The Essentials of Machine Learning Data Curation

The Essentials of Machine Learning Data Curation

Jul 08, 2020

Imagine owning a Bugatti Chiron with a 261 mph top speed - a masterpiece of art, form, and technique. Absolute mint. Now imagine running it on water. Dear ML Engineer, you may have the best machine learning model but without the right data, it will fall flat on its face and may never rise again.

Data is the new oil - and just as oil needs the right refining to come into perfect usage, data too needs curing. The power of your machine learning models will greatly depend on the quality of your data.

As AI integration across industries picks greater pace, ML engineers are confronted with a sad reality - once stakeholders identify a use case with proven ROI, they are eager to jump onto the AI ship, and data curation is not given its due importance. In a survey of 150 machine learning engineers at large companies, 41 percent say their data is too siloed.

The assumption that AI only needs to be fed random data collected and combined on a huge scale can gravely backfire. Incorrect datasets can come in many forms ranging from factually incorrect information to knowledge gaps to incorrect guidelines. Among many other problems, an uncurated dataset can be:

  • Biased: A few popular AI used for image recognition recently displayed menacing gender and racial bias.

  • Inaccurate, unreliable or falsely represented

  • Error-ridden or ambiguous

Using uncurated raw datasets is “found to decrease the feature quality when evaluated on a transfer task” (Caron et al, 2019).

So how do we prepare datasets in a way that they serve the exact purpose ML engineers want them to? Before digging into this question, let’s see the types of datasets a ML engineer needs.

Types of Datasets for Machine Learning

ML engineers depend on data during each step of their AI journey – from model selection, training, and tuning to testing. These datasets usually fall under three categories: training sets, validation sets, and testing sets.

Every ML project begins with two data sets: the training data set and the testing data set.

The training data set is used to train an algorithm, apply concepts, learn, and give results. Around 60 percent of data is training data.

Testing data is used to test the validity of the training data set. Training data is not used for testing because it will produce the expected output. The testing data set comprises of 20 percent of the total data.

Validation tests are used to identify and tune the ML model.

Data Curation for Machine Learning

Data curators collect data from multiple sources, integrate it into one form, authenticate, manage, archive, preserve, retrieve, and represent it.

The process of curating datasets for machine learning starts well before availing datasets. Here’s what we suggest:

  • Identify the goal of AI

  • Identify what dataset you will need to solve the problem

  • Make a record of your assumptions while selecting the data

  • Aim for collecting diverse and meaningful data from both external and internal resources

  • Build a dataset that is hard for your competitors to copy

If you have a small dataset, using a model pre-trained on large datasets can be a good idea. You can use your small dataset to fine-tune it.

Once you have collected the right data, you can proceed with building the training set. This step of putting data in the optimal format is called feature transformation and it comprises four stages:

Formatting: The data is spread in different formats. Formatting will bring it together in one sheet. For example, customer data can come with different currencies, languages, etc. These need to be compiled under one format.

Labeling: Labeling is done to ensure the data set works for your model. For example, a self-driving car will need data labeled as pictures of cars, pedestrians, street signs, footpaths etc.

Data Cleaning: Unwanted characters are removed and missing values are dealt with.

Feature extraction: A number of features are analyzed and optimized. Features that are important for prediction are selected for quicker computation and less memory consumption.

The Upshot

A dataset alone can ensure the success or failure of ML model. Data curation is one of the fundamental aspects of machine learning and if used right, it can unleash great power. The process may appear time-consuming, but it will ensure your dataset’s calibration with your model’s goals at every step.

“Curations are about where the humans can actually add their knowledge to what the machine has automated.”- Stephanie McReynolds, VP of marketing at Alation

Setting up a data curation team and process can look expensive in the short term, therefore organizations must closely study their relevance in the future. Unsupervised methods trained on curated data are available yet, expensive. Choose wisely.

Follow Us!

Stay Updated On 
Latest Trends in AI Here!

Get Started