How much more data do we need to make this model work reliably? – This question often cripples AI engineers in their efforts to create a practical machine learning solution. The quest to build effective AI models is never-ending. But it is now becoming more data-centric than ever before.
That brings us to an even more basic question – What are the fundamental elements of a working AI solution?
- Algorithm or AI model.
In the traditional approach, machine learning practitioners primarily show interest in improving the model to make the solution more effective. This strategy is called the model-centric approach. Let’s discuss this approach in detail to understand the need for the data-centric approach.
In the model-centric approach of AI, developers give primary importance to the ML algorithm and the model to build the machine learning solution. It acts as the sole means to improve the model’s performance. Even though experts now favor the data-centric approach, AI models have developed splendidly with model centrism.
Here are a few features of this approach.
1. Data operations are a one-time thing.
Traditionally, data science teams download a static dataset from a source. It is then processed, cleaned, and stored in the database. After that, it does not get much consideration as data collection & processing is considered a one-time thing.
2. Data quantity over data quality
This approach gives a lot of emphasis on data quantity to make the models work. Thus, although we get a lot of data to work with, the benefit does not match the effort.
3. Focus on the model
The primary focus is on the algorithm and the model to improve the predictive performance of the AI model. Various algorithms are experimented with, and hyperparameters are tuned progressively to achieve the needed improvement.
Even though AI has made inspiring progress in the last decade, some problems have become painfully clear with the current approach.
Some of the drawbacks of using the model-centric approach are –
1. Limitations to the progress
There is a limit to the progress you can achieve by only working on your algorithms. After a while, the performance is bound to reach a peak and converge. However, if you work on engineering your data for quality, you can easily outperform your past benchmarks.
2. Less reliable outcomes
When you are especially concerned about the model only, you fail to notice the peculiarities in your data. That leads to the ignorance of data skewness, under-representations, and biases that creep silently into the model.
3. High costs
The traditional approach is data volume intensive. It means that you need a lot of data to bring desired results. As the volume of data you work with increases, you need more resources to store and handle this data. This scenario increases the total operational costs of a project.
4. Lack of training data
Having a large amount of training data for every use case is not always possible. That makes it hard to solve practical problems using AI.
Data-centric AI is the approach to developing AI models where you focus on engineering your data to improve your model’s performance.
Now that the machine learning experts have hit a roadblock with model-centricity, they are now starting to talk more about the data. In the data-centric approach, data gets the primary importance.
A few of the essential features of this approach are –
1. Data quality over Data quantity
Data centricity prioritizes data quality over its quantity. The working belief is that a model with less training data, but superior data quality, can perform on par with a model trained on a large data volume.
2. Data processing is essential.
In the data-centric method, data processing steps like labeling, augmenting, cleaning, etc., are given much more time and resources than in the model-centric approach.
3. Data is not static
Data processing isn’t a one-time thing for those who follow data centrism. Data keeps evolving with time through feedback from the model and other information that data engineers keep discovering throughout the project lifecycle. Thus, data is constantly improved to improve the model’s performance.
4. Domain Experts
AI is solving problems in different domains and industries. Understanding that, the data-centric approach emphasizes a lot on working with domain experts while solving a domain problem. It improves the data quality, data specificity, and gives a better context to the problem.
1. Minimize bias in your data
When data is collected and labeled, it is easier for the prejudices of the annotators to show up. While the model-centric process might miss such biases, the data-centric approach puts enough consideration into removing this bias from the data. One possible way is to use more than one data annotator to label the same data and decide on a label with consensus.
2. Representative Datasets
Another vital aspect of building well-functioning AI models is to train them on a dataset that is truly representative. You can run data analysis to ensure that every possibility is adequately covered. Tools like Pandas, Numpy, and Matplotlib can help with this. You can source additional data if the data feels representative.
A complementary and sometimes alternative technique can be to use synthetic data generation to create representative data.
3. Curate data with the help of SMEs
Data collection might possibly get you hundreds of thousands of data points, if not millions. If you go on to label everything you see, it will be an expensive affair in no time. Not only that, but data quality will also deteriorate.
Subject matter experts can help you curate informative data for the problem you are solving. Only the most valuable data should remain in your training dataset. It will make data labeling easier while making the training process less resource heavy.
This approach will even help you throw out low-quality and erroneous data in favor of data points that can meaningfully contribute to the model’s growth.
4. Put domain knowledge to use
Domain knowledge can be a game changer when developing machine learning models that can solve your problems. From healthcare to finance, domain experts have insights that outsiders do not have. Such insights can make the task of an AI engineer easier.
For example, you can use domain knowledge to perform efficient feature engineering to select features that are the most crucial to finding the solution. The machine learning model will perform well even with limited data using this optimization.
5. Error Analysis
Working with data is tough; it is even tougher to eradicate all the discrepancies in a dataset at once. However, focused and iterative error analysis of subsets of data can help you achieve the desired quality over time.
For example, if your model is not performing well on a particular class, you focus on the dataset representative of that class. You improve the quality of that subset and retrain the model to improve its performance.
When multiple subsets of data get improved over time, the quality of the whole dataset improves.
6. Augment data
Even though the data-centric approach is more about data quality than quantity, there might be a quantity threshold that needs to be there to ensure more representativeness and better generalization.
That’s why some data teams chose to augment their dataset using techniques like interpolation, extrapolation, and synthetic data generation. Data augmentation is a popular method among AI engineers to create better AI systems.
7. Pre-process your data better
Raw data does not get utilized directly in an AI model. It needs processing before inputting into the AI algorithm. While data pre-processing or preparation is a one-time action in the traditional approach, it is an iterative process in the data-centric method. It is a critical step that needs to be done with near perfection.
Data pre-processing involves data cleaning, integration, reduction, and transformation. These steps ensure that your model gets the quality data it needs to perform better.
It is easy to get attached to data you have got your hands so painfully on. It is a blunder that might prevent you from throwing away poor-quality data. And that will kill your model. More data doesn’t always translate to a better model. With unsuitable training data, you might teach the model things you didn’t intend it to learn. Or it might make your model less accurate than it could otherwise have been.
Data preparation also involves labeling your data. While you may want to label data based on your observation and instinct, it pays to have some pre-decided rules and thresholds that can help you objectively label your data points. Without such regulations, different annotators might give contradicting labels if not for predetermined threshold values of the independent variables.
Benefits of better data preparation -
It will remove noisy data from your dataset.
It will ensure that your data is correctly labeled.
It will ensure that your model learns what is relevant.
It will ensure that minimal resources are needed to train your model.
8. Programmatic Labeling
In this approach, labeling happens via machines rather than humans. This approach uses the subject matter expertise of human experts to create labeling rules that the program then uses to assign labels for all the data points.
Thus, the time and effort needed to generate these labels gets significantly reduced. However, these labels are noisier than the ones created by humans. It is the edge cases that are prone to being wrong. Human annotators can then check the annotations for errors.
Such revision takes a lot less than the conventional labeling approach. Many organizations have saved millions of dollars in their budget by choosing this strategy.
9. Hybrid Labeling
The hybrid labeling approach involves humans and AI-enabled systems to help you label your data. While easily recognizable data gets labeled by the machine, human annotators can label data that machines find hard to recognize and classify.
10. Active Learning
In the active learning approach, we select a small dataset and try to label it as correctly as possible. This small data set is a subset of a larger dataset. Then we iteratively apply active learning techniques, which other data needs labels for the model to perform better.
Active learning techniques can help if you have limited data. Or if you have a large data set but have limited processing power.
Machine learning models have come a long way with the model-centric approach. Algorithms have been optimized to very sophisticated levels, and even incremental progress now would need exponential effort. While AI engineers and data practitioners will continue to pursue that direction of model improvement, data will become the primary avenue of improvement.
Data centrism is carving out its cult space in the AI world. The popularity is increasing so much that various tools helping developers adopt data centrism are now emerging.
WhyLabs – It is an ML monitoring tool that helps you achieve a healthier model working on clean and valuable data.
CleanLab – It is a data-centric package developed by MIT researchers to perform error analysis and improve data.
Tecton – Tecton is an AI package helping solve enterprise problems of using embeddings as model inputs.
Snorkel – If you want to implement programmatic labeling in your dataset, Snorkel can be the tool of your choice. It uses semi-supervised learning to annotate data.
Ydata – It’s a popular data-centric tool used for data profiling and synthetic data generation.
AutoAugment – AutoAugment is a brainchild of Google Brain. This tool helps augment your data for volume and diversity to improve your model’s performance.
Synthetic Data Vault – It is an open-source synthetic data generation tool that you can use to create representative data for your specific problem.
AWS SageMaker – SageMaker is a data-centric tool provided by Amazon to make it easy to deploy your machine learning models.
Galileo – Galileo is an AI tool meant for profiling your NLP datasets. One of its most powerful features is that it identifies texts in your data that might be hurting your model’s performance.
Arize AI – Arize AI is an ML monitoring platform that helps you track your model’s performance in real time. In addition to model-centric checks, it offers tools to check data quality, contributing to the data-centric paradigm.
The data-centric VS model-centric is not a zero-sum game, and AI engineers should not approach it as that. It is not an either-or thing. While working on your AI models, you can work on optimizing the algorithm and ensuring that you have the best training data available. If you are looking to build state-of-the-art artificial intelligence solutions, it’s true that mediocre data won’t help you with it. But it is equally valid that mediocre algorithmic implementations won’t build it either.