Organizations Need To Focus On Good Data, Not Big Data

Andrew Ng is a computer scientist and technology entrepreneur who specializes in machine learning and artificial intelligence. Andrew Ng co-founded Google Brain, served as its head, and was formerly chief scientist at Baidu. Additionally, Andrew Ng has a massive reputation in artificial intelligence, and he introduced the utilization of GPUs to train deep learning models in the 2000s at Stanford University with his students.  

Currently, Andrew's primary focus is on his company (Landing AI), which helps enhance visual inspection with computer vision. He also became an advocate of (data-centric AI movement), which according to him, can provide (small-data) solutions to major AI issues such as model efficiency, reliability, and subjectivity.

In an interview with Andrew Ng, he stated that he's excited about NLP models getting bigger and their potential in building foundation models in computer vision. In addition, he pointed out that there are no foundation models for video so far because of compute bandwidth and the cost of processing video, in contrast to the tokenized text.

Foundation Model

Andrew explained that a foundation model is a term coined by Percy Liang and some of his acquaintances at Stanford, and it refers to a big model that is trained on large datasets and can be modified for particular applications. ChatGPT-3 is an example of a natural language processing foundation model.  

Foundation models offer a new method for creating machine learning applications; however, they present challenges when it comes ensuring they’re fair and free from bias, especially if we build on top of them.

Moreover, Andrew believes that when it comes to building a video foundation model, there is a scalability issue. The computing power required to process the considerable volume of images for video is immensible, and that’s why foundation models have emerged first in natural language processing. Many experts are currently working on this, and it is expected to see foundation models in computer vision soon.  

Deep learning has been utilized heavily in consumer-face businesses with large datasets in the past decade. Although this machine learning approach worked well for some industries, it is unlikely to work for others. In many businesses where big data sets are not utilized, the focus has to shift from big data to good data.

Data-centric AI

Ng defines Data-centric AI as the field of systematically engineering the data required to successfully build an AI system. Ng adds that an AI system requires the implementation of some algorithms, such as neural networks in code, and then training it on a data set. During the past ten years, the predominant approach has been to download the data set while concentrating on code optimization. Due to that approach, deep learning networks have advanced dramatically to the point where for the majority of applications, the code—the neural network architecture—is essentially a solved problem. So, it is currently more fruitful to focus on finding ways to enhance the data for many practical applications.

Big Data VS Good Data

Regarding companies with only a small amount of data to work with, Andrew Ng stated that some vision systems are built with millions of images, and he once built a face recognition AI system using 350 million images. However, It only takes 50 good examples to create something invaluable, like a defect-inspection system. In addition, he believes that In many enterprises where big datasets are not utilized, the emphasis has to shift from big data to good data. Having 50 good-engineered examples is enough to describe to the neural network system what to learn.

Quality Data

Ng adds that when dealing with data, it is important to be consistent if you want improvement in performance. For example, let's say you have 20,000 images, and 20 of the images belong to the same class and are labeled inconsistently. One of the things that you can do in this case is to relabel the inconsistent images, and that leads to high-quality data.

Biased data lead to biased systems. If you have an issue in a subset of data and you try to change the whole neural network architecture to enhance the performance on just that subset, it’s going to be challenging. But if you can engineer a subset of the data, you can solve the issue in a much more targeted way. That can be done using one of the powerful tools that data-centric AI offers.

Collecting More Data For Everything Can be Expensive

Data cleaning is vital in AI, but traditionally, this has been done in a fairly manual manner. In computer vision, a problem might be identified and potentially fixed by an individual who visualizes images using a Jupyter notebook. Nevertheless, what is interesting is the tools that let you work with huge data sets that rapidly and effectively highlight the data subsets that point to issues. Or to quickly bring your focus to the one out of 100 classes where gathering further information would be advantageous. More data can be helpful, but getting more data for everything could prove quite costly.

Ng also stated that he discovered that a speech-recognition system struggled with background traffic noise. This enabled him to gather more data with vehicle noises in the background instead of attempting to gather data on everything, which might have been costly and time-consuming.

Small data Drives Good Data

Before tackling the fundamentals of handling small data at lesser sizes inside an organization, there appears to be a general propensity to concentrate on "Big Data" and sophisticated analytics. Big Data is undoubtedly becoming a more significant component of the landscape, but it is typically not the most urgent data issue within a company.

Getting "Small Data" right will probably yield more immediate rewards. Small Data is crucial for various reasons, which is why it must be tackled separately from a larger Big Data endeavor.

Small data can frequently provide answers to important strategic issues about a business operation that can guide the optimum use of Big Data and more sophisticated analytics. While most businesses don't yet have enough data to be considered big data, all organizations do have some data that they can use to gain insight. Finally, Mastering small data management is vital in the process of achieving total data management excellence within a company.


Comment  0

No comments.