📝 Guest post: SuperData is the new oil – How to win the AI race in the 21st century*
In this guest post, Vahan Petrosyan, сo-founder and CTO at SuperAnnotate, explains the term SuperData and its importance for the development of the AI space. They dive deeper into the definition of processed and unprocessed data and talk about how some of the fastest-growing unicorns and decacorns are using such data to create value as well as grow in competitive environments.
Before going deeper into the details of the article, let's first define the term SuperData.
SuperData = AI-ready training data
I.e., well-structured, tagged, and high-quality labeled data for creating intelligence.
Back in 2006, a British mathematician, Clive Humby, coined the phrase “Data is the new oil.” Since then, many businesses worldwide have evolved into billion, if not trillion-dollar industries. Both oil and data can be transformed into different products: You can use oil to produce plastics, detergents, etc. Meanwhile, data can be transformed into valuable information or insights used to make any type of business decision. As a result, access to the right data allows some of the world’s largest companies to beat their competitors and grow at unprecedented speed.
For example, predicting Walmart’s expected revenue in advance will allow a more accurate estimation of its stock price before the quarterly reports. However, since forecasting the revenue alone can be difficult, one can make an assumption about Walmart’s revenue being directly proportional to the average number of cars in its parking lot. Quantitative data on vehicles in the parking lot is not publicly available, though satellite imagery companies have made it possible to get satellite data of a given location at a given time.
Hence, by acquiring parking lot data from all Walmart stores, one can attempt to build an AI algorithm that predicts the number of cars in a particular parking lot. And that will serve as a foundation for estimating Walmart’s revenue. Data availability — raw satellite images — is not an issue in this case, as it takes only a few API calls to get them. So, building a robust AI algorithm that can precisely predict the number of cars in different locations, weather, and lighting conditions is possible but still is a challenge to solve (some AI startups are already tackling this exact problem). In such scenarios, the expression “data is the new oil” can be misinterpreted as the raw data itself does not produce much value (certainly true for raw oil), hence, the need for processed data.
Let’s dive deeper.
Unprocessed raw data
As technology progresses, any type of small IoT device collects data that can be stored on your local machine or your favorite cloud provider’s storage for future use. Different types of raw data (tabular, images, videos, documents, etc.) keep accumulating in such repositories, called data lakes, where — if not managed correctly — data will end up being useless for target applications. The real value for companies dealing with tons of data is not only creating data lakes and turning them into data swamps but primarily structuring them to easily extract valuable insights anytime. Companies like Snowflake and Databricks help effectively structure datasets, enabling their clients to grow into billion-dollar businesses with better-shaped data warehouses.
The AI race
Digital transformation took a giant leap during the COVID-19 pandemic. Consequently, companies that dealt with process optimization turned to AI-enabled solutions to survive the intensifying AI race.
"Ultimately, every company will become an AI company." IBM CEO Arvind Krishna
Today, the winners of this AI race fully understand the transformation difficulties of AI readiness and consider an ahead-of-time investment. However, AI readiness primarily depends on the data used for training these companies’ AI models.
It is becoming increasingly popular that data is the main source of accurate AI algorithms. The term data-centric AI coined by a prominent AI scientist Andrew Ng has created this paradigm shift within the AI community. We have slowly come to realize that to improve AI, we need to focus more on creating high-quality training data as opposed to incrementally improving models or their architectures. Nevertheless, high-quality training data is tough to create and is much different from raw data. We call such top-quality training data a SuperData.
SuperData = AI-ready training data
I.e., well-structured, tagged, and high-quality labeled data for creating intelligence.
To survive the increasingly competitive AI race, every company should transform into a data company. Every data company, in turn, should create AI-ready SuperData to sustain its growth.
SuperData vs. just data
Very often, many data companies gather petabytes of data and freeze them into different data lakes. You may be able to compute some simple statistics around such datasets, but to prepare an AI application or to get more valuable insights, one needs to structure and accurately version these datasets, making everything searchable and sliceable. Snowflake and Databricks (est. 2012 and 2013) are among these companies that enable businesses to move away from unstructured data lakes and create powerful data warehouses.
Over the last few years, more and more AI applications have been developed based on visual (images, video, LiDAR, DICOM), text, and audio datasets. However, well-structuring such datasets is not enough to create intelligent ML algorithms. In such cases creating a SuperData requires tagging, annotating, and versioning datasets to perfection. Note that neither raw data, nor poorly annotated data can become SuperData: They are not enough to develop intelligent models (i.e., garbage in, garbage out).
Similar to Databricks and Snowflake, Scale and SuperAnnotate (est. 2016 and 2019) became one of the fastest-growing companies empowering businesses with SuperData. All these companies will continue to grow since everyone else relies on them to build the most powerful training data for their AI.
Unleashing the power of AI with SuperData
In the past, to improve the ML model performance, AI engineers would focus on different model architectures, tune parameters, add model layers into their neural networks, and primarily use tools and frameworks such as PyTorch, TensorFlow, and AWS Sagemaker. The research was booming in that direction, and some folks thought those were the only necessary components to work on to build AI applications.
Over the last 1-2 years, we’ve experienced a mind shift from a model-centric to a data-centric approach. However, preparing SuperData with a data-centric approach in neural networks and deep learning algorithms takes more than 80% of the data science team's effort. This is mainly because neural network algorithms require a large amount of SuperData. And, of course, creating, versioning, cleaning, updating, and continuously improving SuperData, in its turn, requires massive effort and collaboration with different professionals. The latter can be a group of data annotators, data validators, project managers, ML engineers, MLOps engineers, etc. First, enabling these professionals to work together seamlessly necessitates a deep understanding of the entire AI lifecycle. Additionally, these professionals need sophisticated tooling to create, version and improve SuperData.
As the community is shifting towards creating better AI-enabling datasets, SuperData platforms become essential to stay on top of the continuously growing AI race. Referring back to the example above, a correctly annotated, tagged, and diverse dataset is the only way to create scaling computer intelligence that can predict Walmart’s revenue and stock price ahead of time, based on the parking lot information.
Conclusion
The expression “Data is the new oil” became a true inspiration for several companies to start building their businesses around data. Over the last two decades, data has redefined several industries allowing top-tier companies to push ahead using data intelligently. In recent years also, every smart device connected through Wi-Fi or Bluetooth started gathering some sort of structured and unstructured data. Such datasets wouldn’t have become usable unless modified to be SuperData.
SuperData is key to achieving AI supremacy and staying competitive in the AI race. Data is necessary but not sufficient to win. It's the SuperData that breaks the ground! Therefore, I would like to enhance the famous and almost the two-decade-long quote to:
*This post was written byVahan Petrosyan: сo-founder and CTO at SuperAnnotate. We thank SuperAnnotate for their ongoing support of TheSequence.
About SuperAnnotate
SuperAnnotate backs genius minds with SuperData to help them disrupt industries faster, smarter, and better. Our platform enables ML engineers, MLOps engineers, data scientists, project managers, annotators, and data validators to seamlessly collaborate with each other to create the best SuperData for their AI. We also advise our clients on how to аnnotate, version, manage, improve, and streamline their ML processes and achieve supreme AI algorithms. To better understand how our team can help you out in your AI journey, go ahead and request a demo. You can also connect with me directly by sending an email to my name at my company name dot com.