Maybe U2 had it right: There is something even better than the real thing—at least as far as data is concerned.
The still nascent market of synthetic data—or artificially manufactured data—seems to be having a moment right now.
In the last year-plus, several large companies, including Microsoft, Google and Amazon, which used it to train Alexa, have all talked openly of their use of synthetic data.
Then last October, Facebook acquired New York-based synthetic data generator AI.Reverie. The next month, chip giant Nvidia said it was creating an engine for generating synthetic data for training AI networks.
Search less. Close more.
Grow your revenue with all-in-one prospecting solutions powered by the leader in private-company data.
Interest in the space has even worked its way into the venture capital world. Only about two dozen companies in the space have received funding in the last two years, according to Crunchbase data. But in the last several months some startups have raised some significant rounds, including:
- San Diego-based synthetic data creator Gretel.ai closed a $50 million Series B funding round led by Anthos Capital in October.
- Austria-based synthetic data generator MOSTLY AI raised a $25 million Series B led by Molten Ventures in January.
- Israel-based Datagen, a platform using synthetic data for visual AI applications, closed a $50 million Series B led by Scale Venture Partners last month.
While those rounds may not be huge, they are substantial, considering synthetic data is a concept few understand.
What is synthetic data?
In the simplest terms, synthetic data is information that is artificially manufactured and not actually created by real-world activities and events. That’s important because developing AI and ML (machine learning) projects require immense amounts of data. However, real-world data can be expensive and difficult to collect.
“A lot of data collection is done manually,” said Ofir Zuk (Chakon), CEO and co-founder of Datagen. “That can be very, very slow.”
Real-world data can also be biased and “dirty”—prone to incorrect labeling or other human error since it’s gathered manually.
Synthetic data eliminates many of those issues while also being easier and faster to collect, and allowing developers to more quickly produce the algorithms and AI models they need.
“It’s completely revolutionized the way developers work,” Zuk said.
Why now?
Zuk said last year proved pivotal. “Prior to 2021, very few companies understood synthetic data,” he said. “But then big companies started to publish outcomes using it. That changes things.”
Before last year, Datagen mainly got customers through outbound sales calls, Zuk said. That changed last year as it became clear many big tech companies were adopting the new kind of data.
“We started getting about 10 inbound sales requests a week,” he said.
The market does seem to be catching up. Gartner estimates that by 2024, 60 percent of data for AI applications will be synthetic. The market for synthetic data generation grew to more than $110 million last year and is expected to get to $1.15 billion by 2027, according to a report published by research firm Cognilytica.
Privacy push
Another driving force behind synthetic data’s growth the past few years is privacy. While companies may be drowning in data, they can’t always use it.
“You may not be able to use the data you have because of regulations,” said Ali Golshan, CEO and co-founder of Gretel.ai. “Synthetic data avoids that issue.”
That is especially important considering two of the leading sectors using AI and synthetic data—finance and health—are also highly regulated.
“I think health has fueled synthetic data’s growth,” he said. “Not just are there regulations around privacy issues, but health care data also can be extremely rare.”
Getting money
Investors also rarely understood synthetic data, despite its evolution over two decades.
Now, “investors understand it much better,” Golshan said. “We could have raised 2x the amount we did if we had wanted it.”
His company gets one one or two VC inbound calls on a weekly basis asking when it will raise a a Series C round, he added.
Zuk agreed his experience raising Datagen’s Series B last month was much different than when it locked down its $18.5 million Series A in February 2021.
“All the tier one investors in the U.S. were happy to take the first call,” he said with a laugh. “In 2018, maybe one out of every 10 took the call.”
Andy Vitus, a partner at Scale Venture Partners, which led Datagen’s Series B, said when he first started to learn about synthetic data years ago, he thought it was counter intuitive to use such data for AI/ML models.
However, as those models create simulations of the real world, using simulated data did not seem unreasonable, he added.
Vitus said that while some data experts still question synthetic data, he sees a future for the industry. “It’s an idea whose time has come,” he said.
With large companies including Amazon and Microsoft already showing interest in synthetic data, others are likely to follow. Data platform Scale AI, valued at more than $7 billion, just announced plans to get into synthetic data.
“I think it’s logical to think companies like Snowflake and Databricks will come in,” Golshan said. “I’m sure there will be others.”
Image: iStock
Stay up to date with recent funding rounds, acquisitions, and more with the Crunchbase Daily.
67.1K Followers