Can synthetic data bridge the AI training data gap?

Devansh Gupta
Dec 28, 2022
2 min read

AI requires gathering a huge amount of data and preparing it which is complex. The complex process requires checking the data (if it contains bias). The problem does not end here as it must be made sure that the data does not contain any sensitive information. The data is massive, potentially a billion records or more, and it is impossible to meet all the requirements, such as the data being complete or representative of the population we are trying to understand. Thus, there arises the approach of the data being synthetic which is termed ‘synthetic data’. Synthetic data is replacing real-world data which has identical mathematical and statistical properties. Synthetic data has many uses, and the possible scenarios are, for training models when real-world data is not complete, filling gaps in training data, speeding up model development, stimulating the future, etc.

There are different approaches to creating synthetic data. There might be no real data or containing some real data or there is real data. Where there is no real data, an AI engineer has a good grasp of the synthetic data set and the data environment. In the case where only some real data exists, the engineer generates some parts from actual data and some parts from assumed distributions. In cases where there is real data, the synthetic data is generated by a best-fit distribution.

The mechanism of the synthetic data is still a new method of collecting and sorting data. If the method is effective enough, it would bridge the gap between AI training data gap. But the whole method should be implemented and put through a trial-and-error process so that the outcomes along with its own inadequacies could be determined perfectly.

Some data sets don't accurately reflect a business's use cases. A system that identifies phone numbers, for example, will not have enough international calls to deal with. Another issue that arises frequently is balancing a data set. According to John Blankenbaker, principal data scientist at SSA & Co., a worldwide

management consulting organization, a historical data set might contain 99 percent non-fraudulent transactions and less than 1% fraudulent transactions. "Many models will conclude that labeling every transaction as non-fraudulent is the most effective policy."

Synthetic data can aid in data set balance, but it must be done with caution. "The synthesis method will only be useful if it captures whatever it is about a transaction that implies fraud," Blankenbaker explains. "Which is unlikely to be clear because we'd have to use it as a fraud detector then."

However, in 2021, Neil Raden published an article where he discussed the pros and cons of synthetic data. He mentioned that he is still not convinced of the argument that similar data could be created that closely resembles the real data. Much research has been conducted based on the effectiveness of synthetic data and researchers and engineers are still trying to find ways synthetic data could help bridge the gap for AI training.

Image by Gerd Altmann from Pixabay