To Be Or Not To Be Synthetic: Data In Bio x ML
Intuition for Synthetic Training Data in the Sciences
Summary: Synthetic data is becoming more central to the training and development of deep neural networks. This article describes some of the intuition around its utility in different domains.
If there is one thing that is coming to best characterize the ethos of deep learning it is that data is of paramount importance. Computational architectures are fascinating, even artistic at times, and the engineering rigor that goes into training, scaling, and optimizing are tremendous, but it all is for naught if the underlying data does not hold the representations that are to be modelled.
This can be a rather challenging task to understand, as the representations or informational content of data are neither human prescribed nor typically known a priori. It is, however, a primary concern of model developers that data does contain some true distribution of an underlying system. One of the primary ways to establish this with some degree of certainty is to simply get bigger and bigger datasets that would, just by statistical occurrence, contain a sufficiently larger representation of information that was central to the model and a decent number of “long-tail” examples that are critical to high functioning, real-world AIs.
However, this approach has diminishing returns as seen in the evidence of the power laws of scaling, either with compute or with data set size. This means that the improvement of the performance of a model becomes logarithmic in the quantity of data or compute dedicated. One way of intuitively thinking about this is to consider data like a jar of marbles of different colors and the objective is to sample the marbles to understand the relative relationships between the colors. Let’s say there are 1000 total marbles, 990 are black, 5 are red, 3 are green, 2 are blue, and 1 is yellow. If you randomly sampled 25 marbles from the jar, the statistics would determine how accurately you might be able to determine these actual relationships (the true ratio of red:green is 5:3). This is, in essence, one of the core outcomes of training a deep neural network like a large language model. The techniques used to do so are:
Run training for longer — e.g., if you draw an increasing number of 25 marble random samples, you might approach the true distribution just by chance
Get more marbles — for example, in your data set, you may have 1 yellow marble in the 1000, however, a true data set may have 2 per 1000 (e.g., the specific data that you had to work with was itself sampled from a true data set and contained a sample with 1). If you double your data set size by adding another 1000 marbles, the central limit theorem would indicate that the average of the 2000 marbles would converge on 4 yellows (2 per 1000), so your larger data set would be more reflective of the true underlying distribution.
Both of the above methods, however, are based on logarithmic statistics such that the actual performance of sampling requires exponentially more data or computation to being to approach true distribution. For a more detailed description on role of data set size and compute on model training see Intuition on AI Scale for Biologists1
In the real world this can become limiting in terms of model performance in two main cases:
Where the majority of raw data has already been consumed — this is the case, for example, in many of the foundational language and image models that are trained on data from the Internet (or at a minimum on the data that has been originally human generated). The data lake of the Internet for deep learning training data has been analogized to fossil fuels — one fueling the industrial revolution and one the deep learning revolution, but both limited resources
Where real data that represents the reality one aims to models doesn’t exist. Beyond digital data that exists online, this is likely the case in most domains, notably those that are dimensionally intractable. The physical sciences, and notably biology and the life sciences, is one of the primary domains where this becomes a challenge. The acquisition, dimensionality, and heterogeneity of data in the life sciences make the availability of both comprehensive distributional content datasets very rare making training AI system a challenge.
One of the primary strategies for making progress in training models in data constrained domains is the creation of synthetic data to increase the available datasets for training. This remainder of this article is intended to provide some intuition for what synthetic data can and cannot accomplish in this pursuit
Data Augmentation - Making Up Data Synthetically
Generally, one can think of creating synthetic data for a few different purposes, but they all fall under the label of Data Augmentation.
Two of primary ways of thinking about this are:
The synthetic creation of new data samples — such samples can be created in whole cloth and can be added to the existing training data sets. One simple example of this is creating variations of pictures to add to a training dataset for image generation models.
The imputation of missing data elements in existing data sets to complete them. In many cases, particularly in biology and healthcare, datasets are incomplete — different data samples may contain, for example, different biomarkers or measurements and for the purposes of modelling one would desire that these datasets be consistently complete. In such a case, synthetic data can be modelled to fill in the missing pieces.
Both are flavors of the same idea but may be useful to think about separately. In each case, the creation of synthetic data is used to augment datasets.
The key consideration for the creation of synthetic data is whether the data is an accurate representation of reality. There are two ways of thinking about the creation of synthetic data:
Data that is created based on rules
Data that is generated based on sampling prior trained models
Rule Based Data Creation
Augmenting data sets with rules-based synthetic data is straight-forward and provides for consistent and accurate synthetic data that can augment training sets. Rules based data generation can be thought of as focusing on a principle of invariance and generating data around that invariance.
For example, physical systems that are accurately modelled by equations can be simulated ad infinitum with such simulated data being an accurate representation of the underlying reality (at least to the extent that the equations are not themselves generalizing or smoothing too much). This type of synthetic data generation also applies to games where rules are defined such as chess, go, and StarCraft, (and their respective Alpha* models) and is used typically in reinforcement learning settings with monte carlo tree searches to generate essentially infinite data points, provided an objective function can be assessed that qualifies the value of the outcomes.
In addition to the use of equations for physical simulation, the identification of invariants is useful in physical modelling with respect to scale. The Buckingham Pi Theorem2, for example, has the objective of identifying combinations of physical parameters that are collectively dimensionless such that calculating equations using such numbers will be independent of the specific scale of the physical system being modelled. This method would allow for the creation of synthetic data points, for example, in the parameters of physical systems, that would retain their grounding in true representations of reality.
One of the more intuitive concepts of invariance in the generation of synthetic data is the concept of positional invariance in images. For example, moving the features of a dog in a picture from the left side of the picture to the right, does not change the identity of the dog, but from a dataset perspective, provides a different set of points to train with. This applies equally as well to translations, rotations, inversions, or other transformations of images and can result in an essentially limitless synthetic dataset that nevertheless retains an accurate representation of reality. Identifying the abstract features of data under such transformations is the primary function of convolutional layers in deep neural networks.
Generative Synthetic Data
The second method of synthetic data creation is generative. This method is arguably becoming the dominant method, at least in the LLM space, and it is anticipated that in the not-so-distant future, a majority of the content that is available on the Internet will be the product of generative models. It is also the case that a number of LLMs that have been released as open-source models have, at least in part, been trained on data that has been generated by other foundational LLMs (and indeed some of the proprietary LLM developers have included restrictions in their terms of use that the outputs cannot be used to train other LLMs).
The question of whether generated synthetic data can be used to train models still rests on the concept of the desired distributional representation being modelled and whether a given data generator can produce this. It’s not dissimilar from making copies of copies. The context of language, training on generated LLM data has been shows to irreversibly cause models to forget and degrade performance. Visually, this is easily understood from a basic resampling perspective
Ref: The Curse of Recursion: Training on Generated Data Makes Models Forget3
Resampling consistently crops the distributions tails such that future distribution become narrower. This seems to be an inevitable consequence of sampling independent of the scale of synthetic data generated. It is reminiscent of Lewis Carroll's sentiment on maps4:
"What a useful thing a pocket-map is!" I remarked.
"That's another thing we've learned from your Nation," said Mein Herr, "map-making. But we've carried it much further than you. What do you consider the largest map that would be really useful?"
"About six inches to the mile."
"Only six inches!" exclaimed Mein Herr. "We very soon got to six yards to the mile. Then we tried a hundred yards to the mile. And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!"
"Have you used it much?" I enquired.
"It has never been spread out, yet," said Mein Herr: "the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well."
Bias Beware
In all cases of synthetic data generation via either rules-based or sampling approaches, the validity of the synthetic data as training data remains dependent on any biases that may be inherited. Biases may be less prevalent in data generated through equations of physics but are certainly worth considering in both rules-based data that is guided by decision policies and/or recursive generative sampling from a distribution.
Generating Synthetic Data in Continuous Space can be Challenging
Synthetic data creation in a bounded, or invariance defined space, is relatively straight forward provided the invariant parameters are sufficiently well defined, however, in many complex and continuous settings, the generation of high-fidelity synthetic data can be substantially more complex. The recent release of models like Sora notwithstanding, this issue has been best identified in the challenges with creating generative video with the issue being that the progression of one frame to the next in continuous time requires a more comprehensive understanding of the underlying physics, one might argue a “world-model”, than may be captured by a generative model. It is worth mentioning that physics driven engines like Unreal Engine developed by Epic Games can produce accurate video progression but do so via specified equations based in physics instead of generative predictions from a neural network. In this case, synthetic data used to train neural networks might possibly be provided by such physics—based engines (in addition to the massive troves of data held on resources like YouTube).
In the life sciences, the issue of continuous space synthetic data generation is rather more challenging, and this poses a critical issue for the training of deep neural networks. The central issue is that there are few, if any, clear invariants that can be used extensively in complex biological systems to generate synthetic data that approximates reality. It is true that biological systems, for example, single cell gene expression, are constrained in some ways, and such prior knowledge of, for example, the co-expression of specific genes, might be used to augment incomplete datasets, but these relationships are almost certainly not invariant.
Significant efforts have been aimed at creating digital twins of various biological systems, from cells to whole organisms. The primary approaches around data augmentation have been centered on imputing missing data elements in multimodal settings and much of this has been a consequence of the available datasets being largely unimodal or fragmented. Where paired data sets exist, joint embedding approaches can be used to learn representations that can be used to align different modalities and possibly be used to impute missing data, diffusion models have been used to generate pretty images of single cell representations and protein structures, and the use of variational autoencoders presumably creates a continuous latent space that can be sampled to generate new data points. However, in each of these cases, one must take care to appreciate whether the synthetic data is reflective of reality. Assuming that a point sampled from a VAE or a diffusion model will generate a cell state that is potentially real is not dissimilar from tracing a pseudo time progression of cell state through a UMAP plot using specific gene markers — both are probably in the general ballpark of reality, but still just best guesses not likely suitable for training subsequent neural networks. To the extent that prior knowledge, for example of co-gene expression, can be used to constrain or prune the scope of de novo synthetic data, it may be more or less useful for training augmentation.
Scale and Proximate Reality Matter
When considering the utility of synthetic data that is generated from prior trained models and the ultimate use case, the scale and requirements for proximate reality matter. For example, the training of the recent Sora text-to-video model from OpenAI trained on prior text-to-image captions that were themselves generated via other models. This use case of synthetic data can be thought of in the following context:
The scale can be very large which, for the purposes of generating training data for video generation, can address some of the resampling loss in the long tails that are not critical for a video product. The edge cases will still exist where the model breaks down, but it does not prevent general usability for a consumer product.
The proximate reality of text-to-image training data is flexible — what this means is that a wide range of possible text captions are valid for a given image such that the likelihood of a synthetic data training point being an appropriate representation of reality is high.
In the biological sciences, these two criteria are not met as seamlessly. In many cases in biology, the edge cases are the critical parameters. Transcription factors, for example, may span concentration ranges across several orders of magnitude yet have significant functional implications. Additionally, the proximate reality of synthetically generated biological data may be entirely absent. A generated “cell-state” that is nominally close to another may not actually exist in reality and will not be a valid data point for future training.
These two concepts make synthetic data generation for physical systems or the sciences categorically different than consumer products or LLMs. The fact that such systems are more constrained in the real world makes generating synthetic data more challenging. The future of training such systems is better addressed in the domain of Physics Informed Machine Learning5 where models are trained within the constraints of physical laws.
Intuition on Synthetic Data
There is and will continue to be a tremendous amount of AI generated content produced — some estimates are that 99 percent to 99.9 percent of the Internet's content will be AI-generated by 2025 to 20306
Practically, there is also an infinite amount of synthetic data that can be generated through rules-based systems, games, equations, etc.
It’s important, however, to note the difference in use case between generating data in general and generating data for the express purpose of training subsequent data generators. The latter is much more challenging.
The intuition for the reader should rest on two primary ideas:
Is the generator of the new data based on invariance principles?
If not, how do data distributions change with recursive sampling?
There are many cases in which synthetic data will be an invaluable resource for training deep neural networks. The sophistication of reinforcement learning models in game settings is a clear example. However, in complex systems that do not have clearly defined rules, the use of synthetic data for training is more speculative. Biology is one such system — at the end of day, we must ask ourselves whether the data we are using to training systems is an accurate representation of the fullness of the reality we are aiming to model.
If it’s not in the data, it’s not in the model…
References:
Intuition on AI Scale for Biologists
Scale is All You Need…Maybe The news has been awash with mega funding rounds for AI companies. It is not uncommon to see 9 figure funding for startup and billion dollar partnerships are being struck with the leading companies like Anthropic and OpenAI.
https://en.wikipedia.org/wiki/Buckingham_%CF%80_theorem
https://arxiv.org/abs/2305.17493
https://people.duke.edu/~ng46/topics/lewis-carroll.htm
https://www.nature.com/articles/s42254-021-00314-5
https://futurism.com/the-byte/ai-internet-generation
Header Image Ref: https://images.fineartamerica.com/images/artworkimages/mediumlarge/3/matrix-green-code-secret-password-poster-joshua-williams.jpg