TLDR: This article discusses data and topics around AI reasoning including pairing generation with search and reinforcement and the recent release of o1 from OpenAI.
We Need More Data
This is the most common refrain in the AI world. In the drug development world, it’s often used to raise capital. Data is often expensive, considered special, and typically proprietary. Companies publish pretty images of fluorescently stained cells, big point clouds, and pretty protein folds. They are artistic, but the real question is whether they are meaningful. It is an under-discussed fact that for many of the AI-driven biotech companies, their lead assets are not, in fact, produced from the pipelines that generate data for their AI models. It’s fair to say that this may be changing, but the jury remains out on how fast.
This article is intended to present a basic framework for this question of data in training AI models. It’s connected to other articles I have written on topics related to scaling models and synthetic data (both linked below) but is a bit more general. There is a lot of math behind the perspective, but in line with the theme of this Substack, this article will keep the content at an intuitive level.
The basic idea will center around entropy—or the informational content of data—and why this is, in fact, the critical factor in determining the success of AI models.
We will take a detour into the mechanics of the recent o1 “strawberry” model released by OpenAI and discuss the role of data entropy in AI reasoning and perhaps in the future of AGI.
For additional reading on this topic. Below are related articles:
An Outline of the article is below
Priors
One of the primary goals of machine learning is to identify patterns in information to make predictions about the future. This naturally presupposes that there are, indeed, patterns. When initiating the weights of a machine learning model, they are naive, meaning they have no particular inclination toward any specific outcome. When you start to train the model with data, the idea is that the patterns in the data get transferred into the weights of the model.
The success of LLMs has been fascinating—natural language has, for a long time, been considered a very difficult computational challenge and it has required both substantial computing power and Internet-scale datasets to enable the progress we have seen to date. However, we should also consider that these datasets are actually quite amenable to machine learning because they exhibit very strong priors.
Simply put, natural language training data, such as that found on the Internet, is already preselected for data distributions that are intended to comprise semantic content. When we speak or write, the information content of the language is presupposed, and this is a substantial advantage for machine learning models that aim to learn semantics. It would be, for example, much more difficult if the training data for LLMs consisted of dictionary entries. While this training data also contains natural language and may be extensive and robust informationally, it does not have the preselected prior of having rich and long-range semantic context.
LLMs are considered “next-word predictors”—in this context, the performance of LLMs is determined by how well they can predict the next word—or more specifically—how much the prior sequence of words constrains the probability of the next words.
For example, the phrase “Mary had a little ____” is a case of generally high constraint. Most people do not need much prompting to predict the next word or even the next several dozen words that would complete the entire poem. This is not to say that other combinations of words are not possible, just that the probability of one combination is much higher than others. The series of words “Mary had a little….” substantially constrains the likelihood of future words.
In math terms, this is discussed in the context of entropy and information. Entropy is a measure of disorder. It is often presented in the context of, for example, gas molecules filling a volume of space where a disordered state is one where the gas is uniformly distributed in the box, with all the molecules bouncing around randomly. An ordered state is one in which, for example, all of the molecules happen to be on one side of the box. The latter state is substantially less likely than the former, and of all the possible states of molecules in the system, it has many fewer states that would satisfy it. It is considered a lower entropy state. The second law of thermodynamics indicates that all natural processes tend toward higher entropy—this just means that gases will fill up an empty volume naturally.
Entropy is used in information theory as a measure of information content. In fact, Claude Shannon’s mathematical construction of information theory, in the seminal paper on the topic, A Mathematical Theory of Communication2, was the first to introduce the merging of these concepts as “Shannon entropy.” If entropy is a measure of the number of possibilities in the state of a system, it can also be a measure of the amount of information that a system contains. More possibilities = more information in any given state, fewer possibilities = less information in any given state. In the example of “Mary had a little ___,” we can think about this in the following progression of predicted next words:
“Mary___” – high entropy, high information content, many possibilities; the next word will provide important information.
“Mary had ___” – a bit lower entropy, lower information content, fewer possibilities for the next word, which provides less information.
“Mary had a little ____” – low entropy, the progression is almost fully defined at this point, and future words provide very little additional information.
The informational content as measured by entropy also goes by a term called perplexity—a model’s performance at predicting the next word improves at lower perplexity values. In LLMs, this can also be controlled by a measure typically called “temperature,” which introduces variance into the next word prediction by modifying the next token probability distributions before applying the softmax function to select the next token. Changing temperature can create more or less variability around a specific training set and is often considered a creativity dial. For example, a well-trained model may produce the entire poem above as a direct next-token prediction calculation with high probability but increasing the temperature will increase the introduction of novelty—or perhaps hallucination.
The Origins of Data: Selection Pressure for Priors
Data always has an origin which often defines its distribution. The most useful data has origins that have strong selection priors—meaning that if the data exists, it has already been through rather stringent filters for meaning. Evolutionary data has this feature because it has a strong survival selection bias—if we see it, it’s because it carries some unique information. Natural languages have this feature, photos on the internet have this feature (e.g., photos are generally of real-world things that people care about), and a good deal of evolutionary biology has this feature. Apart from the simple fact that DNA and proteins are typically expressed as strings of letters (which makes them “appear” similar to semantic languages), it is true their sequences have been produced under strong selective pressures that bias them toward meaning. Protein sequences in nature are not random—they exist because they have function. Despite the claims of protein design companies about the ability to design novel proteins, for example, in antibodies or CRISPR enzymes, the functional and definitive features of these “novel designs” are still largely homologous to their natural training sets, with divergence generally being in more scaffolding or structural regions. An example from the OpenCRISPR paper by Profluent:
Overall, generated sequences closely matched the length of natural proteins from the same protein cluster, with a Pearson correlation of 0.97 (Fig. 2d) … By aligning the structures against curated families from the SCOPe database (30), we revealed the presence of core Cas9 domains in most generated proteins and at a similar rate as naturals (Fig. 2e). This included the HNH and RuvC nuclease domains (100% and 52.1%, respectively), which are responsible for DNA cleavage, as well as the PAM-interacting domain (92.9%) and target recognition (REC) lobe (99.9%) (Fig. 2e)
…
Sequence alignment indicated significant sequence divergence between OpenCRISPR-1 and SpCas9, with a sequence identity of 71.7%. Template-based AlphaFold2 (29) predictions of the catalytic state (37) of OpenCRISPR-1 illustrated that the majority of the mutations were concentrated at the solvent exposed surface of the protein, with only a fraction located at the protein-nucleic acid interface (Fig. S11a-b). The majority of critical nucleic acid coordinating residues and nuclease site components were preserved, demonstrating the capability of the model to accurately constrain all necessary catalytic and interaction sites.3
Generally, proteins that are “predicted” naively from genomic composition models, like Evo4, are pictorially pretty but biologically much less meaningful. This is not to say that these models are not useful—in fact, from a training data perspective, if one is building on natural datasets, they are among the most promising. Indeed, models like AlphaMissense5 use selective constraint in predictions of which variants may be deleterious on the basis that they are rare and have been selected out. The challenge, however, of designing novel genomes is more likely one of combinatoric explosion as a one dimensional sequence goes through protein folding structure/function and protein network integration selection levels before undergoing selective pressure in the context of the environmental complexity of living systems. The combinatorics of these two steps might suggest that even with tremendous amounts of genomic data, it is an insignificant portion of the total possible space such that learning algorithms might, at best, make progress on a small fraction such as local promoter sequence optimization. A result of this is that building “generative” biology models that are aimed at more complex and novel outcomes is probably quite limited as the training data that exists is highly constrained by a small subset of the possible space, limiting generalization. We will build models that just produce variants on the same themes but not divergent novelty.
Higher-Order Datasets: An Entropic View
Looking at training datasets from an entropy perspective makes many higher-order or experimental datasets less attractive. Increasing the number of variables increases the potential entropy of the dataset—this in turn increases the information content (which presumably incorporates the impact of all variables) and increases both the sample size needed and the model size to accommodate it. The relationship between these two was discussed in this article as a form of “informational impedance matching.6” Importantly, increasing the informational content of a dataset does not translate into better quality data—it translates into more information in the data. If that information, for example, in a cell biology experiment, relates to the type of plastics used, that is part of that dataset.
Efforts to remove batch effects or technical artifacts are big topics in biological datasets, and these are all positive steps. The broader theme of this article is whether the datasets are under selective pressure to embody semantic content. In cell biology, the answer is definitely yes—information embodied in cells is inherently meaningful because it is under substantial selective pressure. However, this is not the case in all domains—the chemical domain is a notable standout.
Chemistry Has Lower Selective Pressure
For a bit of specificity, this statement relates to the intersection of chemistry and biology. Chemistry as a field has inherent selective pressure, which might be boiled down to physics and thermodynamics. This has been built up in the contentious (and seemingly obvious) mathematics of “Assembly Theory7” to explain much of the chemical world, but it says much less about the function or utility of chemistry.
Simply put, there isn’t a strict biological selective pressure on the world of possible chemical structures to embody or comprise information about what chemical structures are supposed to do in biology. It is true that many bioactive molecules are found in nature — many of these have been produced by the evolution of metabolic gene clusters for various functions. This is a good place to start functionally, however, is a vanishingly small fraction of the broader chemical space and does not provide any restrictions on the impact of any other chemical structure. In short, there is no intuitive reason to rule out chemical structures just because they are not found in nature and there is no intuitive reason to believe that a chemical structure found in nature will be effective as a therapeutic.
This overall lack of selective pressure on existing chemical space is one of the reasons that chemistry may be a substantially more difficult field to apply machine learning to than, for example, proteins. Many chemistry models end up falling back on some form of physics predictions like docking rather than biological impact predictions. Further, biological therapeutics, like proteins, have a higher clinical success rate than small molecules—the stronger selective pressure makes training data more relevant for the task.
Revisiting Synthetic Data
Synthetic data is considered a central part of the future of AI8. The utility of synthetic data should have two objectives:
Increase the sample number via interpolation in order to give the weights of a model more observations to find the same entropic minimum. The most common way this is done is by using equations to create new data. From an informational standpoint, using equations doesn’t increase the informational content of new data—for example, the equation y = 2x which generates training data points (1,2), (2,4), (3,6), etc. does not contain any new information. However, it may be useful if a model needs more examples to fit its gradient descent hyper-parameters.
Create informationally new data points. This is more challenging. In general, resampling techniques (e.g., using a model to produce training outputs for the model) do not accomplish this. However, this type of synthetic data is possible to create via cross-model training data—e.g., using model 1 to produce data for model 2 or by using recombination or genetic algorithm approaches to create new training data. The outstanding question will be whether the new training data actually comprises new information that is useful for the prediction task at hand. Interestingly, there has been some evidence that using multiple languages in a model is more effective, though it seems rather counterintuitive that using a German language model to produce tokens as training data for an English model should not be particularly useful.
Reasoning and AGI — on “Strawberry”
Up to now, the discussion has focused on data entropy largely for first-order assessments — this is a bit loose terminology, but it can be thought of as “single forward pass” assessments — there is a single function that maps the input to the output (regardless how complex), and this is embodied in the weights of a model. The idea of a single forward pass function requires that the informational content of the training data be sufficient such that a gradient descent algorithm can learn it in one shot. For higher entropy — or higher informational content — topics, this would require both significant training data and significant model capacity. The question is how such data can be acquired and whether or not scaling model capacity is the right approach.
The approach of “reasoning” takes a different view on data entropy. It does not assume that training data must come from the outside world. In a manner rather similar to all of the Alpha models from DeepMind, reasoning models generate the high entropy datasets themselves and explore those data sets through search. A more detailed description of this was written last year around the Q* news9 with the main focus being on marrying language generation and search.
The basic idea is simple — if you want to find a way through a maze, you generate a massive number of possible ways and then figure out which one gets you successfully to the end. For the visually minded, in vector space, this can be thought of as exploring every “neighborhood cloud” of progressive vector additions where each vector is the embedding of a randomly generated sequence. OpenAI turned up the “temperature” on the models to ensure a degree of creativity in this exploration process. This generate/search paradigm creates very high degrees of data entropy, but also very large amounts of synthetic data which can be used to explore and learn hypothetical reasoning steps.
Figure: A visual representation of vector embedding spaces for different types of models. The white dot is a specific answer (e.g., a mathematical proof) and the colored gradient space is a complex answer space with varying degrees of “correctness” (e.g., scientific hypotheses). (A) - the partial space carved out by the shaded area represents a model of low temperature — meaning it can only produce outputs in a banded range. (B) - the fuller range indicates a higher temperature model — meaning it can explore more space. (C) - the space carved out by a model is representative of the model size and required training data. For example, in (C), the model is lower temperature and very large, meaning it can find a solution to arrive at the point, whereas in (A) it cannot. (D) — “reasoning” via model iteration. In this process, a low temperature model can take the outputs of one iteration and use as input to the next. Given the scope spanned by the model’s capacity, this type of iteration can find paths to any point in the upper right quadrant. However, it cannot find solutions to other spaces. (E) — a higher temperature “reasoning” model can find multiple paths to many destinations in the solution space including potentially useful ones for policy training because each iteration expands the search space more generally. The dotted line may be a human reasoning path — there is not specific constraint on models being human mappable.
Like most of the history of deep learning, it is likely that possible successful reasoning steps are not human interpretable, or at least that there is no requirement for this. One of the more amusing examples of unexpected results from reinforcement learning in gameplay is below (timepoint 4:25 is particularly fun):
Limitations on Reasoning
While models become increasingly impressive, the idea that they are moving towards a generalized intelligence is still domain-bound at the moment. The fundamental reason is that the models are still bound by language or (probably soon) visual input. This paradigm is very strong in domains where words carry more concrete meaning. However, as people, our natural neural networks (brains) imbue words with a significantly higher degree of connection than solely other words. We connect words, with feelings, biochemical responses, and an array of other interpretations that are not language based. Languages do, however, aim to embody these sentiments in a more generalized form and words are buckets of cultural evolution. This is one of the reasons why translating words can be difficult. To this extent, the reasoning capabilities of LLMs will continue to increase to the extent that the problems are bounded by a high degree of shared understanding, however, they may struggle with generalizing to more complex sentiments.
One of the most interesting examples of this comes from the musician Jacob Collier, who on a live video stream improvised music at increasing degrees of ambiguity. It’s worth a watch because it is both amazing and a good example of where language starts to blur in certain domains of “intelligence”
Reasoning on the Limits of Science
This same idea of language ambiguity relates to the ability of reasoning models, even with very high degrees of generation/search to explore very large landscapes of outcomes, to perform on tasks that do not have defined destinations. To date, all reinforcement learning requires a clear reward function that can be calculated regardless of how far away the reward destination is (for example, in complex games, it is very difficult to calculate the reward for every incremental step, but it is possible to calculate the reward of winning the game). It is also important that this reward can be calculated in silico in order to achieve the data scales needed.
This becomes difficult for domains where the reward function is either not clear or not calculable. Even to train “reasoning” models like o1, the destination that the model arrives at has to be definable and scorable. This is certainly the case for games and typically the case for things like programming, math, physics, and most things that are known facts, but is more challenging in other domains. This was seen in the performance results for things like personal writing, which actually scored below previous models.
Figure Reference10
This also would challenge domains of science where there is not a clear endpoint. This does not negate the utility of such models in generating hypotheses or in simply synthesizing large amounts of information in ways that humans cannot, but it does still require that such knowledge be tested in the real world for both validation and training of effective reasoning policies. It is an interesting hypothesis, if one considers, for example, biological data, from an entropic perspective, what the most effective datasets will be to generate. This is already being done in various labs, utilizing kernels to select optimal experimental designs.
However, to the extent that we consider intelligence the ability to learn, communicate, and execute tactile things in the real world, the integration of generation, search, and reinforcement toward specific or definable endpoints will likely make tremendous progress. Similarly, despite the challenges of models in ambiguous domains such as those of human preference, they will definitively serve as useful collaborators for humans who can generate things that are more to human liking.
Concluding Thoughts
Data is not useful for data's sake. One of the primary issues with many companies that have large repositories of data is that there is no information coherence that should even be expected from many datasets. Or, if there might be, the variability (or entropy) in the data is so high that the quantity doesn’t compensate enough to make it useful. This is a central issue for companies aiming to make use of their data to make effective predictions about the future. This should also be a central concern for companies aiming to generate data, either experimentally or synthetically. Data is often not cheap.
The idea of understanding the priors of training data can be difficult and abstract. In practice, most will likely move forward with data generation pipelines without much regard for this issue. This is particularly prevalent in the biological sciences and drug discovery, and this approach will inevitably result in a tremendous misallocation of capital in the pursuit of suboptimal data. But the concept is not new—similar ideas pervade statistics and machine learning in the form of data leakage (e.g., training data leaking into test data) or data analysis independence (e.g., ensuring that your logic is not circular in some form). The fundamental issue underlying both of these topics is whether or not your data contains information worth learning and whether you are actually learning it.
We should make it worth it.
References
https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1
https://arcinstitute.org/news/blog/evo
https://www.science.org/doi/10.1126/science.adg7492
https://www.nature.com/articles/s41586-023-06600-9
To Be Or Not To Be Synthetic: Data In Bio x ML
Summary: Synthetic data is becoming more central to the training and development of deep neural networks. This article describes some of the intuition around its utility in different domains.
https://openai.com/index/learning-to-reason-with-llms/