Virtual Cells

Intuition on Simulating Biology and Related Topics

May 24, 2024

Researchers at UConn Health have just released a new version of the Virtual Cell that allows biologists without strong math or computer programming skills to more easily build models and simulate how a cell functions. (Getty Images)

Preamble and TLDR:

This article is about virtual cells and the prospect of being able to simulate biology at the molecular, cellular, organ, and perhaps organism level.

Like other articles in this Substack, it will be semi-technical, but largely oriented toward developing an intuition for the dimension of these challenges and the approaches for solutions. A few of the key terms and concepts that will be introduced are the following:

Simulation is the integration of predictions over time in physical systems. This becomes important when we consider the state of current prediction models and the acquisition of data, particularly in biology. For example, most of the data we have is endpoint data, not time-course data. The time delta and number of processes and variables that contribute to the end state have a direct relationship to the training data required to produce a model that would simulate it. Small time steps and more bounded tasks with direct relationships are more tractable.
The basis of simulation is calculation. There are three ways of calculation -- theoretical symbolic, agnostic trained (e.g., deep learning), and neurosymbolic. A feature of calculation that appears to be real is known as irreducible complexity (See footnote1) This is related to a concept in information theory called Kolmogorov complexity, which is the smallest program that can be used to express an information state. The gist of these ideas is that with many complex systems, we have no apparent ability to predict a future state apart from brute calculation.
The concept of irreducible complexity is related to, but separate from, chaos theory. The former can be based on fully deterministic rules and fully determined starting states, yet still be only predictable with brute calculation. Conway's Game of Life and cellular automata are examples of this and currently, it appears like many very deep neural networks also have this property (e.g., there is no shortcut to output compared to completing the full calculation). The latter is based primarily on the idea that we cannot measure states with arbitrary precision and small changes can propagate into large effects. Both apply to biological systems.
Simulation is related closely to the concept of causality which is a major focus, especially in drug development and medicine. Causality is closely related to the idea of dimensional reduction in complex systems, for example, the ability to identify focal nodes in networks and control systems.
Simulation is often proxied by good mimicry -- an example is the simulation of intelligence by LLMs. In the case of mimicry, causality isn't required. The field of AI interpretability aims to bridge this gap. For many practical purposes, particularly in biology where simulations will need to be validated in the real world, the importance of the difference between good mimicry and a causal simulation is lower, and models that lack mechanistic interpretation but produce high-quality simulations will still be very useful. This is not the case as much in domains like medicine where important and irrevocable decisions may be made on the results of simulations.

This article will expand more on these ideas and some of the practical considerations for the development of biological simulators. The article is not intended to be a review article. There are many groups and publications in this space. Please feel free to reach out directly for more details.

jasonsteiner.xyz

Introduction - Simulacra and Simulation

The direction of technical progress almost inexorably leads to simulation. There is even an argument that our current experience of existence is a simulation. At its root, simulation is about the ability to accurately predict the future evolution of a system in time.

The most advanced computer chips in the world are designed in simulation before a single one is ever taped out, pilots are trained for the complex scenarios in simulation, and the most advanced AI-driven robotics are trained in simulation that is so good it can be zero-shot transferred to the real world.

Simulating complex systems has largely been a feat of engineering. In some cases, the model systems are governed by a set of symbolic equations such as the Navier Stokes or Maxwell equations for modeling fluid flow or electro-magnetism, respectively, and in some cases, the systems are modeled by substantially more complex functions such as those embodied by deep neural networks for example, when simulating natural language conversation.

At the center of the ability to simulate is the ability to calculate. A model is able to take a given state and calculate the next step in discrete time. For example, when an LLM simulates writing (e.g., predicting the next word) it is performing a calculation, albeit one of very large proportions, based on information that it has learned during its training about the statistical context of words. It repeats this calculation for each new word, stepping forward in sequence time. Similar approaches to complex simulations like fluid flow over an airplane wing are done using discrete calculations for each small interval at each discrete point moving forward each delta of time.

To date, nearly all effective forms of simulation in the sciences have been rooted in systems that have their foundations in physics or stochastic processes. We can simulate fluid flow because there are symbolic equations that characterize it (despite it being computationally intensive). We can simulate financial markets because while they lack deterministic equations, they exhibit reasonably well-defined statistics (save for the occasional black swan). Most simulations, however, generally tend to break down at sufficient scale or complexity largely as a result of the principles of chaos, notably the inability to measure a system's state with arbitrarily high resolution.

The gap from the micro to the macro has sometimes been bridged with statistical physics, such as Boltzman's behavior of gases or quantum physics, or with more complex machine learning models, for example in predicting the weather. An important feature of these systems is that simulations do not necessarily need to be based on the lowest-level features. Simulations are models like any other, in that they are valid relative to the scales of their measurements. We do not need to simulate at the quantum physics level to model the trajectory of a baseball. In the biological sciences, this principle applies equally as well as represented by homeostasis and stable oscillators.

Measurement and Inductive Biases

The ability to simulate a complex system depends on the degree to which the system's properties are understood. In some cases, such as in the physical sciences, we aim to compress our understanding of physical systems into equations like E=mc^2. These types of equations represent concrete inductive biases that directly transform one measurement into another.

In more complex physical systems, we model the dynamical relationship between entities we can measure in differential equations. These relationships do not necessarily provide any details about the underlying state, but only about how the variables change with respect to one another -- typically this expands the number of variables that need to be measured in order to develop a model simulation. All equations that have their origins in theoretical physics take this form and, provided that a system is sufficiently well described by the variables that can be measured, these equations can be used to compute simulations.

Most systems, however, in the real world are much messier. We do not necessarily know all of the inputs or all of the relationships that influence an outcome. In macroeconomics, for example, we routinely fail to find accurate simulations. This is not just because we cannot take all the measurements of all the relevant inputs at all times, but also because we can't fully describe the relationships between them. In cases like this, a field has grown up around physic-informed machine learning -- a type of model that uses the tools of machine learning including trainable neural networks with symbolic equation assembly to "discover" the equations that can model an underlying phenomenon.

An objective of this type of learning is to minimize the complexity of the resulting equation, but it does not necessarily lead to the concise equations that are derived from theory. An additional benefit of this type of learning is that it can learn a model with respect solely to the input data provided. If, for example, a system has variables A and B that affect the outcome, but only B is measurable -- this approach would learn a model solely with respect to B which, if the model is accurate, implicitly incorporates the information in A (or the implied relationship of A and B). This implicit information is most useful in relation to A's dependence on B. In practice, it is very rare that complete dependence or independence is achieved in complex systems, but it is the degree to which the relationship exists that the implicit information generalizes. We will get to this point later when discussing the data challenges in multi-scale biological models - one simple example being the ability to predict one -omic from another -omic.

The key takeaway away is that trainable neural networks can implicitly incorporate relationships between underlying data points that are not explicitly measurable. The greater the degree of underlying dynamical relationships between input data, the less measurable data is required to train a useful model. If the underlying relationship is complex and the fraction of measurable data points is a small fraction of the overall input space, the amount of data that will be needed to train a performant model will be significant, and there should be little expectation that a model would converge at all, depending on the relative importance of data types that can be measured.

The importance of measurement, inductive biases, and dynamical relationships are essential components when considering how to develop models that simulate complex systems like those in biology.

Simulating Biology

A grand ambition in biology is to be able to simulate it. This capability would replace a great deal of the time and effort that goes into the laborious effort of "discovery" and vastly accelerate the development of a whole host of new therapeutics and biologically driven products. Developments in deep learning over the last several years, and the successes that have been seen in domains like protein folding with AlphaFold and the host of other models that have been developed since have led to exciting possibilities of being able to simulate greater and greater portions of the biological world.

This is not a new concept in computational biology -- the field of systems biology has been at this for decades, using symbolic differential equations and linear algebra to simulate metabolic pathways, genetic circuits, and more global phenomena like cell growth rates, bioproduction, etc. A distinctive feature of these models is that they have been limited in scope (e.g., simulating the expression of a small set of genes in a synthetic genetic circuit in a simple model organism under controlled conditions). This scale is similar in analogy to early NLP researchers who were developing models with rules-based systems or hidden Markov models.

The impressive results in NLP of massively scaling deep neural network architectures and compute that have resulted in the current classes of LLMs have spawned a deep curiosity and excitement about similar capabilities jump in the biological world -- and more ambitiously, the ability to eventually create virtual cells and simulate their activity.

This is a tall order, but one that has a path, albeit one that is likely much longer than most would anticipate, particularly those accustomed to working in silicon.

A Multi-Scale Primer

To begin, let's start with some basic scale orientation -- this is important because when we talk about medicine, biology, or simulations, there are very different cause/effect mappings, and most practitioners are accustomed to operating at and around one slice of this hierarchy. Most of what we consider the science of Western medicine operates at the molecular and increasingly cellular level. This is where most pharmaceutical interventions act. The cell is the fundamental unit of life, so it makes sense to focus on it as the irreducible entity.

Projects like the Human Cell Atlas have the goal of producing molecular and phenotypic maps of all cells in the human body. There are around 300 or so different cell types and close to 40 trillion individual cells in an adult human. The definition of how these cells work is most often characterized by the expression of their molecular profiles. Of the 25,000 or so genes in the genome, any given cell expresses 20-40% of them and it is often the very long tail of cell type-specific genes that define the unique function of a cell. Some of this classification is more clearly defined, but in general, the behavior of cells and their molecular composition exists on a continuum. Even to the extent that we define "fully differentiated" cells, it's almost certainly the case that even these cells have a dynamic balance of compositions, albeit this balance may exist with more specifically bounded constraints for specific cell types.

The network balances of cellular components for any given cell are a complex mapping of a vast array of variables including, but not limited to, the status of specific chemistries on the genome, the activities of proteins and other macromolecules (from both their concentrations, and compositions), the presence of a plethora of small molecules, metabolic products, the spatial localization and associated physics of diffusion, free energy, etc. And this is just considering the context within a cell -- if we expand one order magnitude in scale outward, we incorporate the immediate extracellular environment with its features of intercellular communication, physical dynamics, and so forth.

The takeaway is that even at the most fundamental unit of the cell, the complexity of the system is already rather intractable to model from a first principles perspective. This is, however, the place that most science focuses on because it is the closest intersection where interventions can be causally (or mechanistically) described, and experiments can be as controlled as possible.

On the more macro end of the spectrum, we have biology as measured by organ or organism-level features. Historically this is where the practice of medicine has resided because of the limitations of measurability. We can measure blood pressure, weight, and an array of biomarkers, or take images to parse for visible signs of dysfunction. There is an intersection between these two domains, for example, MRI scans take macro images of micro-chemical environments affecting the magnetic features of hydrogen atoms, and the study of the intersection of biophysics on molecular dynamics, for example, in genetic responses to endothelial physical strain as a result of high blood pressure, among others. However, these can all be viewed as slices of the overall system -- leaving a great deal of the underlying variables unaccounted for.

Biology and Economics -- A Brief Segue.

One of the more intuitive ways of thinking about modeling biology is similar to how we might think about economics. This also provides some intuition as to the extent to which we should expect to be able to simulate biology. The two domains are rather similar -- they both operate on multiple scales but with a common small-scale agent -- namely the cell or an individual. Microeconomics is akin to molecular/cellular biology and macroeconomics to medicine. In each case, there are tools, albeit limited ones, to understand the nature of the activity of each agent and a great deal of effort has gone into connecting these measurements to these agents. In biology, this is very much related to initiatives like the Human Cell Atlas, and in economics, we have seen the obsessive drive of the tech industry toward tracking and attaching every piece of data to an individual user. In each case, the effort is to characterize the agent and their environment, as much as possible.

The overarching goal in biology, however, is not to build a model of a specific agent, but rather to understand the evolution of a system. In the case of economics, we sample macro factors like the CPI, inflation, or yield curves and slightly more detailed measurements such as sector-specific job reports, and try to predict how to respond or influence them. I'll leave it to the reader to assess how successful they believe this effort is in managing macroeconomics.

The parallels are straightforward in biology. Changing the LIBOR is like taking a drug. Both operate on a complex system, but the micro/macro mechanistic causality is tenuous at best. This is largely why medicine has been an observational study and is still characterized as being "practiced". Importantly, as noted earlier, system-level behavior does not necessarily require detailed individual-level measurements because the overall system operates in a lower dimension, for example, a stable oscillator. We can somewhat see this in the "business cycle" in economics and physiological "set-point" in biology. These system-level features can smooth out much of the underlying noise -- until they don't. We see this in black swan economic events and diseases like cancer. Under these circumstances, the concepts of lower dimensional modeling (and any simulators we may have relied on) go out the window

Fishing for Relationships in a Sea of Dimensionality

A large portion of the current AI world has been focused on extracting insights from information that exists already in the world. This can be thought of as building models that can look for input --> output mappings for any input/output pairing. We have some data X and we have some observation Y and we want to determine how we can best understand the features of data X such that we can either infer or generate new Y. Deep neural networks are searching very long or very large spaces of variables, equations, and assemblies to try to put together the closest equation that translates X to Y. These equations can be billions of terms long, but they all have the same underlying purpose.

The underlying conviction of the field is that more data and more terms will yield better models. This seems intuitively correct, notwithstanding all the technical details of the loss function space that is mapped by the data and parameters and how to effectively navigate it to convergence. The approaches in biology have largely been to combine and stack datasets increasingly with multimodal features, e.g., genomics, proteomics, imaging, etc, and train models to perform certain tasks. The implicit assumption is the same -- if we don't achieve convergence - just increase the number of layers or add more data. There have been high-profile publications about popular deep learning architectures, mostly based on transformers, applied to dataset X or Y that can produce some outputs that can be interpreted as being biologically meaningful.

A great deal of the hype around AI in Bio has flowed from the hype in the LLM space around mysterious topics like "AGI". The allure of the "what-if" is providing fuel to overreach claims. For example, it is a well-known phenomenon that if you are analyzing data for correlations, that the more data types you analyze, the greater the likelihood of finding a spurious correlation by statistical occurrence alone. Adding more parameters increases the likelihood of overfitting (as discussed elsewhere). This does not, however, dissuade high-profile institutions and companies from promoting these capabilities.

In the context of simulations, the value of finding relationships is largely to reduce the dimensionality of the problems. This is used, for example, in cell atlases to impute missing data and integrate datasets. Auto-encoders are used to reduce the dimensionality of cell states and joint embeddings are used to map one variable type to another. These approaches can be used to "complete" an experimental data set, but more importantly, point to the ability to better simulate systems with the requirement to measure variables.

Simulation Is About Time, Not Just Space.

The concept of building virtual cells or simulations of biology is about how to model their progression in time. With few exceptions, AI models of biology are focused on making predictions in space. In this usage, space can mean either physical space or parameter space (such as predicting gene expression under a certain circumstance). The goal of producing or predicting a "representation" is just a stand-in for the real thing -- a simulacra.

If we draw a parallel to LLMs, a prediction in space is akin to a bi-directional encoder model, whereas a prediction in time is a sequence decoder. In one version we are filling in a missing word in the context of other words and in the other we are making a continuation.

The primary challenge with prediction in time is one of predicting change. Three primary technical approaches to this have been used in simulations:

The theory approach -- or defining mechanistic equations and solving them discretely forward in time. As noted, early work in metabolic engineering was of the flavor as is much of physics simulation. This approach is computationally intractable at the whole (eukaryotic) cell level though it has, to an extent, been successful in simpler systems where the networks and kinetics are better defined.
The agnostic deep neural network approach -- this is a very challenging approach that is similar to how, for example, video transformers have been developed. In the domain of vision, this builds video based on whether a frame-by-frame decomposition represents a "good-continuation". The dimensionality of this problem is significantly higher than sequence models for language because the continuation space is very large with the scope of the model size and the ability to learn, for example, object permanence, some semblance of world-model-based physics, etc is entirely trained with few to no priors. While the agnostic approach is quite successful in space (e.g., images, or endpoint assays), they are much more challenging in time.
A combination of 1 and 2 is a middle path that builds in measurable priors with large-scale neural networks. This is related to the domain of physics-informed deep learning. An example of this is combining RNA velocity measurements in the form of spliced RNA proportions with transcriptomic measurements, the combination of which may enable a small delta step in time.

A central question for any simulation is the desired resolution. For example, if you want to simulate whether a cell would die if exposed to a toxin, that could be relatively straightforward, however, it is less of a simulation than an end-point classification. However, if you wanted to simulate the mechanism of cell death or a trans-differentiation cell state trajectory or the impact of an arbitrary perturbation on a complex cell state, the challenge is notably harder. It is both an issue of the resolution of data acquisition, particularly in time, and the challenge of the diversity of manifolds of cell states that any given cell may be traversing (and whether these are even continuous).

Consider the analogy of a cell as a person traversing a geography spanning several countries. As a brief aside, in my 20s I spent the better part of a year traveling overland from Europe to India, so the analogy is apropos. Many observations we take in travel and science are point-to-point -- we fly in and fly out. Overland, however, the gradients are much more evident.

Some approaches are tractable based on the data type, for example, continuous live cell imaging can be used essentially as video transformer training, but practical challenges remain, such as the fact that 2D cell cultures do not represent the 3D environment that exists in tissues. And the major issue that remains is the data sparsity relative to the potential space.

In the limit, a simulation is about continuous prediction if the time delta goes to zero.

It's Still a Data Issue

Returning to the idea of deep learning as a function mapping from X --> f(x)--> Y we should discuss two central ideas:

If we are constructing a virtual cell or a virtual simulator, we make some assumptions about the underlying function f(X) and its consistency across the data we are using to train a model. The consistency of this model is based on the composition of two factors: the distribution and diversity of the training data, and the number of transitions that implicitly separately X from Y. This latter point is just saying that f(x) is most likely a composition function. For example, if X were DNA sequence and Y were expression there is a relatively close mapping between the two with few intervening states or measurements that might be necessary, whereas if X were DNA and Y were MRI images, there would be considerably more difficulty in the mapping.
The degree of mutual dependence of any measurements that may be useful for bridging predictive models or simulations. Cell states are certainly not iid and this fact is useful in establishing relevant priors and imputing information that may be very useful for simulations. This is currently done most frequently through establishing joint embedding spaces for paired modalities and building embedding atlases that can be used to regenerate more comprehensive features with high fidelity. In general, these are promising approaches that take advantage of the fact that despite the fact biology is extremely complex, it overall operates on a lower dimensional manifold.

Taking these into consideration, we might assume that it would be more tractable to develop a series of smaller models and stitch them together than it would be to develop one macro model. However, we can’t ignore the origins of the data.

One of the unexpected results of Eroom's Law is that while we have developed technologies over the last several decades that have exponentially increased our ability to generate and acquire data in biology, our ability to meaningfully translate that information into the tools of medicine has gone the opposite direction (at least from an economic productivity standpoint). There is no guarantee that increasing the amount of data and compute will yield models of biology that meaningfully improve.

Drawing parallels between LLMs and natural language is not accurate. The evidence has pointed to the idea that while we have gathered increasing data in biology, the relevance of that data has been diminished faster per data point than the throughput has increased. The majority of cellular data is generated in very few lines that are not representative of human biology, either in type or underlying biology. Despite the fact that there are 80M+ cells in the CellXGene database, the median cell count is 10k. There is an extremely long tail. We have generated data for data's sake, but the value per unit has not kept pace. The analogy to LLMs would be if 80% of the Internet content were generated by people from San Francisco. In fact, we do see this at a larger scale where LLM performance is biased toward English and the content on the Internet, the difference is that consumers of LLMs and producers of its training content have a heavy overlap.

In contrast, I am not a K562 human.

The Future of Virtual Cells and Simulation

This article has discussed a number of the intuitions behind the development of virtual cells and simulators of biological systems. The future of this space rests significantly on definitions. If we want to create a virtual cell that models the metabolic production of a compound, that has largely been achieved, if we want to create a virtual cell model of the evolution of immune response or cell state transitions in disease, that remains a considerable outstanding challenge.

To date, the majority of deep learning approaches to biology have been simulacra, not simulations. This is the entire concept of a "representation" or "embedding". These representations can be used to make predictions, notably for classification tasks. In some cases, particularly in the case of graph-based models which on account of their specific structure and training have more explicit relationships between variables, there have been advances in interpretability and causality which are steps in the right direction toward simulation.

The drive to develop simulations and virtual cells should be led by utility. The utility of metabolic pathway simulation for simple cells has been achieved through the ability to both measure the relevant variables, have well-established network structure of metabolic pathways, and have clear time domain kinetic relationships. The development of virtual cell models in more complex systems should be driven by the same parameters -- namely, measurement, network composition, and dynamics. Depending on the scope of the task, however, this can become rapidly intractable on many fronts. The tools of deep learning are most applicable in developing ways of reducing the dimensionality of the challenge by modeling high-order patterns and relationships. Extending the tool kit to time domain data will be an essential component in building better models.

It remains important to distinguish between the goals of building biological simulators and building multi-scale foundation models. It is often the case that these two ideas are discussed in a similar context -- e.g., we will be able to "model" biology. While in the limit we could say that these two ideas converge, for the foreseeable future, they are very distinct. Multi-scale biological simulators (extending virtual cells to organisms) are likely an intractable problem at least as far as we can see. This is as much a data acquisition challenge as it is a network modeling challenge. However, multi-scale classifiers, representation modeling, mapping, etc that do not require dynamic considerations are likely to be increasingly useful through large-scale data generation, integration, and advances in model architectures.

It is exciting to consider what is becoming possible at the intersection of deep learning (including neurosymbolic) and data platforms in the life sciences. In the limit, the question of building simulators is almost a philosophical one. We currently have practical limitations from our understanding of information theory and our ability to measure at arbitrary resolutions, however, we also have the useful fact that most complex systems exhibit properties that often smooth out noise over useful scales.

Building virtual cells and models of biology is a canonical example of the maxim: All models are wrong, but some are useful.

Additional Reading

This article is not intended to be an exhaustive review article. There are multiple academic labs working in this space. If you are interested in discussing this, please reach out to me directly. Two specific resources that I have found interesting are below:

jasonsteiner.xyz

- Simulation and Simulacra by Jean Baudrillard -- one of the primary references for the Matrix series, recommended reading for an understanding of representations

- YouTube series by Steve Brunton on physics-informed deep learning -- I have found this series to be particularly useful and accessible for understanding how equations and deep learning can be integrated

Image credit: https://today.uconn.edu/2017/10/uconn-health-researchers-visualize-life-silico/#

Footnote: The concept of theoretical irreducible computational complexity seems in some sense intuitively incorrect for any system that exhibits higher order structure. In fact, we see this in LLMs where similar performance can be achieved with increasingly smaller models indicating that you can reduce or compress the computation. At some point this hits a limit but that’s a compression limit not a feature of the original system. Practical computational irreducibility could very well be true in the real world on account of measurement resolution limitations. There are several fields and techniques that are related to this problem including encryption, hashing and the outstanding P vs. NP problem. It remains an open and intriguing question whether deep learning, particularly for equation discovery, will yield increasing advances in this space. For systems with higher order features, like cells, they most exhibit attractors and other stable patterns of homeostasis that make this limitation potentially less of an issue.

David Kingsley, PhD

May 25, 2024

Great article! It sounds like we still have a long way to go to simulate a cell, particularly because of the importance of its environmental context. It makes me wonder though, is it more practical or advantageous to simulate a larger block, e.g., tissues, rather than individual cells, given the complexity and interactions within biological systems? It probably depends what we are trying to ultimately predict.

Expand full comment

1 reply by Jason Steiner

1 more comment...

Techbio<>Biotech

Discussion about this post