The term “foundation model” has been increasingly applied to biology. It’s an enticing idea but would benefit from some specification.
A foundation model is generally characterized by three features
It has a specific and measurable output or objective
It is intended to span the dimension space of features that contribute to that objective
It is based on the existence of a common (though perhaps undefined) generator function
For example, in the common foundation models that exist now – notably the large language models and image generation models, the criteria are satisfied by the following
1) For LLMs – “predict the next word optimization”, trained on the internet, underlying statistical grammar (perhaps the Chomskian grammar) that generates the outputs
2) For image generation – correct classification by labels, the internet, human annotation as a source of truth.
With these features, one can think of a foundational model as a net (or manifold) that fully covers a feature space and relates the variables. The net can dense or sparse depending on the dimensionality and training set sizes. The denser the net, the better the generality, the sparser, the more transfer learning tuning may be needed to build useful models in specific areas.
To the extent that the foundation model concept has been applied to biology, it is had the above features. Specificity of objectives and data sampling are both human defined, but the underlying generator function is inherent to the system being modeled and is the most central.
The successful cases of generalized foundation models in biology are molecular – eg Alphafold and variants, chemistry modelling, etc. The implied underlying generator is physics. It’s not necessary that this is explicitly defined but it does govern behavior. This is the same as numerical computation of solutions to partial differential equations that don’t have symbolic solutions. (This is, in fact, precisely what backpropagation is).
In more complex systems such as cells, the underlying generator functions become less generalizable – for example, the generator of gene expression consists of a collection of both “cell state” and environmental factors – many of which are difficult to measure or define. As such any deep network that aims to map inputs to outputs will learn on training data that is specific to the underlying generator function. Many research papers aim to do this in biology via joint embeddings (for example drug interactions/imaging/gene expression, etc), and they can demonstrate improvement within the scope of the publication, but not necessarily extrapolated.
In any sufficiently complex system, if one wants to create a foundation model, the two levers to consider are 1) at what level are we presuming a common generator function and 2) based on that level, how much data is required to cover the dimension space below that level to make a useful model?
We are familiar with this concept in tech world as companies aim to link all activity to a single user (the generator) and then cluster those generators (eg people) in to common groups to predict shared behaviors. The approach was not to seek to build a “foundation model” nor would this likely have been useful or successful.
The same will apply in complex systems in biology. While the generator functions are closer to an objective model than human psychology or behavior is, the data is far less abundant at this time.
Practical approaches to developing foundational models in complex systems should be considered more in a constructionist format – where useful models are developed in specific domains and expanded to others.
We should not expect that “biology” will be solved by foundation models, but we should expect deep learning and data generation to continue to develop increasingly effective models of important complex phenomena.
Thanks for this thoughtful piece, Jason. Could you provide an example of the "constructionist format" of foundational model development referenced below? I want to better understand what you mean by domain-specific models being expanded to other domains.
"Practical approaches to developing foundational models in complex systems should be considered more in a constructionist format – where useful models are developed in specific domains and expanded to others."