SOTA Seeking – A Knife Fight in a Phone Booth
Commercialization and State of the Art in Bio and ML
One of the most memorable quotes I have heard in my time in commercial sales organizations operating in transformative and rapidly moving technologies is the following:
“When you have a knife fight in a phone booth, everyone ends up getting cut”
The sentiment is about how customers interpret differentiation that companies aim to position and how they present the competitive advantages of their products. The wild west of marketing, particularly in unregulated sectors, creates an opportunity for narrative that is unparalleled – and that often isn’t either real or relevant. The current SOTA seeking in machine learning, particularly in the biological science is starting to look like history repeating itself. Particularly as the field transitions to more application centric products in machine learning, we might expect that competitive dynamics will creating inceasingly fragmented differentiation.
For clarity, the perspective presented in this article will be about the commercial relevance of State of the Art (SOTA) seeking in the context of the current machine learning landscape in biology. It will cover the ambiguous concepts of SOTA on arbitrary tasks, the role of underlying data distributions in determining performance metrics, the progress and hype of machine learning architecture differentiation, the parameters bounding deep neural network performance and the psychological and social motivations of SOTA in both research and commercial domains.
SOTA as a Meme and a Selling Point
SOTA as a meme is typically reserved for fast moving technologies where competition is fierce, accessible, and incremental. It’s atypical that one may use the term SOTA to describe, for example, fusion technologies or other “deep tech” industries. We generally do not speak about SOTA spaceships at least as a marketing tool or meme.
I have had the opportunity over the last decade to be part of three such SOTA technology transitions, in the first two building 8-figure revenue organizations, and have observed the common features of commercial and product success over that time. The transitions have been:
The introduction of population wide clinical genomics with NIPT
The mass adoption of CRISPR technologies
The rapid evolution of deep learning in biology
These transitions share the following common features:
They introduce a foundational technology that is categorically better than any prior
The technology is highly accessible and has a low barrier of entry
Customer interactions with the technology are variable and individual
Updated Notes: This article published in 2022 tracks the rise of SOTA in image and NLP fields: Mapping global dynamics of benchmark creation and saturation in artificial intelligence and highlights the rapid obsolesence of many new metrics:
A Brief Historical Anecdote of SOTA in Biotech – Transformative Genomics
In 2012 the first non-invasive prenatal testing (NIPT) was commercially offered in the US. In the following year, 3 other US based companies joined the market and over the following decade, dozens of companies began offering NIPT for routine prenatal screening. The introduction of the underlying liquid biopsy technology for analyzing cell-free DNA in blood has categorically transformed pregnancy care and set the stage for a transformation in cancer management as well.
Liquid biopsy testing was the first population wide application of next generation sequencing in clinical practice and there is little doubt it has transformed clinical care. The introduction of technology created fierce competition among the commercial providers and a relative degree of confusion among customers. The information disparity, combined with an underdeveloped regulatory framework led to a marketing wild west about performance metrics, technology potential, product specs and so forth. Everyone was, if not seeking, at the very least promoting SOTA.
In the genomic screening paradigm, SOTA took many forms, but was most often marketed as sensitivity and specificity metrics of a particular test. Glossy marketing brochures would present comparison tables of each competitor’s product highlighting a 98% vs 99% sensitivity or a 99.5% vs 99.8% specificity metric as being differentiating. There were numerous clinical study publications that would be used to cherry pick specific metrics that could be highlighted as core differentiating factors to customers. And it was almost entirely noise.
How to Lie with Statistics
One of the most important features of transformative technology introduction into commercial products is the customer learning curve…and one of the most common tactics used particularly by high growth companies vying for market share is to exploit the information asymmetry. This isn’t necessarily done in a malignant manner; however, it is often very difficult and impractical to communicate technical nuance in a competitive commercial setting. And perhaps more critical is that for most customers, the competitive technical nuances of transformative technologies were often not relevant to their purchasing decisions. In the case of cell free DNA testing for prenatal screening, for example, the categorical improvement leap went from a biochemical screening paradigm with 80% sensitivity and 5% false positive rates to a genetic paradigm with 99% sensitivity and 0.01% false positive rate (statistics for trisomy 21). The categorial paradigm shift was so great that technical nuance often had no practical impact on clinical care. For oncology applications, there was even a race to the bottom to develop the ability to detect a single molecule of DNA as the SOTA of assay sensitivity, yet from a customer perspective, in clinical practice this largely was irrelevant.
The problem of competitive differentiation from a technical perspective became considerable, and companies began to flood the market with any statistic that favored their product. This became somewhat problematic from an accuracy perspective however, as customers were often not experts in the statistics of clinical population screening, the specific patient population and distribution frequencies of positive and negative samples in published studies and the varying complexities of reproductive genetics or next generation sequencing. Moreover, customers had individual experiences with each product that may or may not have been representative of overall technical statistics — e.g. if the technology didn’t work for them, it didn’t matter if it worked for 99% of other customers (and vice versa). The information asymmetry led to the “knife fight in a phone booth” in the commercial sphere with company A cherry picking dataset Y to favor them and company B refuting data set Y with a technical nuance that was irrelevant to the actual customer experience and was, in many cases, not even accurately reflective of the underlying technology performance. If, for example, you report a Positive Predictive Value metric on a screening test as assessed from an enriched patient population it is probably not reflective of actual clinical performance. It is not lying with statistics, but it can certainly border on misleading in some cases. In the extreme it became more a battle between the vendors for prominence than attention for the customer need.
CRISPR Everywhere
The second genomics transformation in the last decade has been CRISPR and it shares a similar set of characteristics from a commercial perspective:
An extremely high velocity publication rate of new of information
An information asymmetry between vendors and customers
A low barrier to entry
A probabilistic or potentially variable customer experience
The commercial landscape manifested itself on several different technical dimensions. In similar analogy to genetic screening, technology performance took center stage – the editing efficiencies or design accuracy of CRISPR tools, the on/off target specificity, chemical purity, etc. were all top of the roster on commercially competitive metrics and the competition was similarly fierce in the commercial marketing brochures. SOTA across several dimensions was, and probably still is, largely undefined in CRISPR on account of the diversity and complexity of applications in biology and what frequently matters commercially are practical factors like logistics, price, customer and technical support, etc.
The Dimensions of Commercial Competition are Not Solely SOTA-Based
It’s not a favorable view among technologists (and often to my own chargin) that technology is not often the driver of commercial success in SOTA based paradigms. The reason for this is straight forward - SOTA seeking is not the same as commercialization. They have both different intentions and different timelines that do not often match those of customers.
It’s worth pointing out that the degree of this discrepancy is proportional to the customer’s rate of adoption and required understanding of the underlying technology, with greater discrepancies leading to more opportunities for marketing spin. For example, SOTA seeking in diffusion models for consumer image generation are more valuable and much less prone to marketing ploys than SOTA seeking in predictive biological applications or clinical trials. The rationale is simple – customers can immediately recognize and adopt new SOTA technologies for their pictures but cannot do so as easily for drug discovery.
Another way to consider this is that in commercial markets where the goal is to drive adoption, the uptake curves often follow the Crossing the Chasm dynamic – and this dynamic is largely focused on the concept of the technology developing as a “whole product” where the technology component plays just one role.
Figure https://innolution.com/uploads/misc/Crossing_The_Chasm_Annotated.png
Commercial success in SOTA driven markets is largely driven by Product Experience, which includes but is not solely dependent on state of the art technology
The importance of product experience is much more common in software than in biotech and generally has less connection to the concept of SOTA. This is not to say that technology performance is not important – only that its importance is measured against a customer’s needs, experiences, and practical ability to benefit from that performance. The majority of value in software has accrued to the application layer, not the technology layer, and the importance gap between this probably no better described that in Steve Job’s brilliant response to technical criticism. In the case of NIPT for prenatal screening, the primary four US companies, while all vying for SOTA recognition, were commercially competitive across a range of dimensions that were important to the customer experience of different market segments, namely logistics/user experience, price/economics, scope of clinical evidence, and technology/assay differentiation. There was no single SOTA that dominated, it was the overall product experience that became a differentiating success factor for each vendor. The importance of product experience is among the most underestimated value propositions in SOTA driven domains particularly for companies that are intending to cross the chasm into a scalable market.
SOTA In Machine Learning and Biology Will not Solely Define Commercial Success (at least currently)
SOTA seeking is driving a frenzy of publications at the intersection of machine learning and biology. The high frequency rate is mostly noise for the following reasons:
SOTA is defined on an arbitrary number of dimensions that are not standardized
Underlying data sets for training and validation are not frequently comparable
Primary limitations on performance improvements are gated by the availability of data, not architectures or compute.
Incentives in the SOTA domain, largely present in academia, reward novelty not development necessary needed for commerical products.
I will make a few caveats to the above that SOTA will be more important in domains where the above conditions are less relevant. A clear example is in the protein folding domain where the specific task and performance can be unambiguously defined. The existence of this objective basis is important for the concept of Foundation Models in biology and elsewhere. The applications of machine learning to aspects of cell biology have a different set of considerations.
SOTA in Bio has Unclear Definition
There are no standard metrics for evaluating deep learning architectures against specific tasks. There is no BLEU score or MNIST equivalent and the ImageNet equivalents are collections of relatively disparate datasets (eg public consortia datasets). The assessment of performance comparisons is frequently done internally, often not on equivalent matched data sets, or against reported metrics from other publications that utilize both different data sets and often different hyperparameters or architectures entirely. While this may be useful from a proof-of-concept perspective, it says little about the development of a new SOTA in the field. A recently published paper, HyenaDNA, is an example of this. Notwithstanding the actual computational architectures of the paper, it provides a series of comparisons against both alternative versions of itself and other publications such as the Nucleotide Transformer
Figure 1: Comparison tables across different architectures and different publications from HyenaDNA paper. In table 4.2 (left), the HyenaDNA architecture achieves a stated SOTA in 12 of 17 tasks.
This reference is not to say that these comparisons are not useful, only that they say little about what the SOTA in the field should be – or how best to improve it. The data sets that each of these architectures are trained on are different and there is no reason to believe that the architecture itself is the majority driver of performance. In fact, the opposite may be true. The various chatter of SOTA on genomics has swung back forth over the last few years with a few examples covering models such as Basenji (CNN), Enformer (CNN/Transformer), Nucleotide Transformer (Transformer), HyenaDNA (CNN composite). In each case, the architecture is advocated to be a new de facto SOTA, but it remains difficult to assess objectively across the validity of specific tasks and, in fact, each architecture only achieves a SOTA metric on a selection of desired tasks and not others (see Figure 1 - Table 4.2). Moreover, for any given architecture, it is unknown to what degree hyperparameter tuning or curated data might improve the performance of certain tasks. In the Enformer paper, it was stated that
“To pinpoint the benefit of attention layers compared with the dilated convolutions used in Basenji2, we replaced attention layers with dilated convolutions and tuned the learning rate for optimal performance. Attention layers outperformed dilated convolutions across all model sizes, numbers of layers, and numbers of training data points…Overall, these results confirm that attention layers are better suited than dilated convolutions for gene expression prediction.”
Whereas in the HyenaDNA paper, a different conclusion was made:
“A large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity”
The rewards of novelty are certainly accruing, however, as each new architecture makes a splash on (bio)arXiv and the twittersphere. It seems reasonable to expect that architectures will continue to be mix and matched with benefits for one type of task or another and each claiming a new SOTA in some domain. It also seems reasonable that new architectures will continue to be developed that tout an increasing range of operational statistics, such as training compute cost, transfer-learnability, tuning speed, and probably yet to be defined network metrics.
The same trend is seen in models of transcriptomics. The recent publication of the scFoundation model had a series of comparative assessments but among the more interesting was a plot comparing the training losses against the parameters of different models that have all been released over the last year.
Figure 2: Training metrics of different transformer model scales against transcriptomic data. This data does not actually reflect these specific models, but rather a model that the author has created set to parameters that are equal to the other models. **Importantly the original posting of this graphic in the bioArix posting on June 2, 2023 did not include this clarification which might lead readers to believe that this was a genuine comparison. A later version on June 21, 2023 made this clarification. This further exacerbates the potential confusion about how to compare SOTA models.**
There are two important observations about this graphic
It does not present any clear SOTA comparison that can be benchmarked in the industry. It is actually only an internal comparison with different hyperparmeter values matching the referenced models
The performance improvement trends are going to be limited by data availability
The latter point about data availability is likely at the crux of the concept of SOTA in bio.
The proverbial phone booth has been constructed…
The Current Bottlenecks in Bio SOTA
Notwithstanding the lack of standardized metrics on which to base an assessment of SOTA in applications of machine learning to cell biology tasks, there is a more significant hurdle at hand – the power scaling laws of deep neural network with respect to data availability.
In 2020, Kaplan et. al. published an analysis of the power scaling laws of deep neural networks as it relates to three primary dimensions: token count, total training compute and network parameter count. A snapshot from that paper is below:
An important observation of this study is that
“empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two”
with one of these factors being data set size.
This has important implications for the development of SOTA models in biology as it relates to training data and architecture improvements. While both model size and total compute are readily modifiable features, their power scaling performance may only continue if the token count is not limiting. Given that token count also follows a power law model, there may be limits to performance improvements based on the production of data.
It has become relatively en vogue to state that there is a data quality/quantity crisis in the life science and it is true that single cell sequencing data is increasing, however, it’s a matter of scale. If a power law holds for token count and the current SOTA model uses 50M cell (scFoundation) and ~100M are currently in the public domain, the question is how fast this data set can grow. An analysis by Adam Gayoso (see here) seems to indicate that while single cell data has been exploding recently, it may not be following the continuing exponential required to not bottleneck the advances in compute and model sizes.
Figure 3: Analysis by of scRNA-Seq data (reference Adam Gayoso)
Growing data exponentially will pose a challenge to keep pace with power scaling performance. Further this analysis is only referring to scRNA-Seq data, which may only serve to model a subset of relevant biological tasks.
The more relevant approaches to data (and what ultimately will bring about useful product applications) will reside in the specific and curated generation of data for specific product tasks, not generalized buckets of everything, everywhere, all at once.
As Andrew Ng has been advocating, it’s about better data, not bigger data. And better data is defined by the specific tasks that the data is intended to model.
The broader question about these models, even if there are no clear SOTA definitions, is whether they are useful for customers and to what degree. On this point there are is a vast space of product development and customer experience questions that take center stage. It will be important that the core performance of models continues to improve, but the overall utility has many more dimensions.
So What About SOTA, Products, and Commercialization Success?
This article presents a perspective on the different weights (no pun intended) of technology performance and product value as considered from a customers’ perspective. While not intending to discount the importance of technical product performance, it brings to attention the fact that this is frequently only a part of whole product. Particularly in instances where there is a high rate of innovation, an asymmetry in vendor/customer information and an ambiguous set of benchmarked standards, it behooves product companies to avoid the knife fight in the phone booth, find the whole product utility, and address the unspoken needs of customers. There will continue to be considerable high frequency chatter in the machine learning space as researchers mix and match architecture components and run trial and error parameter optimizations. However, for machine learning to make a meaningful impact across the drug discovery industry, it will need to be developed into useful products, not solely (bio)arXiv papers and for that, the differentiation space will be larger than the SOTA of the moment.