Preamble:
This post originally appeared on the Timmerman Report linked here and, being for a more general audience, is less technical than other posts on this Substack. However, it’s a very interesting and important trend nonetheless. One of the most powerful aspects of deep representation learning is the ability to express an arbitrary set of features in a common numerical format — which can then subsequently be used for calculations to better quantify nuanced relationships. The impact of these technologies on our understanding of biomedicine will be profound.
What’s Old Is New Again
Say the word “repurposing” in certain scientific circles, and you get knee-jerk reactions to compounds long ago dismissed as failures. But in the new world of AI, what’s old can be new again.
Consider New York and Philadelphia based Every Cure. It is a nonprofit organization building a platform for repurposing drugs using AI. It received a $48 million contract1 in February from ARPA-H, the US government’s new health science agency that funds futuristic, groundbreaking ideas. Every Cure’s goal is to dust off 3,000 FDA-approved drugs with the aim of identifying 100 new potential candidates for repurposing and moving 25 into clinical trials.
The cost of this effort is low relatively speaking, but the potential impact on human health is immense. The industry has already invested billions of dollars into developing these compounds. That’s a sunk cost. If AI can help us analyze known biology and drug mechanisms of action, it could help point science in new directions that could greatly reduce the time and money that make biopharma R&D so challenging. This could fundamentally change the investment math – the risk-reward proposition – for developing treatments for 7,000+ rare diseases. Many of these conditions aren’t considered investable today.
The concept of drug repurposing has been around for decades and there are well over 1002 drugs that have been repurposed for other indications to date, but the potential opportunity remains difficult to explore fully. The diverse, incomplete, and scattered data in the biomedical landscape proves challenging. Additionally, many rare diseases may not have complete or robust documentation, and the mechanisms of action of many compounds are not known. The approaches at best have been a combination of database sleuthing, publication scraping, and statistical analysis.
AI is changing the game.
The Complex Biomedical Data Landscape Makes Insight Challenging
There is a tremendous amount of biomedical, chemical, and clinical data in both the public and private domains that may be useful for discovering new use cases for drugs. To date, the primary approaches to doing this have been through the tools of data science and modeling combined with high throughput assay development. A summary of some of the approaches used are:
Phenotypic screening: one of the original methods of determining drug efficacy, this type of screening doesn't rely on any pharmacological or biological hypotheses and is largely serendipitous in discovery.
Target/Pathway/MoA analysis: these methods aim to find commonalities in disease etiology to determine if there are useful intersections where drugs may be effective. These methods may be based on comparing gene expression profiles, common biomarkers, pathway analysis, high content/throughput screening, and in-silico methods.
Knowledge-based methods: these methods are used to find commonalities in published literature and existing data sets to cross-reference potential repurposing opportunities.
Molecular modeling: these methods look at the structure of molecules and determine if they might be suitable docking candidates for various other targets.
Despite the array of tools available for drug repurposing, the effort to cross reference, coordinate, and interpret the disparate data sets and types across a wide range (and long tail) of potential diseases remains a substantial challenge even for sophisticated data science approaches. Many diseases may not be well annotated or may be annotated in natural language that is not easily comparable or integratable at the scale of all biomedical literature. A vast array of data is represented in a tabular format which enables some statistical methods, but not nuanced pattern recognition, and data is incomplete across many domains.
For drug hunters, this data ecosystem is a sparse network of individual points that need to be integrated and completed.
Knowledge graphs (KGs) have been a core part of the tool kit to begin to understand these relationships. These graphs are composed of large integrated data sets that represent different types of knowledge as nodes and different types of relationships as edges between those nodes.
For example, nodes may consist of drugs, genes, proteins, diseases, ontologies, clinical notes, etc. and edges may consist of relationships such as drug A binds protein B, or gene X is associated with disease Y, etc. Large efforts have been made in the public domain to assemble these knowledge graphs with examples such as the PrimeKG3 and HetioNet4. Each assembly represents a different scope of datasets generally with tens to hundreds of thousands of nodes and millions of edges between them.
These knowledge graphs have enabled insight into the connectivity of diverse types of biomedical data, but their interpretation with standard tools of graph analysis and data science are generally limited to analysis of the existing known data - for example calculating a weighted degree of how many edges link a given node to another to determine how closely they are related. The ability of these graphs to enable complex insight or predictions of new potential links has been limited.
AI Is Enabling Deeper Insight for Predictions
At the core of modern machine learning and AI is the idea that all types of information can be represented in a common manner such that it can be used for calculations. In short, AI aims to translate words, chemical structures, genes, concepts, etc. into information-rich numbers to be able to subsequently perform math on those numbers. These are known as representations or embeddings.
This is a rather extraordinary capability when applied at scale to diverse data types that enables us to move beyond primarily a relational understanding of biomedical data to one that is numerical and quantifiable, can be learned, optimized, and serves as the basis for making predictions about unknown hypotheses such as whether a drug may be a good candidate for repurposing in a given disease.
One of the recent and most powerful tools used to build these representations include Large Language Models (LLMs) which have been put to use in the well-known example of OpenAI’s GPT products.
LLMs can be used to translate a wide array of textual data, for example, clinical notes, disease descriptions, ontologies, etc. into numerical data. This same sort of translation into numerical data can be done for a range of foundational molecular models like ESM, genetic embedding models, models of imaging data, and many others that are linked together in heterogeneous graphs that can be subsequently be trained as graph neural networks (GNNs) to predict a wide range of different features, or with the objective of aligning different embedding models into a common space such as the BioBridge model5 developed by researchers at Amazon.
These networks are built in an abstract numerical way – the language is not one of gene labels, drug names, or natural language – and this enables complex and rich calculations that can provide powerful insight.
For example, in a traditional biomedical KG, the analysis is often limited to the links that exist between the nodes that are extracted from existing literature. However, in the case of a long tail of rare diseases, it may be the case that data linking these diseases to other parts of the biomedical literature is sparse, and sifting through the network of links and disparate data types may be intractable to find a candidate drug, and in fact, such a link may not even exist in the known literature.
AI models built with GNNs and KG embeddings can be used to both find complex patterns in massive quantities of data and make predictions about the quantifiable likelihood of the existence of links between data points that do not already exist, for example, whether there may be a linkage between a drug and a disease.
One of the concerns scientists have raised about AI in drug discovery has been a lack of interpretability of these predictions. But those questions are being addressed. A team at Harvard Medical School’s Department of Biomedical Informatics recently reported a method to revert the numerical form of these networks back into the underlying sub-graph data that informed them, providing a detailed biological interpretation that supports the underlying prediction6
These approaches to drug repurposing and knowledge management are still new and it will take time to see if they improve the predictability of drug utility across different contexts. However, they have already demonstrated the ability to identify and replicate known repurposed drugs and off-label clinical prescriptions. With an explainability layer that describes predictions in the context of the different knowledge elements that contributed to them, they have already improved clinical decision accuracy and confidence in off-label prescriptions6
These new types of AI-driven Knowledge Graphs including Every Cure’s MATRIX platform ( ML/AI-enabled Therapeutic Repurposing In eXtended uses) are enabling new insight into the vast sea of biological data.
Biomedicine and the Future of Knowledge Management
Knowledge management remains the most significant near-term application of AI in biomedicine7 because it provides a method of representing all the different types of information in a common format, extracting complex patterns, and enabling quantifiable and explainable predictions for testable hypotheses.
What this means for healthcare and medicine can be profound when the scale is considered. In a recent podcast, John Halamka, the President of Mayo Clinic Platform, said his team is building models across Mayo's 10 million+ longitudinal patient records and aiming to build an international consortium that would expand to potentially hundreds of millions of patient records8 With all the caveats of patient privacy considered, advances in AI, particularly in GNNs, KGs, and LLMs are likely to enable deep insight into patterns and potential in biomedicine that were previously unknown.
The cost of conducting clinical studies remains a barrier for the long tail of 7,000 rare diseases. But scientists are generating enormous volumes of biological data, and that data can be converted into forms that can be analyzed at scale by AI. There’s good reason to believe that drug candidates that have been through some clinical trials will be a promising place to start making, and testing, AI-generated predictions. It has the potential to change the biopharma R&D investment model. A fascinating set of experiments is about to be run, and the answers could help guide R&D priorities and off-label prescribing that could change the lives of patients with many different rare diseases.
References
https://everycure.org/every-cure-to-receive-48-3m-from-arpa-h-to-develop-ai-driven-platform-to-revolutionize-future-of-drug-development-and-repurposing/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9945820/
https://zitniklab.hms.harvard.edu/projects/PrimeKG/
https://het.io/
https://arxiv.org/abs/2310.03320
https://www.medrxiv.org/content/10.1101/2023.03.19.23287458v2
https://timmermanreport.com/2024/01/rebooting-ai-in-drug-discovery-on-the-slope-of-enlightenment/