Correction: a previous version of the article did not include Vertex Pharmaceuticals as a developer — this has been corrected.
One of my goals for this Substack is to provide accessible intuition, in a semi-technical manner, for how to think about low frequency (e.g., the longer and more transformative) trends in technology and biology. The recent approval of the first CRISPR based therapeutic is a case study in what the future holds and the intersection of genomics, medicine and machine learning — and is decidedly a low frequency1. The below article ebbs and flows in the technical weeds, but hopefully provides useful context.
NOTE: This article is split into two parts to make it more readable. To stay tuned for Part 2 — Subscribe and Share
Genomics based AI and programmable genome editing tools like CRISPR are transforming our understanding of, and ability to treat, human disease in extraordinary ways. This two part article details how.
The Big News
This past week the first ever CRISPR based gene editing cell therapy was approved in the UK2. CASGEVY, developed by CRISPR Therapeutics and Vertex Pharmaceuticals was developed for the ex-vivo treatment of sickle cell disease (SCD) — a disease that is caused by the malformation of a certain form of hemoglobin that causes red blood cells to be misshapen (in the shape of a sickle vs. a more symmetrical round shape) and consequently can cause severe vascular and systemic issues as normal smooth blood flow is inhibited. *Note: it was also approved for transfusion-dependent beta thalassemia (TDT) which is related, but this article will focus on SCD. SCD affects around 300,000 new children born each year with a global prevalence estimated at 20-25 million. It is the canonical genetic disease. People are born with and live with it their entire lives - before genetic engineering there has been no ability to cure this disease.
As milestones go in medicine, the approval of the first CRISPR based genetic engineering therapy is of monumental importance
A New Era of Tools
The milestone is made even greater by the speed. Just 11 years ago, CRISPR, as a co-opted bacterial immune system, was just a publication in Science3. In the ensuing decade, it has transformed the landscape of life science research, drug discovery, and our concept of what is possible in medicine. Where most historical drug development timelines with well established small molecule or protein chemistries would take somewhere on the order of 15 years to be approved, CRISPR achieved in 11 (earning a Nobel Prize along the way4 for Doudna —see tweet below — and Charpentier) — and the promise is even greater as CRISPR is a programmable system that can be used to target an increasing array of genetic diseases.
We are in a new era of genetic medicines, and of curing genetic diseases
Tools on Tools on Tools…a Brief History of Genomics
The development of CRISPR technologies (itself an exploding and ever expanding field5) is the current capstone of tools that have been developed over the last few decades in our pursuit of the understanding of human genetics and the genetic drivers of disease. To consider the full context of this history…and to look to the future…the story of CASGEVY is a model rubric.
In the future, what has been done over decades will soon be done over years and possibly even shorter. The complexity of biology is being matched by the development of high throughput tools, massive data sets, advances in machine learning, and genetic engineering technologies…and we are just getting started.
Mapping the Genome
Shortly after the human genome project was completed (and to some extent even before) there was a large effort to begin to understand the meaning the DNA sequences that were being produced. This effort was canonized over the last 20 years in the 4 iterations of the ENCODE project (the ENCyclopedia Of Dna Elements). These projects have followed the technology development in genomics and aimed at standardizing the aggregating all information about the function of genomic sequences. The ENCODE projects themselves deserve an entire article, but below is a summary of the progression of technology development:
Figure: a timeline of the types of assays used to understand the human genome over time6. Over the last 20 years we have introduced an increasing number of technologies to map the genome that help us understand, from an experimental perspective, the function of DNA. These tools have laid the foundation for the data we now have to train deep learning models.
To put some of these tools in context below is a short description of some of the information that can be learned from each:
3D chromatin structure tools - as seen in the figure of the globin locus, the structure of DNA has both 2D and 3D features. Tools that map 3D structure are some of the first developed and can provide insight into what parts of DNA are “touching” what other parts which is a starting point of a hypothesis of how the circuitry works
DNA Accessibility — not all of our DNA is “open” — in fact, in many cases a large portion is tightly compacted such that it is not accessible for proteins to bind it. Tools that look at DNA accessibility include tools like DNAse footprinting (also know as DHSs — will refer to below) and are used to determine what parts of the DNA are available — or what parts might form the components of the circuitry.
TF-Chip, Histone-Chip, etc — these are technologies that directly measure the binding of different proteins to DNA, many of which impact the regulation of gene expression. BCL11A is one such protein that is important in managing the regulation of genes during the development of blood cells.
Over the last 2 decades, the host of molecular tools that have been developed as part of the ENCODE project have produced a wealth of experimental data about the features of our genetic circuitry — and the data has been hard earned.
A Short Primer on Sickle Cell Disease
To set the stage for an understanding of why this story is so compelling, it’s important to have a general overview of sickle cell disease. I won’t discuss the phenotypic issues substantially but rather the genetic ones, which are the basis of the story herein. Just to satisfy the more visual readers, sickle cell disease looks like the below:
Figure: When there is normal hemoglobin the cells are shaped in a barbell design and are flexible and deformable so they can squeeze through narrow blood vessels. In the abnormal context, the cells deform into a “sickle” shape and can aggregate and block blood flow7.
From a genetic perspective, sickle cell disease is the poster child of genetic disorders because it is so precisely well defined. It is caused by a single point mutation in the beta globin gene that changes a single amino acid from glutamate to valine. For some additional detail, there are two types of hemoglobin that we all have — there is an adult version and a fetal version. After birth, the expression of the fetal version is down regulated in favor of the adult version. There are three different genes that combine to make these different versions (alpha, beta, gamma):
Adult hemoglobin (HbA) is composed of 2 alpha globin and 2 beta globin subunits
Fetal hemoglobin (HbF) is composed of 2 alpha globin and 2 gamma globin subunits.
Since the mutation that causes SCD is in the beta subunit, it occurs in the adult version of hemoglobin.
This specific biology leads to two different possible ways of treating SCD from a genetic standpoint
It is possible to genetically correct the specific mutation in the beta globin gene to ensure that the protein is formed correctly
An alternative route is to change the balance of expression of fetal hemoglobin (which does not include the deleterious mutation) to be more prevalent which would have a therapeutic benefit.
The former is more direct, but CASGEVY does the latter, and the story of its discovery is just the tip of the iceberg for genetic medicines. It is the story of how we can use genetic engineering not just to correct genetic mutations but to actually tune our genetic circuits potentially addressing a vast range of biological conditions. Our genetic circuitry is incredibly complex, and SCD is an example of one of the simplest and most well defined use cases because it is only one gene and one mutation, but the future possibilities of deep learning models to provide insights into the underpinnings of complex genetic regulatory networks and “cell states” of disease and health provide a window into what the future holds. More on that later, but first a description of the gene regulation of hemoglobin.
A Beautiful History of Genomic Dissection
To start at the end, below is a simplified diagram of the regulation of the globin gene expression. The goal is to understand how to increase the expression of gamma globin over beta globin (because the alpha unit is the same in both versions) — the subunit unique to the fetal version of hemoglobin that does not contain the sickle cell mutation. Increasing the expression of gamma globin would then increase the proportion of overall hemoglobin that was beneficial (HbF).
The globin locus looks something like the above with the gamma and beta versions under similar regulation8. (HPFH is benign hereditary persistence of fetal hemoglobin in adults). In this case, there is a region called the LCR (Locus Control Region) which controls gene expression. There is also an upstream genetic enhancer region which is a binding target for a protein call BCL11A — a protein which binds to a specific genetic sequence (see top image) and recruits other proteins to loop the DNA over such that the LCR is centered over beta globin. As seen in the bottom image, when the binding sequence has genetic variation (or is modified), it affects the ability of BCL11A to bind (because proteins are very sensitive to the sequences they bind to) and that causes a differential in the binding of the BCL11A protein which causes a change in the looping of the DNA such that the LCR is more co-localized with the gamma-globin locus. This is a rather simplified discussion, but it presents the pieces.
If it is not already evident — the nuance of gene regulation specificity is absolutely stunning
But what is more fascinating is how this was all discovered…
Discovering the Mechanism…
The mechanism described above is a bit jumping the gun to the end of the story, so let’s discuss some of the genetic history behind this.
I won’t go into an extensive history of DNA sequencing here because it’s a story unto itself, but suffice to it say that the big milestones were in 2003 with the first draft of the human genome (albeit very incomplete and is still being filled in9), in 2007-2008 with the introduction of Next Generation Sequencing (NGS) and Moore’s Law cost dynamics and scale10 and a host of subsequent studies that combined this rapid growth in sequencing data to clinical or phenotypic outcomes — collectively known as Genome Wide Association Studies or GWAS. While the corpus of studies grows every day there are currently several thousand GWAS studies to date11. One of the outcomes of all of this sequencing is that we have gained a deep perspective into the level of genomic diversity and complexity, but much less understanding of the impact of such diversity. One of the simple reasons is that only about 10% of all the variation occurs in protein coding genes (which we can at least translate into a different protein to make conjectures about impact) but 90% reside in the “dark genome” — the part of our DNA that doesn’t make proteins but that serves as the underlying substructure for regulation of the genes that do12. This part of our DNA, which has prior been considered “junk” is likely to provide critical context for how the regulation of our biology actually works. But it’s very complex and difficult to experimentally assess. To date we have largely relied on GWAS to get some understanding — we can say that a variant is “associated” with a particular disease, but in most cases, we don’t really know why.
SCD is a great place to start. Through a variety of GWAS, it was discovered that individuals who had the genetic variation in the BCL11A gene locus (as well as in the globin promoter binding site for the BCL11A protein as indicated above in HPFH patients) both had higher levels of gamma globin and HbF. One of the observations from these GWAS studies was that:
BCL11A brings together two complementary themes in human genetics. Common genetic variation, as reflected in GWAS, led to recognition of BCL11A as a candidate factor responsible for HbF silencing13
If the goal is to increase HbF — one strategy could be to reduce the influence of BCL11A.
The mechanism and future role of machine learning in Part 2…
References
https://www.theguardian.com/society/2023/nov/16/uk-medicines-regulator-approves-casgevy-gene-therapy-for-two-blood-disorders-sickle-cell
https://www.science.org/doi/10.1126/science.1225829?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed
https://www.nobelprize.org/prizes/chemistry/2020/popular-information/
https://www.nature.com/articles/s41587-020-0561-9
https://www.nature.com/articles/s41586-020-2449-8
https://steemitimages.com/DQmPxzfUfbxkw4SqqQBvPAYe3CJwPvzK234VyqEv74zc83E/Sickle-Cell-Normal-Cell_800x549.gif
https://www.cell.com/cell/fulltext/S0092-8674(18)30296-4
https://www.science.org/doi/10.1126/science.abj6987
https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data
https://www.nature.com/articles/s43586-021-00056-9
https://pubmed.ncbi.nlm.nih.gov/35365203/
https://www.cell.com/cell/fulltext/S0092-8674(18)30296-4