🎙 Bryce Daines/CDS at Modulus Therapeutics: Using ML to Power Next Generation Cell Therapy

Researchers are using increasingly sophisticated model architectures to incorporate experimental conditions as covariates and predict the effect of perturbations with increasing success

We’ve done interviews with ML practitioners from VC funds and ML startups, today we’d like to offer you a perspective from an implementation standpoint. Bryce Daines, Chief Data Scientist at Modulus Therapeutics, explained how ML is leveraged in cell therapy and cell design, why traditional ML and statistical methods still dominate biological research, and what ML breakthroughs will be particularly impactful on cell therapy design. Share this interview. No subscription is needed.


👤 Quick bio / Bryce Daines

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning? 

Bryce Daines (BD): I am a co-founder and Chief Data Scientist at Modulus Therapeutics, which we launched this year after incubating it at the Allen Institute for AI. I grew up reading a lot of science fiction and non-fiction with a particular interest in genetically modified organisms and viral pandemics – stuff we’re all familiar with now, but in hindsight seems oddly prescient. As an undergraduate, I studied bioinformatics, a field at the intersection of biology, computer science, and statistics. My first formal introduction to ML was the applied coursework in my major, problems such as inferring the functional properties of proteins from their sequence structure and reconstructing the evolutionary tree of viruses from their nucleotide sequences. I was fortunate enough to begin graduate school at the advent of ‘next generation’ DNA sequencing when the accompanying proliferation of biological data and the application of machine learning to biological datasets were hitting an inflection point. 

🛠 ML Work  

What is immune cell therapy, cell design, and when did scientists start to tackle these challenges with machine learning?   

BD: Immune cell therapies are an emerging treatment option intended to boost the immune system's effectiveness at recognizing and eliminating diseases, such as cancer. Frequently, immune cell therapies are derived from a patient or donor’s white blood cells (lymphocytes), which are genetically engineered to enhance their effectiveness, then grown to very high cell numbers, and injected into the patient. Cell therapies are effective for certain blood cancers, but their effectiveness for treating solid tumors, such as breast cancer, remains to be seen.  

Using genetic engineering tools, such as CRISPR, immune cells can be programmed in various ways to enhance their ability to recognize and eliminate cancers. Still, the complexity of biology and the vast possible design space makes knowing the optimal blueprint for engineering an intractable problem experimentally. At Modulus, we leverage high-throughput experimentation and machine learning to understand how immune cells behave and design immune cells to rationally eliminate solid tumors. We refer to this process as Cell design.  

We train our machine learning models on high-dimensional data extracted from individual, single cells, obtained in high-throughput experiments. At scale, this means collecting thousands of data points across hundreds or thousands of distinct cell populations in pooled experiments, which we use to infer the therapeutic potential of thousands or millions of cells individually and simultaneously. This approach dramatically increases the experimental throughput while reducing the cost.  

Cell Design is still in its infancy, with ongoing research at various institutions applying machine learning and novel experimental techniques across a broad range of biological conditions. 

Tell us about the vision of the Modulus Convergent Design platform and how it leverages modern ML techniques?  

BD: At Modulus, our goal is to integrate experimental and computational methodologies to accelerate the learning/feedback loop. We aim to build predictive models of immune cell behaviors and discover the optimal blueprint for engineered cell therapies. 

There are several ways in which we leverage modern machine learning and deep learning techniques in our platform. For example: 

  • Enabling higher multiplexing of our experimental methodologies by combining multiple engineering experiments into the same physical experiment (e.g., well on a dish) and deconvolute the signal of these experiments. 

  • Inferring and reconstructing cell-type-specific genetic regulatory networks from single-cell data, e.g., identifying transcription factors that regulate the expression of downstream pathways and phenotypes. From these models, we can derive predictions about the effect of perturbing important nodes in these networks and measure the accuracy of predictions in downstream experiments. 

  • Modeling the relationship between perturbations (genetic modifications) and high-dimensional single-cell data to infer the relationship between perturbations and immune cell behaviors. This enables us to make predictions about out-of-distribution perturbations, informing and prioritizing subsequent rounds of experimentation. 

How does ML for cell design compare to ML applied to other problems in biology? 

BD: Biology has a rich history of supervised and unsupervised machine learning applications, which has only accelerated in recent years with the advent of ‘next-generation’ sequencing and other high-throughput technologies. A handful of examples I’ve been familiar with include: 

  • Genomics: Sequence recognition, e.g., open reading frame detection 

  • Proteomics: Secondary structure prediction 

  • Metagenomics: Microbial community classification 

  • Systems Biology: Disease marker discovery 

  • Evolution: Reconstruction of phylogenetic trees 

  • Text Mining: Inference of biological relationships from text 

  • Clinical Diagnosis: Prognosis/diagnosis from histology images 

Deep learning is also coming of age for biology as larger and more complex models are being applied to long-standing biological domain problems. AlphaFold, for example, was recently ranked as the top algorithm for protein structure prediction, a critically important problem for basic and applied research, including drug discovery. 

In keeping with the broader industry of biological research, cell design is still dominated by traditional statistical methods. But with exponentially growing biological datasets, the application of deep learning and other modern machine learning approaches has significantly increased. 

How important is model interpretability in the design of cell therapeutics?

BD: Building interpretable models that aid in understanding biological mechanisms around observed behaviors is a key part of what we do. Often, interpretability, the biological meaning of predictive models, is more important than accuracy. The black box nature of many deep learning models often inhibits interpretation, making these approaches less favorable than more interpretable models. Due to this, traditional machine learning and statistical methods still dominate biological research. But there is a lot of opportunity for the development of more interpretable models, especially as biological datasets continue to grow exponentially.

What unsolved challenges remain in the interpretation of perturbational single-cell omics data? And more general, what breakthroughs in ML relevant to cell design are you expecting to see in the next 3-5 years? 

BD: In collecting single-cell data, particularly scRNA-seq data, sparsity remains an important, unsolved challenge for data analysis. Due to biological variation and technical limitations of the experimental techniques, many genes, though expressed, are not observed in scRNA-Seq datasets. Sparsity in these datasets has an impact on many downstream analyses. Effective statistical and ML-based approaches for modeling latent spaces, denoising and normalizing expression measurements, and imputing non-zero expression values remain active research areas. 

Additionally, the integration of multiple measurement modalities in single-cell datasets, experimental conditions, timepoints and linking data in interpretable ways becomes increasingly challenging as the complexity of experimental designs increases. Integrated datasets can be particularly powerful in identifying previously unidentified subpopulations of cells that may have interesting cell behaviors. However, adding additional samples, timepoints, and heterogenous measures also increases the challenges of batch effects across experiments. Compounded by the sparsity challenge discussed above, combining single-cell datasets can be difficult. 

Advances in the application of machine learning, deep learning in particular, to modeling integrated, large-scale, single-cell experiments across experimental conditions are in the area where breakthroughs will be particularly impactful on cell therapy design. Researchers are using increasingly sophisticated model architectures to incorporate experimental conditions as covariates and predict the effect of perturbations with increasing success. I anticipate the continued exponential growth of experimental datasets coupled with progression in model architectures will yield significant gains over the next few years. 


🙌 Let’s connect

Follow us on Twitter. We share useful information, events, educational materials, and informative threads that help you strengthen your knowledge.

Follow us on Twitter

💥 Miscellaneous – a set of rapid-fire questions  

Favorite math paradox? 

BD: One puzzle I’ve long enjoyed, not a math ‘paradox’ per see, is the C-value enigma. It poses the question of why genome size (measured in nucleotide base pairs) is not correlated to organism complexity. For example, we know that some single-cell organisms, such as protists, have much larger genomes than arguably more complex organisms like humans. We understand now that a combination of non-coding DNA sequences, once termed ‘junk DNA’, is a significant piece of this puzzle. What we don’t understand fully is the incredible layers of complexity involved in orchestrating the developmental processes of complex organisms, like humans, which are performed by this non-coding DNA. Nor do we understand the full potential repertoire of phenotypic expression possible from the DNA of one organism. 

What book can you recommend to an aspiring ML engineer?

BD: I recommend everyone read Robert Jordan’s Eye of the World before Amazon starts streaming the Wheel of Time Series. But if you're looking for something more technical, Ian Goodfellow’s Deep Learning is a favorite textbook. 

Is the Turing test still relevant? Any clever alternatives?

BD: The Turing test always provided an example for philosophical discussions of AI rather than a practical engineering application. As a thought exercise, though, there’s a striking parallel between the Turing Test and the immune system's role. Immune cells are effectively tasked with being the ‘interrogator’ in the Turing test to discriminate between self (healthy) and non-self (cancer or other infectious cells). The development of a tumor is only possible when tumor cells learn to trick the immune system. The interesting thing that we’re doing at Modulus is trying to engineer an ‘interrogator’ which is better or more adapted than our immune cells are inherently at effectively differentiating between self and non-self, sort of a reversal of the Turing test. 

Is P equals NP?

BD: Maybe, sometimes? But I’m more of the camp that it would likely be nonconstructive and unhelpful if a proof existed. As with protein structure prediction referenced above, a classical NP-hard optimization problem in biology, I expect we’ll be fumbling along with empirically-good-enough heuristic approaches for the foreseeable future.