Discover more from TheSequence
🎙 Greg Finak/ CTO of Ozette: using ML to extract intelligence from the immune system
We need to open the black box: without that transparency, there is no trust and verifiability
We’ve done interviews with ML practitioners from VC funds and ML startups, today we’d like to offer you a perspective from an implementation standpoint. Greg Finak, CTO and co-founder of AI-powered immune profiling platform Ozette, explained how they leverage machine learning in high-resolution immune profiling, what type of datasets/data structures they typically use for cytometric data analysis, and what breakthroughs in deep learning can be relevant to immune profiling. Share this interview if you find it insightful. No subscription is needed.
*Thanks to my colleague Evan Greene, VP of Data and co-founder at Ozette, for contributing to the answers below.
👤 Quick bio / Greg Finak, CTO and co-founder
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?
Greg Finak (GF): I’m a scientist and academic turned CTO of Ozette. I fell into bioinformatics and computational biology during my Ph.D. training and continued in that vein afterward. I was fortunate to do research at the Fred Hutch, with access to incredible data sets and interesting computational challenges and scientific questions. That’s where we did the foundational research underpinning Ozette.
🛠 ML Work
Ozette applies machine learning to fascinating healthcare scenarios such as cell analysis and cytometry. Can you tell us about the type of ML scenarios you are trying to tackle at Ozette?
GF: We’re building a platform to automate the profiling of the immune system at unprecedented resolution. The technology to enable this has been around for 50 years. But only now the analytics are catching up to the dimensionality and scale of the generated data. To solve this challenge, we are focused on interpretability, which is critical for healthcare applications. We are using ML to derive insights into the dynamics and functioning of the immune system. We can meaningfully do that only if our models are interpretable and their outputs verifiable. It’s not enough to just have a black-box model give us an answer. We need to be able to understand why it gave the answer it did and what are the biological features that it deemed important to the scientific question at hand.
Cytometric data analysis is a key component of your work to extract intelligence from cellular structures. What type of datasets/data structures do you typically use for cytometric data analysis, and how different are they from the typical datasets we use in deep learning models? How difficult is it to adapt modern deep learning frameworks for cell-analysis scenarios?
GF: The dimensionality of the data we are working with is (in relative terms) not that large (up to 50 or several hundred dimensions depending on the technology). But the task at hand is to identify the cellular structures within the data, and that search space is literally astronomical, even with a moderate number of dimensions. The ML frameworks we are applying to cytometric data to solve these problems are different from the frameworks used for typical deep learning. Our focus is on interpretable models, and the techniques and data structures we’ve developed arise from research by the founding team. Interpretability is critical. The key part of what we are building is the ability to open the black box and show our users what, how, and why the system deemed specific features as important. Without that transparency, there is no trust and verifiability. This verifiability is critical to translating discoveries to the clinic, whether you develop therapies, diagnostics, drugs, or identifying biomarkers.
Data labeling and classification for cytometric data structures has been a known challenge to apply machine learning to cell analysis. What type of techniques and frameworks do you use at Ozette to address this challenge?
GF: We're aware of these challenges: how many cell types are truly there, what are they, how should they be labeled. We’ve been working on these questions for several years, and we’ve developed techniques that perform cell type discovery and annotation in an unsupervised fashion, without the need for labeled training samples. This is a unique strength of what we are building at Ozette. Our platform can resolve the order of magnitude or greater cell types in the data compared to existing approaches. We can do so in a completely unbiased and data-driven way. This allows us to build up a large corpus of annotated data describing the state of the immune system in different disease settings and how it responds to different treatment regimes. Moreover, because our platform fully annotates all cell types, we can now compare and integrate these data across studies. There is an incredible amount of information in single-cell cytometry data that we’ve found has been greatly under-utilized.
Methods such as self-supervision and transformers are pushing the boundaries of computer vision in recent years. How relevant are these new computer vision techniques in the world of cytometric analysis?
GF: Cytometry datasets are produced under conditions that often lead to sample-to-sample variability (both because of natural biological variation as well as variability caused by technical effects). While improvements to measurement technology have produced higher- and higher-dimensional cytometry datasets in recent years, the underlying sources of variation have not disappeared. In fact, these sources of variation have made it challenging to derive comprehensive label sets on more recent high-dimensional studies. While we are interested in applying self-supervision approaches to this challenge, as we mentioned before, we’ve taken a different feature-engineering approach in our platform that is producing novel sets of annotations describing the immune system. We are exploring applying existing computer vision approaches to derived datasets augmented by these features.
What breakthroughs in deep learning that could be relevant to immune profiling are you expecting to see in the next 3-5 years?
GF: As interpretability, reproducibility, and trustworthiness are foundations of our platform, we try to keep up with (and are very excited by) research into these attributes of not only deep learning models but ML in general. We think systems integrating different classes of models, where each component excels in a particular problem area, will find great value in analyzing biological data. We are exploring these opportunities to analyze the database of annotations we are creating.
🙌 Let’s connect
Follow us on Twitter. We share lots of useful information, events, educational materials, and informative threads that help you strengthen your knowledge.
💥 Miscellaneous – a set of rapid-fire questions
Favorite math paradox?
GF: Simpson’s paradox. It comes up surprisingly often in biological data. It’s good to be aware of it as it provides a promising avenue for investigation when measured effects are the opposite of what’s expected.
What book can you recommend to an aspiring ML engineer?
GF: Evan and I recommend “An Introduction to Statistical Learning: with Applications in R” by Gareth James, Daniella Witten, Trevor Hastie, and Robert Tibshirani. I’d also recommend Statistical Rethinking by Richard McElreath.
Is the Turing Test still relevant? Any clever alternatives?
GF: A computer fooling a person into thinking they’re talking to another person is quaint, but I’m not convinced it’s a great benchmark. Analogously, in single-cell analysis, a common benchmark is showing automated analysis methods reproduce manual analysis methods. Even though everyone agrees, manual analysis approaches are biased and error-prone. It’s a pretty low bar, but everyone’s focused on it. I think we should aim higher.
Is P equals NP?
GF: On the balance of probabilities, I suspect not. But I hope I’m wrong.