🎙 Adam Wenchel/CEO of Arthur AI on ML explainability, interpretability, and fairness
There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work can become a great source of insights and inspiration. Please share these interviews if you find them enriching. No subscription is needed.
👤 Quick bio / Adam Wenchel
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?
Adam Wenchel (AW): I started working in AI in the ’90s (!) as a researcher at DARPA right as I was finishing my CS degree at the University of Maryland. After a couple of years, I was lured into the startup world and haven’t looked back. My previous startup leveraged AI to detect and block cybersecurity attacks. We were acquired by Capital One in 2015. Shortly after I joined, I had the awesome opportunity to work with the CEO and CIO to start their AI team and scale it up. ML gets real when you are deploying at an enterprise with billions of dollars of leverage – especially when it impacts the financial lives of millions of people!
🛠 ML Work
Arthur AI is trying to solve one of the most relevant challenges in the current state of ML. What makes ML monitoring and explainability such a difficult challenge and how is it different from monitoring traditional software systems?
AW: We’ve all seen the headlines about AI systems going wrong. Partnership on AI maintains an extensive list at Incidentdatabase.ai. When I was deploying ML systems at Capital One, there were no available solutions for this problem, and it kept me awake at night. That’s why we started Arthur, and it’s been exciting to see how much it has resonated with ML practitioners everywhere.
I’m always astounded by how many analytical models have been deployed in the last 20+ years without any monitoring. ML amplifies the need for guardrails enormously – the complexity and adaptation of these systems go well beyond traditional statistical models. There are time series monitoring tools for traditional software applications, cybersecurity, network operations, not to mention trading floors and power plants. The reason AI monitoring is different and that distinct tools exist for different domains is because of the unique failure modes that require an entirely new set of metrics. On top of that, the ecosystem you are plugging into is always unique – we’ve done quite a lot of work to make sure Arthur is brilliantly simple to deploy across all popular ML stacks and platforms!
Fairness and bias-mitigation are at the forefront of ML explainability research these days. What best practices and techniques ML teams can follow to ensure fairness and minimize bias in ML models?
AW: There is quite a lot that goes into approaching fairness and bias mitigation properly in the real world. It starts with basic good hygiene: making sure your data and models are well understood and documented using tools like model cards and datasheets for datasets. The good news is these not only help with fairness but also lead to better model performance and maintainability.
The next step is to make sure you really understand the entire system and how it is affecting the end-users. Proper governance, SME’s involvement, people who bring a sociological perspective, and representation from the affected communities are key. Think critically about the application domain/prediction task as well as the data you have available. Are particular subsets of the data undersampled? What are you going to do about the undersampling? Are your labels good, or are they systematically biased in some way? Is your prediction target what you actually want to predict, or is it a potentially biased proxy?
We recently presented a workshop on this topic with Humana at the ACM’s Fairness, Accountability, and Transparency (FAccT) conference. That presentation is available online here.
Once you’ve understood the system, think carefully about your model’s goals and what you want to define as fairness. Defining fairness for a particular application can end up having a fair amount of nuance. What is it that you want to equalize? Is it worse to deny someone [x thing] when they should have had [x thing], or is it worse to give someone [x thing] when they shouldn't have had it? This is something that incorporates both business concerns and social responsibility and should shape how you think about and evaluate fairness performance and mitigation.
ML explainability seems to be very different depending on the target model and environment. How different is it to interpret a simple ML model’s behavior like a decision tree versus complex deep learning models such as transformers or language versus computer vision models?
AW: There are approaches that work across all algorithms (“blackbox”) as well as algorithm-specific approaches. Examples of the former are LIME and SHAP, and an example of the latter is integrated gradients. These all generate localized feature importances which gives you insight into which features led the model to make a particular decision. When you go across data types – tabular data, computer vision, NLP – it takes some work to make the outputs of these techniques actionable. For example, we present computer vision explanations as a “heat map” where regions of pixels in an image are shaded green or red depending on their impact on an image classification task.
Feature importance is just one facet of model interpretability. Another area we have invested in is counterfactuals. Counterfactual explanations analyze the model and the data to find the easiest path to a different, more desirable outcome. For instance, if a model identifies you as being high risk for a health condition such as diabetes or heart disease, a counterfactual explanation identifies the easiest route to lowering the risk identified by the model. For anyone who is curious, we wrote a treatise on counterfactuals which won a Neurips workshop best paper award this past year.
One common misconception is that simple models such as decision trees or logistic regression are inherently more interpretable, but that's true only to a very limited extent; especially when we have a lot (>15) of features, there's a limit to how much a human brain can process all of the decision tree nodes / logistic regression coefficients and be able to take away something meaningful about it. When you are automating your statistics (ML) you need to automate the observability as well!
ML interpretability is a highly diverse problem that has sparked many interesting ideas to explain ML models’ behavior. Can you expand on ideas such as intrinsic vs. post-hoc or local vs. global interpretability?
AW: The complexity of the pattern matching performed by ML models is what gives them their power but also the thing that makes them inscrutable without the use of additional tools. Since most people want more powerful algorithms and the performance that comes with them, people then also need the tools that make them interpretable.
Similarly, the complexity of these powerful models means that global explanations are usually misleading. A global explanation is essentially the average of all the local explanations. But since the model can key off completely different features in different areas of your data space, the aggregate often hides what is really going on. That’s why being able to analyze the local and regional explanations is so critical – so you can really understand how your model is making predictions.
What are some of your most ambitious ideas about ML monitoring and explainability? Can we get to a point in which ML models are basically able to explain themselves?
AW: If we implement AI systems effectively we have the opportunity to begin to remove generations of bias and make systems that are more accurate and helpful. Humans are good at offering explanations for why they made a decision, but quite often those explanations are inaccurate. We don’t truly understand how our brains work and implicit bias defies self-awareness. Structurally, there will always be value in models that are optimized for their target task and then monitored and validated by separate, independent models.
Subscribing you support our mission to simplify AI education, one newsletter at a time. You can also give TheSequence as a gift.
💥 Miscellaneous – a set of rapid-fire questions
Is the Turing Test still relevant? Any clever alternatives ?
AW: The Turing Test remains relevant! Language tasks are one of the fundamental pieces of human intelligence. It is extremely unlikely any single test will sufficiently prove the existence of AGI by itself.
Favorite math paradox?
AW: Birthday Paradox
Any book you would recommend to aspiring data scientists?
There are many great technical books on ML, but I’d also encourage them to read some of the new books highlighting the societal impact of these systems such as Weapons of Math Destruction, Automating Inequality, and Race After Technology.
Is P equals NP?
For the vast majority of traveling salespeople, yes.