🎙 Google’s Allen Day on Using ML in the Cryptocurrency Space

Jun 22, 2022

It’s so inspiring to learn from practitioners and thinkers. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you like it. No subscription is needed.

👤 Quick bio /

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Allen Day (AD): I work in Google Cloud’s developer relations team. Our mission is to build a best-in-class experience for cloud devs. Within this team, I advocate for Google Cloud's web3 and data & analytics products. These spans the range of engineering data pipelines, from ingest through analytics and machine learning. I spend most of my time with the data processing and transformation products.

Regarding how I got into machine learning, it wasn’t through deliberate intention but rather the result of my lifelong interest to explore and build at the intersection of computer code and DNA-based biocode. This started with self-study and learning to program a computer at six years old, and led me to pursue a graduate degree in bioinformatics, during which I learned how to use distributed systems to implement machine learning algorithms to do research in human genetics.

🛠 ML Work

In the last few years, your work has focused on applying data engineering and intelligence to blockchain/crypto datasets. Give us some context on the inspiration and nature of this work and tell us about some of the biggest challenges for processing blockchain data.

AD: I got interested in cryptocurrencies in 2013 but didn't get around to learning about the blockchain data structures until Ethereum's ICO boom in 2017. I noticed there were some structural parallels — the blockchain transaction graph looks like the graph of genetic interactions inside a cell. So I decided to apply some simple analyses to find e.g. central nodes and write a blog post about how I did it. It ended up being more data engineering work than I expected to get to a few charts in a Jupyter notebook. I decided that nobody should need to do that work again, so I open-sourced the ETL and put the processed data into a free-to-access BigQuery dataset. Then I wrote my blog post. It was very well received by the blockchain community. Many analysts and engineers reached out to me, and a community formed around the open data.

It became clear that we needed to address two key challenges to meet the community's needs: (1) a robust DevOps architecture (kubernetes, docker) to keep up to date with a blockchain network's consensus state, and (2) an extensible architecture for ETLing complex streaming data (pub/sub, dataflow, airflow) so that we could work with other blockchains such as Ethereum. I teamed up with a talented data engineer, Evgeny Medvedev, and we built the Blockchain ETL community and open-source software project.

Today at GCP we maintain ~20 of these datasets in BigQuery. There's a Kaggle community analyzing them, and Evgeny went on to build a blockchain analytics company, Nansen, based on our work.

Anonymity is key characteristic of blockchain data but, at the same time, a major roadblock for deriving intelligence from this type of datasets. What are some ML methods that can be used to de-anonymized blockchain datasets?

AD: If we consider all of the data on all of the public blockchains, there are indeed some small areas that are effectively invisible. For the majority of the data, though, we can see the transactions. Some blockchains are account-based so we can directly see system actors. Other blockchains are transaction-based and we need to use clustering methods to build synthetic identities. In all cases, we can reduce the ledger activity to a working set of system actors.

From here, it's common to create continuous features via dispersion modeling to estimate contamination from a ransomware payment address. It's also common to use public label data to create categorical features — for example, using a random forest to find look-alikes to known labeled actors (miners, traders) based on their activity aggregated over time.

You seem particularly interested in the emerging area of graph neural networks (GNNs) which, in principle, seems like a perfect fit for blockchain data structures. What is the potential and challenges of applying GNNs to blockchain data?

AD: Yes, definitely! Graph database investment and popularity in graph analytics workloads continue to grow. Their data access capabilities are on the cusp of being generally usable and there is an opportunity to apply graph databases to blockchain data structures.

Why do we care about graph data structures at all? A graph is the ultimate generalized data structure. It captures and can represent the blockchain data with high fidelity, and it has the capability to encode rich relationships between nodes (temporal, semantic, social, spatial, functional). We've already demonstrated that there's useful inductive bias for non-graph-based methods. It seems reasonable to expect that graph-aware models like GNNs will outperform the more basic methods.

I also think it's the right time to be thinking about this. As I described earlier, most of the activity on-chain is open for all to see. But we should expect these data to become more obfuscated and opaque over time. After all, one of the fundamental technologies upon which blockchains are based is cryptography. So more hiding capabilities will be introduced, and the awareness of on-chain actors that they're living in a dark forest will also increase. This becomes an adversarial ML problem.

So we'll need the more powerful capabilities that are unlocked with GNNs, like identifying anomalous transactions, and conversely which transactions don't exist (...yet) that should. Classifying nodes with GNN embeddings and applying graph kernels to characterize neighborhoods will also prove useful.

Decentralized ML is an interesting area at the intersection of ML and blockchain runtimes but one that hasn’t achieved major traction. There are even areas such as federated learning (created by Google) which seem to be a great fit for decentralized ML and still we have failed to major adoption of blockchain runtimes for ML solutions. What are the major technological roadblocks for realizing the promise of decentralized ML and are we likely to see ML models running on blockchains?

AD: The theme of your question seems to be about building ML microservices that use a blockchain backplane.

We're already seeing this today with blockchain oracles: middleware solutions that address the software oracle problem. I pioneered the concept of hybrid blockchain/cloud applications with Chainlink, and the essential problem we solved was how to run intense workloads by decoupling the on-chain compute for logging the transaction from the resources needed to deliver the result. As a concrete example, this design pattern allows spinning up a docker container to train a model or perform inference using a GPU and get results delivered on-chain. Blockchains employ checksums everywhere, so a nice feature that you get for free by doing this is responsible AI — the input dataset can be transparent and verified, and the model training/inference processes are deterministic and reproducible.

Regarding federated learning, I haven't seen an implementation of coordinating with a blockchain, but it seems possible. We can reuse the same Oracle-based worker pattern described above, and converge with a MapReduce orchestrator. The techniques used to survive in the dark forest, like zero-knowledge proofs, may also be helpful here for managing privacy as blockchain-integrated ML models are brought to market.

The intersection of crypto and ML is a fascinating area full of possibilities. NFTs, decentralized finance (DeFi), programmable stablecoins, all seem areas that can be impacted by the adoption of ML methods. Could you elaborate on the potential and pragmatic applications of intelligent crypto assets?

AD: With regard to ML and NFTs, we're seeing NFTs that grant the owner access to ML-linked products and experiences — acting like a license key or a config file. ML is already being used in off-chain trading systems, and I expect we'll also see the on-chain equivalent of this, where Oracle-linked ML models are integral to the automated protocols that power decentralized finance and games.

It's a great time to get involved at the intersection of ML and crypto, and it's been an honor to share with your audience some current market opportunities and areas of open inquiry. I'm excited to see more ML practitioners get involved and see what they'll create.

💥 Miscellaneous – a set of rapid-fire questions

What book can you recommend to an aspiring ML engineer?

AD: Elements of Statistical Learning (free PDF) by Trevor Hastie, Robert Tibshirani, Jerome Friedman; Introduction to Information Retrieval (free PDF) by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze; Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.

Is P equals NP?

AD: I don’t think they’re equal, no. If P=NP there are of course major ramifications for cryptography and the entire stack of blockchain applications built on top of that. But it’s a tiny disruption in relation to all of our assumed limitations that get broken.

Perhaps this question is so captivating because of how close it is to the human condition. We want both unlimited reach (P=NP) while operating from a place of total safety (P!=NP). But the math sublimely indicates we can’t have it both ways; this is both beautiful and terrifying.

TheSequence

Discussion about this post