🎙Ronen Dar, Run:AI's CTO, on managing computation resources in ML pipelines

Feb 16, 2022

It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.

👤 Quick bio / Ronen Dar

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Ronen Dar (RD): I did my bachelor’s, master's, and Ph.D. at Tel Aviv University. I met Omri Geller (Run:AI’s CEO) there, working toward his master’s while I was on my Ph.D. Alongside my studies, I was also working in the industry for a startup called Anobit Technologies, a maker of flash storage technology that Apple acquired during my time there. I stayed on at Apple for several years, and it was really fun having one foot in academia and another in the industry. But when I finished my Ph.D., I had to choose. At first, I chose academia and did my postdoc as a research scientist at Bell Labs in the US. My Ph.D. and postdoc showed me the importance of having easy access to computing power and what you can achieve as a researcher when you have access to large amounts of computing resources. Omri and I knew that we could put unlimited computing power into the hands of every researcher, so when we decided to start Run:AI, I made the switch from academia to becoming a founder and came back to Israel to work alongside Omri as the CTO.

🛠 ML Work

Could you tell us about the vision and inspiration for Run:AI?

DR: There are two key challenges for AI development right now, and they’re only going to increase in importance as more and more companies start doing AI. The first is that AI adoption across enterprises in nearly every industry is driving demand for more powerful computing resources, GPUs, to provide the levels of computing power needed for AI at scale.

The second factor is that it’s incredibly difficult to access the full computing power of these new GPUs. So many organizations are struggling with GPU allocation and orchestration.

Omri and I saw a gap between the amount of computing power that GPUs can offer and the amount that current orchestration tools can access and provision. There’s a new software stack being composed to deal with this issue, and we wanted to be in that stack. The vision of Run:AI is to accelerate AI-driven innovation in every industry by making it easy for researchers and IT to access and manage all their available computing power.

Optimizing hardware architectures is one of the most cumbersome and often ignored aspects of ML applications. Could you walk us through some of the main challenges of optimizing hardware for machine learning/deep learning?  

DR: For beginning AI initiatives, there is a need to optimize the algorithms themselves. When you have just one algorithm, one workload running and consuming compute power, it's really difficult to optimize how that algorithm is using that computing power.

There is also a challenge of optimization when you have a lot of workloads running across several GPUs. How will computing sources be shared when there are multiple workloads? How will you ensure that each researcher and team gets their fair share? How will you size each workload, and in the end, how will all of them align together on one shareable infrastructure? That isn't easy, and that's a different kind of optimization.

Run:AI seems to rely heavily on Kubernetes for its scheduling capabilities. How relevant has been the role of Kubernetes in the evolution of modern ML architectures?

DR: Kubernetes is at the heart of what's going on today in the ML space. Right now, it's like the perfect storm is happening: there are new AI applications, so new kinds of workloads with new compute requirements. Then you also have those new computing resources, those GPUs, those deep learning accelerators. And on top of that, the world is shifting to cloud-native infrastructure. Companies are moving their infrastructure from a virtualized environment to containers, Kubernetes, and other cloud-native technologies.

The problem when you put all these things together is that Kubernetes wasn't built to run compute-intensive AI workloads on this new hardware. It was built to run microservices on consumer CPUs. There are major gaps in what Kubernetes provides today. It lacks advanced preemption mechanisms that ensure fairness and doesn’t use multiple queues to efficiently orchestrate long-running jobs. In addition, K8s is missing gang scheduling for scaling up parallel processing AI workloads to multiple distributed nodes and topology awareness for optimizing performance. Kubernetes clusters often result in resources left idle for too long, and users find themselves limited in the compute power they can consume. The Run:AI scheduler sits on top of Kubernetes and specifically targets these shortcomings to provide a made-for-AI scheduling solution.

Elastic compute resource allocation has been one of the main value propositions of cloud computing platforms but it doesn’t seem to quite work in the ML space. What are some of the key challenges of GPU resource management in ML applications and how does it differ from traditional software solutions?

DR: One big challenge with GPU management is that unlike CPUs, and traditional applications running on CPU cores, GPUs are being allocated to applications statically and exclusively. When applications start to run, they get static allocations of GPUs, and sharing those GPUs between multiple workloads is typically really inefficient. In the CPU world, there is virtualization, but with GPUs, you don't have that software layer with the ability to orchestrate workloads in a dynamic way. Manual allocation of GPUs leads to poor GPU utilization. Many organizations share with us that their typical GPU utilization is at 10-20%, with highly limited data science productivity. You need a software layer to allocate the workload in a dynamic way on the GPUs and really let the workload share the GPUs dynamically and efficiently.

ML hardware is becoming increasingly more specialized to a point that is becoming overwhelming to keep up with the innovations in the space. Do you see a mismatch between ML hardware and software architectures? Does clients’ hardware affect how you approach AI workloads orchestration?

DR: Yeah, we do see a mismatch between ML hardware and software architectures, in that the software just doesn't answer all of the workloads’ needs. The existing software layers don't fit any hardware, making it really difficult for new hardware to come in and integrate with the existing ML architecture.

Hardware companies are investing heavily into their software stack so that they can integrate with existing software architectures, but it's really, really difficult. That’s where Run:AI is trying to help. We're building our software architecture to fit any AI hardware. It’s important for us to be neutral and be able to support any hardware. We think that is key to enabling innovation and beneficial competition in the AI hardware space

💥 Recommended book

Well, if you’re an ML engineer who aspires to become a founder, check out The Hard Thing About Hard Things, by Ben Horowitz. He co-founded a company and sold it to HP at a value of more than $1 billion. Then, he co-founded the venture capital firm Andreessen Horowitz. He’s one of the most famous VCs in the AI world, and in the book, he shares his experiences building an AI company.

TheSequence

Discussion about this post