🎙 Brian Venturo/CoreWeave about GPU-first ML infrastructures

How cryptocurrency mining led the team to challenge “big 3” cloud providers

Nov 10, 2021

It’s inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.

👤 Quick bio / Brian Venturo

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Brian Venturo (BV): I’m Brian Venturo, Co-Founder and CTO of CoreWeave. Prior to CoreWeave, I spent over a decade building and running hedge funds focused on energy markets. In 2016, Mike Intrator (CEO), Brannin McBee (CSO), and I bought our first GPU and began experimenting with cryptocurrency mining. Over the next few years, as a hobby became our sole business focus, we built a large-scale infrastructure spanning seven facilities and inched closer towards our goal of building a cloud infrastructure that provided the world’s creators and innovators access to scalable infrastructure with approachable prices – something that the industry was largely missing. Machine learning and batch processing were the first high-performance computing use cases we served, and something that we wanted to support at scale as we continued building CoreWeave Cloud. I’m happy to announce that we just raised $50 million (some news for your Sunday Scope!) to accelerate the growth of the business.

🛠 ML Work

Could you tell us about the vision and inspiration for CoreWeave Cloud? Why does the market need a GPU-first cloud platform when we can access GPU resources as part of mainstream cloud computing platforms such as AWS, GCP or Azure?

BV: When we began building CoreWeave Cloud, we set out to help empower engineers and creators to access compute on-demand at a massive scale for GPU accelerated use cases. We were all too familiar with the inflexibility and high cost of compute on legacy cloud providers and believed that we could help our clients create world-changing technology more effectively by removing barriers to scale. Machine learning and batch processing are classic examples of this – we're consistently blown away by what our clients can do when they’re able to train, iterate, fine-tune, serve models, and analyze data faster.

The challenges that our clients face with the “big 3” cloud providers can be summarized across three themes:

They have a limited variety of compute. On the “big 3” clouds, you can access a couple high-end GPUs for distributed training and a single SKU, usually the T4 for virtually everything else. CoreWeave is NVIDIA’s first and only Elite Cloud Services Provider for compute, and we take our commitment to offer the industry’s broadest range of GPUs seriously. Our clients have access to 10+ different NVIDIA GPU types on CoreWeave Cloud, and as a result, are empowered to find the best type of compute and performance-adjusted cost for their workloads.
It’s painfully difficult to scale on-demand. Whether limited by strict GPU quotas or infrastructure inefficiencies, we hear from clients every day that they can’t get the resources they need on other cloud platforms. CoreWeave Cloud is purpose-built to handle large-scale workloads on-demand, on a bare-metal Kubernetes native infrastructure that delivers the industry’s fastest spin-up times and most responsive autoscaling. Our clients have confidence that they’ll be able to access the scale they need but also know that they’re empowered to do so efficiently on CoreWeave.
It’s too expensive. To put it simply, our pricing model is designed to encourage scale. Our on-demand pricing is super competitive out of the box, but when you combine our prices with the variety of compute and infrastructure advantages that we offer, our clients typically find CoreWeave to be 50-80% less expensive than the “big 3” on a performance-adjusted basis.

Training is one of the biggest uses of GPU resources in large scale ML solutions. What are some of the main challenges of distributed, large-scale training and what is the importance of selecting the right hardware topologies?

BV: This is top of mind as we finished building our state-of-the-art NVIDIA A100 distributed training cluster this year. Our partners at Eleuther AI are currently using it to train GPT-NeoX-20B, which we expect to be the largest open-source language model when it’s completed later this year. Training – at any scale – is complex from a technical perspective, and for that reason, we feel that it’s really important to provide clients with options. A few examples include:

Containers vs. Virtual Machines. Our team tends to feel strongly about ML workloads running in containers and the advantages of deploying workloads as such, but some businesses rely on Virtual Machines. We support both to ensure that we can meet our clients where they are and pride ourselves on customer service when our clients are ready to take a look at migrating to Kubernetes to realize some of the resulting advantages. In parallel and distributed training setups, latency is of utmost importance. By training in containers-on-bare-metal, libraries such as NCCL are able to get a proper view of the actual hardware architecture without the risk of a virtualization layer misrepresenting PCI buses and NUMA topologies.
Hardware Selection. Not all training requires A100s with Infiniband. GPUs, such as the A40 and A6000, provide 48GB of VRAM, allowing for large batch sizes while being less expensive. Our commitment to offering a broad range of GPUs allows our customers to right-size their training platform on the most efficient compute.
Interconnect. Large scale training requires high bandwidth low-latency interconnect both inside a node, achieved via NVSWITCH in the A100 cluster, and between nodes in a distributed training setup. Our distributed training clusters are built with Infiniband RDMA, with multiple terabits of non-blocking capacity to minimize training time spent in comms.

Possibly even more so than training, hardware selection has a huge impact on inference workloads, as performance-adjusted cost benchmarking becomes critically important for our clients serving models at scale. We recently released benchmarks across five GPU types for our managed inference service for Eleuther AI’s GPT-J-6B model.

AI-first hardware is becoming more and more specialized for different types of deep learning methods. How do you see the evolution of the AI-first hardware space in the next few years?

BV: My personal view is that the training market will become more fragmented from the model serving side over the next few years. I think we’re going to see a few large groups, whether they are private institutions or crowdsourced groups, training mega-scale models with a goal of either selling them to a large public cloud under a monopolistic arrangement or open-sourcing them for the world at large to use.

I have concerns about the large cloud providers attempting to corner certain portions of the market with proprietary hardware for specific use cases and models that they own.

For open-source models, I expect there to be a lot of smaller groups that need limited amounts of compute to fine-tune the models, but the largest demand is going to be for flexible compute to serve these models at scale.

If I were to make a bet, it’s that flexible compute will continue to dominate the landscape given that it’s easier to source, use broadly, and build engineering teams to support it.

Optimizing for GPU and other hardware accelerators is one of those tasks that ML engineers tend to ignore until they become a problem. What are some of the challenges with GPU accelerator optimization and do you believe that, ultimately, those capabilities belong in platforms like CoreWeave Cloud instead of on individual models?

BV: We have countless conversations with clients who are looking to optimize for cost but haven’t optimized their models to fit in more economical GPUs. Sometimes, the team behind a project may be so overwhelmed that they can’t focus the time, which is where we collect data to inform our product roadmap of how we can be more helpful to clients in the future. It’s impossible for us to optimize every model serving pipeline, but I think there is an opportunity for us to create tools for clients to get a better “bang for their buck” at scale. We also see a lot of movement in this area from the framework developers. For example, TorchScript brought PyTorch up to the efficient execution of TensorFlow saved models. Models that can be converted to NVIDIA TensorRT often gain substantial improvements in inference times.

Clients who are able to invest the time – like AI Dungeon and Novel AI – often see massive improvements in performance-adjusted cost.

The emergence of deep learning seems to have brought back the dependencies between hardware and software in ML programs. Does the space need a VMWare-type platform for ML workloads? When do you think we cross that chasm and ML teams need to stop being concerned about the underlying hardware topology?

BV: Regarding crossing the chasm you described, some teams are already there and looking for a software provider that delivers an out-of-the-box solution, taking care of hardware and infrastructure under the hood. There are a ton of interesting companies providing solutions for MLOps, a space that is absolutely exploding, and one you covered thoughtfully in TheSequence yesterday.

I don’t think there’s a “one size fits all” solution here, nor is a potential solution to the problem – to the extent a problem exists – that specific. For larger, complex models, you are always going to want to do some hardware-specific tuning.

💥 Miscellaneous – a set of rapid-fire questions

Favorite math paradox?

Easy. Achilles and the Tortoise. Makes my mind shudder.

What book would you recommend to an aspiring ML engineer?

I am a believer in learning that the water is cold after jumping in. Learning through practice is all I’ve ever known.

Is the Turing Test still relevant? Any clever alternatives?

Maybe. I do think that the basic imitation game in the Turing test can be overcome by an NLP model at some point in the not too far future. NLP models can already readily have a legible conversation with a human. They are still, however, a supercomputer generating answers based on what it has learned from humans. I do believe we need a deeper, non-language-based test to truly determine if an AI can actually think and draw conclusions on its own. Think something like the story in the movie Ex Machina.

Does P equal NP?

I hope not for Bitcoin’s sake.

TheSequence

Discussion about this post

Ready for more?