The Sequence Chat: Vipul Ved Prakash, CEO, Together on Decentralized, Open Source Foundation Models
Together has been behind some of the most interesting releases in open source foundation models.
👤 Quick bio
Tell us a bit about yourself: your background, current role, and how you got started in machine learning.
My background is in large-scale distributed systems and information retrieval, and most of my professional career has involved solving problems of text understanding. I created an open source anti-spam filter called Vipul's Razor and founded a company based on it (Cloudmark) in the early 2000s. We used locality sensitive hashing and probabilistic classification, and got fantastic scale and results. This resulted in a long-lasting fascination with learning from unstructured data. I later founded a company (Topsy) that built a social media search and analytics system where we used machine learning and graph methods for ranking, deduplication and sentiment analysis. Topsy was acquired by Apple, and I directed various efforts there, including Spotlight search, federated learning systems that employed differential privacy, as well as Siri’s open-domain Q&A ability.
🛠 ML Work
Together has emerged as one of the most prominent advocates for the open-source, decentralized approach to foundation models. Could you guide us through the inspiration and vision behind the company, as well as highlight some of the open-source projects that you have released thus far?
I see foundation models as the terminal point of the first generation of human-computer interaction where we had to laboriously and precisely instruct computers to perform a task. Foundation models open up the possibility where we can simply describe our task and shift the burden of devising a solution to computers. In this framing, foundation models represent a very broad form of human-computer interface, perhaps occupying a position similar to compilers or microprocessors.
A tremendous amount of economic and societal value of computing has come from open systems like the Internet, open programming languages and commodity microprocessors, so it seems important to us that there should be a strong open-source foundation model ecosystem.
OpenChatKit stands out as one of the pioneering open-source projects that integrates instruction following capabilities into LLMs. How challenging was the process of developing OpenChatKit, and what are some of the specific features included in its initial release?
It was challenging in several ways — it feels light years ago as OpenChatKit was create pre-LLaMA and Alpaca. Back then, it was quite unclear what makes a great chat model. We were lucky to have bet on instruction data as the key ingredient.
We also made a quite explicit decision not to use OpenAI data, in order to have something clean from the copyright side. This constrained us a bit, as many chat models now use distilled data. Instead we used a community process, along with LAION, we successfully created a dataset of 40M "weak" instructions from various sources. This dataset was later augmented with the data provided by the users of OpenChatKit through a feedback app and is available for use as OIG.
There is also the moderation model, even today, to our best knowledge, OpenChatKit is one of the few (if not the only) Chat model that recommends a layer of moderation through a specifically designed moderation model. Building such a model from scratch was a lot of work, but it is worthwhile as LLMs can get unintentionally offensive.
Red Pajama not only boasts the best code-name in the foundation model realm but also represents one of Together's most ambitious undertakings. Could you elaborate on the process and architecture involved in constructing a 1.2 trillion token dataset, along with its corresponding LLMs?
For RedPajama we closely followed the recipe as outlined in the LLaMA paper. We took the 7 different slices of data: Common Crawl, C4, Github, Books, ArXiv, Wikipedia and StackExchange, and carefully recreated the filtering process. This involved using the CCNet pipeline and several quality filters including a linear classifier that selects for Wikipedia-like pages. We tuned the hyperparameters to roughly get the same number of tokens from each slice as described in the LLaMA paper.
To us, an “open model” implies not just open weights and a permissive license, but also open data and open data recipe. This allows the community to inspect the data, improve it, or filter, and preprocess it differently to create a model that better fits a downstream application. We think open data, and data creation recipes are critical for monotonic progress in open source models.
GPT-JT distinguishes itself as one of the largest foundation models employing a decentralized training method. Can you provide a description of the specific training architecture and techniques employed in GPT-JT, and how they differ from traditional centralized approaches?
Distributed training is a key focus of Together's research work focused on reducing costs of training and inference. GPT-JT was trained using a pre-cursor to our CocktailSGD training method, which reduces the network requirements for training by 117x. CocktailSGD, as the name suggests, uses a combination of methods using quantization, asynchrony, local training and topK compression to be able to fine-tune large models over 1Gbps links. This allows us to use servers distributed across data centers, and connected over the open internet. It also allows for best possible utilization of GPUs within a data center. CocktailSGD paper has been accepted in ICML, so detailed exploration of this this will be published soon! We are quite optimistic that this set of techniques can be expanded and generalized to training large neural network based architectures.
Similar to how RLHF revolutionized the utility of LLMs, there are emerging research techniques, like chain of thought-prompting and in-context learning, that exhibit incredible promise. What are some of the emerging research techniques that you believe could play a pivotal role in shaping the next generation of foundation models?
I am excited about space state models, that are sub-quadratic, and support much larger contexts. There is research around applying imitation learning to solve for hallucinations, and work around data mixtures, like DoReMi, which will have a large impact. I think research on the data side is going to be the cornerstone of progress for the next few years.
💥 Miscellaneous – a set of rapid-fire questions
Present the arguments in favor of and against open-source foundation models compared to closed-source API distribution models. Which approach ultimately prevails in the long run?
Open-source models are transparent. We know the data that composed these models, and have better ability to reason about their behavior. This will be increasingly important as foundation models are used by regulated industries and in mission-critical applications.
Open-source models are privacy friendly. You can deploy them in the infrastructure under your control and use them with sensitive data. The single-tenant SaaS model of closed-source foundation models is problematic in this way. You have to place a lot of trust in a company, esp if you are going to use closed models with sensitive customer data.
Open-source models can be customized. You can fine-tune them or pre-train them from the final (and in some cases intermediate) checkpoints on large amounts of data. We see 10-12 points of accuracy improvements on fine-tuned models.
Open-source models give you control. They won't disappear or change in unexpected ways, like a closed source model might behind an API. Again, as these models mature, and become critical to applications, developers will want more control over the weights.
Could you articulate the advantages and disadvantages of centralized and decentralized architectures for foundation models, and make a compelling case for each?
Centralized training on HPC style clusters is likely the fastest way of building models today. While there's a lot of scope for optimization in the centralized setting, we generally have good software and knowledge here, and for companies building large foundation models, it often makes sense to follow the best practices and go with well understood infrastructure.
Decentralized training wins significantly on cost. It's easier to get slightly lower-end hardware in multiple locations, and you can achieve lower upfront costs and elasticity. Centralization will likely run into scale-out limits, so the largest models are likely going to be done in decentralized settings in the future.
Architecturally, the progress feels similar to how things played out in the database world, where we have big monolithic databases, and then distributed and fault-tolerant options like DHTs appeared. There's a place for both, but I believe the techniques created in decentralized training will increasingly percolate to centralized settings.
What are the major upcoming milestones for open-source foundation models
I expect amazing progress in open source foundation models in 2023. We'll surpass GPT-3.5 quality in the open this year, and it will be a fairly big moment for open source. I also believe we'll continue to see optimization work that reduce the costs of working with AI, given the more low-powered resource landscape, and we'll see new architectures beyond transformers. I also expect SOTA open models in code, music, biology, and other niche areas.