The Sequence Chat: Hugging Face's Leandro von Werra on StarCoder and Code Generating LLMs

StarCoder is one of the most ambitious code generation foundation models released in recent times.

May 24, 2023

👤 Quick bio

Tell us a bit about yourself: your background, current role, and how you got started in machine learning and data labeling.

I originally did a master's degree in physics focusing on astrophysics, but around that time, I noticed the breakthroughs happening in ML so I decided to switch the focus of my studies towards ML. After finishing my master thesis in ML for precision medicine, I joined a start-up as a data scientist where I worked on a wide range of industry projects. This is also where I met Lewis Tunstall and as language models with BERT and GPT-2 started taking off we decided to start working on a textbook about transformer models and the Hugging Face ecosystem. When reaching out to Hugging Face we met Thomas Wolf, the Chief Science Officer and co-founder at Hugging Face, who joined the project as a co-author. As the book came to an end Lewis and I joined Hugging Face. Since then, I worked as a Machine Learning Engineer in both the open-source and research team working on projects such as Evaluate, TRL, CodeParrot and recently co-leading BigCode research collaboration.

🛠 ML Work

You are part of the StarCoder project, which was recently released by the BigCode community championed by Hugging Face and ServiceNow. Could you walk us through the inspiration and vision for the project?

Codex and CoPilot had a big impact on the community and developers and can lead to a significant productivity boost in the professional workflow. However, those models are closed source, so you cannot freely adapt them to your use-case or experiment with fine-tuning and you need to send your data to an external API. In addition, there are big open questions around data governance: what data was trained on, what licenses were included, how to attribute sources, and what if you want to be excluded? There are several open models, but they lack CoPilot’s performance and also don’t fully disclaim how the dataset was created and filtered.

The goal of BigCode and subsequently StarCoder was to address these issues and produce a high-performance code model with clear data governance structures. The project is a spiritual successor of BigScience and is run as an open research collaboration where every research or industry expert can join.

The StarCoder release includes two models: StarCoder and StarCoderBase. Why are there two models, and what are the key differences between them?

StarCoderBase is trained on 80+ programming languages for 1T tokens. Since a lot of developers are working on Python we continued to trainStarCoder for about 35B tokens (~3% of full training) on the Python subset which lead to a significant performance boost. Surprisingly, it also lead to a performance increase in some other languages such as R or Swift. On the other hand, we found that StarCoderBase can be better prompted to act as a tech assistant: by simply adding a few example conversations to the context (see the TA prompt) you can ask StarCoderBase to help you solve programming related questions. StarChat (alpha) is even better at that since it was specifically fine-tuned on conversations and instructions.

StarCoder was trained on over 80 programming languages from The Stack dataset. Could you explain the data curation and training process required for building such a model?

The data curation probably made up 60-80% of the whole project. There were two main ingredients to create a good pretraining dataset. First, we applied strong near-deduplication where similar files are removed. It might sound counterintuitive, but first strongly near-deduplicating the dataset allows you to safely train for a few epochs without performance degradation. Second, for each file extension we examined at least 100 samples and derived heuristics to exclude low quality files (e.g. data or auto-generated files). In addition we labelled a PII dataset for code to train a PII detector. At that scale even applying that PII model to the whole dataset required several hundred GPU hours. Also, we excluded code files from users that had opted out of the dataset.

Finally, for the training we used 512 A100 GPUs for 24 days to train the model. The training was extremely smooth. We had some restarts due to hardware failures but those mostly happened automatically. Training at that scale with modern tools such as Megatron and using BF16 is very smooth.

I have this romantic idea that Jupyter notebooks are ideal for pretraining coding LLMs, given that they are language-friendly. But from what I understand, they can be one of the most difficult datasets to put together. How did you guys handle Jupyter Notebooks in the process of training StarCoder?

Indeed, we also found that Jupyter notebooks are a treasure trove of interesting data with lots of tutorials and examples. We parsed the notebooks in two ways:

- we converted the notebooks to source code where the markdown cells become code comments.

- we parsed the notebooks into a structured format where the cell become text-code-output-text chains separated by special tokens. This also allows us to easily provide the whole notebook as context (incl. cell outputs) for code completion in Jupyter notebooks (see this Jupyter plugin).

StarCoder seems like a clever combination of existing architectures, such as SantaCoder and multi-query attention. Were there any research breakthroughs in StarCoder, or would you say it was more of a crafty ML engineering effort?

Indeed, in a sense StarCoder is the combination of the best available techniques and most of the performance can probably be attributed to careful work on the dataset. The architecture goal was to make the model easy to use and deploy for users and fulfill their needs: fast inference, cheap generation, long contexts, and infilling using context from both sides. To achieve these goals, we trained a moderately sized but fast model (“just” 15B) with MQA to scale generation, implemented Flash Attention to train with context windows of 8192 tokens and used the Fill-in-the-Middle objective in addition to the normal autoregressive language modeling objective.

💥 Miscellaneous – a set of rapid-fire questions

What is your favorite area of AI research outside of generative AI?

I am really excited about the application of ML to Science, such as health, chemistry, math or physics. One application that excites me most is AlphaFold, that helps scientists speed up the protein development process to an impressive scale. Technologies like this that support scientists will help science progress even faster.

What makes a good benchmark for evaluating coding LLMs? What are the best benchmarks in the market?

The most popular one is HumanEval which tests LLMs for code on a variety of coding challenges in Python. We also used MultiPL-E which extends HumanEval to over a dozen other languages. However, HumanEval only consists of coding interview style challenges and as such does not capture the full range of programming tasks.

Make the case for and against open-source foundation models relative to closed-source API distribution models. Which approach ultimately wins in the end?

One thing we learned from releases such as Stable Diffusion or Llama is the creativity and capability of the open-source community. Within weeks of the release the community built dozens of variants of the model as well as custom applications – more than any company or institution could come up with. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications.

While it is easier to keep control over closed-source API models it makes it harder to build trust around such systems if there is no transparency as well as giving researchers the possibility to make them safer.

What are the biggest next milestones for coding LLMs?

There are lots of interesting avenues for future code LLMs! Evaluation is definitely in its infancy compared to natural language and will need to improve to better capture the user experience. In terms of generation capacity, the models are getting very good at function level completions but struggle with building longer, more complex structures as well as editing the whole codebase to implement a new feature for example. Additionally, they are not able to interactively debug code where they execute a piece of code and based on the error or behavior improve the solution. Solving these challenges opens a lot of very exciting directions!

TheSequence

Discussion about this post