📝 Guest Post: How to build a responsible code LLM with crowdsourcing*

May 29, 2023

In this post Toloka showcases Human-in-the-Loop using StarCoder, a code LLM, as an example. They address PII risks by training a PII reduction model through crowdsourcing, employing strategies like task decomposition, clear instructions, and quality control. This successful implementation demonstrates how responsible AI and high-performing models can align.

Responsible AI starts with a responsible approach to data

The promise of Large Language Models (LLMs) is that they will help us with a variety of different tasks. However, before your LLM can solve these problems in a few-shot or even zero-shot manner, models must be exposed to extremely large amounts of data. These datasets are usually scraped from the internet. There is a problem with the internet, though—it’s messy. If you use scraped data, the model might pick up some private information, amplify existing biases, and, consequently, create more harm than good. Naturally, model developers implement numerous strategies and safeguards to detect and discard inappropriate prompts or model output at inference time, but models can still be manipulated into generating undesirable content.

The risks of harmful results do not align with the principles of Responsible AI. If you want to build a responsible AI solution, you need to be careful with data handling practices. This includes adhering to copyright laws, complying with the laws of the country of use, and being fully transparent about the data collection and model training processes.

All of these aspects are nearly impossible to cover without any human curation, and this is where Human-in-the-Loop comes in. We’re going to show how Human-in-the-Loop can be put to effective use in building responsible AI tools, using the example of StarCoder, a code LLM. By creating this open-source code LLM, the BigCode community, supported by Hugging Face and ServiceNow, has proven that high-performing AI solutions can be a part of responsible AI.

StarCoder’s PII challenges

StarCoder is an open-access alternative to the model which powers Github Copilot. The main goal of the BigCode community was to develop a code LLM that follows responsible AI guidelines, particularly those related to training data.

StarCoderBase is trained on The Stack Dataset — a 6.4 TB dataset of permissively licensed source code in 384 programming languages. The final product, StarCoder, is the StarCoderBase model fine-tuned on the data sourced from the same dataset. To respect code owners’ rights, the StarCoder developers introduced a tool called “Am I in The Stack” which allows developers to opt out if desired.

Even though the data usage was legally permissible, there were risks related to Personally Identifiable Information (PII) contained in the training data. The presence of personal data poses an ethical concern, as the final model could uncontrollably output personal information during inference.

To mitigate this risk, prior to using The Stack dataset for StarCoder, the Big Code community members trained a PII reduction model and applied it to the entire dataset.

Building a PII reduction model

In the context of ethically sensitive tasks such as PII detection, human involvement is crucial. Looking through 6.4 terabytes of data manually is impossible. A working method to solve this dilemma is to use machine learning models and Human-in-the-Loop in a PII detection pipeline.

When working with natural language processing (NLP) and text data— which includes code — developers are no longer training all their models from scratch, since downstreaming (fine-tuning, or prompting for extremely large models) has proven to be quite effective for training language models to perform specific tasks. In line with this approach, the Big Code community developers have trained the BERT-like encoder-only Star Encoder model and fine-tuned it to perform a Named Entity Recognition task.

To achieve good recognition quality, engineers needed a high-quality labeled dataset of code snippets with various kinds of PII, including potential edge cases. A dataset for fine-tuning needs to be large — the plan was to use approximately 12,000 items — and diverse, in this case in types of PII represented. Given the cost and time associated with gathering such a dataset with a team of software engineers, the Big Code community decided to use crowdsourcing for labeling, and asked Toloka for help.

Secrets to success for crowdsourcing and PII detection

A commonly held belief is that tasks requiring domain knowledge, like labeling programming code, can only be done by a specifically gathered group of domain experts. But experts are often difficult to find, hard to scale, and expensive to employ. This misguided belief often slows down the development of high-quality responsible AI tools, which are primarily data-driven.

Over the past 10+ years, Toloka has tackled complex data labeling and data generation tasks that require deep domain expertise, proving that tasks of this nature can be solved efficiently with crowdsourcing. Toloka’s diverse crowd naturally includes experts in multiple domains. When we apply advanced crowdsourcing techniques, even the part of the crowd without domain experience can effectively contribute to labeling tasks.

We applied our experience to the task of PII detection for the Big Code project and we’ll share our strategies in the following sections.

Decomposition is key

When setting up a project to be labeled with crowdsourcing, the key strategy is to break down the task into easier subtasks. This is a skill that becomes second nature as you handle crowdsourcing projects.

Instead of giving the Toloka crowd (also known as Tolokers) an assignment to label every type of PII in code, we grouped PII into 7 categories and set up a separate labeling project for each. These are the types of PII:

Names: Names in Licenses, Placeholders, Names
Emails: Emails in Licenses, Placeholders, Emails
Usernames: Usernames in Licenses, Placeholders, Usernames
IP Addresses
Passwords
SSH/API keys
IDs

This approach made the task easier to handle for better quality. Putting all the categories in one project would create cognitive overload and lead to poor labeling quality.

Start with the basics and gradually add complexity

We created a quiz for Tolokers that guided them through each category of PII, from easiest to hardest. They were assigned a skill for each category that they mastered in the quiz, and they had an opportunity to opt out if they hit a point where they felt overwhelmed. We used a similar system for tasks in production. Out of 2896 Tolokers interested in PII labeling, 1364 of them mastered all 7 categories.

Names -> Emails -> Usernames -> IP Addressess -> Passwords -> API/SSH Keys -> IDs

Maintain consistency and make tasks manageable

We kept the tasks consistent and easy to understand. Each task included exactly 50 lines of code, and each project had no more than 4 categories to label. A good rule of thumb in crowdsourcing is that if a task takes more than 2 minutes, keep decomposing it.

The user interface matters

It’s essential to make labeling tools intuitive and easy to use. For instance, it helps to use contrasting colors to highlight categories. It’s also a good practice to add an option for users to give feedback that something is wrong with the input data, like an “Ambiguous” class in this project.

We try to include all of the best practices of crowdsourcing interface development in Toloka’s Template Builder.

Give clear instructions with examples and counter examples

People are all-purpose few-shot learners. Their advantage lies in the ability to detect a similar item in different distorted forms and to be able to give human-readable feedback on levels of this distortion.

Set up quality control

Choosing a small group of experts to do the labeling might seem like the only way to get good quality. But that’s not always the case. Crowdsourcing allows us to use advanced techniques to measure labeling skills and maintain quality at the desired level. For the PII pipeline, we used validation projects, overlap, and hidden control tasks to manage labeling quality.

Use validation projects

For each category, we designed a chain of projects:

Quiz (to train Tolokers)
PII label generation (to find PII in code)
PII label validation (to check if PII information was found correctly)

Validation projects are needed when the correct answers are hard to check automatically. A validation project should reflect target metrics. In the case of PII detection, the metrics were precision and recall: “Was every piece of PII found in the code? Was each selected piece labeled correctly?”

Use overlap

Two heads are better than one. Overlap means that the same task is completed by two or more people and the results are aggregated. In validation projects, we used majority vote to determine the correct answer and weed out low-quality results.

Use control tasks

Control tasks are tasks that include correct answers and can be checked automatically, but they look like regular tasks to the crowd. We used these tasks to dynamically update Toloker skill levels for each category of tasks during labeling. We filtered Tolokers by skill and only allowed them to access the types of tasks they have a high skill level for. We also awarded bonuses for good quality.

Responsibility to the crowd

To follow the principles of Responsible AI, crowd projects should be managed responsibly.

Keep up motivation with bonuses and fair payment.
Don’t limit the crowd in creativity and desire to learn. We value the feedback we receive from Tolokers and use it to achieve good quality, fast.
Be transparent about the project and share how the results are going to be used.

*Disclaimer in PII detection projects for transparency*

Results

Final Pipeline

PII reduction model

Our rapid setup and labeling — completed in two weeks and four days, respectively — yielded impressive results. However, given more time to improve labeling instructions, we’re confident of even greater accuracy, potentially reaching flawless ID labeling.

The PII reduction model fine-tuned on the labeled dataset scored high F1 scores for names, emails, and IP addresses (over 90%) and passwords (73.39%). Lower performance on keys and usernames (F1 scores of 56.66% and 59.39%) was due to a limited number of these PII types in the dataset, with only 308 instances available. IDs were excluded from the training dataset.

To sum up

The StarCoder model surpassed every open Code LLM that supports multiple programming languages and competes with, if not outperforms, OpenAI’s code-cushman-001. What’s most important to us is that it follows the guidelines of Responsible AI.

Achieving these results without Human-in-the-Loop would be challenging. Crowdsourcing is an effective approach, delivering quality labeling in a limited time frame across a range of complexity.

TheSequence

Discussion about this post