📝 Guest Post: Using LLMs from Hugging Face? Fix your model failure points 10x faster with Galileo Data Intelligence*

Apr 07, 2023

Large Language Models (LLMs) are powerful assets for data scientists to leverage within their applications – Hugging Face is a leading repository for LLMs today.

However, while using LLMs, the practical reality is that the quality of the training data governs model performance, and data scientists often spend 80% of their time in Excel sheets and python scripts, trying to find the data that pulls the model performance down, whether while training a model, or for models in production.

In this guest post, co-founder and CEO of Galileo Vikram Chatterji explains how to:

build a Named Entity Recognition model using Hugging Face.
use Galileo, an ML data quality intelligence platform, to automatically surface errors in the training and validation data powering the LLM – all with few lines of code using the deep Galileo<>Hugging Face integration.
find and fix the data pulling model performance down 10x faster and track these data changes over time.

Want to try Galileo for yourself? Feel free to reach out here to get a personalized demo from a member of the Galileo Data Science team.

🚀Few Lines of Code: Using Galileo While Training a Hugging Face Model

We’ll be using the popular CoNLLpp dataset. Using Galileo, we will quickly be able to find a host of data errors:

Find and fix 0.6% of spans that were Mislabeled (incorrect ground truth)
Identify and annotate 0.25% of spans that were missed
Pinpoint frequently erroneous tokens

STEP 1: Install `dataquality` and initialize Galileo

For this tutorial, you need at least Python 3.7. As a development environment, you can use Google Colaboratory. The first step is to install dataquality (Galileo's python client) and datasets, evaluate, and transformers (HuggingFace).

pip install dataquality datasets evaluate transformers

import dataquality as dq

dq.init(task_type="text_ner", 
        project_name="named_entity_recognition_conllpp", 
        run_name="demo_run_conllpp_01")

STEP 2: Load, Tokenize and Log the Hugging Face 🤗Dataset

The next step is to load your dataset. For this demo, we will use the popular `conllpp` dataset, which follows the same NER data format as any other HuggingFace dataset.

Galileo provides Hugging Face integrations to allow tokenization and label alignment. Behind the scenes, it logs your input data automatically.

STEP 3: Training the NER Model

Now we're ready to train our HuggingFace model for a Named Entity Recognition task. You simply call trainer.train() and you'd be set. But we’re here to drill down into this dataset and find data errors or samples the model struggles with.

To achieve that, we wrap the trainer in Galileo’s “watch” function and call dq.finish() at the end to publish the results to Galileo. It’s THAT simple!

from dataquality.integrations.transformers_trainer import watch

# 🔭🌕 Galileo Logging
watch(trainer)

trainer.train()

# 🔭🌕 Galileo Logging
dq.finish()

When the model finishes training, you’ll see a link to the Galileo Console.

⚠️⚠️Find and fix data errors instantly: Data-centric model inspection with Galileo

Within a glance, Galileo points out the data that is pulling your model performance down.

The Galileo console is designed to allow you to perform deep exploration of your data while giving you alerts out of the box to act as jumping boards to find problematic pockets of data. On the right, you can view your dataset in table form, or in the embedding space.

DATA ERROR 1: Regions of high Data Error Potential (DEP) – a high precision ML data quality metric

The dataset is sorted by the Data Error Potential score of the sample - a metric built by Galileo to provide a holistic data quality score for each sample to identify samples in the dataset contributing to low or high model performance.

Right away you can see a lot of mistakes: “The” being included in “The Netherlands”, “Middle East Economic Survey” not being recognized as an Organization. Words that were capitalized because they were at the beginning of the sentence such as “One-year” or “Sporting” are annotated as entities, and more.

You can also color the model’s embeddings by DEP to find islands of erroneous data. This view provides an X-Ray-like view of regions of your data where the model had a difficult time learning from versus data the model learned easily.

You can also see what tokens tend to have high DEP. For example, “U.S.” shows up in a lot of organizations’ names (e.g. U.S. Treasury, U.S. Senate Intelligence Committee, U.S. Tennis Association) but these aren’t always annotated correctly, and the model has a hard time predicting them correctly.

DATA ERROR 2: Missed Annotations

Conllpp, despite being a massively peer-reviewed dataset, still has many missing annotations. Galileo surfaces these via the “Missed Annotations” alert. Clicking on it allows you to inspect further and in one-click add the annotations in-tool or send to your Labeling tool.

DATA ERROR 3: Errors in Labels

Often, human labelers add the incorrect ground truth. Again, despite Conllpp being a corrected dataset, and there only being 4 classes (Location, Person, Organization, Misc), there are still a number of mislabels.

Using Galileo’s “Likely Mislabeled” alert card, Galileo exposes the mislabeled data with high precision. Again, with one click, we can fix these samples by re-labeling within Galileo, or exporting a labeling tool through Galileo’s integrations.

Conclusion

We covered how to fine-tune a model for NER tasks using the powerful HuggingFace library and then use Galileo to inspect the quality of the model and dataset.

This is only a fraction of what you can achieve using Galileo (more documentation here). Feel free to reach out here to get a personalized demo from a member of the Galileo data science team.

Hope this proved useful, and happy building!

TheSequence

Discussion about this post