📝 Guest Post: Stop Hallucinations From Hurting your LLM Powered Apps*

Jun 02, 2023

Large language model (LLM) hallucinations pose a big threat to the successful adoption of the new wave of LLM apps. In this post, the Galileo team dives into how one can prevent hallucinations from creeping in, as well as some metrics developed by the researchers at Galileo to quantify potential LLM hallucinations. They also introduce a free access to the Galileo LLM Studio, powered by research-backed mechanisms to combat LLM hallucinations.

Leveraging large language models to build useful and interactive products has never been easier. This has sparked a new wave of ‘LLM developers’ to quickly use a combination of LLMs (Open source/Closed source), frameworks such as LangChain, LlamaIndex and LangFlow, and perhaps a vector database to spin up a LLM powered product — a large number of startups, mid-market businesses, as well as large enterprises have been racing in this direction.

Given this, it becomes increasingly important to be mindful of a huge bottleneck in LLM adoption – hallucinations, aka models being over confidently incorrect in their predictions.

We have already seen this crop up in Google’s Bard launch, Microsoft’s launch of the Tey chatbot a few years ago, and numerous similar examples recently – this poses not just a reputational risk to the business, but also a societal risk with disinformation dissemination and biased viewpoints proliferating.

What are LLM hallucinations?

LLMs hallucinate when their predictions are based on insufficient or inaccurate training data. For instance, GPT-4 was trained on data dating back to Sep 2021. When prompted with questions beyond that date, it has a high likelihood of a hallucinated response. A research paper from the Center for Artificial Intelligence Research (CAiRE) defines a hallucination from an LLM as “when the generated content is nonsensical or unfaithful to the provided source content”.

A popular example of this is about GPT-4 being asked “Who owns Silicon Valley Bank?”.

The response is “Silicon Valley Bank (SVB) is not owned by a single individual or entity. It is a subsidiary of SVB Financial Group, a publicly traded company (NASDAQ: SIVB). The ownership of SVB Financial Group is distributed among individual and institutional shareholders who own its stocks. As a publicly traded company, its ownership can change frequently as shares are bought and sold on the stock market.”

In this case, GPT-4 has no idea about the recent SVB collapse. To mitigate disinformation from this ‘hallucinated’ response, OpenAI recently added the ‘As of my knowledge cutoff in September 2021,’ prefix ahead of such responses.

Why LLM hallucinations occur

LLMs are, at the end of the day, large neural networks that predict the next token in a sequence – this could be the next character, sub-word or word.

In mathematical terms – given a sequence of tokens T1, T2, …, TN, , the LLM learns the probability distribution of the next token TN+1 conditioned on the previous tokens: P(T_{N+1}|T_{1},T_{2},…,T_{N})

There are two factors that can influence LLM hallucination a lot:

Training data – when trained with insufficient data about a topic, the LLM might choose a ‘token’ that has a low probability as the next one in the sequence. When building LLMs for a particular use case, such as, summarizing medical terms to a patient, it is imperative to train with domain-specific data.
Prompt instruction – the instruction used in a prompt, can influence the persona the LLM adopts when responding. For instance, a prompt that dictates ‘explain this concept to the user as if they are a 5 year old’, will lead to a different response as compared with ‘explain this concept to a patient as if you are a pathologist’

Quantifying LLM Hallucinations

The best ways to reduce LLM hallucinations are by

Debugging the training data to ensure your LLM is being trained with unbiased, well balanced data that is appropriate for your use case.
Finding the right prompt instruction/template to ensure the responses are being delivered appropriately, along with caveats around potential gaps in information. This goes a long way in ensuring the user is aware of potential hallucinations.

To take this a step further, the researchers at Galileo have come up with promising metrics to be used to quantify hallucination.

‘minLogProb’ – this is the minimum value over the model’s log-probabilities for the tokens in its generated output. Generating this metric is simple and costless to compute. Lower values suggest potential hallucination.

‘ChatGPT-Friend’ – We generate multiple completions from the same prompt, then ask ChatGPT whether they agree with one another. If ChatGPT says they don’t agree, this suggests hallucination. This can be usefully combined with Min Logprob in several ways. For example, we can use Min Logprob to filter the data to a subset, then run ChatGPT-Friend in this subset. This saves some of the inference cost of ChatGPT-Friend.

Introducing the Galileo LLM Studio

To build high performing LLM powered apps, requires careful debugging of prompts and the training data – the Galileo LLM Studio provides powerful tools to do just that, powered by research-backed mechanisms to combat LLM hallucinations – and it’s 100% free for the community to use.

Prompt Inspector – Build, evaluate and manage prompts. Minimize hallucinations and costs and find the 'right' prompt fast

LLM Debugger – Galileo hooks into your LLM to auto-identify the data that pulls model performance down. Get algorithm-powered superpowers to inspect, fix and improve your LLM performance.

Conclusion

If you are interested to try the Galileo LLM Studio – join the waitlist along with 1000s of developers building exciting LLM powered apps.

The problem of model hallucinations poses a dire threat in the face of adopting LLMs in applications at scale for everyday use – by focusing on ways to quantify the problem, as well as baking in safeguards, we can build safer, more useful products for the world and truly unleash the power of LLMs.

References & Acknowledgments

The calibration and building blocks of Galileo's LLM hallucination metric is the outcome of numerous techniques and experiments, with references to (but not limited by) the following papers and artifacts:

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models, Potsawee Manakul, Adian Liusie, Mark J. F. Gales.
OpenAssistant Conversations dataset. Democratizing Large Language Model Alignment, Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, Alexander Mattick.
Survey of Hallucination in NLG, Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, Pascale Fung
And last but not least, a special thanks to Robert Friel and the entire ML research team at Galileo for conducting numerous experiments on standardized benchmark datasets to develop and test the efficacy of Galileo's novel LLM hallucination metric.

TheSequence

Discussion about this post