Large Language Models (LLMs) Hallucinate 100% of the Time

For a long time now I’ve been telling people that large language models (LLMs) such as Google’s Gemini or OpenAI’s ChatGPT hallucinate 100% of the time. Here is my brief explanation of this claim.

An LLM is a type of artificial intelligence (AI) model that is trained via a deep learning strategy. LLMs predict, guess if you will, the next “thing” from a series of things. In the case of a chatbot that thing is text, for an image generator an image, for a music generator music, and so on. When the guess isn’t very good, and sometimes even plain wrong, it’s common to claim that the AI has hallucinated. When the guess is higher quality, an answer that we believe to be sufficiently accurate for our needs, we accept it and move on. On the surface, it appears that hallucination is in the eye of the beholder. But in practice the real issue is the quality of the guess. The point is that every answer produced by an LLM is effectively a hallucination, the quality of which can range from ridiculous to exceptional.

Improving How LLMs Hallucinate

There are several strategies that you can follow to improve the quality of the hallucinations produced by an LLM:

Choose the best LLM for your context. There is a plethora of choice when it comes to LLMs, each of them having their strengths and weaknesses. If you’re using a publicly available LLM “out of the box”, and if the quality of the answers produced is important to you, then it behooves you to identify which LLM(s) are oriented towards the types of prompts you are likely to pose.
Better prompting. Improving the descriptiveness of the prompt that is submitted to the LLM will improve the results produced by it. This is what prompt engineering is all about.
Higher-quality training data. If you are building an LLM, or another type of AI model for that matter, the better the quality of the data that goes into training the model the better the quality of the predictions that it will make (on average). AI is limited by the law of GIGO – garbage in, garbage out. I’ve been writing a fair bit about data quality (DQ) over the years, with more to come.
Retrieval augmented generation (RAG). RAG is a strategy where you programmatically add relevant data to a prompt so that an LLM can base its answer on that provided data. The greater the accuracy of that data, and the better the wording of the generated prompt, the greater the quality of your model’s predictions.

Accept the Fact that LLMs Hallucinate

In short, the question you need to ask yourself is whether a given hallucination is something you can live with.

You can read my other AI blog postings here.

2 Comments

Adam russell

Posted December 30, 2024 6:23 pm 0Likes


Interesting perspective Scott. I suppose anything that helps people understand the probabilistic, non-deterministic aspect of LLMs is a good thing because so many users treat it as an authoritative source. I like the idea of generalizing the term “hallucination” because it’s such h a vague term nowadays. IIRC it used to have a very specific meaning in the machine learning domain (something like anything that can’t be traced back to its training data). Then it meant something factually incorrect and now it just means “something unexpected or odd”. So moving to your use would be a step forward IMO.
Valentin Tudor Mocanu

Posted February 26, 2025 3:55 am 0Likes


LLMs have knowledge, reasoning, and context. But… they don’t have expertise.

For expertise, you need a vast context in a specific field and a lot of practice. An LLM is a generalist, not a specialist. This means it cannot truly know if it has given a correct answer on a specific topic—only to the extent that it makes a rough, approximate comparison with references. That is, if the references are good.