Hopp til innhold
All writing

Research

AI hallucinations in legal workflows: building verifiable pipelines

Stanford's 2024 study put hallucination rates on legal queries between 58% (GPT-4) and 88% (Llama 2). The right response is not to ban AI from legal work. It is to build pipelines where every output is verifiable, traceable, and inspectable by default.

9 min readBy SafeMediAI Editorial

In May 2023 a Manhattan attorney filed a federal brief that cited six judicial opinions. None of them existed. ChatGPT had generated them. The judge described the resulting filings as "gibberish" and "bogus."[3] Mata v. Avianca became the case most commonly cited when discussing AI in legal work, not for what it said about the law, but for what it said about the deployment.

Eight months later, Stanford's RegLab and HAI published the first large-scale empirical study of legal hallucination rates in major LLMs.[1] The findings, summarised in Stanford HAI's headline,[2] were sharper than most legal-tech press had been admitting: 58% hallucination rate on GPT-4 for verifiable legal questions about federal cases, 88% on Llama 2, with rates climbing as queries became more complex.

There are two unproductive responses to that data. The first is to ban AI from legal work entirely, which ignores the cost savings AI does deliver on lower-stakes tasks. The second is to claim the problem will be fixed in the next model release, which has been claimed continuously since GPT-3.

The productive response is to take hallucinations as a structural property of LLMs and build legal pipelines that absorb it. This piece walks through what that looks like.

What hallucinations actually are

A large language model generates plausible continuations of a prompt. It has no concept of truth, no internal database of verified facts to check against, and no reliable way to know what it does not know. When asked about a federal case, it will produce text that looks like a case description regardless of whether the case exists. The output is a statistical artefact of the training distribution, not a retrieval from a knowledge base.

This makes the failure mode systematic, not occasional. The legal domain is particularly exposed because:

  • Citations are dense and specific. Case names, reporter volumes, page numbers, parenthetical descriptions. The model can produce all of them in correct format while every individual element is fabricated.
  • Plausibility is uncorrelated with accuracy. A fabricated case can be more "reasonable-sounding" than a real one. There is no useful surface signal.
  • Verification is expensive. Checking citations manually takes about as long as writing the brief, undoing the productivity gain that motivated using the model in the first place.

What the Stanford study actually measured

Dahl, Magesh, Suzgun and Ho took an evaluation set of verifiable legal questions and tested GPT-3.5, GPT-4, PaLM 2, and Llama 2 against ground truth from authoritative legal databases.[1]

Three findings travelled well beyond the paper:

  1. Hallucination rates were high for verifiable, well-defined questions about random federal court cases. The headline number was 58% for GPT-4, 70% for ChatGPT 3.5, 72% for PaLM 2, and 88% for Llama 2.
  2. Complexity correlated with failure. Questions that required reasoning about precedential relationships, dissenting opinions, or holdings on specific issues failed more often than simple factual lookups.
  3. Models confidently asserted incorrect answers. Confidence is not a signal of correctness, and asking the model "are you sure?" did not reliably correct errors.

Stanford's follow-up study in 2024 went further and tested legal-specific RAG products sold by major legal-tech vendors.[5] Even those products, which combine LLMs with retrieval against curated legal databases, hallucinated on 17-33% of queries. The "RAG fixes hallucinations" claim that dominated 2023-era legal AI marketing did not survive the empirical test.

Where RAG actually helps

Retrieval-augmented generation, formalised in the Lewis et al. NeurIPS 2020 paper[4], works by retrieving relevant passages from a knowledge base at inference time and conditioning the model's generation on those passages. For legal work, the knowledge base is usually a corpus of case law, statutes, regulations, or internal precedent.

Done well, RAG produces three benefits:

  • Citations point to retrieved sources. If the model is conditioned on retrieved passages, the resulting citation is more likely to refer to a real document. The Stanford 2024 study still found this is not a guarantee, but the rate is much lower than open-prompt models.
  • Updates without retraining. Adding a new case or statute to the knowledge base does not require retraining the model.
  • Citation traceability. The system can show the user which retrieved passage informed each claim, which is the foundation of verifiable workflows.

Where RAG breaks:

  • The retriever returns irrelevant passages and the model writes around them.
  • The retriever returns relevant passages but the model paraphrases incorrectly.
  • The retriever returns nothing and the model fills the gap with confident fabrication.
  • The model treats retrieved passages as suggestions rather than constraints.

Engineering around these failure modes is what separates a serious legal-AI deployment from a wrapper around ChatGPT.

What "verifiable pipeline" actually means

A legal-AI pipeline that takes hallucinations seriously has a small number of non-negotiable properties. The pattern is consistent across competent vendors and is mostly absent from the rest.

Citations must be machine-checkable

Every cited authority in the model's output should resolve to a record in a known database (Westlaw, Lexis, EUR-Lex, Norwegian Lovdata, the appropriate court's docket) at the time of generation. The check should be automatic, not manual. If a citation cannot be resolved, the system flags or removes it; it does not pass through.

This requires that the model output a structured citation format the post-processor can parse. "Mata v. Avianca, 678 F. Supp. 3d 443 (S.D.N.Y. 2023)" should be emitted as a typed object, not free text mixed into prose.

Retrieved passages must be visible to the user

The user reading the model's output should be able to see, alongside each claim, the specific passage from the knowledge base that supports it. Not the title of the case. The actual passage text, with the citation. Hovering, sidebars, footnote-style links are all fine UI choices. Omitting the underlying passage is not.

The reason is the failure mode where retrieval is correct but paraphrasing is wrong. The user must be able to compare the model's claim against the retrieved text without leaving the workflow.

Confidence and uncertainty must be surfaced honestly

A model that is uncertain should say so. A model that is unable to find relevant authority should say that, not invent some. Confidence calibration in LLMs is still a research problem, but the floor is that the system should not present uncertain outputs with the same visual weight as well-supported ones. UI signals (greyed-out text, "low confidence" tags, separation of "synthesised" from "quoted" passages) carry real safety value.

The audit trail must survive the workflow

For regulated legal work, the entire prompt, retrieval set, model output, citation-check result, and human review action need to be logged and retained. This is a Sarbanes-Oxley equivalent for the AI era: when something goes wrong, you need to be able to reconstruct exactly what the system saw and decided. This is also where the legal-AI vendor's posture toward client data confidentiality gets tested.

Human review is in the loop, not on the side

The Datatilsynet sandbox reports on PrevBOT and Doorkeeper apply directly here: "there is a human in the loop" without specifics is hand-waving. For legal AI, the specifics that matter are: which outputs require review (default: all citation-bearing outputs), what the reviewer is supposed to verify (does the citation resolve, does the claim match the retrieved passage), what training the reviewers have, what the reviewer is expected to do when the AI is wrong.

A working architecture

A deployable architecture for legal AI on regulated documents that takes the above seriously tends to converge on a single shape. We have seen variants of it across competent legal-AI vendors, in-house deployments at law firms, and the public reports from supervisors.

  1. Document ingestion, with PII redaction on identifying details that are not legally relevant, run client-side or in a sovereign environment.
  2. Retrieval against a curated legal corpus (case law for the jurisdiction, statutes, regulations, in-house precedents).
  3. LLM generation conditioned on the retrieved passages, with the prompt explicitly instructing the model to ground every claim in a passage and to refuse if no relevant passage is found.
  4. Citation validation, post-hoc, against the source corpus and external citation databases. Unverified citations are flagged or removed.
  5. Source attribution surfaced in the UI so the lawyer reading the output sees the supporting passage next to each claim.
  6. Human review with explicit verification steps for every citation-bearing output.
  7. Audit log of prompt, retrieval, generation, validation, and review steps.

This is more work than wrapping ChatGPT in a custom prompt. It is the work the Mata v. Avianca sanctions, the Stanford studies, and the post-2023 wave of judicial orders requiring AI-disclosure in filings are all pointing toward.

What lawyers should actually ask vendors

The list of questions that separate honest legal-AI vendors from the rest is short.

  1. What hallucination rate do you measure, on what benchmark, and how often do you re-evaluate? A vendor that does not have a number does not know.
  2. How do you handle a case where the model produces a citation that does not resolve? Suppress, flag, retry? The honest answers are specific.
  3. Where does the source passage appear in the user UI? The right answer is "next to the claim that uses it," not "in a sidebar three clicks away."
  4. Where is the audit log kept, who can access it, and how long is it retained?
  5. What jurisdictional coverage does your retrieval corpus have? A US-trained legal AI applied to Norwegian or EU law will hallucinate more, not less. Coverage gaps must be explicit.
  6. Have you been independently evaluated? Stanford's 2024 RAG study is a useful baseline. Vendors should be able to discuss their results against published benchmarks.

The good news is that this conversation, awkward in 2023, is now routine in 2026. Legal-AI procurement has matured fast under the pressure of Mata, the Stanford studies, and the steady drip of judicial orders requiring disclosure. The vendors who survive the next two years will be the ones who built verifiable pipelines instead of clever prompts.

References

  1. Dahl, Magesh, Suzgun, Ho, Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, Journal of Legal Analysis (2024)
  2. Stanford HAI, AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries
  3. Mata v. Avianca, Inc., 678 F. Supp. 3d 443 (S.D.N.Y. 2023)
  4. Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020
  5. Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, Stanford RegLab (2024)