How to Stop Your AI Chatbot from Making Things Up — A Practical Guide for Government and Education
Marcus Webb
Senior Solutions Architect, Keyspider
March 2025
13 min read
In March 2024, a US county government deployed a general-purpose AI chatbot on its website. Within 72 hours, it had told residents the wrong deadline for a property tax exemption, invented a non-existent parking amnesty programme, and confidently quoted a figure for utility bill assistance that was $400 higher than the actual amount. The chatbot was removed. The county IT director spent a week managing the fallout.
This is not a hypothetical. Variants of this story are playing out across state, local government, and education organisations around the world. The pressure to deploy AI is real — from leadership, from peers, from residents who use AI assistants everywhere else and expect the same from their government. But the specific failure mode of AI hallucination — confident, fluent, plausible-sounding incorrect information — is uniquely dangerous in public sector and education contexts. Understanding how AI Assistant should be architected is the first step to deploying it safely.
This article explains why hallucinations happen, why the stakes are higher in SLED environments than anywhere else, and exactly what architectural and operational choices eliminate the risk for organisations that need to get this right.
What Hallucination Actually Is — and Why It Happens
The term 'hallucination' describes a large language model (LLM) generating output that is factually incorrect, fabricated, or not supported by any real source — presented with the same confidence and fluency as accurate information. It is not a bug in the conventional sense. It is an emergent property of how LLMs work.
LLMs are trained to predict the most statistically probable next token given the preceding context. They learn patterns of language, reasoning, and knowledge from an enormous training corpus. But they do not retrieve facts from a database and they do not have direct access to verified, current information. They generate text that is coherent and contextually appropriate — which usually means accurate, but not always.
When an LLM is asked about a topic where its training data was sparse, outdated, ambiguous, or conflicting — or when the question sits at the edge of its knowledge — it fills the gap with plausible-sounding content. It does not know what it does not know. It cannot flag uncertainty reliably. And in the context of a customer-facing chatbot on a government or university website, it will state a wrong benefit amount, a wrong deadline, or a non-existent programme with exactly the same confident, polished tone it uses when it is correct.
A citizen cannot tell the difference between an AI that is right and an AI that sounds right. That is the fundamental problem. In a commercial context, a confident wrong answer about a product feature is inconvenient. In a government context, it can cause real harm — a missed deadline, an incorrect tax filing, a lost entitlement.
— Digital transformation lead, NSW state agency
Why SLED Environments Face Unique Risk
The stakes are higher than in commercial contexts
When an e-commerce chatbot hallucinates a return policy, a customer is annoyed and contacts support. When a government chatbot hallucinates an eligibility requirement for a housing benefit, a vulnerable person may fail to apply for assistance they need. When a university chatbot hallucinates a scholarship deadline, a student may miss a life-changing opportunity. The asymmetry between the harm of being right and the harm of being wrong is vastly greater in public sector and education than in most commercial contexts.
Trust, once broken, is not easily rebuilt
Government and educational institutions operate on public trust. That trust, built over decades, can be damaged by a single highly visible AI failure in a way that takes years to repair. A viral social media post showing a government chatbot giving wrong information about emergency housing does not just embarrass the digital team — it undermines citizen confidence in the agency's ability to use technology responsibly.
Policy and legislation changes frequently
A general-purpose LLM has a training cutoff date. Government policy, legislation, and eligibility criteria change constantly — through budget cycles, legislative amendments, ministerial decisions, and administrative updates. An LLM trained before a policy change will state the old policy with the same confidence as the new one. For an agency that updated its rental assistance income thresholds in November, a chatbot trained in July is a liability.
Legal and FOI exposure
In many jurisdictions, government agencies have obligations around the accuracy of information they provide to citizens. Where an AI chatbot provides materially incorrect guidance that a citizen relies upon to their detriment, the legal exposure — and the Freedom of Information implications — are non-trivial. CISOs and general counsel in SLED organisations are increasingly aware of this, and procurement of AI tools without adequate accuracy governance is creating real liability.
The Architecture That Eliminates Hallucination: Retrieval-Augmented Generation
The solution to hallucination in SLED contexts is not a better LLM. It is a different architecture. Retrieval-augmented generation (RAG) constrains the AI's answer generation to a defined, controllable corpus of approved content — your organisation's own documents, website, and knowledge bases.
Here is how it works in practice:
- 1The citizen asks a question in the chat interface.
- 2A semantic retrieval system searches your indexed content — your website, policy documents, FAQs, and knowledge base — for the most relevant passages.
- 3The most relevant passages are passed to the language model as context, along with the citizen's question.
- 4The language model generates an answer using only the provided context. It does not draw on its general training knowledge.
- 5The answer is presented to the citizen with citations — links to the source documents the answer was drawn from.
- 6If the answer cannot be found in your indexed content, the system says so — directing the citizen to contact support rather than inventing an answer.
What this means operationally
In a RAG-based system, the AI cannot say anything that is not in your content. If your policy document says the income threshold is $52,000, the chatbot says $52,000. If your policy document has not been updated since the threshold changed, that is a content governance problem — but it is visible, auditable, and fixable. A hallucinating AI creates invisible failures. A grounded AI surfaces real ones.
The Citation Layer: Making AI Outputs Verifiable
RAG architecture eliminates hallucination, but it is citations that make a grounded AI system trustworthy in the eyes of the citizen — and auditable in the eyes of your governance team.
Every answer from a well-architected AI chat system should display, immediately below the response, the source documents from which the answer was drawn. 'This information is sourced from the Housing Assistance Policy, updated February 2025.' The citizen can click through, verify the answer, and read the full policy document if they choose. The caseworker can audit what information was given and trace it back to the policy version in effect at the time.
This is not a minor UX feature. For public sector deployments, it is a fundamental trust mechanism. It transforms AI chat from 'the computer said' — an answer with no accountability — to 'the computer said, and here is the official document it drew from'. That is a governance standard that most AI tools do not meet by default.
Content Governance: The Other Half of the Equation
RAG eliminates hallucination from external sources. It does not eliminate the risk of giving wrong answers if your own content is wrong, outdated, or ambiguous. This is the part of AI chat deployment that technology vendors prefer not to emphasise, but it is where SLED organisations most often need to invest.
Content currency
Your AI chatbot is only as current as your indexed content. If a policy document is updated in your document management system but the web page that explains it is still showing last year's figures, your chatbot will tell citizens last year's figures. Real-time indexing — where content changes in your CMS are automatically re-indexed within minutes — is a non-negotiable requirement, not a nice-to-have.
Content scope and permissions
Not all of your content should be in scope for your citizen-facing chatbot. Internal policies, draft documents, legal advice, and staff-only guidance should be excluded from the citizen-facing index. For internal chatbots, role-based access control must be enforced at the retrieval layer — a junior caseworker should not be able to extract sensitive case notes through a chatbot that has access to an index containing those notes.
The 'I don't know' capability
A critically important and underrated quality of a well-configured AI chat system is the ability to recognise when a question cannot be answered from available content and to say so clearly — directing the user to a human channel rather than generating a plausible but potentially wrong response.
In a commercial chatbot, a graceful fallback is a nice touch. In a government chatbot, it is an ethical requirement. Configure your confidence threshold so that low-confidence answers trigger a 'I'm not sure about that — please contact us directly' response, not a hallucinated answer. The call centre call that results is better than the harm that results from a confident wrong answer.
RAG
Reduces hallucination rate from ~20% to <0.5% in grounded deployments
100%
of answers should have a traceable source citation
<5 min
Content update latency target for real-time indexed AI chat
0
Answers generated from outside your approved content in a properly grounded system
A Procurement Checklist for SLED Leaders
When evaluating AI chat vendors for government or education deployment, these questions should be in every brief and every demo evaluation:
- Is the system RAG-based, or does it use a fine-tuned or prompted general LLM? If it is a prompted general LLM, what prevents it from drawing on general internet knowledge when your content doesn't have an answer?
- How are citations implemented? Can every answer be traced to a specific document, section, or URL?
- What is the 'I don't know' behaviour? Show us a query where the answer is not in our content. What happens?
- How quickly does content update in the index after we publish changes? What is the SLA?
- How is the indexed corpus controlled? Who has access to add, remove, or modify what is in scope?
- Is the system GDPR-compliant? Is query data used for model training? Where is data stored?
- How do we audit what the chatbot has told users? Is there a conversation log with timestamped source citations?
- What is the escalation path when the chatbot can't answer? Is it clearly communicated to users?
What Good Looks Like: A Reference Implementation
A well-implemented AI chat system on a government or university website should behave as follows:
A citizen asks: 'Can I still apply for the emergency housing grant if I'm renting privately?'
The system retrieves the most relevant passages from the Housing Assistance Policy and the Emergency Grant FAQ. It generates a response: 'Yes, private renters are eligible for the Emergency Housing Grant provided they meet the income and tenancy requirements. Applications must be submitted within 30 days of the rental hardship event. Source: Emergency Housing Grant — Eligibility Guidelines (updated March 2025).'
The citizen can click the citation to read the full policy. The response can be audited. If the policy changes next month, the content governance workflow updates the source document, the index updates within minutes, and the chatbot answers accurately within the hour.
That is achievable today. It is not a research project or a pilot programme aspiration. It is what well-architected, grounded AI chat looks like in production.
The Political Reality for SLED Leaders
Digital leaders in government and education face pressure from two directions simultaneously: pressure to adopt AI quickly from leadership who want visible innovation, and pressure to avoid AI failures from the same leadership who don't want to explain a chatbot incident to a minister, a board, or a local paper.
Grounded AI chat is the path through that tension. It allows you to deploy AI responsibly, with confidence that accuracy is governed through your content, not through the probabilistic lottery of a general LLM. It gives you audit trails. It gives you citation evidence. It gives you a 'turn it off' mechanism that goes no further than unpublishing the source document.
The question is no longer whether to deploy AI chat. The question is whether to deploy it responsibly.
Related resources
Recommendation
Before deploying any AI chatbot on a citizen-facing or student-facing channel, run a structured red-teaming exercise: ask the system 50 questions where the answer is not in your indexed content. If it answers confidently with information that cannot be traced to your content, the system is not safe for public sector deployment. A grounded system should respond to all 50 with a clear 'I don't have information on that — please contact us directly' or a verified citation.
Ready to see it in action?
Book a demo and we'll configure Keyspider on a live sample of your content, within 48 hours.
Book a Demo