artificial-intelligence

I Built an AI Assistant That Checks Its Own Answers

Most AI apps fail for a simple reason: they trust the model too much.

The model answers. The app displays it. The user believes it. Then someone notices the answer was wrong, outdated, or completely invented.

Lovely.

After building AI tools for real workflows, I stopped treating the LLM like the whole product. Now I treat it like one smart worker inside a controlled system.

The system should retrieve facts, generate an answer, check that answer, and only then show it.

1. The problem is not intelligence, it is reliability

A normal AI chatbot works like this:

user_question -> LLM -> answer

That is fine for casual use.

But for serious work, I prefer this:

user_question
    -> retrieve relevant context
    -> generate answer
    -> verify answer against context
    -> return answer or say "not enough information"

This is slower than a basic chatbot, but much safer.

And honestly, “fast wrong answer” is not a feature.

2. The core idea: generation plus verification

The trick is simple.

Use one model call to answer. Use another model call to check whether the answer is supported by the provided context.

Here is a simplified version.

from dataclasses import dataclass
from typing import List
@dataclass
class Document:
    title: str
    content: str

@dataclass
class AIAnswer:
    answer: str
    confidence: str
    sources: List[str]
    verified: bool

class SimpleAIWorkflow:
    def __init__(self, llm_client):
        self.llm = llm_client
    def retrieve_context(self, question: str, documents: List[Document]) -> List[Document]:
        keywords = question.lower().split()
        scored_docs = []
        for doc in documents:
            score = 0
            text = doc.content.lower()
            for keyword in keywords:
                if keyword in text:
                    score += 1
            scored_docs.append((score, doc))
        scored_docs.sort(key=lambda item: item[0], reverse=True)
        return [doc for score, doc in scored_docs if score > 0][:3]
    def build_answer_prompt(self, question: str, context_docs: List[Document]) -> str:
        context = "\n\n".join(
            f"Title: {doc.title}\nContent: {doc.content}"
            for doc in context_docs
        )
        return f"""
Answer the question using only the context below.
Question:
{question}
Context:
{context}
Rules:
- Do not invent facts.
- If the context is weak, say so.
- Mention the source title.
"""
    def build_verification_prompt(self, question: str, answer: str, context_docs: List[Document]) -> str:
        context = "\n\n".join(
            f"Title: {doc.title}\nContent: {doc.content}"
            for doc in context_docs
        )
        return f"""
Check whether the answer is fully supported by the context.
Question:
{question}
Answer:
{answer}
Context:
{context}
Return only one word:
SUPPORTED or UNSUPPORTED
"""
    def answer_question(self, question: str, documents: List[Document]) -> AIAnswer:
        context_docs = self.retrieve_context(question, documents)
        if not context_docs:
            return AIAnswer(
                answer="I could not find enough information to answer this.",
                confidence="low",
                sources=[],
                verified=False
            )
        answer_prompt = self.build_answer_prompt(question, context_docs)
        answer = self.llm.generate(answer_prompt)
        verification_prompt = self.build_verification_prompt(
            question,
            answer,
            context_docs
        )
        verification = self.llm.generate(verification_prompt).strip().upper()
        verified = verification == "SUPPORTED"
        return AIAnswer(
            answer=answer if verified else "I found related information, but I cannot verify the answer confidently.",
            confidence="high" if verified else "low",
            sources=[doc.title for doc in context_docs],
            verified=verified
        )

This is not a full production system, but the pattern is powerful.

Generate first. Verify second. Trust last.

3. Why this works better than prompt engineering alone

Prompts help, but prompts are not seatbelts.

You can tell the model:

Do not hallucinate.

But that does not guarantee anything. The model may still produce a beautiful wrong answer with the confidence of a finance student five minutes before the exam.

Verification adds a second layer.

The answer is no longer accepted just because it sounds good. It must be supported by the retrieved context.

That is a major upgrade.

4. Add a confidence gate

I like adding a confidence gate before showing the final answer.

For example:

def safe_response(ai_answer: AIAnswer) -> str:
    if not ai_answer.verified:
        return (
            "I found some related information, but I cannot confirm the answer "
            "from the available context. Please provide more documents or ask a narrower question."
        )
if ai_answer.confidence == "low":
        return "The answer may be incomplete based on the available information."
    return ai_answer.answer

This small function can prevent many bad outputs.

The best AI systems are not the ones that always answer. They are the ones that know when not to answer.

5. Where I would use this

This pattern is useful for:

Customer support bots Internal company policy assistants PDF question-answering tools Research assistants Legal document search Finance report explanation University assignment helpers Technical documentation bots

Basically, anywhere wrong answers can create confusion.

For casual brainstorming, direct AI output is fine. For factual work, verification is worth it.

6. The part most beginners skip: logging

Every AI workflow should log what happened.

def log_ai_result(question: str, ai_answer: AIAnswer):
    print("QUESTION:", question)
    print("ANSWER:", ai_answer.answer)
    print("CONFIDENCE:", ai_answer.confidence)
    print("VERIFIED:", ai_answer.verified)
    print("SOURCES:", ", ".join(ai_answer.sources))
    print("-" * 50)

Logs help you see where the system fails.

Bad retrieval? Weak documents? Poor prompt? Overconfident answer?

Without logs, debugging AI feels like chasing a ghost with a flashlight.

7. Final thoughts

The future of AI development is not just bigger models.

It is better systems around models.

A serious AI app should retrieve facts, generate carefully, verify answers, log results, and refuse when evidence is weak.

That is the difference between a chatbot and an AI tool people can actually trust.

My rule now is simple:

Never let the model be the whole brain. Build the nervous system around it.