Building a RAG Evaluation Framework From Scratch

Flowchart of a RAG evaluation pipeline with retrieval and generation stages

Deploying a RAG system is straightforward. Knowing whether it actually works is harder. This playbook walks through building a practical evaluation framework that catches regressions before your users do.

Step 1: Curate a Golden Test Set

Start with 50–100 question-answer pairs that represent real user queries. For each pair, include:

  • The question as a user would phrase it
  • The expected answer or acceptable answer range
  • The source passages that contain the ground truth

This dataset is your single source of truth. Every evaluation metric flows from it.

Step 2: Measure Retrieval Quality

Before evaluating generated answers, verify that the retriever is surfacing the right documents:

  • Recall@K — what fraction of relevant passages appear in the top K retrieved chunks?
  • Mean Reciprocal Rank — how high does the first relevant chunk rank?

If retrieval is poor, no amount of generation tuning will help. Fix retrieval first.

Step 3: Measure Answer Faithfulness

A correct answer that hallucinates unsupported claims is worse than no answer. For each generated response, check:

  • Attribution — can every claim in the answer be traced to a retrieved passage?
  • Contradiction — does the answer contradict any retrieved passage?

LLM-as-judge works well here: prompt a separate model to verify each claim against the source text.

Step 4: Measure End-to-End Quality

Finally, evaluate the full pipeline output:

  • Correctness — does the answer match the golden answer?
  • Completeness — does it cover all key points?
  • Conciseness — is it free of irrelevant information?

Score these on a simple 1–3 scale. Automate scoring with an LLM judge and spot-check 10% manually.

Step 5: Automate and Gate

Wire the evaluation into your CI pipeline:

  1. Run the golden test set on every retriever or prompt change
  2. Fail the build if Recall@5 drops below your baseline
  3. Flag answer faithfulness regressions for human review

This turns evaluation from a one-time exercise into a continuous quality gate.

Sources

Explore more playbooks