How many golden test examples do I need?

Start with 50–100 curated question-answer pairs. Expand as you discover edge cases.

Can I use the same model as both generator and judge?

It works in a pinch, but a separate judge model reduces self-confirmation bias.

RAG Evaluation Playbook — Retrieval, Faithfulness & E2E Metrics

Deploying a RAG system is straightforward. Knowing whether it actually works is harder. This playbook walks through building a practical evaluation framework that catches regressions before your users do.

Step 1: Curate a Golden Test Set

Start with 50–100 question-answer pairs that represent real user queries. For each pair, include:

The question as a user would phrase it
The expected answer or acceptable answer range
The source passages that contain the ground truth

This dataset is your single source of truth. Every evaluation metric flows from it.

Step 2: Measure Retrieval Quality

Before evaluating generated answers, verify that the retriever is surfacing the right documents:

Recall@K — what fraction of relevant passages appear in the top K retrieved chunks?
Mean Reciprocal Rank — how high does the first relevant chunk rank?

If retrieval is poor, no amount of generation tuning will help. Fix retrieval first.

Step 3: Measure Answer Faithfulness

A correct answer that hallucinates unsupported claims is worse than no answer. For each generated response, check:

Attribution — can every claim in the answer be traced to a retrieved passage?
Contradiction — does the answer contradict any retrieved passage?

LLM-as-judge works well here: prompt a separate model to verify each claim against the source text.

Step 4: Measure End-to-End Quality

Finally, evaluate the full pipeline output:

Correctness — does the answer match the golden answer?
Completeness — does it cover all key points?
Conciseness — is it free of irrelevant information?

Score these on a simple 1–3 scale. Automate scoring with an LLM judge and spot-check 10% manually.

Step 5: Automate and Gate

Wire the evaluation into your CI pipeline:

Run the golden test set on every retriever or prompt change
Fail the build if Recall@5 drops below your baseline
Flag answer faithfulness regressions for human review

This turns evaluation from a one-time exercise into a continuous quality gate.

Building a RAG Evaluation Framework From Scratch