Deploying a RAG system is straightforward. Knowing whether it actually works is harder. This playbook walks through building a practical evaluation framework that catches regressions before your users do.
Step 1: Curate a Golden Test Set
Start with 50–100 question-answer pairs that represent real user queries. For each pair, include:
- The question as a user would phrase it
- The expected answer or acceptable answer range
- The source passages that contain the ground truth
This dataset is your single source of truth. Every evaluation metric flows from it.
Step 2: Measure Retrieval Quality
Before evaluating generated answers, verify that the retriever is surfacing the right documents:
- Recall@K — what fraction of relevant passages appear in the top K retrieved chunks?
- Mean Reciprocal Rank — how high does the first relevant chunk rank?
If retrieval is poor, no amount of generation tuning will help. Fix retrieval first.
Step 3: Measure Answer Faithfulness
A correct answer that hallucinates unsupported claims is worse than no answer. For each generated response, check:
- Attribution — can every claim in the answer be traced to a retrieved passage?
- Contradiction — does the answer contradict any retrieved passage?
LLM-as-judge works well here: prompt a separate model to verify each claim against the source text.
Step 4: Measure End-to-End Quality
Finally, evaluate the full pipeline output:
- Correctness — does the answer match the golden answer?
- Completeness — does it cover all key points?
- Conciseness — is it free of irrelevant information?
Score these on a simple 1–3 scale. Automate scoring with an LLM judge and spot-check 10% manually.
Step 5: Automate and Gate
Wire the evaluation into your CI pipeline:
- Run the golden test set on every retriever or prompt change
- Fail the build if Recall@5 drops below your baseline
- Flag answer faithfulness regressions for human review
This turns evaluation from a one-time exercise into a continuous quality gate.