Query Agent Verification¶
How do we know the query agent's responses are correct — beyond eyeballing?
This page covers the automated evaluation systems, the metrics they produce, and how to use them during development and in CI.
The Problem¶
A RAG system can fail silently: it returns fluent, confident text that is factually wrong, incomplete, or unsupported by the retrieved context. Manual spot-checking doesn't scale and misses subtle regressions.
The project addresses this with two complementary evaluation systems:
- Ragas-based CI harness (
evaluation/) — runs on every push tomain, catches regressions - Scorer-based framework (
src/evaluation/) — programmatic, extensible, used for deeper analysis
1. CI Evaluation Harness¶
How it works¶
flowchart LR
A[Golden Dataset] --> B[CI Runner]
B --> C[Query Agent]
C --> D[Responses]
D --> E[Ragas Metrics]
D --> F[Retrieval Recall]
E --> G[Compare to Baseline]
F --> G
G -->|>5% drop| H[❌ FAIL]
G -->|3-5% drop| I[⚠️ WARNING]
G -->|OK| J[✅ PASS]
J --> K[Grafana]
H --> K
Golden dataset¶
Generated from actual text chunks in the knowledge graph:
This produces evaluation/golden_dataset.json containing question-answer pairs with source contexts derived from real KG data — ensuring questions are answerable by the system.
Running locally¶
# Full eval (same as CI)
uv run python -m evaluation.harness.ci_eval
# Just generate + run without regression check
uv run python evaluation/run_eval.py
Metrics produced¶
| Metric | What it measures | Source |
|---|---|---|
| Faithfulness | Does the answer only use information from retrieved context? (no hallucination) | Ragas |
| Answer Relevance | Is the answer relevant to the question asked? | Ragas |
| Context Precision | Did we retrieve the right chunks? | Ragas |
| Retrieval Recall | |expected_entities ∩ retrieved_entities| / |expected_entities| |
Custom |
| Mode Classification | Did the agent pick the correct query mode (local/global/hybrid)? | Custom |
| Latency | p50, p95, p99 response times | Custom |
Regression detection¶
The CI runner compares current results against a stored baseline:
- >5% drop on any metric → build fails (exit code 1)
- 3–5% drop → warning posted to PR, build passes
- No regression → green, results pushed to Grafana
Results are posted as PR comments automatically via GitHub Actions.
Response caching¶
Responses are cached in evaluation/cache/ so that re-runs are deterministic and fast. The cache key is based on the golden dataset + baseline config hash.
2. Scorer-Based Framework¶
For deeper analysis beyond CI pass/fail, the src/evaluation/ framework provides pluggable scorers.
Test queries¶
Defined in data/test_queries.yaml, organized by category:
multi_hop_queries— require traversing multiple relationshipsglobal_queries— broad, community-level questionslocal_queries— specific entity lookupshybrid_queries— combine semantic + graph searchdisambiguation_queries— entities with ambiguous names
Each query specifies:
- query: "What proteins are associated with Alzheimer's disease?"
expected_mode: local
expected_keywords: ["APP", "PSEN1", "APOE"]
reference_answer: "Key proteins include APP, PSEN1, PSEN2, and APOE..."
hop_depth: 1
Available scorers¶
| Scorer | What it does |
|---|---|
keyword_match |
Fraction of expected keywords found in the answer text |
entity_overlap |
Precision, recall, and F1 on entity names (case-insensitive) |
embedding_similarity |
Cosine similarity between response and reference answer embeddings |
llm_judge |
LLM scores the response on relevance, correctness, and completeness (0–1 each) |
Running the framework¶
from src.evaluation.runner import EvaluationRunner
from src.evaluation.scorers import get_scorer
from src.evaluation.adapters import YourAgentAdapter
runner = EvaluationRunner()
queries = runner.load_queries(categories=["local", "multi_hop"])
adapter = YourAgentAdapter(...) # wraps your agent
scorers = [get_scorer("keyword_match"), get_scorer("entity_overlap"), get_scorer("llm_judge")]
summary = runner.run(adapter, queries, scorers)
# Inspect results
for result in summary.results:
print(f"{result.query[:50]}...")
for name, score in result.scores.items():
print(f" {name}: {score.value:.3f}")
print(f"\nAggregate: {summary.aggregate_scores}")
print(f"Mode accuracy: {summary.mode_accuracy:.1%}")
Adding a custom scorer¶
from src.evaluation.models import EvaluationResponse, MetricScore, TestQuery
from src.evaluation.scorers import register_scorer
@register_scorer("my_custom_scorer")
class MyScorer:
def score(self, query: TestQuery, response: EvaluationResponse) -> list[MetricScore]:
# Your logic here
value = compute_something(query, response)
return [MetricScore(metric_name="my_metric", value=value)]
3. What Each Metric Tells You¶
Faithfulness (Ragas)¶
"Is the agent making things up?"
A faithfulness score of 0.9 means 90% of claims in the answer are supported by the retrieved context. Low faithfulness = hallucination.
When it drops: The agent is generating information not present in the retrieved chunks. Check if retrieval is returning irrelevant context, or if the LLM prompt encourages speculation.
Answer Relevance (Ragas)¶
"Did the agent actually answer the question?"
When it drops: The agent may be going off-topic, returning tangentially related information, or misinterpreting the query.
Context Precision (Ragas)¶
"Are we retrieving the right chunks?"
When it drops: The vector index may need recomputation, embeddings may have drifted, or the chunking strategy is producing poor segments.
Entity Recall¶
"Did the agent find the entities we expected?"
Computed as |expected ∩ found| / |expected|. A recall of 0.8 means 80% of expected entities appeared in the response.
When it drops: Graph traversal depth may be insufficient, or entity consolidation may have broken links.
LLM Judge — Correctness¶
"Is the biomedical information factually accurate?"
An LLM (Bedrock) evaluates whether the response contains correct biomedical facts. This catches subtle errors that keyword matching misses.
Limitations: The judge LLM has its own knowledge cutoff and biases. Use as a signal, not ground truth.
4. Practical Workflow¶
During development¶
- Make your change (agent logic, prompts, retrieval strategy)
- Run the scorer framework locally against relevant categories:
- Compare scores before/after your change
Before merging¶
- CI runs the full eval harness automatically on push to
main - Check the PR comment for regression warnings
- If a regression is flagged, investigate which category/metric dropped
Monitoring in production¶
- Grafana dashboard (
olink-rag-eval-e2e) shows metric trends over time - Alert rules fire when any eval metric drops >5% from the rolling baseline
- Latency percentiles track performance degradation
5. Limitations & What's Not Automated¶
| Gap | Mitigation |
|---|---|
| Golden dataset may not cover edge cases | Expand data/test_queries.yaml as you find failures |
| LLM judge has its own biases | Cross-reference with entity_overlap and keyword_match |
| Ragas requires ground truth contexts | Generated from actual KG — regenerate after major ingestion changes |
| No real-time user feedback loop yet | Planned: Glicko rating system from user thumbs-up/down |
| Domain expertise needed to validate | Periodic manual review by domain scientists remains essential |