Skip to content

Query Agent Verification

How do we know the query agent's responses are correct — beyond eyeballing?

This page covers the automated evaluation systems, the metrics they produce, and how to use them during development and in CI.


The Problem

A RAG system can fail silently: it returns fluent, confident text that is factually wrong, incomplete, or unsupported by the retrieved context. Manual spot-checking doesn't scale and misses subtle regressions.

The project addresses this with two complementary evaluation systems:

  1. Ragas-based CI harness (evaluation/) — runs on every push to main, catches regressions
  2. Scorer-based framework (src/evaluation/) — programmatic, extensible, used for deeper analysis

1. CI Evaluation Harness

How it works

flowchart LR
    A[Golden Dataset] --> B[CI Runner]
    B --> C[Query Agent]
    C --> D[Responses]
    D --> E[Ragas Metrics]
    D --> F[Retrieval Recall]
    E --> G[Compare to Baseline]
    F --> G
    G -->|>5% drop| H[❌ FAIL]
    G -->|3-5% drop| I[⚠️ WARNING]
    G -->|OK| J[✅ PASS]
    J --> K[Grafana]
    H --> K

Golden dataset

Generated from actual text chunks in the knowledge graph:

uv run python evaluation/generate_dataset.py

This produces evaluation/golden_dataset.json containing question-answer pairs with source contexts derived from real KG data — ensuring questions are answerable by the system.

Running locally

# Full eval (same as CI)
uv run python -m evaluation.harness.ci_eval

# Just generate + run without regression check
uv run python evaluation/run_eval.py

Metrics produced

Metric What it measures Source
Faithfulness Does the answer only use information from retrieved context? (no hallucination) Ragas
Answer Relevance Is the answer relevant to the question asked? Ragas
Context Precision Did we retrieve the right chunks? Ragas
Retrieval Recall |expected_entities ∩ retrieved_entities| / |expected_entities| Custom
Mode Classification Did the agent pick the correct query mode (local/global/hybrid)? Custom
Latency p50, p95, p99 response times Custom

Regression detection

The CI runner compares current results against a stored baseline:

  • >5% drop on any metric → build fails (exit code 1)
  • 3–5% drop → warning posted to PR, build passes
  • No regression → green, results pushed to Grafana

Results are posted as PR comments automatically via GitHub Actions.

Response caching

Responses are cached in evaluation/cache/ so that re-runs are deterministic and fast. The cache key is based on the golden dataset + baseline config hash.


2. Scorer-Based Framework

For deeper analysis beyond CI pass/fail, the src/evaluation/ framework provides pluggable scorers.

Test queries

Defined in data/test_queries.yaml, organized by category:

  • multi_hop_queries — require traversing multiple relationships
  • global_queries — broad, community-level questions
  • local_queries — specific entity lookups
  • hybrid_queries — combine semantic + graph search
  • disambiguation_queries — entities with ambiguous names

Each query specifies:

- query: "What proteins are associated with Alzheimer's disease?"
  expected_mode: local
  expected_keywords: ["APP", "PSEN1", "APOE"]
  reference_answer: "Key proteins include APP, PSEN1, PSEN2, and APOE..."
  hop_depth: 1

Available scorers

Scorer What it does
keyword_match Fraction of expected keywords found in the answer text
entity_overlap Precision, recall, and F1 on entity names (case-insensitive)
embedding_similarity Cosine similarity between response and reference answer embeddings
llm_judge LLM scores the response on relevance, correctness, and completeness (0–1 each)

Running the framework

from src.evaluation.runner import EvaluationRunner
from src.evaluation.scorers import get_scorer
from src.evaluation.adapters import YourAgentAdapter

runner = EvaluationRunner()
queries = runner.load_queries(categories=["local", "multi_hop"])

adapter = YourAgentAdapter(...)  # wraps your agent
scorers = [get_scorer("keyword_match"), get_scorer("entity_overlap"), get_scorer("llm_judge")]

summary = runner.run(adapter, queries, scorers)

# Inspect results
for result in summary.results:
    print(f"{result.query[:50]}...")
    for name, score in result.scores.items():
        print(f"  {name}: {score.value:.3f}")

print(f"\nAggregate: {summary.aggregate_scores}")
print(f"Mode accuracy: {summary.mode_accuracy:.1%}")

Adding a custom scorer

from src.evaluation.models import EvaluationResponse, MetricScore, TestQuery
from src.evaluation.scorers import register_scorer

@register_scorer("my_custom_scorer")
class MyScorer:
    def score(self, query: TestQuery, response: EvaluationResponse) -> list[MetricScore]:
        # Your logic here
        value = compute_something(query, response)
        return [MetricScore(metric_name="my_metric", value=value)]

3. What Each Metric Tells You

Faithfulness (Ragas)

"Is the agent making things up?"

A faithfulness score of 0.9 means 90% of claims in the answer are supported by the retrieved context. Low faithfulness = hallucination.

When it drops: The agent is generating information not present in the retrieved chunks. Check if retrieval is returning irrelevant context, or if the LLM prompt encourages speculation.

Answer Relevance (Ragas)

"Did the agent actually answer the question?"

When it drops: The agent may be going off-topic, returning tangentially related information, or misinterpreting the query.

Context Precision (Ragas)

"Are we retrieving the right chunks?"

When it drops: The vector index may need recomputation, embeddings may have drifted, or the chunking strategy is producing poor segments.

Entity Recall

"Did the agent find the entities we expected?"

Computed as |expected ∩ found| / |expected|. A recall of 0.8 means 80% of expected entities appeared in the response.

When it drops: Graph traversal depth may be insufficient, or entity consolidation may have broken links.

LLM Judge — Correctness

"Is the biomedical information factually accurate?"

An LLM (Bedrock) evaluates whether the response contains correct biomedical facts. This catches subtle errors that keyword matching misses.

Limitations: The judge LLM has its own knowledge cutoff and biases. Use as a signal, not ground truth.


4. Practical Workflow

During development

  1. Make your change (agent logic, prompts, retrieval strategy)
  2. Run the scorer framework locally against relevant categories:
    uv run python -c "
    from src.evaluation.runner import EvaluationRunner
    from src.evaluation.scorers import get_scorer
    runner = EvaluationRunner()
    queries = runner.load_queries(categories=['local'])
    # ... run and inspect
    "
    
  3. Compare scores before/after your change

Before merging

  • CI runs the full eval harness automatically on push to main
  • Check the PR comment for regression warnings
  • If a regression is flagged, investigate which category/metric dropped

Monitoring in production

  • Grafana dashboard (olink-rag-eval-e2e) shows metric trends over time
  • Alert rules fire when any eval metric drops >5% from the rolling baseline
  • Latency percentiles track performance degradation

5. Limitations & What's Not Automated

Gap Mitigation
Golden dataset may not cover edge cases Expand data/test_queries.yaml as you find failures
LLM judge has its own biases Cross-reference with entity_overlap and keyword_match
Ragas requires ground truth contexts Generated from actual KG — regenerate after major ingestion changes
No real-time user feedback loop yet Planned: Glicko rating system from user thumbs-up/down
Domain expertise needed to validate Periodic manual review by domain scientists remains essential

Quick Reference

# Generate golden dataset from current KG
uv run python evaluation/generate_dataset.py

# Run full CI eval locally
uv run python -m evaluation.harness.ci_eval

# Run Ragas evaluation
uv run python evaluation/run_eval.py

# Check eval results
cat evaluation/reports/ci_summary.json | python -m json.tool