Query Agent Verification¶

How do we know the query agent's responses are correct — beyond eyeballing?

This page covers the automated evaluation systems, the metrics they produce, and how to use them during development and in CI.

The Problem¶

A RAG system can fail silently: it returns fluent, confident text that is factually wrong, incomplete, or unsupported by the retrieved context. Manual spot-checking doesn't scale and misses subtle regressions.

The project addresses this with two complementary evaluation systems:

Ragas-based CI harness (evaluation/) — runs on every push to main, catches regressions
Scorer-based framework (src/evaluation/) — programmatic, extensible, used for deeper analysis

1. CI Evaluation Harness¶

How it works¶

flowchart LR
    A[Golden Dataset] --> B[CI Runner]
    B --> C[Query Agent]
    C --> D[Responses]
    D --> E[Ragas Metrics]
    D --> F[Retrieval Recall]
    E --> G[Compare to Baseline]
    F --> G
    G -->|>5% drop| H[❌ FAIL]
    G -->|3-5% drop| I[⚠️ WARNING]
    G -->|OK| J[✅ PASS]
    J --> K[Grafana]
    H --> K

Golden dataset¶

Generated from actual text chunks in the knowledge graph:

uv run python evaluation/generate_dataset.py

This produces evaluation/golden_dataset.json containing question-answer pairs with source contexts derived from real KG data — ensuring questions are answerable by the system.

Running locally¶

# Full eval (same as CI)
uv run python -m evaluation.harness.ci_eval

# Just generate + run without regression check
uv run python evaluation/run_eval.py

Metrics produced¶

Metric	What it measures	Source
Faithfulness	Does the answer only use information from retrieved context? (no hallucination)	Ragas
Answer Relevance	Is the answer relevant to the question asked?	Ragas
Context Precision	Did we retrieve the right chunks?	Ragas
Retrieval Recall	`\|expected_entities ∩ retrieved_entities\| / \|expected_entities\|`	Custom
Mode Classification	Did the agent pick the correct query mode (local/global/hybrid)?	Custom
Latency	p50, p95, p99 response times	Custom

Regression detection¶

The CI runner compares current results against a stored baseline:

>5% drop on any metric → build fails (exit code 1)
3–5% drop → warning posted to PR, build passes
No regression → green, results pushed to Grafana

Results are posted as PR comments automatically via GitHub Actions.

Response caching¶

Responses are cached in evaluation/cache/ so that re-runs are deterministic and fast. The cache key is based on the golden dataset + baseline config hash.

2. Scorer-Based Framework¶

For deeper analysis beyond CI pass/fail, the src/evaluation/ framework provides pluggable scorers.

Test queries¶

Defined in data/test_queries.yaml, organized by category:

multi_hop_queries — require traversing multiple relationships
global_queries — broad, community-level questions
local_queries — specific entity lookups
hybrid_queries — combine semantic + graph search
disambiguation_queries — entities with ambiguous names

Each query specifies:

- query: "What proteins are associated with Alzheimer's disease?"
  expected_mode: local
  expected_keywords: ["APP", "PSEN1", "APOE"]
  reference_answer: "Key proteins include APP, PSEN1, PSEN2, and APOE..."
  hop_depth: 1

Available scorers¶

Scorer	What it does
`keyword_match`	Fraction of expected keywords found in the answer text
`entity_overlap`	Precision, recall, and F1 on entity names (case-insensitive)
`embedding_similarity`	Cosine similarity between response and reference answer embeddings
`llm_judge`	LLM scores the response on relevance, correctness, and completeness (0–1 each)

Running the framework¶

from src.evaluation.runner import EvaluationRunner
from src.evaluation.scorers import get_scorer
from src.evaluation.adapters import YourAgentAdapter

runner = EvaluationRunner()
queries = runner.load_queries(categories=["local", "multi_hop"])

adapter = YourAgentAdapter(...)  # wraps your agent
scorers = [get_scorer("keyword_match"), get_scorer("entity_overlap"), get_scorer("llm_judge")]

summary = runner.run(adapter, queries, scorers)

# Inspect results
for result in summary.results:
    print(f"{result.query[:50]}...")
    for name, score in result.scores.items():
        print(f"  {name}: {score.value:.3f}")

print(f"\nAggregate: {summary.aggregate_scores}")
print(f"Mode accuracy: {summary.mode_accuracy:.1%}")

Adding a custom scorer¶

from src.evaluation.models import EvaluationResponse, MetricScore, TestQuery
from src.evaluation.scorers import register_scorer

@register_scorer("my_custom_scorer")
class MyScorer:
    def score(self, query: TestQuery, response: EvaluationResponse) -> list[MetricScore]:
        # Your logic here
        value = compute_something(query, response)
        return [MetricScore(metric_name="my_metric", value=value)]

3. What Each Metric Tells You¶

Faithfulness (Ragas)¶

"Is the agent making things up?"

A faithfulness score of 0.9 means 90% of claims in the answer are supported by the retrieved context. Low faithfulness = hallucination.

When it drops: The agent is generating information not present in the retrieved chunks. Check if retrieval is returning irrelevant context, or if the LLM prompt encourages speculation.

Answer Relevance (Ragas)¶

"Did the agent actually answer the question?"

When it drops: The agent may be going off-topic, returning tangentially related information, or misinterpreting the query.

Context Precision (Ragas)¶

"Are we retrieving the right chunks?"

When it drops: The vector index may need recomputation, embeddings may have drifted, or the chunking strategy is producing poor segments.

Entity Recall¶

"Did the agent find the entities we expected?"

Computed as |expected ∩ found| / |expected|. A recall of 0.8 means 80% of expected entities appeared in the response.

When it drops: Graph traversal depth may be insufficient, or entity consolidation may have broken links.

LLM Judge — Correctness¶

"Is the biomedical information factually accurate?"

An LLM (Bedrock) evaluates whether the response contains correct biomedical facts. This catches subtle errors that keyword matching misses.

Limitations: The judge LLM has its own knowledge cutoff and biases. Use as a signal, not ground truth.

4. Practical Workflow¶

During development¶

Make your change (agent logic, prompts, retrieval strategy)

Run the scorer framework locally against relevant categories:

uv run python -c "
from src.evaluation.runner import EvaluationRunner
from src.evaluation.scorers import get_scorer
runner = EvaluationRunner()
queries = runner.load_queries(categories=['local'])
# ... run and inspect
"

Compare scores before/after your change

Before merging¶

CI runs the full eval harness automatically on push to main
Check the PR comment for regression warnings
If a regression is flagged, investigate which category/metric dropped

Monitoring in production¶

Grafana dashboard (olink-rag-eval-e2e) shows metric trends over time
Alert rules fire when any eval metric drops >5% from the rolling baseline
Latency percentiles track performance degradation

5. Limitations & What's Not Automated¶

Gap	Mitigation
Golden dataset may not cover edge cases	Expand `data/test_queries.yaml` as you find failures
LLM judge has its own biases	Cross-reference with entity_overlap and keyword_match
Ragas requires ground truth contexts	Generated from actual KG — regenerate after major ingestion changes
No real-time user feedback loop yet	Planned: Glicko rating system from user thumbs-up/down
Domain expertise needed to validate	Periodic manual review by domain scientists remains essential

Quick Reference¶

# Generate golden dataset from current KG
uv run python evaluation/generate_dataset.py

# Run full CI eval locally
uv run python -m evaluation.harness.ci_eval

# Run Ragas evaluation
uv run python evaluation/run_eval.py

# Check eval results
cat evaluation/reports/ci_summary.json | python -m json.tool