Benchmarks & Evaluation¶

Automated evaluation harnesses for measuring extraction quality, model performance, and pipeline consistency.

Available Benchmarks¶

Benchmark	What it measures	Gold Standard	Run Command
Gold Standard (Parser x Model)	End-to-end extraction accuracy vs Gary's curated labels	`PUBLICATIONS_SUMMARIES` CSV	`uv run python -m evaluation.benchmark`
Model Comparison	Entity/relationship extraction consistency & garbage rate	PubMed abstracts (self-referencing)	`uv run python -m evaluation.extraction_model_eval`
KG Pipeline Consistency (planned)	Entity consolidation, relationship merging, embedding quality	Synthetic + known-answer sets	—

Quick Start¶

# Interactive mode — guided setup for any benchmark
uv run python -m evaluation

# Gold standard benchmark (full matrix)
uv run python -m evaluation.benchmark -n 20 -r 10 -m all -p all --full-text

# Model comparison (abstracts only, fast)
uv run python -m evaluation.extraction_model_eval --abstracts 20 --runs 3

Latest Headlines¶

Parser Finding

PyMuPDF outperforms markitdown and docling on protein extraction F1 by ~70%. Direct PDF text extraction preserves protein names better than HTML conversion.

Model Finding

Mistral Large 3 achieves highest recall across all parsers. Llama 3.1 8B offers best cost/performance ratio at 3× lower latency.

Reproducibility

All models achieve >77% reproducibility (Jaccard across 10 runs). Llama 3.3 70B and Llama 3.1 8B hit 96-100% on docling — perfectly deterministic at temperature 0.

Architecture¶

evaluation/
├── __main__.py              # Interactive TUI (arrow-key menu)
├── benchmark/               # Gold standard benchmark
│   ├── cli.py               # Argparse CLI
│   ├── gold_standard.py     # CSV parser → gold labels
│   ├── fetcher.py           # Paper download + cache + parsers
│   ├── runner.py            # Parallel LLM extraction
│   ├── metrics.py           # Precision/recall/F1/Jaccard
│   ├── reporter.py          # JSON + HTML report generation
│   └── report.html          # Interactive Chart.js template
├── extraction_model_eval.py # Model comparison (abstracts)
└── harness/                 # Query eval harness