Skip to content

Benchmarks & Evaluation

Automated evaluation harnesses for measuring extraction quality, model performance, and pipeline consistency.


Available Benchmarks

Benchmark What it measures Gold Standard Run Command
Gold Standard (Parser x Model) End-to-end extraction accuracy vs Gary's curated labels PUBLICATIONS_SUMMARIES CSV uv run python -m evaluation.benchmark
Model Comparison Entity/relationship extraction consistency & garbage rate PubMed abstracts (self-referencing) uv run python -m evaluation.extraction_model_eval
KG Pipeline Consistency (planned) Entity consolidation, relationship merging, embedding quality Synthetic + known-answer sets

Quick Start

# Interactive mode — guided setup for any benchmark
uv run python -m evaluation

# Gold standard benchmark (full matrix)
uv run python -m evaluation.benchmark -n 20 -r 10 -m all -p all --full-text

# Model comparison (abstracts only, fast)
uv run python -m evaluation.extraction_model_eval --abstracts 20 --runs 3

Latest Headlines

Parser Finding

PyMuPDF outperforms markitdown and docling on protein extraction F1 by ~70%. Direct PDF text extraction preserves protein names better than HTML conversion.

Model Finding

Mistral Large 3 achieves highest recall across all parsers. Llama 3.1 8B offers best cost/performance ratio at 3× lower latency.

Reproducibility

All models achieve >77% reproducibility (Jaccard across 10 runs). Llama 3.3 70B and Llama 3.1 8B hit 96-100% on docling — perfectly deterministic at temperature 0.


Architecture

evaluation/
├── __main__.py              # Interactive TUI (arrow-key menu)
├── benchmark/               # Gold standard benchmark
│   ├── cli.py               # Argparse CLI
│   ├── gold_standard.py     # CSV parser → gold labels
│   ├── fetcher.py           # Paper download + cache + parsers
│   ├── runner.py            # Parallel LLM extraction
│   ├── metrics.py           # Precision/recall/F1/Jaccard
│   ├── reporter.py          # JSON + HTML report generation
│   └── report.html          # Interactive Chart.js template
├── extraction_model_eval.py # Model comparison (abstracts)
└── harness/                 # Query eval harness