Benchmarks & Evaluation¶
Automated evaluation harnesses for measuring extraction quality, model performance, and pipeline consistency.
Available Benchmarks¶
| Benchmark | What it measures | Gold Standard | Run Command |
|---|---|---|---|
| Gold Standard (Parser x Model) | End-to-end extraction accuracy vs Gary's curated labels | PUBLICATIONS_SUMMARIES CSV |
uv run python -m evaluation.benchmark |
| Model Comparison | Entity/relationship extraction consistency & garbage rate | PubMed abstracts (self-referencing) | uv run python -m evaluation.extraction_model_eval |
| KG Pipeline Consistency (planned) | Entity consolidation, relationship merging, embedding quality | Synthetic + known-answer sets | — |
Quick Start¶
# Interactive mode — guided setup for any benchmark
uv run python -m evaluation
# Gold standard benchmark (full matrix)
uv run python -m evaluation.benchmark -n 20 -r 10 -m all -p all --full-text
# Model comparison (abstracts only, fast)
uv run python -m evaluation.extraction_model_eval --abstracts 20 --runs 3
Latest Headlines¶
Parser Finding
PyMuPDF outperforms markitdown and docling on protein extraction F1 by ~70%. Direct PDF text extraction preserves protein names better than HTML conversion.
Model Finding
Mistral Large 3 achieves highest recall across all parsers. Llama 3.1 8B offers best cost/performance ratio at 3× lower latency.
Reproducibility
All models achieve >77% reproducibility (Jaccard across 10 runs). Llama 3.3 70B and Llama 3.1 8B hit 96-100% on docling — perfectly deterministic at temperature 0.
Architecture¶
evaluation/
├── __main__.py # Interactive TUI (arrow-key menu)
├── benchmark/ # Gold standard benchmark
│ ├── cli.py # Argparse CLI
│ ├── gold_standard.py # CSV parser → gold labels
│ ├── fetcher.py # Paper download + cache + parsers
│ ├── runner.py # Parallel LLM extraction
│ ├── metrics.py # Precision/recall/F1/Jaccard
│ ├── reporter.py # JSON + HTML report generation
│ └── report.html # Interactive Chart.js template
├── extraction_model_eval.py # Model comparison (abstracts)
└── harness/ # Query eval harness