Gold Standard Benchmark (Parser × Model)¶

Evaluates the full extraction pipeline against Gary's curated protein/disease labels from PUBLICATIONS_SUMMARIES.

Interactive Report (charts + per-paper detail) →{: style="color: #2196F3; text-decoration: underline;" }

Methodology

Gold standard: Papers sorted by label richness from Gary's CSV. Each has manually annotated disease areas and protein markers.
Paper fetch: Full text downloaded via article link (PDF preferred). Paywalled sources excluded. All parsers operate on the same cached raw file.
Parsing: Each parser (markitdown, pymupdf, docling) converts the cached PDF/HTML to text independently.
Extraction: Each LLM model extracts proteins and diseases from parsed text. Run 10× per paper for reproducibility measurement.
Scoring: Extracted entities compared against gold labels (case-insensitive match).

Metric Definitions

Precision = |extracted ∩ gold| / |extracted| — "of what we found, how much was correct?"
Recall = |extracted ∩ gold| / |gold| — "of what Gary listed, how much did we find?"
F1 = harmonic mean of precision and recall
Reproducibility = avg pairwise Jaccard across 10 runs — "how consistent is the model?"

Parser Comparison Summary¶

Parser	Avg F1	Avg Precision	Avg Recall	Avg Reproducibility	Best For
pymupdf	21.1%	40.0%	17.9%	92.8%	Best overall accuracy (PDF text extraction)
docling	16.3%	34.6%	14.2%	91.1%	Structured documents, tables
markitdown	12.5%	28.5%	10.8%	92.4%	HTML-heavy sources, speed

Key Finding

PyMuPDF wins on accuracy — direct PDF text extraction preserves protein names better than HTML-to-markdown conversion (markitdown) or document understanding (docling). The ~70% F1 improvement over markitdown is significant.

Results by Parser¶

pymupdf (Best)doclingmarkitdown

Model	Precision	Recall	F1	Reproducibility	Latency
Mistral Large 3	37.8%	21.4%	22.0%	84.7%	2531ms
Llama 3.1 8B	41.0%	17.6%	21.8%	94.9%	1224ms
Nova Micro	47.2%	17.5%	21.6%	93.3%	998ms
Nova Pro	41.2%	18.2%	21.2%	86.6%	1892ms
Nova Lite	44.1%	17.1%	20.7%	95.5%	1485ms
Llama 3.3 70B	34.5%	16.7%	20.5%	97.2%	1951ms
Claude Sonnet 4.6	34.0%	16.9%	20.0%	97.7%	4604ms

Model	Precision	Recall	F1	Reproducibility	Latency
Mistral Large 3	31.7%	21.9%	19.2%	78.1%	3077ms
Claude Sonnet 4.6	28.6%	16.7%	17.1%	92.6%	4837ms
Llama 3.1 8B	38.5%	12.5%	16.7%	98.2%	1179ms
Nova Pro	35.6%	13.0%	16.2%	84.6%	1669ms
Nova Micro	41.5%	11.8%	15.8%	97.4%	946ms
Nova Lite	38.5%	12.0%	15.1%	86.7%	1327ms
Llama 3.3 70B	27.7%	11.7%	14.2%	100.0%	1783ms

Model	Precision	Recall	F1	Reproducibility	Latency
Mistral Large 3	24.5%	15.2%	13.8%	77.9%	2428ms
Llama 3.1 8B	32.5%	10.2%	13.5%	100.0%	1137ms
Claude Sonnet 4.6	24.9%	12.0%	13.2%	93.0%	4127ms
Nova Micro	35.3%	9.5%	12.4%	91.7%	955ms
Llama 3.3 70B	22.5%	9.4%	11.9%	96.1%	1732ms
Nova Lite	30.8%	9.7%	11.5%	92.4%	1324ms
Nova Pro	29.1%	9.7%	11.4%	96.0%	1497ms

Key Observations¶

Model Insights

Mistral Large 3 consistently achieves highest recall across all parsers — it finds the most proteins
Llama 3.1 8B offers best cost/performance — nearly matches larger models at 3× lower latency
Claude Sonnet 4.6 has highest reproducibility (97.7%) but 4× the latency
Nova Micro is fastest (955ms) with competitive precision

Recall Ceiling

Recall is capped because Gary's gold labels include proteins from full paper body, figures, and supplementary data — not all appear in the main text even with full-text parsing. A paper with 27 gold proteins may only mention 7 in extractable text.

How to Run¶

# Interactive mode (prompts for all options)
uv run python -m evaluation

# Direct CLI — all models, all parsers, 20 papers, 10 runs each
uv run python -m evaluation.benchmark -n 20 -r 10 -m all -p all --full-text

# Single parser, single model (quick test)
uv run python -m evaluation.benchmark -n 5 -r 3 -m nova-pro -p pymupdf --full-text