Gold Standard Benchmark (Parser × Model)¶
Evaluates the full extraction pipeline against Gary's curated protein/disease labels from PUBLICATIONS_SUMMARIES.
Interactive Report (charts + per-paper detail) →{: style="color: #2196F3; text-decoration: underline;" }
Methodology
- Gold standard: Papers sorted by label richness from Gary's CSV. Each has manually annotated disease areas and protein markers.
- Paper fetch: Full text downloaded via article link (PDF preferred). Paywalled sources excluded. All parsers operate on the same cached raw file.
- Parsing: Each parser (markitdown, pymupdf, docling) converts the cached PDF/HTML to text independently.
- Extraction: Each LLM model extracts proteins and diseases from parsed text. Run 10× per paper for reproducibility measurement.
- Scoring: Extracted entities compared against gold labels (case-insensitive match).
Metric Definitions
- Precision = |extracted ∩ gold| / |extracted| — "of what we found, how much was correct?"
- Recall = |extracted ∩ gold| / |gold| — "of what Gary listed, how much did we find?"
- F1 = harmonic mean of precision and recall
- Reproducibility = avg pairwise Jaccard across 10 runs — "how consistent is the model?"
Parser Comparison Summary¶
| Parser | Avg F1 | Avg Precision | Avg Recall | Avg Reproducibility | Best For |
|---|---|---|---|---|---|
| pymupdf | 21.1% | 40.0% | 17.9% | 92.8% | Best overall accuracy (PDF text extraction) |
| docling | 16.3% | 34.6% | 14.2% | 91.1% | Structured documents, tables |
| markitdown | 12.5% | 28.5% | 10.8% | 92.4% | HTML-heavy sources, speed |
Key Finding
PyMuPDF wins on accuracy — direct PDF text extraction preserves protein names better than HTML-to-markdown conversion (markitdown) or document understanding (docling). The ~70% F1 improvement over markitdown is significant.
Results by Parser¶
| Model | Precision | Recall | F1 | Reproducibility | Latency |
|---|---|---|---|---|---|
| Mistral Large 3 | 37.8% | 21.4% | 22.0% | 84.7% | 2531ms |
| Llama 3.1 8B | 41.0% | 17.6% | 21.8% | 94.9% | 1224ms |
| Nova Micro | 47.2% | 17.5% | 21.6% | 93.3% | 998ms |
| Nova Pro | 41.2% | 18.2% | 21.2% | 86.6% | 1892ms |
| Nova Lite | 44.1% | 17.1% | 20.7% | 95.5% | 1485ms |
| Llama 3.3 70B | 34.5% | 16.7% | 20.5% | 97.2% | 1951ms |
| Claude Sonnet 4.6 | 34.0% | 16.9% | 20.0% | 97.7% | 4604ms |
| Model | Precision | Recall | F1 | Reproducibility | Latency |
|---|---|---|---|---|---|
| Mistral Large 3 | 31.7% | 21.9% | 19.2% | 78.1% | 3077ms |
| Claude Sonnet 4.6 | 28.6% | 16.7% | 17.1% | 92.6% | 4837ms |
| Llama 3.1 8B | 38.5% | 12.5% | 16.7% | 98.2% | 1179ms |
| Nova Pro | 35.6% | 13.0% | 16.2% | 84.6% | 1669ms |
| Nova Micro | 41.5% | 11.8% | 15.8% | 97.4% | 946ms |
| Nova Lite | 38.5% | 12.0% | 15.1% | 86.7% | 1327ms |
| Llama 3.3 70B | 27.7% | 11.7% | 14.2% | 100.0% | 1783ms |
| Model | Precision | Recall | F1 | Reproducibility | Latency |
|---|---|---|---|---|---|
| Mistral Large 3 | 24.5% | 15.2% | 13.8% | 77.9% | 2428ms |
| Llama 3.1 8B | 32.5% | 10.2% | 13.5% | 100.0% | 1137ms |
| Claude Sonnet 4.6 | 24.9% | 12.0% | 13.2% | 93.0% | 4127ms |
| Nova Micro | 35.3% | 9.5% | 12.4% | 91.7% | 955ms |
| Llama 3.3 70B | 22.5% | 9.4% | 11.9% | 96.1% | 1732ms |
| Nova Lite | 30.8% | 9.7% | 11.5% | 92.4% | 1324ms |
| Nova Pro | 29.1% | 9.7% | 11.4% | 96.0% | 1497ms |
Key Observations¶
Model Insights
- Mistral Large 3 consistently achieves highest recall across all parsers — it finds the most proteins
- Llama 3.1 8B offers best cost/performance — nearly matches larger models at 3× lower latency
- Claude Sonnet 4.6 has highest reproducibility (97.7%) but 4× the latency
- Nova Micro is fastest (955ms) with competitive precision
Recall Ceiling
Recall is capped because Gary's gold labels include proteins from full paper body, figures, and supplementary data — not all appear in the main text even with full-text parsing. A paper with 27 gold proteins may only mention 7 in extractable text.
How to Run¶
# Interactive mode (prompts for all options)
uv run python -m evaluation
# Direct CLI — all models, all parsers, 20 papers, 10 runs each
uv run python -m evaluation.benchmark -n 20 -r 10 -m all -p all --full-text
# Single parser, single model (quick test)
uv run python -m evaluation.benchmark -n 5 -r 3 -m nova-pro -p pymupdf --full-text