Skip to content

Gold Standard Benchmark (Parser × Model)

Evaluates the full extraction pipeline against Gary's curated protein/disease labels from PUBLICATIONS_SUMMARIES.

Interactive Report (charts + per-paper detail) →{: style="color: #2196F3; text-decoration: underline;" }

Methodology

  1. Gold standard: Papers sorted by label richness from Gary's CSV. Each has manually annotated disease areas and protein markers.
  2. Paper fetch: Full text downloaded via article link (PDF preferred). Paywalled sources excluded. All parsers operate on the same cached raw file.
  3. Parsing: Each parser (markitdown, pymupdf, docling) converts the cached PDF/HTML to text independently.
  4. Extraction: Each LLM model extracts proteins and diseases from parsed text. Run 10× per paper for reproducibility measurement.
  5. Scoring: Extracted entities compared against gold labels (case-insensitive match).

Metric Definitions

  • Precision = |extracted ∩ gold| / |extracted| — "of what we found, how much was correct?"
  • Recall = |extracted ∩ gold| / |gold| — "of what Gary listed, how much did we find?"
  • F1 = harmonic mean of precision and recall
  • Reproducibility = avg pairwise Jaccard across 10 runs — "how consistent is the model?"

Parser Comparison Summary

Parser Avg F1 Avg Precision Avg Recall Avg Reproducibility Best For
pymupdf 21.1% 40.0% 17.9% 92.8% Best overall accuracy (PDF text extraction)
docling 16.3% 34.6% 14.2% 91.1% Structured documents, tables
markitdown 12.5% 28.5% 10.8% 92.4% HTML-heavy sources, speed

Key Finding

PyMuPDF wins on accuracy — direct PDF text extraction preserves protein names better than HTML-to-markdown conversion (markitdown) or document understanding (docling). The ~70% F1 improvement over markitdown is significant.


Results by Parser

Model Precision Recall F1 Reproducibility Latency
Mistral Large 3 37.8% 21.4% 22.0% 84.7% 2531ms
Llama 3.1 8B 41.0% 17.6% 21.8% 94.9% 1224ms
Nova Micro 47.2% 17.5% 21.6% 93.3% 998ms
Nova Pro 41.2% 18.2% 21.2% 86.6% 1892ms
Nova Lite 44.1% 17.1% 20.7% 95.5% 1485ms
Llama 3.3 70B 34.5% 16.7% 20.5% 97.2% 1951ms
Claude Sonnet 4.6 34.0% 16.9% 20.0% 97.7% 4604ms
Model Precision Recall F1 Reproducibility Latency
Mistral Large 3 31.7% 21.9% 19.2% 78.1% 3077ms
Claude Sonnet 4.6 28.6% 16.7% 17.1% 92.6% 4837ms
Llama 3.1 8B 38.5% 12.5% 16.7% 98.2% 1179ms
Nova Pro 35.6% 13.0% 16.2% 84.6% 1669ms
Nova Micro 41.5% 11.8% 15.8% 97.4% 946ms
Nova Lite 38.5% 12.0% 15.1% 86.7% 1327ms
Llama 3.3 70B 27.7% 11.7% 14.2% 100.0% 1783ms
Model Precision Recall F1 Reproducibility Latency
Mistral Large 3 24.5% 15.2% 13.8% 77.9% 2428ms
Llama 3.1 8B 32.5% 10.2% 13.5% 100.0% 1137ms
Claude Sonnet 4.6 24.9% 12.0% 13.2% 93.0% 4127ms
Nova Micro 35.3% 9.5% 12.4% 91.7% 955ms
Llama 3.3 70B 22.5% 9.4% 11.9% 96.1% 1732ms
Nova Lite 30.8% 9.7% 11.5% 92.4% 1324ms
Nova Pro 29.1% 9.7% 11.4% 96.0% 1497ms

Key Observations

Model Insights

  • Mistral Large 3 consistently achieves highest recall across all parsers — it finds the most proteins
  • Llama 3.1 8B offers best cost/performance — nearly matches larger models at 3× lower latency
  • Claude Sonnet 4.6 has highest reproducibility (97.7%) but 4× the latency
  • Nova Micro is fastest (955ms) with competitive precision

Recall Ceiling

Recall is capped because Gary's gold labels include proteins from full paper body, figures, and supplementary data — not all appear in the main text even with full-text parsing. A paper with 27 gold proteins may only mention 7 in extractable text.

How to Run

# Interactive mode (prompts for all options)
uv run python -m evaluation

# Direct CLI — all models, all parsers, 20 papers, 10 runs each
uv run python -m evaluation.benchmark -n 20 -r 10 -m all -p all --full-text

# Single parser, single model (quick test)
uv run python -m evaluation.benchmark -n 5 -r 3 -m nova-pro -p pymupdf --full-text