Olink RAG — Benchmark Report

Gold Standard: Gary's curated PUBLICATIONS_SUMMARIES CSV. Each paper has manually annotated disease areas and protein/marker names — these are the ground truth labels we measure against.

Pipeline:

Paper Selection — Papers sorted by label richness (most proteins+diseases first). Paywalled sources excluded. Pool of 3× target loaded, filtered to N papers with successful full-text fetch.
PDF Parsing — Each parser (markitdown, pymupdf, docling) converts the same cached raw file (PDF or HTML) to text. All parsers see identical source content.
LLM Extraction — Each model extracts proteins and diseases from the parsed text. Run N times per paper to measure reproducibility.
Scoring — Extracted entities compared against gold labels (case-insensitive string match).

Metrics:

Protein Precision = |extracted ∩ gold| / |extracted| — "of what we found, how much was correct?"
Protein Recall = |extracted ∩ gold| / |gold| — "of what Gary listed, how much did we find?"
Protein F1 = harmonic mean of precision and recall
Disease Recall = same as protein recall but for disease area labels
Reproducibility (Jaccard) = avg pairwise Jaccard similarity across N runs of same paper+model — "how consistent is the model?"
Cross-Model Agreement = Jaccard between different models on same paper — "do models agree?"

Note: We typically extract MORE entities than Gary listed (he noted highlights, not exhaustive lists). High precision + moderate recall is expected. Recall ceiling depends on whether gold proteins actually appear in the paper text.

🧪 Olink RAG — Benchmark Report

📐 Parser Comparison

📄 Gold Standard Papers

🔬 Per-Parser Model Detail

Model Comparison

Cross-Model Agreement

Per-Paper Extraction Detail