๐Ÿงช Olink RAG โ€” Benchmark Report

KG extraction accuracy & reproducibility vs gold standard

๐Ÿ“– Methodology

Gold Standard: Gary's curated PUBLICATIONS_SUMMARIES CSV. Each paper has manually annotated disease areas and protein/marker names โ€” these are the ground truth labels we measure against.

Pipeline:

  1. Paper Selection โ€” Papers sorted by label richness (most proteins+diseases first). Paywalled sources excluded. Pool of 3ร— target loaded, filtered to N papers with successful full-text fetch.
  2. PDF Parsing โ€” Each parser (markitdown, pymupdf, docling) converts the same cached raw file (PDF or HTML) to text. All parsers see identical source content.
  3. LLM Extraction โ€” Each model extracts proteins and diseases from the parsed text. Run N times per paper to measure reproducibility.
  4. Scoring โ€” Extracted entities compared against gold labels (case-insensitive string match).

Metrics:

  • Protein Precision = |extracted โˆฉ gold| / |extracted| โ€” "of what we found, how much was correct?"
  • Protein Recall = |extracted โˆฉ gold| / |gold| โ€” "of what Gary listed, how much did we find?"
  • Protein F1 = harmonic mean of precision and recall
  • Disease Recall = same as protein recall but for disease area labels
  • Reproducibility (Jaccard) = avg pairwise Jaccard similarity across N runs of same paper+model โ€” "how consistent is the model?"
  • Cross-Model Agreement = Jaccard between different models on same paper โ€” "do models agree?"

Note: We typically extract MORE entities than Gary listed (he noted highlights, not exhaustive lists). High precision + moderate recall is expected. Recall ceiling depends on whether gold proteins actually appear in the paper text.


๐Ÿ“„ Gold Standard Papers


๐Ÿ”ฌ Per-Parser Model Detail

Model Comparison

Per-Paper Extraction Detail