Gold Standard: Gary's curated PUBLICATIONS_SUMMARIES CSV. Each paper has manually annotated disease areas and protein/marker names โ these are the ground truth labels we measure against.
Pipeline:
- Paper Selection โ Papers sorted by label richness (most proteins+diseases first). Paywalled sources excluded. Pool of 3ร target loaded, filtered to N papers with successful full-text fetch.
- PDF Parsing โ Each parser (markitdown, pymupdf, docling) converts the same cached raw file (PDF or HTML) to text. All parsers see identical source content.
- LLM Extraction โ Each model extracts proteins and diseases from the parsed text. Run N times per paper to measure reproducibility.
- Scoring โ Extracted entities compared against gold labels (case-insensitive string match).
Metrics:
- Protein Precision = |extracted โฉ gold| / |extracted| โ "of what we found, how much was correct?"
- Protein Recall = |extracted โฉ gold| / |gold| โ "of what Gary listed, how much did we find?"
- Protein F1 = harmonic mean of precision and recall
- Disease Recall = same as protein recall but for disease area labels
- Reproducibility (Jaccard) = avg pairwise Jaccard similarity across N runs of same paper+model โ "how consistent is the model?"
- Cross-Model Agreement = Jaccard between different models on same paper โ "do models agree?"
Note: We typically extract MORE entities than Gary listed (he noted highlights, not exhaustive lists). High precision + moderate recall is expected. Recall ceiling depends on whether gold proteins actually appear in the paper text.