Preprint Integration¶
Integrating preprint papers from bioRxiv and medRxiv with version tracking and publication linking.
Overview¶
Preprint integration enables access to cutting-edge research 6-12 months before peer-reviewed publication. The system fetches preprints from both bioRxiv (biological sciences) and medRxiv (clinical sciences) with full version tracking.
Features¶
- Unified API — single interface for bioRxiv and medRxiv
- Version tracking — all preprint versions stored with timestamps
- Publication linking — automatic linking to published PMIDs when available
- Async processing — non-blocking I/O for efficient parallel fetching
- Rate limiting — respects API limits with exponential backoff
- Deduplication — DOI-based detection of already-ingested preprints
Quick Start¶
# bioRxiv preprints
uv run python ingest_main.py \
--source biorxiv --search-term "protein biomarker" \
--max-results 100 --database olink1 --service bedrock
# Using a queries file
uv run python ingest_main.py \
--source biorxiv --queries-file data/queries_biorxiv.txt \
--max-results 50 --database olink1 --service bedrock
How It Works¶
flowchart TD
A["bioRxiv/medRxiv API"] --> B["Fetch metadata<br/>(DOI, authors, dates)"]
B --> C{"Already<br/>ingested?"}
C -->|Yes| D["Skip (deduplicate)"]
C -->|No| E["Download PDF"]
E --> F["Multi-strategy extraction<br/>(Nougat → PyMuPDF → pdftotext)"]
F --> G["Chunk + LLM extraction"]
G --> H["Store with version metadata"]
H --> I{"Published<br/>version exists?"}
I -->|Yes| J["Link to PMID"]
I -->|No| K["Mark as preprint-only"]
Version Tracking¶
Each preprint stores version history:
MATCH (p:Publication {source: "biorxiv"})
RETURN p.doi, p.version, p.posted_date, p.published_pmid
ORDER BY p.posted_date DESC LIMIT 10;
When a preprint is later published in a journal, the system can link the preprint node to the published PMID, preserving the full provenance chain.
Parallel Ingestion¶
For bulk bioRxiv runs alongside PubMed/PMC:
uv run python parallel_ingest.py \
-q data/queries_bulk.txt \
--biorxiv-queries-file data/queries_biorxiv.txt \
-d olink1 -s bedrock \
--biorxiv-max 100
Query format
bioRxiv uses simple keyword substring matching (not MeSH terms). Keep queries short and specific: "protein biomarker", "cardiovascular proteomics".
Key Files¶
| File | Role |
|---|---|
pipeline/ingest/preprint_fetcher.py |
Unified bioRxiv/medRxiv API client |
pipeline/ingest/biorxiv_fetcher.py |
bioRxiv-specific fetching logic |
pipeline/ingest/biorxiv_ingestor.py |
Full ingestion orchestration |
pipeline/ingest/biorxiv_deduplicator.py |
DOI-based deduplication |
pipeline/processors/preprint_version_tracker.py |
Version history management |