Skip to content

Preprint Integration

Integrating preprint papers from bioRxiv and medRxiv with version tracking and publication linking.

Overview

Preprint integration enables access to cutting-edge research 6-12 months before peer-reviewed publication. The system fetches preprints from both bioRxiv (biological sciences) and medRxiv (clinical sciences) with full version tracking.

Features

  • Unified API — single interface for bioRxiv and medRxiv
  • Version tracking — all preprint versions stored with timestamps
  • Publication linking — automatic linking to published PMIDs when available
  • Async processing — non-blocking I/O for efficient parallel fetching
  • Rate limiting — respects API limits with exponential backoff
  • Deduplication — DOI-based detection of already-ingested preprints

Quick Start

# bioRxiv preprints
uv run python ingest_main.py \
  --source biorxiv --search-term "protein biomarker" \
  --max-results 100 --database olink1 --service bedrock

# Using a queries file
uv run python ingest_main.py \
  --source biorxiv --queries-file data/queries_biorxiv.txt \
  --max-results 50 --database olink1 --service bedrock

How It Works

flowchart TD
    A["bioRxiv/medRxiv API"] --> B["Fetch metadata<br/>(DOI, authors, dates)"]
    B --> C{"Already<br/>ingested?"}
    C -->|Yes| D["Skip (deduplicate)"]
    C -->|No| E["Download PDF"]
    E --> F["Multi-strategy extraction<br/>(Nougat → PyMuPDF → pdftotext)"]
    F --> G["Chunk + LLM extraction"]
    G --> H["Store with version metadata"]
    H --> I{"Published<br/>version exists?"}
    I -->|Yes| J["Link to PMID"]
    I -->|No| K["Mark as preprint-only"]

Version Tracking

Each preprint stores version history:

MATCH (p:Publication {source: "biorxiv"})
RETURN p.doi, p.version, p.posted_date, p.published_pmid
ORDER BY p.posted_date DESC LIMIT 10;

When a preprint is later published in a journal, the system can link the preprint node to the published PMID, preserving the full provenance chain.

Parallel Ingestion

For bulk bioRxiv runs alongside PubMed/PMC:

uv run python parallel_ingest.py \
  -q data/queries_bulk.txt \
  --biorxiv-queries-file data/queries_biorxiv.txt \
  -d olink1 -s bedrock \
  --biorxiv-max 100

Query format

bioRxiv uses simple keyword substring matching (not MeSH terms). Keep queries short and specific: "protein biomarker", "cardiovascular proteomics".

Key Files

File Role
pipeline/ingest/preprint_fetcher.py Unified bioRxiv/medRxiv API client
pipeline/ingest/biorxiv_fetcher.py bioRxiv-specific fetching logic
pipeline/ingest/biorxiv_ingestor.py Full ingestion orchestration
pipeline/ingest/biorxiv_deduplicator.py DOI-based deduplication
pipeline/processors/preprint_version_tracker.py Version history management