PDF Extraction Pipeline¶

Multi-strategy PDF extraction with graceful fallback for scientific papers.

Overview¶

The extraction pipeline uses a priority chain — each strategy is attempted in order, falling back to the next on failure:

flowchart LR
    PDF["PDF Input"]
    N["Nougat<br/><i>Complex papers</i>"]
    D["Docling<br/><i>Structured docs</i>"]
    P["PyMuPDF<br/><i>Fast extraction</i>"]
    T["pdftotext<br/><i>CLI fallback</i>"]
    PDF --> N --> D --> P --> T

Strategy	Best For	Speed	Quality
Nougat	Complex scientific papers, equations	Slow	Highest
Docling	Structured documents, tables	Medium	High
PyMuPDF	Simple PDFs, fast extraction	Fast	Good
pdftotext	Fallback when others fail	Fast	Basic

All strategies except PyMuPDF are optional — the system gracefully degrades when dependencies are unavailable.

Installation¶

Core (included in pyproject.toml)¶

uv sync  # PyMuPDF included by default

Optional: Nougat (Recommended for Scientific Papers)¶

Facebook Research's learning-based PDF parser trained on scientific documents:

uv pip install nougat-ocr
# Requires ~4GB for model weights (downloaded on first use)

Optional: Docling¶

IBM's document understanding library:

uv pip install docling

Optional: pdftotext¶

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

Usage¶

CLI¶

# Single PDF
uv run python ingest_main.py \
  --source pdf --pdf-files paper.pdf \
  --database olink1 --service bedrock

# Multiple PDFs
uv run python ingest_main.py \
  --source pdf --pdf-files paper1.pdf paper2.pdf paper3.pdf \
  --database olink1 --service bedrock

# Directory of PDFs
uv run python ingest_main.py \
  --source pdf --pdf-dir ./papers/ \
  --database olink1 --service bedrock

Configuration¶

In code or via environment:

from pipeline.ingest.extraction_models import ExtractionConfig

config = ExtractionConfig(
    enable_nougat=True,          # Best for scientific papers
    enable_pymupdf=True,         # Fast fallback
    enable_pdftotext=True,       # CLI fallback
    enable_docling=True,         # Structured documents
    nougat_timeout=120,          # Seconds
    pdf_timeout=10,              # Seconds for PyMuPDF/pdftotext
    remove_references=True,      # Strip References section
    remove_acknowledgements=True,# Strip Acknowledgements
    quality_min_length=100,      # Min chars for valid extraction
    pdf_cache_dir="/tmp/pdfs",   # Cache downloaded PDFs
    max_concurrent_extractions=5,# Parallel limit
)

Quality Validation¶

Extracted text is validated against thresholds:

Minimum length: 100 characters (configurable)
Non-alpha ratio: Max 50% non-alphanumeric characters
Section removal: References and Acknowledgements stripped by default

If extraction fails quality checks, the next strategy in the chain is attempted.

Multimodal Processing¶

For PDFs with figures and tables, the multimodal processor extracts additional entities:

uv run python -m pipeline.processors.multimodal_processor \
  --pdf-dir ./papers/ --database olink1 --service bedrock

This uses vision models (Llama-3.2-11B-Vision via Bedrock) to:

Analyze protein structure diagrams
Extract entities from pathway charts
Interpret Western blots and microscopy images
Parse embedded tables

See Multimodal Processing for details.

Key Files¶

File	Role
`pipeline/ingest/extraction_models.py`	Config and data models
`pipeline/ingest/extraction_strategies.py`	Strategy implementations
`pipeline/ingest/content_extractor.py`	Strategy chain orchestration
`pipeline/ingest/docling_strategy.py`	Docling integration
`pipeline/processors/pdf_processor.py`	PDF-specific processing
`pipeline/processors/multimodal_processor.py`	Vision model extraction