Skip to content

PDF Extraction Pipeline

Multi-strategy PDF extraction with graceful fallback for scientific papers.

Overview

The extraction pipeline uses a priority chain — each strategy is attempted in order, falling back to the next on failure:

flowchart LR
    PDF["PDF Input"]
    N["Nougat<br/><i>Complex papers</i>"]
    D["Docling<br/><i>Structured docs</i>"]
    P["PyMuPDF<br/><i>Fast extraction</i>"]
    T["pdftotext<br/><i>CLI fallback</i>"]
    PDF --> N --> D --> P --> T
Strategy Best For Speed Quality
Nougat Complex scientific papers, equations Slow Highest
Docling Structured documents, tables Medium High
PyMuPDF Simple PDFs, fast extraction Fast Good
pdftotext Fallback when others fail Fast Basic

All strategies except PyMuPDF are optional — the system gracefully degrades when dependencies are unavailable.

Installation

Core (included in pyproject.toml)

uv sync  # PyMuPDF included by default

Facebook Research's learning-based PDF parser trained on scientific documents:

uv pip install nougat-ocr
# Requires ~4GB for model weights (downloaded on first use)

Optional: Docling

IBM's document understanding library:

uv pip install docling

Optional: pdftotext

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

Usage

CLI

# Single PDF
uv run python ingest_main.py \
  --source pdf --pdf-files paper.pdf \
  --database olink1 --service bedrock

# Multiple PDFs
uv run python ingest_main.py \
  --source pdf --pdf-files paper1.pdf paper2.pdf paper3.pdf \
  --database olink1 --service bedrock

# Directory of PDFs
uv run python ingest_main.py \
  --source pdf --pdf-dir ./papers/ \
  --database olink1 --service bedrock

Configuration

In code or via environment:

from pipeline.ingest.extraction_models import ExtractionConfig

config = ExtractionConfig(
    enable_nougat=True,          # Best for scientific papers
    enable_pymupdf=True,         # Fast fallback
    enable_pdftotext=True,       # CLI fallback
    enable_docling=True,         # Structured documents
    nougat_timeout=120,          # Seconds
    pdf_timeout=10,              # Seconds for PyMuPDF/pdftotext
    remove_references=True,      # Strip References section
    remove_acknowledgements=True,# Strip Acknowledgements
    quality_min_length=100,      # Min chars for valid extraction
    pdf_cache_dir="/tmp/pdfs",   # Cache downloaded PDFs
    max_concurrent_extractions=5,# Parallel limit
)

Quality Validation

Extracted text is validated against thresholds:

  • Minimum length: 100 characters (configurable)
  • Non-alpha ratio: Max 50% non-alphanumeric characters
  • Section removal: References and Acknowledgements stripped by default

If extraction fails quality checks, the next strategy in the chain is attempted.

Multimodal Processing

For PDFs with figures and tables, the multimodal processor extracts additional entities:

uv run python -m pipeline.processors.multimodal_processor \
  --pdf-dir ./papers/ --database olink1 --service bedrock

This uses vision models (Llama-3.2-11B-Vision via Bedrock) to:

  • Analyze protein structure diagrams
  • Extract entities from pathway charts
  • Interpret Western blots and microscopy images
  • Parse embedded tables

See Multimodal Processing for details.

Key Files

File Role
pipeline/ingest/extraction_models.py Config and data models
pipeline/ingest/extraction_strategies.py Strategy implementations
pipeline/ingest/content_extractor.py Strategy chain orchestration
pipeline/ingest/docling_strategy.py Docling integration
pipeline/processors/pdf_processor.py PDF-specific processing
pipeline/processors/multimodal_processor.py Vision model extraction