PDF Extraction Pipeline¶
Multi-strategy PDF extraction with graceful fallback for scientific papers.
Overview¶
The extraction pipeline uses a priority chain — each strategy is attempted in order, falling back to the next on failure:
flowchart LR
PDF["PDF Input"]
N["Nougat<br/><i>Complex papers</i>"]
D["Docling<br/><i>Structured docs</i>"]
P["PyMuPDF<br/><i>Fast extraction</i>"]
T["pdftotext<br/><i>CLI fallback</i>"]
PDF --> N --> D --> P --> T
| Strategy | Best For | Speed | Quality |
|---|---|---|---|
| Nougat | Complex scientific papers, equations | Slow | Highest |
| Docling | Structured documents, tables | Medium | High |
| PyMuPDF | Simple PDFs, fast extraction | Fast | Good |
| pdftotext | Fallback when others fail | Fast | Basic |
All strategies except PyMuPDF are optional — the system gracefully degrades when dependencies are unavailable.
Installation¶
Core (included in pyproject.toml)¶
Optional: Nougat (Recommended for Scientific Papers)¶
Facebook Research's learning-based PDF parser trained on scientific documents:
Optional: Docling¶
IBM's document understanding library:
Optional: pdftotext¶
Usage¶
CLI¶
# Single PDF
uv run python ingest_main.py \
--source pdf --pdf-files paper.pdf \
--database olink1 --service bedrock
# Multiple PDFs
uv run python ingest_main.py \
--source pdf --pdf-files paper1.pdf paper2.pdf paper3.pdf \
--database olink1 --service bedrock
# Directory of PDFs
uv run python ingest_main.py \
--source pdf --pdf-dir ./papers/ \
--database olink1 --service bedrock
Configuration¶
In code or via environment:
from pipeline.ingest.extraction_models import ExtractionConfig
config = ExtractionConfig(
enable_nougat=True, # Best for scientific papers
enable_pymupdf=True, # Fast fallback
enable_pdftotext=True, # CLI fallback
enable_docling=True, # Structured documents
nougat_timeout=120, # Seconds
pdf_timeout=10, # Seconds for PyMuPDF/pdftotext
remove_references=True, # Strip References section
remove_acknowledgements=True,# Strip Acknowledgements
quality_min_length=100, # Min chars for valid extraction
pdf_cache_dir="/tmp/pdfs", # Cache downloaded PDFs
max_concurrent_extractions=5,# Parallel limit
)
Quality Validation¶
Extracted text is validated against thresholds:
- Minimum length: 100 characters (configurable)
- Non-alpha ratio: Max 50% non-alphanumeric characters
- Section removal: References and Acknowledgements stripped by default
If extraction fails quality checks, the next strategy in the chain is attempted.
Multimodal Processing¶
For PDFs with figures and tables, the multimodal processor extracts additional entities:
uv run python -m pipeline.processors.multimodal_processor \
--pdf-dir ./papers/ --database olink1 --service bedrock
This uses vision models (Llama-3.2-11B-Vision via Bedrock) to:
- Analyze protein structure diagrams
- Extract entities from pathway charts
- Interpret Western blots and microscopy images
- Parse embedded tables
See Multimodal Processing for details.
Key Files¶
| File | Role |
|---|---|
pipeline/ingest/extraction_models.py |
Config and data models |
pipeline/ingest/extraction_strategies.py |
Strategy implementations |
pipeline/ingest/content_extractor.py |
Strategy chain orchestration |
pipeline/ingest/docling_strategy.py |
Docling integration |
pipeline/processors/pdf_processor.py |
PDF-specific processing |
pipeline/processors/multimodal_processor.py |
Vision model extraction |