Skip to content

Multimodal Processing

Vision model integration for extracting entities from figures, tables, and diagrams in scientific PDFs.

Overview

Scientific papers contain critical information in non-text elements — pathway diagrams, Western blots, protein structure figures, and data tables. The multimodal processor uses vision models to extract structured entities from these elements.

Capabilities

Element Model Extracts
Pathway diagrams Llama-3.2-11B-Vision (Bedrock) Proteins, interactions, pathways
Protein structures Llama-3.2-11B-Vision (Bedrock) Domains, binding sites, modifications
Western blots Llama-3.2-11B-Vision (Bedrock) Proteins, expression levels
Microscopy Llama-3.2-11B-Vision (Bedrock) Cell types, markers, localization
Tables PyMuPDF + LLM Proteins, diseases, measurements

Quick Start

# Process PDFs with multimodal extraction
uv run python -m pipeline.processors.multimodal_processor \
  --pdf-dir ./papers/ \
  --database olink1 \
  --service bedrock

# With checkpointing (resume on failure)
uv run python -m pipeline.processors.multimodal_processor \
  --pdf-dir ./papers/ \
  --database olink1 \
  --service bedrock \
  --checkpoint-db ./checkpoints.sqlite

Processing Pipeline

flowchart TD
    PDF["PDF Document"]

    subgraph extract["Element Extraction"]
        IMG["Extract images<br/>(PyMuPDF)"]
        TBL["Extract tables<br/>(PyMuPDF table detection)"]
    end

    subgraph vision["Vision Analysis"]
        V1["Classify image type"]
        V2["Extract entities<br/>from figures"]
        V3["Generate captions"]
    end

    subgraph table["Table Analysis"]
        T1["Column classification<br/>(LLM)"]
        T2["Entity mapping"]
        T3["Relationship inference"]
    end

    subgraph merge["Integration"]
        M1["Merge with text entities"]
        M2["Deduplicate"]
        M3["Store in graph"]
    end

    PDF --> extract
    IMG --> vision
    TBL --> table
    vision --> merge
    table --> merge

Configuration

# Parallel processing with checkpointing
config = {
    "workers": 4,              # Parallel document processing
    "checkpoint_db": "cp.sqlite",  # Resume on failure
    "max_images_per_doc": 20,  # Limit vision API calls
    "min_image_size": (100, 100),  # Skip tiny images
}

Batch Processing

The processor supports batch processing with SQLite-based checkpointing:

  • PENDINGIN_PROGRESSCOMPLETED or FAILED
  • Resume capability: skip completed files, retry failed ones
  • File change detection via MD5 hashing
  • Statistics tracking per document

Key Files

File Role
pipeline/processors/multimodal_processor.py Main orchestrator
pipeline/processors/multimodal_chunker.py Element-aware chunking
pipeline/processors/vision_processor.py Vision model integration
pipeline/processors/table_processor.py Table extraction and analysis