Skip to content

Extraction Architecture & Rationale

Why Two-Pass Entity Typing?

A common question when onboarding: "Why not have the LLM assign final entity types in one shot?"

The short answer: the LLM does assign types during extraction — but those types are treated as provisional. A deterministic post-processing layer refines and corrects them using ontology lookups. This isn't a workaround for context window limits; it's a deliberate separation of concerns.

The Per-Chunk Pipeline

Each text chunk passes through a 3-step LangGraph chain:

flowchart TD
    chunk["Text Chunk"]

    subgraph step1["Step 1 — LLM Extraction"]
        direction TB
        llm["LLM prompt:<br/><i>Extract entities + relationships</i>"]
        out1["Output: nodes with provisional types<br/>(Protein | Disease | Entity)<br/>+ relationships with source sentences"]
        llm --> out1
    end

    subgraph step2["Step 2 — Ontology Filtering"]
        direction TB
        validate["Validate against<br/>MONDO & UniProt dictionaries"]
        actions2["• Standardize names to canonical forms<br/>• Drop spurious/generic terms<br/>• Remove unrecognized Entity nodes<br/>• Prune orphaned relationships"]
        validate --> actions2
    end

    subgraph step3["Step 3 — Protein Entity Linking"]
        direction TB
        lookup["Query graph for existing<br/>UniProt-identified protein"]
        found{"Match<br/>found?"}
        reuse["Rewire relationships<br/>to existing node"]
        create["Create new node with<br/>uniprot_gene_name set"]
        lookup --> found
        found -->|Yes| reuse
        found -->|No| create
    end

    chunk --> step1 --> step2 --> step3

    subgraph postbatch["Post-Ingestion Batch"]
        direction TB
        scan["Scan nodes still labeled Entity"]
        match["Match against MONDO → Disease<br/>Match against UniProt → Protein"]
        relabel["Relabel node, standardize name,<br/>remove Entity label"]
        scan --> match --> relabel
    end

    step3 -->|"All chunks done"| postbatch

Step 1: LLM Extraction

The LLM receives a chunk of text and returns structured JSON:

{
  "nodes": [{"id": "...", "name": "IL-6", "type": "Protein"}],
  "relationships": [{
    "source_id": "...",
    "target_id": "...",
    "type": "ASSOCIATED_WITH",
    "source_sentence": "IL-6 levels were elevated in patients with..."
  }]
}

The prompt constrains entity types to Protein, Disease, or Entity (generic fallback). It instructs the model to skip spurious generic terms and to include the verbatim source sentence for each relationship as provenance.

What the LLM is good at: understanding natural language, recognizing that something is a named entity, identifying relationships, quoting evidence.

What the LLM is unreliable at: consistent typing (it might call "hypertension" an Entity), canonical naming ("IL6" vs "IL-6" vs "Interleukin 6"), and deduplication across chunks.

Step 2: Ontology Filtering

Validates extracted nodes against loaded ontologies:

  • Diseases checked against MONDO (Medical Ontology for Disease). Names standardized to canonical forms.
  • Proteins checked against UniProt/gene symbol dictionaries. Names standardized.
  • Generic Entity nodes matching known spurious terms are dropped.
  • Relationships referencing removed nodes are also dropped.

This step is fast (dictionary lookups), deterministic, and catches LLM hallucinations like extracting "Data" or "Analysis" as entities.

Step 3: Protein Entity Linking

For nodes typed as Protein, the pipeline queries the existing graph for a matching UniProt-identified node:

  • Found → rewires all relationships to point to the existing node (no duplicate created)
  • Not found → creates a new node with uniprot_gene_name set for future resolution

This prevents the graph from accumulating duplicate protein nodes across ingestion runs.

Post-Ingestion Batch Resolution

After all chunks in a job are processed, resolve_and_label_nodes scans the database for nodes still labeled Entity (or potentially misclassified) and attempts to resolve them:

  1. Match node name against MONDO disease ontology → relabel as Disease
  2. Match node name against UniProt protein dictionary → relabel as Protein
  3. Standardize the name property to the canonical form
  4. Remove the generic Entity label

Why Not Single-Pass?

Concern Single-pass (LLM does everything) Two-pass (LLM + ontology)
Typing accuracy ~80-90% (hallucinations, inconsistency) ~99% (ontology is ground truth)
Canonical naming Inconsistent across chunks Deterministic (always maps to canonical)
Deduplication Impossible per-chunk (no global view) Handled by entity linking + batch resolution
New entity types Requires prompt engineering + hope Add ontology dictionary + filter rule
Cost Higher (longer prompts to explain all rules) Lower (LLM prompt stays simple)
Debuggability Opaque (why did the LLM choose X?) Transparent (ontology match or not)

The LLM focuses on what it's uniquely good at: understanding language and extracting structure. The ontology layer handles what databases are good at: canonical identity, deduplication, and classification.

Adding New Entity Types

To extend the schema (e.g., adding PTMs, Cell Types, Pathways):

  1. Update the extraction prompt — Add the new type to the allowed entity_type list in KGPipeline._extract_kg_with_llm()
  2. Add validation logic — Extend OntologyFilter.filter_nodes() with an is_valid_<type>() check
  3. Load a reference ontology (optional but recommended) — e.g., PSI-MOD for PTMs, Cell Ontology for cell types
  4. Add a resolver — Extend resolve_and_label_nodes() to handle the new label
  5. Update the Neo4j schema — Add indexes/constraints for the new label

The architecture scales linearly with new types — each one is just a new ontology dictionary and a few lines of filter logic.

Key Files

File Role
pipeline/ingest/kg_pipeline.py LangGraph chain, LLM extraction prompt, protein linking
pipeline/ingest/ontology_pipeline.py Orchestrates filter + link steps, post-ingestion labeling
pipeline/processors/ontology_filter.py MONDO/UniProt validation, node filtering, batch relabeling
pipeline/processors/entity_resolver.py Multi-strategy deduplication (UniProt ID, synonyms, fuzzy match)