Extraction Architecture & Rationale¶

Why Two-Pass Entity Typing?¶

A common question when onboarding: "Why not have the LLM assign final entity types in one shot?"

The short answer: the LLM does assign types during extraction — but those types are treated as provisional. A deterministic post-processing layer refines and corrects them using ontology lookups. This isn't a workaround for context window limits; it's a deliberate separation of concerns.

The Per-Chunk Pipeline¶

Each text chunk passes through a 3-step LangGraph chain:

flowchart TD
    chunk["Text Chunk"]

    subgraph step1["Step 1 — LLM Extraction"]
        direction TB
        llm["LLM prompt:<br/><i>Extract entities + relationships</i>"]
        out1["Output: nodes with provisional types<br/>(Protein | Disease | Entity)<br/>+ relationships with source sentences"]
        llm --> out1
    end

    subgraph step2["Step 2 — Ontology Filtering"]
        direction TB
        validate["Validate against<br/>MONDO & UniProt dictionaries"]
        actions2["• Standardize names to canonical forms<br/>• Drop spurious/generic terms<br/>• Remove unrecognized Entity nodes<br/>• Prune orphaned relationships"]
        validate --> actions2
    end

    subgraph step3["Step 3 — Protein Entity Linking"]
        direction TB
        lookup["Query graph for existing<br/>UniProt-identified protein"]
        found{"Match<br/>found?"}
        reuse["Rewire relationships<br/>to existing node"]
        create["Create new node with<br/>uniprot_gene_name set"]
        lookup --> found
        found -->|Yes| reuse
        found -->|No| create
    end

    chunk --> step1 --> step2 --> step3

    subgraph postbatch["Post-Ingestion Batch"]
        direction TB
        scan["Scan nodes still labeled Entity"]
        match["Match against MONDO → Disease<br/>Match against UniProt → Protein"]
        relabel["Relabel node, standardize name,<br/>remove Entity label"]
        scan --> match --> relabel
    end

    step3 -->|"All chunks done"| postbatch

Step 1: LLM Extraction¶

The LLM receives a chunk of text and returns structured JSON:

{
  "nodes": [{"id": "...", "name": "IL-6", "type": "Protein"}],
  "relationships": [{
    "source_id": "...",
    "target_id": "...",
    "type": "ASSOCIATED_WITH",
    "source_sentence": "IL-6 levels were elevated in patients with..."
  }]
}

The prompt constrains entity types to Protein, Disease, or Entity (generic fallback). It instructs the model to skip spurious generic terms and to include the verbatim source sentence for each relationship as provenance.

What the LLM is good at: understanding natural language, recognizing that something is a named entity, identifying relationships, quoting evidence.

What the LLM is unreliable at: consistent typing (it might call "hypertension" an Entity), canonical naming ("IL6" vs "IL-6" vs "Interleukin 6"), and deduplication across chunks.

Step 2: Ontology Filtering¶

Validates extracted nodes against loaded ontologies:

Diseases checked against MONDO (Medical Ontology for Disease). Names standardized to canonical forms.
Proteins checked against UniProt/gene symbol dictionaries. Names standardized.
Generic Entity nodes matching known spurious terms are dropped.
Relationships referencing removed nodes are also dropped.

This step is fast (dictionary lookups), deterministic, and catches LLM hallucinations like extracting "Data" or "Analysis" as entities.

Step 3: Protein Entity Linking¶

For nodes typed as Protein, the pipeline queries the existing graph for a matching UniProt-identified node:

Found → rewires all relationships to point to the existing node (no duplicate created)
Not found → creates a new node with uniprot_gene_name set for future resolution

This prevents the graph from accumulating duplicate protein nodes across ingestion runs.

Post-Ingestion Batch Resolution¶

After all chunks in a job are processed, resolve_and_label_nodes scans the database for nodes still labeled Entity (or potentially misclassified) and attempts to resolve them:

Match node name against MONDO disease ontology → relabel as Disease
Match node name against UniProt protein dictionary → relabel as Protein
Standardize the name property to the canonical form
Remove the generic Entity label

Why Not Single-Pass?¶

Concern	Single-pass (LLM does everything)	Two-pass (LLM + ontology)
Typing accuracy	~80-90% (hallucinations, inconsistency)	~99% (ontology is ground truth)
Canonical naming	Inconsistent across chunks	Deterministic (always maps to canonical)
Deduplication	Impossible per-chunk (no global view)	Handled by entity linking + batch resolution
New entity types	Requires prompt engineering + hope	Add ontology dictionary + filter rule
Cost	Higher (longer prompts to explain all rules)	Lower (LLM prompt stays simple)
Debuggability	Opaque (why did the LLM choose X?)	Transparent (ontology match or not)

The LLM focuses on what it's uniquely good at: understanding language and extracting structure. The ontology layer handles what databases are good at: canonical identity, deduplication, and classification.

Adding New Entity Types¶

To extend the schema (e.g., adding PTMs, Cell Types, Pathways):

Update the extraction prompt — Add the new type to the allowed entity_type list in KGPipeline._extract_kg_with_llm()
Add validation logic — Extend OntologyFilter.filter_nodes() with an is_valid_<type>() check
Load a reference ontology (optional but recommended) — e.g., PSI-MOD for PTMs, Cell Ontology for cell types
Add a resolver — Extend resolve_and_label_nodes() to handle the new label
Update the Neo4j schema — Add indexes/constraints for the new label

The architecture scales linearly with new types — each one is just a new ontology dictionary and a few lines of filter logic.

Key Files¶

File	Role
`pipeline/ingest/kg_pipeline.py`	LangGraph chain, LLM extraction prompt, protein linking
`pipeline/ingest/ontology_pipeline.py`	Orchestrates filter + link steps, post-ingestion labeling
`pipeline/processors/ontology_filter.py`	MONDO/UniProt validation, node filtering, batch relabeling
`pipeline/processors/entity_resolver.py`	Multi-strategy deduplication (UniProt ID, synonyms, fuzzy match)