Extraction Architecture & Rationale¶
Why Two-Pass Entity Typing?¶
A common question when onboarding: "Why not have the LLM assign final entity types in one shot?"
The short answer: the LLM does assign types during extraction — but those types are treated as provisional. A deterministic post-processing layer refines and corrects them using ontology lookups. This isn't a workaround for context window limits; it's a deliberate separation of concerns.
The Per-Chunk Pipeline¶
Each text chunk passes through a 3-step LangGraph chain:
flowchart TD
chunk["Text Chunk"]
subgraph step1["Step 1 — LLM Extraction"]
direction TB
llm["LLM prompt:<br/><i>Extract entities + relationships</i>"]
out1["Output: nodes with provisional types<br/>(Protein | Disease | Entity)<br/>+ relationships with source sentences"]
llm --> out1
end
subgraph step2["Step 2 — Ontology Filtering"]
direction TB
validate["Validate against<br/>MONDO & UniProt dictionaries"]
actions2["• Standardize names to canonical forms<br/>• Drop spurious/generic terms<br/>• Remove unrecognized Entity nodes<br/>• Prune orphaned relationships"]
validate --> actions2
end
subgraph step3["Step 3 — Protein Entity Linking"]
direction TB
lookup["Query graph for existing<br/>UniProt-identified protein"]
found{"Match<br/>found?"}
reuse["Rewire relationships<br/>to existing node"]
create["Create new node with<br/>uniprot_gene_name set"]
lookup --> found
found -->|Yes| reuse
found -->|No| create
end
chunk --> step1 --> step2 --> step3
subgraph postbatch["Post-Ingestion Batch"]
direction TB
scan["Scan nodes still labeled Entity"]
match["Match against MONDO → Disease<br/>Match against UniProt → Protein"]
relabel["Relabel node, standardize name,<br/>remove Entity label"]
scan --> match --> relabel
end
step3 -->|"All chunks done"| postbatch
Step 1: LLM Extraction¶
The LLM receives a chunk of text and returns structured JSON:
{
"nodes": [{"id": "...", "name": "IL-6", "type": "Protein"}],
"relationships": [{
"source_id": "...",
"target_id": "...",
"type": "ASSOCIATED_WITH",
"source_sentence": "IL-6 levels were elevated in patients with..."
}]
}
The prompt constrains entity types to Protein, Disease, or Entity (generic fallback). It instructs the model to skip spurious generic terms and to include the verbatim source sentence for each relationship as provenance.
What the LLM is good at: understanding natural language, recognizing that something is a named entity, identifying relationships, quoting evidence.
What the LLM is unreliable at: consistent typing (it might call "hypertension" an Entity), canonical naming ("IL6" vs "IL-6" vs "Interleukin 6"), and deduplication across chunks.
Step 2: Ontology Filtering¶
Validates extracted nodes against loaded ontologies:
- Diseases checked against MONDO (Medical Ontology for Disease). Names standardized to canonical forms.
- Proteins checked against UniProt/gene symbol dictionaries. Names standardized.
- Generic
Entitynodes matching known spurious terms are dropped. - Relationships referencing removed nodes are also dropped.
This step is fast (dictionary lookups), deterministic, and catches LLM hallucinations like extracting "Data" or "Analysis" as entities.
Step 3: Protein Entity Linking¶
For nodes typed as Protein, the pipeline queries the existing graph for a matching UniProt-identified node:
- Found → rewires all relationships to point to the existing node (no duplicate created)
- Not found → creates a new node with
uniprot_gene_nameset for future resolution
This prevents the graph from accumulating duplicate protein nodes across ingestion runs.
Post-Ingestion Batch Resolution¶
After all chunks in a job are processed, resolve_and_label_nodes scans the database for nodes still labeled Entity (or potentially misclassified) and attempts to resolve them:
- Match node name against MONDO disease ontology → relabel as
Disease - Match node name against UniProt protein dictionary → relabel as
Protein - Standardize the
nameproperty to the canonical form - Remove the generic
Entitylabel
Why Not Single-Pass?¶
| Concern | Single-pass (LLM does everything) | Two-pass (LLM + ontology) |
|---|---|---|
| Typing accuracy | ~80-90% (hallucinations, inconsistency) | ~99% (ontology is ground truth) |
| Canonical naming | Inconsistent across chunks | Deterministic (always maps to canonical) |
| Deduplication | Impossible per-chunk (no global view) | Handled by entity linking + batch resolution |
| New entity types | Requires prompt engineering + hope | Add ontology dictionary + filter rule |
| Cost | Higher (longer prompts to explain all rules) | Lower (LLM prompt stays simple) |
| Debuggability | Opaque (why did the LLM choose X?) | Transparent (ontology match or not) |
The LLM focuses on what it's uniquely good at: understanding language and extracting structure. The ontology layer handles what databases are good at: canonical identity, deduplication, and classification.
Adding New Entity Types¶
To extend the schema (e.g., adding PTMs, Cell Types, Pathways):
- Update the extraction prompt — Add the new type to the allowed
entity_typelist inKGPipeline._extract_kg_with_llm() - Add validation logic — Extend
OntologyFilter.filter_nodes()with anis_valid_<type>()check - Load a reference ontology (optional but recommended) — e.g., PSI-MOD for PTMs, Cell Ontology for cell types
- Add a resolver — Extend
resolve_and_label_nodes()to handle the new label - Update the Neo4j schema — Add indexes/constraints for the new label
The architecture scales linearly with new types — each one is just a new ontology dictionary and a few lines of filter logic.
Key Files¶
| File | Role |
|---|---|
pipeline/ingest/kg_pipeline.py |
LangGraph chain, LLM extraction prompt, protein linking |
pipeline/ingest/ontology_pipeline.py |
Orchestrates filter + link steps, post-ingestion labeling |
pipeline/processors/ontology_filter.py |
MONDO/UniProt validation, node filtering, batch relabeling |
pipeline/processors/entity_resolver.py |
Multi-strategy deduplication (UniProt ID, synonyms, fuzzy match) |