Entity Consolidation¶
Intelligent duplicate detection and merging using a multi-strategy pipeline.
Problem¶
Knowledge graphs accumulate duplicate entities with slight variations:
- "Pulmonary Hypertension (PH)" vs "pulmonary hypertension (PH)"
- "IL-6" vs "IL6" vs "Interleukin 6"
- Different IDs for the same medical concept
Solution: Hybrid Consolidation¶
Consolidation runs both during and after ingestion:
| Phase | When | Scope | Purpose |
|---|---|---|---|
| Incremental | Every 25 chunks | Name-based only (fast) | Prevent duplicate accumulation |
| Final | After all chunks | Full resolution + ontology | Ensure complete data quality |
Strategies (Cascading)¶
1. UniProt ID Matching (highest confidence)¶
Merges proteins sharing the same UniProt ID:
uv run python -m pipeline.processors.entity_resolver \
--database olink1 --operation consolidate-uniprot
-- Before: 3 nodes for same protein
(:Protein {name:"TP53", uniprot_id:"P04637"})
(:Protein {name:"p53", uniprot_id:"P04637"})
(:Protein {name:"Tumor Protein P53", uniprot_id:"P04637"})
-- After: 1 canonical node
(:Protein {name:"TP53", uniprot_id:"P04637", synonyms:["p53","Tumor Protein P53"]})
2. Synonym & Gene Symbol Matching (high confidence)¶
Merges entities sharing gene symbols or synonyms from CSV dictionaries:
uv run python -m pipeline.processors.entity_resolver \
--database olink1 --operation consolidate-name-features
3. Fuzzy Name Matching (medium confidence)¶
Levenshtein distance + semantic similarity with configurable threshold:
uv run python -m pipeline.processors.entity_resolver \
--database olink1 --operation consolidate --similarity-threshold 0.85
Handles typos, abbreviations, and minor variations.
4. MONDO ID Matching¶
Links diseases to ontology hierarchy and creates PARENT_OF relationships:
Full Pipeline (Recommended)¶
Run all strategies in one command:
Executes: UniProt ID → Name features → Fuzzy matching → MONDO labeling → UniProt labeling → Validation.
Configuration¶
Chunk Thresholds¶
| Strategy | Threshold | Performance | Duplicate Control | Use Case |
|---|---|---|---|---|
| Real-time | 1 | Slow | Excellent | Small datasets |
| Hybrid | 25-50 | Balanced | Good | Production |
| Batch-only | Disabled | Fast | Poor | Development |
Similarity Threshold¶
0.85(default) — conservative, low false positives0.75— more aggressive, catches more duplicates but risks false merges0.90— very conservative, only near-exact matches
LLM-Based Relationship Type Consolidation¶
For semantically similar relationship types (e.g., associated_with / related_to / affiliated_with):
# Dry run (safe preview)
uv run python -m pipeline.processors.relationship_type_consolidator \
--database olink1 --dry-run
# Execute
uv run python -m pipeline.processors.relationship_type_consolidator \
--database olink1 --execute
Performance¶
| Phase | Time | Scope |
|---|---|---|
| Incremental | 2-5 seconds | Name-based matching |
| Final | 30-60 seconds | Full fuzzy + hierarchy |
| MONDO load | ~5 seconds | 26,284 ontology entries |
Key Files¶
| File | Role |
|---|---|
pipeline/processors/entity_resolver.py |
Multi-strategy deduplication |
pipeline/processors/ontology_filter.py |
MONDO/UniProt validation and labeling |
pipeline/processors/incremental_consolidation.py |
During-ingestion consolidation |
pipeline/processors/relationship_type_consolidator.py |
LLM-based relationship type merging |