Entity Consolidation¶

Intelligent duplicate detection and merging using a multi-strategy pipeline.

Problem¶

Knowledge graphs accumulate duplicate entities with slight variations:

"Pulmonary Hypertension (PH)" vs "pulmonary hypertension (PH)"
"IL-6" vs "IL6" vs "Interleukin 6"
Different IDs for the same medical concept

Solution: Hybrid Consolidation¶

Consolidation runs both during and after ingestion:

Phase	When	Scope	Purpose
Incremental	Every 25 chunks	Name-based only (fast)	Prevent duplicate accumulation
Final	After all chunks	Full resolution + ontology	Ensure complete data quality

Strategies (Cascading)¶

1. UniProt ID Matching (highest confidence)¶

Merges proteins sharing the same UniProt ID:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation consolidate-uniprot

-- Before: 3 nodes for same protein
(:Protein {name:"TP53", uniprot_id:"P04637"})
(:Protein {name:"p53", uniprot_id:"P04637"})
(:Protein {name:"Tumor Protein P53", uniprot_id:"P04637"})

-- After: 1 canonical node
(:Protein {name:"TP53", uniprot_id:"P04637", synonyms:["p53","Tumor Protein P53"]})

2. Synonym & Gene Symbol Matching (high confidence)¶

Merges entities sharing gene symbols or synonyms from CSV dictionaries:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation consolidate-name-features

3. Fuzzy Name Matching (medium confidence)¶

Levenshtein distance + semantic similarity with configurable threshold:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation consolidate --similarity-threshold 0.85

Handles typos, abbreviations, and minor variations.

4. MONDO ID Matching¶

Links diseases to ontology hierarchy and creates PARENT_OF relationships:

uv run python -m pipeline.processors.ontology_filter \
  --database olink1 --operation label_diseases

Full Pipeline (Recommended)¶

Run all strategies in one command:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation full

Executes: UniProt ID → Name features → Fuzzy matching → MONDO labeling → UniProt labeling → Validation.

Configuration¶

Chunk Thresholds¶

Strategy	Threshold	Performance	Duplicate Control	Use Case
Real-time	1	Slow	Excellent	Small datasets
Hybrid	25-50	Balanced	Good	Production
Batch-only	Disabled	Fast	Poor	Development

Similarity Threshold¶

0.85 (default) — conservative, low false positives
0.75 — more aggressive, catches more duplicates but risks false merges
0.90 — very conservative, only near-exact matches

LLM-Based Relationship Type Consolidation¶

For semantically similar relationship types (e.g., associated_with / related_to / affiliated_with):

# Dry run (safe preview)
uv run python -m pipeline.processors.relationship_type_consolidator \
  --database olink1 --dry-run

# Execute
uv run python -m pipeline.processors.relationship_type_consolidator \
  --database olink1 --execute

Performance¶

Phase	Time	Scope
Incremental	2-5 seconds	Name-based matching
Final	30-60 seconds	Full fuzzy + hierarchy
MONDO load	~5 seconds	26,284 ontology entries

Key Files¶

File	Role
`pipeline/processors/entity_resolver.py`	Multi-strategy deduplication
`pipeline/processors/ontology_filter.py`	MONDO/UniProt validation and labeling
`pipeline/processors/incremental_consolidation.py`	During-ingestion consolidation
`pipeline/processors/relationship_type_consolidator.py`	LLM-based relationship type merging