Skip to content

Entity Consolidation

Intelligent duplicate detection and merging using a multi-strategy pipeline.

Problem

Knowledge graphs accumulate duplicate entities with slight variations:

  • "Pulmonary Hypertension (PH)" vs "pulmonary hypertension (PH)"
  • "IL-6" vs "IL6" vs "Interleukin 6"
  • Different IDs for the same medical concept

Solution: Hybrid Consolidation

Consolidation runs both during and after ingestion:

Phase When Scope Purpose
Incremental Every 25 chunks Name-based only (fast) Prevent duplicate accumulation
Final After all chunks Full resolution + ontology Ensure complete data quality

Strategies (Cascading)

1. UniProt ID Matching (highest confidence)

Merges proteins sharing the same UniProt ID:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation consolidate-uniprot
-- Before: 3 nodes for same protein
(:Protein {name:"TP53", uniprot_id:"P04637"})
(:Protein {name:"p53", uniprot_id:"P04637"})
(:Protein {name:"Tumor Protein P53", uniprot_id:"P04637"})

-- After: 1 canonical node
(:Protein {name:"TP53", uniprot_id:"P04637", synonyms:["p53","Tumor Protein P53"]})

2. Synonym & Gene Symbol Matching (high confidence)

Merges entities sharing gene symbols or synonyms from CSV dictionaries:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation consolidate-name-features

3. Fuzzy Name Matching (medium confidence)

Levenshtein distance + semantic similarity with configurable threshold:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation consolidate --similarity-threshold 0.85

Handles typos, abbreviations, and minor variations.

4. MONDO ID Matching

Links diseases to ontology hierarchy and creates PARENT_OF relationships:

uv run python -m pipeline.processors.ontology_filter \
  --database olink1 --operation label_diseases

Run all strategies in one command:

uv run python -m pipeline.processors.entity_resolver \
  --database olink1 --operation full

Executes: UniProt ID → Name features → Fuzzy matching → MONDO labeling → UniProt labeling → Validation.

Configuration

Chunk Thresholds

Strategy Threshold Performance Duplicate Control Use Case
Real-time 1 Slow Excellent Small datasets
Hybrid 25-50 Balanced Good Production
Batch-only Disabled Fast Poor Development

Similarity Threshold

  • 0.85 (default) — conservative, low false positives
  • 0.75 — more aggressive, catches more duplicates but risks false merges
  • 0.90 — very conservative, only near-exact matches

LLM-Based Relationship Type Consolidation

For semantically similar relationship types (e.g., associated_with / related_to / affiliated_with):

# Dry run (safe preview)
uv run python -m pipeline.processors.relationship_type_consolidator \
  --database olink1 --dry-run

# Execute
uv run python -m pipeline.processors.relationship_type_consolidator \
  --database olink1 --execute

Performance

Phase Time Scope
Incremental 2-5 seconds Name-based matching
Final 30-60 seconds Full fuzzy + hierarchy
MONDO load ~5 seconds 26,284 ontology entries

Key Files

File Role
pipeline/processors/entity_resolver.py Multi-strategy deduplication
pipeline/processors/ontology_filter.py MONDO/UniProt validation and labeling
pipeline/processors/incremental_consolidation.py During-ingestion consolidation
pipeline/processors/relationship_type_consolidator.py LLM-based relationship type merging