Knowledge Graph Creation Guide¶
Complete workflow for building a production-ready knowledge graph with non-redundant consolidated relationships.
Overview¶
This guide walks through building a production-quality knowledge graph with:
- Non-redundant entities: Multiple mentions consolidated into single canonical nodes
- Feature-rich relationships: All associations between node pairs merged with preserved metadata
- Biological accuracy: Disease ontologies (MONDO) and protein dictionaries (UniProt) for standardization
- Query-ready: Vector embeddings for semantic search
flowchart TD
A["Disease Ontology CSV<br/>Protein Dictionaries CSV"] --> B["Ingest Foundation Data<br/>(IDs & Synonyms)"]
B --> C["Literature Extraction<br/>(PubMed / bioRxiv / PMC / PDF)"]
C --> D["Entity Consolidation<br/>UniProt ID · Synonyms · Fuzzy"]
D --> E["Relationship Consolidation<br/>Merge duplicates · Preserve evidence"]
E --> F["Vector Embeddings"]
F --> G["Query-Ready Knowledge Graph"]
Prerequisites¶
# Python 3.12+, uv package manager
uv sync
# Neo4j (Docker)
docker run -d --name neo4j \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/your_password \
neo4j:5.13
Environment (.env):
Phase 1: Data Preparation¶
Organize source data:
data/
├── ontologies/
│ └── mondo_disease_ontology.csv # MONDO IDs, names, hierarchies
├── proteins/
│ ├── uniprot_protein_dictionary.csv # UniProt IDs, names, gene symbols
│ └── protein_synonyms.csv # Alternative protein names
└── pubmed/
└── queries.txt # Search terms for abstract retrieval
Expected CSV formats:
# Disease ontology
mondo_id,disease_name,synonyms,parent_id,description
MONDO:0005148,Type 2 Diabetes,"T2D|Diabetes Mellitus Type 2|NIDDM",MONDO:0005015,Non-insulin dependent diabetes
# Protein dictionary
uniprot_id,protein_name,gene_symbol,synonyms,function
P04637,TP53,TP53,"Tumor Protein P53|p53|Antigen NY-CO-13",Tumor suppressor
Phase 2: CSV Foundation¶
Create canonical entities with proper IDs and synonyms before adding literature data:
export DATABASE="olink1"
# Disease ontology (MONDO IDs)
uv run python enrichment_main.py \
--file data/ontologies/mondo_disease_ontology.csv \
--database $DATABASE \
--column-handlers "0:mondo-id,1:disease-name,2:synonyms,3:parent-id,4:prop"
# Protein dictionaries (UniProt IDs)
uv run python enrichment_main.py \
--file data/proteins/uniprot_protein_dictionary.csv \
--database $DATABASE \
--column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms,4:prop"
Verify:
MATCH (d:Disease) RETURN d.mondo_id, d.name, d.synonyms LIMIT 10;
MATCH (p:Protein) RETURN p.uniprot_id, p.name, p.gene_symbol LIMIT 10;
Phase 3: Literature Ingestion¶
# Single query
uv run python ingest_main.py \
--search-term "cardiovascular disease protein biomarker" \
--max-results 1000 --database $DATABASE --service bedrock
# Batch from queries file
uv run python ingest_main.py \
--queries-file data/pubmed/queries.txt \
--max-results 500 --database $DATABASE --service bedrock
After ingestion you'll have raw entities with duplicates and unmerged relationships — this is expected.
Phase 4: Entity Consolidation¶
Run the full multi-strategy pipeline:
This executes (in order):
- UniProt ID consolidation — exact match on
uniprot_id - Name features consolidation — synonym/gene symbol matching
- Fuzzy matching — 85% similarity threshold
- MONDO ID labeling — disease ontology linking
- UniProt ID labeling — protein dictionary linking
- Quality validation
See Entity Consolidation for strategy details.
Phase 5: Relationship Consolidation¶
Merges duplicate edges while preserving all evidence (PMIDs, confidence scores, temporal tracking).
Phase 6: Vector Embeddings¶
Phase 7: Validation¶
Complete Workflow Script¶
#!/bin/bash
export DATABASE="olink1"
echo "=== Phase 1: CSV Foundation ==="
uv run python enrichment_main.py \
--file data/ontologies/mondo_disease_ontology.csv \
--database $DATABASE --column-handlers "0:mondo-id,1:disease-name,2:synonyms"
uv run python enrichment_main.py \
--file data/proteins/uniprot_protein_dictionary.csv \
--database $DATABASE --column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms"
echo "=== Phase 2: Literature ==="
uv run python ingest_main.py \
--queries-file data/pubmed/queries.txt \
--max-results 500 --database $DATABASE --service bedrock
echo "=== Phase 3: Entity Consolidation ==="
uv run python -m pipeline.processors.entity_resolver \
--database $DATABASE --operation full
echo "=== Phase 4: Relationship Consolidation ==="
uv run python ingest_main.py --consolidate-relationships --database $DATABASE
echo "=== Phase 5: Embeddings ==="
uv run python ingest_main.py --add-graph-embeddings --database $DATABASE --service bedrock
echo "=== Phase 6: Validation ==="
uv run python -m pipeline.processors.entity_resolver --database $DATABASE --operation validate
echo "✓ Knowledge graph creation complete!"
Querying the Graph¶
# Start API server
uv run granian --interface asgi api.app:app --host 127.0.0.1 --port 8000 --reload
Example Cypher queries:
-- Strongest protein-disease associations
MATCH (p:Protein)-[r:ASSOCIATES_WITH]->(d:Disease)
WHERE r.evidence_sources >= 5 AND r.avg_confidence >= 0.85
RETURN p.name, d.name, r.evidence_sources, r.avg_confidence
ORDER BY r.evidence_sources DESC LIMIT 20;
-- Most studied proteins
MATCH (p:Protein)-[r:ASSOCIATES_WITH]->()
RETURN p.name, count(r) as associations, sum(r.evidence_sources) as total_evidence
ORDER BY total_evidence DESC LIMIT 10;
Troubleshooting¶
| Issue | Check | Fix |
|---|---|---|
| Low consolidation rate | MATCH (p:Protein) RETURN count(p.uniprot_id)/count(p) |
Run --operation label_proteins first |
| Missing relationships | MATCH (n) WHERE NOT (n)--() RETURN labels(n), count(n) |
Rerun --consolidate-relationships |
| Slow queries | SHOW INDEXES |
Add indexes on uniprot_id, mondo_id |
| Embeddings missing | MATCH (n) WHERE n.embedding IS NULL RETURN count(n) |
Rerun --add-graph-embeddings --force-regenerate |