Operations¶
Incremental Updates (Weekly Maintenance)¶
Keep the knowledge graph current with weekly PubMed updates:
# Fetch abstracts from last 7 days
uv run python ingest_main.py \
--queries-file data/queries.txt \
--date-from $(date -v-7d +%Y-%m-%d) \
--date-to $(date +%Y-%m-%d) \
--database olink1 --service bedrock
# Consolidate and re-embed
uv run python ingest_main.py --consolidate-relationships --database olink1
uv run python ingest_main.py --add-graph-embeddings --database olink1 --service bedrock
What happens automatically: PMID deduplication, entity merging, relationship consolidation, incremental embeddings. Completes in <15 minutes for ~100 abstracts.
Weekly cron job:
0 2 * * 1 cd /path/to/olink_rag && uv run python ingest_main.py --queries-file data/queries.txt --date-from $(date -v-7d +\%Y-\%m-\%d) --date-to $(date +\%Y-\%m-\%d) --database olink1 --service bedrock >> logs/weekly_update.log 2>&1
See Incremental PubMed Update Guide for details.
Validation Scripts¶
# Community detection validation
uv run python scripts/validate_community_detection.py
# Synonym coverage validation
uv run python scripts/validate_synonym_coverage.py
# Incremental embeddings validation
uv run python scripts/validate_incremental_embeddings.py --database olink1
# Test query validation
uv run python scripts/validate_test_queries.py
Community Detection¶
After building the KG, detect entity communities for global query mode:
from pipeline.processors.community_detector import CommunityDetector
from src.core.database import Neo4jDatabase
db = Neo4jDatabase(uri="bolt://localhost:7687", user="neo4j", password="your_password")
detector = CommunityDetector(db=db, llm=your_llm_instance)
communities = detector.detect_communities(min_community_size=3)
communities = asyncio.run(detector.generate_summaries(communities))
detector.write_community_ids(communities)
See Community Detection Validation.
Node Labeling¶
Instead of labeling after every ingestion run, schedule it:
# Standalone labeling + UniProt mapping
uv run python ingest_main.py --label-nodes -d olink1
# Cron: every Monday and Thursday at 2am
0 2 * * 1,4 cd /path/to/olink_rag && uv run python ingest_main.py --label-nodes -d olink1 >> logs/labeling.log 2>&1
Cache Management¶
# Warm precomputed query cache
uv run python scripts/precompute_queries.py --service local --database neo4j
# Check cache stats
curl "http://localhost:8000/v1/cache/stats" -H "Authorization: Bearer $API_AUTH_TOKEN"
# Invalidate after graph updates
curl -X POST "http://localhost:8000/v1/cache/invalidate?database=neo4j" \
-H "Authorization: Bearer $API_AUTH_TOKEN"
SSH/SSM Tunnel Keepalive¶
For overnight runs through AWS SSM tunnels:
Monitoring KPIs¶
| Metric | Target |
|---|---|
| Multi-hop success rate | > 95% |
| Community usage rate | > 80% (global queries) |
| Entity disambiguation recall | > 90% |
| Cache hit rate | > 30% |
| P95 latency | < 2000ms |
| Incremental update success rate | > 95% |
| Entity merge rate | 20-40% |
See Metrics Dashboard for API endpoints and alerting.
Biological Enrichment (Optional)¶
Legacy Enrichers¶
# Disease hierarchies from MONDO
uv run python -m pipeline.processors.disease_hierarchy_enricher --database olink1 --operation enrich
# Subcellular location data (requires UniProt IDs)
uv run python -m pipeline.processors.subcellular_location_enricher --database olink1 --operation enrich
Science Skills Enrichment (Recommended)¶
Enrich Protein nodes from 6 external bioinformatics APIs in one command:
# Full enrichment (all sources)
uv run python -m pipeline.processors.science_skills_enricher \
--database olink1 --sources all --batch-size 50 --rate-limit 1.0
# Targeted enrichment (specific sources and proteins)
uv run python -m pipeline.processors.science_skills_enricher \
--database olink1 --sources string,hpa,clinvar \
--proteins TP53,BRCA1,EGFR
# Preview mode (no writes)
uv run python -m pipeline.processors.science_skills_enricher \
--database olink1 --sources all --dry-run
| Source | What it adds |
|---|---|
string |
INTERACTS_WITH relationships (score-filtered PPI) |
hpa |
Tissue nodes + EXPRESSED_IN, subcellular locations |
reactome |
Pathway nodes + PARTICIPATES_IN + CHILD_OF hierarchy |
chembl |
Drug nodes + TARGETS relationships (pChEMBL-filtered) |
clinvar |
Variant nodes + HAS_VARIANT + ASSOCIATED_WITH Disease |
openalex |
Publication nodes + CITES citation network |
Suggested schedule: Run monthly or after major protein ingestion updates. Supports --dry-run for preview, aborts at >50% failure rate.
See Integrations for full details.