Operations¶

Incremental Updates (Weekly Maintenance)¶

Keep the knowledge graph current with weekly PubMed updates:

# Fetch abstracts from last 7 days
uv run python ingest_main.py \
  --queries-file data/queries.txt \
  --date-from $(date -v-7d +%Y-%m-%d) \
  --date-to $(date +%Y-%m-%d) \
  --database olink1 --service bedrock

# Consolidate and re-embed
uv run python ingest_main.py --consolidate-relationships --database olink1
uv run python ingest_main.py --add-graph-embeddings --database olink1 --service bedrock

What happens automatically: PMID deduplication, entity merging, relationship consolidation, incremental embeddings. Completes in <15 minutes for ~100 abstracts.

Weekly cron job:

0 2 * * 1 cd /path/to/olink_rag && uv run python ingest_main.py --queries-file data/queries.txt --date-from $(date -v-7d +\%Y-\%m-\%d) --date-to $(date +\%Y-\%m-\%d) --database olink1 --service bedrock >> logs/weekly_update.log 2>&1

See Incremental PubMed Update Guide for details.

Validation Scripts¶

# Community detection validation
uv run python scripts/validate_community_detection.py

# Synonym coverage validation
uv run python scripts/validate_synonym_coverage.py

# Incremental embeddings validation
uv run python scripts/validate_incremental_embeddings.py --database olink1

# Test query validation
uv run python scripts/validate_test_queries.py

Community Detection¶

After building the KG, detect entity communities for global query mode:

from pipeline.processors.community_detector import CommunityDetector
from src.core.database import Neo4jDatabase

db = Neo4jDatabase(uri="bolt://localhost:7687", user="neo4j", password="your_password")
detector = CommunityDetector(db=db, llm=your_llm_instance)

communities = detector.detect_communities(min_community_size=3)
communities = asyncio.run(detector.generate_summaries(communities))
detector.write_community_ids(communities)

See Community Detection Validation.

Node Labeling¶

Instead of labeling after every ingestion run, schedule it:

# Standalone labeling + UniProt mapping
uv run python ingest_main.py --label-nodes -d olink1

# Cron: every Monday and Thursday at 2am
0 2 * * 1,4 cd /path/to/olink_rag && uv run python ingest_main.py --label-nodes -d olink1 >> logs/labeling.log 2>&1

Cache Management¶

# Warm precomputed query cache
uv run python scripts/precompute_queries.py --service local --database neo4j

# Check cache stats
curl "http://localhost:8000/v1/cache/stats" -H "Authorization: Bearer $API_AUTH_TOKEN"

# Invalidate after graph updates
curl -X POST "http://localhost:8000/v1/cache/invalidate?database=neo4j" \
  -H "Authorization: Bearer $API_AUTH_TOKEN"

SSH/SSM Tunnel Keepalive¶

For overnight runs through AWS SSM tunnels:

caffeinate -s bash scripts/tunnel_keepalive.sh

Monitoring KPIs¶

Metric	Target
Multi-hop success rate	> 95%
Community usage rate	> 80% (global queries)
Entity disambiguation recall	> 90%
Cache hit rate	> 30%
P95 latency	< 2000ms
Incremental update success rate	> 95%
Entity merge rate	20-40%

See Metrics Dashboard for API endpoints and alerting.

Biological Enrichment (Optional)¶

Legacy Enrichers¶

# Disease hierarchies from MONDO
uv run python -m pipeline.processors.disease_hierarchy_enricher --database olink1 --operation enrich

# Subcellular location data (requires UniProt IDs)
uv run python -m pipeline.processors.subcellular_location_enricher --database olink1 --operation enrich

Science Skills Enrichment (Recommended)¶

Enrich Protein nodes from 6 external bioinformatics APIs in one command:

# Full enrichment (all sources)
uv run python -m pipeline.processors.science_skills_enricher \
  --database olink1 --sources all --batch-size 50 --rate-limit 1.0

# Targeted enrichment (specific sources and proteins)
uv run python -m pipeline.processors.science_skills_enricher \
  --database olink1 --sources string,hpa,clinvar \
  --proteins TP53,BRCA1,EGFR

# Preview mode (no writes)
uv run python -m pipeline.processors.science_skills_enricher \
  --database olink1 --sources all --dry-run

Source	What it adds
`string`	`INTERACTS_WITH` relationships (score-filtered PPI)
`hpa`	`Tissue` nodes + `EXPRESSED_IN`, subcellular locations
`reactome`	`Pathway` nodes + `PARTICIPATES_IN` + `CHILD_OF` hierarchy
`chembl`	`Drug` nodes + `TARGETS` relationships (pChEMBL-filtered)
`clinvar`	`Variant` nodes + `HAS_VARIANT` + `ASSOCIATED_WITH` Disease
`openalex`	`Publication` nodes + `CITES` citation network

Suggested schedule: Run monthly or after major protein ingestion updates. Supports --dry-run for preview, aborts at >50% failure rate.

See Integrations for full details.