Skip to content

Operations

Incremental Updates (Weekly Maintenance)

Keep the knowledge graph current with weekly PubMed updates:

# Fetch abstracts from last 7 days
uv run python ingest_main.py \
  --queries-file data/queries.txt \
  --date-from $(date -v-7d +%Y-%m-%d) \
  --date-to $(date +%Y-%m-%d) \
  --database olink1 --service bedrock

# Consolidate and re-embed
uv run python ingest_main.py --consolidate-relationships --database olink1
uv run python ingest_main.py --add-graph-embeddings --database olink1 --service bedrock

What happens automatically: PMID deduplication, entity merging, relationship consolidation, incremental embeddings. Completes in <15 minutes for ~100 abstracts.

Weekly cron job:

0 2 * * 1 cd /path/to/olink_rag && uv run python ingest_main.py --queries-file data/queries.txt --date-from $(date -v-7d +\%Y-\%m-\%d) --date-to $(date +\%Y-\%m-\%d) --database olink1 --service bedrock >> logs/weekly_update.log 2>&1

See Incremental PubMed Update Guide for details.

Validation Scripts

# Community detection validation
uv run python scripts/validate_community_detection.py

# Synonym coverage validation
uv run python scripts/validate_synonym_coverage.py

# Incremental embeddings validation
uv run python scripts/validate_incremental_embeddings.py --database olink1

# Test query validation
uv run python scripts/validate_test_queries.py

Community Detection

After building the KG, detect entity communities for global query mode:

from pipeline.processors.community_detector import CommunityDetector
from src.core.database import Neo4jDatabase

db = Neo4jDatabase(uri="bolt://localhost:7687", user="neo4j", password="your_password")
detector = CommunityDetector(db=db, llm=your_llm_instance)

communities = detector.detect_communities(min_community_size=3)
communities = asyncio.run(detector.generate_summaries(communities))
detector.write_community_ids(communities)

See Community Detection Validation.

Node Labeling

Instead of labeling after every ingestion run, schedule it:

# Standalone labeling + UniProt mapping
uv run python ingest_main.py --label-nodes -d olink1

# Cron: every Monday and Thursday at 2am
0 2 * * 1,4 cd /path/to/olink_rag && uv run python ingest_main.py --label-nodes -d olink1 >> logs/labeling.log 2>&1

Cache Management

# Warm precomputed query cache
uv run python scripts/precompute_queries.py --service local --database neo4j

# Check cache stats
curl "http://localhost:8000/v1/cache/stats" -H "Authorization: Bearer $API_AUTH_TOKEN"

# Invalidate after graph updates
curl -X POST "http://localhost:8000/v1/cache/invalidate?database=neo4j" \
  -H "Authorization: Bearer $API_AUTH_TOKEN"

SSH/SSM Tunnel Keepalive

For overnight runs through AWS SSM tunnels:

caffeinate -s bash scripts/tunnel_keepalive.sh

Monitoring KPIs

Metric Target
Multi-hop success rate > 95%
Community usage rate > 80% (global queries)
Entity disambiguation recall > 90%
Cache hit rate > 30%
P95 latency < 2000ms
Incremental update success rate > 95%
Entity merge rate 20-40%

See Metrics Dashboard for API endpoints and alerting.

Biological Enrichment (Optional)

Legacy Enrichers

# Disease hierarchies from MONDO
uv run python -m pipeline.processors.disease_hierarchy_enricher --database olink1 --operation enrich

# Subcellular location data (requires UniProt IDs)
uv run python -m pipeline.processors.subcellular_location_enricher --database olink1 --operation enrich

Enrich Protein nodes from 6 external bioinformatics APIs in one command:

# Full enrichment (all sources)
uv run python -m pipeline.processors.science_skills_enricher \
  --database olink1 --sources all --batch-size 50 --rate-limit 1.0

# Targeted enrichment (specific sources and proteins)
uv run python -m pipeline.processors.science_skills_enricher \
  --database olink1 --sources string,hpa,clinvar \
  --proteins TP53,BRCA1,EGFR

# Preview mode (no writes)
uv run python -m pipeline.processors.science_skills_enricher \
  --database olink1 --sources all --dry-run
Source What it adds
string INTERACTS_WITH relationships (score-filtered PPI)
hpa Tissue nodes + EXPRESSED_IN, subcellular locations
reactome Pathway nodes + PARTICIPATES_IN + CHILD_OF hierarchy
chembl Drug nodes + TARGETS relationships (pChEMBL-filtered)
clinvar Variant nodes + HAS_VARIANT + ASSOCIATED_WITH Disease
openalex Publication nodes + CITES citation network

Suggested schedule: Run monthly or after major protein ingestion updates. Supports --dry-run for preview, aborts at >50% failure rate.

See Integrations for full details.