Skip to content

Olink RAG Wiki

CLI Reference

Olink-Proteomics/gav360_graphrag

CLI Reference¶

Three entry points cover all operations. All require aws sso login for Bedrock LLM access.

ingest_main.py — Build the Knowledge Graph¶

The primary CLI for ingesting scientific literature and PDFs into the graph.

Common Usage¶

# Ingest PDFs
uv run python ingest_main.py --source pdf --pdf-files paper.pdf --service bedrock

# Ingest from PubMed
uv run python ingest_main.py --search-term "cardiovascular biomarker" --max-results 100 --service bedrock

# Ingest from PMC (full-text)
uv run python ingest_main.py --source pmc --search-term "kidney disease" --max-results 50

# Ingest from bioRxiv
uv run python ingest_main.py --source biorxiv --search-term "proteomics" --max-results 30

# Multi-query from file
uv run python ingest_main.py --queries-file data/queries_bulk.txt --max-results 200 --service bedrock

Post-Ingestion Operations¶

# Generate embeddings (required for semantic search)
uv run python ingest_main.py --add-graph-embeddings --database neo4j --service bedrock

# Consolidate relationships (merge duplicates)
uv run python ingest_main.py --consolidate-relationships --database neo4j

# Detect communities (Louvain clustering)
uv run python ingest_main.py --detect-communities --database neo4j

# Label untyped Entity nodes as Protein/Disease
uv run python ingest_main.py --label-nodes --database neo4j

# Enrich diseases with MONDO hierarchy
uv run python ingest_main.py --enrich-hierarchy --database neo4j

Key Flags¶

Flag	Short	Description
`--source`		`pubmed` (default), `pmc`, `biorxiv`, `pdf`
`--search-term`	`-t`	Search query
`--queries-file`	`-q`	File with one query per line
`--max-results`	`-n`	Max articles per query (default: 100)
`--pdf-files`		Space-separated PDF paths
`--database`	`-d`	Neo4j database name
`--service`	`-s`	LLM: `bedrock`, `local`, `sagemaker-llama3`
`--force`	`-f`	Re-ingest existing articles
`--skip-node-labeling`		Skip slow node classification (fast mode)
`--enable-consolidation`		Consolidate relationships during ingestion
`--chunk-size`		Token chunk size (default: 2000)

enrichment_main.py — Enrich from CSV/Parquet/TSV¶

Add structured data (ontologies, protein dictionaries, experimental data) to the graph.

Common Usage¶

# Load disease ontology
uv run python enrichment_main.py \
  --file src/utils/mondo.obo \
  --database neo4j \
  --column-handlers "0:mondo-id,1:disease-name,2:synonyms"

# Load protein dictionary
uv run python enrichment_main.py \
  --file src/utils/uniprot_ids_human.csv \
  --database neo4j \
  --column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms"

# Enrich from experimental CSV
uv run python enrichment_main.py --file data/experiment.csv --service bedrock

# Analyze CSV structure only (no writes)
uv run python enrichment_main.py --file data.csv --analyze-only

Key Flags¶

Flag	Short	Description
`--file`	`-f`	Input file (CSV, Parquet, TSV, TXT)
`--database`	`-d`	Neo4j database name
`--service`	`-s`	LLM service for extraction
`--column-handlers`		Manual column mapping (e.g., `"0:protein-id,2:prop"`)
`--analyze-only`		Preview structure without writing
`--resume`		Skip already-processed rows
`--offset` / `--limit`		Process a specific row range
`--batch-size`		Rows per batch (default: 1000)

Column Handlers¶

Handler	Description
`protein-id` / `uniprot-id`	Column contains protein identifiers
`mondo-id`	Column contains MONDO disease IDs
`disease-name`	Column contains disease names
`protein-name`	Column contains protein names
`gene-symbol`	Column contains gene symbols
`synonyms`	Column contains synonyms (semicolon-separated)
`prop` / `property`	Add column values as node properties
`kg` / `kg-extraction`	Run LLM extraction on column text
`skip` / `ignore`	Skip this column

parallel_ingest.py — Bulk Parallel Ingestion¶

Run PubMed, PMC, and bioRxiv ingestion simultaneously for maximum throughput.

Common Usage¶

# Dry run — see what would launch
uv run python parallel_ingest.py -q data/queries_bulk.txt -d neo4j -s bedrock --dry-run

# Full parallel run
uv run python parallel_ingest.py \
  -q data/queries_bulk.txt \
  -d neo4j -s bedrock \
  --skip-node-labeling \
  --pubmed-max 500 --pmc-max 200 --biorxiv-max 100

# PubMed only with mass scraper
uv run python parallel_ingest.py \
  -q data/queries_bulk.txt \
  --skip-pmc --skip-biorxiv \
  --pubmed-mass-scraper

Key Flags¶

Flag	Description
`--queries-file` / `-q`	Queries file (required)
`--biorxiv-queries-file`	Separate simpler queries for bioRxiv
`--pubmed-max`	Max per query for PubMed (default: 500)
`--pmc-max`	Max per query for PMC (default: 200)
`--biorxiv-max`	Max per query for bioRxiv (default: 100)
`--skip-pubmed/pmc/biorxiv`	Skip a source
`--skip-node-labeling`	Fast mode — label nodes later
`--dry-run`	Show commands without executing

Entity Resolution (standalone)¶

# Full consolidation (UniProt → synonyms → fuzzy)
uv run python -m pipeline.processors.entity_resolver --database neo4j --operation full

# Validate quality
uv run python -m pipeline.processors.entity_resolver --database neo4j --operation validate

Environment Variables¶

All configuration is via environment or .env file. Secrets auto-load from AWS Secrets Manager.

Variable	Default	Description
`NEO4J_URI`	`bolt://localhost:7687`	Neo4j connection
`NEO4J_PASSWORD`	(from Secrets Manager)	Neo4j password
`NEO4J_DATABASE`	`neo4j`	Database name
`BEDROCK_MODEL_ID`	`us.meta.llama3-1-8b-instruct-v1:0`	Bedrock model
`BEDROCK_REGION`	`us-east-1`	Bedrock region
`AWS_REGION`	`eu-north-1`	General AWS region
`KG_MAX_LLM_CONCURRENCY`	`15`	Parallel LLM calls
`API_AUTH_TOKEN`	(from Secrets Manager)	API bearer token
`DISABLE_REDIS`	`true`	Skip Redis for local dev