Skip to content

CLI Reference

Three entry points cover all operations. All require aws sso login for Bedrock LLM access.


ingest_main.py — Build the Knowledge Graph

The primary CLI for ingesting scientific literature and PDFs into the graph.

Common Usage

# Ingest PDFs
uv run python ingest_main.py --source pdf --pdf-files paper.pdf --service bedrock

# Ingest from PubMed
uv run python ingest_main.py --search-term "cardiovascular biomarker" --max-results 100 --service bedrock

# Ingest from PMC (full-text)
uv run python ingest_main.py --source pmc --search-term "kidney disease" --max-results 50

# Ingest from bioRxiv
uv run python ingest_main.py --source biorxiv --search-term "proteomics" --max-results 30

# Multi-query from file
uv run python ingest_main.py --queries-file data/queries_bulk.txt --max-results 200 --service bedrock

Post-Ingestion Operations

# Generate embeddings (required for semantic search)
uv run python ingest_main.py --add-graph-embeddings --database neo4j --service bedrock

# Consolidate relationships (merge duplicates)
uv run python ingest_main.py --consolidate-relationships --database neo4j

# Detect communities (Louvain clustering)
uv run python ingest_main.py --detect-communities --database neo4j

# Label untyped Entity nodes as Protein/Disease
uv run python ingest_main.py --label-nodes --database neo4j

# Enrich diseases with MONDO hierarchy
uv run python ingest_main.py --enrich-hierarchy --database neo4j

Key Flags

Flag Short Description
--source pubmed (default), pmc, biorxiv, pdf
--search-term -t Search query
--queries-file -q File with one query per line
--max-results -n Max articles per query (default: 100)
--pdf-files Space-separated PDF paths
--database -d Neo4j database name
--service -s LLM: bedrock, local, sagemaker-llama3
--force -f Re-ingest existing articles
--skip-node-labeling Skip slow node classification (fast mode)
--enable-consolidation Consolidate relationships during ingestion
--chunk-size Token chunk size (default: 2000)

enrichment_main.py — Enrich from CSV/Parquet/TSV

Add structured data (ontologies, protein dictionaries, experimental data) to the graph.

Common Usage

# Load disease ontology
uv run python enrichment_main.py \
  --file src/utils/mondo.obo \
  --database neo4j \
  --column-handlers "0:mondo-id,1:disease-name,2:synonyms"

# Load protein dictionary
uv run python enrichment_main.py \
  --file src/utils/uniprot_ids_human.csv \
  --database neo4j \
  --column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms"

# Enrich from experimental CSV
uv run python enrichment_main.py --file data/experiment.csv --service bedrock

# Analyze CSV structure only (no writes)
uv run python enrichment_main.py --file data.csv --analyze-only

Key Flags

Flag Short Description
--file -f Input file (CSV, Parquet, TSV, TXT)
--database -d Neo4j database name
--service -s LLM service for extraction
--column-handlers Manual column mapping (e.g., "0:protein-id,2:prop")
--analyze-only Preview structure without writing
--resume Skip already-processed rows
--offset / --limit Process a specific row range
--batch-size Rows per batch (default: 1000)

Column Handlers

Handler Description
protein-id / uniprot-id Column contains protein identifiers
mondo-id Column contains MONDO disease IDs
disease-name Column contains disease names
protein-name Column contains protein names
gene-symbol Column contains gene symbols
synonyms Column contains synonyms (semicolon-separated)
prop / property Add column values as node properties
kg / kg-extraction Run LLM extraction on column text
skip / ignore Skip this column

parallel_ingest.py — Bulk Parallel Ingestion

Run PubMed, PMC, and bioRxiv ingestion simultaneously for maximum throughput.

Common Usage

# Dry run — see what would launch
uv run python parallel_ingest.py -q data/queries_bulk.txt -d neo4j -s bedrock --dry-run

# Full parallel run
uv run python parallel_ingest.py \
  -q data/queries_bulk.txt \
  -d neo4j -s bedrock \
  --skip-node-labeling \
  --pubmed-max 500 --pmc-max 200 --biorxiv-max 100

# PubMed only with mass scraper
uv run python parallel_ingest.py \
  -q data/queries_bulk.txt \
  --skip-pmc --skip-biorxiv \
  --pubmed-mass-scraper

Key Flags

Flag Description
--queries-file / -q Queries file (required)
--biorxiv-queries-file Separate simpler queries for bioRxiv
--pubmed-max Max per query for PubMed (default: 500)
--pmc-max Max per query for PMC (default: 200)
--biorxiv-max Max per query for bioRxiv (default: 100)
--skip-pubmed/pmc/biorxiv Skip a source
--skip-node-labeling Fast mode — label nodes later
--dry-run Show commands without executing

Entity Resolution (standalone)

# Full consolidation (UniProt → synonyms → fuzzy)
uv run python -m pipeline.processors.entity_resolver --database neo4j --operation full

# Validate quality
uv run python -m pipeline.processors.entity_resolver --database neo4j --operation validate

Environment Variables

All configuration is via environment or .env file. Secrets auto-load from AWS Secrets Manager.

Variable Default Description
NEO4J_URI bolt://localhost:7687 Neo4j connection
NEO4J_PASSWORD (from Secrets Manager) Neo4j password
NEO4J_DATABASE neo4j Database name
BEDROCK_MODEL_ID us.meta.llama3-1-8b-instruct-v1:0 Bedrock model
BEDROCK_REGION us-east-1 Bedrock region
AWS_REGION eu-north-1 General AWS region
KG_MAX_LLM_CONCURRENCY 15 Parallel LLM calls
API_AUTH_TOKEN (from Secrets Manager) API bearer token
DISABLE_REDIS true Skip Redis for local dev