CLI Reference¶
Three entry points cover all operations. All require aws sso login for Bedrock LLM access.
ingest_main.py — Build the Knowledge Graph¶
The primary CLI for ingesting scientific literature and PDFs into the graph.
Common Usage¶
# Ingest PDFs
uv run python ingest_main.py --source pdf --pdf-files paper.pdf --service bedrock
# Ingest from PubMed
uv run python ingest_main.py --search-term "cardiovascular biomarker" --max-results 100 --service bedrock
# Ingest from PMC (full-text)
uv run python ingest_main.py --source pmc --search-term "kidney disease" --max-results 50
# Ingest from bioRxiv
uv run python ingest_main.py --source biorxiv --search-term "proteomics" --max-results 30
# Multi-query from file
uv run python ingest_main.py --queries-file data/queries_bulk.txt --max-results 200 --service bedrock
Post-Ingestion Operations¶
# Generate embeddings (required for semantic search)
uv run python ingest_main.py --add-graph-embeddings --database neo4j --service bedrock
# Consolidate relationships (merge duplicates)
uv run python ingest_main.py --consolidate-relationships --database neo4j
# Detect communities (Louvain clustering)
uv run python ingest_main.py --detect-communities --database neo4j
# Label untyped Entity nodes as Protein/Disease
uv run python ingest_main.py --label-nodes --database neo4j
# Enrich diseases with MONDO hierarchy
uv run python ingest_main.py --enrich-hierarchy --database neo4j
Key Flags¶
| Flag | Short | Description |
|---|---|---|
--source |
pubmed (default), pmc, biorxiv, pdf |
|
--search-term |
-t |
Search query |
--queries-file |
-q |
File with one query per line |
--max-results |
-n |
Max articles per query (default: 100) |
--pdf-files |
Space-separated PDF paths | |
--database |
-d |
Neo4j database name |
--service |
-s |
LLM: bedrock, local, sagemaker-llama3 |
--force |
-f |
Re-ingest existing articles |
--skip-node-labeling |
Skip slow node classification (fast mode) | |
--enable-consolidation |
Consolidate relationships during ingestion | |
--chunk-size |
Token chunk size (default: 2000) |
enrichment_main.py — Enrich from CSV/Parquet/TSV¶
Add structured data (ontologies, protein dictionaries, experimental data) to the graph.
Common Usage¶
# Load disease ontology
uv run python enrichment_main.py \
--file src/utils/mondo.obo \
--database neo4j \
--column-handlers "0:mondo-id,1:disease-name,2:synonyms"
# Load protein dictionary
uv run python enrichment_main.py \
--file src/utils/uniprot_ids_human.csv \
--database neo4j \
--column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms"
# Enrich from experimental CSV
uv run python enrichment_main.py --file data/experiment.csv --service bedrock
# Analyze CSV structure only (no writes)
uv run python enrichment_main.py --file data.csv --analyze-only
Key Flags¶
| Flag | Short | Description |
|---|---|---|
--file |
-f |
Input file (CSV, Parquet, TSV, TXT) |
--database |
-d |
Neo4j database name |
--service |
-s |
LLM service for extraction |
--column-handlers |
Manual column mapping (e.g., "0:protein-id,2:prop") |
|
--analyze-only |
Preview structure without writing | |
--resume |
Skip already-processed rows | |
--offset / --limit |
Process a specific row range | |
--batch-size |
Rows per batch (default: 1000) |
Column Handlers¶
| Handler | Description |
|---|---|
protein-id / uniprot-id |
Column contains protein identifiers |
mondo-id |
Column contains MONDO disease IDs |
disease-name |
Column contains disease names |
protein-name |
Column contains protein names |
gene-symbol |
Column contains gene symbols |
synonyms |
Column contains synonyms (semicolon-separated) |
prop / property |
Add column values as node properties |
kg / kg-extraction |
Run LLM extraction on column text |
skip / ignore |
Skip this column |
parallel_ingest.py — Bulk Parallel Ingestion¶
Run PubMed, PMC, and bioRxiv ingestion simultaneously for maximum throughput.
Common Usage¶
# Dry run — see what would launch
uv run python parallel_ingest.py -q data/queries_bulk.txt -d neo4j -s bedrock --dry-run
# Full parallel run
uv run python parallel_ingest.py \
-q data/queries_bulk.txt \
-d neo4j -s bedrock \
--skip-node-labeling \
--pubmed-max 500 --pmc-max 200 --biorxiv-max 100
# PubMed only with mass scraper
uv run python parallel_ingest.py \
-q data/queries_bulk.txt \
--skip-pmc --skip-biorxiv \
--pubmed-mass-scraper
Key Flags¶
| Flag | Description |
|---|---|
--queries-file / -q |
Queries file (required) |
--biorxiv-queries-file |
Separate simpler queries for bioRxiv |
--pubmed-max |
Max per query for PubMed (default: 500) |
--pmc-max |
Max per query for PMC (default: 200) |
--biorxiv-max |
Max per query for bioRxiv (default: 100) |
--skip-pubmed/pmc/biorxiv |
Skip a source |
--skip-node-labeling |
Fast mode — label nodes later |
--dry-run |
Show commands without executing |
Entity Resolution (standalone)¶
# Full consolidation (UniProt → synonyms → fuzzy)
uv run python -m pipeline.processors.entity_resolver --database neo4j --operation full
# Validate quality
uv run python -m pipeline.processors.entity_resolver --database neo4j --operation validate
Environment Variables¶
All configuration is via environment or .env file. Secrets auto-load from AWS Secrets Manager.
| Variable | Default | Description |
|---|---|---|
NEO4J_URI |
bolt://localhost:7687 |
Neo4j connection |
NEO4J_PASSWORD |
(from Secrets Manager) | Neo4j password |
NEO4J_DATABASE |
neo4j |
Database name |
BEDROCK_MODEL_ID |
us.meta.llama3-1-8b-instruct-v1:0 |
Bedrock model |
BEDROCK_REGION |
us-east-1 |
Bedrock region |
AWS_REGION |
eu-north-1 |
General AWS region |
KG_MAX_LLM_CONCURRENCY |
15 |
Parallel LLM calls |
API_AUTH_TOKEN |
(from Secrets Manager) | API bearer token |
DISABLE_REDIS |
true |
Skip Redis for local dev |