Getting Started¶
Prerequisites¶
- Python 3.12+
- uv package manager
- AWS SSO access (
aws sso login) - Neo4j 5.x (Docker locally, or AWS shared instance)
Installation¶
Always run through uv:
Configuration¶
Credentials are loaded automatically from AWS Secrets Manager after aws sso login. No .env file required.
The system pulls NEO4J_PASSWORD, API keys, and other secrets from graphrag-secrets in Secrets Manager at startup.
Optional: .env overrides¶
If you want to override any value (e.g., point to a local Neo4j), create a .env:
# Only set what you want to override
NEO4J_URI=bolt://localhost:7687
NEO4J_DATABASE=neo4j
BEDROCK_MODEL_ID=us.meta.llama3-1-8b-instruct-v1:0
.env values take precedence over Secrets Manager.
Quick Start: Build a KG from PDFs¶
The fastest path from zero to queryable graph:
# 1. Start Neo4j (if not using AWS instance)
docker run -d --name neo4j -p 7687:7687 -p 7474:7474 \
-e NEO4J_AUTH=neo4j/password123 neo4j:5
# 2. Ingest PDFs
uv run python ingest_main.py \
--source pdf \
--pdf-files papers/paper1.pdf papers/paper2.pdf \
--database neo4j \
--service bedrock
# 3. Generate embeddings (required for search)
uv run python ingest_main.py \
--add-graph-embeddings \
--database neo4j \
--service bedrock
# 4. Start the API
uv run granian --interface asgi api.app:app --host 127.0.0.1 --port 8000 --reload
For detailed walkthrough see: Build Your First KG (Demo)
Quick Start: Build a KG from PubMed¶
# Fetch 100 abstracts
uv run python ingest_main.py \
--search-term "cardiovascular disease protein biomarker" \
--max-results 100 \
--database neo4j \
--service bedrock
# Consolidate entities (merge duplicates)
uv run python ingest_main.py --consolidate-relationships --database neo4j
# Generate embeddings
uv run python ingest_main.py --add-graph-embeddings --database neo4j --service bedrock
Production Workflow (Full KG)¶
For building a comprehensive knowledge graph with ontologies:
DATABASE="psx"
# 1. Load disease ontology (MONDO IDs)
uv run python enrichment_main.py \
--file src/utils/mondo.obo \
--database $DATABASE \
--column-handlers "0:mondo-id,1:disease-name,2:synonyms"
# 2. Load protein dictionary (UniProt IDs)
uv run python enrichment_main.py \
--file src/utils/uniprot_ids_human.csv \
--database $DATABASE \
--column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms"
# 3. Ingest from all queries (31 disease areas)
uv run python parallel_ingest.py \
-q data/queries_bulk.txt \
-d $DATABASE -s bedrock \
--pubmed-max 500 --pmc-max 200 --biorxiv-max 100
# 4. Consolidate
uv run python ingest_main.py --consolidate-relationships --database $DATABASE
# 5. Embeddings
uv run python ingest_main.py --add-graph-embeddings --database $DATABASE --service bedrock
# 6. (Optional) Enrich from external bioinformatics APIs (STRING, HPA, Reactome, ChEMBL, ClinVar, OpenAlex)
uv run python -m pipeline.processors.science_skills_enricher \
--database $DATABASE --sources all --batch-size 50 --rate-limit 1.0
Your First Query¶
# Create a session
curl -X POST "http://localhost:8000/v1/sessions" \
-H "Authorization: Bearer $API_AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{"database_name":"neo4j","llm_service":"bedrock"}'
# Query (replace SESSION_ID)
curl -X POST "http://localhost:8000/v1/sessions/SESSION_ID/query" \
-H "Authorization: Bearer $API_AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{"question": "What proteins are associated with heart disease?"}'
Or start the frontend:
Docker Development¶
# Full stack: API + frontend + Redis
docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile dev up
Services: API on :8000, Frontend on :5173, Redis on :6379.
Common Commands¶
| Command | Purpose |
|---|---|
uv run pytest |
Run tests |
uv run ruff check . |
Lint |
uv run ruff format . |
Format |
uv run mypy src/ api/ pipeline/ |
Type check |
uv run granian --interface asgi api.app:app --host 0.0.0.0 --port 8000 --reload |
Start API |
Next Steps¶
- Build Your First KG (Demo) — 15-min hands-on walkthrough
- AWS Credentials & Neo4j Tunnel — connect to shared AWS instance
- Architecture — system design
- Schema Expansion — PTMs, variants, data model discussion
- Ingestion Pipeline — data sources and processing
- API Reference — REST endpoints