Getting Started¶

Prerequisites¶

Python 3.12+
uv package manager
AWS SSO access (aws sso login)
Neo4j 5.x (Docker locally, or AWS shared instance)

Installation¶

git clone https://github.com/Olink-Proteomics/graphrag_api.git
cd graphrag_api
uv sync

Always run through uv:

uv run python <script>
uv run pytest ...

Configuration¶

Credentials are loaded automatically from AWS Secrets Manager after aws sso login. No .env file required.

# This is all you need:
aws sso login

The system pulls NEO4J_PASSWORD, API keys, and other secrets from graphrag-secrets in Secrets Manager at startup.

Optional: .env overrides¶

If you want to override any value (e.g., point to a local Neo4j), create a .env:

# Only set what you want to override
NEO4J_URI=bolt://localhost:7687
NEO4J_DATABASE=neo4j
BEDROCK_MODEL_ID=us.meta.llama3-1-8b-instruct-v1:0

.env values take precedence over Secrets Manager.

Quick Start: Build a KG from PDFs¶

The fastest path from zero to queryable graph:

# 1. Start Neo4j (if not using AWS instance)
docker run -d --name neo4j -p 7687:7687 -p 7474:7474 \
  -e NEO4J_AUTH=neo4j/password123 neo4j:5

# 2. Ingest PDFs
uv run python ingest_main.py \
  --source pdf \
  --pdf-files papers/paper1.pdf papers/paper2.pdf \
  --database neo4j \
  --service bedrock

# 3. Generate embeddings (required for search)
uv run python ingest_main.py \
  --add-graph-embeddings \
  --database neo4j \
  --service bedrock

# 4. Start the API
uv run granian --interface asgi api.app:app --host 127.0.0.1 --port 8000 --reload

For detailed walkthrough see: Build Your First KG (Demo)

Quick Start: Build a KG from PubMed¶

# Fetch 100 abstracts
uv run python ingest_main.py \
  --search-term "cardiovascular disease protein biomarker" \
  --max-results 100 \
  --database neo4j \
  --service bedrock

# Consolidate entities (merge duplicates)
uv run python ingest_main.py --consolidate-relationships --database neo4j

# Generate embeddings
uv run python ingest_main.py --add-graph-embeddings --database neo4j --service bedrock

Production Workflow (Full KG)¶

For building a comprehensive knowledge graph with ontologies:

DATABASE="psx"

# 1. Load disease ontology (MONDO IDs)
uv run python enrichment_main.py \
  --file src/utils/mondo.obo \
  --database $DATABASE \
  --column-handlers "0:mondo-id,1:disease-name,2:synonyms"

# 2. Load protein dictionary (UniProt IDs)
uv run python enrichment_main.py \
  --file src/utils/uniprot_ids_human.csv \
  --database $DATABASE \
  --column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms"

# 3. Ingest from all queries (31 disease areas)
uv run python parallel_ingest.py \
  -q data/queries_bulk.txt \
  -d $DATABASE -s bedrock \
  --pubmed-max 500 --pmc-max 200 --biorxiv-max 100

# 4. Consolidate
uv run python ingest_main.py --consolidate-relationships --database $DATABASE

# 5. Embeddings
uv run python ingest_main.py --add-graph-embeddings --database $DATABASE --service bedrock

# 6. (Optional) Enrich from external bioinformatics APIs (STRING, HPA, Reactome, ChEMBL, ClinVar, OpenAlex)
uv run python -m pipeline.processors.science_skills_enricher \
  --database $DATABASE --sources all --batch-size 50 --rate-limit 1.0

Your First Query¶

# Create a session
curl -X POST "http://localhost:8000/v1/sessions" \
  -H "Authorization: Bearer $API_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"database_name":"neo4j","llm_service":"bedrock"}'

# Query (replace SESSION_ID)
curl -X POST "http://localhost:8000/v1/sessions/SESSION_ID/query" \
  -H "Authorization: Bearer $API_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"question": "What proteins are associated with heart disease?"}'

Or start the frontend:

cd ../gav360_graphrag_react && npm install && npm run dev
# Open http://localhost:5173

Docker Development¶

# Full stack: API + frontend + Redis
docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile dev up

Services: API on :8000, Frontend on :5173, Redis on :6379.

Common Commands¶

Command	Purpose
`uv run pytest`	Run tests
`uv run ruff check .`	Lint
`uv run ruff format .`	Format
`uv run mypy src/ api/ pipeline/`	Type check
`uv run granian --interface asgi api.app:app --host 0.0.0.0 --port 8000 --reload`	Start API

Next Steps¶

Build Your First KG (Demo) — 15-min hands-on walkthrough
AWS Credentials & Neo4j Tunnel — connect to shared AWS instance
Architecture — system design
Schema Expansion — PTMs, variants, data model discussion
Ingestion Pipeline — data sources and processing
API Reference — REST endpoints