Skip to content

Getting Started

Prerequisites

  • Python 3.12+
  • uv package manager
  • AWS SSO access (aws sso login)
  • Neo4j 5.x (Docker locally, or AWS shared instance)

Installation

git clone https://github.com/Olink-Proteomics/graphrag_api.git
cd graphrag_api
uv sync

Always run through uv:

uv run python <script>
uv run pytest ...

Configuration

Credentials are loaded automatically from AWS Secrets Manager after aws sso login. No .env file required.

# This is all you need:
aws sso login

The system pulls NEO4J_PASSWORD, API keys, and other secrets from graphrag-secrets in Secrets Manager at startup.

Optional: .env overrides

If you want to override any value (e.g., point to a local Neo4j), create a .env:

# Only set what you want to override
NEO4J_URI=bolt://localhost:7687
NEO4J_DATABASE=neo4j
BEDROCK_MODEL_ID=us.meta.llama3-1-8b-instruct-v1:0

.env values take precedence over Secrets Manager.

Quick Start: Build a KG from PDFs

The fastest path from zero to queryable graph:

# 1. Start Neo4j (if not using AWS instance)
docker run -d --name neo4j -p 7687:7687 -p 7474:7474 \
  -e NEO4J_AUTH=neo4j/password123 neo4j:5

# 2. Ingest PDFs
uv run python ingest_main.py \
  --source pdf \
  --pdf-files papers/paper1.pdf papers/paper2.pdf \
  --database neo4j \
  --service bedrock

# 3. Generate embeddings (required for search)
uv run python ingest_main.py \
  --add-graph-embeddings \
  --database neo4j \
  --service bedrock

# 4. Start the API
uv run granian --interface asgi api.app:app --host 127.0.0.1 --port 8000 --reload

For detailed walkthrough see: Build Your First KG (Demo)

Quick Start: Build a KG from PubMed

# Fetch 100 abstracts
uv run python ingest_main.py \
  --search-term "cardiovascular disease protein biomarker" \
  --max-results 100 \
  --database neo4j \
  --service bedrock

# Consolidate entities (merge duplicates)
uv run python ingest_main.py --consolidate-relationships --database neo4j

# Generate embeddings
uv run python ingest_main.py --add-graph-embeddings --database neo4j --service bedrock

Production Workflow (Full KG)

For building a comprehensive knowledge graph with ontologies:

DATABASE="psx"

# 1. Load disease ontology (MONDO IDs)
uv run python enrichment_main.py \
  --file src/utils/mondo.obo \
  --database $DATABASE \
  --column-handlers "0:mondo-id,1:disease-name,2:synonyms"

# 2. Load protein dictionary (UniProt IDs)
uv run python enrichment_main.py \
  --file src/utils/uniprot_ids_human.csv \
  --database $DATABASE \
  --column-handlers "0:uniprot-id,1:protein-name,2:gene-symbol,3:synonyms"

# 3. Ingest from all queries (31 disease areas)
uv run python parallel_ingest.py \
  -q data/queries_bulk.txt \
  -d $DATABASE -s bedrock \
  --pubmed-max 500 --pmc-max 200 --biorxiv-max 100

# 4. Consolidate
uv run python ingest_main.py --consolidate-relationships --database $DATABASE

# 5. Embeddings
uv run python ingest_main.py --add-graph-embeddings --database $DATABASE --service bedrock

# 6. (Optional) Enrich from external bioinformatics APIs (STRING, HPA, Reactome, ChEMBL, ClinVar, OpenAlex)
uv run python -m pipeline.processors.science_skills_enricher \
  --database $DATABASE --sources all --batch-size 50 --rate-limit 1.0

Your First Query

# Create a session
curl -X POST "http://localhost:8000/v1/sessions" \
  -H "Authorization: Bearer $API_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"database_name":"neo4j","llm_service":"bedrock"}'

# Query (replace SESSION_ID)
curl -X POST "http://localhost:8000/v1/sessions/SESSION_ID/query" \
  -H "Authorization: Bearer $API_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"question": "What proteins are associated with heart disease?"}'

Or start the frontend:

cd ../gav360_graphrag_react && npm install && npm run dev
# Open http://localhost:5173

Docker Development

# Full stack: API + frontend + Redis
docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile dev up

Services: API on :8000, Frontend on :5173, Redis on :6379.

Common Commands

Command Purpose
uv run pytest Run tests
uv run ruff check . Lint
uv run ruff format . Format
uv run mypy src/ api/ pipeline/ Type check
uv run granian --interface asgi api.app:app --host 0.0.0.0 --port 8000 --reload Start API

Next Steps