Build Your First Knowledge Graph¶

A 15-minute walkthrough: go from PDF papers to queryable knowledge graph.

Prerequisites¶

Requirement	How to get it
Python 3.12+	`brew install python@3.12`
uv	`pip install uv`
AWS SSO login	`aws sso login`
Neo4j	Docker (below) or existing instance

Step 1: Install (2 min)¶

git clone https://github.com/Olink-Proteomics/graphrag_api.git
cd graphrag_api
uv sync

Step 2: Start Neo4j (2 min)¶

Option A — Docker (local, quickest):

docker run -d --name neo4j \
  -p 7687:7687 -p 7474:7474 \
  -e NEO4J_AUTH=neo4j/password123 \
  neo4j:5

Option B — AWS shared instance (team):

# Open SSM tunnel (keep this terminal open)
aws ssm start-session \
  --target <ec2-instance-id> \
  --document-name AWS-StartPortForwardingSessionToRemoteHost \
  --parameters '{"host":["<neo4j-nlb-address>"],"portNumber":["7687"],"localPortNumber":["7687"]}' \
  --region eu-north-1

Step 3: Configure (30 sec)¶

aws sso login

That's it. Credentials (Neo4j password, API keys) are pulled automatically from AWS Secrets Manager.

Optional: If using local Docker Neo4j (Step 2 Option A), create a minimal .env:

NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=password123

Step 4: Ingest PDFs (5 min)¶

Put some PDFs in a folder, then:

uv run python ingest_main.py \
  --source pdf \
  --pdf-files papers/paper1.pdf papers/paper2.pdf \
  --database neo4j \
  --service bedrock

What happens:

PDF → text extraction (PyMuPDF or Nougat for scanned PDFs)
Text → 512-token chunks with overlap
Chunks → LLM entity extraction (proteins, diseases, relationships)
Entities → Neo4j graph (MERGE to avoid duplicates)

Expected output:

Processing paper1.pdf...
  Extracted 8 chunks
  Found 23 entities, 15 relationships
Processing paper2.pdf...
  Extracted 12 chunks
  Found 41 entities, 28 relationships

Done: 64 entities, 43 relationships in 45s

Step 5: Generate Embeddings (2 min)¶

uv run python ingest_main.py \
  --add-graph-embeddings \
  --database neo4j \
  --service bedrock

This adds 768-dim vectors to all entity nodes for semantic search.

Step 6: Query (instant)¶

Option A — curl:

# Create session
SESSION=$(curl -s -X POST "http://localhost:8000/v1/sessions" \
  -H "Authorization: Bearer dev-token" \
  -H "Content-Type: application/json" \
  -d '{"database_name":"neo4j","llm_service":"bedrock"}' | python3 -c "import sys,json;print(json.loads(sys.stdin.read())['session_id'])")

# Ask a question
curl -s -X POST "http://localhost:8000/v1/sessions/$SESSION/query" \
  -H "Authorization: Bearer dev-token" \
  -H "Content-Type: application/json" \
  -d '{"question": "What proteins are mentioned in the papers?"}' | python3 -m json.tool

Option B — start the API + frontend:

# Terminal 1: API
uv run granian --interface asgi api.app:app --host 127.0.0.1 --port 8000 --reload

# Terminal 2: Frontend (if cloned)
cd ../gav360_graphrag_react && npm run dev

Open http://localhost:5173 and ask questions in the chat interface.

Alternative: Ingest from PubMed (no PDFs needed)¶

# Fetch 50 abstracts about cardiovascular protein biomarkers
uv run python ingest_main.py \
  --search-term "cardiovascular disease protein biomarker" \
  --max-results 50 \
  --database neo4j \
  --service bedrock

What's in the Graph Now?¶

Check via Neo4j Browser (http://localhost:7474):

-- Count everything
MATCH (n) RETURN labels(n) AS type, count(n) AS count ORDER BY count DESC

-- See protein-disease relationships
MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(d:Disease)
RETURN p.name, type(r), d.name LIMIT 20

-- Find all relationships for a protein
MATCH (p:Protein {name: "IL-6"})-[r]-(n)
RETURN p.name, type(r), labels(n), n.name

Next Steps¶

Goal	Command
Add more papers	`uv run python ingest_main.py --source pdf --pdf-files more/*.pdf`
Add PubMed data	`uv run python ingest_main.py --search-term "your topic" --max-results 200`
Consolidate duplicates	`uv run python ingest_main.py --consolidate-relationships --database neo4j`
Detect communities	`uv run python ingest_main.py --detect-communities --database neo4j`
Bulk ingest (31 queries)	`uv run python parallel_ingest.py -q data/queries_bulk.txt -d neo4j -s bedrock`

Troubleshooting¶

Problem	Fix
`Token has expired`	`aws sso login`
`Connection refused` on Neo4j	Start Docker or SSM tunnel
`ThrottlingException` from Bedrock	Reduce `--max-results` or set `KG_MAX_LLM_CONCURRENCY=5` in .env
Slow extraction	Use `--service local` with Ollama for testing (free, no rate limits)