Build Your First Knowledge Graph¶
A 15-minute walkthrough: go from PDF papers to queryable knowledge graph.
Prerequisites¶
| Requirement | How to get it |
|---|---|
| Python 3.12+ | brew install python@3.12 |
| uv | pip install uv |
| AWS SSO login | aws sso login |
| Neo4j | Docker (below) or existing instance |
Step 1: Install (2 min)¶
Step 2: Start Neo4j (2 min)¶
Option A — Docker (local, quickest):
Option B — AWS shared instance (team):
# Open SSM tunnel (keep this terminal open)
aws ssm start-session \
--target <ec2-instance-id> \
--document-name AWS-StartPortForwardingSessionToRemoteHost \
--parameters '{"host":["<neo4j-nlb-address>"],"portNumber":["7687"],"localPortNumber":["7687"]}' \
--region eu-north-1
Step 3: Configure (30 sec)¶
That's it. Credentials (Neo4j password, API keys) are pulled automatically from AWS Secrets Manager.
Optional: If using local Docker Neo4j (Step 2 Option A), create a minimal .env:
Step 4: Ingest PDFs (5 min)¶
Put some PDFs in a folder, then:
uv run python ingest_main.py \
--source pdf \
--pdf-files papers/paper1.pdf papers/paper2.pdf \
--database neo4j \
--service bedrock
What happens:
- PDF → text extraction (PyMuPDF or Nougat for scanned PDFs)
- Text → 512-token chunks with overlap
- Chunks → LLM entity extraction (proteins, diseases, relationships)
- Entities → Neo4j graph (MERGE to avoid duplicates)
Expected output:
Processing paper1.pdf...
Extracted 8 chunks
Found 23 entities, 15 relationships
Processing paper2.pdf...
Extracted 12 chunks
Found 41 entities, 28 relationships
Done: 64 entities, 43 relationships in 45s
Step 5: Generate Embeddings (2 min)¶
This adds 768-dim vectors to all entity nodes for semantic search.
Step 6: Query (instant)¶
Option A — curl:
# Create session
SESSION=$(curl -s -X POST "http://localhost:8000/v1/sessions" \
-H "Authorization: Bearer dev-token" \
-H "Content-Type: application/json" \
-d '{"database_name":"neo4j","llm_service":"bedrock"}' | python3 -c "import sys,json;print(json.loads(sys.stdin.read())['session_id'])")
# Ask a question
curl -s -X POST "http://localhost:8000/v1/sessions/$SESSION/query" \
-H "Authorization: Bearer dev-token" \
-H "Content-Type: application/json" \
-d '{"question": "What proteins are mentioned in the papers?"}' | python3 -m json.tool
Option B — start the API + frontend:
# Terminal 1: API
uv run granian --interface asgi api.app:app --host 127.0.0.1 --port 8000 --reload
# Terminal 2: Frontend (if cloned)
cd ../gav360_graphrag_react && npm run dev
Open http://localhost:5173 and ask questions in the chat interface.
Alternative: Ingest from PubMed (no PDFs needed)¶
# Fetch 50 abstracts about cardiovascular protein biomarkers
uv run python ingest_main.py \
--search-term "cardiovascular disease protein biomarker" \
--max-results 50 \
--database neo4j \
--service bedrock
What's in the Graph Now?¶
Check via Neo4j Browser (http://localhost:7474):
-- Count everything
MATCH (n) RETURN labels(n) AS type, count(n) AS count ORDER BY count DESC
-- See protein-disease relationships
MATCH (p:Protein)-[r:ASSOCIATED_WITH]->(d:Disease)
RETURN p.name, type(r), d.name LIMIT 20
-- Find all relationships for a protein
MATCH (p:Protein {name: "IL-6"})-[r]-(n)
RETURN p.name, type(r), labels(n), n.name
Next Steps¶
| Goal | Command |
|---|---|
| Add more papers | uv run python ingest_main.py --source pdf --pdf-files more/*.pdf |
| Add PubMed data | uv run python ingest_main.py --search-term "your topic" --max-results 200 |
| Consolidate duplicates | uv run python ingest_main.py --consolidate-relationships --database neo4j |
| Detect communities | uv run python ingest_main.py --detect-communities --database neo4j |
| Bulk ingest (31 queries) | uv run python parallel_ingest.py -q data/queries_bulk.txt -d neo4j -s bedrock |
Troubleshooting¶
| Problem | Fix |
|---|---|
Token has expired |
aws sso login |
Connection refused on Neo4j |
Start Docker or SSM tunnel |
ThrottlingException from Bedrock |
Reduce --max-results or set KG_MAX_LLM_CONCURRENCY=5 in .env |
| Slow extraction | Use --service local with Ollama for testing (free, no rate limits) |