Neptune Massive Ingest — Operational Runbook¶
How to run large-scale paper ingestion into Neptune + Aurora pgvector on the beta environment.
Architecture¶
PubMed/PMC ──→ [ECS Task: scaled_ingest.py] ──→ Bedrock LLM ──→ SQS Queue
│
▼
[ECS Task: sqs_ingestion_worker.py]
│
┌───────────────────────┼───────────────────┐
▼ ▼
Neptune (graph) Aurora pgvector
entities + rels embeddings
│
▼
Beta Frontend (query)
Prerequisites¶
- AWS SSO login:
aws sso login - Docker running (for image builds)
- Branch:
feat/parallel_ingest
Step 1: Build & Push Image¶
Only needed when code changes:
cd /Users/apple/Developer/olink/graphrag_api
git checkout feat/parallel_ingest
# Build ingest image
docker build --platform linux/amd64 \
-t 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest \
-f Dockerfile.ingest .
# Push
aws ecr get-login-password --region eu-north-1 | \
docker login --username AWS --password-stdin 357836458011.dkr.ecr.eu-north-1.amazonaws.com
docker push 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest
Step 2: Run Extraction (Papers → SQS)¶
Launch an ECS task that fetches papers, extracts entities via Bedrock, and pushes to SQS:
aws ecs run-task \
--cluster graphrag-beta-cluster \
--task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
"securityGroups": ["sg-072b4fffe90fed053"],
"assignPublicIp": "ENABLED"
}
}' \
--overrides '{
"containerOverrides": [{
"name": "fastapi",
"command": ["--queries-file", "data/queries_bulk.txt", "--max-results", "300", "--use-sqs"]
}]
}' \
--region eu-north-1
Options:
- --max-results 300 — papers per query (31 queries × 300 = ~9.3K papers)
- --source pubmed (default) / pmc / biorxiv
- Remove --use-sqs to write directly to Neptune (slower, may timeout)
Monitor:
# Task status
aws ecs describe-tasks --cluster graphrag-beta-cluster --tasks <TASK_ID> \
--region eu-north-1 --query "tasks[0].lastStatus" --output text
# Logs
aws logs get-log-events --log-group-name "graphrag-beta-ecs-logs" \
--log-stream-name "api/fastapi/<TASK_ID>" --region eu-north-1 \
--query "events[-10:].message" --output text
# Queue depth
aws sqs get-queue-attributes \
--queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
--attribute-names ApproximateNumberOfMessages --region eu-north-1
Step 3: Run Ingestion Worker (SQS → Neptune + Aurora)¶
Once extraction is done (or while it's running), launch the ingestion worker to drain the queue:
aws ecs run-task \
--cluster graphrag-beta-cluster \
--task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
"securityGroups": ["sg-072b4fffe90fed053"],
"assignPublicIp": "ENABLED"
}
}' \
--overrides '{
"containerOverrides": [{
"name": "fastapi",
"command": ["--worker"]
}]
}' \
--region eu-north-1
Note: The --worker flag runs sqs_ingestion_worker.py which:
- Reads messages from graphrag-beta-extraction-queue
- Batch-writes nodes/relationships to Neptune via OpenCypher
- Writes embeddings to Aurora pgvector
- Deletes processed messages
- Stops when queue is empty
Step 4: Verify¶
# Check Neptune node count (via beta API)
curl -s https://graphrag-beta.mlapps.olink.systems/v1/sessions \
-H "Content-Type: application/json" \
-d '{"database_type": "neptune"}'
# Check SQS DLQ for failures
aws sqs get-queue-attributes \
--queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-dlq \
--attribute-names ApproximateNumberOfMessages --region eu-north-1
Costs¶
| Component | Cost per 10K papers |
|---|---|
| Bedrock LLM (Llama 3.1 8B) | ~$8-12 |
| ECS Fargate (extraction, ~30 min) | ~$0.50 |
| ECS Fargate (ingestion, ~10 min) | ~$0.15 |
| Neptune (running) | ~$0.35/hr |
| Aurora pgvector (running) | ~$0.20/hr |
| SQS | ~$0.01 |
Troubleshooting¶
| Error | Cause | Fix |
|---|---|---|
sqs:sendmessage AccessDenied |
Task role missing SQS policy | Add SQS policy to RagStack-beta-AppTaskRole7EC51C11-jWyEfPb24mh5 |
ValueError: invalid literal for int() with base 16: b'' |
PubMed chunked-encoding error | Fixed — retry logic handles this |
| Neptune timeout at min capacity | Too many concurrent writes | Use SQS mode (controlled drain rate) |
unrecognized arguments |
Wrong command format | Use ["--queries-file", ...] not ["python", "-m", ...] |
Weekly Automation (TODO)¶
EventBridge rule graphrag-beta-weekly-ingest exists but is not yet connected. To fully automate:
- Wire
weekly_worker.pyinto CDK beta stack - Add Step Functions: extraction task → wait → ingestion worker task
- SNS notification on completion/failure
Endpoints¶
| Resource | Value |
|---|---|
| Neptune Alpha | neptunedbcluster-y5yc1gf3wxsn |
| Neptune Beta | neptunedbcluster-rssnn6cwagnp |
| Aurora pgvector | ragstack-beta-aurorapgvectorstackaurorapgvectorclu-i8zsxoht8olc |
| SQS Queue | graphrag-beta-extraction-queue |
| SQS DLQ | graphrag-beta-extraction-dlq |
| ECS Cluster | graphrag-beta-cluster |
| Task Role | RagStack-beta-AppTaskRole7EC51C11-jWyEfPb24mh5 |
| Docker Image | 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest |