Neptune Massive Ingest — Operational Runbook¶

How to run large-scale paper ingestion into Neptune + Aurora pgvector on the beta environment.

Architecture¶

PubMed/PMC ──→ [ECS Task: scaled_ingest.py] ──→ Bedrock LLM ──→ SQS Queue
                                                                      │
                                                                      ▼
                                              [ECS Task: sqs_ingestion_worker.py]
                                                                      │
                                              ┌───────────────────────┼───────────────────┐
                                              ▼                                           ▼
                                        Neptune (graph)                          Aurora pgvector
                                        entities + rels                          embeddings
                                              │
                                              ▼
                                     Beta Frontend (query)

Prerequisites¶

AWS SSO login: aws sso login
Docker running (for image builds)
Branch: feat/parallel_ingest

Step 1: Build & Push Image¶

Only needed when code changes:

cd /Users/apple/Developer/olink/graphrag_api
git checkout feat/parallel_ingest

# Build ingest image
docker build --platform linux/amd64 \
  -t 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest \
  -f Dockerfile.ingest .

# Push
aws ecr get-login-password --region eu-north-1 | \
  docker login --username AWS --password-stdin 357836458011.dkr.ecr.eu-north-1.amazonaws.com
docker push 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest

Step 2: Run Extraction (Papers → SQS)¶

Launch an ECS task that fetches papers, extracts entities via Bedrock, and pushes to SQS:

aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--queries-file", "data/queries_bulk.txt", "--max-results", "300", "--use-sqs"]
    }]
  }' \
  --region eu-north-1

Options: - --max-results 300 — papers per query (31 queries × 300 = ~9.3K papers) - --source pubmed (default) / pmc / biorxiv - Remove --use-sqs to write directly to Neptune (slower, may timeout)

Monitor:

# Task status
aws ecs describe-tasks --cluster graphrag-beta-cluster --tasks <TASK_ID> \
  --region eu-north-1 --query "tasks[0].lastStatus" --output text

# Logs
aws logs get-log-events --log-group-name "graphrag-beta-ecs-logs" \
  --log-stream-name "api/fastapi/<TASK_ID>" --region eu-north-1 \
  --query "events[-10:].message" --output text

# Queue depth
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
  --attribute-names ApproximateNumberOfMessages --region eu-north-1

Step 3: Run Ingestion Worker (SQS → Neptune + Aurora)¶

Once extraction is done (or while it's running), launch the ingestion worker to drain the queue:

aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--worker"]
    }]
  }' \
  --region eu-north-1

Note: The --worker flag runs sqs_ingestion_worker.py which: - Reads messages from graphrag-beta-extraction-queue - Batch-writes nodes/relationships to Neptune via OpenCypher - Writes embeddings to Aurora pgvector - Deletes processed messages - Stops when queue is empty

Step 4: Verify¶

# Check Neptune node count (via beta API)
curl -s https://graphrag-beta.mlapps.olink.systems/v1/sessions \
  -H "Content-Type: application/json" \
  -d '{"database_type": "neptune"}'

# Check SQS DLQ for failures
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-dlq \
  --attribute-names ApproximateNumberOfMessages --region eu-north-1

Costs¶

Component	Cost per 10K papers
Bedrock LLM (Llama 3.1 8B)	~$8-12
ECS Fargate (extraction, ~30 min)	~$0.50
ECS Fargate (ingestion, ~10 min)	~$0.15
Neptune (running)	~$0.35/hr
Aurora pgvector (running)	~$0.20/hr
SQS	~$0.01

Troubleshooting¶

Error	Cause	Fix
`sqs:sendmessage AccessDenied`	Task role missing SQS policy	Add SQS policy to `RagStack-beta-AppTaskRole7EC51C11-jWyEfPb24mh5`
`ValueError: invalid literal for int() with base 16: b''`	PubMed chunked-encoding error	Fixed — retry logic handles this
Neptune timeout at min capacity	Too many concurrent writes	Use SQS mode (controlled drain rate)
`unrecognized arguments`	Wrong command format	Use `["--queries-file", ...]` not `["python", "-m", ...]`

Weekly Automation (TODO)¶

EventBridge rule graphrag-beta-weekly-ingest exists but is not yet connected. To fully automate:

Wire weekly_worker.py into CDK beta stack
Add Step Functions: extraction task → wait → ingestion worker task
SNS notification on completion/failure

Endpoints¶

Resource	Value
Neptune Alpha	`neptunedbcluster-y5yc1gf3wxsn`
Neptune Beta	`neptunedbcluster-rssnn6cwagnp`
Aurora pgvector	`ragstack-beta-aurorapgvectorstackaurorapgvectorclu-i8zsxoht8olc`
SQS Queue	`graphrag-beta-extraction-queue`
SQS DLQ	`graphrag-beta-extraction-dlq`
ECS Cluster	`graphrag-beta-cluster`
Task Role	`RagStack-beta-AppTaskRole7EC51C11-jWyEfPb24mh5`
Docker Image	`357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest`