Skip to content

Neptune Massive Ingest — Operational Runbook

How to run large-scale paper ingestion into Neptune + Aurora pgvector on the beta environment.

Architecture

PubMed/PMC ──→ [ECS Task: scaled_ingest.py] ──→ Bedrock LLM ──→ SQS Queue
                                              [ECS Task: sqs_ingestion_worker.py]
                                              ┌───────────────────────┼───────────────────┐
                                              ▼                                           ▼
                                        Neptune (graph)                          Aurora pgvector
                                        entities + rels                          embeddings
                                     Beta Frontend (query)

Prerequisites

  • AWS SSO login: aws sso login
  • Docker running (for image builds)
  • Branch: feat/parallel_ingest

Step 1: Build & Push Image

Only needed when code changes:

cd /Users/apple/Developer/olink/graphrag_api
git checkout feat/parallel_ingest

# Build ingest image
docker build --platform linux/amd64 \
  -t 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest \
  -f Dockerfile.ingest .

# Push
aws ecr get-login-password --region eu-north-1 | \
  docker login --username AWS --password-stdin 357836458011.dkr.ecr.eu-north-1.amazonaws.com
docker push 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest

Step 2: Run Extraction (Papers → SQS)

Launch an ECS task that fetches papers, extracts entities via Bedrock, and pushes to SQS:

aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--queries-file", "data/queries_bulk.txt", "--max-results", "300", "--use-sqs"]
    }]
  }' \
  --region eu-north-1

Options: - --max-results 300 — papers per query (31 queries × 300 = ~9.3K papers) - --source pubmed (default) / pmc / biorxiv - Remove --use-sqs to write directly to Neptune (slower, may timeout)

Monitor:

# Task status
aws ecs describe-tasks --cluster graphrag-beta-cluster --tasks <TASK_ID> \
  --region eu-north-1 --query "tasks[0].lastStatus" --output text

# Logs
aws logs get-log-events --log-group-name "graphrag-beta-ecs-logs" \
  --log-stream-name "api/fastapi/<TASK_ID>" --region eu-north-1 \
  --query "events[-10:].message" --output text

# Queue depth
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
  --attribute-names ApproximateNumberOfMessages --region eu-north-1

Step 3: Run Ingestion Worker (SQS → Neptune + Aurora)

Once extraction is done (or while it's running), launch the ingestion worker to drain the queue:

aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--worker"]
    }]
  }' \
  --region eu-north-1

Note: The --worker flag runs sqs_ingestion_worker.py which: - Reads messages from graphrag-beta-extraction-queue - Batch-writes nodes/relationships to Neptune via OpenCypher - Writes embeddings to Aurora pgvector - Deletes processed messages - Stops when queue is empty

Step 4: Verify

# Check Neptune node count (via beta API)
curl -s https://graphrag-beta.mlapps.olink.systems/v1/sessions \
  -H "Content-Type: application/json" \
  -d '{"database_type": "neptune"}'

# Check SQS DLQ for failures
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-dlq \
  --attribute-names ApproximateNumberOfMessages --region eu-north-1

Costs

Component Cost per 10K papers
Bedrock LLM (Llama 3.1 8B) ~$8-12
ECS Fargate (extraction, ~30 min) ~$0.50
ECS Fargate (ingestion, ~10 min) ~$0.15
Neptune (running) ~$0.35/hr
Aurora pgvector (running) ~$0.20/hr
SQS ~$0.01

Troubleshooting

Error Cause Fix
sqs:sendmessage AccessDenied Task role missing SQS policy Add SQS policy to RagStack-beta-AppTaskRole7EC51C11-jWyEfPb24mh5
ValueError: invalid literal for int() with base 16: b'' PubMed chunked-encoding error Fixed — retry logic handles this
Neptune timeout at min capacity Too many concurrent writes Use SQS mode (controlled drain rate)
unrecognized arguments Wrong command format Use ["--queries-file", ...] not ["python", "-m", ...]

Weekly Automation (TODO)

EventBridge rule graphrag-beta-weekly-ingest exists but is not yet connected. To fully automate:

  1. Wire weekly_worker.py into CDK beta stack
  2. Add Step Functions: extraction task → wait → ingestion worker task
  3. SNS notification on completion/failure

Endpoints

Resource Value
Neptune Alpha neptunedbcluster-y5yc1gf3wxsn
Neptune Beta neptunedbcluster-rssnn6cwagnp
Aurora pgvector ragstack-beta-aurorapgvectorstackaurorapgvectorclu-i8zsxoht8olc
SQS Queue graphrag-beta-extraction-queue
SQS DLQ graphrag-beta-extraction-dlq
ECS Cluster graphrag-beta-cluster
Task Role RagStack-beta-AppTaskRole7EC51C11-jWyEfPb24mh5
Docker Image 357836458011.dkr.ecr.eu-north-1.amazonaws.com/graphrag-fastapi:latest