Skip to content

Production Neptune Bulk Ingestion

End-to-end workflow for ingesting thousands to millions of papers into Neptune + Aurora pgvector.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Input Sources                                  │
│  PubMed queries │ PMC Full-Text │ bioRxiv │ S3 PDFs │ CSV/TSV       │
└───────────────────────────┬──────────────────────────────────────────┘
            ┌───────────────┼───────────────┐
            ▼               ▼               ▼
    ┌──────────────┐ ┌────────────┐ ┌──────────────────┐
    │  Option A    │ │  Option B  │ │    Option C       │
    │  Direct      │ │  Batch     │ │    SQS Fan-out    │
    │  <500 docs   │ │  >500 docs │ │    Millions       │
    └──────┬───────┘ └─────┬──────┘ └────────┬─────────┘
           │               │                  │
           ▼               ▼                  ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    Bedrock LLM (entity extraction)                │
    │              us.meta.llama3-1-8b-instruct-v1:0                    │
    │              50–100 concurrent calls per worker                   │
    └───────────────────────────┬──────────────────────────────────────┘
                ┌───────────────┴───────────────┐
                ▼                               ▼
    ┌───────────────────────┐       ┌───────────────────────┐
    │   Neptune (graph)     │       │  Aurora pgvector       │
    │   Entities + rels     │       │  768-dim embeddings    │
    │   OpenCypher UNWIND   │       │  HNSW index            │
    └───────────────────────┘       └───────────────────────┘

Components

Component Resource Purpose
Neptune neptunedbcluster-rssnn6cwagnp Graph storage (nodes + relationships)
Aurora pgvector ragstack-beta-aurorapgvectorclu-i8zsxoht8olc Vector embeddings
SQS Queue graphrag-beta-extraction-queue Decouples extraction from ingestion
SQS DLQ graphrag-beta-extraction-dlq Failed message retry (3 attempts, 14-day retention)
ECS Cluster graphrag-beta-cluster Fargate compute for workers
S3 Bucket graphrag-upload-bucket PDF staging area
Bedrock us.meta.llama3-1-8b-instruct-v1:0 LLM entity extraction

Cost Estimates

Scale Bedrock ECS Fargate Neptune Aurora Total
500 docs (Direct) ~$0.50 ~$0.35/hr ~$0.20/hr ~$2
5,000 docs (Batch) ~$5 ~$0.65 ~$0.70 ~$0.40 ~$8
50,000 docs (SQS) ~$50 ~$5 ~$3.50 ~$2.00 ~$65

Off-peak discount

Batch mode runs 22:00–06:00 UTC at ~50% cheaper Bedrock pricing. Effective cost: ~$0.004/document.


Pre-flight Checklist

Run these checks before any ingestion run:

# 1. AWS SSO login
aws sso login
aws sts get-caller-identity  # confirm identity

# 2. Neptune cluster status (check NCU capacity)
aws neptune describe-db-clusters \
  --db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
  --region eu-north-1 \
  --query "DBClusters[0].{Status:Status,ServerlessV2ScalingConfiguration:ServerlessV2ScalingConfiguration}"

# 3. Aurora pgvector connectivity
aws rds describe-db-clusters \
  --db-cluster-identifier ragstack-beta-aurorapgvectorstackaurorapgvectorclu-i8zsxoht8olc \
  --region eu-north-1 \
  --query "DBClusters[0].Status"

# 4. S3 bucket access
aws s3 ls s3://graphrag-upload-bucket/ --region eu-north-1

# 5. Bedrock model availability
aws bedrock get-foundation-model \
  --model-identifier us.meta.llama3-1-8b-instruct-v1:0 \
  --region eu-north-1 \
  --query "modelDetails.modelLifecycle.status"

Neptune NCU capacity

For bulk ingestion (>1,000 docs), ensure Neptune min NCU ≥ 2.5. Scale up before starting:

aws neptune modify-db-cluster \
  --db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
  --serverless-v2-scaling-configuration MinCapacity=2.5,MaxCapacity=32 \
  --region eu-north-1


Option A: Direct Mode (<500 documents)

Best for small, ad-hoc ingestion runs. Single-process, synchronous writes to Neptune.

From PubMed queries

uv run python ingest_main.py \
  --source pubmed \
  --search-term "cardiovascular disease protein biomarker" \
  --max-results 200 \
  --target neptune \
  --service bedrock \
  --database beta

From PDF files (local)

uv run python ingest_main.py \
  --source pdf \
  --pdf-files paper1.pdf paper2.pdf paper3.pdf \
  --target neptune \
  --service bedrock \
  --database beta

From S3 PDFs

# Upload PDFs to S3 first
aws s3 cp ./papers/ s3://graphrag-upload-bucket/uploads/pdfs/ --recursive

# Ingest from S3
uv run python ingest_main.py \
  --source pdf \
  --s3-prefix uploads/pdfs/ \
  --target neptune \
  --service bedrock \
  --database beta

Expected Performance

Metric Value
Throughput ~2 docs/min (LLM-bottlenecked)
100 papers ~50 min
500 papers ~4 hours
Concurrency Single process, 15 concurrent LLM calls

Option B: Batch Mode (>500 documents, 50% cheaper)

Uses Bedrock batch inference for cheaper off-peak processing. Recommended for nightly scheduled runs.

Step 1: Upload PDFs to S3

aws s3 sync ./papers/ s3://graphrag-upload-bucket/uploads/pdfs/ --region eu-north-1

Or for PubMed queries, the worker fetches directly from PubMed APIs.

Step 2: Submit batch job

uv run python -m pipeline.ingest.weekly_worker \
  --mode batch \
  --s3-prefix uploads/pdfs/ \
  --database beta \
  --service bedrock

For PubMed queries:

uv run python -m pipeline.ingest.weekly_worker \
  --mode batch \
  --queries-file data/queries_bulk.txt \
  --max-results 500 \
  --database beta \
  --service bedrock

Step 3: Monitor progress

# Check Bedrock batch job status
aws bedrock list-model-invocation-jobs \
  --region eu-north-1 \
  --query "invocationJobSummaries[?status!='Completed'].{Id:jobArn,Status:status,Created:submitTime}"

# Neptune node count (via Gremlin status endpoint)
aws neptune-data execute-open-cypher-query \
  --cluster-endpoint neptunedbcluster-rssnn6cwagnp.cluster-xxx.eu-north-1.neptune.amazonaws.com \
  --open-cypher-query "MATCH (n) RETURN count(n) AS nodes" \
  --region eu-north-1

# SQS queue depth (if using intermediate queue)
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
  --attribute-names ApproximateNumberOfMessages \
  --region eu-north-1

Expected Performance

Metric Value
Throughput Thousands/hour (batch parallelism)
5,000 papers ~2 hours
Cost per document ~$0.004 (off-peak)
Bedrock concurrency Handled by AWS batch API

Option C: SQS Fan-out (Millions, Maximum Parallelism)

For the largest runs. Decouples extraction from ingestion via SQS for independent scaling.

Step 1: Launch extraction workers

Scale extraction across N ECS tasks:

# Launch extraction worker (repeat for N workers)
aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--queries-file", "data/queries_bulk.txt", "--max-results", "500", "--use-sqs"]
    }]
  }' \
  --region eu-north-1

Each worker:

  • Fetches papers from source APIs
  • Chunks text (512 tokens, 64 overlap)
  • Extracts entities via Bedrock (50–100 concurrent calls)
  • Pushes structured results to SQS

Step 2: Launch ingestion workers

# Launch SQS → Neptune ingestion worker
aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--worker"]
    }]
  }' \
  --region eu-north-1

Ingestion workers:

  • Long-poll SQS messages
  • Batch UNWIND to Neptune (5,000 nodes/tx)
  • Batch INSERT to Aurora pgvector (100 embeddings/flush)
  • Self-throttle to prevent database overload
  • Stop when queue is empty

Step 3: Monitor

# Queue depth (should decrease over time)
watch -n 30 'aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
  --attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible \
  --region eu-north-1 --output table'

# DLQ (should stay at 0)
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-dlq \
  --attribute-names ApproximateNumberOfMessages \
  --region eu-north-1

# ECS task status
aws ecs list-tasks --cluster graphrag-beta-cluster --region eu-north-1

Reference

For detailed SQS fan-out architecture and troubleshooting, see Neptune Runbook.


Post-ingestion

Verify counts

# Node count by label
curl -s https://graphrag-beta.mlapps.olink.systems/health/ready \
  -H "Authorization: Bearer $API_AUTH_TOKEN" | jq .

# Or via direct Neptune query
aws neptune-data execute-open-cypher-query \
  --cluster-endpoint neptunedbcluster-rssnn6cwagnp.cluster-xxx.eu-north-1.neptune.amazonaws.com \
  --open-cypher-query "MATCH (n) RETURN labels(n)[0] AS label, count(n) AS count ORDER BY count DESC" \
  --region eu-north-1

Run deduplication (if needed)

If multiple runs overlap, deduplicate entities:

uv run python -m pipeline.processors.entity_resolver \
  --database beta \
  --target neptune \
  --operation full

Check Aurora embedding counts

# Via psql (through SSM tunnel or bastion)
psql -h ragstack-beta-xxx.cluster-xxx.eu-north-1.rds.amazonaws.com \
  -U postgres -d graphrag \
  -c "SELECT count(*) FROM embeddings;"

Scale Neptune back down

After ingestion completes, reduce Neptune NCU to save costs:

aws neptune modify-db-cluster \
  --db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
  --serverless-v2-scaling-configuration MinCapacity=1.0,MaxCapacity=16 \
  --region eu-north-1

Monitoring & Troubleshooting

CloudWatch Log Groups

Log Group Contents
graphrag-beta-ecs-logs ECS task stdout/stderr (extraction + ingestion workers)
/aws/neptune/neptunedbcluster-rssnn6cwagnp/audit Neptune query audit logs
/aws/rds/cluster/ragstack-beta-.../postgresql Aurora pgvector logs
# Tail ECS logs
aws logs tail graphrag-beta-ecs-logs --follow --region eu-north-1

# Search for errors
aws logs filter-log-events \
  --log-group-name graphrag-beta-ecs-logs \
  --filter-pattern "ERROR" \
  --region eu-north-1 \
  --query "events[].message" --output text

Common Issues

Symptom Cause Fix
Neptune ReadTimeoutError Min NCU too low for write load Increase min NCU to 2.5+
Bedrock ThrottlingException Too many concurrent LLM calls Reduce KG_MAX_LLM_CONCURRENCY (default 15)
Bedrock ModelStreamErrorException Transient model error Automatic retry handles this (3 attempts)
SQS messages in DLQ Ingestion worker crash or Neptune timeout Check DLQ messages, fix, then redrive
AccessDeniedException on SQS Task role missing policy Add SQS permissions to task role
Aurora connection refused Security group / VPC mismatch Verify ECS task SG can reach Aurora SG on port 5432

Cost Tracking

# Bedrock usage (last 24h)
aws bedrock list-model-invocation-jobs \
  --region eu-north-1 \
  --query "invocationJobSummaries[?status=='Completed'].{Tokens:outputDataConfig}" \
  --output table

# Neptune NCU hours (CloudWatch)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Neptune \
  --metric-name ServerlessDatabaseCapacity \
  --dimensions Name=DBClusterIdentifier,Value=neptunedbcluster-rssnn6cwagnp \
  --start-time $(date -u -v-24H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 --statistics Average \
  --region eu-north-1

Schedule: Nightly Off-peak Runs

Ingestion is scheduled to run nightly via EventBridge (22:00 UTC, Mon–Fri) for cheapest Bedrock pricing.

EventBridge Rule

Rule: graphrag-beta-nightly-ingest
Schedule: cron(0 22 ? * MON-FRI *)
Target: ECS RunTask (batch mode)

What runs nightly

  1. Fetch new papers from last 24h (date-filtered PubMed queries)
  2. Extract entities via Bedrock batch
  3. Write to Neptune + Aurora
  4. Send SNS notification on completion/failure

Manual trigger

# Trigger the scheduled rule manually
aws events put-events \
  --entries '[{"Source":"manual","DetailType":"ScheduledIngest","Detail":"{}"}]' \
  --region eu-north-1

Disable/enable schedule

# Disable nightly runs
aws events disable-rule --name graphrag-beta-nightly-ingest --region eu-north-1

# Re-enable
aws events enable-rule --name graphrag-beta-nightly-ingest --region eu-north-1