Production Neptune Bulk Ingestion¶

End-to-end workflow for ingesting thousands to millions of papers into Neptune + Aurora pgvector.

Architecture¶

┌─────────────────────────────────────────────────────────────────────┐
│                        Input Sources                                  │
│  PubMed queries │ PMC Full-Text │ bioRxiv │ S3 PDFs │ CSV/TSV       │
└───────────────────────────┬──────────────────────────────────────────┘
                            │
            ┌───────────────┼───────────────┐
            ▼               ▼               ▼
    ┌──────────────┐ ┌────────────┐ ┌──────────────────┐
    │  Option A    │ │  Option B  │ │    Option C       │
    │  Direct      │ │  Batch     │ │    SQS Fan-out    │
    │  <500 docs   │ │  >500 docs │ │    Millions       │
    └──────┬───────┘ └─────┬──────┘ └────────┬─────────┘
           │               │                  │
           ▼               ▼                  ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    Bedrock LLM (entity extraction)                │
    │              us.meta.llama3-1-8b-instruct-v1:0                    │
    │              50–100 concurrent calls per worker                   │
    └───────────────────────────┬──────────────────────────────────────┘
                                │
                ┌───────────────┴───────────────┐
                ▼                               ▼
    ┌───────────────────────┐       ┌───────────────────────┐
    │   Neptune (graph)     │       │  Aurora pgvector       │
    │   Entities + rels     │       │  768-dim embeddings    │
    │   OpenCypher UNWIND   │       │  HNSW index            │
    └───────────────────────┘       └───────────────────────┘

Components¶

Component	Resource	Purpose
Neptune	`neptunedbcluster-rssnn6cwagnp`	Graph storage (nodes + relationships)
Aurora pgvector	`ragstack-beta-aurorapgvectorclu-i8zsxoht8olc`	Vector embeddings
SQS Queue	`graphrag-beta-extraction-queue`	Decouples extraction from ingestion
SQS DLQ	`graphrag-beta-extraction-dlq`	Failed message retry (3 attempts, 14-day retention)
ECS Cluster	`graphrag-beta-cluster`	Fargate compute for workers
S3 Bucket	`graphrag-upload-bucket`	PDF staging area
Bedrock	`us.meta.llama3-1-8b-instruct-v1:0`	LLM entity extraction

Cost Estimates¶

Scale	Bedrock	ECS Fargate	Neptune	Aurora	Total
500 docs (Direct)	~$0.50	—	~$0.35/hr	~$0.20/hr	~$2
5,000 docs (Batch)	~$5	~$0.65	~$0.70	~$0.40	~$8
50,000 docs (SQS)	~$50	~$5	~$3.50	~$2.00	~$65

Off-peak discount

Batch mode runs 22:00–06:00 UTC at ~50% cheaper Bedrock pricing. Effective cost: ~$0.004/document.

Pre-flight Checklist¶

Run these checks before any ingestion run:

# 1. AWS SSO login
aws sso login
aws sts get-caller-identity  # confirm identity

# 2. Neptune cluster status (check NCU capacity)
aws neptune describe-db-clusters \
  --db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
  --region eu-north-1 \
  --query "DBClusters[0].{Status:Status,ServerlessV2ScalingConfiguration:ServerlessV2ScalingConfiguration}"

# 3. Aurora pgvector connectivity
aws rds describe-db-clusters \
  --db-cluster-identifier ragstack-beta-aurorapgvectorstackaurorapgvectorclu-i8zsxoht8olc \
  --region eu-north-1 \
  --query "DBClusters[0].Status"

# 4. S3 bucket access
aws s3 ls s3://graphrag-upload-bucket/ --region eu-north-1

# 5. Bedrock model availability
aws bedrock get-foundation-model \
  --model-identifier us.meta.llama3-1-8b-instruct-v1:0 \
  --region eu-north-1 \
  --query "modelDetails.modelLifecycle.status"

Neptune NCU capacity

For bulk ingestion (>1,000 docs), ensure Neptune min NCU ≥ 2.5. Scale up before starting:

aws neptune modify-db-cluster \
  --db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
  --serverless-v2-scaling-configuration MinCapacity=2.5,MaxCapacity=32 \
  --region eu-north-1

Option A: Direct Mode (<500 documents)¶

Best for small, ad-hoc ingestion runs. Single-process, synchronous writes to Neptune.

From PubMed queries¶

uv run python ingest_main.py \
  --source pubmed \
  --search-term "cardiovascular disease protein biomarker" \
  --max-results 200 \
  --target neptune \
  --service bedrock \
  --database beta

From PDF files (local)¶

uv run python ingest_main.py \
  --source pdf \
  --pdf-files paper1.pdf paper2.pdf paper3.pdf \
  --target neptune \
  --service bedrock \
  --database beta

From S3 PDFs¶

# Upload PDFs to S3 first
aws s3 cp ./papers/ s3://graphrag-upload-bucket/uploads/pdfs/ --recursive

# Ingest from S3
uv run python ingest_main.py \
  --source pdf \
  --s3-prefix uploads/pdfs/ \
  --target neptune \
  --service bedrock \
  --database beta

Expected Performance¶

Metric	Value
Throughput	~2 docs/min (LLM-bottlenecked)
100 papers	~50 min
500 papers	~4 hours
Concurrency	Single process, 15 concurrent LLM calls

Option B: Batch Mode (>500 documents, 50% cheaper)¶

Uses Bedrock batch inference for cheaper off-peak processing. Recommended for nightly scheduled runs.

Step 1: Upload PDFs to S3¶

aws s3 sync ./papers/ s3://graphrag-upload-bucket/uploads/pdfs/ --region eu-north-1

Or for PubMed queries, the worker fetches directly from PubMed APIs.

Step 2: Submit batch job¶

uv run python -m pipeline.ingest.weekly_worker \
  --mode batch \
  --s3-prefix uploads/pdfs/ \
  --database beta \
  --service bedrock

For PubMed queries:

uv run python -m pipeline.ingest.weekly_worker \
  --mode batch \
  --queries-file data/queries_bulk.txt \
  --max-results 500 \
  --database beta \
  --service bedrock

Step 3: Monitor progress¶

# Check Bedrock batch job status
aws bedrock list-model-invocation-jobs \
  --region eu-north-1 \
  --query "invocationJobSummaries[?status!='Completed'].{Id:jobArn,Status:status,Created:submitTime}"

# Neptune node count (via Gremlin status endpoint)
aws neptune-data execute-open-cypher-query \
  --cluster-endpoint neptunedbcluster-rssnn6cwagnp.cluster-xxx.eu-north-1.neptune.amazonaws.com \
  --open-cypher-query "MATCH (n) RETURN count(n) AS nodes" \
  --region eu-north-1

# SQS queue depth (if using intermediate queue)
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
  --attribute-names ApproximateNumberOfMessages \
  --region eu-north-1

Expected Performance¶

Metric	Value
Throughput	Thousands/hour (batch parallelism)
5,000 papers	~2 hours
Cost per document	~$0.004 (off-peak)
Bedrock concurrency	Handled by AWS batch API

Option C: SQS Fan-out (Millions, Maximum Parallelism)¶

For the largest runs. Decouples extraction from ingestion via SQS for independent scaling.

Step 1: Launch extraction workers¶

Scale extraction across N ECS tasks:

# Launch extraction worker (repeat for N workers)
aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--queries-file", "data/queries_bulk.txt", "--max-results", "500", "--use-sqs"]
    }]
  }' \
  --region eu-north-1

Each worker:

Fetches papers from source APIs
Chunks text (512 tokens, 64 overlap)
Extracts entities via Bedrock (50–100 concurrent calls)
Pushes structured results to SQS

Step 2: Launch ingestion workers¶

# Launch SQS → Neptune ingestion worker
aws ecs run-task \
  --cluster graphrag-beta-cluster \
  --task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
      "securityGroups": ["sg-072b4fffe90fed053"],
      "assignPublicIp": "ENABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "fastapi",
      "command": ["--worker"]
    }]
  }' \
  --region eu-north-1

Ingestion workers:

Long-poll SQS messages
Batch UNWIND to Neptune (5,000 nodes/tx)
Batch INSERT to Aurora pgvector (100 embeddings/flush)
Self-throttle to prevent database overload
Stop when queue is empty

Step 3: Monitor¶

# Queue depth (should decrease over time)
watch -n 30 'aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
  --attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible \
  --region eu-north-1 --output table'

# DLQ (should stay at 0)
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-dlq \
  --attribute-names ApproximateNumberOfMessages \
  --region eu-north-1

# ECS task status
aws ecs list-tasks --cluster graphrag-beta-cluster --region eu-north-1

Reference

For detailed SQS fan-out architecture and troubleshooting, see Neptune Runbook.

Post-ingestion¶

Verify counts¶

# Node count by label
curl -s https://graphrag-beta.mlapps.olink.systems/health/ready \
  -H "Authorization: Bearer $API_AUTH_TOKEN" | jq .

# Or via direct Neptune query
aws neptune-data execute-open-cypher-query \
  --cluster-endpoint neptunedbcluster-rssnn6cwagnp.cluster-xxx.eu-north-1.neptune.amazonaws.com \
  --open-cypher-query "MATCH (n) RETURN labels(n)[0] AS label, count(n) AS count ORDER BY count DESC" \
  --region eu-north-1

Run deduplication (if needed)¶

If multiple runs overlap, deduplicate entities:

uv run python -m pipeline.processors.entity_resolver \
  --database beta \
  --target neptune \
  --operation full

Check Aurora embedding counts¶

# Via psql (through SSM tunnel or bastion)
psql -h ragstack-beta-xxx.cluster-xxx.eu-north-1.rds.amazonaws.com \
  -U postgres -d graphrag \
  -c "SELECT count(*) FROM embeddings;"

Scale Neptune back down¶

After ingestion completes, reduce Neptune NCU to save costs:

aws neptune modify-db-cluster \
  --db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
  --serverless-v2-scaling-configuration MinCapacity=1.0,MaxCapacity=16 \
  --region eu-north-1

Monitoring & Troubleshooting¶

CloudWatch Log Groups¶

Log Group	Contents
`graphrag-beta-ecs-logs`	ECS task stdout/stderr (extraction + ingestion workers)
`/aws/neptune/neptunedbcluster-rssnn6cwagnp/audit`	Neptune query audit logs
`/aws/rds/cluster/ragstack-beta-.../postgresql`	Aurora pgvector logs

# Tail ECS logs
aws logs tail graphrag-beta-ecs-logs --follow --region eu-north-1

# Search for errors
aws logs filter-log-events \
  --log-group-name graphrag-beta-ecs-logs \
  --filter-pattern "ERROR" \
  --region eu-north-1 \
  --query "events[].message" --output text

Common Issues¶

Symptom	Cause	Fix
Neptune `ReadTimeoutError`	Min NCU too low for write load	Increase min NCU to 2.5+
Bedrock `ThrottlingException`	Too many concurrent LLM calls	Reduce `KG_MAX_LLM_CONCURRENCY` (default 15)
Bedrock `ModelStreamErrorException`	Transient model error	Automatic retry handles this (3 attempts)
SQS messages in DLQ	Ingestion worker crash or Neptune timeout	Check DLQ messages, fix, then redrive
`AccessDeniedException` on SQS	Task role missing policy	Add SQS permissions to task role
Aurora connection refused	Security group / VPC mismatch	Verify ECS task SG can reach Aurora SG on port 5432

Cost Tracking¶

# Bedrock usage (last 24h)
aws bedrock list-model-invocation-jobs \
  --region eu-north-1 \
  --query "invocationJobSummaries[?status=='Completed'].{Tokens:outputDataConfig}" \
  --output table

# Neptune NCU hours (CloudWatch)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Neptune \
  --metric-name ServerlessDatabaseCapacity \
  --dimensions Name=DBClusterIdentifier,Value=neptunedbcluster-rssnn6cwagnp \
  --start-time $(date -u -v-24H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 --statistics Average \
  --region eu-north-1

Schedule: Nightly Off-peak Runs¶

Ingestion is scheduled to run nightly via EventBridge (22:00 UTC, Mon–Fri) for cheapest Bedrock pricing.

EventBridge Rule¶

Rule: graphrag-beta-nightly-ingest
Schedule: cron(0 22 ? * MON-FRI *)
Target: ECS RunTask (batch mode)

What runs nightly¶

Fetch new papers from last 24h (date-filtered PubMed queries)
Extract entities via Bedrock batch
Write to Neptune + Aurora
Send SNS notification on completion/failure

Manual trigger¶

# Trigger the scheduled rule manually
aws events put-events \
  --entries '[{"Source":"manual","DetailType":"ScheduledIngest","Detail":"{}"}]' \
  --region eu-north-1

Disable/enable schedule¶

# Disable nightly runs
aws events disable-rule --name graphrag-beta-nightly-ingest --region eu-north-1

# Re-enable
aws events enable-rule --name graphrag-beta-nightly-ingest --region eu-north-1