Production Neptune Bulk Ingestion¶
End-to-end workflow for ingesting thousands to millions of papers into Neptune + Aurora pgvector.
Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ Input Sources │
│ PubMed queries │ PMC Full-Text │ bioRxiv │ S3 PDFs │ CSV/TSV │
└───────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────────┐ ┌────────────┐ ┌──────────────────┐
│ Option A │ │ Option B │ │ Option C │
│ Direct │ │ Batch │ │ SQS Fan-out │
│ <500 docs │ │ >500 docs │ │ Millions │
└──────┬───────┘ └─────┬──────┘ └────────┬─────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Bedrock LLM (entity extraction) │
│ us.meta.llama3-1-8b-instruct-v1:0 │
│ 50–100 concurrent calls per worker │
└───────────────────────────┬──────────────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ Neptune (graph) │ │ Aurora pgvector │
│ Entities + rels │ │ 768-dim embeddings │
│ OpenCypher UNWIND │ │ HNSW index │
└───────────────────────┘ └───────────────────────┘
Components¶
| Component | Resource | Purpose |
|---|---|---|
| Neptune | neptunedbcluster-rssnn6cwagnp |
Graph storage (nodes + relationships) |
| Aurora pgvector | ragstack-beta-aurorapgvectorclu-i8zsxoht8olc |
Vector embeddings |
| SQS Queue | graphrag-beta-extraction-queue |
Decouples extraction from ingestion |
| SQS DLQ | graphrag-beta-extraction-dlq |
Failed message retry (3 attempts, 14-day retention) |
| ECS Cluster | graphrag-beta-cluster |
Fargate compute for workers |
| S3 Bucket | graphrag-upload-bucket |
PDF staging area |
| Bedrock | us.meta.llama3-1-8b-instruct-v1:0 |
LLM entity extraction |
Cost Estimates¶
| Scale | Bedrock | ECS Fargate | Neptune | Aurora | Total |
|---|---|---|---|---|---|
| 500 docs (Direct) | ~$0.50 | — | ~$0.35/hr | ~$0.20/hr | ~$2 |
| 5,000 docs (Batch) | ~$5 | ~$0.65 | ~$0.70 | ~$0.40 | ~$8 |
| 50,000 docs (SQS) | ~$50 | ~$5 | ~$3.50 | ~$2.00 | ~$65 |
Off-peak discount
Batch mode runs 22:00–06:00 UTC at ~50% cheaper Bedrock pricing. Effective cost: ~$0.004/document.
Pre-flight Checklist¶
Run these checks before any ingestion run:
# 1. AWS SSO login
aws sso login
aws sts get-caller-identity # confirm identity
# 2. Neptune cluster status (check NCU capacity)
aws neptune describe-db-clusters \
--db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
--region eu-north-1 \
--query "DBClusters[0].{Status:Status,ServerlessV2ScalingConfiguration:ServerlessV2ScalingConfiguration}"
# 3. Aurora pgvector connectivity
aws rds describe-db-clusters \
--db-cluster-identifier ragstack-beta-aurorapgvectorstackaurorapgvectorclu-i8zsxoht8olc \
--region eu-north-1 \
--query "DBClusters[0].Status"
# 4. S3 bucket access
aws s3 ls s3://graphrag-upload-bucket/ --region eu-north-1
# 5. Bedrock model availability
aws bedrock get-foundation-model \
--model-identifier us.meta.llama3-1-8b-instruct-v1:0 \
--region eu-north-1 \
--query "modelDetails.modelLifecycle.status"
Neptune NCU capacity
For bulk ingestion (>1,000 docs), ensure Neptune min NCU ≥ 2.5. Scale up before starting:
Option A: Direct Mode (<500 documents)¶
Best for small, ad-hoc ingestion runs. Single-process, synchronous writes to Neptune.
From PubMed queries¶
uv run python ingest_main.py \
--source pubmed \
--search-term "cardiovascular disease protein biomarker" \
--max-results 200 \
--target neptune \
--service bedrock \
--database beta
From PDF files (local)¶
uv run python ingest_main.py \
--source pdf \
--pdf-files paper1.pdf paper2.pdf paper3.pdf \
--target neptune \
--service bedrock \
--database beta
From S3 PDFs¶
# Upload PDFs to S3 first
aws s3 cp ./papers/ s3://graphrag-upload-bucket/uploads/pdfs/ --recursive
# Ingest from S3
uv run python ingest_main.py \
--source pdf \
--s3-prefix uploads/pdfs/ \
--target neptune \
--service bedrock \
--database beta
Expected Performance¶
| Metric | Value |
|---|---|
| Throughput | ~2 docs/min (LLM-bottlenecked) |
| 100 papers | ~50 min |
| 500 papers | ~4 hours |
| Concurrency | Single process, 15 concurrent LLM calls |
Option B: Batch Mode (>500 documents, 50% cheaper)¶
Uses Bedrock batch inference for cheaper off-peak processing. Recommended for nightly scheduled runs.
Step 1: Upload PDFs to S3¶
Or for PubMed queries, the worker fetches directly from PubMed APIs.
Step 2: Submit batch job¶
uv run python -m pipeline.ingest.weekly_worker \
--mode batch \
--s3-prefix uploads/pdfs/ \
--database beta \
--service bedrock
For PubMed queries:
uv run python -m pipeline.ingest.weekly_worker \
--mode batch \
--queries-file data/queries_bulk.txt \
--max-results 500 \
--database beta \
--service bedrock
Step 3: Monitor progress¶
# Check Bedrock batch job status
aws bedrock list-model-invocation-jobs \
--region eu-north-1 \
--query "invocationJobSummaries[?status!='Completed'].{Id:jobArn,Status:status,Created:submitTime}"
# Neptune node count (via Gremlin status endpoint)
aws neptune-data execute-open-cypher-query \
--cluster-endpoint neptunedbcluster-rssnn6cwagnp.cluster-xxx.eu-north-1.neptune.amazonaws.com \
--open-cypher-query "MATCH (n) RETURN count(n) AS nodes" \
--region eu-north-1
# SQS queue depth (if using intermediate queue)
aws sqs get-queue-attributes \
--queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
--attribute-names ApproximateNumberOfMessages \
--region eu-north-1
Expected Performance¶
| Metric | Value |
|---|---|
| Throughput | Thousands/hour (batch parallelism) |
| 5,000 papers | ~2 hours |
| Cost per document | ~$0.004 (off-peak) |
| Bedrock concurrency | Handled by AWS batch API |
Option C: SQS Fan-out (Millions, Maximum Parallelism)¶
For the largest runs. Decouples extraction from ingestion via SQS for independent scaling.
Step 1: Launch extraction workers¶
Scale extraction across N ECS tasks:
# Launch extraction worker (repeat for N workers)
aws ecs run-task \
--cluster graphrag-beta-cluster \
--task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
"securityGroups": ["sg-072b4fffe90fed053"],
"assignPublicIp": "ENABLED"
}
}' \
--overrides '{
"containerOverrides": [{
"name": "fastapi",
"command": ["--queries-file", "data/queries_bulk.txt", "--max-results", "500", "--use-sqs"]
}]
}' \
--region eu-north-1
Each worker:
- Fetches papers from source APIs
- Chunks text (512 tokens, 64 overlap)
- Extracts entities via Bedrock (50–100 concurrent calls)
- Pushes structured results to SQS
Step 2: Launch ingestion workers¶
# Launch SQS → Neptune ingestion worker
aws ecs run-task \
--cluster graphrag-beta-cluster \
--task-definition RagStackbetaFastApiTaskDefinition78258723:19 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-0f830824dfcaf2f42","subnet-0fa6e404df8fdebdb"],
"securityGroups": ["sg-072b4fffe90fed053"],
"assignPublicIp": "ENABLED"
}
}' \
--overrides '{
"containerOverrides": [{
"name": "fastapi",
"command": ["--worker"]
}]
}' \
--region eu-north-1
Ingestion workers:
- Long-poll SQS messages
- Batch UNWIND to Neptune (5,000 nodes/tx)
- Batch INSERT to Aurora pgvector (100 embeddings/flush)
- Self-throttle to prevent database overload
- Stop when queue is empty
Step 3: Monitor¶
# Queue depth (should decrease over time)
watch -n 30 'aws sqs get-queue-attributes \
--queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-queue \
--attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible \
--region eu-north-1 --output table'
# DLQ (should stay at 0)
aws sqs get-queue-attributes \
--queue-url https://sqs.eu-north-1.amazonaws.com/357836458011/graphrag-beta-extraction-dlq \
--attribute-names ApproximateNumberOfMessages \
--region eu-north-1
# ECS task status
aws ecs list-tasks --cluster graphrag-beta-cluster --region eu-north-1
Reference
For detailed SQS fan-out architecture and troubleshooting, see Neptune Runbook.
Post-ingestion¶
Verify counts¶
# Node count by label
curl -s https://graphrag-beta.mlapps.olink.systems/health/ready \
-H "Authorization: Bearer $API_AUTH_TOKEN" | jq .
# Or via direct Neptune query
aws neptune-data execute-open-cypher-query \
--cluster-endpoint neptunedbcluster-rssnn6cwagnp.cluster-xxx.eu-north-1.neptune.amazonaws.com \
--open-cypher-query "MATCH (n) RETURN labels(n)[0] AS label, count(n) AS count ORDER BY count DESC" \
--region eu-north-1
Run deduplication (if needed)¶
If multiple runs overlap, deduplicate entities:
uv run python -m pipeline.processors.entity_resolver \
--database beta \
--target neptune \
--operation full
Check Aurora embedding counts¶
# Via psql (through SSM tunnel or bastion)
psql -h ragstack-beta-xxx.cluster-xxx.eu-north-1.rds.amazonaws.com \
-U postgres -d graphrag \
-c "SELECT count(*) FROM embeddings;"
Scale Neptune back down¶
After ingestion completes, reduce Neptune NCU to save costs:
aws neptune modify-db-cluster \
--db-cluster-identifier neptunedbcluster-rssnn6cwagnp \
--serverless-v2-scaling-configuration MinCapacity=1.0,MaxCapacity=16 \
--region eu-north-1
Monitoring & Troubleshooting¶
CloudWatch Log Groups¶
| Log Group | Contents |
|---|---|
graphrag-beta-ecs-logs |
ECS task stdout/stderr (extraction + ingestion workers) |
/aws/neptune/neptunedbcluster-rssnn6cwagnp/audit |
Neptune query audit logs |
/aws/rds/cluster/ragstack-beta-.../postgresql |
Aurora pgvector logs |
# Tail ECS logs
aws logs tail graphrag-beta-ecs-logs --follow --region eu-north-1
# Search for errors
aws logs filter-log-events \
--log-group-name graphrag-beta-ecs-logs \
--filter-pattern "ERROR" \
--region eu-north-1 \
--query "events[].message" --output text
Common Issues¶
| Symptom | Cause | Fix |
|---|---|---|
Neptune ReadTimeoutError |
Min NCU too low for write load | Increase min NCU to 2.5+ |
Bedrock ThrottlingException |
Too many concurrent LLM calls | Reduce KG_MAX_LLM_CONCURRENCY (default 15) |
Bedrock ModelStreamErrorException |
Transient model error | Automatic retry handles this (3 attempts) |
| SQS messages in DLQ | Ingestion worker crash or Neptune timeout | Check DLQ messages, fix, then redrive |
AccessDeniedException on SQS |
Task role missing policy | Add SQS permissions to task role |
| Aurora connection refused | Security group / VPC mismatch | Verify ECS task SG can reach Aurora SG on port 5432 |
Cost Tracking¶
# Bedrock usage (last 24h)
aws bedrock list-model-invocation-jobs \
--region eu-north-1 \
--query "invocationJobSummaries[?status=='Completed'].{Tokens:outputDataConfig}" \
--output table
# Neptune NCU hours (CloudWatch)
aws cloudwatch get-metric-statistics \
--namespace AWS/Neptune \
--metric-name ServerlessDatabaseCapacity \
--dimensions Name=DBClusterIdentifier,Value=neptunedbcluster-rssnn6cwagnp \
--start-time $(date -u -v-24H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 --statistics Average \
--region eu-north-1
Schedule: Nightly Off-peak Runs¶
Ingestion is scheduled to run nightly via EventBridge (22:00 UTC, Mon–Fri) for cheapest Bedrock pricing.
EventBridge Rule¶
Rule: graphrag-beta-nightly-ingest
Schedule: cron(0 22 ? * MON-FRI *)
Target: ECS RunTask (batch mode)
What runs nightly¶
- Fetch new papers from last 24h (date-filtered PubMed queries)
- Extract entities via Bedrock batch
- Write to Neptune + Aurora
- Send SNS notification on completion/failure
Manual trigger¶
# Trigger the scheduled rule manually
aws events put-events \
--entries '[{"Source":"manual","DetailType":"ScheduledIngest","Detail":"{}"}]' \
--region eu-north-1