Architecture¶

System Overview¶

Olink RAG is organized into four main layers:

┌─────────────────────────────────────────────────────────┐
│  Frontend (React + Sigma.js)  — separate repo           │
├─────────────────────────────────────────────────────────┤
│  API Layer (FastAPI + Granian)                          │
│  ├── ingestion/   query/   workspace/   platform/       │
│  └── shared/ (middleware, security, rate limiting, SSE)  │
├─────────────────────────────────────────────────────────┤
│  Core Services & Agents                                 │
│  ├── agents/ (TwoPhase, Dynamic, Neo4j, LangGraph)      │
│  ├── services/ (query, ingestion, workspace, platform)   │
│  ├── factories/ (database, LLM, embedding, agent)        │
│  └── kg_tools/ (registry, CLI, MCP, LangChain adapter)   │
├─────────────────────────────────────────────────────────┤
│  Pipeline Layer                                         │
│  ├── ingest/ (fetchers, chunkers, extractors, KG pipe)   │
│  ├── enrichment/ (column analysis, node annotation)      │
│  └── processors/ (41 modules: consolidation, enrichment, │
│                    science skills integrators)            │
├─────────────────────────────────────────────────────────┤
│  Data Layer                                             │
│  ├── Neo4j (knowledge graph, vector indexes)             │
│  └── Redis (sessions, query cache, job state)            │
└─────────────────────────────────────────────────────────┘

Key Architectural Patterns¶

Factory Pattern — src/factories/ creates database, LLM, embedding, and agent instances based on configuration. Supports Ollama, Bedrock, SageMaker, and OpenAI backends.
Domain-Based Routing — API organized by business domain (ingestion, query, workspace, platform), each with its own router and routes subdirectory.
Two-Phase Query Architecture — Phase 1: deterministic execution of all graph tools (no LLM). Phase 2: single LLM call to synthesize results. Guarantees tool execution with zero hallucinated data.
Cascading Entity Consolidation — UniProt ID matching → synonym/gene symbol matching → fuzzy name matching. Each stage catches what the previous missed.
Relationship Consolidation — Merges duplicate edges between the same entity pair while preserving all evidence (PMIDs, confidence scores, extraction methods, temporal tracking).
Registry Pattern for Tools — Single ToolRegistry is the source of truth for all 16 KG tools. CLI, MCP, and LangChain interfaces derive from it automatically.

Architecture Diagrams¶

Full System¶

flowchart TB
    PUBMED[("PubMed<br/>(Entrez API)")]
    CSV[("CSVs<br/>(Ontologies, Proteins)")]

    subgraph INGEST ["INGESTION PIPELINE"]
        direction TB

        subgraph FETCH ["Data Fetching"]
            ING["ingestor.py<br/>fetch_pubmed_abstracts()"]
            SCRAPER["pubmed_mass_scraper.py"]
        end

        subgraph KG_BUILD ["KG Construction"]
            KGP["KGPipeline"]
            subgraph STATE ["StateGraph"]
                direction LR
                EXT["extract_kg<br/>LLM → JSON"]
                FILT["filter_ontology"]
                LINK["link_protein_entities<br/>UniProt matching"]
                EXT --> FILT --> LINK
            end
            KGP --> STATE
        end

        subgraph ENRICH ["Enrichment & Resolution"]
            ER["entity_resolver.py<br/>MONDO mapping"]
            IC["incremental_consolidation.py"]
            RC["relationship_counter.py"]
        end

        subgraph EMBED ["Embedding Pipeline"]
            EP["embedding_pipeline.py"]
            CHUNK["chunker.py"]
        end
    end

    subgraph STORAGE ["STORAGE"]
        direction LR
        NEO4J[("Neo4j")]
        VECTOR[("Vector Index")]
    end

    subgraph FACTORIES ["FACTORIES"]
        direction TB
        LLM_F["llm_factory → Ollama/Bedrock/SageMaker"]
        EMB_F["embedding_factory → SentenceTransformers"]
        DB_F["database_factory → Neo4j"]
        QA_F["query_agent_factory"]
    end

    subgraph QUERY ["QUERY LAYER"]
        direction TB
        subgraph AGENTS ["Agents"]
            DQA["DynamicQueryAgent"]
            NQA["Neo4jQueryAgent"]
            TPA["TwoPhaseAgent"]
        end
        subgraph RETRIEVAL ["Retrieval"]
            VR["VectorRetriever"]
            HR["HybridRetriever"]
            T2C["Text2CypherRetriever"]
        end
        subgraph PROCESSING ["Processing"]
            QP["QueryProcessor"]
            ED["EntityDiscovery"]
            RP["ResultProcessor<br/>CrossEncoder"]
            RF["REFRAG compression"]
        end
    end

    subgraph API ["API LAYER"]
        FAST["FastAPI + Granian"]
        QS["QueryService"]
        FAST --> QS
    end

    USER(("User"))
    REACT["React Frontend"]

    PUBMED --> ING
    CSV --> KGP
    ING --> KGP
    LINK --> ER & IC & RC
    KGP --> EP --> CHUNK
    RC --> NEO4J
    EP --> VECTOR --> NEO4J

    USER --> REACT --> FAST
    QS --> QA_F --> AGENTS
    AGENTS --> RETRIEVAL --> NEO4J
    RETRIEVAL --> RP --> RF --> LLM_F

    LLM_F -.-> EXT & AGENTS
    EMB_F -.-> EP
    DB_F -.-> KGP & AGENTS

API + Query Agent Flow¶

flowchart TB
    subgraph API["API Layer"]
        EP_SESSION["POST /v1/sessions"]
        EP_QUERY["POST /v1/sessions/{id}/query"]
    end

    subgraph SM["Session Manager"]
        SM_CREATE["create_session()"]
        SM_EXEC["execute_query()"]
    end

    subgraph QS["Query Service"]
        QS_CREATE["create_agent()"]
        QS_EXEC["execute_query()"]
        QS_REFRAG["REFRAG compression"]
    end

    subgraph AGENTS["Agent Layer"]
        TPA["TwoPhaseAgent<br/>(default)"]
        NQA["Neo4jQueryAgent"]
        DQA["DynamicQueryAgent"]
    end

    subgraph TPA_INTERNAL["TwoPhaseAgent Pipeline"]
        direction LR
        TP_DISC["Phase 1: Discovery<br/>(all tools, no LLM)"] --> TP_EXP["Phase 2: Expansion<br/>(neighbor traversal)"] --> TP_SYNTH["Synthesis<br/>(single LLM call)"]
    end

    subgraph DYN_INTERNAL["DynamicQueryAgent Pipeline"]
        direction LR
        QP["QueryProcessor"] --> ED["EntityDiscovery"] --> SS["6 Search Strategies"] --> RP["CrossEncoder rerank"]
    end

    subgraph RETRIEVERS["Neo4j Retrievers"]
        direction LR
        VR["VectorRetriever"]
        HR["HybridRetriever"]
        T2CR["Text2CypherRetriever"]
        KNN["APOC KNN"]
    end

    NEO4J[("Neo4j")]

    EP_SESSION --> SM_CREATE --> QS_CREATE
    EP_QUERY --> SM_EXEC --> QS_EXEC --> QS_REFRAG

    QS_CREATE -->|"default"| TPA
    QS_CREATE -->|"Standard"| NQA
    QS_CREATE -->|"Dynamic"| DQA

    TPA --> TPA_INTERNAL
    TPA_INTERNAL --> NEO4J
    NQA -.->|extends| DQA
    DQA --> DYN_INTERNAL
    NQA --> RETRIEVERS
    RETRIEVERS --> NEO4J

Directory Structure¶

API Layer (`api/`)¶

api/
├── app.py                # FastAPI app, lifespan, middleware, router mounting
├── ingestion/            # /v1/ingestion/* — KG assembly
│   ├── router.py
│   └── routes/
├── query/                # /v1/sessions/*, /v1/evidence, /v1/communities/*
│   ├── router.py
│   └── routes/
├── workspace/            # /v1/cells/*, /v1/my-files/*, /v1/auto-discovery/*
│   ├── router.py
│   └── routes/
├── platform/             # /health, /v1/metrics/*, /v1/dashboard/*
│   ├── router.py
│   └── routes/
└── shared/               # Cross-cutting concerns
    ├── models/           # Pydantic request/response models
    ├── middleware.py      # CORS, logging, error handling
    ├── rate_limit.py      # slowapi rate limiting
    ├── security.py        # Cypher injection prevention, PII detection
    ├── sse_streamer.py    # Server-sent events for streaming
    └── validation.py      # Request/response validation

Core Library (`src/`)¶

src/
├── agents/               # Query agents (TwoPhase, Dynamic, Neo4j, LangGraph)
├── services/             # Business logic by domain
│   ├── ingestion/        #   Job management
│   ├── query/            #   Query service, sessions, memory, cache
│   ├── workspace/        #   Cells, files, auto-discovery
│   ├── platform/         #   Audit, cost, telemetry, feedback, Glicko
│   └── sdcg/             #   Semantic Dynamic Context Graph
├── factories/            # Factory pattern for DI
├── models/               # Pydantic/dataclass models
├── core/                 # Database interfaces (Neo4j)
├── cache/                # Redis caching
├── security/             # Cypher builder, guardrails, sanitizer
├── kg_tools/             # Tool registry, CLI, MCP adapter
├── mcp_server/           # MCP server for biomarker/pathway tools
├── ml/                   # ML scoring, evidence weighting
├── evaluation/           # Eval framework (scorers, adapters, runners)
├── tools/                # Graph tools, migration tools
└── utils/                # Logging, retry, token counting, mappers

Pipeline Layer (`pipeline/`)¶

pipeline/
├── ingest/               # 25 modules: fetchers, chunkers, extractors, KG pipeline
├── enrichment/           # CSV/Parquet enrichment: column analysis, annotation
└── processors/           # 41 modules: entity resolution, community detection,
                          #   evidence scoring, external integrations, PDF handling,
                          #   science skills integrators (STRING, HPA, Reactome,
                          #   ChEMBL, ClinVar, OpenAlex)

Data Flow¶

Ingestion Flow¶

Data Sources (PubMed, bioRxiv, PMC, PDF, CSV, External APIs)
    ↓
Fetchers (pubmed, biorxiv, pmc, pdf)
    ↓
Token-based Chunking (512 tokens, 64 overlap, sentence boundaries)
    ↓
LLM Entity/Relationship Extraction (with optional gleaning)
    ↓
Neo4j Storage (nodes + relationships)
    ↓
Entity Consolidation (UniProt → synonyms → fuzzy)
    ↓
Relationship Consolidation (merge duplicates, preserve evidence)
    ↓
Vector Embeddings (sentence-transformers → Neo4j vector indexes)
    ↓
External API Enrichment (STRING, HPA, Reactome, ChEMBL, ClinVar, OpenAlex)

Query Flow¶

User Query
    ↓
Security Guardrails (PII detection, injection check)
    ↓
Session Manager (restore history, create agent)
    ↓
Query Mode Router (auto-classify: local/global/hybrid/naive)
    ↓
Agent Execution (TwoPhase: deterministic tools → LLM synthesis)
    ↓
REFRAG Compression (optional context compression)
    ↓
SSE Streaming Response (token-by-token)

LLM Service Configuration¶

Service	Backend	Use Case
`local`	Ollama	Development (default)
`bedrock`	AWS Bedrock	Production
`sagemaker-llama3`	AWS SageMaker	Production (custom endpoints)
`openai`	OpenAI API	Alternative

Query Agents — detailed agent architecture
Ingestion Pipeline — pipeline components
Infrastructure — deployment architecture
Advanced Features (SDCG) — future architecture vision

Architecture¶

System Overview¶

Key Architectural Patterns¶

Architecture Diagrams¶

Full System¶

API + Query Agent Flow¶

Directory Structure¶

API Layer (api/)¶

Core Library (src/)¶

Pipeline Layer (pipeline/)¶

Data Flow¶

Ingestion Flow¶

Query Flow¶

LLM Service Configuration¶

Related Pages¶

API Layer (`api/`)¶

Core Library (`src/`)¶

Pipeline Layer (`pipeline/`)¶