Skip to content

Architecture

System Overview

Olink RAG is organized into four main layers:

┌─────────────────────────────────────────────────────────┐
│  Frontend (React + Sigma.js)  — separate repo           │
├─────────────────────────────────────────────────────────┤
│  API Layer (FastAPI + Granian)                          │
│  ├── ingestion/   query/   workspace/   platform/       │
│  └── shared/ (middleware, security, rate limiting, SSE)  │
├─────────────────────────────────────────────────────────┤
│  Core Services & Agents                                 │
│  ├── agents/ (TwoPhase, Dynamic, Neo4j, LangGraph)      │
│  ├── services/ (query, ingestion, workspace, platform)   │
│  ├── factories/ (database, LLM, embedding, agent)        │
│  └── kg_tools/ (registry, CLI, MCP, LangChain adapter)   │
├─────────────────────────────────────────────────────────┤
│  Pipeline Layer                                         │
│  ├── ingest/ (fetchers, chunkers, extractors, KG pipe)   │
│  ├── enrichment/ (column analysis, node annotation)      │
│  └── processors/ (41 modules: consolidation, enrichment, │
│                    science skills integrators)            │
├─────────────────────────────────────────────────────────┤
│  Data Layer                                             │
│  ├── Neo4j (knowledge graph, vector indexes)             │
│  └── Redis (sessions, query cache, job state)            │
└─────────────────────────────────────────────────────────┘

Key Architectural Patterns

  1. Factory Patternsrc/factories/ creates database, LLM, embedding, and agent instances based on configuration. Supports Ollama, Bedrock, SageMaker, and OpenAI backends.

  2. Domain-Based Routing — API organized by business domain (ingestion, query, workspace, platform), each with its own router and routes subdirectory.

  3. Two-Phase Query Architecture — Phase 1: deterministic execution of all graph tools (no LLM). Phase 2: single LLM call to synthesize results. Guarantees tool execution with zero hallucinated data.

  4. Cascading Entity Consolidation — UniProt ID matching → synonym/gene symbol matching → fuzzy name matching. Each stage catches what the previous missed.

  5. Relationship Consolidation — Merges duplicate edges between the same entity pair while preserving all evidence (PMIDs, confidence scores, extraction methods, temporal tracking).

  6. Registry Pattern for Tools — Single ToolRegistry is the source of truth for all 16 KG tools. CLI, MCP, and LangChain interfaces derive from it automatically.

Architecture Diagrams

Full System

flowchart TB
    PUBMED[("PubMed<br/>(Entrez API)")]
    CSV[("CSVs<br/>(Ontologies, Proteins)")]

    subgraph INGEST ["INGESTION PIPELINE"]
        direction TB

        subgraph FETCH ["Data Fetching"]
            ING["ingestor.py<br/>fetch_pubmed_abstracts()"]
            SCRAPER["pubmed_mass_scraper.py"]
        end

        subgraph KG_BUILD ["KG Construction"]
            KGP["KGPipeline"]
            subgraph STATE ["StateGraph"]
                direction LR
                EXT["extract_kg<br/>LLM → JSON"]
                FILT["filter_ontology"]
                LINK["link_protein_entities<br/>UniProt matching"]
                EXT --> FILT --> LINK
            end
            KGP --> STATE
        end

        subgraph ENRICH ["Enrichment & Resolution"]
            ER["entity_resolver.py<br/>MONDO mapping"]
            IC["incremental_consolidation.py"]
            RC["relationship_counter.py"]
        end

        subgraph EMBED ["Embedding Pipeline"]
            EP["embedding_pipeline.py"]
            CHUNK["chunker.py"]
        end
    end

    subgraph STORAGE ["STORAGE"]
        direction LR
        NEO4J[("Neo4j")]
        VECTOR[("Vector Index")]
    end

    subgraph FACTORIES ["FACTORIES"]
        direction TB
        LLM_F["llm_factory → Ollama/Bedrock/SageMaker"]
        EMB_F["embedding_factory → SentenceTransformers"]
        DB_F["database_factory → Neo4j"]
        QA_F["query_agent_factory"]
    end

    subgraph QUERY ["QUERY LAYER"]
        direction TB
        subgraph AGENTS ["Agents"]
            DQA["DynamicQueryAgent"]
            NQA["Neo4jQueryAgent"]
            TPA["TwoPhaseAgent"]
        end
        subgraph RETRIEVAL ["Retrieval"]
            VR["VectorRetriever"]
            HR["HybridRetriever"]
            T2C["Text2CypherRetriever"]
        end
        subgraph PROCESSING ["Processing"]
            QP["QueryProcessor"]
            ED["EntityDiscovery"]
            RP["ResultProcessor<br/>CrossEncoder"]
            RF["REFRAG compression"]
        end
    end

    subgraph API ["API LAYER"]
        FAST["FastAPI + Granian"]
        QS["QueryService"]
        FAST --> QS
    end

    USER(("User"))
    REACT["React Frontend"]

    PUBMED --> ING
    CSV --> KGP
    ING --> KGP
    LINK --> ER & IC & RC
    KGP --> EP --> CHUNK
    RC --> NEO4J
    EP --> VECTOR --> NEO4J

    USER --> REACT --> FAST
    QS --> QA_F --> AGENTS
    AGENTS --> RETRIEVAL --> NEO4J
    RETRIEVAL --> RP --> RF --> LLM_F

    LLM_F -.-> EXT & AGENTS
    EMB_F -.-> EP
    DB_F -.-> KGP & AGENTS

API + Query Agent Flow

flowchart TB
    subgraph API["API Layer"]
        EP_SESSION["POST /v1/sessions"]
        EP_QUERY["POST /v1/sessions/{id}/query"]
    end

    subgraph SM["Session Manager"]
        SM_CREATE["create_session()"]
        SM_EXEC["execute_query()"]
    end

    subgraph QS["Query Service"]
        QS_CREATE["create_agent()"]
        QS_EXEC["execute_query()"]
        QS_REFRAG["REFRAG compression"]
    end

    subgraph AGENTS["Agent Layer"]
        TPA["TwoPhaseAgent<br/>(default)"]
        NQA["Neo4jQueryAgent"]
        DQA["DynamicQueryAgent"]
    end

    subgraph TPA_INTERNAL["TwoPhaseAgent Pipeline"]
        direction LR
        TP_DISC["Phase 1: Discovery<br/>(all tools, no LLM)"] --> TP_EXP["Phase 2: Expansion<br/>(neighbor traversal)"] --> TP_SYNTH["Synthesis<br/>(single LLM call)"]
    end

    subgraph DYN_INTERNAL["DynamicQueryAgent Pipeline"]
        direction LR
        QP["QueryProcessor"] --> ED["EntityDiscovery"] --> SS["6 Search Strategies"] --> RP["CrossEncoder rerank"]
    end

    subgraph RETRIEVERS["Neo4j Retrievers"]
        direction LR
        VR["VectorRetriever"]
        HR["HybridRetriever"]
        T2CR["Text2CypherRetriever"]
        KNN["APOC KNN"]
    end

    NEO4J[("Neo4j")]

    EP_SESSION --> SM_CREATE --> QS_CREATE
    EP_QUERY --> SM_EXEC --> QS_EXEC --> QS_REFRAG

    QS_CREATE -->|"default"| TPA
    QS_CREATE -->|"Standard"| NQA
    QS_CREATE -->|"Dynamic"| DQA

    TPA --> TPA_INTERNAL
    TPA_INTERNAL --> NEO4J
    NQA -.->|extends| DQA
    DQA --> DYN_INTERNAL
    NQA --> RETRIEVERS
    RETRIEVERS --> NEO4J

Directory Structure

API Layer (api/)

api/
├── app.py                # FastAPI app, lifespan, middleware, router mounting
├── ingestion/            # /v1/ingestion/* — KG assembly
│   ├── router.py
│   └── routes/
├── query/                # /v1/sessions/*, /v1/evidence, /v1/communities/*
│   ├── router.py
│   └── routes/
├── workspace/            # /v1/cells/*, /v1/my-files/*, /v1/auto-discovery/*
│   ├── router.py
│   └── routes/
├── platform/             # /health, /v1/metrics/*, /v1/dashboard/*
│   ├── router.py
│   └── routes/
└── shared/               # Cross-cutting concerns
    ├── models/           # Pydantic request/response models
    ├── middleware.py      # CORS, logging, error handling
    ├── rate_limit.py      # slowapi rate limiting
    ├── security.py        # Cypher injection prevention, PII detection
    ├── sse_streamer.py    # Server-sent events for streaming
    └── validation.py      # Request/response validation

Core Library (src/)

src/
├── agents/               # Query agents (TwoPhase, Dynamic, Neo4j, LangGraph)
├── services/             # Business logic by domain
│   ├── ingestion/        #   Job management
│   ├── query/            #   Query service, sessions, memory, cache
│   ├── workspace/        #   Cells, files, auto-discovery
│   ├── platform/         #   Audit, cost, telemetry, feedback, Glicko
│   └── sdcg/             #   Semantic Dynamic Context Graph
├── factories/            # Factory pattern for DI
├── models/               # Pydantic/dataclass models
├── core/                 # Database interfaces (Neo4j)
├── cache/                # Redis caching
├── security/             # Cypher builder, guardrails, sanitizer
├── kg_tools/             # Tool registry, CLI, MCP adapter
├── mcp_server/           # MCP server for biomarker/pathway tools
├── ml/                   # ML scoring, evidence weighting
├── evaluation/           # Eval framework (scorers, adapters, runners)
├── tools/                # Graph tools, migration tools
└── utils/                # Logging, retry, token counting, mappers

Pipeline Layer (pipeline/)

pipeline/
├── ingest/               # 25 modules: fetchers, chunkers, extractors, KG pipeline
├── enrichment/           # CSV/Parquet enrichment: column analysis, annotation
└── processors/           # 41 modules: entity resolution, community detection,
                          #   evidence scoring, external integrations, PDF handling,
                          #   science skills integrators (STRING, HPA, Reactome,
                          #   ChEMBL, ClinVar, OpenAlex)

Data Flow

Ingestion Flow

Data Sources (PubMed, bioRxiv, PMC, PDF, CSV, External APIs)
Fetchers (pubmed, biorxiv, pmc, pdf)
Token-based Chunking (512 tokens, 64 overlap, sentence boundaries)
LLM Entity/Relationship Extraction (with optional gleaning)
Neo4j Storage (nodes + relationships)
Entity Consolidation (UniProt → synonyms → fuzzy)
Relationship Consolidation (merge duplicates, preserve evidence)
Vector Embeddings (sentence-transformers → Neo4j vector indexes)
External API Enrichment (STRING, HPA, Reactome, ChEMBL, ClinVar, OpenAlex)

Query Flow

User Query
Security Guardrails (PII detection, injection check)
Session Manager (restore history, create agent)
Query Mode Router (auto-classify: local/global/hybrid/naive)
Agent Execution (TwoPhase: deterministic tools → LLM synthesis)
REFRAG Compression (optional context compression)
SSE Streaming Response (token-by-token)

LLM Service Configuration

Service Backend Use Case
local Ollama Development (default)
bedrock AWS Bedrock Production
sagemaker-llama3 AWS SageMaker Production (custom endpoints)
openai OpenAI API Alternative