Skip to content

Knowledge Graph Data Model

This page documents the Neo4j schema — node labels, relationship types, and key properties — used by the Olink RAG system.


Overview

graph TD
    A[Abstract] -->|HAS_CHUNK| C[Chunk]
    C -->|CONTAINS| P[Protein]
    C -->|CONTAINS| D[Disease]
    P -->|ASSOCIATED_WITH| D
    P -->|INTERACTS_WITH| P2[Protein]
    P -->|PARTICIPATES_IN| PW[Pathway]
    P -->|HAS_GO_TERM| GO[GOTerm]
    P -->|LOCATED_IN| SL[SubcellularLocation]
    D -->|IS_SUBTYPE_OF| D2[Disease]
    R[Relationship] -->|HAS_PROVENANCE| SD[Source_Dataset]
    R -->|HAS_PROVENANCE| EE[Extraction_Event]
    R -->|HAS_PROVENANCE| AC[Actor]

Core Nodes

Protein

The primary biological entity. Identified canonically by UniProt ID.

Property Type Description
name string Display name (e.g., "Insulin")
id string Internal identifier
uniprot_id string Canonical UniProt accession (e.g., "P01308")
uniprot_gene_name string Gene name from UniProt
gene_symbol string HGNC gene symbol (e.g., "INS")
ensembl_id string Ensembl gene ID
embedding float[] Vector embedding for semantic search

Labels: :Protein:Entity

Disease

Identified canonically by MONDO ID.

Property Type Description
name string Display name (e.g., "Type 2 diabetes mellitus")
id string Internal identifier
mondo_id string MONDO ontology ID (e.g., "MONDO:0005148")
definition string Ontology definition text
umls_id string UMLS concept ID
efo_id string EFO ontology ID
embedding float[] Vector embedding for semantic search

Labels: :Disease:Entity

Chunk

A text segment extracted from a source document. The atomic unit for retrieval.

Property Type Description
chunk_id string Unique chunk identifier
text string Chunk text content
doc_id string Parent document identifier
pmid string PubMed ID (if from PubMed)
title string Source document title
publication_year int Year of publication
embedding float[] Vector embedding
ingestion_job_id string Job that created this chunk
ingested_at datetime Ingestion timestamp

Abstract

A PubMed abstract or bioRxiv preprint metadata record.

Property Type Description
pmid string PubMed ID
title string Paper title
authors string[] Author list
journal string Journal name
publication_date string Publication date
doi string DOI
keywords string[] MeSH/author keywords
ingestion_job_id string Job that created this record
ingested_at datetime Ingestion timestamp

Community

A cluster of densely connected entities detected by Louvain community detection.

Property Type Description
community_id string Unique community identifier
summary string LLM-generated natural language summary

Annotation Nodes

Pathway

Biological pathway (e.g., from Reactome, KEGG).

Property Type Description
name string Pathway name

GOTerm

Gene Ontology term annotation.

Property Type Description
id string GO term ID (e.g., "GO:0006915")
name string Term name
namespace string Ontology namespace (BP/MF/CC)

SubcellularLocation

Where a protein is located within the cell.

Property Type Description
name string Location name (e.g., "Cytoplasm")

Provenance Nodes

These track the lineage of extracted knowledge back to its source.

Source_Dataset

Property Type Description
source_dataset_id string Dataset identifier
created_at datetime When the dataset was registered

Extraction_Event

Property Type Description
ingestion_job_id string Job identifier
extraction_method string Method: llm, tabular, user_description, notebook
timestamp datetime When extraction occurred
chunk_id string Source chunk

Actor

Property Type Description
actor_id string User or system identifier
created_at datetime Registration timestamp

Relationships

Core Knowledge Relationships

Relationship Source Target Key Properties
ASSOCIATED_WITH Protein Disease confidence, score, evidence_count, pmids, extraction_method
INTERACTS_WITH Protein Protein combined_score, interaction_type, source_database
CAUSES Protein Disease confidence, pmids
TREATS Protein Disease confidence, pmids
UPREGULATES Protein Protein/Disease confidence, pmids
DOWNREGULATES Protein Protein/Disease confidence, pmids
RELATED_TO Entity Entity confidence, pmids
IS_SUBTYPE_OF Disease Disease (ontology hierarchy)

Structural Relationships

Relationship Source Target Description
HAS_CHUNK Abstract Chunk Abstract contains this text chunk
CONTAINS Chunk Protein/Disease Chunk mentions this entity
FROM_CHUNK Entity Chunk Entity was extracted from this chunk
BELONGS_TO Entity Community Entity is in this community

Annotation Relationships

Relationship Source Target Description
PARTICIPATES_IN Protein Pathway Protein is part of this pathway
HAS_GO_TERM Protein GOTerm Protein has this GO annotation
LOCATED_IN Protein SubcellularLocation Protein is found here

Provenance Relationships

Relationship Source Target Description
HAS_PROVENANCE Relationship Source_Dataset Knowledge came from this dataset
HAS_PROVENANCE Relationship Extraction_Event Knowledge was extracted by this event
HAS_PROVENANCE Relationship Actor This actor produced the knowledge
CONFIRMED_BY Relationship User_Context_Node User confirmed this relationship
CONTRADICTS Entity Entity Contradiction detected between assertions

Vector Indexes

Index Name Node Label Property Dimensions Similarity
node_embeddings Entity embedding 768 cosine
chunk_embeddings Chunk embedding 768 cosine

Used for hybrid semantic + graph search in the query agent.

Fulltext Indexes

  • Entity names (Protein.name, Disease.name) for keyword search and fuzzy matching
  • Chunk text for BM25-style retrieval

Entity Resolution

Entities are consolidated using multiple strategies:

  1. UniProt ID matching — proteins with the same UniProt accession are merged
  2. MONDO ID matching — diseases with the same MONDO ID are merged
  3. Synonym matching — exact match against known synonym lists
  4. Fuzzy matching — Levenshtein distance for near-duplicates (configurable threshold)

After consolidation, duplicate relationships are merged while preserving all evidence (PMIDs, confidence scores, provenance chains).


Temporal Model

Nodes and edges carry temporal validity windows (see src/models/context_graph_models.py):

  • valid_from — when the knowledge was first observed
  • valid_to — when superseded (null = currently valid)

This enables point-in-time queries and tracking knowledge evolution.


Example Cypher Queries

-- Find all proteins associated with a disease
MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease)
WHERE d.mondo_id = 'MONDO:0005148'
RETURN p.name, p.uniprot_id, r.confidence
ORDER BY r.confidence DESC

-- Get provenance chain for a relationship
MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease)
WHERE p.uniprot_id = 'P01308' AND d.mondo_id = 'MONDO:0005148'
MATCH (r)-[:HAS_PROVENANCE]->(prov)
RETURN labels(prov), properties(prov)

-- Find protein-protein interactions above a score threshold
MATCH (p1:Protein)-[r:INTERACTS_WITH]-(p2:Protein)
WHERE r.combined_score > 0.7
RETURN p1.name, p2.name, r.combined_score, r.interaction_type
LIMIT 50

-- Community members and summary
MATCH (c:Community {community_id: $id})<-[:BELONGS_TO]-(e:Entity)
RETURN c.summary, collect(e.name) AS members