Knowledge Graph Data Model¶
This page documents the Neo4j schema — node labels, relationship types, and key properties — used by the Olink RAG system.
Overview¶
graph TD
A[Abstract] -->|HAS_CHUNK| C[Chunk]
C -->|CONTAINS| P[Protein]
C -->|CONTAINS| D[Disease]
P -->|ASSOCIATED_WITH| D
P -->|INTERACTS_WITH| P2[Protein]
P -->|PARTICIPATES_IN| PW[Pathway]
P -->|HAS_GO_TERM| GO[GOTerm]
P -->|LOCATED_IN| SL[SubcellularLocation]
D -->|IS_SUBTYPE_OF| D2[Disease]
R[Relationship] -->|HAS_PROVENANCE| SD[Source_Dataset]
R -->|HAS_PROVENANCE| EE[Extraction_Event]
R -->|HAS_PROVENANCE| AC[Actor]
Core Nodes¶
Protein¶
The primary biological entity. Identified canonically by UniProt ID.
| Property | Type | Description |
|---|---|---|
name |
string | Display name (e.g., "Insulin") |
id |
string | Internal identifier |
uniprot_id |
string | Canonical UniProt accession (e.g., "P01308") |
uniprot_gene_name |
string | Gene name from UniProt |
gene_symbol |
string | HGNC gene symbol (e.g., "INS") |
ensembl_id |
string | Ensembl gene ID |
embedding |
float[] | Vector embedding for semantic search |
Labels: :Protein:Entity
Disease¶
Identified canonically by MONDO ID.
| Property | Type | Description |
|---|---|---|
name |
string | Display name (e.g., "Type 2 diabetes mellitus") |
id |
string | Internal identifier |
mondo_id |
string | MONDO ontology ID (e.g., "MONDO:0005148") |
definition |
string | Ontology definition text |
umls_id |
string | UMLS concept ID |
efo_id |
string | EFO ontology ID |
embedding |
float[] | Vector embedding for semantic search |
Labels: :Disease:Entity
Chunk¶
A text segment extracted from a source document. The atomic unit for retrieval.
| Property | Type | Description |
|---|---|---|
chunk_id |
string | Unique chunk identifier |
text |
string | Chunk text content |
doc_id |
string | Parent document identifier |
pmid |
string | PubMed ID (if from PubMed) |
title |
string | Source document title |
publication_year |
int | Year of publication |
embedding |
float[] | Vector embedding |
ingestion_job_id |
string | Job that created this chunk |
ingested_at |
datetime | Ingestion timestamp |
Abstract¶
A PubMed abstract or bioRxiv preprint metadata record.
| Property | Type | Description |
|---|---|---|
pmid |
string | PubMed ID |
title |
string | Paper title |
authors |
string[] | Author list |
journal |
string | Journal name |
publication_date |
string | Publication date |
doi |
string | DOI |
keywords |
string[] | MeSH/author keywords |
ingestion_job_id |
string | Job that created this record |
ingested_at |
datetime | Ingestion timestamp |
Community¶
A cluster of densely connected entities detected by Louvain community detection.
| Property | Type | Description |
|---|---|---|
community_id |
string | Unique community identifier |
summary |
string | LLM-generated natural language summary |
Annotation Nodes¶
Pathway¶
Biological pathway (e.g., from Reactome, KEGG).
| Property | Type | Description |
|---|---|---|
name |
string | Pathway name |
GOTerm¶
Gene Ontology term annotation.
| Property | Type | Description |
|---|---|---|
id |
string | GO term ID (e.g., "GO:0006915") |
name |
string | Term name |
namespace |
string | Ontology namespace (BP/MF/CC) |
SubcellularLocation¶
Where a protein is located within the cell.
| Property | Type | Description |
|---|---|---|
name |
string | Location name (e.g., "Cytoplasm") |
Provenance Nodes¶
These track the lineage of extracted knowledge back to its source.
Source_Dataset¶
| Property | Type | Description |
|---|---|---|
source_dataset_id |
string | Dataset identifier |
created_at |
datetime | When the dataset was registered |
Extraction_Event¶
| Property | Type | Description |
|---|---|---|
ingestion_job_id |
string | Job identifier |
extraction_method |
string | Method: llm, tabular, user_description, notebook |
timestamp |
datetime | When extraction occurred |
chunk_id |
string | Source chunk |
Actor¶
| Property | Type | Description |
|---|---|---|
actor_id |
string | User or system identifier |
created_at |
datetime | Registration timestamp |
Relationships¶
Core Knowledge Relationships¶
| Relationship | Source | Target | Key Properties |
|---|---|---|---|
ASSOCIATED_WITH |
Protein | Disease | confidence, score, evidence_count, pmids, extraction_method |
INTERACTS_WITH |
Protein | Protein | combined_score, interaction_type, source_database |
CAUSES |
Protein | Disease | confidence, pmids |
TREATS |
Protein | Disease | confidence, pmids |
UPREGULATES |
Protein | Protein/Disease | confidence, pmids |
DOWNREGULATES |
Protein | Protein/Disease | confidence, pmids |
RELATED_TO |
Entity | Entity | confidence, pmids |
IS_SUBTYPE_OF |
Disease | Disease | (ontology hierarchy) |
Structural Relationships¶
| Relationship | Source | Target | Description |
|---|---|---|---|
HAS_CHUNK |
Abstract | Chunk | Abstract contains this text chunk |
CONTAINS |
Chunk | Protein/Disease | Chunk mentions this entity |
FROM_CHUNK |
Entity | Chunk | Entity was extracted from this chunk |
BELONGS_TO |
Entity | Community | Entity is in this community |
Annotation Relationships¶
| Relationship | Source | Target | Description |
|---|---|---|---|
PARTICIPATES_IN |
Protein | Pathway | Protein is part of this pathway |
HAS_GO_TERM |
Protein | GOTerm | Protein has this GO annotation |
LOCATED_IN |
Protein | SubcellularLocation | Protein is found here |
Provenance Relationships¶
| Relationship | Source | Target | Description |
|---|---|---|---|
HAS_PROVENANCE |
Relationship | Source_Dataset | Knowledge came from this dataset |
HAS_PROVENANCE |
Relationship | Extraction_Event | Knowledge was extracted by this event |
HAS_PROVENANCE |
Relationship | Actor | This actor produced the knowledge |
CONFIRMED_BY |
Relationship | User_Context_Node | User confirmed this relationship |
CONTRADICTS |
Entity | Entity | Contradiction detected between assertions |
Vector Indexes¶
| Index Name | Node Label | Property | Dimensions | Similarity |
|---|---|---|---|---|
node_embeddings |
Entity | embedding | 768 | cosine |
chunk_embeddings |
Chunk | embedding | 768 | cosine |
Used for hybrid semantic + graph search in the query agent.
Fulltext Indexes¶
- Entity names (Protein.name, Disease.name) for keyword search and fuzzy matching
- Chunk text for BM25-style retrieval
Entity Resolution¶
Entities are consolidated using multiple strategies:
- UniProt ID matching — proteins with the same UniProt accession are merged
- MONDO ID matching — diseases with the same MONDO ID are merged
- Synonym matching — exact match against known synonym lists
- Fuzzy matching — Levenshtein distance for near-duplicates (configurable threshold)
After consolidation, duplicate relationships are merged while preserving all evidence (PMIDs, confidence scores, provenance chains).
Temporal Model¶
Nodes and edges carry temporal validity windows (see src/models/context_graph_models.py):
valid_from— when the knowledge was first observedvalid_to— when superseded (null = currently valid)
This enables point-in-time queries and tracking knowledge evolution.
Example Cypher Queries¶
-- Find all proteins associated with a disease
MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease)
WHERE d.mondo_id = 'MONDO:0005148'
RETURN p.name, p.uniprot_id, r.confidence
ORDER BY r.confidence DESC
-- Get provenance chain for a relationship
MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease)
WHERE p.uniprot_id = 'P01308' AND d.mondo_id = 'MONDO:0005148'
MATCH (r)-[:HAS_PROVENANCE]->(prov)
RETURN labels(prov), properties(prov)
-- Find protein-protein interactions above a score threshold
MATCH (p1:Protein)-[r:INTERACTS_WITH]-(p2:Protein)
WHERE r.combined_score > 0.7
RETURN p1.name, p2.name, r.combined_score, r.interaction_type
LIMIT 50
-- Community members and summary
MATCH (c:Community {community_id: $id})<-[:BELONGS_TO]-(e:Entity)
RETURN c.summary, collect(e.name) AS members