Knowledge Graph Data Model¶

This page documents the Neo4j schema — node labels, relationship types, and key properties — used by the Olink RAG system.

Overview¶

graph TD
    A[Abstract] -->|HAS_CHUNK| C[Chunk]
    C -->|CONTAINS| P[Protein]
    C -->|CONTAINS| D[Disease]
    P -->|ASSOCIATED_WITH| D
    P -->|INTERACTS_WITH| P2[Protein]
    P -->|PARTICIPATES_IN| PW[Pathway]
    P -->|HAS_GO_TERM| GO[GOTerm]
    P -->|LOCATED_IN| SL[SubcellularLocation]
    D -->|IS_SUBTYPE_OF| D2[Disease]
    R[Relationship] -->|HAS_PROVENANCE| SD[Source_Dataset]
    R -->|HAS_PROVENANCE| EE[Extraction_Event]
    R -->|HAS_PROVENANCE| AC[Actor]

Core Nodes¶

Protein¶

The primary biological entity. Identified canonically by UniProt ID.

Property	Type	Description
`name`	string	Display name (e.g., "Insulin")
`id`	string	Internal identifier
`uniprot_id`	string	Canonical UniProt accession (e.g., "P01308")
`uniprot_gene_name`	string	Gene name from UniProt
`gene_symbol`	string	HGNC gene symbol (e.g., "INS")
`ensembl_id`	string	Ensembl gene ID
`embedding`	float[]	Vector embedding for semantic search

Labels: :Protein:Entity

Disease¶

Identified canonically by MONDO ID.

Property	Type	Description
`name`	string	Display name (e.g., "Type 2 diabetes mellitus")
`id`	string	Internal identifier
`mondo_id`	string	MONDO ontology ID (e.g., "MONDO:0005148")
`definition`	string	Ontology definition text
`umls_id`	string	UMLS concept ID
`efo_id`	string	EFO ontology ID
`embedding`	float[]	Vector embedding for semantic search

Labels: :Disease:Entity

Chunk¶

A text segment extracted from a source document. The atomic unit for retrieval.

Property	Type	Description
`chunk_id`	string	Unique chunk identifier
`text`	string	Chunk text content
`doc_id`	string	Parent document identifier
`pmid`	string	PubMed ID (if from PubMed)
`title`	string	Source document title
`publication_year`	int	Year of publication
`embedding`	float[]	Vector embedding
`ingestion_job_id`	string	Job that created this chunk
`ingested_at`	datetime	Ingestion timestamp

Abstract¶

A PubMed abstract or bioRxiv preprint metadata record.

Property	Type	Description
`pmid`	string	PubMed ID
`title`	string	Paper title
`authors`	string[]	Author list
`journal`	string	Journal name
`publication_date`	string	Publication date
`doi`	string	DOI
`keywords`	string[]	MeSH/author keywords
`ingestion_job_id`	string	Job that created this record
`ingested_at`	datetime	Ingestion timestamp

Community¶

A cluster of densely connected entities detected by Louvain community detection.

Property	Type	Description
`community_id`	string	Unique community identifier
`summary`	string	LLM-generated natural language summary

Annotation Nodes¶

Pathway¶

Biological pathway (e.g., from Reactome, KEGG).

Property	Type	Description
`name`	string	Pathway name

GOTerm¶

Gene Ontology term annotation.

Property	Type	Description
`id`	string	GO term ID (e.g., "GO:0006915")
`name`	string	Term name
`namespace`	string	Ontology namespace (BP/MF/CC)

SubcellularLocation¶

Where a protein is located within the cell.

Property	Type	Description
`name`	string	Location name (e.g., "Cytoplasm")

Provenance Nodes¶

These track the lineage of extracted knowledge back to its source.

Source_Dataset¶

Property	Type	Description
`source_dataset_id`	string	Dataset identifier
`created_at`	datetime	When the dataset was registered

Extraction_Event¶

Property	Type	Description
`ingestion_job_id`	string	Job identifier
`extraction_method`	string	Method: `llm`, `tabular`, `user_description`, `notebook`
`timestamp`	datetime	When extraction occurred
`chunk_id`	string	Source chunk

Actor¶

Property	Type	Description
`actor_id`	string	User or system identifier
`created_at`	datetime	Registration timestamp

Relationships¶

Core Knowledge Relationships¶

Relationship	Source	Target	Key Properties
`ASSOCIATED_WITH`	Protein	Disease	confidence, score, evidence_count, pmids, extraction_method
`INTERACTS_WITH`	Protein	Protein	combined_score, interaction_type, source_database
`CAUSES`	Protein	Disease	confidence, pmids
`TREATS`	Protein	Disease	confidence, pmids
`UPREGULATES`	Protein	Protein/Disease	confidence, pmids
`DOWNREGULATES`	Protein	Protein/Disease	confidence, pmids
`RELATED_TO`	Entity	Entity	confidence, pmids
`IS_SUBTYPE_OF`	Disease	Disease	(ontology hierarchy)

Structural Relationships¶

Relationship	Source	Target	Description
`HAS_CHUNK`	Abstract	Chunk	Abstract contains this text chunk
`CONTAINS`	Chunk	Protein/Disease	Chunk mentions this entity
`FROM_CHUNK`	Entity	Chunk	Entity was extracted from this chunk
`BELONGS_TO`	Entity	Community	Entity is in this community

Annotation Relationships¶

Relationship	Source	Target	Description
`PARTICIPATES_IN`	Protein	Pathway	Protein is part of this pathway
`HAS_GO_TERM`	Protein	GOTerm	Protein has this GO annotation
`LOCATED_IN`	Protein	SubcellularLocation	Protein is found here

Provenance Relationships¶

Relationship	Source	Target	Description
`HAS_PROVENANCE`	Relationship	Source_Dataset	Knowledge came from this dataset
`HAS_PROVENANCE`	Relationship	Extraction_Event	Knowledge was extracted by this event
`HAS_PROVENANCE`	Relationship	Actor	This actor produced the knowledge
`CONFIRMED_BY`	Relationship	User_Context_Node	User confirmed this relationship
`CONTRADICTS`	Entity	Entity	Contradiction detected between assertions

Vector Indexes¶

Index Name	Node Label	Property	Dimensions	Similarity
`node_embeddings`	Entity	embedding	768	cosine
`chunk_embeddings`	Chunk	embedding	768	cosine

Used for hybrid semantic + graph search in the query agent.

Fulltext Indexes¶

Entity names (Protein.name, Disease.name) for keyword search and fuzzy matching
Chunk text for BM25-style retrieval

Entity Resolution¶

Entities are consolidated using multiple strategies:

UniProt ID matching — proteins with the same UniProt accession are merged
MONDO ID matching — diseases with the same MONDO ID are merged
Synonym matching — exact match against known synonym lists
Fuzzy matching — Levenshtein distance for near-duplicates (configurable threshold)

After consolidation, duplicate relationships are merged while preserving all evidence (PMIDs, confidence scores, provenance chains).

Temporal Model¶

Nodes and edges carry temporal validity windows (see src/models/context_graph_models.py):

valid_from — when the knowledge was first observed
valid_to — when superseded (null = currently valid)

This enables point-in-time queries and tracking knowledge evolution.

Example Cypher Queries¶

-- Find all proteins associated with a disease
MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease)
WHERE d.mondo_id = 'MONDO:0005148'
RETURN p.name, p.uniprot_id, r.confidence
ORDER BY r.confidence DESC

-- Get provenance chain for a relationship
MATCH (p:Protein)-[r:ASSOCIATED_WITH]-(d:Disease)
WHERE p.uniprot_id = 'P01308' AND d.mondo_id = 'MONDO:0005148'
MATCH (r)-[:HAS_PROVENANCE]->(prov)
RETURN labels(prov), properties(prov)

-- Find protein-protein interactions above a score threshold
MATCH (p1:Protein)-[r:INTERACTS_WITH]-(p2:Protein)
WHERE r.combined_score > 0.7
RETURN p1.name, p2.name, r.combined_score, r.interaction_type
LIMIT 50

-- Community members and summary
MATCH (c:Community {community_id: $id})<-[:BELONGS_TO]-(e:Entity)
RETURN c.summary, collect(e.name) AS members