Advanced Features¶

Multimodal Document Processing¶

Vision model integration (Llama-3.2-11B-Vision via Bedrock) for analyzing biomedical PDFs:

Image analysis: protein structure diagrams, pathway charts, Western blots, microscopy
Table extraction: PyMuPDF detection + LLM-based column classification and entity mapping
Batch processing: parallel workers with SQLite checkpointing and resume capability

# Single PDF
uv run python multimodal_main.py --mode single --path ./data/paper.pdf

# Batch folder
uv run python multimodal_main.py --mode batch --path ./data/papers --recursive --max-workers 4

See Multimodal Processing.

Evidence Weighting¶

Quality scoring for protein-disease associations based on:

Study design (0.3–1.0): RCT > meta-analysis > cohort > case-control > case report
Source type (0.3–1.0): curated DB > full-text > abstract > preprint > social media
Sample size (±0.1): bonus for n≥1000, penalty for n<50
Retraction status (0.0–1.0): full retraction = 0.0, expression of concern = 0.3

Combined: final_weight = study_design × source_type × sample_size_adj × retraction_penalty

See Evidence Weighting.

Community Detection¶

Louvain algorithm for detecting entity communities, with LLM-generated summaries for global query mode:

Community IDs assigned to entity nodes
Community nodes store summaries and embeddings
Global queries retrieve relevant community summaries by semantic similarity

See Community Detection Validation.

Protein-Protein Interaction (PPI) Framework¶

Multi-source PPI integration with three merge strategies:

Strategy	Description
`merge_evidence`	Combine evidence from all sources per protein pair
`update_best`	Keep only highest-confidence interaction
`preserve_all`	Separate relationships per source

Supports STRING, FunCoup, BioGRID, and custom CSV formats.

💡 For live API access: STRING interactions can also be fetched directly via the Science Skills enricher (uv run python -m pipeline.processors.science_skills_enricher --sources string), which additionally provides HPA tissue expression, Reactome pathways, ChEMBL drug targets, ClinVar variants, and OpenAlex citations. See Integrations.

See PPI Framework and Multi-Source PPI Guide.

Cell-Based Conversational Interface¶

Tree-structured conversations for parallel hypothesis exploration:

Cell 1: "What proteins are associated with diabetes?"
  ├── Cell 1a: "Focus on Type 2 diabetes"
  │   ├── Cell 1a-i: "Show protein-protein interactions"
  │   └── Cell 1a-ii: "Find drug targets"
  └── Cell 1b: "Compare with Type 1 diabetes"

See Cell-Based Interface.

Semantic Dynamic Context Graph (SDCG)¶

Vision for evolving the static KG into a living knowledge fabric:

Temporal validity windows on nodes and edges
Provenance chains: dataset → row → extraction → entity → relationship
User context as graph structure: role, team, description → actual nodes
Statistical edges with confidence intervals and sample sizes
Context-aware queries: team-scoped views, role-weighted relevance

Status: v1 infrastructure complete (provenance, temporal, user context, confidence scoring, contradiction detection). See SDCG Design Brief and Expansion Roadmap.

Incremental Embeddings¶

Generate embeddings only for new content, reusing existing ones:

Queries for nodes with WHERE n.embedding IS NULL
Neo4j vector indexes automatically include new embeddings
No manual index rebuild required

See Incremental Embeddings.

Enhanced Node Management¶

Two-phase embedding strategy for enriched properties:

Phase 1 (during ingestion): chunk embeddings only
Phase 2 (after enrichment): graph node embeddings with all properties

See Enhanced Node Management.

GFQL Query Patterns¶

Cypher patterns inspired by PyGraphistry's GFQL for multi-hop traversal, conditional filtering, hierarchical disease queries, and pathway co-occurrence analysis.

See GFQL Patterns.