Skip to content

Advanced Features

Multimodal Document Processing

Vision model integration (Llama-3.2-11B-Vision via Bedrock) for analyzing biomedical PDFs:

  • Image analysis: protein structure diagrams, pathway charts, Western blots, microscopy
  • Table extraction: PyMuPDF detection + LLM-based column classification and entity mapping
  • Batch processing: parallel workers with SQLite checkpointing and resume capability
# Single PDF
uv run python multimodal_main.py --mode single --path ./data/paper.pdf

# Batch folder
uv run python multimodal_main.py --mode batch --path ./data/papers --recursive --max-workers 4

See Multimodal Processing.

Evidence Weighting

Quality scoring for protein-disease associations based on:

  1. Study design (0.3–1.0): RCT > meta-analysis > cohort > case-control > case report
  2. Source type (0.3–1.0): curated DB > full-text > abstract > preprint > social media
  3. Sample size (±0.1): bonus for n≥1000, penalty for n<50
  4. Retraction status (0.0–1.0): full retraction = 0.0, expression of concern = 0.3

Combined: final_weight = study_design × source_type × sample_size_adj × retraction_penalty

See Evidence Weighting.

Community Detection

Louvain algorithm for detecting entity communities, with LLM-generated summaries for global query mode:

  • Community IDs assigned to entity nodes
  • Community nodes store summaries and embeddings
  • Global queries retrieve relevant community summaries by semantic similarity

See Community Detection Validation.

Protein-Protein Interaction (PPI) Framework

Multi-source PPI integration with three merge strategies:

Strategy Description
merge_evidence Combine evidence from all sources per protein pair
update_best Keep only highest-confidence interaction
preserve_all Separate relationships per source

Supports STRING, FunCoup, BioGRID, and custom CSV formats.

💡 For live API access: STRING interactions can also be fetched directly via the Science Skills enricher (uv run python -m pipeline.processors.science_skills_enricher --sources string), which additionally provides HPA tissue expression, Reactome pathways, ChEMBL drug targets, ClinVar variants, and OpenAlex citations. See Integrations.

See PPI Framework and Multi-Source PPI Guide.

Cell-Based Conversational Interface

Tree-structured conversations for parallel hypothesis exploration:

Cell 1: "What proteins are associated with diabetes?"
  ├── Cell 1a: "Focus on Type 2 diabetes"
  │   ├── Cell 1a-i: "Show protein-protein interactions"
  │   └── Cell 1a-ii: "Find drug targets"
  └── Cell 1b: "Compare with Type 1 diabetes"

See Cell-Based Interface.

Semantic Dynamic Context Graph (SDCG)

Vision for evolving the static KG into a living knowledge fabric:

  • Temporal validity windows on nodes and edges
  • Provenance chains: dataset → row → extraction → entity → relationship
  • User context as graph structure: role, team, description → actual nodes
  • Statistical edges with confidence intervals and sample sizes
  • Context-aware queries: team-scoped views, role-weighted relevance

Status: v1 infrastructure complete (provenance, temporal, user context, confidence scoring, contradiction detection). See SDCG Design Brief and Expansion Roadmap.

Incremental Embeddings

Generate embeddings only for new content, reusing existing ones:

  • Queries for nodes with WHERE n.embedding IS NULL
  • Neo4j vector indexes automatically include new embeddings
  • No manual index rebuild required

See Incremental Embeddings.

Enhanced Node Management

Two-phase embedding strategy for enriched properties:

  1. Phase 1 (during ingestion): chunk embeddings only
  2. Phase 2 (after enrichment): graph node embeddings with all properties

See Enhanced Node Management.

GFQL Query Patterns

Cypher patterns inspired by PyGraphistry's GFQL for multi-hop traversal, conditional filtering, hierarchical disease queries, and pathway co-occurrence analysis.

See GFQL Patterns.