Advanced Features¶
Multimodal Document Processing¶
Vision model integration (Llama-3.2-11B-Vision via Bedrock) for analyzing biomedical PDFs:
- Image analysis: protein structure diagrams, pathway charts, Western blots, microscopy
- Table extraction: PyMuPDF detection + LLM-based column classification and entity mapping
- Batch processing: parallel workers with SQLite checkpointing and resume capability
# Single PDF
uv run python multimodal_main.py --mode single --path ./data/paper.pdf
# Batch folder
uv run python multimodal_main.py --mode batch --path ./data/papers --recursive --max-workers 4
Evidence Weighting¶
Quality scoring for protein-disease associations based on:
- Study design (0.3–1.0): RCT > meta-analysis > cohort > case-control > case report
- Source type (0.3–1.0): curated DB > full-text > abstract > preprint > social media
- Sample size (±0.1): bonus for n≥1000, penalty for n<50
- Retraction status (0.0–1.0): full retraction = 0.0, expression of concern = 0.3
Combined: final_weight = study_design × source_type × sample_size_adj × retraction_penalty
See Evidence Weighting.
Community Detection¶
Louvain algorithm for detecting entity communities, with LLM-generated summaries for global query mode:
- Community IDs assigned to entity nodes
- Community nodes store summaries and embeddings
- Global queries retrieve relevant community summaries by semantic similarity
See Community Detection Validation.
Protein-Protein Interaction (PPI) Framework¶
Multi-source PPI integration with three merge strategies:
| Strategy | Description |
|---|---|
merge_evidence |
Combine evidence from all sources per protein pair |
update_best |
Keep only highest-confidence interaction |
preserve_all |
Separate relationships per source |
Supports STRING, FunCoup, BioGRID, and custom CSV formats.
💡 For live API access: STRING interactions can also be fetched directly via the Science Skills enricher (
uv run python -m pipeline.processors.science_skills_enricher --sources string), which additionally provides HPA tissue expression, Reactome pathways, ChEMBL drug targets, ClinVar variants, and OpenAlex citations. See Integrations.
See PPI Framework and Multi-Source PPI Guide.
Cell-Based Conversational Interface¶
Tree-structured conversations for parallel hypothesis exploration:
Cell 1: "What proteins are associated with diabetes?"
├── Cell 1a: "Focus on Type 2 diabetes"
│ ├── Cell 1a-i: "Show protein-protein interactions"
│ └── Cell 1a-ii: "Find drug targets"
└── Cell 1b: "Compare with Type 1 diabetes"
See Cell-Based Interface.
Semantic Dynamic Context Graph (SDCG)¶
Vision for evolving the static KG into a living knowledge fabric:
- Temporal validity windows on nodes and edges
- Provenance chains: dataset → row → extraction → entity → relationship
- User context as graph structure: role, team, description → actual nodes
- Statistical edges with confidence intervals and sample sizes
- Context-aware queries: team-scoped views, role-weighted relevance
Status: v1 infrastructure complete (provenance, temporal, user context, confidence scoring, contradiction detection). See SDCG Design Brief and Expansion Roadmap.
Incremental Embeddings¶
Generate embeddings only for new content, reusing existing ones:
- Queries for nodes with
WHERE n.embedding IS NULL - Neo4j vector indexes automatically include new embeddings
- No manual index rebuild required
Enhanced Node Management¶
Two-phase embedding strategy for enriched properties:
- Phase 1 (during ingestion): chunk embeddings only
- Phase 2 (after enrichment): graph node embeddings with all properties
GFQL Query Patterns¶
Cypher patterns inspired by PyGraphistry's GFQL for multi-hop traversal, conditional filtering, hierarchical disease queries, and pathway co-occurrence analysis.
See GFQL Patterns.