Development¶

Contributing¶

Code Style¶

Ruff for linting and formatting (88-char lines, double quotes, spaces)
mypy strict mode (disallow_untyped_defs, disallow_incomplete_defs)
Import paths: use src., api., pipeline. as top-level packages
Never use stale prefixes like src.api., src.ingest_pipeline.

Key Conventions¶

Factory pattern: use src/factories/ for creating database, LLM, embedding, agent instances
Domain routers: each API domain has its own router with a routes/ subdirectory
Shared API code: cross-cutting concerns in api/shared/
CLI entry points: ingest_main.py, enrichment_main.py, parallel_ingest.py at top level
CDK is isolated: cdk_resources/ has its own pyproject.toml and uv.lock

Running Checks¶

uv run ruff check .          # Lint
uv run ruff format .         # Format
uv run mypy src/ api/ pipeline/  # Type check
uv run pytest                # Unit tests
uv run pytest -m properties  # Property-based tests

Onboarding New Developers¶

See Modular Projects for New Hires for 9 self-contained work packages ranked by independence and impact:

Evaluation Module Hardening (10/10 independence) — learn all agents
MCP Server Expansion (10/10) — simple pattern, fast wins
Fine-Tuned SLMs (9/10) — high impact if ML background
3D Protein Structure (9/10) — greenfield, bio background
DisGeNET/OpenTargets (9/10) — clear API integration
Evidence Weighting (8/10) — domain understanding needed
OpenAPI Type Generation (8/10) — mechanical refactoring
Playwright E2E (9/10) — frontend-leaning
PMC Full-Text (7/10) — touches core pipeline

📌 Note (June 2026): Project 5 (DisGeNET/OpenTargets) scope has been reduced — ChEMBL drug-target data is now available via Science Skills integration. STRING PPI data is also now queryable for Project 2 (MCP Server). See the full document for details.

Project Roadmap¶

6-Month Roadmap (April–September 2026)¶

Phase 1: Stabilize (Weeks 1–8) — Router extraction, CI, onboarding, frontend cleanup Phase 2: Accelerate (Weeks 9–18) — Async pipeline, entity resolution, data sources, observability Phase 3: Scale (Weeks 19–26) — Auth, RBAC, Neptune, export, backup

See 6-Month Roadmap for the full sprint-by-sprint plan with Jira/GitHub Projects integration.

Restructuring Plan¶

The codebase is being reorganized into clearer domain boundaries:

Q&A side (proprietary): agents, KG tools, query service, memory, confidence scoring
Ingestion side (open-source candidate): pipeline, processors, enrichment, fetchers
Shared: core, models, cache, factories, security, utils

See Restructuring Sketch for the proposed directory structure and open-source strategy.

Status Reports¶

Current state assessments: - Backend Status — API monolith issues, service layer, security - Frontend Status — god components, missing shared client, test gaps - Pipeline & Infra Status — active specs, CI/CD, Docker

Entity Extraction Strategy¶

The system uses a hybrid "extract broadly, filter precisely" approach: - LLM: broad entity extraction from text (permissive by design) - Ontology filter: validates against UniProt/MONDO, removes spurious entities - User control: configurable entity types, filtering strictness, custom rules

See User vs LLM Decisions.

Alternative Database Evaluation¶

Kuzu was evaluated as an alternative to Neo4j. Conclusion: not suitable as primary backend (no vector search), but could serve as secondary analytics database.

See Kuzu Evaluation.