Development¶
Contributing¶
Code Style¶
- Ruff for linting and formatting (88-char lines, double quotes, spaces)
- mypy strict mode (
disallow_untyped_defs,disallow_incomplete_defs) - Import paths: use
src.,api.,pipeline.as top-level packages - Never use stale prefixes like
src.api.,src.ingest_pipeline.
Key Conventions¶
- Factory pattern: use
src/factories/for creating database, LLM, embedding, agent instances - Domain routers: each API domain has its own router with a
routes/subdirectory - Shared API code: cross-cutting concerns in
api/shared/ - CLI entry points:
ingest_main.py,enrichment_main.py,parallel_ingest.pyat top level - CDK is isolated:
cdk_resources/has its ownpyproject.tomlanduv.lock
Running Checks¶
uv run ruff check . # Lint
uv run ruff format . # Format
uv run mypy src/ api/ pipeline/ # Type check
uv run pytest # Unit tests
uv run pytest -m properties # Property-based tests
Onboarding New Developers¶
See Modular Projects for New Hires for 9 self-contained work packages ranked by independence and impact:
- Evaluation Module Hardening (10/10 independence) — learn all agents
- MCP Server Expansion (10/10) — simple pattern, fast wins
- Fine-Tuned SLMs (9/10) — high impact if ML background
- 3D Protein Structure (9/10) — greenfield, bio background
- DisGeNET/OpenTargets (9/10) — clear API integration
- Evidence Weighting (8/10) — domain understanding needed
- OpenAPI Type Generation (8/10) — mechanical refactoring
- Playwright E2E (9/10) — frontend-leaning
- PMC Full-Text (7/10) — touches core pipeline
📌 Note (June 2026): Project 5 (DisGeNET/OpenTargets) scope has been reduced — ChEMBL drug-target data is now available via Science Skills integration. STRING PPI data is also now queryable for Project 2 (MCP Server). See the full document for details.
Project Roadmap¶
6-Month Roadmap (April–September 2026)¶
Phase 1: Stabilize (Weeks 1–8) — Router extraction, CI, onboarding, frontend cleanup Phase 2: Accelerate (Weeks 9–18) — Async pipeline, entity resolution, data sources, observability Phase 3: Scale (Weeks 19–26) — Auth, RBAC, Neptune, export, backup
See 6-Month Roadmap for the full sprint-by-sprint plan with Jira/GitHub Projects integration.
Restructuring Plan¶
The codebase is being reorganized into clearer domain boundaries:
- Q&A side (proprietary): agents, KG tools, query service, memory, confidence scoring
- Ingestion side (open-source candidate): pipeline, processors, enrichment, fetchers
- Shared: core, models, cache, factories, security, utils
See Restructuring Sketch for the proposed directory structure and open-source strategy.
Status Reports¶
Current state assessments: - Backend Status — API monolith issues, service layer, security - Frontend Status — god components, missing shared client, test gaps - Pipeline & Infra Status — active specs, CI/CD, Docker
Entity Extraction Strategy¶
The system uses a hybrid "extract broadly, filter precisely" approach: - LLM: broad entity extraction from text (permissive by design) - Ontology filter: validates against UniProt/MONDO, removes spurious entities - User control: configurable entity types, filtering strictness, custom rules
Alternative Database Evaluation¶
Kuzu was evaluated as an alternative to Neo4j. Conclusion: not suitable as primary backend (no vector search), but could serve as secondary analytics database.
See Kuzu Evaluation.