Skip to content

Development

Contributing

Code Style

  • Ruff for linting and formatting (88-char lines, double quotes, spaces)
  • mypy strict mode (disallow_untyped_defs, disallow_incomplete_defs)
  • Import paths: use src., api., pipeline. as top-level packages
  • Never use stale prefixes like src.api., src.ingest_pipeline.

Key Conventions

  • Factory pattern: use src/factories/ for creating database, LLM, embedding, agent instances
  • Domain routers: each API domain has its own router with a routes/ subdirectory
  • Shared API code: cross-cutting concerns in api/shared/
  • CLI entry points: ingest_main.py, enrichment_main.py, parallel_ingest.py at top level
  • CDK is isolated: cdk_resources/ has its own pyproject.toml and uv.lock

Running Checks

uv run ruff check .          # Lint
uv run ruff format .         # Format
uv run mypy src/ api/ pipeline/  # Type check
uv run pytest                # Unit tests
uv run pytest -m properties  # Property-based tests

Onboarding New Developers

See Modular Projects for New Hires for 9 self-contained work packages ranked by independence and impact:

  1. Evaluation Module Hardening (10/10 independence) — learn all agents
  2. MCP Server Expansion (10/10) — simple pattern, fast wins
  3. Fine-Tuned SLMs (9/10) — high impact if ML background
  4. 3D Protein Structure (9/10) — greenfield, bio background
  5. DisGeNET/OpenTargets (9/10) — clear API integration
  6. Evidence Weighting (8/10) — domain understanding needed
  7. OpenAPI Type Generation (8/10) — mechanical refactoring
  8. Playwright E2E (9/10) — frontend-leaning
  9. PMC Full-Text (7/10) — touches core pipeline

📌 Note (June 2026): Project 5 (DisGeNET/OpenTargets) scope has been reduced — ChEMBL drug-target data is now available via Science Skills integration. STRING PPI data is also now queryable for Project 2 (MCP Server). See the full document for details.

Project Roadmap

6-Month Roadmap (April–September 2026)

Phase 1: Stabilize (Weeks 1–8) — Router extraction, CI, onboarding, frontend cleanup Phase 2: Accelerate (Weeks 9–18) — Async pipeline, entity resolution, data sources, observability Phase 3: Scale (Weeks 19–26) — Auth, RBAC, Neptune, export, backup

See 6-Month Roadmap for the full sprint-by-sprint plan with Jira/GitHub Projects integration.

Restructuring Plan

The codebase is being reorganized into clearer domain boundaries:

  • Q&A side (proprietary): agents, KG tools, query service, memory, confidence scoring
  • Ingestion side (open-source candidate): pipeline, processors, enrichment, fetchers
  • Shared: core, models, cache, factories, security, utils

See Restructuring Sketch for the proposed directory structure and open-source strategy.

Status Reports

Current state assessments: - Backend Status — API monolith issues, service layer, security - Frontend Status — god components, missing shared client, test gaps - Pipeline & Infra Status — active specs, CI/CD, Docker

Entity Extraction Strategy

The system uses a hybrid "extract broadly, filter precisely" approach: - LLM: broad entity extraction from text (permissive by design) - Ontology filter: validates against UniProt/MONDO, removes spurious entities - User control: configurable entity types, filtering strictness, custom rules

See User vs LLM Decisions.

Alternative Database Evaluation

Kuzu was evaluated as an alternative to Neo4j. Conclusion: not suitable as primary backend (no vector search), but could serve as secondary analytics database.

See Kuzu Evaluation.