Skip to content

Testing

Overview

1,065+ tests across 5 categories with 100% categorization. Unit tests complete in <3 minutes with parallel execution (pytest-xdist, -n auto).

Category Count % Description
Unit 846 ~79% Fast, isolated, mocked dependencies
Property-based 136 ~13% Hypothesis-based correctness properties
Slow 53 ~5% Long-running or report-only
Integration 56 ~5% Requires live Neo4j, Redis, or LLM
Known failures 2 <1% Expected failures, don't block CI

Running Tests

# Default: unit tests (excludes integration/slow)
uv run pytest

# Specific categories
uv run pytest -m properties
uv run pytest -m integration
uv run pytest -m slow

# Quick run with short output
uv run pytest tests/unit/ -q --tb=short -x

# Parallel execution (default)
uv run pytest -n auto

Test Markers

Marker Description Timeout
unit Fast, isolated, mocked (default) 3 min
integration Requires live services 10 min
slow Long-running 15 min
properties Hypothesis property-based 10 min
xfail Expected failures

Test Organization

tests/
├── unit/                 # Fast isolated tests with mocks
├── integration/          # Tests with real services
├── properties/           # Hypothesis property-based tests
├── services/             # Service-layer tests
├── performance/          # Performance benchmarks
├── api/                  # API endpoint tests
├── fixtures/             # JSON test fixtures
├── conftest.py           # Shared fixtures (mock_llm, mock_neo4j_database, etc.)
├── assertion_helpers.py  # Reusable assertion utilities
└── mock_neo4j_database.py # Mock Neo4j for unit testing

Mocking Strategy

Key fixtures in conftest.py: - mock_neo4j_database — mocked Neo4j with _execute_cypher - mock_llm — mocked LLM responses - mock_embedder — mocked embedding model - mock_redis — mocked Redis client

Example:

def test_graph_tool_count_nodes(mock_neo4j_database):
    mock_neo4j_database._execute_cypher.return_value = [{'count': 42}]
    tools = GraphTools(db=mock_neo4j_database)
    result = tools.get_tools()[0].invoke({'label': 'Protein'})
    assert json.loads(result)["count"] == 42

Property-Based Testing

136 tests using Hypothesis for correctness properties across: - Entity models, query models, session models - Evaluation scorers and persistence - PDF extraction priority - Security (Cypher injection, PII detection) - API validation

The science skills integration adds 15 property tests covering: - Command construction, temp file uniqueness, output parsing (SkillScriptRunner) - Score/threshold filtering, batch partitioning, provenance metadata (base integrator) - Failure threshold abort, graceful degradation, dry-run prevention - Identifier resolution ordering (STRING map→network, HPA resolve→fetch) - CLI source validation, schema completeness, summary report completeness

Evaluation Framework

Automated accuracy benchmarks in src/evaluation/: - 50 test queries across 5 categories in data/test_queries.yaml - 4 scorers: keyword match, entity overlap, embedding similarity, LLM judge - Agent adapters for LangGraph and Neo4j agents - JSON result persistence and Markdown report generation

# Validate test query set
uv run python scripts/validate_test_queries.py

Linting & Type Checking

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check (strict mode)
uv run mypy src/ api/ pipeline/

Ruff config: 88-char line length, Python 3.12 target, double quotes, spaces indent. mypy: disallow_untyped_defs, disallow_incomplete_defs, strict_equality. Tests excluded.

See Testing Strategy for the full testing guide.