Testing¶

Overview¶

1,065+ tests across 5 categories with 100% categorization. Unit tests complete in <3 minutes with parallel execution (pytest-xdist, -n auto).

Category	Count	%	Description
Unit	846	~79%	Fast, isolated, mocked dependencies
Property-based	136	~13%	Hypothesis-based correctness properties
Slow	53	~5%	Long-running or report-only
Integration	56	~5%	Requires live Neo4j, Redis, or LLM
Known failures	2	<1%	Expected failures, don't block CI

Running Tests¶

# Default: unit tests (excludes integration/slow)
uv run pytest

# Specific categories
uv run pytest -m properties
uv run pytest -m integration
uv run pytest -m slow

# Quick run with short output
uv run pytest tests/unit/ -q --tb=short -x

# Parallel execution (default)
uv run pytest -n auto

Test Markers¶

Marker	Description	Timeout
`unit`	Fast, isolated, mocked (default)	3 min
`integration`	Requires live services	10 min
`slow`	Long-running	15 min
`properties`	Hypothesis property-based	10 min
`xfail`	Expected failures	—

Test Organization¶

tests/
├── unit/                 # Fast isolated tests with mocks
├── integration/          # Tests with real services
├── properties/           # Hypothesis property-based tests
├── services/             # Service-layer tests
├── performance/          # Performance benchmarks
├── api/                  # API endpoint tests
├── fixtures/             # JSON test fixtures
├── conftest.py           # Shared fixtures (mock_llm, mock_neo4j_database, etc.)
├── assertion_helpers.py  # Reusable assertion utilities
└── mock_neo4j_database.py # Mock Neo4j for unit testing

Mocking Strategy¶

Key fixtures in conftest.py: - mock_neo4j_database — mocked Neo4j with _execute_cypher - mock_llm — mocked LLM responses - mock_embedder — mocked embedding model - mock_redis — mocked Redis client

Example:

def test_graph_tool_count_nodes(mock_neo4j_database):
    mock_neo4j_database._execute_cypher.return_value = [{'count': 42}]
    tools = GraphTools(db=mock_neo4j_database)
    result = tools.get_tools()[0].invoke({'label': 'Protein'})
    assert json.loads(result)["count"] == 42

Property-Based Testing¶

136 tests using Hypothesis for correctness properties across: - Entity models, query models, session models - Evaluation scorers and persistence - PDF extraction priority - Security (Cypher injection, PII detection) - API validation

The science skills integration adds 15 property tests covering: - Command construction, temp file uniqueness, output parsing (SkillScriptRunner) - Score/threshold filtering, batch partitioning, provenance metadata (base integrator) - Failure threshold abort, graceful degradation, dry-run prevention - Identifier resolution ordering (STRING map→network, HPA resolve→fetch) - CLI source validation, schema completeness, summary report completeness

Evaluation Framework¶

Automated accuracy benchmarks in src/evaluation/: - 50 test queries across 5 categories in data/test_queries.yaml - 4 scorers: keyword match, entity overlap, embedding similarity, LLM judge - Agent adapters for LangGraph and Neo4j agents - JSON result persistence and Markdown report generation

# Validate test query set
uv run python scripts/validate_test_queries.py

Linting & Type Checking¶

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check (strict mode)
uv run mypy src/ api/ pipeline/

Ruff config: 88-char line length, Python 3.12 target, double quotes, spaces indent. mypy: disallow_untyped_defs, disallow_incomplete_defs, strict_equality. Tests excluded.

See Testing Strategy for the full testing guide.