Testing¶
Overview¶
1,065+ tests across 5 categories with 100% categorization. Unit tests complete in <3 minutes with parallel execution (pytest-xdist, -n auto).
| Category | Count | % | Description |
|---|---|---|---|
| Unit | 846 | ~79% | Fast, isolated, mocked dependencies |
| Property-based | 136 | ~13% | Hypothesis-based correctness properties |
| Slow | 53 | ~5% | Long-running or report-only |
| Integration | 56 | ~5% | Requires live Neo4j, Redis, or LLM |
| Known failures | 2 | <1% | Expected failures, don't block CI |
Running Tests¶
# Default: unit tests (excludes integration/slow)
uv run pytest
# Specific categories
uv run pytest -m properties
uv run pytest -m integration
uv run pytest -m slow
# Quick run with short output
uv run pytest tests/unit/ -q --tb=short -x
# Parallel execution (default)
uv run pytest -n auto
Test Markers¶
| Marker | Description | Timeout |
|---|---|---|
unit |
Fast, isolated, mocked (default) | 3 min |
integration |
Requires live services | 10 min |
slow |
Long-running | 15 min |
properties |
Hypothesis property-based | 10 min |
xfail |
Expected failures | — |
Test Organization¶
tests/
├── unit/ # Fast isolated tests with mocks
├── integration/ # Tests with real services
├── properties/ # Hypothesis property-based tests
├── services/ # Service-layer tests
├── performance/ # Performance benchmarks
├── api/ # API endpoint tests
├── fixtures/ # JSON test fixtures
├── conftest.py # Shared fixtures (mock_llm, mock_neo4j_database, etc.)
├── assertion_helpers.py # Reusable assertion utilities
└── mock_neo4j_database.py # Mock Neo4j for unit testing
Mocking Strategy¶
Key fixtures in conftest.py:
- mock_neo4j_database — mocked Neo4j with _execute_cypher
- mock_llm — mocked LLM responses
- mock_embedder — mocked embedding model
- mock_redis — mocked Redis client
Example:
def test_graph_tool_count_nodes(mock_neo4j_database):
mock_neo4j_database._execute_cypher.return_value = [{'count': 42}]
tools = GraphTools(db=mock_neo4j_database)
result = tools.get_tools()[0].invoke({'label': 'Protein'})
assert json.loads(result)["count"] == 42
Property-Based Testing¶
136 tests using Hypothesis for correctness properties across: - Entity models, query models, session models - Evaluation scorers and persistence - PDF extraction priority - Security (Cypher injection, PII detection) - API validation
The science skills integration adds 15 property tests covering: - Command construction, temp file uniqueness, output parsing (SkillScriptRunner) - Score/threshold filtering, batch partitioning, provenance metadata (base integrator) - Failure threshold abort, graceful degradation, dry-run prevention - Identifier resolution ordering (STRING map→network, HPA resolve→fetch) - CLI source validation, schema completeness, summary report completeness
Evaluation Framework¶
Automated accuracy benchmarks in src/evaluation/:
- 50 test queries across 5 categories in data/test_queries.yaml
- 4 scorers: keyword match, entity overlap, embedding similarity, LLM judge
- Agent adapters for LangGraph and Neo4j agents
- JSON result persistence and Markdown report generation
Linting & Type Checking¶
# Lint
uv run ruff check .
# Format
uv run ruff format .
# Type check (strict mode)
uv run mypy src/ api/ pipeline/
Ruff config: 88-char line length, Python 3.12 target, double quotes, spaces indent.
mypy: disallow_untyped_defs, disallow_incomplete_defs, strict_equality. Tests excluded.
See Testing Strategy for the full testing guide.