Voozh

The Coverage Lie#

Code coverage lies. A test that exercises a line doesn’t mean it verifies that line does the right thing:

function add(a: number, b: number): number {
 return a + b
}

// 100% coverage - would still pass if add() returned 999
it('adds numbers', () => {
 add(2, 2)
})

Mutation testing flips the question. Instead of asking “did tests run this code?”, it asks “if I break this code, do tests fail?”

Using our add example, a mutation tester would:

// Original
function add(a: number, b: number): number {
 return a + b
}

// Mutated: swap + for -
function add(a: number, b: number): number {
 return a - b // <-- bug introduced
}

Now run the test. add(2, 2) returns 0 instead of 4. Does the test fail? No—it never checked the result. The mutant survives. Your test has a gap.

The process:

Mutate: Introduce a small bug (change > to >=, swap && for ||, delete a line)
Run tests: Execute your test suite against the mutated code
Evaluate: If tests pass with the bug, your tests are weak. If tests fail, they caught it.

A mutation that tests fail to catch is a “surviving mutant”—proof of a test gap.

When Stryker Works: The Gold Standard#

When your test stack supports it, automated mutation testing with Stryker is the way to go. It’s fast, deterministic, generates HTML reports, and runs in CI pipelines. This is especially valuable when you have pure functions with high test coverage but want to verify test quality.

Here’s what it looks like in practice:

pnpm test:mutation
# or: stryker run

INFO ProjectReader Found 7 of 2947 file(s) to be mutated.
INFO Instrumenter Instrumented 7 source file(s) with 394 mutant(s)
INFO DryRunExecutor Initial test run succeeded. Ran 184 tests in 0 seconds.

Mutation testing [====================] 100% | 394/394 Mutants tested
(35 survived, 0 timed out)

--------------|---------|----------|----------|----------|
File | % score | # killed | # survived | # no cov |
--------------|---------|----------|----------|----------|
All files | 90.86 | 358 | 35 | 1 |
 backlinks.ts | 96.30 | 26 | 1 | 0 |
 callouts.ts | 93.94 | 62 | 4 | 0 |
 graph.ts | 91.55 | 65 | 6 | 0 |
 mentions.ts | 91.30 | 63 | 5 | 1 |
 minimark.ts | 82.61 | 76 | 16 | 0 |
 text.ts | 100.00 | 34 | 0 | 0 |
 wikilinks.ts | 91.43 | 32 | 3 | 0 |
--------------|---------|----------|----------|----------|

INFO MutationTestExecutor Done in 36 seconds.

394 mutants tested across 7 files in 36 seconds. The report shows exactly which files have weak spots—minimark.ts at 82.61% needs attention, while text.ts is solid at 100%.

Stryker also generates an interactive HTML report where you can drill into each surviving mutant and see exactly what code change your tests failed to catch.

💪Use Stryker When You Can

If your stack supports Stryker (standard Vitest in Node mode, Jest, Mocha), use it. Deterministic tooling in your CI pipeline beats manual approaches every time. The AI agent technique in this post is for when Stryker isn’t an option.

The Vitest Browser Mode Problem#

But what if Stryker doesn’t support your stack? Stryker doesn’t work with Vitest’s browser mode. Their instrumentation assumes Node.js execution, but browser mode runs tests in actual Chromium via Playwright.

My setup:

Framework: Vitest 4 with browser.enabled: true
Provider: Playwright (Chromium)
Test style: Integration tests with real DOM

My testing strategy relies heavily on Vitest browser mode for realistic user flow testing. Stryker’s mutation coverage reports? Not an option. And switching to Node-based testing would mean losing the browser-specific behavior I’m actually testing.

AI Agents as Manual Mutation Testers#

The mutation testing algorithm is simple enough that an AI coding agent can execute it manually. Claude Code can:

Read your source code
Apply mutations systematically
Run pnpm test --run
Record whether tests passed or failed
Restore the original code
Report surviving mutants with suggested fixes

I adapted a Claude Code skill originally created by Paul Hammond that codifies this workflow.

The Mutation Testing Skill#

The skill defines mutation operators in priority order:

Priority 1 - Boundaries (most likely to survive):

Original	Mutate To
`<`	`<=`
`>`	`>=`
`<=`	`<`
`>=`	`>`

Priority 2 - Boolean Logic:

Original	Mutate To
`&&`	`\|\|`
`\|\|`	`&&`
`!condition`	`condition`

Priority 3 - Return Values:

Original	Mutate To
`return x`	`return null`
`return true`	`return false`
Early return	Remove it

Priority 4 - Statement Removal:

Original	Mutate To
`array.push(x)`	Remove
`await save(x)`	Remove
`emit('event')`	Remove

The agent applies each mutation one at a time, runs tests, records results, and restores the original code immediately.

Real Example: Settings Feature#

I ran this against my settings feature. The integration tests looked comprehensive—theme toggling, language switching, unit preferences. Code coverage would show high percentages.

Results: 38% mutation score (5 killed, 8 survived out of 13 mutations)

Here’s what the AI agent found:

Surviving Mutant #1: Volume Boundary Not Tested#

// Original (stores/settings.ts:65)
Math.min(Math.max(volume, 0.5), 1)

// Mutation: Change 0.5 to 0.4
Math.min(Math.max(volume, 0.4), 1)

// Result: Tests PASSED -> Mutant SURVIVED

My tests never verified the minimum volume constraint. A bug changing the minimum from 50% to 40% would ship undetected.

Surviving Mutant #2: Theme DOM Class Not Verified#

// Original (composables/useTheme.ts:26)
newMode === 'dark'

// Mutation: Negate the condition
newMode !== 'dark'

// Result: Tests PASSED -> Mutant SURVIVED

My test checked that clicking the toggle changed the stored preference. It never verified that document.documentElement.classList actually received the dark class. The UI could break while tests pass.

Surviving Mutant #3: Error Handling Path Untested#

// Original (stores/settings.ts:28)
if (error) return

// Mutation: Negate the condition
if (!error) return

// Result: Tests PASSED -> Mutant SURVIVED

No test exercised the error handling branch. A bug that inverted error handling would go unnoticed.

The Fixes#

The agent suggested specific tests for each surviving mutant:

// Fix for Mutant #1: Boundary test
it('volume slider has minimum value constraint of 50%', async () => {
 const volumeSlider = page.getByTestId('timer-sound-volume-slider')
 await expect.poll(async () => {
 const el = await volumeSlider.element()
 return el.getAttribute('min')
 }).toBe('0.5')
})

// Fix for Mutant #2: DOM verification
it('adds dark class to html element when dark mode enabled', async () => {
 const themeToggle = page.getByTestId('theme-toggle')
 await userEvent.click(themeToggle)

 await expect.poll(() =>
 document.documentElement.classList.contains('dark')
 ).toBe(true)
})

How to Set This Up#

Step 1: Create the Skill#

Save this as .claude/skills/mutation-testing/SKILL.md:

Step 2: Invoke It#

claude "Run mutation testing on the settings feature"

The agent will:

Find changed files on your branch
Identify testable functions
Apply mutations systematically
Report surviving mutants with suggested test fixes

Step 3: Review and Fix#

The agent produces a markdown report. Review each surviving mutant and decide:

Add the suggested test
Accept the risk (document why)
Refactor the code to be more testable

When to Use This Approach#

Good Fit	Not Ideal
Vitest browser mode (no Stryker support)	Large codebases needing full mutation coverage
Playwright component testing	CI/CD automation (manual agent invocation)
Small-to-medium codebases	Strict mutation score thresholds
Pre-merge review of specific features
Learning what makes tests effective

💪Complement, Don't Replace

This approach works best alongside your existing testing strategy. Use it to spot-check critical features before merge, not as a replacement for automated mutation testing where available.

Feature Branches, Not Pipelines

This skill shines on feature branches where you want to validate test quality before merging. Running AI agents in CI/CD pipelines is possible—you could build an automated QA agent with the Claude Agent SDK—but it adds complexity and cost. For pipeline automation, deterministic tools like Stryker remain the better choice when your stack supports them. Think of this as a developer tool for improving tests during development, not a CI gate.

Key Takeaways#

Coverage doesn’t equal confidence. High code coverage can coexist with ineffective tests.
Mutation testing reveals test gaps. By breaking code and checking if tests notice, you find what’s actually being verified.
AI agents can execute manual mutation testing. When tooling doesn’t support your stack, an agent can apply the algorithm systematically.
Focus on surviving mutants. Each one is a potential bug your tests wouldn’t catch.
This complements, not replaces. Use this alongside coverage reports, not instead of automated mutation testing where available.

Resources#

Paul Hammond’s Mutation Testing Skill - The original skill this post is based on
Mutation Testing on Wikipedia
Stryker Mutator - When your stack supports it
My TDD workflow with Claude Code - Related approach for test-first development

URL: https://alexop.dev/posts/mutation-testing-ai-agents-vitest-browser-mode/

⇱ Mutation Testing with AI Agents When Stryker Doesn't Work | alexop.dev

Mutation Testing with AI Agents When Stryker Doesn't Work