Broken and flaky tests cost engineering teams hours every week — usually for mechanical reasons like a renamed selector or a drifted fixture. Verdict automates the tedious middle of fixing them: when a test fails, it diagnoses the failure, proposes a patch, verifies it in a sandbox against your real suite, and scores the fix's quality with an LLM judge. You get a diagnosed failure, a proposed diff, and a verdict — and you decide.
Verdict ingests the failing CI run.
It gathers context and generates a candidate patch.
The patch is re-run in a sandbox, and an LLM Judge scores its quality — not just whether it's green.
The diagnosed failure, the diff, and the verdict are surfaced for review. You accept, reject, or edit. Verdict never commits.
Verdict produces a useful, reviewable fix roughly 70–80% of the timeon our internal test corpus. We share that number plainly: it's an internal observation under known conditions, not a guarantee for every repository — generalization to other codebases is something we're still measuring. Verdict is designed for exactly this reality: it's a reviewer's assistant, so a weak or wrong patch is cheap to reject, and nothing is ever committed without you.