Whitepaper · Companion to Release Management
"When is testing done?" is the single hardest question a test manager has to answer, because it's not actually a testing question — it's a release-decision question pretending to be a testing question. Good exit criteria make the decision legible, defensible, and measurable. Bad exit criteria make the decision political and inconsistent. This paper is about the difference.
Pairs with the Test Release Processes whitepaper (which deliberately scopes release criteria out) and the Quality Risk Analysis whitepaper (which is where the traceability comes from).
Four kinds of criteria — not one
Most teams talk about "exit criteria" as if it were a single checklist. In practice there are four distinct kinds of gates, each answering a different question. Conflating them is the source of most release-argument dysfunction.
Every release cycle needs all four. Entry criteria prevent premature work; exit criteria make completion legible; suspension criteria protect the team from pouring effort into a collapsing build; resumption criteria prevent re-entry into the same failure mode. If your process documents one without the others, you have half a process.
A note on vocabulary
Different shops use different words. "Acceptance criteria," "release criteria," "done criteria," "quality gates," "promotion criteria" — you'll see all of them. The semantics matter less than the separation between the four questions above. Pick words that make sense for your organization and stick with them.
What a good criterion looks like
Before we get to the frameworks, here's the bar each individual criterion has to clear. A criterion is a commitment — a statement of the form "we will do X" or "we will not ship until Y" — and like all commitments it has to be honest, measurable, and falsifiable.
"Zero P1 defects at release" sounds like a strong criterion and is almost always a bad one. In practice one of two things happens: either the team finds creative reasons to downgrade a P1 to a P2 near release, or they find creative reasons not to file a P1 in the first place. Either way the metric stops measuring what it claims to measure. A better version: "all known P1 defects are either fixed, have an accepted written waiver from a named approver, or have a mitigation that has been tested end to end."
The three approaches, in increasing quality
QRA teaches that there are informal, checklist-based, and rigorous techniques for discovering risks. The same hierarchy applies to authoring criteria. You can do this casually and get a casual result, or you can do it rigorously and get a defensible one.
Approach 1: Coverage + defect tail (the default)
The most common approach, and the starting point for everyone. Criteria are expressed in terms of test-execution completeness and defect-register state:
- 100% of planned test cases executed
- ≥ X% pass rate on regression suite
- Zero open P1 defects (with the waiver clause above)
- Fewer than N open P2 defects, with a written plan for each
- Defect trend stable or declining for the last N days
This works, in the sense that it's better than no criteria at all. What it misses is why those numbers are the right numbers. 100% of what planned test cases? Tests for which risks? Requirements? Regulations? The criteria are self-referential to the test plan — they check whether we did what we said we'd do, not whether what we did was sufficient.
Approach 2: Traceability to requirements (Simmons' contribution)
Erik Simmons, in "Requirements to Release Criteria: Testing in Context" (PNSQC 2001), made the argument that release criteria should trace back to requirements — that each criterion should be derivable from a specific requirement, quality attribute, or business commitment. This is the approach that distinguishes professional release engineering from ceremonial release engineering.
Concretely, for each requirement or quality attribute, the team asks:
- What does "this is delivered correctly" mean for this requirement?
- What evidence would convince a skeptical reviewer?
- What's the measurable threshold?
- Who signs off?
The resulting criteria read like this — each one pointing back to a source document:
- R-034 (auth timeout) — session timeout of 30 minutes verified across web, mobile, and API surfaces; zero deviations in test report #TR-112. Owner: Security lead.
- NFR-008 (P95 latency) — P95 ≤ 250ms at 2× forecast peak load, verified in staging environment matching production topology, report #PR-44. Owner: Performance lead.
- Regulation GDPR-Art-17 — right-to-erasure flow tested end to end including downstream cache invalidation, report #CR-09. Owner: Compliance lead.
This approach forces two useful things. First, any requirement with no corresponding criterion is either unimportant (in which case it shouldn't be a requirement) or we have a gap. Second, any criterion with no corresponding requirement is either protecting against something we've forgotten to write down (in which case let's write it down) or it's there for political reasons (in which case let's be honest about that).
Approach 3: Risk-weighted + traceability (the rigorous version)
The next step up: derive criteria not just from requirements but from the quality risk register. Each high-priority risk item gets a corresponding criterion that specifies what evidence will be required before we accept that risk as sufficiently mitigated.
Traceability flow
From risks to release criteria
Each layer narrows. Not every risk item produces a release criterion — only the ones whose residual risk, after planned mitigation, is still above the team's tolerance.
A healthy release process has at least this much narrowing. A register where every risk becomes a gate is unmanageable; a process where no risks become gates is ceremonial.
This is the approach we use with clients. It produces a release review meeting where every criterion has a stated purpose (the risk it's guarding), a named owner, a measurable threshold, and an evidence trail. Waivers are still possible — they always are — but they are explicit, written, and named. The political game of quiet re-interpretation is eliminated because there's nothing left to re-interpret.
Modern patterns
The criteria above describe the state of the product at a single release decision. Continuous delivery, progressive rollout, and error-budget-based operation have added three new kinds of criteria that sit alongside the classic set.
Automated quality gates in CI
A large share of what used to be exit criteria can now be expressed as automated gates in the build pipeline: unit-test pass rate, code-coverage threshold, SAST/DAST findings under a threshold, dependency scan clean, performance-regression guard, schema-contract test pass. The human release review is then a review of the exceptions — risks that can't be expressed as an automated gate — rather than a review of the whole set.
Rule of thumb: if a criterion can be expressed as an automated gate that runs on every commit, put it there. The human release review should focus on what can't be automated (integration risk, product-judgment risk, regulatory sign-off, business-readiness), not on things a CI job can decide faster and more reliably.
Progressive rollout gates
When a release is not a single binary event but a progression (canary → 1% → 10% → 50% → 100%), each stage has its own criteria:
Each stage gate is a miniature release decision. The named owner is usually the on-call engineer or release manager, not a committee — the cadence is too fast for committee decisions. The criteria have to be operable under time pressure, which is a tougher constraint than the classic release-meeting criteria.
Error-budget-based criteria
For teams on SRE-style operation, the exit criterion for ongoing feature delivery is not "the tests pass" but "the service is within its error budget." If the budget is exhausted, feature releases are paused until reliability work restores it — regardless of how green the test suite is.
This is a meta-criterion: it sits above the per-release criteria and sometimes overrides them. Teams operating this way need both — the per-release criteria say "this particular change is safe to roll out," the error-budget criterion says "the service has headroom to absorb another change right now." Either one can veto.
How to build the list
Concretely, how do you author a good release-criteria set for a release you're planning now? Here's the sequence we use:
- Start from the risk register and requirements list. For each high-priority risk and each named requirement or quality attribute, ask: what evidence convinces us this is delivered correctly? Write that as the criterion.
- Move what you can to automation. For each criterion, ask: can this be an automated CI gate? If yes, move it there and remove it from the manual release checklist.
- Define entry, suspension, and resumption. For each test phase, define the three non-exit criteria explicitly. Most release-process failures happen because one of these three is missing.
- Name an owner per criterion. Every criterion has one named owner who signs off pass/fail. If you can't name them, the criterion isn't mature yet.
- Pre-declare the waiver path. If a criterion fails at release time, what's the path to a documented waiver? Who approves? What's the acceptance condition? Pre-declaring this removes a surprising amount of release-day friction.
- Review with stakeholders before the release. The list should be agreed before testing starts, not negotiated at the release review. A criterion discovered at release time is a criterion the team couldn't meet.
The most valuable moment in a release-criteria process is the moment the list is agreed — weeks before the release. Stakeholders who sign off early, in writing, on what "done" looks like are much less able to move the goalposts later. This isn't bureaucracy; it's a commitment device that protects the team from late-cycle scope drift.
Common failure modes — a diagnostic checklist
If you're trying to figure out why your release process keeps ending in fire drills, work this list:
What this buys you
A release criteria set built this way turns the release decision from a political conversation into a data review. The conversation in the room becomes "do we have evidence for each criterion, yes or no," not "does this feel ready." The criteria are defensible to auditors, explainable to executives, and teachable to new team members. And critically — the act of authoring them forces the conversation about scope, risk, and acceptance to happen early, when change is cheap, instead of at release time, when change is expensive and emotional.
Related reading
- Test Release Processes: Seven Steps, Nine Quality Indicators — the other side of the release coin
- Quality Risk Analysis: A Complete Whitepaper — the source of the risk items that drive criteria
- Risk Perception and Cognitive Bias — why group release decisions go wrong and how to debias them
- Risk-Based Test Results Reporting — how to report the evidence behind each criterion
Sources and further reading
- Simmons, Erik. "Requirements to Release Criteria: Testing in Context." Proceedings of the Pacific Northwest Software Quality Conference (PNSQC), 2001. The seminal paper making the case for traceability-based release criteria.
- Beyer, Betsy, et al. Site Reliability Engineering. O'Reilly, 2016. Chapter on error budgets; source of the meta-criterion concept.
- Humble, Jez, and David Farley. Continuous Delivery. Addison-Wesley, 2010. Source of the automated-quality-gate pattern and the build-pipeline thinking.