Skip to main content
WhitepaperUpdated April 2026·12 min read

Exit and Release Criteria: A Framework for Knowing When to Ship

Exit criteria answer the hardest question in testing: when is the work done? This paper covers the four kinds of criteria you actually need (entry, exit, suspension, resumption), the traceability-based approach that ties each criterion back to a risk or requirement, modern patterns — progressive-rollout gates, error budgets, automated quality gates in CI — and the checklist of failure modes that produce useless criteria.

Release CriteriaExit CriteriaTest ManagementRelease ManagementCI/CDQuality Gates

Whitepaper · Companion to Release Management

"When is testing done?" is the single hardest question a test manager has to answer, because it's not actually a testing question — it's a release-decision question pretending to be a testing question. Good exit criteria make the decision legible, defensible, and measurable. Bad exit criteria make the decision political and inconsistent. This paper is about the difference.

Pairs with the Test Release Processes whitepaper (which deliberately scopes release criteria out) and the Quality Risk Analysis whitepaper (which is where the traceability comes from).

Four kinds of criteria — not one

Most teams talk about "exit criteria" as if it were a single checklist. In practice there are four distinct kinds of gates, each answering a different question. Conflating them is the source of most release-argument dysfunction.

Entry
Can we start?
What must be true before a test phase, cycle, or build can begin — the minimum inputs we require before we accept responsibility for running.
Exit
Are we done?
What must be true before a test phase or release can be declared complete — the outputs and conditions we commit to producing.
Suspension
Should we stop?
What conditions trigger a halt to testing — build too broken to progress, blocker defect tail, infrastructure down, scope drift invalidates the plan.
Resumption
Can we restart?
What conditions must be met after a suspension before testing can resume — often tighter than entry criteria, because we've burned trust.

Every release cycle needs all four. Entry criteria prevent premature work; exit criteria make completion legible; suspension criteria protect the team from pouring effort into a collapsing build; resumption criteria prevent re-entry into the same failure mode. If your process documents one without the others, you have half a process.

A note on vocabulary

Different shops use different words. "Acceptance criteria," "release criteria," "done criteria," "quality gates," "promotion criteria" — you'll see all of them. The semantics matter less than the separation between the four questions above. Pick words that make sense for your organization and stick with them.

What a good criterion looks like

Before we get to the frameworks, here's the bar each individual criterion has to clear. A criterion is a commitment — a statement of the form "we will do X" or "we will not ship until Y" — and like all commitments it has to be honest, measurable, and falsifiable.

Measurable
No judgment calls in the statement
'System test is complete' is not a criterion. '100% of planned test cases executed and 100% of planned risks covered at the agreed depth' is.
Binary
Pass or fail, not 'mostly'
At decision time, the answer must be yes or no. A criterion that needs interpretation in the room is a political lever, not a gate.
Traceable
Tied to a risk or requirement
Every criterion should answer the question 'why this one?' with a pointer to a specific risk item, requirement, regulation, or business commitment.
Honest
We will actually hold to it
A criterion we quietly waive under pressure is worse than no criterion — it trains the organization to treat the list as decorative.
Timely
Measurable before we ship, not after
Criteria that require data we won't have until production (e.g., 'user satisfaction above X') belong in post-release metrics, not release gates.
Owned
A named person signs off
Each criterion has one owner who is accountable for calling pass or fail. Committee ownership is no ownership.
The 'zero P1 defects' trap

"Zero P1 defects at release" sounds like a strong criterion and is almost always a bad one. In practice one of two things happens: either the team finds creative reasons to downgrade a P1 to a P2 near release, or they find creative reasons not to file a P1 in the first place. Either way the metric stops measuring what it claims to measure. A better version: "all known P1 defects are either fixed, have an accepted written waiver from a named approver, or have a mitigation that has been tested end to end."

The three approaches, in increasing quality

QRA teaches that there are informal, checklist-based, and rigorous techniques for discovering risks. The same hierarchy applies to authoring criteria. You can do this casually and get a casual result, or you can do it rigorously and get a defensible one.

Approach 1: Coverage + defect tail (the default)

The most common approach, and the starting point for everyone. Criteria are expressed in terms of test-execution completeness and defect-register state:

  • 100% of planned test cases executed
  • ≥ X% pass rate on regression suite
  • Zero open P1 defects (with the waiver clause above)
  • Fewer than N open P2 defects, with a written plan for each
  • Defect trend stable or declining for the last N days

This works, in the sense that it's better than no criteria at all. What it misses is why those numbers are the right numbers. 100% of what planned test cases? Tests for which risks? Requirements? Regulations? The criteria are self-referential to the test plan — they check whether we did what we said we'd do, not whether what we did was sufficient.

Approach 2: Traceability to requirements (Simmons' contribution)

Erik Simmons, in "Requirements to Release Criteria: Testing in Context" (PNSQC 2001), made the argument that release criteria should trace back to requirements — that each criterion should be derivable from a specific requirement, quality attribute, or business commitment. This is the approach that distinguishes professional release engineering from ceremonial release engineering.

Concretely, for each requirement or quality attribute, the team asks:

  1. What does "this is delivered correctly" mean for this requirement?
  2. What evidence would convince a skeptical reviewer?
  3. What's the measurable threshold?
  4. Who signs off?

The resulting criteria read like this — each one pointing back to a source document:

  • R-034 (auth timeout) — session timeout of 30 minutes verified across web, mobile, and API surfaces; zero deviations in test report #TR-112. Owner: Security lead.
  • NFR-008 (P95 latency) — P95 ≤ 250ms at 2× forecast peak load, verified in staging environment matching production topology, report #PR-44. Owner: Performance lead.
  • Regulation GDPR-Art-17 — right-to-erasure flow tested end to end including downstream cache invalidation, report #CR-09. Owner: Compliance lead.

This approach forces two useful things. First, any requirement with no corresponding criterion is either unimportant (in which case it shouldn't be a requirement) or we have a gap. Second, any criterion with no corresponding requirement is either protecting against something we've forgotten to write down (in which case let's write it down) or it's there for political reasons (in which case let's be honest about that).

Approach 3: Risk-weighted + traceability (the rigorous version)

The next step up: derive criteria not just from requirements but from the quality risk register. Each high-priority risk item gets a corresponding criterion that specifies what evidence will be required before we accept that risk as sufficiently mitigated.

Traceability flow

From risks to release criteria

Each layer narrows. Not every risk item produces a release criterion — only the ones whose residual risk, after planned mitigation, is still above the team's tolerance.

Quality risk items identifiedFull risk register from QRA workshopcaught 240 (36%)428 escapedRisks with planned mitigationItems covered by test cases, reviews, static analysis, etc.caught 200 (47%)228 escapedResidual risk above toleranceItems whose unmitigated tail still matters at releasecaught 90 (39%)138 escapedRelease criteria authoredExplicit gate with evidence, threshold, and ownercaught 70 (51%)68 escapedCriteria passed at release decisionGreen gates at the release review meetingcaught 68 (100%)

A healthy release process has at least this much narrowing. A register where every risk becomes a gate is unmanageable; a process where no risks become gates is ceremonial.

This is the approach we use with clients. It produces a release review meeting where every criterion has a stated purpose (the risk it's guarding), a named owner, a measurable threshold, and an evidence trail. Waivers are still possible — they always are — but they are explicit, written, and named. The political game of quiet re-interpretation is eliminated because there's nothing left to re-interpret.

Modern patterns

The criteria above describe the state of the product at a single release decision. Continuous delivery, progressive rollout, and error-budget-based operation have added three new kinds of criteria that sit alongside the classic set.

Automated quality gates in CI

A large share of what used to be exit criteria can now be expressed as automated gates in the build pipeline: unit-test pass rate, code-coverage threshold, SAST/DAST findings under a threshold, dependency scan clean, performance-regression guard, schema-contract test pass. The human release review is then a review of the exceptions — risks that can't be expressed as an automated gate — rather than a review of the whole set.

Rule of thumb: if a criterion can be expressed as an automated gate that runs on every commit, put it there. The human release review should focus on what can't be automated (integration risk, product-judgment risk, regulatory sign-off, business-readiness), not on things a CI job can decide faster and more reliably.

Progressive rollout gates

When a release is not a single binary event but a progression (canary → 1% → 10% → 50% → 100%), each stage has its own criteria:

Canary
Single instance / internal users
No error-rate degradation vs. control, no P1 telemetry signals, smoke tests pass against live canary.
1% rollout
Small production slice
Error rate within tolerance band for N minutes. No customer-impact incidents. Key business metrics stable.
10–50%
Substantial production
Sustained error rate within band. Performance metrics within SLOs. No outsized cohort impact (by region, plan tier, platform).
100%
Full release
All previous gates held across full duration. No pending rollback signals. Post-release plan owner confirmed.

Each stage gate is a miniature release decision. The named owner is usually the on-call engineer or release manager, not a committee — the cadence is too fast for committee decisions. The criteria have to be operable under time pressure, which is a tougher constraint than the classic release-meeting criteria.

Error-budget-based criteria

For teams on SRE-style operation, the exit criterion for ongoing feature delivery is not "the tests pass" but "the service is within its error budget." If the budget is exhausted, feature releases are paused until reliability work restores it — regardless of how green the test suite is.

This is a meta-criterion: it sits above the per-release criteria and sometimes overrides them. Teams operating this way need both — the per-release criteria say "this particular change is safe to roll out," the error-budget criterion says "the service has headroom to absorb another change right now." Either one can veto.

How to build the list

Concretely, how do you author a good release-criteria set for a release you're planning now? Here's the sequence we use:

  1. Start from the risk register and requirements list. For each high-priority risk and each named requirement or quality attribute, ask: what evidence convinces us this is delivered correctly? Write that as the criterion.
  2. Move what you can to automation. For each criterion, ask: can this be an automated CI gate? If yes, move it there and remove it from the manual release checklist.
  3. Define entry, suspension, and resumption. For each test phase, define the three non-exit criteria explicitly. Most release-process failures happen because one of these three is missing.
  4. Name an owner per criterion. Every criterion has one named owner who signs off pass/fail. If you can't name them, the criterion isn't mature yet.
  5. Pre-declare the waiver path. If a criterion fails at release time, what's the path to a documented waiver? Who approves? What's the acceptance condition? Pre-declaring this removes a surprising amount of release-day friction.
  6. Review with stakeholders before the release. The list should be agreed before testing starts, not negotiated at the release review. A criterion discovered at release time is a criterion the team couldn't meet.
The pre-commit, not the release meeting

The most valuable moment in a release-criteria process is the moment the list is agreed — weeks before the release. Stakeholders who sign off early, in writing, on what "done" looks like are much less able to move the goalposts later. This isn't bureaucracy; it's a commitment device that protects the team from late-cycle scope drift.

Common failure modes — a diagnostic checklist

If you're trying to figure out why your release process keeps ending in fire drills, work this list:

1
Criteria discovered at release time
If new criteria appear at the release review, the list isn't agreed. Fix: pre-commit to criteria at test-plan sign-off.
2
Criteria without owners
A list with no names is a list nobody is responsible for. Fix: one named owner per item, signed.
3
Criteria with qualitative thresholds
'Quality is acceptable' is not a threshold. Fix: every criterion has a measurable pass/fail.
4
Waived criteria with no written record
Undocumented waivers train the team to treat the list as optional. Fix: every waiver in writing, named approver, stated reason.
5
No suspension criteria
When the build collapses, the team grinds through it anyway because stopping feels political. Fix: explicit suspension triggers owned by the test lead.
6
Criteria that can't be measured pre-release
User-satisfaction, NPS, conversion-lift — these are post-release metrics. Fix: move them to the post-release review instead of gating on them.
7
Self-referential exit criteria
'All planned test cases executed' with no audit of whether the plan was right. Fix: trace criteria to risks and requirements, not to the plan itself.
8
No differentiation across the four kinds
Entry, exit, suspension, and resumption are one blurry checklist. Fix: four separate sections, each owned, each reviewed.

What this buys you

A release criteria set built this way turns the release decision from a political conversation into a data review. The conversation in the room becomes "do we have evidence for each criterion, yes or no," not "does this feel ready." The criteria are defensible to auditors, explainable to executives, and teachable to new team members. And critically — the act of authoring them forces the conversation about scope, risk, and acceptance to happen early, when change is cheap, instead of at release time, when change is expensive and emotional.

Sources and further reading

  • Simmons, Erik. "Requirements to Release Criteria: Testing in Context." Proceedings of the Pacific Northwest Software Quality Conference (PNSQC), 2001. The seminal paper making the case for traceability-based release criteria.
  • Beyer, Betsy, et al. Site Reliability Engineering. O'Reilly, 2016. Chapter on error budgets; source of the meta-criterion concept.
  • Humble, Jez, and David Farley. Continuous Delivery. Addison-Wesley, 2010. Source of the automated-quality-gate pattern and the build-pipeline thinking.
RBI

Rex Black, Inc.

Enterprise technology consulting · Dallas, Texas

Related reading

Other articles, talks, guides, and case studies tagged for the same audience.

Working on something like this?

Whether you are scoping an architecture, shipping an agent, or sizing a QA program — we can help.