Exit and Release Criteria: A Framework for Knowing When to Ship

Whitepaper · Companion to Release Management

"When is testing done?" is the single hardest question a test manager has to answer, because it's not actually a testing question, it's a release-decision question pretending to be a testing question. Good exit criteria make the decision legible, defensible, and measurable. Bad exit criteria make the decision political and inconsistent. This paper is about the difference.

Pairs with the Test Release Processes whitepaper (which deliberately scopes release criteria out) and the Quality Risk Analysis whitepaper (which is where the traceability comes from).

Four kinds of criteria, not one

Most teams talk about "exit criteria" as if it were a single checklist. In practice there are four distinct kinds of gates, each answering a different question. Conflating them is the source of most release-argument dysfunction.

Entry

Can we start?

What must be true before a test phase, cycle, or build can begin, the minimum inputs we require before we accept responsibility for running.

Exit

Are we done?

What must be true before a test phase or release can be declared complete, the outputs and conditions we commit to producing.

Suspension

Should we stop?

What conditions trigger a halt to testing, build too broken to progress, blocker defect tail, infrastructure down, scope drift invalidates the plan.

Resumption

Can we restart?

What conditions must be met after a suspension before testing can resume, often tighter than entry criteria, because we've burned trust.

Every release cycle needs all four. Entry criteria prevent premature work; exit criteria make completion legible; suspension criteria protect the team from pouring effort into a collapsing build; resumption criteria prevent re-entry into the same failure mode. If your process documents one without the others, you have half a process.

A note on vocabulary

Different shops use different words. "Acceptance criteria," "release criteria," "done criteria," "quality gates," "promotion criteria", you'll see all of them. The semantics matter less than the separation between the four questions above. Pick words that make sense for your organization and stick with them.

What a good criterion looks like

Before we get to the frameworks, here's the bar each individual criterion has to clear. A criterion is a commitment (a statement of the form "we will do X" or "we will not ship until Y") and like all commitments it has to be honest, measurable, and falsifiable.

Measurable

No judgment calls in the statement

'System test is complete' is not a criterion. '100% of planned test cases executed and 100% of planned risks covered at the agreed depth' is.

Binary

Pass or fail, not 'mostly'

At decision time, the answer must be yes or no. A criterion that needs interpretation in the room is a political lever, not a gate.

Traceable

Tied to a risk or requirement

Every criterion should answer the question 'why this one?' with a pointer to a specific risk item, requirement, regulation, or business commitment.

Honest

We will actually hold to it

A criterion we quietly waive under pressure is worse than no criterion, it trains the organization to treat the list as decorative.

Timely

Measurable before we ship, not after

Criteria that require data we won't have until production (e.g., 'user satisfaction above X') belong in post-release metrics, not release gates.

Owned

A named person signs off

Each criterion has one owner who is accountable for calling pass or fail. Committee ownership is no ownership.

The 'zero P1 defects' trap

"Zero P1 defects at release" sounds like a strong criterion and is almost always a bad one. In practice one of two things happens: either the team finds creative reasons to downgrade a P1 to a P2 near release, or they find creative reasons not to file a P1 in the first place. Either way the metric stops measuring what it claims to measure. A better version: "all known P1 defects are either fixed, have an accepted written waiver from a named approver, or have a mitigation that has been tested end to end."

The three approaches, in increasing quality

QRA teaches that there are informal, checklist-based, and rigorous techniques for discovering risks. The same hierarchy applies to authoring criteria. You can do this casually and get a casual result, or you can do it rigorously and get a defensible one.

Approach 1: Coverage + defect tail (the default)

The most common approach, and the starting point for everyone. Criteria are expressed in terms of test-execution completeness and defect-register state:

100% of planned test cases executed
≥ X% pass rate on regression suite
Zero open P1 defects (with the waiver clause above)
Fewer than N open P2 defects, with a written plan for each
Defect trend stable or declining for the last N days

This works, in the sense that it's better than no criteria at all. What it misses is why those numbers are the right numbers. 100% of what planned test cases? Tests for which risks? Requirements? Regulations? The criteria are self-referential to the test plan, they check whether we did what we said we'd do, not whether what we did was sufficient.

Approach 2: Traceability to requirements (Simmons' contribution)

Erik Simmons, in "Requirements to Release Criteria: Testing in Context" (PNSQC 2001), made the argument that release criteria should trace back to requirements, that each criterion should be derivable from a specific requirement, quality attribute, or business commitment. This is the approach that distinguishes professional release engineering from ceremonial release engineering.

Concretely, for each requirement or quality attribute, the team asks:

What does "this is delivered correctly" mean for this requirement?
What evidence would convince a skeptical reviewer?
What's the measurable threshold?
Who signs off?

The resulting criteria read like this, each one pointing back to a source document:

R-034 (auth timeout): session timeout of 30 minutes verified across web, mobile, and API surfaces; zero deviations in test report #TR-112. Owner: Security lead.
NFR-008 (P95 latency): P95 ≤ 250ms at 2× forecast peak load, verified in staging environment matching production topology, report #PR-44. Owner: Performance lead.
Regulation GDPR-Art-17: right-to-erasure flow tested end to end including downstream cache invalidation, report #CR-09. Owner: Compliance lead.

This approach forces two useful things. First, any requirement with no corresponding criterion is either unimportant (in which case it shouldn't be a requirement) or we have a gap. Second, any criterion with no corresponding requirement is either protecting against something we've forgotten to write down (in which case let's write it down) or it's there for political reasons (in which case let's be honest about that).

Approach 3: Risk-weighted + traceability (the rigorous version)

The next step up: derive criteria not just from requirements but from the quality risk register. Each high-priority risk item gets a corresponding criterion that specifies what evidence will be required before we accept that risk as sufficiently mitigated.

Traceability flow

From risks to release criteria

Each layer narrows. Not every risk item produces a release criterion, only the ones whose residual risk, after planned mitigation, is still above the team's tolerance.

A healthy release process has at least this much narrowing. A register where every risk becomes a gate is unmanageable; a process where no risks become gates is ceremonial.

This is the approach we use with clients. It produces a release review meeting where every criterion has a stated purpose (the risk it's guarding), a named owner, a measurable threshold, and an evidence trail. Waivers are still possible (they always are) but they are explicit, written, and named. The political game of quiet re-interpretation is eliminated because there's nothing left to re-interpret.

Modern patterns

The criteria above describe the state of the product at a single release decision. Continuous delivery, progressive rollout, and error-budget-based operation have added three new kinds of criteria that sit alongside the classic set.

Automated quality gates in CI

A large share of what used to be exit criteria can now be expressed as automated gates in the build pipeline: unit-test pass rate, code-coverage threshold, SAST/DAST findings under a threshold, dependency scan clean, performance-regression guard, schema-contract test pass. The human release review is then a review of the exceptions (risks that can't be expressed as an automated gate) rather than a review of the whole set.

Rule of thumb: if a criterion can be expressed as an automated gate that runs on every commit, put it there. The human release review should focus on what can't be automated (integration risk, product-judgment risk, regulatory sign-off, business-readiness), not on things a CI job can decide faster and more reliably.

Progressive rollout gates

When a release is not a single binary event but a progression (canary → 1% → 10% → 50% → 100%), each stage has its own criteria:

Canary

Single instance / internal users

No error-rate degradation vs. control, no P1 telemetry signals, smoke tests pass against live canary.

1% rollout

Small production slice

Error rate within tolerance band for N minutes. No customer-impact incidents. Key business metrics stable.

10–50%

Substantial production

Sustained error rate within band. Performance metrics within SLOs. No outsized cohort impact (by region, plan tier, platform).

100%

Full release

All previous gates held across full duration. No pending rollback signals. Post-release plan owner confirmed.

Each stage gate is a miniature release decision. The named owner is usually the on-call engineer or release manager, not a committee, the cadence is too fast for committee decisions. The criteria have to be operable under time pressure, which is a tougher constraint than the classic release-meeting criteria.

Error-budget-based criteria

For teams on SRE-style operation, the exit criterion for ongoing feature delivery is not "the tests pass" but "the service is within its error budget." If the budget is exhausted, feature releases are paused until reliability work restores it, regardless of how green the test suite is.

This is a meta-criterion: it sits above the per-release criteria and sometimes overrides them. Teams operating this way need both, the per-release criteria say "this particular change is safe to roll out," the error-budget criterion says "the service has headroom to absorb another change right now." Either one can veto.

How to build the list

Concretely, how do you author a good release-criteria set for a release you're planning now? Here's the sequence we use:

Start from the risk register and requirements list. For each high-priority risk and each named requirement or quality attribute, ask: what evidence convinces us this is delivered correctly? Write that as the criterion.
Move what you can to automation. For each criterion, ask: can this be an automated CI gate? If yes, move it there and remove it from the manual release checklist.
Define entry, suspension, and resumption. For each test phase, define the three non-exit criteria explicitly. Most release-process failures happen because one of these three is missing.
Name an owner per criterion. Every criterion has one named owner who signs off pass/fail. If you can't name them, the criterion isn't mature yet.
Pre-declare the waiver path. If a criterion fails at release time, what's the path to a documented waiver? Who approves? What's the acceptance condition? Pre-declaring this removes a surprising amount of release-day friction.
Review with stakeholders before the release. The list should be agreed before testing starts, not negotiated at the release review. A criterion discovered at release time is a criterion the team couldn't meet.

The pre-commit, not the release meeting

The most valuable moment in a release-criteria process is the moment the list is agreed, weeks before the release. Stakeholders who sign off early, in writing, on what "done" looks like are much less able to move the goalposts later. This isn't bureaucracy; it's a commitment device that protects the team from late-cycle scope drift.

Common failure modes, a diagnostic checklist

If you're trying to figure out why your release process keeps ending in fire drills, work this list:

Criteria discovered at release time

If new criteria appear at the release review, the list isn't agreed. Fix: pre-commit to criteria at test-plan sign-off.

Criteria without owners

A list with no names is a list nobody is responsible for. Fix: one named owner per item, signed.

Criteria with qualitative thresholds

'Quality is acceptable' is not a threshold. Fix: every criterion has a measurable pass/fail.

Waived criteria with no written record

Undocumented waivers train the team to treat the list as optional. Fix: every waiver in writing, named approver, stated reason.

No suspension criteria

When the build collapses, the team grinds through it anyway because stopping feels political. Fix: explicit suspension triggers owned by the test lead.

Criteria that can't be measured pre-release

User-satisfaction, NPS, conversion-lift, these are post-release metrics. Fix: move them to the post-release review instead of gating on them.

Self-referential exit criteria

'All planned test cases executed' with no audit of whether the plan was right. Fix: trace criteria to risks and requirements, not to the plan itself.

No differentiation across the four kinds

Entry, exit, suspension, and resumption are one blurry checklist. Fix: four separate sections, each owned, each reviewed.

What this buys you

A release criteria set built this way turns the release decision from a political conversation into a data review. The conversation in the room becomes "do we have evidence for each criterion, yes or no," not "does this feel ready." The criteria are defensible to auditors, explainable to executives, and teachable to new team members. And critically, the act of authoring them forces the conversation about scope, risk, and acceptance to happen early, when change is cheap, instead of at release time, when change is expensive and emotional.

Test Release Processes: Seven Steps, Nine Quality Indicators, the other side of the release coin
Quality Risk Analysis: A Complete Whitepaper, the source of the risk items that drive criteria
Risk Perception and Cognitive Bias, why group release decisions go wrong and how to debias them
Risk-Based Test Results Reporting, how to report the evidence behind each criterion

Sources and further reading

Simmons, Erik. "Requirements to Release Criteria: Testing in Context." Proceedings of the Pacific Northwest Software Quality Conference (PNSQC), 2001. The seminal paper making the case for traceability-based release criteria.
Beyer, Betsy, et al. Site Reliability Engineering. O'Reilly, 2016. Chapter on error budgets; source of the meta-criterion concept.
Humble, Jez, and David Farley. Continuous Delivery. Addison-Wesley, 2010. Source of the automated-quality-gate pattern and the build-pipeline thinking.

Exit and Release Criteria: A Framework for Knowing When to Ship

Four kinds of criteria, not one

A note on vocabulary

What a good criterion looks like

The three approaches, in increasing quality

Approach 1: Coverage + defect tail (the default)

Approach 2: Traceability to requirements (Simmons' contribution)

Approach 3: Risk-weighted + traceability (the rigorous version)

From risks to release criteria

Modern patterns

Automated quality gates in CI

Progressive rollout gates

Error-budget-based criteria

How to build the list

Common failure modes, a diagnostic checklist

What this buys you

Sources and further reading

Related reading

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Beyond ISTQB: A Multi-Domain Certification Roadmap for Technical L&D

The ISTQB Advanced Level path, mapped

Bug Triage: A Cross-Functional Framework for Deciding Which Defects to Fix

Building Quality In: What Engineering Organizations Do from Day One

Where this leads

Software Quality & Security

Risk Reduction & Clear Decisions

Reliable Software at Scale

Working on something like this?