Metrics for Software Testing, Part 2: Process Metrics

Series · Part 2 of 4 · Managing with facts

Process metrics measure the capability of the test process and the software process around it. They are the least-used and least-understood of the three kinds of test metrics, and they are the ones that drive process improvement. This paper walks through the four that matter most, with benchmarks, interpretation rules, and worked examples.

Four-part series: Part 1, Why & how · Part 2 (this paper), Process metrics · Part 3, Project metrics · Part 4, Product metrics

What process metrics are for

Process metrics measure process capability: how good the test process is at what it's supposed to do, how good the software process is at preventing defects in the first place, and where the leverage points are to make either one better. They let you benchmark yourself against other organizations and, more importantly, against your own previous state. They also let you decide where not to invest, because some parts of your process are already good enough.

The non-negotiable rule

Process metrics measure process capability, not team or individual capability. Most of the factors controlling a process's capability are under management's control, not the individual's. Using process metrics for performance appraisal destroys the metrics (people will game them) and destroys your ability to improve the process.

Applying the top-down framework from Part 1, process metrics ask three questions for each test-process objective:

Effectiveness. Is the process producing the desired result?
Efficiency. Is it producing the result without waste?
Elegance. Is the work graceful, and does it hold up in front of stakeholders?

Let's apply it.

Effectiveness, Defect Detection Effectiveness (DDE)

The single most important effectiveness metric for a test process. Also widely known as DDP (Defect Detection Percentage); we'll use DDE here.

Testing is a bug filter. DDE asks what percentage of the bugs present in the system during testing were actually caught by the test process. The general form:

                        bugs found during testing
     DDE  =  ──────────────────────────────────────────────────
              bugs found during testing  +  bugs found after

The denominator is usually approximated by counting all bugs reported during the testing phase plus all bugs reported from production through the end of the measurement window. Bugs that are present but never detected don't count toward DDE, by construction, you can't measure what you never see. For test-process evaluation this is fine: defects that never result in observable behavior don't matter for a behavior-oriented test process.

For the final test phase before release (system test, acceptance test, or the pre-release hardening period on a continuous-delivery program) DDE is:

                           bugs found in final test phase
   Final DDE  =  ───────────────────────────────────────────────
                  final-test-phase bugs  +  production bugs

85%

Industry baseline (typical)

Median final-phase DDE across dozens of client assessments. Your starting benchmark.

95%

Upper practical range

Many teams can reach this without sacrificing efficiency. Harder beyond.

~0%

Teams that hit 100%

None. No test process finds every bug. If you're reporting 100%, you're miscounting.

break-even

Cost per bug in test vs. production

When test-phase cost per bug approaches production cost per bug, the economic return on extra testing starts to turn negative.

How to interpret the number

A final-phase DDE below 85% is a red flag: the test process is leaving a noticeable tail of defects to customers. Before writing an improvement plan, investigate the cause. Common causes in modern programs include weak or outdated quality risk analysis, a test environment that diverges from production, test data that doesn't reflect real usage, and (new) LLM-shaped behavior gaps in test generation. Don't assume individuals are the problem; the data almost never supports that.

Between 85% and 95%, improvement is almost always possible at a reasonable cost. Above 95%, the marginal cost of each additional caught bug starts to rise fast. The break-even question is specific to your business: what does a customer-found bug cost you compared to a test-found bug? For safety-critical and regulated work, the acceptable cost-per-bug in testing is much higher than it is for consumer software, and DDE targets should be higher correspondingly.

The important-bugs variant

DDE for all bugs isn't enough. Most test processes exist primarily to find important bugs. Add a second metric:

   DDE-important   >   DDE-all

That's the relationship we want. Across dozens of assessments, we routinely see the inverse, DDE for all bugs is higher than DDE for important bugs. That's a sign that the test approach isn't prioritizing by risk. The fix is usually a proper risk-based testing strategy (see the companion papers in the risk cluster).

Why the ratio, not the difference?

You could also track the difference (DDE-important minus DDE-all). Don't. Metrics shape behavior. If testers notice that ignoring less-important bugs pushes the difference up, some of them will start to file fewer minor reports, which destroys another important objective of testing: producing complete information. The ratio doesn't create that incentive.

Beyond final-phase DDE, phase containment across the lifecycle

DDE as we've defined it so far measures a single test phase. That's useful, and also limiting. A defect caught in system test after three earlier chances to catch it (requirements review, design review, code inspection) is a failure of the software process, even if the test phase did its job. To see the whole picture, we have to measure the efficiency of every defect-removal activity, not just the last one.

The industry term for this extended view is phase containment or defect removal efficiency across the lifecycle (the latter terminology was coined by Capers Jones in the 1990s and is now used across IEEE, ISTQB, and academic literature). The idea is the same as DDE: for each phase, what percentage of the defects present going into that phase got caught before they leaked to the next one? Plot them end to end and you get a funnel.

Phase containment, full-lifecycle DDE

Defect-removal funnel across a typical modern delivery lifecycle

Per-phase capture rates vary widely by organization and product type. These are reasonable industry benchmarks, use as a starting point, then measure your own.

Filled green bar = defects caught at this phase. Gray bar = defects still present going to the next phase. A healthy program catches 70%+ of introduced defects before system test, and under 5% reach customers.

Why phase containment matters more today

The economics of where you catch a bug have not changed since the 1970s: the cost of fixing a defect grows roughly by a factor of 10 with each phase it leaks through. What has changed is how many phase-level gates you now have available:

Shift-left

Earlier phases, higher leverage

LLM-assisted requirements review, AI-paired code review, and SAST in pre-commit hooks each add a phase-containment gate before the expensive ones.

Prevention

Best phase is 'never introduced'

Defect prevention (coding standards, templates, strong types, architecture patterns) shows up as a lower initial defect-injection rate, not as a phase capture.

DevSecOps

Continuous security gates

SAST/DAST/IAST and SBOM scanning move security-defect capture from system test back to CI, effectively a new early phase.

Observability

Production as a phase

Good production telemetry catches a meaningful slice of escaped defects before customers file reports, 'discovery' moves earlier.

Reading the funnel

Two questions to ask of the shape:

Is the funnel front-loaded? Most high-performing programs now catch 60–80% of their defects before the final system-test phase. If your funnel is bottom-heavy (most captures in system test and production) you have a shift-left opportunity that is almost always economically worthwhile.
Is the escape tail short? The production bar should be short. A long production bar signals either a weak final test phase (low final-phase DDE), a weak quality-risk analysis (the right risks aren't being tested), or an environment/test-data mismatch that lets real-world failure modes escape detection.

Benchmark your phase-containment profile against itself across releases. Cross-organization benchmarks exist (IEEE, ISBSG, various industry studies) but they vary wildly by domain, regulated medical software has completely different numbers than a consumer SaaS product. Your own trend line is almost always more useful than an industry median.

Defect potentials is a separate input

Phase containment measures capture. "Defect potentials" (another Jones term) measures injection, how many defects per function point or per KLOC get introduced to begin with. Containment + injection together give a complete picture. We don't cover defect-potential estimation in this paper, but any mature metrics program eventually measures both, because the cheapest defect to remove is the one that was never introduced.

Efficiency, Defect Closure Period (DCP)

DDE is effectiveness. DCP is efficiency. It measures how quickly bugs move from filed to resolved:

   DCP (for a bug)  =  closed_date  −  opened_date     (days)

Reported per-bug, DCP is noisy. Two derived metrics are far more useful:

Daily DCP: the average closure period for all bugs closed on a given day.
Rolling DCP: the average closure period for all bugs closed on or before a given day (cumulative to date).

Chart them together against calendar time.

Efficiency metric

Daily and rolling defect closure period

Healthy shape, rolling curve is flat and low; daily fluctuates randomly within a narrow band around it.

Daily DCP

Rolling DCP

Upper acceptable bound

Lower acceptable bound

Upper and lower bounds come from the project plan's bug-turnaround SLA. The lower bound is not a typo, fixing bugs 'too fast' usually means mass-deferring them, which leaves the bug in the product.

The two lenses, stable and acceptable

Stable means low day-to-day variance, a rolling curve with a near-constant slope near zero, and daily values that fluctuate randomly around the rolling curve within a few days in either direction. Stability implies a process under control.

Acceptable means both curves fall within the upper and lower bounds set by the project plan. Yes, there's a lower bound too. Deferring every bug the day it's opened drives the daily DCP to zero, but the bugs are still in the product. Fixing bugs too fast usually means fixing them poorly: we audited a project once where the test team took releases two or three times a day for bugs identified hours earlier. Closure period was ~1.2 days, and multiple bugs had been reopened ten or more times because the fixes kept failing. Acceptable means honest turnaround, not minimal turnaround.

Trade-offs show up here

Early in a process-improvement program, you can usually push every metric in the right direction. At some point effectiveness, efficiency, and elegance start to trade off against each other. Pushing DDE from 95% to 98% often pushes DCP up too (harder bugs take longer to fix); pushing DCP down aggressively often pushes the reopen rate up (fast fixes are sloppy fixes). The top-down framework in Part 1 handles this honestly: set goals for balanced metrics, not for one metric at a time.

The software process, reopen count

DCP measures the test process's efficiency. A closely-related metric measures the software process's efficiency at fixing bugs correctly the first time: the reopen count.

Most bug trackers can be configured to count how many times a given report has been opened (1 when first filed, incremented each time it gets reopened because the fix failed confirmation testing). Ideally every bug is opened exactly once. Every reopen represents wasted test cycles, wasted development time, and real schedule risk.

Software-process efficiency

Reopen distribution on an audited project

17% of bugs were reopened at least once. One bug was reopened 10 times.

Rework cost estimate: if each confirmation + regression cycle for a failed fix costs roughly 1 person-hour, the 170 bugs reopened at least once represent ~212 lost person-hours. On a planned 10-person × 6-week effort (1,800 person-hours), that's ~9% of the planned test budget consumed by software-process inefficiency.

On this particular project, 17% of reports were reopened at least once, each reopen requiring another confirmation test and typically some regression testing. Estimating one person-hour per reopen, the inefficiency works out to about 212 lost person-hours, or roughly 9% of the originally-planned test effort. That's a big number, and the cause lives in the development process, not the test process. The test team gets that number back only by surfacing it.

Do these metrics show up in test dashboards? Mostly no.

Here's the awkward part: process metrics rarely appear on test dashboards. Most test dashboards are project-focused, trends in bugs and test execution for the current release. That's useful but incomplete. A reopen histogram is one of the most persuasive pieces of evidence you can put in front of an engineering leader when you need to argue for better fix-review or pre-merge testing practices. It belongs on the dashboard.

The same applies to test assessments. Many maturity-model assessments (TMMi, CTP, internal QMS assessments) rely on subjective interviews to check whether practices are performed at all, without measuring how effectively, efficiently, or elegantly they are performed. A test team with a good bug tracker and a clean bug-management process can still have a very low DDE, and an assessment that doesn't include metrics will miss it. There is no understanding, and no basis for rational decisions, without metrics.

What to adopt

A minimal balanced process-metrics set, which most teams can put in place in a few weeks:

DDE

Effectiveness

Final-phase DDE and important-bug DDE. Report quarterly.

DCP

Efficiency

Daily + rolling closure period with upper and lower SLA bounds. Report on the project dashboard.

Reopens

Fix quality

Distribution of reopen counts. Report monthly or at project close.

Elegance

A short quarterly stakeholder satisfaction survey (Likert 1–5) on the usefulness of testing reports. One signal, cheap, candid.

Start with these four, set goals against your own baseline, review quarterly, and adjust as the process matures. Resist adding more until you're using these four in real decisions.

Where this goes next

Part 3, Project Metrics covers the most widely used test metrics of all: multi-series bug trends, test-case fulfillment, and test-execution hours. These are the metrics that typically show up on project dashboards, and the ones that are most often misread.

Part 1 (Why & how of metrics) the framework this paper builds on.
Part 3 (Project Metrics) next in the series.
Effective Test Status Reporting, how to present these numbers.
The Bug Reporting Process, where the data feeding these metrics comes from.
Bug Reporting Process Checklist, one-page printable for individual contributors.

Metrics for Software Testing, Part 2: Process Metrics

What process metrics are for

Effectiveness, Defect Detection Effectiveness (DDE)

How to interpret the number

The important-bugs variant

Beyond final-phase DDE, phase containment across the lifecycle

Defect-removal funnel across a typical modern delivery lifecycle

Why phase containment matters more today

Reading the funnel

Efficiency, Defect Closure Period (DCP)

Daily and rolling defect closure period

The two lenses, stable and acceptable

Trade-offs show up here

The software process, reopen count

Reopen distribution on an audited project

Do these metrics show up in test dashboards? Mostly no.

What to adopt

Where this goes next

Related reading

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Beyond ISTQB: A Multi-Domain Certification Roadmap for Technical L&D

The ISTQB Advanced Level path, mapped

Bug Triage: A Cross-Functional Framework for Deciding Which Defects to Fix

Building Quality In: What Engineering Organizations Do from Day One

Where this leads

Software Quality & Security

Risk Reduction & Clear Decisions

Reliable Software at Scale

Working on something like this?