A Risk-Based Testing Pilot: Six Phases, One Worked Example

Case study · Risk-Based Testing

Most teams can agree risk-based testing is the right idea and still stall out on the first try. The methodology is only half the work, the other half is running a disciplined pilot. This is the six-phase playbook we use, written up with the actual numbers from a pilot on a mature enterprise product.

Read time: ~10 minutes. Written for test managers and engineering leaders preparing to introduce (or reset) risk-based testing on an existing product.

Why pilot at all?

Risk-based testing sounds obvious once you hear it: out of the infinite set of tests that could be run, pick the ones that address the ways the product is most likely to hurt customers. In practice, moving a team from "test what we tested last time" to "test what is riskiest now" is a cultural change as much as a methodological one. Pilots are how you make that change legible, to the team, to the business, and to you.

A good pilot has four jobs:

Produce a usable risk analysis for the product under test: not a demo artifact, a real one that will drive test effort for the next release.
Train the stakeholders who have to live with it. If analysts run the analysis alone, the business side won't trust the output.
Expose the places your rating scales break down. They always break down somewhere; you want to find out in week two of the pilot, not week six of the next major release.
Create a story you can tell the rest of the organization. Other teams adopt risk-based testing when they see it work next door, not when they see a slide deck.

The pilot described below was run on a mature, long-lived product from a large enterprise software vendor. It followed six phases. We've published the concrete numbers because the concrete numbers are what make the playbook usable.

If you want the methodology on its own, see our Risk-Based Testing webinar and the Quality Risk Analysis process checklist in the QA Library. This article is the case study. It assumes you already know what "likelihood," "impact," and "risk priority number" mean.

The six phases

Phase	Goal	Who's in the room	Approx. effort
1. Train	Common vocabulary and rationale	Test team + key dev, PM, business stakeholders	1 day
2. Run the risk-analysis session	First full list of quality + project risks	Same group	½–1 day
3. Analyze and refine	Rate every item, catch clumping, adjust scales	Test manager + 1–2 analysts	1–2 days
4. Align testing with risk	Map risks → specs → tests; decide effort allocation	Test manager + test leads	2–3 days
5. Guide the project with risk	Run tests in risk order; prioritize defects by risk	Whole team, ongoing	Duration of the release
6. Assess benefits and lessons	Retrospective; carry what worked into the next release	Test management + sponsors	½ day

The rest of this article walks each phase with the decisions and the numbers.

Phase 1, Train the stakeholders

Before anyone rates anything, everyone who will be in the risk-analysis session needs the same working vocabulary. The pilot used a one-day workshop covering:

The principles of and rationale for risk-based testing.
Categories of quality risks: functionality, performance, reliability, usability, security, installability, compatibility, maintainability, portability. (The exact list is less important than having a list the group agrees on.)
How to perform a quality risk analysis and align testing with risk levels.
How to document quality risks so the output is usable six months later.
How to monitor quality risks during test execution and report results back to stakeholders in risk terms.

The training was half presentation, half discussion, and included a two-hour hands-on exercise on a hypothetical project. The exercise is where you find out whether people actually understand the scale. If five people rate the same risk item three different ways during the exercise, they will do the same thing in the real session, that's a rating-scale problem, and phase 3 is going to be painful unless you fix it now.

Skipping training and going straight to the risk-analysis session is the single most common way pilots go sideways. The session turns into the training.

Phase 2, Run the risk-analysis session

The session ran in two sub-sessions.

Sub-session A, enumerate. Participants brainstormed as many quality risk items as they could think of. The main quality risk categories were written on three whiteboards. Each participant wrote individual risks on sticky notes and posted them under the appropriate category. About three hours. The output was over 100 candidate items.

The team also enumerated 11 project risks (example: "The number and timing of QA bug discoveries delays the release date.") and 3 miscellaneous issues (example: "Have all previous release fixes been merged into the current code base?"). Project risks are not quality risks, but they affect the test program, so they belong on the same board.

Sub-session B, rate. Participants assessed likelihood and impact for each item, and deduplicated overlapping items. They used the rating scales below.

Likelihood        Rating   Comments
Very likely         1      Almost certain to happen
Likely              2      More likely to happen than not
Somewhat likely     3      About even odds
Unlikely            4      Less likely to happen than not
Very unlikely       5      Almost certain not to happen

Impact (initial)    Rating   Comments
Must-fix now         1       Top priority, "come in on Sunday" kind of issue
Must-fix schedule    2       Schedule for resolution as quickly as possible
Should fix           3       Major irritant but might wait on other issues
Good-to-fix          4       Irritant for some customers
Don't fix            5       No or limited value to fixing

Likelihood got quick inter-rater agreement on almost everything. Impact did not. Participants argued extensively about the line between Must-fix now and Must-fix schedule, which slowed the session. By the end, the team had identified 92 non-duplicate quality risk items and successfully rated impact and likelihood on about 40% of them. One team member was asked to assign tentative ratings to the rest, subject to the group's later review.

Lesson: rating-scale debates during the session are a symptom, not the problem. The impact scale itself needed surgery. That happened in phase 3.

Phase 3, Analyze and refine

Finish rating. In the week after the session, the team worked through the unrated items. In a handful of cases, the group had to split a risk item into two, the original item was compound and the two halves deserved different impact ratings. At the end, the team had 104 fully rated quality risk items.

Compute RPNs. Risk priority number = likelihood × impact. With 5-point scales, RPNs fall between 1 (highest risk) and 25 (lowest risk). Some integer values are mathematically unreachable, no two integers between 1 and 5 multiply to 7, 11, 13, 14, 17, 18, 19, 21, 22, 23, or 24, so the histogram has natural gaps at those values.

Check for clumping. Clumping happens when too many items pile up at the same RPN, typically because the rating scale has poorly defined distinctions or raters consistently assume worst-case impact. The pilot's first histogram showed strong skew toward the left side, with many items clustered at RPN 6. The underlying distribution:

Likelihood   Count     Impact   Count
    1           5         1       10
    2           9         2       52
    3          25         3       32
    4          39         4        8
    5          26         5        2

Likelihood looked reasonable: the product under test was mature, with a stable codebase and a seasoned development team, so most items landing in the 3–4 range was plausible. (On a newer product, that distribution would be wishful thinking.) Impact was the problem: more than half the ratings were 2, meaning "schedule for resolution as quickly as possible." Too many items were implicitly being called emergencies.

Fix the scale, not the data. Rather than re-rate each item (which would have re-opened the week-one debate), the team split the old rating-2 definition into two finer ones:

Impact (revised)            Rating   Comments
Must-fix now                  1      Top priority, drop-everything issue
Must-fix no workaround        2      Loss of important functionality, no workaround
Must-fix w/ workaround        3      Loss of important functionality, but with a workaround
Good-to-fix                   4      Irritant for some customers
Limited value to fix          5      No or limited value

Re-rating against the new scale redistributed the impact counts toward the middle and produced a much healthier RPN histogram. Roughly a day of work.

Lesson: your rating scale is a product; ship v2 when v1 misfires. Forcing the organization to live with a broken scale for the rest of the release is how risk-based testing gets a bad reputation.

Phase 4, Align testing with risk

With the analysis refined, the team did four things:

Decided how much effort each RPN band deserved.
Mapped every risk item back to specifications.
Mapped every risk item forward to test cases.
Prioritized test cases based on the RPN of the risk(s) they cover.

The effort-allocation mapping the pilot settled on:

RPN range    Extent of testing   What that means
  1–12       Extensive           Large number of tests, broad and deep, combinations
                                 and variations of interesting conditions
 13–16       Broad               Medium number of tests covering many different
                                 interesting conditions
 17–20       Cursory             Small number of tests sampling the most interesting
                                 conditions
 21–25       Opportunity         Leverage other tests and activities; run a test or
                                 two of an interesting condition only if the
                                 opportunity is cheap

Cross-referencing the mapping against the final RPN histogram showed every risk item received at least cursory testing, a useful sanity check. If a large fraction of items fell into the Opportunity bucket, the team would re-examine either the ratings or the test plan.

Traceability in both directions. To keep the analysis usable as requirements changed, each risk item was mapped to the Product Requirements Specification (PRS) and the Detail Design Specification (DDS). When a requirement changed, the impacted risk items (and therefore the impacted tests) were reachable in two clicks. This is the single most under-invested-in part of most risk analyses; without it, the analysis decays into a one-time artifact instead of a living document.

Where one test covered multiple risk items, it inherited the highest of the associated RPNs. That test ran early.

Phase 5, Guide the project with risk

Two things changed once execution started.

Test sequencing changed. In the old model, test assignments were based on staff expertise and availability, which sometimes meant a deep expert was carrying late-breaking high-importance tasks that genuinely should have run early. With the risk-ordered plan, all high-priority tests ran early in the execution window and all low-priority tests ran later. The scary tests finished before the time box got compressed.

Defect prioritization changed. The team already had a severity standard (how bad is this bug in isolation?). Risk-based testing added a priority layer on top (how bad is this bug in the context of the risk item it signals?). The practical effect: defects tied to high-RPN risks were opened at higher severities than they might have been otherwise, which meant entry into Beta could gate on all high-RPN-related defects being resolved, not just on a count of open bugs.

If you want to see the reporting side of this (how to talk to executives about residual quality risk during test execution) see the Test Results Reporting Process in the QA Library.

Phase 6, Assess benefits and lessons

The pilot's retrospective surfaced four benefits and one big lesson.

Benefit 1, Effort allocated intelligently within constraints. Exhaustive testing is not possible; risk-based testing lets the team consciously choose what not to test and defend that choice. Items in the Opportunity band got light coverage; items in the Extensive band got everything the team had.

Benefit 2, Found the scary stuff first. Ordering tests by risk surfaced the most damaging problems early in the execution window, which gave developers real time to fix them without compressing the tail of the project.

Benefit 3, Graceful response to constraint changes. Mid-way through the release, the team lost a person. Prioritized risks made it straightforward to re-assign work (and, where needed, to cut work) by taking from the bottom of the RPN stack rather than cutting from the top by accident.

Benefit 4, Honest quality signal at release. "Here's what we tested, here's what passed, here's the residual risk you're shipping with" is a conversation executives can actually have. Test reports framed in risk terms are harder to dismiss than test reports framed in coverage percentages.

The lesson, involve business users in the analysis, not just engineers. Technical staff frame impact in technical terms: outages, functional annoyances, performance degradations. Business users frame impact in the terms the product is actually used in: lost productivity, specific broken workflows, escalation patterns to support. The likelihood column is largely technical. The impact column is largely not. A risk analysis run only by engineers will systematically under-weight the ways the product actually hurts the people who buy it. On the next pilot, include the business side from the start.

What to take from this

If you're about to run your first risk-based-testing pilot, take four things from this case study:

Do phase 1 before phase 2. The risk-analysis session is not the place to teach vocabulary.
Expect your impact scale to break. Build in the time to ship a v2 of the scale after the first session. Don't force the team to live with a broken v1 for the rest of the release.
Invest in bidirectional traceability. Risk → spec → test, and spec → risk → test. Without it, the analysis is a one-time artifact.
Invite the business side. Impact is a business question.

Every release is itself a pilot for the next release. Risk-based testing compounds, the analysis from release N is 70% of the starting point for release N+1, and each iteration takes less time and surfaces better signal.

A Risk-Based Testing Pilot: Six Phases, One Worked Example

Why pilot at all?

The six phases

Phase 1, Train the stakeholders

Phase 2, Run the risk-analysis session

Phase 3, Analyze and refine

Phase 4, Align testing with risk

Phase 5, Guide the project with risk

Phase 6, Assess benefits and lessons

What to take from this

Further reading

Related reading

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Beyond ISTQB: A Multi-Domain Certification Roadmap for Technical L&D

The ISTQB Advanced Level path, mapped

Bug Triage: A Cross-Functional Framework for Deciding Which Defects to Fix

Building Quality In: What Engineering Organizations Do from Day One

Where this leads

Software Quality & Security

Risk Reduction & Clear Decisions

Reliable Software at Scale

Working on something like this?