Investing in Testing, Part 5: Manual or Automated?

Part 5 of 6 · Investing in Software Testing

Test automation pays for itself when applied correctly and destroys budgets when applied wrong. This article gives a cost-benefit framework for deciding what to automate, what to leave manual, and how to avoid the common failure patterns that have sunk a generation of automation programs.

Read time: ~9 minutes. Written for QA leaders and engineering managers scoping an automation investment.

The decision that makes or breaks the automation budget

Part 4 argued that the three technique families (static, structural, behavioral) each have their place. Within structural and behavioral testing, the second decision is whether to run a given test manually or automated. Some tests demand automation. Some demand manual execution. Many can go either way, and the decision determines whether the automation investment shows a positive return.

Automation failures are expensive. Software Test Automation references failure rates on large automation projects as high as 50%, similar to the failure rate for the underlying software projects they support. The failures are predictable and mostly avoidable, they come from automating the wrong things, underestimating the work, or treating automation as a replacement for testing rather than a leverage tool.

When automation is the right call

These categories of test are almost always better automated:

Regression and confirmation

The flagship automation use case. Every time a new release comes out, you need to confirm that previously-working behavior still works and that bug fixes actually fixed the bug. Without automation, regression costs scale linearly with feature count, which at any real product size is fatal. With automation, regression scales near-constant in the steady state.

Monkey, random, and fuzz testing

Firing large volumes of random or generated input at a system to look for crashes, resource leaks, and data-corruption scenarios is a mechanical task. Humans can't produce input at the needed rate or with consistent statistical distribution. Automation is the only realistic option.

Load, volume, capacity, performance, and reliability

Measuring system response under realistic concurrent load requires synchronized, precisely-calibrated streams of traffic plus instrumented collection. Manual execution cannot simulate 50,000 simultaneous users. Automation isn't optional here.

Structural (unit / component / integration / contract)

Unit tests, contract tests, and API-level integration tests are authored in code and executed by the build system. Manual execution of these makes no economic sense.

Continuous / scheduled monitoring

Canary tests, synthetic transactions, production health probes, and scheduled end-to-end runs all require automation by construction, they need to run on a schedule no human would sustain.

High-frequency AI evaluation loops

In AI-heavy systems, evaluation harnesses that grade model outputs against golden datasets, run regression evaluations on every prompt template change, and monitor production drift are all structural-automation problems at their core. Any team shipping LLM features needs this infrastructure.

Automated tests have higher upfront costs (tools, framework, harness, test data) and much lower per-execution costs. They pay off when you run them many times.

When manual testing is the right call

High per-test cost, the need for human judgment, or extensive ongoing human intervention all indicate manual execution. These categories:

Installation, setup, operations, maintenance

Loading media, rebooting appliances, flipping hardware switches, walking through upgrade UIs, high human touch, low repeat count per release. Manual.

Configuration and compatibility

Reconfiguring systems, swapping devices, changing network conditions, all labor-intensive in a way automation doesn't help with.

Error handling and recovery

Unplugging a server, saturating a disk, corrupting a config file to see how the system recovers, mostly manual work with the occasional chaos-engineering tool.

Localization

Deciding whether a translation is accurate, culturally appropriate, or non-offensive requires human judgment with appropriate language skills. Currency, date, and time rendering can be automated, but the regression run frequency is low.

Usability and accessibility

Cumbersome interfaces, confusing workflows, inaccessible components, these are human-judgment calls. Automated accessibility scanners catch some structural issues (missing alt text, color contrast, ARIA roles), but they miss the experiential issues that actually hurt users.

Documentation, help, and error messages

Checking that documentation is accurate, error messages are helpful, and the product explains itself, judgment calls that require human readers.

Exploratory testing

A trained tester exploring an unfamiliar surface, following hunches, and blazing new trails is the best defect-discovery engine ever invented for complex systems. It cannot be automated, automation can only replay what's already been discovered.

Trying to automate these wastes budget. A representative horror story: a client spent a year and hundreds of thousands of dollars trying to automate configuration and compatibility tests, gave up, and had to rebuild the manual program from scratch.

The wildcards, tests that can go either way

Several test categories are genuinely a judgment call:

Functional testing

Automation works well once the functional surface is stable. Rule of thumb: get the process under control manually first, then automate the cases that will run many times. Keep some functional testing manual, manual testers find different functional bugs than automated ones. Both matter.

Use cases (user scenarios)

Chained workflows can be automated. The trick is to avoid automating scenarios that involve human intervention (email confirmations, captchas, 2FA, hardware interactions). Those become maintenance nightmares.

User interface

Basic UI automation is valuable. UI automation of rapidly-changing surfaces is a maintenance sink. A sensible rule: automate at the lowest stable contract (API, component, or data-model level) and keep UI automation thin.

Date and time handling

Automatable if the test harness can manipulate system clocks cleanly. Otherwise, manual.

The cost-benefit calculation

For any specific test, run this math:

Test cost = design cost + (per-run cost × number of planned runs)

Test benefit = likelihood of defect × cost of defect if undetected

If benefit exceeds cost, the test earns its keep. If not, either cut the test or find a cheaper way to run it (drop automation for manual; drop manual for exploratory; skip entirely).

A worked example

You're considering automating a performance test for an e-commerce checkout flow.

Tools and development: $25,000
Per-run cost (maintenance + execution), $1,000
Planned runs: 10 over the first year
Total cost: $25,000 + ($1,000 × 10) = $35,000

Benefit estimate:

Likelihood of slow performance, 25% (based on your architecture review)
Potential customers per year: 100,000
Average profit per customer: $10
Fraction that abandon on poor performance: 50%
Expected loss = 0.25 × 100,000 × $10 × 0.50 = $125,000

Benefit exceeds cost by $90,000. Automate.

Now run the same numbers with 10,000 potential customers instead of 100,000. Expected loss drops to $12,500. Automation cost exceeds benefit. Either run the test manually at lower frequency, find a way to run it for under $12,500, or skip it and accept the risk.

The framework is crude (the inputs are estimates) but it disciplines the conversation. More importantly, it forces the team to put a number on the downside of not running the test, which is the hardest part of any ROI case.

Why automation programs fail

The catalog of common causes:

Wrong target. Automating what shouldn't be automated (configuration, compatibility, usability) or what isn't stable enough to amortize the build cost.
Wrong layer. Pushing too much to UI automation when the same tests could run at API or component level with one-tenth the flakiness.
Expecting elimination, not leverage. The successful automation program doesn't replace testers; it amplifies what they can cover. Test design, expected-result calculation, and result analysis remain manual.
Underestimating maintenance. Automated test suites decay without continuous maintenance. Build that maintenance cost into the program from day one.
Tools over strategy. Picking the tool first and the test strategy second is backward. Tool selection should fall out of a clear automation strategy, not drive it.
Neglecting the harness. Test framework, fixtures, data setup, and environment management are usually 70% of the real work. Underinvesting in infrastructure kills automation programs more often than bad test design.

When automation projects of this size fail, people lose credibility and sometimes jobs. Go in with realistic scope, staffed expertise (hired, contracted, or trained), and an infrastructure plan. Software Test Automation by Graham and Fewster remains a good reference for anyone starting one.

Context still matters

These two articles (Parts 4 and 5) make it sound like technique and automation choice are the only variables. They're not. When testing starts, who does it, and how the testing function relates to the rest of the organization all matter as much. Part 6 covers that (the concept of pervasive testing) and brings the series to a close.

Part 4 (The Importance of the Right Technique) technique families that set up this automation decision.
Part 6 (Maximum ROI Through Pervasive Testing) putting the series together.
Shoestring Manual Testing, companion playbook for teams that can't invest heavily in automation.
Test Estimation Process, how to budget for a blended manual/automation program.

Investing in Testing, Part 5: Manual or Automated?

The decision that makes or breaks the automation budget

When automation is the right call

Regression and confirmation

Monkey, random, and fuzz testing

Load, volume, capacity, performance, and reliability

Structural (unit / component / integration / contract)

Continuous / scheduled monitoring

High-frequency AI evaluation loops

When manual testing is the right call

Installation, setup, operations, maintenance

Configuration and compatibility

Error handling and recovery

Localization

Usability and accessibility

Documentation, help, and error messages

Exploratory testing

The wildcards, tests that can go either way

Functional testing

Use cases (user scenarios)

User interface

Date and time handling

The cost-benefit calculation

A worked example

Why automation programs fail

Context still matters

Related reading

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

The Case for Investing in Testing: A Board-Level Argument for Enterprise Test-Function Capability

Deciding When to Bring in External Help: A Framework for Training, Consulting, Staff Augmentation, and Outsourced Testing

Investing in Testing, Part 1: The Cost of Software Quality

Verifying Third-Party Quality: Entry and Exit Criteria Across the Vendor Boundary

Where this leads

Risk Reduction & Clear Decisions

Software Quality & Security

AI & Data Governance

Working on something like this?