Starting AI Adoption: A Sequence for Mid-Market Engineering Teams

Whitepaper, AI Operating Model, ~15 min read

Most mid-market engineering organizations we meet are being asked to "do AI" by a board that has no idea what the words mean operationally. The teams respond by flailing: a task force that ships nothing, a procurement binge on tools that overlap, or a pilot that never leaves the prototype folder. None of these produce production AI.

This paper is the sequence we use to unfreeze the adoption. Six stages. Named exit criteria at each one. The anti-patterns that reliably predict failure. And the first-90-days framing that ties together architecture, evaluation, and model economics.

Why the order matters more than the pieces

Every piece in this sequence exists elsewhere in the industry's advice. Pick a use case. Instrument it. Pilot. Measure. Scale. None of that is novel. What is consistently missing is the order, and the refusal to let the team jump steps.

Teams that fail usually fail the same way: they skip from "we need to do AI" directly to "let us buy a tool" or "let us hire an ML team," without ever answering the two questions that sit in between, which are "what problem are we solving" and "what does success look like numerically." Those two questions are what stages one through three of this sequence are designed to force.

Sequence integrity is the single highest-leverage discipline for a leader putting AI into a mid-market engineering org.

The six-stage sequence

Adoption sequence

Six stages from 'we need to do AI' to 'we are operating AI'

Exit criteria are named for each stage. Skip a stage, fail quietly two stages later.

Values represent conceptual slots on the funnel, not team attrition. Each step gates the next. The one reliable pattern: teams that skip a stage discover the gap two stages later, at more cost than fixing it at the time.

Each stage deserves its own close look.

Before any technical work, two things must be true: a named executive sponsor owns the outcome, and the business problem is specific and numerical. "We want to use AI" is not a problem. "Ticket resolution time is 47 hours P50 and we want it below 20" is a problem. "Support agents spend 60 percent of their time on the same five queries" is a problem.

If the problem cannot be stated in one sentence with a number in it, the sponsorship is nominal and the project will drift. Send the leader back to the business before writing code.

Stage 2: Use-case selection

Inside the problem, pick one narrow use case. "Ticket triage on tier-1 requests in the billing queue" is narrower than "support automation." Narrow use cases have three properties that make them ship: bounded scope, a measurable baseline, and a willing operator who will live with the result.

Bounded

Scope fits on a sticky note

If the scope requires a paragraph, it is too broad. Narrow it to something describable in a sentence.

Baselined

Today's number is known

Without a baseline, success cannot be measured. Operators usually know this number or can produce it in a day.

Operable

A named person will use it

The operator is the one whose life changes when the feature ships. Their buy-in before launch predicts adoption after.

Reversible

Failure is recoverable

Pick a use case where a wrong answer costs time, not money or reputation. Save the high-blast-radius use cases for after the team has a track record.

The most common mistake at this stage is choosing a use case because it is impressive rather than because it is solvable and measurable. Impressive comes later, after the team has shipped three boring wins.

Stage 3: Instrumentation

Before building anything AI-shaped, measure today's performance on the use case. This is the stage that is most often skipped, and the one whose absence most reliably torpedoes the program.

The discipline: no pilot without a baseline. If the operator cannot tell you what the current P50 resolution time is, what the current classification accuracy is, or what the current per-transaction cost is, the program is building a feature it cannot evaluate. Worse, the program will be unable to defend its results to the sponsor six months in, because there is no before-and-after story.

Instrumentation sometimes takes two weeks and is unglamorous. It is also the cheapest engineering work in the entire sequence. Do it.

Stage 4: Prototype and evaluation

With the baseline in hand, build the smallest viable AI-backed solution and evaluate it with discipline.

The architecture choice between a deterministic workflow and an autonomous agent belongs in this stage, not later. Most mid-market AI features should start as workflows. We cover the reasoning in the companion paper on workflow-vs-agent architecture: the short version is that autonomy is expensive in ways teams consistently underestimate, and earned autonomy is cheaper to maintain than assumed autonomy.

The evaluation discipline is a golden set of 60 to 150 examples, scored on five dimensions, with a release gate that must pass before production traffic. The companion paper on evaluation before shipping covers the release-gate design in detail. In this stage, two artifacts leave the team's laptop: the evaluation suite and the release gate thresholds.

Stage 5: Productionize and rollout

Rolling out an AI feature is not different from rolling out any other risky feature except in one respect: the evaluation suite replaces the smoke test as the primary gate. Everything else is the same. Canary, 1 percent cohort, 10 percent cohort, 100 percent. Each stage has named exit criteria. The team holds at 100 percent for two weeks before declaring the feature stable.

Model choice becomes a hot decision here. Teams that default to the flagship model for every call on the assumption that "it just works" discover, at 30 days, that the bill is three to five times what a disciplined routing design would have produced. The companion paper on model selection and cost management covers cascade routing, multi-provider fallback, and the switching-cost audit in detail.

Stage 6: Operate and extend

Once the feature is at 100 percent, the question flips from "is it ready" to "is it still working." Four drift sources matter: model drift (the provider updates), context drift (the data shifts), prompt drift (the team edits the system prompt), and upstream drift (tools the feature depends on change behavior).

The defense is a scheduled regression run. Weekly, on a known-good harness, against production config. Scores go to a dashboard. A drop of more than a few points is an incident, not a ticket. The evaluation paper covers the mechanics.

Governance belongs here too, and it belongs honestly. Governance is the policy layer that decides what the system is and is not allowed to do, the audit trail that proves the policy was followed, and the escalation path when it was not. Governance that is not audited is theater. Governance that is audited weekly is load-bearing.

The anti-patterns that reliably predict failure

Across a meaningful number of engagements, the patterns that lead to failed AI programs are consistent enough to name.

Anti-pattern: The AI task force

A cross-functional committee is formed to "drive AI adoption." It has a steering group, sub-working-groups, a charter, and a presentation cycle. It ships nothing for eight months because no single person owns a dated production commitment. The pattern is indistinguishable from strategic stalling. Mitigation: dissolve the committee; point one named engineering leader at one named use case with a budget and a deadline.

Anti-pattern: Buy before measure

The team buys a tool (or signs an enterprise AI platform contract) before it has instrumented the target workflow. Nine months later, the tool has driven adoption to 12 percent and the contract auto-renews at higher tier. The tool did not fail; the order of operations did. Mitigation: no AI procurement commitment without an instrumented baseline and a specific capability gap the tool closes.

Anti-pattern: Model-first, not problem-first

The engineering team falls in love with a model capability and looks for problems to apply it to. Every use case gets bent to fit the model. Every use case underperforms because the model is not solving the actual problem. Mitigation: problem-first. The model is the last decision in the sequence, not the first.

Anti-pattern: Successful pilot, unclear production path

The pilot works beautifully in a Jupyter notebook. It is never integrated into the production surface because nobody scoped the integration work. The team celebrates the pilot, then quietly loses momentum over the next quarter. Mitigation: the production path is scoped in stage 2, not discovered in stage 5.

The first ninety days, realistically

A realistic first-90-days plan for a team starting from zero looks like this. It is not fast and it is not slow; it is calibrated to what actually has to happen.

Days 1 to 10. Sponsor confirmation and problem statement. Use-case selection out of a shortlist of three. Baseline instrumentation scoped.
Days 11 to 25. Baseline measurement completes. The team has today's number for the target metric. Evaluation rubric drafted. Release-gate thresholds proposed.
Days 26 to 45. Prototype build. Golden set grows to 60 examples. First scored run against the eval suite. Architecture decision (workflow vs agent) documented. Model shortlist narrowed to two candidates.
Days 46 to 60. Release gates pass on the prototype. Integration work begins. Cascade routing is designed and measured if cost-sensitive.
Days 61 to 75. Canary on internal users. 1 percent cohort if the canary holds. Regression harness in CI.
Days 76 to 90. 10 percent cohort. Drift dashboard live. Governance policy signed off. First monthly review scheduled.

Teams that try to compress this timeline to 30 days produce pilots that look finished and are not. Teams that let it stretch to 120 days lose momentum with the sponsor. Ninety days is where the rhythm lands.

The governance layer that is not theater

Three things distinguish useful AI governance from performative AI governance.

Policy as code. The rules the system must follow (what it refuses, what data it can access, what tools it can invoke) are expressed in code, not in a Confluence page. Changes flow through the same CI as any other code change.
Audit by sampling. A fixed percentage of production interactions are logged with full input, output, and decision trace. A governance function reviews a rotating sample weekly. The review writes findings; findings drive either a policy tightening or a release gate update.
Named incident response. When a policy violation is discovered, a named owner responds within a named SLA. The response is not "add more people to the review cycle"; it is a specific change to the policy, the prompt, the release gate, or the regression suite.

Teams that have these three elements in place can honestly answer the question "how do you make sure the AI is operating safely" in a way that holds up to audit and to regulation. Teams that do not have them cannot, regardless of how polished the governance deck looks.

What a leader can do this week

Three concrete moves:

Force the order. If your organization is at stage 4 without having done stages 2 and 3, pause stage 4 and go do them. The pause looks like lost momentum. The lost momentum is already there; stopping to acknowledge it costs less than the next six months of flailing.
Make the problem statement specific and numerical. One sentence, one number. If the problem does not fit that shape, send it back to the business.
Instrument before you build. If the team cannot tell you today's baseline for the target use case, that is stage 3, and it is non-negotiable.

If this maps to work in flight on your team, the relevant work sits under AI & agents.

This paper is the overview for a series on putting AI into production. The three companion pieces go deep on the parts of the sequence that earn their own paper: evaluation before shipping, workflow vs agent architecture, and model selection and cost management.

Starting AI Adoption: A Sequence for Mid-Market Engineering Teams

Why the order matters more than the pieces

The six-stage sequence

Six stages from 'we need to do AI' to 'we are operating AI'

Stage 2: Use-case selection

Stage 3: Instrumentation

Stage 4: Prototype and evaluation

Stage 5: Productionize and rollout

Stage 6: Operate and extend

The anti-patterns that reliably predict failure

The first ninety days, realistically

The governance layer that is not theater

What a leader can do this week

The companion deck

Related reading

Seven questions to ask before you sign

Compare AI Models

What is an AI Agent? The 6 Layers

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Workflow or Agent? A Decision Framework Before You Architect Anything

Where this leads

Ecommerce Solutions

Integrations & Optimization

Commerce Growth & Efficiency

Working on something like this?