HomeGuidesEvaluate an AI Consultant

How to evaluate a UK AI consultant: five tests that filter the builders from the prompt-slingers

By Dean Griffiths ·

In short

Most UK AI consultants are one of three things: a prompt-slinger using ChatGPT with a logo, an agency that subcontracts the actual engineering, or a SaaS reseller rebadging someone else's product. Genuine bespoke builders are rare. Five tests filter the genuine from the rest: (1) Can they show you production code they wrote? (2) Can they walk through a previous build end-to-end including failures? (3) Can they answer specific architecture questions about where your data flows? (4) Can a previous client confirm the consultant did the engineering? (5) Did they diagnose before they sold?

The market is messy. Here are the five tests that cut through it.

"AI consultant" now describes someone who graduated from a six-week prompt-engineering bootcamp last year, a Big Four practice billing partner-day rates for slide decks, an agency that subcontracts the code to a dev shop in another timezone, and a former data scientist quietly building bespoke systems in production. They all use the same job title.

Five tests, applied in sequence, filter most of the noise.

Test 1 — Can they show you production code they wrote themselves?

Not a screenshot of a ChatGPT conversation. Not a demo dashboard. Actual code, in a repository, that someone else is paying to run in production. A consultant who writes their own code can show you commits with their name on them. A consultant who subcontracts will sidestep — "we have a development partner who handles that side" — which is fine if you wanted to hire an agency, but you should know what you're buying.

How to apply it: ask in the first conversation. "Show me a piece of production code you wrote for a previous client — even anonymised." The answer tells you what role you're actually hiring for.

Test 2 — Can they walk through a previous build end-to-end, including the parts that broke?

Real builds have failure modes. Integrations don't work the first time. A particular edge case takes longer than expected. The original scope misses a requirement that surfaces on contact with the business. A consultant who's actually shipped will describe these without prompting — "the EPC register API rate-limited us at v1, so we re-engineered the matching to batch and cache" — because the failures are how the build evolves.

A consultant who can only describe successes either hasn't shipped or hasn't reflected on what they shipped. Both are bad signals.

Test 3 — Can they answer specific architecture questions?

Where does your data live? Where do the prompts go? Who hosts the LLM calls? What happens when the model vendor changes their API? Where does the audit log live? What's the disaster-recovery plan? Who has access in production?

Real builders have answers. They might not be the right answers for your situation — that's a discovery conversation. But they have answers. A consultant who deflects to "we use industry-standard practices" is reading from a slide they didn't write.

Test 4 — Can a previous client confirm the consultant did the engineering?

Reference calls. Specifically: ask the previous client who actually wrote the code, who they spoke to during the build, and what they'd do differently. The cleanest signal is when the previous client describes the same consultant doing the engineering work, not a separate development team.

For solo operators (one-person consultancies), the test is simpler — the consultant is the engineer. For agencies, the test is whether the person you'll be hiring is the person who'll be writing the code, or whether you're paying a relationship layer above the actual builders.

Test 5 — Did they diagnose before they sold?

A consultant who proposes scope in the first 15 minutes hasn't diagnosed your operation. They've pattern-matched against their existing offer. That can be fine — if your operation does fit a common pattern. It is rarely the case for mid-market businesses with genuinely specific operations.

A discovery-first consultant asks where your time leaks before they propose what to build. The discovery call is the diagnostic — the build proposal comes after, costed, scoped, and with a defensible reason for each component. (More on why this matters: see the discovery methodology guide.)

Bonus red flags

  • Generic case studies. If the same outcomes appear across clients with different operations ("we saved them 40% on admin"), it usually means the consultant has a template and you're the next customer to receive it.
  • No code shown. Marketing assets, demo videos, and dashboards are not engineering evidence. Code is.
  • "Trust us" on data handling. A serious build has a clear answer for where your data flows, where it's stored, who can access it, and how it's deleted. Wave-hand answers mean the consultant either doesn't know or doesn't want to commit.
  • SaaS rebadged as bespoke. If the "bespoke" build is a thin wrapper around an existing SaaS product, you're paying bespoke prices for SaaS economics. Ask what the consultant would do differently if the SaaS vendor disappeared tomorrow.
  • Pressure to sign in the first call. A consultant who needs you to commit before you've understood the scope hasn't earned the commitment. Real builders are happy for you to think about it.

What "good" looks like (a positive signature)

The pattern that consistently produces working builds:

  • One person (or a small team) who can describe both the business problem and the technical architecture in the same conversation.
  • References to specific previous builds they can talk about in detail, including the failure modes.
  • A diagnostic-first sales conversation — they ask before they propose.
  • A clear answer to "where does my data live and who has access."
  • A commercial shape that fits the work — fixed fee for genuinely one-off builds, retainer for systems that need to keep moving with the business.
  • Willingness to say "no, this doesn't fit a bespoke build — buy the SaaS." A consultant who would rather walk away from a bad-fit deal is usually a consultant worth hiring for a good-fit one.

Next step

If you're about to hire an AI consultant, run them through the five tests before the contract. If you're considering AIMindShift specifically, the discovery call is the test you can run on us — 45–60 minutes, technical, diagnostic-first, costed bottleneck map at the end. You'll know inside the call whether the engineering depth is there.

Common questions on this topic

Want to apply this to your operation?

A 45–60 minute discovery call. Map the bottlenecks. Get a costed bottleneck map — whether we build or not.

Book a Discovery Call
AIMindShift
Loading...