Why Most Evaluations Fail

Companies evaluating AI implementation partners typically compare three things: credentials (who the firm has worked with), proposals (what they say they’ll do), and pricing (what it costs). All three are optimized for by every firm in the market. Credentials are curated. Proposals are written by salespeople. Pricing is structured to win, not to reflect actual delivery cost.

The evaluation that matters compares delivery capability, the structural factors that predict whether the engagement will produce running systems or strategy documents. These factors are harder to assess because they require asking uncomfortable questions about business models, staffing, and outcomes that firms don’t volunteer.

This framework is the one we wish existed when we started NimbleBrain. We’ve used it to evaluate potential partners for our own clients, and we hold ourselves to the same standard. Every question below is designed to reveal a structural truth about how the firm operates, not what they claim, but how their business model actually works.

The 8-Point Evaluation Framework

1. Can they show you a running system?

Forget the demo environments, recorded videos, and anonymized case studies. A production system that a real client is using today, handling real data, serving real users.

This is the single most important question because it’s the one most firms cannot answer. Building a demo takes days. Building a production system takes engineering discipline, integration knowledge, and operational maturity. The gap between the two is precisely where most AI engagements fail.

What to ask: “Can you give me access to a production system you’ve built? I want to see it running, not hear about it.”

Red flag: “We can’t show that due to client confidentiality.” Every firm deals with confidentiality. The ones that build real systems find ways to demonstrate capability, anonymized interfaces, open-source tools, their own internal systems. NimbleBrain publishes its tools openly: mpak.dev (the MCP registry), upjack.dev (the application framework). These aren’t demos. They’re production infrastructure.

2. Who does the actual work?

In traditional consultancies, the people who win the deal are not the people who do the work. A senior partner presents, a manager scopes, and a team of analysts and contractors execute. The talent you evaluated during the sales process disappears after the contract is signed.

What to ask: “Name the specific people who will write code on my engagement. What have they shipped in the last 12 months? Will they be full-time on my project?”

Red flag: “We’ll assemble the right team once we understand your needs.” Translation: they don’t have a dedicated team. They staff from a bench, and the quality is unpredictable.

The anti-consultancy model puts the same senior engineers in the sales conversation and the delivery room. At NimbleBrain, the person who scopes your engagement is the person who writes the code. There’s no translation layer.

3. What do you own when they leave?

This question reveals more about a firm’s business model than any other. Some partners build on proprietary platforms. If you stop paying, you lose access. Some retain IP rights to custom code. Some create dependency by design: the system can’t run without their ongoing involvement.

What to ask: “When the engagement ends, what exactly do I own? Code, infrastructure, documentation, models, data pipelines, enumerate it. Is there any component I need your ongoing involvement to operate?”

Red flag: “You’ll have access to the platform for as long as you maintain your subscription.” That’s not ownership. That’s rent.

Full ownership means: the source code is yours, the infrastructure runs in your environment, the documentation enables your team to operate independently, and nothing breaks if you never talk to the partner again.

4. What’s the timeline to first working output?

The length of time between contract signature and the first thing you can actually use tells you everything about a firm’s delivery model. Extended discovery phases, architecture reviews, and planning cycles are how consultancies de-risk their engagements at the client’s expense.

What to ask: “How many calendar days from kickoff until I have a working system processing real data? Not a plan. Not a prototype. A system my team can evaluate against real workflows.”

Red flag: “We typically start with a 6-12 week discovery and architecture phase.” Six weeks of discovery means six weeks of billable hours before any code is written. AI rewards rapid iteration in production, not extended planning in conference rooms.

NimbleBrain’s benchmark: first working system within 14 days of kickoff. That’s not a sprint goal. It’s the structural result of putting senior engineers on the problem from day one with no translation layer between strategy and execution.

5. How do they handle scope?

Open-ended scope is the default in traditional consulting: the engagement expands as “complexity is discovered.” Change orders, phase extensions, and “we found additional requirements” are not surprises, they’re the business model working as designed.

What to ask: “Is this engagement fixed-scope and fixed-price? What happens if the project takes longer than planned? Who absorbs the overrun?”

Red flag: “We bill time and materials based on actual effort.” T&M means the partner’s revenue increases the longer the project takes. The incentive to ship quickly doesn’t exist.

Fixed-scope, fixed-price flips the incentive. When the partner makes the same amount whether the project takes three weeks or eight, efficiency becomes their problem, not yours.

6. What’s their technology lock-in?

Some firms build on proprietary stacks that create dependency by design. Others lock you into specific vendors through exclusive partnerships. The technology choices made during implementation determine how much flexibility you have after the engagement ends.

What to ask: “What’s the technology stack? Is it open-source or proprietary? Can I switch providers without rebuilding? Do you have exclusive vendor relationships that influence your recommendations?”

Red flag: “Our platform is the most efficient way to deploy AI in your environment.” If the recommendation always leads back to their platform, the evaluation was never objective.

Open standards (particularly MCP for agent-tool integration) mean your systems compose with the broader ecosystem. Proprietary stacks mean you’re locked in.

7. Can previous clients operate without them?

This is the escape velocity test. A partner who delivers real value should produce clients who don’t need them anymore. If every client requires ongoing engagement, the model is designed for dependency, not delivery.

What to ask: “Give me two references from clients whose engagements have ended. I want to talk to them about what it’s like operating the system without you.”

Red flag: “All of our clients maintain an ongoing relationship with us.” There’s a difference between choosing to continue and needing to. If no former client can operate independently, the partner is building dependency, not capability.

NimbleBrain’s target is escape velocity, the point where the client’s team owns and operates the system without us. Every engagement includes knowledge transfer, documentation in Business-as-Code artifacts, and a handoff protocol. The goal is to become unnecessary.

8. What have they built for themselves?

Advisors who don’t build are guessing. The firms that write code, maintain infrastructure, and operate their own tools understand production reality at a level no pure-strategy firm can match.

What to ask: “What tools, frameworks, or platforms has your firm built? Are they open-source? Do you use them in client engagements?”

Red flag: “We’re technology-agnostic. We don’t build tools, we implement the right solution for each client.” Technology-agnostic often means technology-superficial. The firms that build tools have learned from their own failures, debugged their own production issues, and developed opinions from experience, not theory.

NimbleBrain builds the tools we use: Upjack (declarative AI applications), mpak (MCP server registry with security scanning), Synapse (protocol-native UI). Every client engagement runs on the same infrastructure we maintain and improve. Our bugs become their improvements.

Red Flags: The Quick Screening List

Before you get to the detailed evaluation, these red flags should disqualify a firm immediately:

  • “Digital transformation roadmap” as a deliverable. You’re buying a document, not a system.
  • No public code or open-source work. If they’ve never shipped anything publicly, their production capability is unverifiable.
  • Case studies with no metrics. “We helped Company X improve their AI capabilities” means nothing. What was the timeline? What was the measurable outcome?
  • “Discovery phase” longer than 2 weeks. Extended discovery is how consultancies generate revenue before delivering value.
  • They can’t name who will write your code. If the delivery team doesn’t exist yet, neither does the delivery capability.
  • The proposal describes a “platform” you’ll license. You’re buying a subscription, not a solution.
  • No former clients who operate independently. Every client is still paying = every engagement creates dependency.

How NimbleBrain Scores

We apply this framework to ourselves. Here’s the honest assessment:

QuestionNimbleBrain ScoreEvidence
Running system?Passmpak.dev, upjack.dev, production infrastructure, publicly accessible
Who does the work?PassSenior engineers in every engagement. The person in your Slack is the person writing code.
What do you own?PassEverything. Source code, infrastructure, documentation. No ongoing dependency required.
Timeline to first output?Pass14 days average to first production system. 4-week standard engagement.
How is scope handled?PassFixed-scope, fixed-price. We absorb overruns.
Technology lock-in?PassOpen standards (MCP), open-source tools, runs in your infrastructure.
Can clients operate without you?PassYes. Escape velocity is the explicit goal. Knowledge transfer is built into every engagement.
What have they built?PassUpjack, mpak, Synapse, 21+ MCP servers. All open-source. All used in client engagements.

Where we’re NOT the right fit:

  • Multi-year enterprise transformation programs. Our model is 4-week sprints, not 18-month programs. If you need org-wide change management across 50 departments, we’re not structured for that.
  • Body-shop staffing. We don’t place individual engineers on your team long-term. We embed, build, transfer, and leave.
  • Low-budget exploration. Our engagements start at $50K. If you need a $10K proof-of-concept, we’ll point you to good alternatives.
  • You want a vendor, not a partner. If the goal is to hand off a requirements doc and check back in three months, the embed model won’t work. We need operational access and collaborative engagement.

The Decision Matrix

FactorNimbleBrainTraditional ConsultancyIn-House TeamFreelancers
Time to first system14 days3-6 months2-4 months4-8 weeks
Cost (typical engagement)$50K-150K fixed$200K-1M+ (open-ended)$300K-500K/yr (team cost)$15K-50K per project
Who does the workSenior engineersMix of senior/junior + contractorsYour teamOne person
You own everythingYesSometimes (platform dependency)YesUsually
Ongoing dependencyNone (by design)High (by design)N/ALow
Production experienceDeep (we build our own tools)Varies widelyDepends on hire qualityVaries wildly
Scale1-3 focused agents per sprintEnterprise-wide programsLimited by team sizeSingle project
Technology lock-inOpen standards + open sourceOften proprietaryYour choiceTheir preference
Best forSpeed-to-production, ownership, focused scopeLarge-scale transformation, compliance-heavyLong-term capability buildingNarrow, defined tasks

How to Use This Framework

Print the eight questions. Ask them of every firm you’re evaluating. Score each answer honestly. The firms that score well on all eight are structurally aligned with AI delivery. The firms that fail on two or more have business models that work against your success.

Don’t take our word for any of this. Verify. Ask for references. Ask to see running systems. Ask to meet the engineers. The firms that welcome scrutiny are the ones that can deliver.


Frequently Asked Questions

Should I use this framework to evaluate NimbleBrain?

Yes. We built it for that purpose. We score ourselves transparently in the guide, including where we're not the right fit. If you're evaluating us against other firms, use the same eight questions on everyone. The framework is designed to reveal structural differences, not favor any particular vendor.

What if none of my candidates pass all eight questions?

Most won't. That's the point, the bar is set at what actually predicts AI delivery success, not at what firms typically present. If nobody passes, you have three options: lower your scope to something a less capable partner can deliver, build in-house if you have the engineering talent, or expand your search beyond the traditional consultancy market.

Is this framework biased toward NimbleBrain's model?

It's biased toward delivery models that work for AI. We built NimbleBrain around these principles because we believe they're correct, not the other way around. Fixed scope, senior builders, production speed, and ownership transfer are structural advantages for AI implementation regardless of who provides them. If another firm scores well on all eight questions, they're probably a good partner.

How is this different from a typical RFP process?

RFPs evaluate what a firm says it can do. This framework evaluates what a firm has actually done and how its business model aligns with your success. Three of the eight questions, timeline to first output, what you own when they leave, and whether previous clients operate independently, are impossible to answer with marketing materials. They require verifiable proof.

What's the single most important question on the list?

Question 1: Can they show you a running system? Not a demo. Not a video. A production system that a real client is using right now. If they can't, everything else is theoretical. The gap between 'we can build this' and 'we have built this' is where most AI engagements die.


Ready to put this into practice?

Or email directly: hello@nimblebrain.ai