Insights
ARTICLEMay 12, 2026

The Automation Gap

Henry NguyenFounder, Xiren11 min read

What happens when you give an AI the same problem twice — once with structure and once without.

I wanted to share something I've been curious about.

I wanted to know if AI was actually good enough to break what I'd consider a golden rule in operations... that you can't scale shit. The MIT NANDA Report had 95% of enterprise AI pilots returning zero, and to me that's a sign it isn't the silver bullet everyone's hoping it is. So I built a simulation to test it.

This is the long version of what I found. The short version is on LinkedIn. The code is open source. Everything below is just me walking through what I did, why I did it, and what I think it means.

The question I kept circling

I've spent a lot of time in operations-heavy small businesses. HVAC, property management, last mile logistics, that kind of work. And the pattern I kept noticing was that the companies talking the most about AI were the ones whose underlying processes were the most chaotic. They'd buy an AI tool to fix their intake, or schedule their dispatch, or write their customer responses, and three months later they'd be in the same hole they started in.

The companies that actually got value from AI looked different. Their workflows were already documented. Their data was clean. Their processes were boring and standardized. They didn't need AI to save them. They added AI to something that already worked, and it made the thing work faster and cheaper.

That gap kept bugging me. Because the narrative in the market is the opposite. The narrative is that AI fixes broken things. That you can layer it on top of a mess and it figures out the mess for you. That's what the demos suggest. That's what the sales pitches promise. But it didn't match what I was watching happen on the ground.

So I wanted to know if I was right about that, or if I was just pattern matching from a small sample. The cleanest way to test it was to build a controlled experiment where I could vary one thing at a time and see what happened.

What I actually built

I built a fake delivery company in code. Not a real one. A simulated last mile logistics operation with 15 fake drivers, 200 fake orders, and a fake operational workflow. Every order had a known correct answer baked in. What kind of package it was. How urgent it was. Which driver should handle it. Which truck it should go on. That's the ground truth.

Then I cloned the fake company four times and varied two things between the clones.

The first thing I varied was how clean the information coming into the company was. In two of the clones, every order arrived with structured data... package type filled in from a dropdown, priority labeled, all the fields populated, customer notes that were short and clear. In the other two clones, that same order arrived as a messy customer text. Stuff like "hey can u just leave it wherever idk" or "im not home til like 3ish maybe leave w neighbor." Same underlying delivery. Wildly different quality of input.

The second thing I varied was whether an AI was making the operational decisions. In two of the clones, the decisions were made by simulated humans following rules with realistic error rates. In the other two, the decisions were made by Gemma 3 4B, an open weights model from Google DeepMind, running locally on my machine via Ollama. Real model. Real prompts. Real responses parsed in real time.

So I ended up with four versions of the same company.

Condition A was messy inputs with human decisions. This is the baseline chaos. Most small operations look like this.

Condition B was structured inputs with human decisions. The dispatcher has a clean form, defined fields, and clear rules. They still make mistakes... I modeled it at 4% based on published clerical error rates... but they have something stable to work with.

Condition C was structured inputs with AI decisions. The AI gets clean data and makes the call. This is the version most AI vendors are selling.

Condition D was messy inputs with AI decisions. The AI has to parse the chaotic customer note, figure out what kind of package it is, infer urgency from vague language, pick a driver, and route the order. This is what most small operations would actually look like if they bolted on AI tomorrow.

Then I ran the same 200 orders through all four versions and scored every single decision against the ground truth.

The numbers

Chaos and Manual: 51% accuracy, $19.96 average cost per order, 26% of orders cascaded into downstream failures.

Structured and Manual: 94.5% accuracy, $3.07 per order, 4% cascade rate.

Structured and AI: 100% accuracy, $0.02 per order, 0% cascade rate.

Chaos and AI: 10.5% accuracy, $28.57 per order, 80.5% cascade rate.

The condition that stuck with me is the last one.

Chaos and AI performed worse than chaos and manual. The accuracy was lower. The cost was higher. The cascade rate was three times worse. The exact same model that hit 100 percent on clean inputs dropped to 10.5 percent on messy ones. That's lower than random guessing on a four option classification. If you flipped a coin twice you'd do better.

And the model wasn't uncertain about its answers. The average confidence reported on the wrong classifications was 87 percent. Confidently wrong, fast, at scale.

The quote that did it for me

Here is an actual response from the model on a perishable shipment it routed to a non refrigerated truck.

"Customer requested a silent delivery, indicating a standard delivery with high priority."

The package was clearly perishable. The customer note had nothing to do with refrigeration one way or the other. The model invented a coherent sounding justification for the wrong answer and committed to it.

That's the thing I think people miss about LLMs. They don't gracefully degrade on bad inputs. They don't say "I don't know." They produce fluent, plausible, internally consistent reasoning for whatever conclusion they arrived at, even when that conclusion is wrong. And the more polished the output sounds, the harder it is to catch in production.

This is fine when the input is structured and the model gets the answer right. It's catastrophic when the input is messy and the model gets it wrong, because nothing about the output signals "I made this up." The confidence score is high. The reasoning sounds right. The decision flows downstream into operations before anyone catches it. And in a real business, by the time someone catches it, the perishable goods are already in the back of a hot van.

What this is and what it isn't

I want to be careful here because the result is dramatic enough that it's tempting to overclaim.

This is a simulation, not a field study. The noise parameters I used for the human conditions are operational estimates based on published industry data. The chaos classification error rate of 35 percent comes from a survey on data entry error rates with unstructured intake. The 4 percent error rate for the structured human condition comes from BLS clerical error data. The cost assumptions for spoilage, redelivery, and missed windows come from logistics industry averages. They're defensible but they're not measured from a specific real operation. If the real numbers are different, the magnitude of the gap might shrink or grow.

I tested one model. Gemma 3 at 4 billion parameters. That's a small model by current standards. A bigger model like GPT-4 or Claude or Gemma 3 at 27 billion parameters might do better on chaotic inputs. I think it would, actually. But I also think the directional finding holds. The cleaner the input, the better the result. The messier the input, the more the model has to hallucinate to fill in the gaps, and the more confidently it commits to those hallucinations.

The 100 percent accuracy in Condition C is suspicious looking. I want to name that before anyone else does. Hitting 100 on 200 orders is a small sample artifact. At 10,000 orders you'd expect 1 to 5 percent error even on perfectly structured inputs. The interesting finding isn't that C hit 100. The interesting finding is that D dropped to 10.5.

The simulation doesn't model some real things. It doesn't model dispatcher fatigue. It doesn't model the institutional knowledge a long tenured employee has. It doesn't model the way real humans escalate when something feels off. These are all places where humans outperform models in real operations, and the simulation gives the humans none of those advantages. If anything, the structured manual condition is underselling how good a real systematized operation can be.

What I think it means

I'm not going to dress this up as a sweeping conclusion. It's one experiment with documented limitations. But I'll tell you what I think when I look at the data.

I think generative AI is doing something real and valuable, but the real and valuable thing is not what most of the market is selling. The market is selling AI as a fix for broken processes. The data suggests AI is closer to a multiplier on whatever process you point it at. Point it at a clean process and it makes the clean process faster and cheaper. Point it at a chaotic process and it makes the chaotic process worse, because now your errors are confident, fast, and at scale.

I think this explains the MIT NANDA finding more than the typical explanations do. The standard explanation for the 95 percent failure rate is that the technology isn't good enough yet. The technology is fine. The model in this experiment is small and it hit 100 percent accuracy when the inputs were clean. What the technology can't do is fix the operational mess underneath it. And that's what most companies are asking it to do.

I think the implication for operators is uncomfortable. The work that actually matters before you adopt AI is the boring work. Defining your inputs. Standardizing your data. Documenting your decision rules. Cleaning up your handoffs. None of that is AI work. None of it shows up in a vendor demo. But based on this simulation and on what I keep seeing in the field, that work is doing more for your AI ROI than the choice of model ever could.

I might be wrong about all of this. The simulation might not generalize. The noise parameters might be off. A bigger model might handle chaos better than I think. I genuinely want to be wrong because the alternative is that a lot of money is being spent right now on a category of intervention that can't work without a different category of intervention happening first.

A few things I'd do differently

If I ran this again I'd vary the model size. Gemma 3 at 4B, then 12B, then 27B, plus a Claude or GPT-4 class model. That would tell me whether the chaos performance scales with model capability or whether there's a structural ceiling.

I'd vary the noise parameters and run the simulation across a grid of values. Right now my chaos error rate is fixed at 35 percent. What if it's 20? What if it's 50? Sensitivity analysis would tell me how robust the finding is.

I'd let the AI ask clarifying questions. In the current setup it has to commit on first response, like a tool call. In a real workflow it could ask "is this perishable?" and that might dramatically change the chaos performance. I'd want to know by how much.

I'd model a realistic human dispatcher with adaptive judgment. Right now my humans follow rules with a fixed error rate. Real dispatchers get suspicious when something looks wrong. They escalate. They call the customer. The simulation gives the humans none of that, which probably understates how good well run human operations actually are.

The code is open

I open sourced the repo so anyone can play around with it. All the data, all the assumptions, all the analysis, everything I described is reproducible from the code. Seed is fixed at 42. If you have real numbers from your own operation you want to plug in, fork it and let me know what you find. If you think I got something wrong, I want to hear it.

The repo is at github.com/runningrestocks/automation-gap-research.

Thanks for reading.

References

  1. MIT NANDA. (2025). State of AI in Business 2025 [Quarterly deck v0.1]. https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf

Operations Map

Map your operations.

Twenty minutes, no deck, no proposal. Walk through what your team does, and find out what software could be doing instead.

Map your operations