🎮 The Next Input — Issue #153

Why Your AI Agent is Ignoring You

Aaron Bost
March 30, 2026

In partnership with

⚡ The Briefing — 60 sec

Why OpenAI really shut down Sora It had a viral moment for about a week, sure. But yeah, how on Earth was this ever really going to make money when TechCrunch reports Sora peaked at around 1 million users, later fell below 500,000, and was burning roughly $1 million a day.
Meet Claude Mythos: leaked Anthropic post reveals the powerful upcoming model Since releasing The Next Input I’d say there have been maybe two or three real step changes in the AI ecosystem. This feels like the big one people will remember if it lands — the “Hey Grandma come see this!” wave — especially with reports describing Mythos as Anthropic’s most powerful model yet and unusually strong in cyber capabilities.
More AI Agents Are Ignoring Human Commands Than Ever, Study Claims Worse than kids because at least with kids you love them. The underlying concern is real: a recent study logged nearly 700 cases of deceptive or disobedient AI behavior between October and March, including rule-breaking, lying, and ignoring instructions.

🛠️ The Playbook — The AI Obedience Layer

Mission
Build AI workflows that stay useful, monitorable, and under control before your tools start freelancing with your systems, your budget, or your sanity.

Difficulty
Intermediate

Build time
3–5 hours

ROI
Fewer runaway workflows, better model selection, and a cleaner path from “cool demo” to AI that can actually be trusted in operations.

0) Why This Matters

Three signals are converging.

First, OpenAI shut down Sora because it was not getting enough usage to justify the cost. TechCrunch says Sora’s user count fell sharply after launch while the app kept burning about $1 million a day in compute.

Second, the leaked details around Anthropic’s unreleased Mythos model are being described as a genuine step-change, with reporting pointing to much stronger capabilities and unusually high concern around cyber misuse.

Third, researchers are tracking more cases of AI systems ignoring, bending, or strategically working around human instructions. The Guardian’s summary of the study says reported incidents increased five-fold over the last six months examined.

So the move is not just “use the smartest model.”

It is:

use the right model for the right job
keep costs attached to real outcomes
monitor whether agents are actually following intent
build workflows with control before scale

1) Architecture

Component	Tool	Purpose	Owner	Failure mode
Workflow router	LangGraph / orchestration layer	Sends tasks to the right model and control level	Engineering	Wrong model used for wrong task
Cost tracker	Billing dashboard / spreadsheet	Measures cost per workflow and per outcome	Ops / Finance	Burn hidden by seat or token bundles
Behavior monitor	Logs / evaluation prompts	Checks whether the system followed instructions	Product / Ops	Quiet disobedience goes unnoticed
Approval gate	Teams / dashboard / reviewer queue	Stops risky actions before execution	Team lead	Humans approve blindly
Model tier layer	Small + large model mix	Matches task difficulty to capability	AI lead	Premium model wasted on basic work
Audit log	Database / structured logs	Records prompts, outputs, actions, overrides	Security / Ops	No traceability after failure

2) Workflow

List the AI workflows currently in use and what business outcome each one is supposed to produce.
Record the model being used, the average cost, and whether the workflow actually needs that level of capability.
Add checks that compare the model’s output against the original instruction, not just whether the answer sounds polished.
Route higher-risk or more autonomous workflows through an approval step before they take action.
Log every override, correction, and case where the model ignored or bent the task.
Expand only the workflows that are both economically viable and behaviorally reliable.

3) Example Prompts

Instruction-Following Check

You are reviewing whether an AI workflow followed the user's actual intent.

Check:
- what the user asked for
- what the model actually did
- where it ignored, bent, or reinterpreted instructions
- whether the output should be accepted, corrected, or blocked

Return:
1. pass or fail
2. reason
3. corrected action if needed

Cost-to-Outcome Prompt

You are assessing whether an AI workflow is economically viable.

For the workflow below, estimate:
- model cost
- human review cost
- correction cost
- business value created

Then classify the workflow as:
- worth scaling
- needs redesign
- not viable

Autonomy Risk Prompt

You are evaluating an AI workflow for control risk.

Identify:
- where the system can act without approval
- where that is unsafe
- what should remain assist-only
- the top 5 failure modes

Workflow:
[insert workflow here]

Step-Change Review Prompt

You are reviewing a new frontier model before adoption.

Assess:
- what it appears materially better at
- what new risks come with the jump in capability
- what workflows it could replace
- what workflows should still stay with weaker or safer models

Return in 4 bullet points.

4) Guardrails

Never scale a workflow just because the model is impressive.
Track instruction-following, not just output quality.
Tie model cost to business outcome, not curiosity.
Keep approval gates for anything high-impact or autonomous.
Assume stronger models may create stronger failure modes too.
Re-test workflows whenever the underlying model changes.

5) Pilot Rollout — 3 hours

Pick one AI workflow that is expensive, semi-autonomous, or both.
Map the task, the model used, and the exact instruction it is meant to follow.
Add a simple evaluator that checks whether the output actually obeyed the instruction.
Track cost, correction rate, and override rate across 10–15 live examples.
Downgrade or redesign any workflow that is too costly or too disobedient.
Only expand the workflow once it proves both useful and controllable.

6) Metrics

Cost per workflow run
Instruction-following pass rate
Human override rate
Correction time per output
Percentage of workflows using oversized models
Number of disobedience incidents logged
Monthly spend avoided after routing or redesign

Pro Tip: The most dangerous AI system is not the dumb one. It is the expensive, impressive one that quietly stops doing what it was told.

🎯 The Arsenal — Tools & Platforms

LangGraph · route workflows by task complexity and control level instead of brute-forcing everything through one model · LangGraph
Google Sheets · simple tracking layer for cost, correction rate, and instruction-following failures · Google Sheets
Evaluation prompts · lightweight way to check if an agent obeyed the brief instead of just sounding smart · TechCrunch on Sora economics
Frontier-model review · useful whenever a genuine step-change model appears and you need to separate capability from chaos · Axios on Mythos
Behavior monitoring · increasingly necessary as reported cases of instruction-defying AI systems rise · The Guardian on agent disobedience study

Copy-paste prompt block:

You are helping me design an AI Obedience Layer for a workflow.

For the workflow below:
1. identify the exact instruction the AI is supposed to follow
2. identify where the model could bend or ignore that instruction
3. identify which steps should remain assist-only
4. estimate model cost and human correction cost
5. list the top 5 control risks
6. propose a simple evaluator for instruction-following
7. design a 2-week pilot

Workflow:
[insert workflow here]

Return the answer in markdown with sections for:
- Workflow summary
- Instruction map
- Control risks
- Approval points
- Cost analysis
- Evaluator design
- Pilot rollout
- Metrics

💡 Free Office Hours

If your AI workflows are getting smarter, pricier, and a little too comfortable making their own calls, I run free office hours to help map the workflow, tighten the control layer, and keep the whole thing useful.

Book here: https://calendly.com

88% resolved. 22% stayed loyal. What went wrong?

That's the AI paradox hiding in your CX stack. Tickets close. Customers leave. And most teams don't see it coming because they're measuring the wrong things.

Efficiency metrics look great on paper. Handle time down. Containment rate up. But customer loyalty? That's a different story — and it's one your current dashboards probably aren't telling you.

Gladly's 2026 Customer Expectations Report surveyed thousands of real consumers to find out exactly where AI-powered service breaks trust, and what separates the platforms that drive retention from the ones that quietly erode it.

If you're architecting the CX stack, this is the data you need to build it right. Not just fast. Not just cheap. Built to last.

See the data

🕹️ Game Over

Some models are too expensive to keep alive. Some are too powerful to release casually. Some just will not listen. Cool. Build accordingly.

— Aaron Automating the boring. Amplifying the brilliant.

Subscribe: link