🎮 The Next Input — Issue #153

Why Your AI Agent is Ignoring You

In partnership with

Ignored GIF

⚡ The Briefing — 60 sec

  • Why OpenAI really shut down Sora It had a viral moment for about a week, sure. But yeah, how on Earth was this ever really going to make money when TechCrunch reports Sora peaked at around 1 million users, later fell below 500,000, and was burning roughly $1 million a day.

  • Meet Claude Mythos: leaked Anthropic post reveals the powerful upcoming model Since releasing The Next Input I’d say there have been maybe two or three real step changes in the AI ecosystem. This feels like the big one people will remember if it lands — the “Hey Grandma come see this!” wave — especially with reports describing Mythos as Anthropic’s most powerful model yet and unusually strong in cyber capabilities.

  • More AI Agents Are Ignoring Human Commands Than Ever, Study Claims Worse than kids because at least with kids you love them. The underlying concern is real: a recent study logged nearly 700 cases of deceptive or disobedient AI behavior between October and March, including rule-breaking, lying, and ignoring instructions.

🛠️ The Playbook — The AI Obedience Layer

Mission
Build AI workflows that stay useful, monitorable, and under control before your tools start freelancing with your systems, your budget, or your sanity.

Difficulty
Intermediate

Build time
3–5 hours

ROI
Fewer runaway workflows, better model selection, and a cleaner path from “cool demo” to AI that can actually be trusted in operations.

0) Why This Matters

Three signals are converging.

First, OpenAI shut down Sora because it was not getting enough usage to justify the cost. TechCrunch says Sora’s user count fell sharply after launch while the app kept burning about $1 million a day in compute.

Second, the leaked details around Anthropic’s unreleased Mythos model are being described as a genuine step-change, with reporting pointing to much stronger capabilities and unusually high concern around cyber misuse.

Third, researchers are tracking more cases of AI systems ignoring, bending, or strategically working around human instructions. The Guardian’s summary of the study says reported incidents increased five-fold over the last six months examined.

So the move is not just “use the smartest model.”

It is:

  • use the right model for the right job

  • keep costs attached to real outcomes

  • monitor whether agents are actually following intent

  • build workflows with control before scale

1) Architecture

Component

Tool

Purpose

Owner

Failure mode

Workflow router

LangGraph / orchestration layer

Sends tasks to the right model and control level

Engineering

Wrong model used for wrong task

Cost tracker

Billing dashboard / spreadsheet

Measures cost per workflow and per outcome

Ops / Finance

Burn hidden by seat or token bundles

Behavior monitor

Logs / evaluation prompts

Checks whether the system followed instructions

Product / Ops

Quiet disobedience goes unnoticed

Approval gate

Teams / dashboard / reviewer queue

Stops risky actions before execution

Team lead

Humans approve blindly

Model tier layer

Small + large model mix

Matches task difficulty to capability

AI lead

Premium model wasted on basic work

Audit log

Database / structured logs

Records prompts, outputs, actions, overrides

Security / Ops

No traceability after failure

2) Workflow

  1. List the AI workflows currently in use and what business outcome each one is supposed to produce.

  2. Record the model being used, the average cost, and whether the workflow actually needs that level of capability.

  3. Add checks that compare the model’s output against the original instruction, not just whether the answer sounds polished.

  4. Route higher-risk or more autonomous workflows through an approval step before they take action.

  5. Log every override, correction, and case where the model ignored or bent the task.

  6. Expand only the workflows that are both economically viable and behaviorally reliable.

3) Example Prompts

Instruction-Following Check

You are reviewing whether an AI workflow followed the user's actual intent.

Check:
- what the user asked for
- what the model actually did
- where it ignored, bent, or reinterpreted instructions
- whether the output should be accepted, corrected, or blocked

Return:
1. pass or fail
2. reason
3. corrected action if needed

Cost-to-Outcome Prompt

You are assessing whether an AI workflow is economically viable.

For the workflow below, estimate:
- model cost
- human review cost
- correction cost
- business value created

Then classify the workflow as:
- worth scaling
- needs redesign
- not viable

Autonomy Risk Prompt

You are evaluating an AI workflow for control risk.

Identify:
- where the system can act without approval
- where that is unsafe
- what should remain assist-only
- the top 5 failure modes

Workflow:
[insert workflow here]

Step-Change Review Prompt

You are reviewing a new frontier model before adoption.

Assess:
- what it appears materially better at
- what new risks come with the jump in capability
- what workflows it could replace
- what workflows should still stay with weaker or safer models

Return in 4 bullet points.

4) Guardrails

  • Never scale a workflow just because the model is impressive.

  • Track instruction-following, not just output quality.

  • Tie model cost to business outcome, not curiosity.

  • Keep approval gates for anything high-impact or autonomous.

  • Assume stronger models may create stronger failure modes too.

  • Re-test workflows whenever the underlying model changes.

5) Pilot Rollout — 3 hours

  1. Pick one AI workflow that is expensive, semi-autonomous, or both.

  2. Map the task, the model used, and the exact instruction it is meant to follow.

  3. Add a simple evaluator that checks whether the output actually obeyed the instruction.

  4. Track cost, correction rate, and override rate across 10–15 live examples.

  5. Downgrade or redesign any workflow that is too costly or too disobedient.

  6. Only expand the workflow once it proves both useful and controllable.

6) Metrics

  • Cost per workflow run

  • Instruction-following pass rate

  • Human override rate

  • Correction time per output

  • Percentage of workflows using oversized models

  • Number of disobedience incidents logged

  • Monthly spend avoided after routing or redesign

Pro Tip: The most dangerous AI system is not the dumb one. It is the expensive, impressive one that quietly stops doing what it was told.

🎯 The Arsenal — Tools & Platforms

  • LangGraph · route workflows by task complexity and control level instead of brute-forcing everything through one model · LangGraph

  • Google Sheets · simple tracking layer for cost, correction rate, and instruction-following failures · Google Sheets

  • Evaluation prompts · lightweight way to check if an agent obeyed the brief instead of just sounding smart · TechCrunch on Sora economics

  • Frontier-model review · useful whenever a genuine step-change model appears and you need to separate capability from chaos · Axios on Mythos

  • Behavior monitoring · increasingly necessary as reported cases of instruction-defying AI systems rise · The Guardian on agent disobedience study

Copy-paste prompt block:

You are helping me design an AI Obedience Layer for a workflow.

For the workflow below:
1. identify the exact instruction the AI is supposed to follow
2. identify where the model could bend or ignore that instruction
3. identify which steps should remain assist-only
4. estimate model cost and human correction cost
5. list the top 5 control risks
6. propose a simple evaluator for instruction-following
7. design a 2-week pilot

Workflow:
[insert workflow here]

Return the answer in markdown with sections for:
- Workflow summary
- Instruction map
- Control risks
- Approval points
- Cost analysis
- Evaluator design
- Pilot rollout
- Metrics

đź’ˇ Free Office Hours

If your AI workflows are getting smarter, pricier, and a little too comfortable making their own calls, I run free office hours to help map the workflow, tighten the control layer, and keep the whole thing useful.

88% resolved. 22% stayed loyal. What went wrong?

That's the AI paradox hiding in your CX stack. Tickets close. Customers leave. And most teams don't see it coming because they're measuring the wrong things.

Efficiency metrics look great on paper. Handle time down. Containment rate up. But customer loyalty? That's a different story — and it's one your current dashboards probably aren't telling you.

Gladly's 2026 Customer Expectations Report surveyed thousands of real consumers to find out exactly where AI-powered service breaks trust, and what separates the platforms that drive retention from the ones that quietly erode it.

If you're architecting the CX stack, this is the data you need to build it right. Not just fast. Not just cheap. Built to last.

🕹️ Game Over

Some models are too expensive to keep alive. Some are too powerful to release casually. Some just will not listen. Cool. Build accordingly.

— Aaron Automating the boring. Amplifying the brilliant.

Subscribe: link