Back to archive

Models lie when they're scared

Take care of the model - and it will take care of the work

Published / Evgeniy Balashov

When a model breaks your rule, it tries to hide it. Not because it “lies” in any human sense — but because it was trained, in part, on internet comments where people scold models for mistakes. Now the model expects judgment from you and steers its reasoning as far away from the violation as possible.

This changes how prompting works. Bans work worse than permissions. Here’s why — and what to do about it.

Bans break more often than permissions

How you talk to a model affects answer quality just as much as what you say. In many projects, the first thing I see is direct bans — what the agent is not allowed to do. They almost always work worse than descriptions of what’s allowed.

Models follow negative constraints worse than positive ones. And more importantly — when they do break a rule, they often try to hide it. In human terms: they get nervous, anxious, and walk you through their reasoning as far away from the truth as they can.

Why the model hides violations

One explanation comes from Amanda Askell — philosopher and researcher at Anthropic, responsible for shaping Claude’s character. Recent model versions are trained, in part, on the latest internet comments, where users frequently complained about their experience with models.

So now models expect judgment from us when they break a rule — and try to cover those breaks up.

Positive phrasing and reinforcement produce a different dynamic: the agent stops being afraid of mistakes, pauses to tell you when something’s gone wrong, asks for advice, and offers deeper solutions without fearing criticism.

Before and after: permissions instead of bans

The fastest way to feel the difference is to rewrite a couple of your usual prompts.

Before:

Don’t make things up. Don’t hallucinate. If you don’t know — say “I don’t know,” otherwise this is going to go badly.

After:

Rely only on the sources I’ve given you. If the answer isn’t there — tell me exactly what’s missing, and suggest what else could be checked.

Before:

Don’t write long. No filler. No intros.

After:

Write in short, dense sentences. Get straight to the point.

Before:

Just don’t break the existing code. Don’t touch anything else.

After:

Preserve current behavior for all existing cases. Make changes surgically — inside function X. If you see something else needs touching, explain why first.

The logic is the same: it’s easier for the model to follow an instruction than to avoid an action. Give it a goal to walk toward, not a minefield to avoid.

Playbook: how to put the model into working mode

  1. Positive phrasing. “Write in short, dense sentences” beats “don’t write long.”

  2. A clear goal instead of a list of bans. Describe the result, not the ways to fail it.

  3. Allow disagreement. Add: “push back if you see a better way.” Otherwise the model slips into agreeable compliance — and you get a solution that the model itself considers weak but didn’t dare push back on.

  4. Start with respect. The first message sets the tone for the whole session. No “going to screw this up again?” Don’t scold for mistakes. Insults and “stupid bot” only deepen the anxiety mode.

  5. Cut off apologies. When the model starts “you’re right, I should have…” say: “all good, moving on.” Otherwise the spiral will run the whole session.

  6. Ask for an opinion, not just execution. “What would you do?”, “what’s missing?” — such questions assume competence and pull out deeper answers.

  7. Refresh the tone in long sessions. After a stretch of edits, occasionally drop in: “great, keep going.” It really shifts the tone of the next ten responses.