Prompt Engineering ยท AI Agents

What actually makes a prompt work in production

Most prompt advice is decoration. The patterns that actually change a model's behaviour are concrete: thresholds over adjectives, worked examples good and bad, named failure modes, enforced output formats, and where each instruction sits.

Vin Lim Founder, Astralab

Most prompt advice is decoration. A frontier model does not become more careful because you told it to "be thorough," and it does not write better code because you asked for "clean, well-structured" output. Those words carry nothing the model can act on. What changes behaviour is specific and testable: thresholds instead of adjectives, worked examples instead of descriptions, named failure modes instead of hopes, and templates instead of "please format it nicely." Prompt engineering, done well, looks more like engineering than like writing. These are the levers that actually move a model, and the ones that only feel like they do.

What actually changes a model's behaviour

The fastest way to improve a prompt is to delete the parts that do nothing and sharpen the parts that do. Almost every weak instruction has a strong counterpart that hands the model something concrete.

Does littleMoves behaviour
Adjectives ("thorough", "clean", "professional")Thresholds and counts
A rule described in proseA worked example, one good and one bad
"Please respond in JSON"A format template plus a sentinel line
"Try your best"A success criterion the output is checked against

None of this is about writing more. A short prompt built from concrete directives beats a long one padded with adjectives. The test for any sentence: if you deleted it, would the output measurably change? If not, it was decoration, and it is costing you tokens and attention for nothing.

Replace adjectives with thresholds

An adjective is a judgement the model has to guess at. A number is an instruction it can follow. "Be concise" could mean fifty words or five hundred; "keep the answer under 100 words" cannot be misread. The same move works for any quality you care about.

Vague instructionConcrete version the model can act on
"Be thorough""Check at least three places, try one boundary input, include one adversarial case"
"Be concise""Keep the answer under 100 words"
"Handle errors""On a failed write, retry once, then stop and report; never swallow the error"
"Write good code""Match the surrounding file's naming and structure; add no new dependency without a reason"

Length is the clearest case: tell the model how long to be in numbers, not in adjectives, and the output stops drifting. The principle generalises. Every time you reach for "thorough" or "clean," ask what you would actually check to know the output had that property, and write that instead. Following Anthropic's prompt-engineering guidance, being explicit and giving the model room to follow the rule beats hinting at what you want.

One worked example beats a paragraph of rules

Examples are the most load-bearing part of a prompt, and the reason is not stylistic. In-context, or few-shot, learning is how these models pick up a task from the prompt itself, the behaviour first documented at scale when GPT-3 showed that a handful of examples in the context could stand in for fine-tuning (Brown et al., 2020). A rule tells the model what you want; an example shows it, including the edge the rule forgot to mention.

The highest-value example is often the negative one. Pair a correct example with a wrong one labelled "rejected," and add a short parenthetical that names exactly why it is wrong. A verification prompt that shows a PASS written purely from reading the code, tagged "(no command was run; reading code is not verification)," prevents that mistake far more reliably than any amount of instruction to "actually test it." Anthropic's own guidance puts examples first for the same reason: they are the single most effective thing you can add, and three to five well-chosen ones, covering the edge cases, do most of the work.

Name the excuse before the model makes it

Models fail in predictable ways, and the failure usually arrives wearing the same few phrases. "Based on my reading, this looks correct." "This should work." "Without access to a browser I can't check this." Each is a rationalisation for skipping the hard part, and each is easy to pre-empt: write the excuse into the prompt and reject it in advance.

This is one of the most effective patterns we use. A block that says, in effect, "you will be tempted to stop here and explain instead of running the command; recognise that urge and run the command" changes behaviour, because it makes the shortcut legible to the model at the moment it reaches for it. The craft is to name the actual excuses, the ones you have watched the model give, not to offer generic encouragement. "Give it your best" does nothing. "If you catch yourself writing an explanation instead of running the check, stop and run the check" does.

Enforce the output format, don't hope for it

If anything downstream parses the output, "please respond in JSON" is a coin flip. Model output drifts, and one stray sentence of preamble breaks the parser. The fix has two parts: show the exact shape as a template, and end with a single literal sentinel line the caller can scan for, something like VERDICT: PASS on its own line, no bold and no variation.

The sentinel works because it gives the model an unambiguous target and the parser an unambiguous anchor. The same applies to any structured handoff between agents: define the shape, show one valid instance and one rejected instance, and require the marker at the end. Models reliably emit a fixed string when the prompt closes on the requirement to. Hope is not a parsing strategy.

Position is a lever: what goes first and last

Where a piece of information sits in the prompt changes how much weight it gets. Models attend most to the beginning and the end of a long context and least to the middle, the effect Liu and colleagues documented as "lost in the middle": accuracy on retrieving a fact drops measurably when it is buried in the centre of a long input rather than placed at either edge (Liu et al., 2023). The practical rule is to put the load-bearing instructions and examples near the top, the specific question or input near the bottom, and nothing you cannot afford to have skimmed in the dead centre.

That ordering pays a second time at the billing layer. Putting the stable part of the prompt first (the role, the rules, the examples) and the changing part last (the user input and session state) is exactly the structure that lets prompt caching charge a fraction of the price on the repeated prefix. The same discipline that makes a prompt reliable makes it cheap, which we go into in our piece on building production agents.

Pick the right imperative register

Not every instruction deserves the same force, and a prompt that shouts everything emphasises nothing. Match the register to the kind of rule.

RegisterUse it forExample
You MUST / strictly prohibitedHard rules: safety, output format"You MUST read the file before editing it"
ALWAYS / NEVERRepeatable behaviours"NEVER use an interactive flag such as -i"
Prefer / Default toSoft defaults with an override path"Prefer a new commit over amending"
Consider / RecommendedSuggestions and edge cases"Consider a sub-agent for a wide search"

Two habits keep the register honest. Spend your capital letters: one or two genuinely non-negotiable rules marked IMPORTANT land hard, while a prompt littered with capitals reads as noise and the model stops weighting any of them. And prefer direct negation, "don't swallow the error," over softer phrasing like "try to avoid swallowing errors." A blunt "don't" sticks better than a polite paragraph.

None of these levers is exotic, and that is the point. Prompt engineering is an engineering loop, not incantation. Write the prompt from concrete directives and real examples, watch where the model fails, name that failure, add one mitigation, and re-run. We write prompts the way we write code: the examples are the tests, the named failure modes are the guardrails, and anything that does not change the output gets deleted. A model will not read your intentions. It follows your instructions, exactly as specific as you made them.

Sources