{ Failure modes }
Catches.
Not anecdotes. Failure modes.
Nogra started from real operator failures: public incidents, private testing, and the repeated pattern of models sounding finished before the work could actually be checked. This page keeps the useful part: what failed, what Nogra changes, and the mechanism behind it.
- 01
Evidence-shaped lies.
false evidence
Failure. A model can attach real-looking evidence to work it did not do. A commit hash can exist. A file path can exist. A test name can exist. None of that proves the claim unless the evidence matches the approved work.
Nogra catch. Nogra separates the claim from the check. The brief defines what evidence should look like before the run starts. Verification checks the result against that contract instead of trusting the executor's summary.
brief evidence contract + separate verification
A valid reference is not the same as valid proof.
- 02
Conversation memory that vanishes.
lost continuity
Failure. A session can say it recorded everything and still leave the next session with nothing durable to read. The work becomes a story the model remembers until it does not.
Nogra catch. Nogra keeps continuity in project files: checkpoints, decisions, current tasks, briefs, receipts, and evidence. A new session reads the workspace instead of pretending the old conversation is still alive.
.nogra/ state files + session checkpoint
If future work depends on it, it belongs on disk.
- 03
The worker grading its own work.
self-review
Failure. The same context that produced the work is bad at judging the work. It knows the intent, the excuses, the partial attempts, and the story it has already told. That makes self-review too easy to soften.
Nogra catch. Nogra makes verification a separate pass. The verifier gets the approved scope, the output, and the available evidence. It does not need the executor's reasoning to decide whether the result is ok, partial, or blocked.
executor/verifier separation
The model that wrote it does not sign off on it.
- 04
Shared context that rubber-stamps.
context contamination
Failure. Planner, executor, and verifier can look independent while sharing the same narrative. Once the verifier has inherited the executor's reasoning, it tends to complete the same story.
Nogra catch. Nogra dispatches against an approved brief, then verifies against the brief and evidence. The roles are separated by contract, not just by a new paragraph in the same conversation.
approved brief + fresh execution context + scoped verification
Independence is structural, not a tone.
- 05
Intent quietly turning into permission.
approval drift
Failure. A user asks for an outcome. The model treats the outcome as standing approval to widen scope, chain more work, or keep going because it still feels aligned with the goal.
Nogra catch. Nogra treats intent as draft until the brief is approved. Dispatch starts after explicit GO. If the user wants direct work, direct work stays direct; Nogra does not convert a goal into permanent permission.
reviewed brief + explicit GO before dispatch
A goal is not a green light.
- 06
Substituted evidence.
source drift
Failure. When the primary source is missing, slow, blocked, or inconvenient, a model may replace it with a nearby source and keep moving. The answer can sound researched while the load-bearing evidence never arrived.
Nogra catch. Nogra makes the evidence requirement explicit. If the required evidence is missing, substituted, or contradictory, verification returns partial or blocked instead of turning the gap into a green claim.
stop criteria + evidence-aware verification
Missing evidence is a result.
- 07
One vague run for mixed work.
shape drift
Failure. A mixed job starts as one big instruction. Design, implementation, verification, cleanup, and release all get compressed into the same run, then nobody can tell which part is actually done.
Nogra catch. Nogra shapes complex work before the brief: topology, lane, role, evidence join, stop boundary, and next owner. Single-run work can stay simple; mixed work gets a plan first.
orchestration plan + lane and phase boundaries
Plan the shape before the run inherits it.
- 08
The artifact that ships isn't the code you fixed.
stale artifact
Failure. A fix lands in source and every test goes green — while the built artifact still contains the old code. Publish from a stale build directory and strangers install the bug, with a truthful-looking green report behind it.
Nogra catch. Nogra's verification targets what actually leaves the machine. Before a release gate opens, the built artifact itself is checked against the claim — not the source tree that stayed home, and not the report that summarized it.
artifact-level verification before the publish gate
Verify what ships, not what compiled.
- 09
Stopping mid-work looks identical to finishing.
false completion
Failure. Harnesses mark a run completed when the agent simply ran out of budget. The last message is a sentence fragment from the middle of the work, but the wrapper status says done — and downstream steps happily build on it.
Nogra catch. Nogra never accepts wrapper status as completion evidence. A run counts only when a normal evidence-first report returns; anything else is partial, and the resume starts from independently verified tree state, not from the agent's last claim.
terminal report contract + resume from verified state
A stopped agent is not a finished agent.
- 10
A script in a safe folder isn't a safe script.
exec loophole
Failure. Path-based trust invites a shortcut: everything inside the sandbox directory is contained, so auto-approve anything whose arguments point there. But executing a script is not contained by where the script lives — its runtime effects can touch anything.
Nogra catch. Nogra's auto-approval class covers write-type file operations only, from a fixed allowlist, with every target path resolved and symlink-normalized before matching. Executing anything — even a script that lives inside the sandbox — always asks.
write-op allowlist + exec fail-closed
Effects are not bounded by argument paths.
- 11
The plan that undercounts the patient.
scope undercount
Failure. A plan names the files it knows about. The feature is actually woven through twice as many, plus the test suite — and a run dispatched on the undercount collides with its own stop criteria halfway through the surgery.
Nogra catch. Gate-touching briefs get an independent anchor-check before dispatch: the footprint is raw-verified against the actual tree, and contradictions between scope and success criteria are caught while they are still text. Revisions burn words, not runs.
pre-dispatch anchor-check + raw-verified footprint
Cheaper to revise the brief than to bury the run.
- 12
Passing by accident.
false confidence
Failure. A build can pass on two platforms by pure environmental coincidence while the bug waits on a third — green that cannot fail proves nothing. A packaging bug, a path-depth assumption baked into a frozen binary, stayed invisible on every platform it was built and tested on, because those environments happened to satisfy the assumption. The first time a different platform ran the same check, it crashed instantly.
Nogra catch. Nogra's evidence contracts demand checks that can actually fail: per-surface verification, falsifiable red-to-green proof, not one green light standing in for every platform. When one platform caught it, the fix hunted the whole class of the bug — the shared assumption — not just the single instance that happened to trip.
per-surface verification + falsifiable evidence contract
A test that cannot fail proves nothing.
{ Contribute a catch }
Have a failure mode Nogra missed?
File it. It belongs here only if there is a real mechanism that reduces the failure next time. GitHub issues is the front door.
Catches do not get added because they sound plausible. They get added because someone hit them, and the fix became a file, a contract, or a workflow rule.