John Regehr, March 26 2026.
By this point, most of us who have experimented with Claude, Codex, and other LLM-based coding agents have noticed that the current generation of these can sometimes do good work, at superhuman speed, when given some kinds of highly constrained tasks. For example, coding agents can eat a large, tricky API—such as the one for manipulating LLVM IR—for lunch, and they’ve also given me a number of fixes to non-trivial bugs in real software that could be applied as-is. On the other hand, these same tools frequently fall over in baffling ways, emitting tasteless or nonsensical code.
When an LLM has the option of doing something poorly, we simply can’t trust it to make the right choices. The solution, then, is clear: we need to take away the freedom to do the job badly. The software tools that can help us accomplish this are executable oracles. The simplest executable oracle is a test case—but test cases, even when there are a lot of them, are weak. Consider Claude’s C Compiler, which I wrote about earlier: even after passing GCC’s “torture test suite” and more, it still had 34 nasty miscompilation bugs that were within easy reach. But it wouldn’t have had those bugs if Csmith and YARPGen had been included in the testing loop that was used to bring up this compiler. These tools are better executable oracles because each of them implicitly encodes a vast collection of test cases.
This piece is about collapsing as many failure-producing degrees of freedom as possible. Zero degrees of freedom is aspirational, but a good aspiration.
Besides the miscompilations, Claude’s C Compiler also fell over in
terms of quality of generated code. The compiler contains a somewhat
elaborate (and plausible-looking) set of optimization passes, but they
appear to make very little difference in the quality of its output. But
what if the human overseeing the creation of this compiler had included
an executable oracle for code quality into its testing loop? Well, I’m
100% speculating here, but my educated guess is that Claude would have
been able to incorporate this feedback, and would have done a
significantly better job optimizing the generated code. What would this
oracle actually look like? I probably would have kept it simple—perhaps
a count of the number of instructions that get executed, when you run
the compiler’s output. I’d also have given it a baseline such as
gcc -O0, so that the LLM would know where there was the
most room for improvement.
Summary: The LLM was given a degree of freedom with respect to the quality of CCC’s output, and consequently it did a poor job there.
My group (in collaboration with other folks) is working towards automated synthesis of dataflow transfer functions, such as those used by LLVM’s “known bits” analysis. We wrote a paper about this, where we used randomized synthesis techniques, no LLMs. Recently I asked Codex to start writing transfer functions. By itself, it’s not bad at this, but not great. However, given access to our command-line tools for evaluating the precision and verifying the soundness of a transfer function, Codex produced results that are better than anything I’ve seen either in a real compiler like LLVM, or in our own randomized synthesis results. The remaining degree of freedom that I left Codex—code size—allowed it to write pretty large transfer functions that explore some pretty deep case splits on the input structure, but capping the size of generated code is the easiest thing.
Summary: By pinching the LLM’s results between opposing executable oracles for soundness and precision, synthesis of dataflow transfer functions worked really well.
JustHTML is a new parser for HTML5 written in Python. Like Claude’s C Compiler, JustHTML was effectively tested into existence using a large, existing test suite. Reading into the blog post by the author, the coding agent painted itself into a corner by creating a poor software architecture. This is a difficult degree of freedom to put an executable oracle on, and the author didn’t try to do that—instead he manually walked the LLM through some refactoring tasks, finally arriving at an architecture suitable for further development. Later in the post, the author talks about adding executable oracles for correctness (using a fuzzer) and for performance.
Summary: Creative use of existing and new executable oracles led to an impressive overall result. Targeted manual intervention regarding software architecture also appears to have been a key factor.
When I teach software testing to CS students, one aspect that I emphasize is that testing is a creative activity, not a mindless pinning down of input->output mappings. When I look at the best software testing efforts out there, there’s invariably something creative and interesting hiding inside. I feel like a lot of projects leave easy testing wins sitting on the floor because nobody has carefully thought about what test oracles might be used. Finding executable oracles for LLMs feels the same to me: with a little effort and critical thinking, we can often find a programmatic way to pin down some degree of freedom that would otherwise be available to the LLM to screw up.
Next let’s look at some specific examples.
Correctness oracles abound. We have test suites, fuzzers and property-based testers, runtime sanitizers, static analyzers, linters, strong type systems, and formal verifiers. Any time such a tool can be made available to the LLM, we’ll reap the benefits in terms of not dealing with bugs the hard way, later on.
Performance oracles are also abundant. We have compiler-inserted instrumentation, runtime instrumentation, heap profilers, hardware performance counters, performance regression test suites, and more. Again, any and all of these tools should be made available to the LLM, although perhaps not right at the beginning of a project. It may work better to have a feature+correctness LLM loop that runs for a while, and then optimize performance in a later phase.
LLMs love to write weirdly excessive defensive code, and even when they don’t do this, they don’t seem to notice and remove code that becomes dead as they work. The obvious automated oracle is code coverage tools, which we should be using anyhow, to detect deficiencies in our test suites. On the other hand, simply asking an LLM to improve code coverage is an absolutely classic example of How to Misuse Code Coverage.
Every time we give the LLM a goal to optimize, we need to be thinking about preventing it from gaming the system. I’ve seen LLMs omit inconvenient benchmarks, rig the code coverage harness, and hard-code specific test cases into program logic. Some of this can only be uncovered by keeping an eye on the LLM while it’s working, and by inspecting its work afterwards.
In some cases we can make our metrics inherently robust. For example, earlier in this piece I mentioned synthesizing dataflow transfer functions. Individually, any of the goals I gave the LLM can be satisfied is degenerate ways: it’s easier to create a precise transfer function if it doesn’t have to be sound. An empty transfer function is both efficient and sound. However, taken together, these oracles constrain the set of solutions in such a way that gaming the situation is difficult or impossible.
Software architecture, modularity, and maintainability: Picking an appropriate architecture is critical as software scales up, and I don’t know of a good executable oracle. This aspect of software is likely best addressed through the initial prompt and then through human-directed refactoring, as in the JustHTML example above. Of course, we do have metrics for modularity (from the old-school software engineering literature) but I don’t have any real optimism that this would be a useful kind of feedback to a coding agent.
Duplication and unnecessary complexity: A few weeks ago I asked a coding agent to write some LLVM instrumentation for me; in addition to the code I wanted, it generated an elaborate concurrency control structure. Since LLVM is single-threaded this was totally unnecessary. There’s no way to easily automate this kind of judgment call, nor do I know of a good way to automatically notice situations where the LLM writes a number of nearly-duplicate functions instead of properly abstracting, like a skillful human would do.
GUI polish: Creating a polished interface is doubly difficult for coding agents, which are not great at understanding images, and also of course this aspect can’t be automated. Manual oversight and intervention seems like the only general way forward, although in specific instances it would seem possible to instruct the LLM to mirror some existing GUI.
Security: Fuzzers are obvious and amazingly useful automated oracles, but of course there are plenty of defects—misuse of a cryptography API, perhaps—that they might be unable to catch. I don’t imagine that it’s going to be a good idea using LLM-generated code in security-critical situations, in the near future at least.
An ideal executable oracle would be fast, deterministic, local, sandbox-compatible, and have easy-to-interpret output. The more specific a tool can be, the better. For example, a compiler warning that mentions the offending columns as well as the offending line will be more actionable. A solver-based verification tool should produce a concise counterexample, not just a statement that the property could not be verified. It’s hard to overstate the importance of this, the devil really is in the details.
The command line tools that you provide to the LLM should have a
queryable interface. Here by “queryable” I simply mean that
tool --help provides sufficient information for using the
tool. If a tool has a more opaque interface, then you have to document
its use in AGENTS.md or whatever. If a CLI tool requires
particular configuration options, document this. Also, the tool’s output
needs to be understandable to the model: if it is too voluminous, or is
in the form of image files, then things will be more difficult.
LLMs are particularly poor at dealing with long-running tools: they love to both kill tools too early and also wait a long time for tools that will never finish. Provide specific advice about this in your documentation.
I’ve found then when writing a playbook for how an LLM should use tools, providing the LLM with multiple options usually doesn’t work. The playbook should codify a linear sequence of steps, and usually it helps to add some language to the effect of “these steps are mandatory, deviations are not allowed.” Even so, LLMs are hilariously lazy and sooner or later, they’ll seek to shortcut your instructions.
Be clear with the LLM about which requirements are hard and which are soft. For example, failing test cases are a showstopper, but software that’s a little bit too slow may not be as big of a deal. In general, the playbook that you write should explain how to resolve conflicts where different executable oracles are giving different feedback. This could also be automated, for example by giving the LLM a formula for balancing throughput against code size or latency.
LLMs are also hilariously industrious. If a tool isn’t where you said it would be, or can’t be run from inside the sandbox, or doesn’t work due to a shared library problem or whatever, then the coding agent is likely to do something incredibly silly like rewriting the tool from scratch, using a different (and often entirely inappropriate) tool in hopes of achieving the same effect, etc.
Basically, there are a lot of little things that can go wrong when the coding agent uses your executable oracles. The only way to solve this, as far as I know, is to just watch what it’s doing. When it strays from the easy path you have created for it, you need to fix something on your machine, issue corrective advice (and make sure this advice gets added to a markdown file, you can’t just add it to the LLM’s current context), etc. You should only try to initiate a long LLM run after having seen it successfully use the tools several times in a row.
Our goal should be to give an LLM coding agent zero degrees of freedom. This is aspirational at present, but it’s where we should be trying to go.
Given any uncontrolled degree of freedom—some aspect of implementing a piece of software that is important for its use case—we cannot expect LLM coding agents to reliably do a good job. Strong, automated oracles are necessary forcing functions for keeping LLMs in line. An important corollary is that since there are very important aspects of code that we can’t easily measure, such as security, modularity, maintainability, and readability, code written by the current generation of LLM coding agents is generally not suitable for use cases where those things are important.
I’m a tool builder, I’ve spent the bulk of my career building and supervising the building of tools for software developers. I adore the tool-centric view of software development. However, it’s only recently that I recognized how incredibly powerful tool-centric software engineering would become in the era of coding agents. I’ve also had fun realizing that almost anything that makes a tool more useful/usable by LLM agents also makes it more useful/usable for humans; and, vice-versa.