I Fuzzed, and Vibe Fixed, the Vibed C Compiler

John Regehr, March 2 2026.

Update from March 17 2026: In the first version of this piece I made an embarrassing mistake, which was trusting a vibe-coded script that ran Csmith, and then reported that there were no miscompilations due to Csmith in the version of CCC that was already fixed based on bugs detected by YARPGen. There were, in fact, many miscompilations detected by Csmith. Furthermore, after fixing the Csmith-triggered bugs, I have fuzzed CCC using YARPGen v2, which found several more bugs. The piece is now revised to reflect these new findings.


I didn’t take much of an interest in CCC, “Claude’s C Compiler,” at first. However, after seeing a hint of what happens when you fuzz it using Csmith and YARPGen, I got curious about how wrong this compiler actually is. The results at the Github issue—14 miscompiles out of 101 Csmith programs and 5 miscompiles out of 101 YARPGen programs—seemed pretty bad, but consistent with what I’d heard about this compiler, which is that it occupies an odd bit of territory where it’s more sophisticated than something we could get out of a compiler course project, but also that it doesn’t even rate at all on the scale where we consider production-grade artifacts like GCC and Clang/LLVM.

As a bit of background, Csmith and YARPGen are randomized compiler testing tools produced by my research group. They’re each responsible for detecting many hundreds of compiler defects, the most interesting of which are miscompilations, where a compiler silently produces output whose behavior diverges from the set of behaviors allowed by the relevant programming language standard, for some input program. YARPGen has (in effect) a built-in interpreter that allows it to predict the value that should be printed by a program that it generates. Csmith has no such functionality; for it to detect a miscompilation we use differential testing where we compare the behavior of executables generated by two compilers, or two modes of the same compiler (an optimizing and a non-optimizing compile, for example). Although I can’t prove it, I like to think that these tools (and others like them) have helped the production compilers that developers use every day become more robust and solid.

I connected YARPGen version 1 and CCC in a testing loop. As expected, CCC miscompiles a lot of inputs. Each time I found a miscompile I reduced it using C-Vise (which is mostly C-Reduce, but with the core rewritten in Python instead of Perl). It’s not really possible to deal with miscompile bugs triggered by large test cases—test case reduction shrinks miscompile triggers down to (typically) a few lines. Here an example of a program that CCC miscompiled:

int printf(const char *, ...);
unsigned long long seed;
unsigned a = 3357492005;
int b;
void hash(long long *seed, int v) { *seed ^= v; }
int main() {
  b = a / (long)3;
  hash(&seed, b);
  printf("%llu\n", seed);
}

Next—not being even a little bit of a Rust programmer—I asked Codex to fix each bug and also, of course, to add a regression test. I picked Codex (“gpt-5.3-codex high,” to be exact) not because I have any particular affinity for it, but rather it’s what my employer currently pays for, for whatever reason. Once it appeared to succeed, I went back and ran YARPGen some more. After 11 bug fixes, an overnight run of YARPGen (around 200,000 individual tests) could not get CCC to miscompile.

Here are the commits fixing the 11 bugs. Bug summaries are Codex’s.

Next, I tested CCC using Csmith, which revealed 18 more bugs before I was able to compile and run 200,000 test cases without seeing a miscompilation. Here are the commits fixing these. Again, both the fixes and the explanations of those fixes are by Codex.

Finally, I turned to YARPGen v2, which found 5 additional bugs:

So, what have we learned here? For one thing, the bug fixing by Codex seems pretty impressive: I gave it zero guidance other than the reduced test cases and good reference compilers (GCC and LLVM). I had half expected Codex to patch CCC in clumsy ways that would lead towards chaos instead of correctness, but that wasn’t the case at all. Codex did go badly wrong in one instance where it tried to fix a poorly-reduced input that contained undefined behavior, but it was easy enough to notice this and discard its work. Another time it decided to reformat every file in CCC, for no apparent reason. But in general, it just fixed the bugs. Are its fixes good ones in a sense other than “they seem to work?” I don’t know! I don’t care to try to understand a vibe coded C compiler.

Another thing we learned here was that, with respect to the subset of C that Csmith and YARPGen generate, CCC was within some reasonable edit distance (the 34 commits above) of giving a reasonable impression of being correct. Was that a foregone conclusion? Absolutely not. It could easily have been the case that the vibe coded compiler was irrevocably specialized for its initial testing environment, in such a way that it was architecturally incapable of compiling C code in the more general case.

The most time-consuming part of the process here, initially, was waiting for C-Vise (which, again, is mostly C-Reduce). This was especially annoying because I specifically wrote C-Reduce to be slow, but good. But I do hate waiting for it. Later in the process, waiting for a random test case to trigger a bug was the most time-consuming part of the process. It pains me to say that waiting for Codex—which I would guess used more raw compute than any of the rest of this—was never the bottleneck. I didn’t measure how long Codex took, but most of the time it fixed a bug within 15-30 minutes.

Finally, let’s talk about the character of the bugs that I found here. They are mostly the kind of mistake that one would make if one was implementing a C compiler without reading the standard closely and carefully. They’re surface-level bugs that you would simply not find in a serious compiler. I don’t think we ever found a bug like these in GCC. We did find maybe one or two like this in Clang, but this was because we started testing Clang very early in its history: the LLVM community was bringing it up at the same time we developed Csmith. In contrast with these surface-level bugs, the vast majority of the bugs discovered by random testers in production-grade compilers are in their optimizers. There is a vast semantic surface area over which defects can occur in an aggressive optimizer.

What’s the verdict about Claude’s C Compiler? At some level it is impressive. I think there are a whole lot of programmers out there who couldn’t create a compiler this capable in six months. For example, George Necula had this to say about writing a C frontend:

When I (George) started to write CIL I thought it was going to take two weeks. Exactly a year has passed since then and I am still fixing bugs in it. This gross underestimate was due to the fact that I thought parsing and making sense of C is simple. You probably think the same. What I did not expect was how many dark corners this language has, especially if you want to parse real-world programs such as those written for GCC or if you are more ambitious and you want to parse the Linux or Windows NT sources (both of these were written without any respect for the standard and with the expectation that compilers will be changed to accommodate the program).

However, on the other hand, CCC doesn’t optimize very much and it contained these 34 fairly basic bugs in interpreting C code. From the point of view of people working with production compilers, CCC isn’t even a useful prototype. (If you’d like a more nuanced take, Chris’s piece from a couple weeks ago is good.) Moreover, there are a lot of C compilers available in Codex’s training corpus. I don’t know that any of them are written in Rust, but the modern LLMs seem really good at translating concepts between programming languages.

Is my version of CCC correct? There’s no chance of this. We created tools like Csmith and YARPGen to stress test the optimizers of already-tested compilers, but they are in no way a replacement for a really serious C conformance test suite.

If anyone feels like continuing to fuzz my fork of CCC, it’s here, make sure to get the yarpgen branch.