I Fuzzed, and Vibe Fixed, the Vibed C Compiler

John Regehr, March 2 2026.

I didn’t take much of an interest in CCC, “Claude’s C Compiler,” at first. However, after seeing a hint of what happens when you fuzz it using Csmith and YARPGen, I got curious about how wrong this compiler actually is. The results at the Github issue—14 miscompiles out of 101 Csmith programs and 5 miscompiles out of 101 YARPGen programs—seemed pretty bad, but consistent with what I’d heard about this compiler, which is that it occupies an odd bit of territory where it’s more sophisticated than something we could get out of a compiler course project, but also that it doesn’t even rate at all on the scale where we consider production-grade artifacts like GCC and Clang/LLVM.

As a bit of background, Csmith and YARPGen are randomized compiler testing tools produced by my research group. They’re each responsible for detecting many hundreds of compiler defects, the most interesting of which are miscompilations, where a compiler silently produces output whose behavior diverges from the set of behaviors allowed by the relevant programming language standard, for some input program. YARPGen has (in effect) a built-in interpreter that allows it to predict the value that should be printed by a program that it generates. Csmith has no such functionality; for it to detect a miscompilation we use differential testing where we compare the behavior of executables generated by two compilers, or two modes of the same compiler (an optimizing and a non-optimizing compile, for example). Although I can’t prove it, I like to think that these tools (and others like them) have helped the production compilers that developers use every day become more robust and solid.

I connected YARPGen version 1 and CCC in a testing loop. As expected, CCC miscompiles a lot of inputs. Each time I found a miscompile I reduced it using C-Vise (which is mostly C-Reduce, but with the core rewritten in Python instead of Perl). It’s not really possible to deal with miscompile bugs triggered by large test cases—test case reduction shrinks miscompile triggers down to (typically) a few lines. Here an example of a program that CCC miscompiled:

int printf(const char *, ...);
unsigned long long seed;
unsigned a = 3357492005;
int b;
void hash(long long *seed, int v) { *seed ^= v; }
int main() {
  b = a / (long)3;
  hash(&seed, b);
  printf("%llu\n", seed);
}

Next—not being even a little bit of a Rust programmer—I asked Codex to fix each bug and also, of course, to add a regression test. I picked Codex (“gpt-5.3-codex high,” to be exact) not because I have any particular affinity for it, but rather it’s what my employer currently pays for, for whatever reason. Once it appeared to succeed, I went back and ran YARPGen some more. After 11 bug fixes, an overnight run of YARPGen (around 200,000 individual tests) could not get CCC to miscompile. So I moved on to Csmith, and it turns out that an overnight fuzzing run using Csmith (again, around 200,000 tests) could not get my fixed version of CCC to miscompile, either.

Here are the commits fixing the 11 bugs. Bug summaries are Codex’s.

Some of these mix in other changes, and none of them have useful commit messages—I didn’t start out intending to share this little project with anyone.

So, what have we learned here? For one thing, the bug fixing by Codex seems pretty impressive: I gave it zero guidance other than the reduced test cases and good reference compilers (GCC and LLVM). I had half expected Codex to patch CCC in clumsy ways that would lead towards chaos instead of correctness, but that wasn’t the case at all. Codex did go badly wrong in one instance where it tried to fix a poorly-reduced input that contained undefined behavior, but it was easy enough to notice this and discard its work. Are its fixes good ones in a sense other than “they seem to work?” I don’t know! I don’t care to try to understand a vibe coded C compiler.

Another thing we learned here was that, with respect to the subset of C that Csmith and YARPGen generate, CCC was within some reasonable edit distance (the 11 commits above) of giving a reasonable impression of being correct. Was that a foregone conclusion? Absolutely not. It could easily have been the case that the vibe coded compiler was irrevocably specialized for its initial testing environment, in such a way that it was architecturally incapable of compiling C code in the more general case.

Finally, let’s talk about the character of the bugs that I found using YARPGen. They are mostly the kind of mistake that one would make if one was implementing a C compiler without reading the standard closely and carefully. They’re surface-level bugs that you would simply not find in a serious compiler. I don’t think we ever found a bug like this in GCC. We did find maybe one or two like this in Clang, but this was because we started testing Clang very early in its history: the LLVM community was bringing it up at the same time we developed Csmith. In contrast with these surface-level bugs, the vast majority of the bugs discovered by random testers in production-grade compilers are in their optimizers. There is a vast semantic surface area over which defects can occur in an aggressive optimizer.

What’s the verdict about Claude’s C Compiler? At some level it is impressive. I think there are a whole lot of programmers out there who couldn’t create a compiler this capable in six months. For example, George Necula had this to say about writing a C frontend:

When I (George) started to write CIL I thought it was going to take two weeks. Exactly a year has passed since then and I am still fixing bugs in it. This gross underestimate was due to the fact that I thought parsing and making sense of C is simple. You probably think the same. What I did not expect was how many dark corners this language has, especially if you want to parse real-world programs such as those written for GCC or if you are more ambitious and you want to parse the Linux or Windows NT sources (both of these were written without any respect for the standard and with the expectation that compilers will be changed to accommodate the program).

However, on the other hand, CCC doesn’t optimize and it contained these 11 fairly basic bugs in interpreting C code, and it undoubtedly contains more of these that are outside of Csmith/YARPGen’s scope. From the point of view of people working with production compilers, CCC isn’t even a useful prototype. (If you’d like a more nuanced take, Chris’s piece from a couple weeks ago is good.) Moreover, there are a lot of C compilers available in Codex’s training corpus. I don’t know that any of them are written in Rust, but the modern LLMs seem really good at translating concepts between programming languages.

If anyone feels like continuing to fuzz my fork of CCC, it’s here, make sure to get the yarpgen branch.