I Fuzzed, and Vibe Fixed, the Vibed C Compiler

John Regehr, March 2 2026.

Update from March 17 2026: In the first version of this piece I made an embarrassing mistake, which was trusting a vibe-coded script that ran Csmith, and then reported that there were no miscompilations due to Csmith in the version of CCC that was already fixed based on bugs detected by YARPGen. There were, in fact, many miscompilations detected by Csmith. Furthermore, after fixing the Csmith-triggered bugs, I have fuzzed CCC using YARPGen v2, which found several more bugs. The piece is now revised to reflect these new findings.

I didn’t take much of an interest in CCC, “Claude’s C Compiler,” at first. However, after seeing a hint of what happens when you fuzz it using Csmith and YARPGen, I got curious about how wrong this compiler actually is. The results at the Github issue—14 miscompiles out of 101 Csmith programs and 5 miscompiles out of 101 YARPGen programs—seemed pretty bad, but consistent with what I’d heard about this compiler, which is that it occupies an odd bit of territory where it’s more sophisticated than something we could get out of a compiler course project, but also that it doesn’t even rate at all on the scale where we consider production-grade artifacts like GCC and Clang/LLVM.

As a bit of background, Csmith and YARPGen are randomized compiler testing tools produced by my research group. They’re each responsible for detecting many hundreds of compiler defects, the most interesting of which are miscompilations, where a compiler silently produces output whose behavior diverges from the set of behaviors allowed by the relevant programming language standard, for some input program. YARPGen has (in effect) a built-in interpreter that allows it to predict the value that should be printed by a program that it generates. Csmith has no such functionality; for it to detect a miscompilation we use differential testing where we compare the behavior of executables generated by two compilers, or two modes of the same compiler (an optimizing and a non-optimizing compile, for example). Although I can’t prove it, I like to think that these tools (and others like them) have helped the production compilers that developers use every day become more robust and solid.

I connected YARPGen version 1 and CCC in a testing loop. As expected, CCC miscompiles a lot of inputs. Each time I found a miscompile I reduced it using C-Vise (which is mostly C-Reduce, but with the core rewritten in Python instead of Perl). It’s not really possible to deal with miscompile bugs triggered by large test cases—test case reduction shrinks miscompile triggers down to (typically) a few lines. Here an example of a program that CCC miscompiled:

int printf(const char *, ...);
unsigned long long seed;
unsigned a = 3357492005;
int b;
void hash(long long *seed, int v) { *seed ^= v; }
int main() {
  b = a / (long)3;
  hash(&seed, b);
  printf("%llu\n", seed);
}

Next—not being even a little bit of a Rust programmer—I asked Codex to fix each bug and also, of course, to add a regression test. I picked Codex (“gpt-5.3-codex high,” to be exact) not because I have any particular affinity for it, but rather it’s what my employer currently pays for, for whatever reason. Once it appeared to succeed, I went back and ran YARPGen some more. After 11 bug fixes, an overnight run of YARPGen (around 200,000 individual tests) could not get CCC to miscompile.

Here are the commits fixing the 11 bugs. Bug summaries are Codex’s.

4d9913e7 The narrow pass could pick a 32-bit type with mismatched signedness when rewriting a 64-bit op, changing whether later widening sign-extends or zero-extends. That can change the computed value even when low bits match. The fix only narrows when signedness semantics are preserved.
32fe7f5e Constant folding for unary minus treated unsigned results like signed negation instead of modulo-2^N arithmetic. Expressions such as negation into unsigned types could fold to the wrong constant. The fix adds typed negation that honors result width and unsigned wrapping.
abeb8fbd X86 compare-branch fusion could remove setcc materialization even after %rax was clobbered/reloaded from an unrelated stack slot. That made the fused branch test the wrong value and changed control flow. The fix tightens matching so fusion only happens for safe spill/reload patterns.
00fbea89 Comparison narrowing could drop sign-changing casts and could narrow sub-int comparisons below int width, violating C promotion semantics. This could flip signed/unsigned compare behavior and produce wrong branch decisions. The fix requires type-consistent sources and blocks unsafe sub-int narrowing.
90905856 Shift narrowing allowed rewrites even when shift counts were out of range for the narrowed width. That can change behavior (for example, narrowing << 47 from 64-bit to 32-bit semantics). The fix requires constant in-range counts before narrowing shift operations.
c01bac0f Expression lowering used storage-oriented type lookup for integer casts instead of semantic inferred type, so implicit cast direction could violate usual arithmetic conversions. This caused wrong signedness/width behavior in integer expressions like mixed-width division. The fix switches integer cast decisions to semantic inferred types.
5b0447ea Explicit cast lowering similarly used storage type instead of semantic integer type, which could lose required sign extension on integer-to-long paths. That yielded incorrect results in signed comparisons after casts. The fix uses inferred integer semantics when lowering casts.
b1c97854 The narrow pass still treated some sign-sensitive patterns as safe (notably And and some Shl cases), allowing removal of required sign-changing casts. That could alter implementation-defined conversion points and final values. The fix excludes unsafe cases and adds signedness checks on widening sources.
acc1b4a5 X86 codegen treated same-width U32 -> I32 casts as no-ops, but values live in 64-bit registers and later signed operations depend on proper sign extension. Without extending, upper bits could carry wrong semantics into 64-bit ops. The fix emits sign extension for this cast class.
ceff82eb The divide-by-constant optimization tracked many unsigned 32-bit values as if they were known signed-32-safe. That let signed transformations apply where values could exceed i32::MAX, changing division semantics. The fix separates u32 and i32 range knowledge and gates transforms accordingly.
9fe29b62 CFG simplification resolved casted constants as though casts were identity, so branch/ternary folding could ignore required truncation. That could fold to the wrong successor block (for example (i8)512 should be zero). The fix folds cast constants with proper source/target integer semantics before control-flow decisions.

Next, I tested CCC using Csmith, which revealed 18 more bugs before I was able to compile and run 200,000 test cases without seeing a miscompilation. Here are the commits fixing these. Again, both the fixes and the explanations of those fixes are by Codex.

9addd3af The parser was treating int;-style non-aggregate declarations inside struct/union bodies as if they were anonymous members. That created a fake field slot and could shift initializer mapping to real fields, producing wrong initialized values.
f9edd2ab Global initializer lowering used a sub_stride > struct_size gate that failed for singleton inner dimensions ([1]). This prevented needed recursion through brace nesting and mis-initialized multidimensional struct arrays.
06d5dcce Constant folding for || and && could fold from the RHS even when LHS truthiness was unknown. That violates short-circuit semantics and could wrongly skip LHS side effects.
4c0a8fb9 In global struct-array init lowering, the flat index tracker (current_idx) was not advanced after consuming nested elements. Excess scalar initializers could be reprocessed at an older index and overwrite the last valid element.
78e8e7fb Small packed structs carried in registers were sometimes stored/loaded using carrier widths (I32/I64) directly against exact-size objects. That could clobber adjacent bytes on stores and read out of bounds on loads for odd sizes (for example 5-7 bytes). The fix routes these paths through exact-width stores or temporary+memcpy handling.
d024eee3 Local struct-array initialization with singleton inner dimensions could leave an extra list wrapper at leaf level. Then only part of the intended struct initializer was consumed, dropping later scalar fields.
b5a33193 Const-local caching stored raw initializer constants before coercion to the declared type. On signed-char targets, values like const char c = 220; were cached as 220 instead of -36, causing incorrect constant-folded control flow.
c7ec162a The compare-branch fusion peephole pass could remove setcc materialization even when that boolean value was later stored and observably used. That changed program behavior by deleting side effects; the fix preserves materialization when stores are involved and only removes the redundant final test.
e7a16d57 Non-_Bool bitfield assignment did not apply normal assignment conversion of RHS to the bitfield expression type before truncation/masking. Narrow unsigned RHS values could be interpreted through a signed path and end up sign-extended incorrectly.
1a17cf58 The prior singleton-dimension initializer fix peeled only one redundant list layer. Cases with double singleton inner dimensions still lost scalar field initializers because extra wrappers remained.
c516fb61 Unsized array dimensions in tentative/incomplete global declarations were defaulted to 256, inflating object size and producing wrong sizeof. C requires a missing bound here to become one element at end of translation unit.
2f259a10 X86 memory-folding could replace a 32-bit load feeding a 64-bit compare with a direct 64-bit memory compare. That loses required zero-extension semantics and reads extra bytes, which can miscompile comparisons.
f438bad3 Constant coercion had overly broad fast paths for unsigned targets that returned signed constants unchanged. Negative narrow values therefore skipped required zero-extension (for example coercing -7 to u32), producing incorrect folded constants.
f73e9627 Bitfield compound assignment used a simplified fixed-type lowering path instead of normal arithmetic conversions/promotions. With unsigned-short RHS values, this could apply the operation under the wrong signedness and store the wrong bitfield result.
e23307cf Global struct-array pointer-field initialization had the same singleton-dimension recursion bug (this_stride > struct_size gate), causing brace levels not to be peeled when extent was 1 and mis-initializing scalar fields. The fix recurses whenever dimensions remain.
a0f94b81 Local struct/union array initializer recursion treated stride == struct_size as a leaf and skipped needed descent through remaining dimensions. Singleton inner dimensions ([1]) therefore dropped nested initializers; recursion is now driven by remaining dimensions, not stride comparison.
355ff5aa Constant folding for unary ~ was applying a full-width complement to values stored as I64, so unsigned 32-bit cases could get the wrong truth value (for example ~0xFFFFFFFFu folded nonzero). The fix makes unary bit-not const-eval honor promoted result width/signedness, restoring correct && short-circuit behavior.
e21ef76b Pointer-aware global scalar-array initialization ignored brace-wrapped scalar items (Initializer::List), so cases like union U g[] = {{{{6}}}}; could leave zero instead of writing 6. The fix unwraps nested braces and writes the scalar value through the pointer-aware path.

Finally, I turned to YARPGen v2, which found 5 additional bugs:

66731383 Constant evaluation treated casts to _Bool like truncating integer casts instead of normalizing nonzero values to 1. This could fold expressions such as (_Bool)61 to 61, changing loop steps and control flow. The fix applies _Bool normalization before truncation-based cast handling in both sema and IR lowering.
bf9e6fff Type resolution for GNU statement expressions could resolve the final identifier against an outer scope before declarations inside the statement expression. When an inner declaration shadowed an outer one, typeof could pick the wrong type and change signedness/width semantics. The fix makes statement-expression result type lookup prefer the compound’s local declarations.
90dc0b35 Statement-expression type resolution only recognized a bare trailing expression statement and missed cases where the final value was wrapped in a label-like statement. Then typeof on a statement expression could fail to see the real tail expression or use the wrong scope/type. The fix unwraps label/case/default wrappers before resolving the statement-expression result in sema and lowering.
25e55d85 Nested GNU statement expressions still misresolved typeof when inner declarations depended on names from an enclosing statement-expression scope or when the tail expression was more complex than a bare identifier. That could make lowering fall back to the wrong type or miss the inner scope entirely in kernel-style macro patterns. The fix threads parent scope information through nested statement-expression type resolution and adds broader scope-aware CType inference during typeof evaluation.
fcc28ed5 Scope-aware typeof resolution inside nested statement expressions inferred binary-expression types with a “wider operand wins” heuristic instead of C operator rules. Comparison expressions such as _a < _a could therefore be treated as wide integer types instead of int, changing later comparisons and producing wrong output. The fix makes the scope-aware path follow normal comparison, shift, pointer-arithmetic, vector, and arithmetic-conversion typing rules, and adds a regression test for the reduced nested-typeof case.

So, what have we learned here? For one thing, the bug fixing by Codex seems pretty impressive: I gave it zero guidance other than the reduced test cases and good reference compilers (GCC and LLVM). I had half expected Codex to patch CCC in clumsy ways that would lead towards chaos instead of correctness, but that wasn’t the case at all. Codex did go badly wrong in one instance where it tried to fix a poorly-reduced input that contained undefined behavior, but it was easy enough to notice this and discard its work. Another time it decided to reformat every file in CCC, for no apparent reason. But in general, it just fixed the bugs. Are its fixes good ones in a sense other than “they seem to work?” I don’t know! I don’t care to try to understand a vibe coded C compiler.

Another thing we learned here was that, with respect to the subset of C that Csmith and YARPGen generate, CCC was within some reasonable edit distance (the 34 commits above) of giving a reasonable impression of being correct. Was that a foregone conclusion? Absolutely not. It could easily have been the case that the vibe coded compiler was irrevocably specialized for its initial testing environment, in such a way that it was architecturally incapable of compiling C code in the more general case.

The most time-consuming part of the process here, initially, was waiting for C-Vise (which, again, is mostly C-Reduce). This was especially annoying because I specifically wrote C-Reduce to be slow, but good. But I do hate waiting for it. Later in the process, waiting for a random test case to trigger a bug was the most time-consuming part of the process. It pains me to say that waiting for Codex—which I would guess used more raw compute than any of the rest of this—was never the bottleneck. I didn’t measure how long Codex took, but most of the time it fixed a bug within 15-30 minutes.

Finally, let’s talk about the character of the bugs that I found here. They are mostly the kind of mistake that one would make if one was implementing a C compiler without reading the standard closely and carefully. They’re surface-level bugs that you would simply not find in a serious compiler. I don’t think we ever found a bug like these in GCC. We did find maybe one or two like this in Clang, but this was because we started testing Clang very early in its history: the LLVM community was bringing it up at the same time we developed Csmith. In contrast with these surface-level bugs, the vast majority of the bugs discovered by random testers in production-grade compilers are in their optimizers. There is a vast semantic surface area over which defects can occur in an aggressive optimizer.

What’s the verdict about Claude’s C Compiler? At some level it is impressive. I think there are a whole lot of programmers out there who couldn’t create a compiler this capable in six months. For example, George Necula had this to say about writing a C frontend:

When I (George) started to write CIL I thought it was going to take two weeks. Exactly a year has passed since then and I am still fixing bugs in it. This gross underestimate was due to the fact that I thought parsing and making sense of C is simple. You probably think the same. What I did not expect was how many dark corners this language has, especially if you want to parse real-world programs such as those written for GCC or if you are more ambitious and you want to parse the Linux or Windows NT sources (both of these were written without any respect for the standard and with the expectation that compilers will be changed to accommodate the program).

However, on the other hand, CCC doesn’t optimize very much and it contained these 34 fairly basic bugs in interpreting C code. From the point of view of people working with production compilers, CCC isn’t even a useful prototype. (If you’d like a more nuanced take, Chris’s piece from a couple weeks ago is good.) Moreover, there are a lot of C compilers available in Codex’s training corpus. I don’t know that any of them are written in Rust, but the modern LLMs seem really good at translating concepts between programming languages.

Is my version of CCC correct? There’s no chance of this. We created tools like Csmith and YARPGen to stress test the optimizers of already-tested compilers, but they are in no way a replacement for a really serious C conformance test suite.

If anyone feels like continuing to fuzz my fork of CCC, it’s here, make sure to get the yarpgen branch.