John Regehr, March 2 2026.
Update from March 17 2026: In the first version of this piece I made an embarrassing mistake, which was trusting a vibe-coded script that ran Csmith, and then reported that there were no miscompilations due to Csmith in the version of CCC that was already fixed based on bugs detected by YARPGen. There were, in fact, many miscompilations detected by Csmith. Furthermore, after fixing the Csmith-triggered bugs, I have fuzzed CCC using YARPGen v2, which found several more bugs. The piece is now revised to reflect these new findings.
I didn’t take much of an interest in CCC, “Claude’s C Compiler,” at first. However, after seeing a hint of what happens when you fuzz it using Csmith and YARPGen, I got curious about how wrong this compiler actually is. The results at the Github issue—14 miscompiles out of 101 Csmith programs and 5 miscompiles out of 101 YARPGen programs—seemed pretty bad, but consistent with what I’d heard about this compiler, which is that it occupies an odd bit of territory where it’s more sophisticated than something we could get out of a compiler course project, but also that it doesn’t even rate at all on the scale where we consider production-grade artifacts like GCC and Clang/LLVM.
As a bit of background, Csmith and YARPGen are randomized compiler testing tools produced by my research group. They’re each responsible for detecting many hundreds of compiler defects, the most interesting of which are miscompilations, where a compiler silently produces output whose behavior diverges from the set of behaviors allowed by the relevant programming language standard, for some input program. YARPGen has (in effect) a built-in interpreter that allows it to predict the value that should be printed by a program that it generates. Csmith has no such functionality; for it to detect a miscompilation we use differential testing where we compare the behavior of executables generated by two compilers, or two modes of the same compiler (an optimizing and a non-optimizing compile, for example). Although I can’t prove it, I like to think that these tools (and others like them) have helped the production compilers that developers use every day become more robust and solid.
I connected YARPGen version 1 and CCC in a testing loop. As expected, CCC miscompiles a lot of inputs. Each time I found a miscompile I reduced it using C-Vise (which is mostly C-Reduce, but with the core rewritten in Python instead of Perl). It’s not really possible to deal with miscompile bugs triggered by large test cases—test case reduction shrinks miscompile triggers down to (typically) a few lines. Here an example of a program that CCC miscompiled:
int printf(const char *, ...);
unsigned long long seed;
unsigned a = 3357492005;
int b;
void hash(long long *seed, int v) { *seed ^= v; }
int main() {
b = a / (long)3;
hash(&seed, b);
printf("%llu\n", seed);
}Next—not being even a little bit of a Rust programmer—I asked Codex to fix each bug and also, of course, to add a regression test. I picked Codex (“gpt-5.3-codex high,” to be exact) not because I have any particular affinity for it, but rather it’s what my employer currently pays for, for whatever reason. Once it appeared to succeed, I went back and ran YARPGen some more. After 11 bug fixes, an overnight run of YARPGen (around 200,000 individual tests) could not get CCC to miscompile.
Here are the commits fixing the 11 bugs. Bug summaries are Codex’s.
4d9913e7
The narrow pass could pick a 32-bit type with mismatched signedness when
rewriting a 64-bit op, changing whether later widening sign-extends or
zero-extends. That can change the computed value even when low bits
match. The fix only narrows when signedness semantics are
preserved.32fe7f5e
Constant folding for unary minus treated unsigned results like signed
negation instead of modulo-2^N arithmetic. Expressions such as negation
into unsigned types could fold to the wrong constant. The fix adds typed
negation that honors result width and unsigned wrapping.abeb8fbd
X86 compare-branch fusion could remove setcc
materialization even after %rax was clobbered/reloaded from
an unrelated stack slot. That made the fused branch test the wrong value
and changed control flow. The fix tightens matching so fusion only
happens for safe spill/reload patterns.00fbea89
Comparison narrowing could drop sign-changing casts and could narrow
sub-int comparisons below int width, violating C promotion
semantics. This could flip signed/unsigned compare behavior and produce
wrong branch decisions. The fix requires type-consistent sources and
blocks unsafe sub-int narrowing.90905856
Shift narrowing allowed rewrites even when shift counts were out of
range for the narrowed width. That can change behavior (for example,
narrowing << 47 from 64-bit to 32-bit semantics). The
fix requires constant in-range counts before narrowing shift
operations.c01bac0f
Expression lowering used storage-oriented type lookup for integer casts
instead of semantic inferred type, so implicit cast direction could
violate usual arithmetic conversions. This caused wrong signedness/width
behavior in integer expressions like mixed-width division. The fix
switches integer cast decisions to semantic inferred types.5b0447ea
Explicit cast lowering similarly used storage type instead of semantic
integer type, which could lose required sign extension on
integer-to-long paths. That yielded incorrect results in signed
comparisons after casts. The fix uses inferred integer semantics when
lowering casts.b1c97854
The narrow pass still treated some sign-sensitive patterns as safe
(notably And and some Shl cases), allowing
removal of required sign-changing casts. That could alter
implementation-defined conversion points and final values. The fix
excludes unsafe cases and adds signedness checks on widening
sources.acc1b4a5
X86 codegen treated same-width U32 -> I32 casts as
no-ops, but values live in 64-bit registers and later signed operations
depend on proper sign extension. Without extending, upper bits could
carry wrong semantics into 64-bit ops. The fix emits sign extension for
this cast class.ceff82eb
The divide-by-constant optimization tracked many unsigned 32-bit values
as if they were known signed-32-safe. That let signed transformations
apply where values could exceed i32::MAX, changing division
semantics. The fix separates u32 and i32 range
knowledge and gates transforms accordingly.9fe29b62
CFG simplification resolved casted constants as though casts were
identity, so branch/ternary folding could ignore required truncation.
That could fold to the wrong successor block (for example
(i8)512 should be zero). The fix folds cast constants with
proper source/target integer semantics before control-flow
decisions.Next, I tested CCC using Csmith, which revealed 18 more bugs before I was able to compile and run 200,000 test cases without seeing a miscompilation. Here are the commits fixing these. Again, both the fixes and the explanations of those fixes are by Codex.
9addd3af
The parser was treating int;-style non-aggregate
declarations inside struct/union bodies as if
they were anonymous members. That created a fake field slot and could
shift initializer mapping to real fields, producing wrong initialized
values.f9edd2ab
Global initializer lowering used a
sub_stride > struct_size gate that failed for singleton
inner dimensions ([1]). This prevented needed recursion
through brace nesting and mis-initialized multidimensional struct
arrays.06d5dcce
Constant folding for || and && could
fold from the RHS even when LHS truthiness was unknown. That violates
short-circuit semantics and could wrongly skip LHS side effects.4c0a8fb9
In global struct-array init lowering, the flat index tracker
(current_idx) was not advanced after consuming nested
elements. Excess scalar initializers could be reprocessed at an older
index and overwrite the last valid element.78e8e7fb
Small packed structs carried in registers were sometimes stored/loaded
using carrier widths (I32/I64) directly
against exact-size objects. That could clobber adjacent bytes on stores
and read out of bounds on loads for odd sizes (for example 5-7 bytes).
The fix routes these paths through exact-width stores or
temporary+memcpy handling.d024eee3
Local struct-array initialization with singleton inner dimensions could
leave an extra list wrapper at leaf level. Then only part of the
intended struct initializer was consumed, dropping later scalar
fields.b5a33193
Const-local caching stored raw initializer constants before coercion to
the declared type. On signed-char targets, values like
const char c = 220; were cached as 220 instead
of -36, causing incorrect constant-folded control
flow.c7ec162a
The compare-branch fusion peephole pass could remove setcc
materialization even when that boolean value was later stored and
observably used. That changed program behavior by deleting side effects;
the fix preserves materialization when stores are involved and only
removes the redundant final test.e7a16d57
Non-_Bool bitfield assignment did not apply normal
assignment conversion of RHS to the bitfield expression type before
truncation/masking. Narrow unsigned RHS values could be interpreted
through a signed path and end up sign-extended incorrectly.1a17cf58
The prior singleton-dimension initializer fix peeled only one redundant
list layer. Cases with double singleton inner dimensions still lost
scalar field initializers because extra wrappers remained.c516fb61
Unsized array dimensions in tentative/incomplete global declarations
were defaulted to 256, inflating object size and producing
wrong sizeof. C requires a missing bound here to become one
element at end of translation unit.2f259a10
X86 memory-folding could replace a 32-bit load feeding a 64-bit compare
with a direct 64-bit memory compare. That loses required zero-extension
semantics and reads extra bytes, which can miscompile comparisons.f438bad3
Constant coercion had overly broad fast paths for unsigned targets that
returned signed constants unchanged. Negative narrow values therefore
skipped required zero-extension (for example coercing -7 to
u32), producing incorrect folded constants.f73e9627
Bitfield compound assignment used a simplified fixed-type lowering path
instead of normal arithmetic conversions/promotions. With unsigned-short
RHS values, this could apply the operation under the wrong signedness
and store the wrong bitfield result.e23307cf
Global struct-array pointer-field initialization had the same
singleton-dimension recursion bug
(this_stride > struct_size gate), causing brace levels
not to be peeled when extent was 1 and mis-initializing scalar fields.
The fix recurses whenever dimensions remain.a0f94b81
Local struct/union array initializer recursion treated
stride == struct_size as a leaf and skipped needed descent
through remaining dimensions. Singleton inner dimensions
([1]) therefore dropped nested initializers; recursion is
now driven by remaining dimensions, not stride comparison.355ff5aa
Constant folding for unary ~ was applying a full-width
complement to values stored as I64, so unsigned 32-bit
cases could get the wrong truth value (for example
~0xFFFFFFFFu folded nonzero). The fix makes unary bit-not
const-eval honor promoted result width/signedness, restoring correct
&& short-circuit behavior.e21ef76b
Pointer-aware global scalar-array initialization ignored brace-wrapped
scalar items (Initializer::List), so cases like
union U g[] = {{{{6}}}}; could leave zero instead of
writing 6. The fix unwraps nested braces and writes the
scalar value through the pointer-aware path.Finally, I turned to YARPGen v2, which found 5 additional bugs:
66731383
Constant evaluation treated casts to _Bool like truncating
integer casts instead of normalizing nonzero values to 1.
This could fold expressions such as (_Bool)61 to
61, changing loop steps and control flow. The fix applies
_Bool normalization before truncation-based cast handling
in both sema and IR lowering.bf9e6fff
Type resolution for GNU statement expressions could resolve the final
identifier against an outer scope before declarations inside the
statement expression. When an inner declaration shadowed an outer one,
typeof could pick the wrong type and change
signedness/width semantics. The fix makes statement-expression result
type lookup prefer the compound’s local declarations.90dc0b35
Statement-expression type resolution only recognized a bare trailing
expression statement and missed cases where the final value was wrapped
in a label-like statement. Then typeof on a statement
expression could fail to see the real tail expression or use the wrong
scope/type. The fix unwraps label/case/default wrappers before resolving
the statement-expression result in sema and lowering.25e55d85
Nested GNU statement expressions still misresolved typeof
when inner declarations depended on names from an enclosing
statement-expression scope or when the tail expression was more complex
than a bare identifier. That could make lowering fall back to the wrong
type or miss the inner scope entirely in kernel-style macro patterns.
The fix threads parent scope information through nested
statement-expression type resolution and adds broader scope-aware CType
inference during typeof evaluation.fcc28ed5
Scope-aware typeof resolution inside nested statement
expressions inferred binary-expression types with a “wider operand wins”
heuristic instead of C operator rules. Comparison expressions such as
_a < _a could therefore be treated as wide integer types
instead of int, changing later comparisons and producing
wrong output. The fix makes the scope-aware path follow normal
comparison, shift, pointer-arithmetic, vector, and arithmetic-conversion
typing rules, and adds a regression test for the reduced
nested-typeof case.So, what have we learned here? For one thing, the bug fixing by Codex seems pretty impressive: I gave it zero guidance other than the reduced test cases and good reference compilers (GCC and LLVM). I had half expected Codex to patch CCC in clumsy ways that would lead towards chaos instead of correctness, but that wasn’t the case at all. Codex did go badly wrong in one instance where it tried to fix a poorly-reduced input that contained undefined behavior, but it was easy enough to notice this and discard its work. Another time it decided to reformat every file in CCC, for no apparent reason. But in general, it just fixed the bugs. Are its fixes good ones in a sense other than “they seem to work?” I don’t know! I don’t care to try to understand a vibe coded C compiler.
Another thing we learned here was that, with respect to the subset of C that Csmith and YARPGen generate, CCC was within some reasonable edit distance (the 34 commits above) of giving a reasonable impression of being correct. Was that a foregone conclusion? Absolutely not. It could easily have been the case that the vibe coded compiler was irrevocably specialized for its initial testing environment, in such a way that it was architecturally incapable of compiling C code in the more general case.
The most time-consuming part of the process here, initially, was waiting for C-Vise (which, again, is mostly C-Reduce). This was especially annoying because I specifically wrote C-Reduce to be slow, but good. But I do hate waiting for it. Later in the process, waiting for a random test case to trigger a bug was the most time-consuming part of the process. It pains me to say that waiting for Codex—which I would guess used more raw compute than any of the rest of this—was never the bottleneck. I didn’t measure how long Codex took, but most of the time it fixed a bug within 15-30 minutes.
Finally, let’s talk about the character of the bugs that I found here. They are mostly the kind of mistake that one would make if one was implementing a C compiler without reading the standard closely and carefully. They’re surface-level bugs that you would simply not find in a serious compiler. I don’t think we ever found a bug like these in GCC. We did find maybe one or two like this in Clang, but this was because we started testing Clang very early in its history: the LLVM community was bringing it up at the same time we developed Csmith. In contrast with these surface-level bugs, the vast majority of the bugs discovered by random testers in production-grade compilers are in their optimizers. There is a vast semantic surface area over which defects can occur in an aggressive optimizer.
What’s the verdict about Claude’s C Compiler? At some level it is impressive. I think there are a whole lot of programmers out there who couldn’t create a compiler this capable in six months. For example, George Necula had this to say about writing a C frontend:
When I (George) started to write CIL I thought it was going to take two weeks. Exactly a year has passed since then and I am still fixing bugs in it. This gross underestimate was due to the fact that I thought parsing and making sense of C is simple. You probably think the same. What I did not expect was how many dark corners this language has, especially if you want to parse real-world programs such as those written for GCC or if you are more ambitious and you want to parse the Linux or Windows NT sources (both of these were written without any respect for the standard and with the expectation that compilers will be changed to accommodate the program).
However, on the other hand, CCC doesn’t optimize very much and it contained these 34 fairly basic bugs in interpreting C code. From the point of view of people working with production compilers, CCC isn’t even a useful prototype. (If you’d like a more nuanced take, Chris’s piece from a couple weeks ago is good.) Moreover, there are a lot of C compilers available in Codex’s training corpus. I don’t know that any of them are written in Rust, but the modern LLMs seem really good at translating concepts between programming languages.
Is my version of CCC correct? There’s no chance of this. We created tools like Csmith and YARPGen to stress test the optimizers of already-tested compilers, but they are in no way a replacement for a really serious C conformance test suite.
If anyone feels like continuing to fuzz my fork of CCC, it’s
here, make sure to get the yarpgen branch.