BENCHMARK ANALYSIS

CVDP Performance Analysis.

ChipCraftX achieves 94.7% functional correctness on NVIDIA's CVDP benchmark across 302 non-agentic problems and 64.1% on 92 agentic problems — the highest reported scores on an open-source evaluation toolchain. On the four CID categories shared with ACE-RTL (NVIDIA, arXiv:2602.10218), ChipCraftX leads on 3 of 4 categories while using 30x fewer generation attempts.

The CVDP Benchmark

The Comprehensive Verilog Design Problems (CVDP) benchmark from NVIDIA (arXiv:2506.14074) contains 649 RTL problems derived from real chip design workflows. Unlike VerilogEval — which tests short, textbook-style snippets — CVDP problems feature specifications averaging 18KB, multiple interacting submodules, and cocotb-based Python testbenches that exercise silicon-realistic edge cases.

CVDP organizes problems into 8 CID categories spanning code completion, spec-to-RTL generation, code modification, optimization, debugging, testbench generation, and assertion writing.

Evaluation scope

We evaluate the 5 CID categories that use open-source tools (Icarus Verilog + cocotb). Categories cid012 (Testbench Stimulus), cid013 (Testbench Checker), and cid014 (Assertion Generation) require NVIDIA Xcelium and are excluded. All per-CID comparisons with ACE-RTL are on identical problem sets.

Zero prior exposure

ChipCraftBrain had never seen any CVDP problem, specification, or testbench before this evaluation. No fine-tuning, no few-shot examples, no benchmark-specific prompts. These results reflect cold, first-contact performance.

Non-Agentic Results

302 problems across 5 CID categories (easy + medium difficulty). ChipCraftX uses up to 5 iterations with EDA tool feedback — compile errors and simulation results inform each retry.

CID	Category	Problems	Pass Rate
cid002	RTL Code Completion	94	93.6%
cid003	Spec-to-RTL Generation	78	96.2%
cid004	RTL Code Modification	55	96.4%
cid007	RTL Optimization	40	97.5%
cid016	Bug Fixing	35	88.6%
	Overall	302	94.7%

By Difficulty

CID	Easy	Medium
cid002	46/48 (95.8%)	42/46 (91.3%)
cid003	41/41 (100.0%)	34/37 (91.9%)
cid004	29/30 (96.7%)	24/25 (96.0%)
cid007	21/22 (95.5%)	18/18 (100.0%)
cid016	20/21 (95.2%)	11/14 (78.6%)

Performance is remarkably consistent across difficulty levels, with medium problems only 3-5pp below easy in most categories. The one exception is cid016 (bug fixing), where medium-difficulty problems drop to 78.6% — these involve subtle timing bugs and multi-signal race conditions that are harder to diagnose from simulation output alone.

87.8% of problems pass on the first attempt (265 of 302). Only 37 problems required retries, and of those, 21 were resolved within 2-4 additional iterations.

Local Open-Model Results

Can the same pipeline run on a fully local, open-weight model instead of the Anthropic API? We evaluated Qwen3.6-35B-A3B-FP8 served by vLLM on two RTX 3090s — identical pipeline configuration, identical cocotb harnesses, zero API calls, zero marginal cost.

CID	Category	Local (Qwen3.6)	Claude Sonnet	Delta
cid002	RTL Code Completion	85.1%	93.6%	−8.5pp
cid003	Spec-to-RTL Generation	94.9%	96.2%	−1.3pp
cid004	RTL Code Modification	78.2%	96.4%	−18.2pp
cid007	RTL Optimization	77.5%	97.5%	−20.0pp
cid016	Bug Fixing	94.3%	88.6%	+5.7pp
	Overall	86.4%	94.7%	−8.3pp

The local model reaches 86.4% at zero marginal cost — 91% of Claude's pass rate, needing ~46% more repair iterations. Two results stand out: Qwen3.6 essentially matches Claude on Spec-to-RTL (94.9% vs 96.2%), the core ChipCraftX use case, and beats Claude on Bug Fixing (94.3% vs 88.6%). The gap concentrates in RTL Modification (−18.2pp) and RTL Optimization (−20.0pp).

Half the gap is pipeline, not model

An agentic root-cause analysis of the 35 problems Qwen failed but Claude passed reveals that only 19 (54%) are genuine model weaknesses — the other 16 (46%) are pipeline/harness/validator artifacts where Qwen's RTL was independently valid. The single largest cluster (8 problems) is a context-extraction regex bug. Correcting pipeline artifacts alone — no model-side work — is projected to lift the local model toward ~91–92%.

Measured vs projected

The leaderboard reports the measured 86.4%. The ~92% projection requires pipeline fixes and re-evaluation — it is an evidence-grounded estimate, not a confirmed result.

Head-to-Head: ChipCraftX vs ACE-RTL

ACE-RTL (NVIDIA, arXiv:2602.10218, February 2026) is the most directly comparable system: it combines a fine-tuned RTL-specialized LLM with an agentic feedback loop, evaluated on the same CVDP benchmark. Their paper reports results for 18 systems across four CID categories — the same four we share. All scores use APR (Agentic Pass Rate).

vs Agentic Systems

Model	cid002	cid003	cid004	cid016
ChipCraftX (Iterative@5)	93.6	96.2	96.4	88.6
ACE-RTL (5x parallel, 30 iter)	80.85	96.15	90.91	91.43
ACE-RTL (Claude4 generator)	80.85	89.74	81.82	88.57
ACE-RTL-Generator (Pass@1)	39.57	49.74	65.09	57.14
ScaleRTL†-32B	29.79	35.90	32.73	40.00

vs Frontier LLMs

Model	cid002	cid003	cid004	cid016
ChipCraftX (Iterative@5)	93.6	96.2	96.4	88.6
GPT-5	39.36	47.44	45.45	60.00
Claude4-Sonnet	39.36	51.28	49.09	54.29
o4-mini	37.23	45.45	44.44	58.82
DeepSeek-R1	39.36	42.31	43.64	51.43
DeepSeek-v3.1	37.23	48.72	41.82	40.00
Llama4-Maverick	28.72	32.05	38.18	37.14
Qwen3-Coder-480B	31.91	35.90	41.82	42.86
Kimi-K2	25.53	29.49	32.73	31.43

vs RTL-Specialized Models

Model	cid002	cid003	cid004	cid016
ChipCraftX (Iterative@5)	93.6	96.2	96.4	88.6
ScaleRTL-32B	27.66	33.33	30.91	37.14
OriGen-7B	21.28	21.79	16.36	11.43
CraftRTL-15B	11.70	17.95	16.36	8.57
CodeV-7B	6.38	7.69	0.00	0.00
RTLCoder-v1.1-7B	1.06	5.13	1.82	2.86

The Delta

CID	ChipCraftX	ACE-RTL	Delta	Winner
cid002 (Code Completion)	93.6	80.85	+12.75pp	ChipCraftX
cid003 (Spec-to-RTL)	96.2	96.15	+0.05pp	Tied
cid004 (Code Modification)	96.4	90.91	+5.49pp	ChipCraftX
cid016 (Bug Fixing)	88.6	91.43	-2.83pp	ACE-RTL

ChipCraftX leads convincingly on code completion (+12.75pp) and code modification (+5.49pp), ties on spec-to-RTL, and trails by 2.83pp on bug fixing. The cid016 gap likely reflects ACE-RTL's restart mechanism — when a debugging trajectory stalls, their Coordinator discards it and starts fresh with distilled insights. Our pipeline does not yet implement trajectory restarts; when an iteration stalls, we continue refining the same code rather than starting over.

Efficiency matters

ACE-RTL uses 5 parallel processes with up to 30 iterations each — a total budget of 150 generation attempts per problem. ChipCraftX uses a single process with at most 5 iterations. Despite a 30x difference in compute budget, we lead on 3 of 4 categories.

Agentic Results

92 problems across 4 CID categories (easy + medium + hard difficulty). Agentic problems provide multi-file context — existing RTL modules that the generated code must integrate with.

CID	Category	Problems	Pass Rate
cid003	Spec-to-RTL Generation	34	85.3%
cid004	RTL Code Modification	25	52.0%
cid005	Module Reuse / Integration	22	40.9%
cid016	Bug Fixing	11	72.7%
	Overall	92	64.1%

By Difficulty

CID	Easy	Medium	Hard
cid003	5/5 (100%)	22/25 (88.0%)	2/4 (50.0%)
cid004	4/4 (100%)	7/15 (46.7%)	2/6 (33.3%)
cid005	1/1 (100%)	3/12 (25.0%)	5/9 (55.6%)
cid016	7/7 (100%)	1/3 (33.3%)	0/1 (0.0%)

The agentic subset is substantially harder — easy problems are universally solved (100% across all CIDs), but medium and hard problems expose the challenge of multi-file context. The 64.1% overall rate shows clear room for improvement, with module reuse/integration (40.9%) and hard code modification (33.3%) as primary bottlenecks.

Iteration Dynamics

Non-Agentic

Iterations	Passed	Failed	Pass Rate	Avg Time
1	265	0	100%	50.2s
2	16	0	100%	103.1s
3	4	2	66.7%	336.6s
4	1	1	50.0%	324.9s
5	0	13	0%	343.9s

Agentic

Iterations	Passed	Failed	Pass Rate	Avg Time
1	53	0	100%	50.7s
2	4	0	100%	146.5s
4	2	2	50.0%	287.5s
5	0	31	0%	246.7s

The pattern is striking: every problem that reaches 5 iterations fails. This is a hard ceiling, not a soft degradation. If the pipeline hasn't solved a problem in 4 attempts, the 5th attempt has zero empirical chance of succeeding.

Timing

Status	Avg Time	Median	Max
Passed	55.8s	47.1s	340.5s
Failed	383.1s	254.8s	1,020s

Failures take 6.9x longer on average — the system exhausts its iteration budget before admitting defeat.

Key Takeaways

94.7% non-agentic on CVDP demonstrates that iterative EDA feedback with a strong foundation model solves the vast majority of realistic RTL problems — not just textbook exercises.
Efficiency over brute force: ChipCraftX matches or beats ACE-RTL on 3/4 categories with 5 iterations vs their 150 attempts.
Agentic is the frontier: 64.1% on multi-file integration tasks shows where the real challenge lies. Code completion and spec-to-RTL are approaching saturation; module integration and debugging at scale are the next problems to solve.
Transparency builds trust: We report our cid016 gap honestly. ACE-RTL's restart mechanism is a good idea that we plan to adopt. The 16 non-agentic failures are understood and actively being addressed.

Results as of March 2026. ACE-RTL data from arXiv:2602.10218 (February 2026). All ChipCraftX evaluations use Icarus Verilog + cocotb on the open-source CVDP subset.