Benchmark Analysis

CVDP Performance Analysis

ChipCraftX achieves 94.7% functional correctness on NVIDIA's CVDP benchmark across 302 non-agentic problems and 64.1% on 92 agentic problems — the highest reported scores on an open-source evaluation toolchain. On the four CID categories shared with ACE-RTL (NVIDIA, arXiv:2602.10218), ChipCraftX leads on 3 of 4 categories while using 30x fewer generation attempts.


The CVDP Benchmark

The Comprehensive Verilog Design Problems (CVDP) benchmark from NVIDIA (arXiv:2506.14074) contains 649 RTL problems derived from real chip design workflows. Unlike VerilogEval — which tests short, textbook-style snippets — CVDP problems feature specifications averaging 18KB, multiple interacting submodules, and cocotb-based Python testbenches that exercise silicon-realistic edge cases.

CVDP organizes problems into 8 CID categories spanning code completion, spec-to-RTL generation, code modification, optimization, debugging, testbench generation, and assertion writing.

Evaluation scope

We evaluate the 5 CID categories that use open-source tools (Icarus Verilog + cocotb). Categories cid012 (Testbench Stimulus), cid013 (Testbench Checker), and cid014 (Assertion Generation) require NVIDIA Xcelium and are excluded. All per-CID comparisons with ACE-RTL are on identical problem sets.

Non-Agentic Results

302 problems across 5 CID categories (easy + medium difficulty). ChipCraftX uses up to 5 iterations with EDA tool feedback — compile errors and simulation results inform each retry.

CIDCategoryProblemsPass Rate
cid002RTL Code Completion9493.6%
cid003Spec-to-RTL Generation7896.2%
cid004RTL Code Modification5596.4%
cid007RTL Optimization4097.5%
cid016Bug Fixing3588.6%
Overall30294.7%

By Difficulty

CIDEasyMedium
cid00246/48 (95.8%)42/46 (91.3%)
cid00341/41 (100.0%)34/37 (91.9%)
cid00429/30 (96.7%)24/25 (96.0%)
cid00721/22 (95.5%)18/18 (100.0%)
cid01620/21 (95.2%)11/14 (78.6%)

Performance is remarkably consistent across difficulty levels, with medium problems only 3-5pp below easy in most categories. The one exception is cid016 (bug fixing), where medium-difficulty problems drop to 78.6% — these involve subtle timing bugs and multi-signal race conditions that are harder to diagnose from simulation output alone.

87.8% of problems pass on the first attempt (265 of 302). Only 37 problems required retries, and of those, 21 were resolved within 2-4 additional iterations.


Head-to-Head: ChipCraftX vs ACE-RTL

ACE-RTL (NVIDIA, arXiv:2602.10218, February 2026) is the most directly comparable system: it combines a fine-tuned RTL-specialized LLM with an agentic feedback loop, evaluated on the same CVDP benchmark. Their paper reports results for 18 systems across four CID categories — the same four we share. All scores use APR (Agentic Pass Rate).

vs Agentic Systems

Modelcid002cid003cid004cid016
ChipCraftX (Iterative@5)93.696.296.488.6
ACE-RTL (5x parallel, 30 iter)80.8596.1590.9191.43
ACE-RTL (Claude4 generator)80.8589.7481.8288.57
ACE-RTL-Generator (Pass@1)39.5749.7465.0957.14
ScaleRTL†-32B29.7935.9032.7340.00

vs Frontier LLMs

Modelcid002cid003cid004cid016
ChipCraftX (Iterative@5)93.696.296.488.6
GPT-539.3647.4445.4560.00
Claude4-Sonnet39.3651.2849.0954.29
o4-mini37.2345.4544.4458.82
DeepSeek-R139.3642.3143.6451.43
DeepSeek-v3.137.2348.7241.8240.00
Llama4-Maverick28.7232.0538.1837.14
Qwen3-Coder-480B31.9135.9041.8242.86
Kimi-K225.5329.4932.7331.43

vs RTL-Specialized Models

Modelcid002cid003cid004cid016
ChipCraftX (Iterative@5)93.696.296.488.6
ScaleRTL-32B27.6633.3330.9137.14
OriGen-7B21.2821.7916.3611.43
CraftRTL-15B11.7017.9516.368.57
CodeV-7B6.387.690.000.00
RTLCoder-v1.1-7B1.065.131.822.86

The Delta

CIDChipCraftXACE-RTLDeltaWinner
cid002 (Code Completion)93.680.85+12.75ppChipCraftX
cid003 (Spec-to-RTL)96.296.15+0.05ppTied
cid004 (Code Modification)96.490.91+5.49ppChipCraftX
cid016 (Bug Fixing)88.691.43-2.83ppACE-RTL

ChipCraftX leads convincingly on code completion (+12.75pp) and code modification (+5.49pp), ties on spec-to-RTL, and trails by 2.83pp on bug fixing. The cid016 gap likely reflects ACE-RTL's restart mechanism — when a debugging trajectory stalls, their Coordinator discards it and starts fresh with distilled insights. Our pipeline does not yet implement trajectory restarts; when an iteration stalls, we continue refining the same code rather than starting over.

Efficiency matters

ACE-RTL uses 5 parallel processes with up to 30 iterations each — a total budget of 150 generation attempts per problem. ChipCraftX uses a single process with at most 5 iterations. Despite a 30x difference in compute budget, we lead on 3 of 4 categories.

Agentic Results

92 problems across 4 CID categories (easy + medium + hard difficulty). Agentic problems provide multi-file context — existing RTL modules that the generated code must integrate with.

CIDCategoryProblemsPass Rate
cid003Spec-to-RTL Generation3485.3%
cid004RTL Code Modification2552.0%
cid005Module Reuse / Integration2240.9%
cid016Bug Fixing1172.7%
Overall9264.1%

By Difficulty

CIDEasyMediumHard
cid0035/5 (100%)22/25 (88.0%)2/4 (50.0%)
cid0044/4 (100%)7/15 (46.7%)2/6 (33.3%)
cid0051/1 (100%)3/12 (25.0%)5/9 (55.6%)
cid0167/7 (100%)1/3 (33.3%)0/1 (0.0%)

The agentic subset is substantially harder — easy problems are universally solved (100% across all CIDs), but medium and hard problems expose the challenge of multi-file context. The 64.1% overall rate shows clear room for improvement, with module reuse/integration (40.9%) and hard code modification (33.3%) as primary bottlenecks.


Iteration Dynamics

Non-Agentic

IterationsPassedFailedPass RateAvg Time
12650100%50.2s
2160100%103.1s
34266.7%336.6s
41150.0%324.9s
50130%343.9s

Agentic

IterationsPassedFailedPass RateAvg Time
1530100%50.7s
240100%146.5s
42250.0%287.5s
50310%246.7s

The pattern is striking: every problem that reaches 5 iterations fails. This is a hard ceiling, not a soft degradation. If the pipeline hasn't solved a problem in 4 attempts, the 5th attempt has zero empirical chance of succeeding.

Timing

StatusAvg TimeMedianMax
Passed55.8s47.1s340.5s
Failed383.1s254.8s1,020s

Failures take 6.9x longer on average — the system exhausts its iteration budget before admitting defeat.


Key Takeaways

  1. 94.7% non-agentic on CVDP demonstrates that iterative EDA feedback with a strong foundation model solves the vast majority of realistic RTL problems — not just textbook exercises.
  2. Efficiency over brute force: ChipCraftX matches or beats ACE-RTL on 3/4 categories with 5 iterations vs their 150 attempts.
  3. Agentic is the frontier: 64.1% on multi-file integration tasks shows where the real challenge lies. Code completion and spec-to-RTL are approaching saturation; module integration and debugging at scale are the next problems to solve.
  4. Transparency builds trust: We report our cid016 gap honestly. ACE-RTL's restart mechanism is a good idea that we plan to adopt. The 16 non-agentic failures are understood and actively being addressed.

Results as of March 2026. ACE-RTL data from arXiv:2602.10218 (February 2026). All ChipCraftX evaluations use Icarus Verilog + cocotb on the open-source CVDP subset.