CVDP Performance Analysis
ChipCraftX achieves 94.7% functional correctness on NVIDIA's CVDP benchmark across 302 non-agentic problems and 64.1% on 92 agentic problems — the highest reported scores on an open-source evaluation toolchain. On the four CID categories shared with ACE-RTL (NVIDIA, arXiv:2602.10218), ChipCraftX leads on 3 of 4 categories while using 30x fewer generation attempts.
The CVDP Benchmark
The Comprehensive Verilog Design Problems (CVDP) benchmark from NVIDIA (arXiv:2506.14074) contains 649 RTL problems derived from real chip design workflows. Unlike VerilogEval — which tests short, textbook-style snippets — CVDP problems feature specifications averaging 18KB, multiple interacting submodules, and cocotb-based Python testbenches that exercise silicon-realistic edge cases.
CVDP organizes problems into 8 CID categories spanning code completion, spec-to-RTL generation, code modification, optimization, debugging, testbench generation, and assertion writing.
Evaluation scope
Non-Agentic Results
302 problems across 5 CID categories (easy + medium difficulty). ChipCraftX uses up to 5 iterations with EDA tool feedback — compile errors and simulation results inform each retry.
| CID | Category | Problems | Pass Rate |
|---|---|---|---|
| cid002 | RTL Code Completion | 94 | 93.6% |
| cid003 | Spec-to-RTL Generation | 78 | 96.2% |
| cid004 | RTL Code Modification | 55 | 96.4% |
| cid007 | RTL Optimization | 40 | 97.5% |
| cid016 | Bug Fixing | 35 | 88.6% |
| Overall | 302 | 94.7% |
By Difficulty
| CID | Easy | Medium |
|---|---|---|
| cid002 | 46/48 (95.8%) | 42/46 (91.3%) |
| cid003 | 41/41 (100.0%) | 34/37 (91.9%) |
| cid004 | 29/30 (96.7%) | 24/25 (96.0%) |
| cid007 | 21/22 (95.5%) | 18/18 (100.0%) |
| cid016 | 20/21 (95.2%) | 11/14 (78.6%) |
Performance is remarkably consistent across difficulty levels, with medium problems only 3-5pp below easy in most categories. The one exception is cid016 (bug fixing), where medium-difficulty problems drop to 78.6% — these involve subtle timing bugs and multi-signal race conditions that are harder to diagnose from simulation output alone.
87.8% of problems pass on the first attempt (265 of 302). Only 37 problems required retries, and of those, 21 were resolved within 2-4 additional iterations.
Head-to-Head: ChipCraftX vs ACE-RTL
ACE-RTL (NVIDIA, arXiv:2602.10218, February 2026) is the most directly comparable system: it combines a fine-tuned RTL-specialized LLM with an agentic feedback loop, evaluated on the same CVDP benchmark. Their paper reports results for 18 systems across four CID categories — the same four we share. All scores use APR (Agentic Pass Rate).
vs Agentic Systems
| Model | cid002 | cid003 | cid004 | cid016 |
|---|---|---|---|---|
| ChipCraftX (Iterative@5) | 93.6 | 96.2 | 96.4 | 88.6 |
| ACE-RTL (5x parallel, 30 iter) | 80.85 | 96.15 | 90.91 | 91.43 |
| ACE-RTL (Claude4 generator) | 80.85 | 89.74 | 81.82 | 88.57 |
| ACE-RTL-Generator (Pass@1) | 39.57 | 49.74 | 65.09 | 57.14 |
| ScaleRTL†-32B | 29.79 | 35.90 | 32.73 | 40.00 |
vs Frontier LLMs
| Model | cid002 | cid003 | cid004 | cid016 |
|---|---|---|---|---|
| ChipCraftX (Iterative@5) | 93.6 | 96.2 | 96.4 | 88.6 |
| GPT-5 | 39.36 | 47.44 | 45.45 | 60.00 |
| Claude4-Sonnet | 39.36 | 51.28 | 49.09 | 54.29 |
| o4-mini | 37.23 | 45.45 | 44.44 | 58.82 |
| DeepSeek-R1 | 39.36 | 42.31 | 43.64 | 51.43 |
| DeepSeek-v3.1 | 37.23 | 48.72 | 41.82 | 40.00 |
| Llama4-Maverick | 28.72 | 32.05 | 38.18 | 37.14 |
| Qwen3-Coder-480B | 31.91 | 35.90 | 41.82 | 42.86 |
| Kimi-K2 | 25.53 | 29.49 | 32.73 | 31.43 |
vs RTL-Specialized Models
| Model | cid002 | cid003 | cid004 | cid016 |
|---|---|---|---|---|
| ChipCraftX (Iterative@5) | 93.6 | 96.2 | 96.4 | 88.6 |
| ScaleRTL-32B | 27.66 | 33.33 | 30.91 | 37.14 |
| OriGen-7B | 21.28 | 21.79 | 16.36 | 11.43 |
| CraftRTL-15B | 11.70 | 17.95 | 16.36 | 8.57 |
| CodeV-7B | 6.38 | 7.69 | 0.00 | 0.00 |
| RTLCoder-v1.1-7B | 1.06 | 5.13 | 1.82 | 2.86 |
The Delta
| CID | ChipCraftX | ACE-RTL | Delta | Winner |
|---|---|---|---|---|
| cid002 (Code Completion) | 93.6 | 80.85 | +12.75pp | ChipCraftX |
| cid003 (Spec-to-RTL) | 96.2 | 96.15 | +0.05pp | Tied |
| cid004 (Code Modification) | 96.4 | 90.91 | +5.49pp | ChipCraftX |
| cid016 (Bug Fixing) | 88.6 | 91.43 | -2.83pp | ACE-RTL |
ChipCraftX leads convincingly on code completion (+12.75pp) and code modification (+5.49pp), ties on spec-to-RTL, and trails by 2.83pp on bug fixing. The cid016 gap likely reflects ACE-RTL's restart mechanism — when a debugging trajectory stalls, their Coordinator discards it and starts fresh with distilled insights. Our pipeline does not yet implement trajectory restarts; when an iteration stalls, we continue refining the same code rather than starting over.
Efficiency matters
Agentic Results
92 problems across 4 CID categories (easy + medium + hard difficulty). Agentic problems provide multi-file context — existing RTL modules that the generated code must integrate with.
| CID | Category | Problems | Pass Rate |
|---|---|---|---|
| cid003 | Spec-to-RTL Generation | 34 | 85.3% |
| cid004 | RTL Code Modification | 25 | 52.0% |
| cid005 | Module Reuse / Integration | 22 | 40.9% |
| cid016 | Bug Fixing | 11 | 72.7% |
| Overall | 92 | 64.1% |
By Difficulty
| CID | Easy | Medium | Hard |
|---|---|---|---|
| cid003 | 5/5 (100%) | 22/25 (88.0%) | 2/4 (50.0%) |
| cid004 | 4/4 (100%) | 7/15 (46.7%) | 2/6 (33.3%) |
| cid005 | 1/1 (100%) | 3/12 (25.0%) | 5/9 (55.6%) |
| cid016 | 7/7 (100%) | 1/3 (33.3%) | 0/1 (0.0%) |
The agentic subset is substantially harder — easy problems are universally solved (100% across all CIDs), but medium and hard problems expose the challenge of multi-file context. The 64.1% overall rate shows clear room for improvement, with module reuse/integration (40.9%) and hard code modification (33.3%) as primary bottlenecks.
Iteration Dynamics
Non-Agentic
| Iterations | Passed | Failed | Pass Rate | Avg Time |
|---|---|---|---|---|
| 1 | 265 | 0 | 100% | 50.2s |
| 2 | 16 | 0 | 100% | 103.1s |
| 3 | 4 | 2 | 66.7% | 336.6s |
| 4 | 1 | 1 | 50.0% | 324.9s |
| 5 | 0 | 13 | 0% | 343.9s |
Agentic
| Iterations | Passed | Failed | Pass Rate | Avg Time |
|---|---|---|---|---|
| 1 | 53 | 0 | 100% | 50.7s |
| 2 | 4 | 0 | 100% | 146.5s |
| 4 | 2 | 2 | 50.0% | 287.5s |
| 5 | 0 | 31 | 0% | 246.7s |
The pattern is striking: every problem that reaches 5 iterations fails. This is a hard ceiling, not a soft degradation. If the pipeline hasn't solved a problem in 4 attempts, the 5th attempt has zero empirical chance of succeeding.
Timing
| Status | Avg Time | Median | Max |
|---|---|---|---|
| Passed | 55.8s | 47.1s | 340.5s |
| Failed | 383.1s | 254.8s | 1,020s |
Failures take 6.9x longer on average — the system exhausts its iteration budget before admitting defeat.
Key Takeaways
- 94.7% non-agentic on CVDP demonstrates that iterative EDA feedback with a strong foundation model solves the vast majority of realistic RTL problems — not just textbook exercises.
- Efficiency over brute force: ChipCraftX matches or beats ACE-RTL on 3/4 categories with 5 iterations vs their 150 attempts.
- Agentic is the frontier: 64.1% on multi-file integration tasks shows where the real challenge lies. Code completion and spec-to-RTL are approaching saturation; module integration and debugging at scale are the next problems to solve.
- Transparency builds trust: We report our cid016 gap honestly. ACE-RTL's restart mechanism is a good idea that we plan to adopt. The 16 non-agentic failures are understood and actively being addressed.
Results as of March 2026. ACE-RTL data from arXiv:2602.10218 (February 2026). All ChipCraftX evaluations use Icarus Verilog + cocotb on the open-source CVDP subset.