Sprint Log.
What shipped, when, and what it proved. Most recent first.
The Pipeline Is the Moat: +24.5 Points on a Free, On-Prem Model
The ChipCraftX pipeline turns a free, open-weight model into a production RTL engine. Running Qwen3.6-35B on two consumer RTX 3090s - fully on-prem, $0 per run - we scored 86.4% across all 302 NVIDIA CVDP problems. The raw model on its own manages 61.9%. The architecture does the rest: +24.5 points of retrieval, hierarchical decomposition, orchestration, and EDA validation. That's the whole thesis - the model is a commodity you can swap; the pipeline is the product. It's model-agnostic by design, so every gain in open weights is one we inherit for free, at zero marginal cost, on hardware we own. Every result is scored on NVIDIA's own cocotb functional harness - real simulation, not lint, not self-grading.
- ›A free, open model on two consumer GPUs ($0/run) scores 86.4% across all 302 CVDP problems - fully on-prem
- ›The pipeline adds +24.5 points over the raw model (61.9% to 86.4%): the architecture is the moat, not the model
- ›Model-agnostic by design - as open weights improve, ChipCraftX rides them at zero marginal cost
- ›Strong where it's hardest: 94.3% on bug-fixing
- ›Scored on NVIDIA's own cocotb functional harness - real simulation, not lint, not self-grading
- ›86.4% on CVDP-302 (261/302) - Qwen3.6-35B, 2x RTX 3090, $0 marginal cost
- ›+24.5 points from the pipeline: 61.9% raw to 86.4% orchestrated
- ›Bug-fixing: 94.3%
- ›$0 per run - fully on-prem inference
“The model is the commodity. The pipeline is the product. A free model on hardware we own already does production-grade work - and as open weights get better, we get better for free.”
ChipCraftBrain Paper Published on arXiv
Our research paper — ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration — is now published on arXiv. It formalizes the architecture behind our benchmark results: a hybrid symbolic-neural system that orchestrates six specialized agents over a 168-dimensional state space, combining pattern retrieval, hierarchical decomposition, and adaptive reinforcement learning to solve RTL generation at production accuracy.
- ›Adaptive multi-agent orchestration over a 168-dimensional state space
- ›Hybrid architecture: algorithmic solvers for logic, neural for timing and RTL
- ›321 pattern templates + 971 open-source implementations for targeted retrieval
- ›Hierarchical decomposition with synchronized module interfaces
- ›RISC-V case study: full validated hardware generation end-to-end
- ›Zero fine-tuning — pure architectural advantage over trained competitors
- ›98.72% on VerilogEval-Human (154/156 problems)
- ›94.7% on NVIDIA CVDP (286/302 problems)
Local Testbench Model: 100% semantic accuracy, zero API cost
We trained a dedicated testbench LLM that replaces all cloud API calls for testbench generation. When it produces a testbench that compiles, it passes simulation 100% of the time — zero semantic failures. Compilation-only failures account for the remaining 7.3% and are recoverable through iteration. This drops per-run testbench cost to zero and unblocks unlimited-budget preference training for the RTL model. The model was fine-tuned on a curated dataset of ~150K verified entries. Verify-before-train discipline is why the model achieves zero semantic failures on compiled outputs: it learned from exclusively correct signal and can expand to novel specs.
- ›100% simulation pass on compiled outputs
- ›92.7% generalization power
- ›Replaces all cloud API calls for testbench generation — cost per run for testbenches drop to $0
- ›Runs entirely on local GPU infrastructure
- ›Enables unlimited-cost preference training for RTL model improvement
“Verification is the moat. Every piece we bring in-house makes the loop tighter.”
CVDP: 54.24% pass rate improvement over SOTA baseline while requiring only 1.28 iterations on average
The previous SOTA relied on fine-tuned models, massive compute budgets, and hundreds of retries. ChipCraftBrain solves the same problems in 1.28 iterations on average — lifting the pass rate by 54.24 percentage points through a fundamentally different approach.
- ›94.7% on NVIDIA CVDP (302 problems) — the hardest public RTL benchmark
- ›Wins 3/4 shared categories vs ACE-RTL: Code Completion +12.75pp, Modification +5.49pp, Spec-to-RTL tied
- ›5 iterations vs 150 attempts — 30× more efficient
- ›GPT-5 peaked at 60%. Their fine-tuned generator: 67%. ChipCraftBrain: 94.7%.
- ›94.7% overall — #1 published result on CVDP
- ›96.4% Code Modification, 96.2% Spec-to-RTL, 93.6% Code Completion
- ›97.5% RTL Optimization — a category ACE-RTL didn't even attempt
- ›Zero fine-tuning, zero custom training data — pure architectural advantage
“We invite all other competitors to publish their scores on CVDP and similar benchmarks for fair evaluation of the systems.”
VerilogEval 98.7%, #1 on the Benchmark
ChipCraftBrain achieves 98.72% pass rate on the full 156-problem VerilogEval functional benchmark, surpassing every published and commercial system, including MAGE (95.9%), ChipAgents (97.4%, closed-source), VFlow (83.6%), and CodeV (59.2%).
- ›154 of 156 problems passed via real testbench simulation
- ›Beats MAGE (95.9%), the best published academic result
- ›Beats ChipAgents (97.4%), the only commercial competitor, with unpublished methodology
- ›Average 1.14 iterations per problem, near first-try accuracy
- ›Full benchmark completed in 35.5 minutes
- ›98.72% pass rate on VerilogEval (156 problems)
- ›1.14 average iterations to solution
- ›154/156 simulation pass rate (98.72%)
- ›2.72 percentage points above ChipAgents (97.4%)
- ›2.82 percentage points above MAGE (95.9%)