The Data Moat.
Most RTL datasets are scraped. Ours is verified. Every training record passes through a real EDA toolchain — compile, simulate, lint — before it earns a place in the corpus.
How the corpus is made
Real silicon, not scraped text
973 production-grade RTL modules curated from 23 open-source silicon projects — AXI, PCIe, Ethernet, RISC-V cores, crypto — plus 656 hand-curated design patterns and microarchitectural idioms.
RAG-augmented generation
Our own GPU fleet generates spec → RTL → testbench triples, with knowledge-base patterns injected into every prompt. The dataset grows itself.
The EDA gate
Every record must survive a real toolchain: iverilog compile, simulation against a testbench, lint. Of 600K records collected, only 188,982 — under a third — make it into production training sets.
Specialist corpora
SFT for RTL generation and testbench writing, DPO preference pairs from rejection sampling, and 29K state-action-reward trajectories for the RL world model.
The flywheel turns
Deployed models produce new generation trajectories. Failures become preference data, successes become golden pairs — and both flow back through the gate.
Quality is a gate, not a goal
188,982 of 600,007 collected records survive to production — a 31% acceptance rate. Most data pipelines optimize for volume; ours optimizes for what compiles and simulates.
Compile gate
iverilog / Verilator — every module and testbench must build against a real EDA toolchain
Simulation gate
Generated RTL runs against its testbench; only verified behavior enters the corpus
Quality filter
The raw 125,777-record testbench corpus yields 99,526 records (79%) after automated repair and compile verification
973 modules of real silicon
Curated from 23 open-source silicon projects across 14 design categories — 303,058 lines of production RTL that grounds retrieval and generation in how hardware is actually built.
| Category | Modules |
|---|---|
| CPU Building Blocks | 174 |
| Datapath & Arithmetic | 165 |
| Protocol Bridges | 153 |
| Network & Packet | 98 |
| Peripheral Controllers | 78 |
| DSP | 77 |
| Flow Control & FIFOs | 65 |
| Memory | 37 |
| Crypto | 33 |
| System Infrastructure | 33 |
| Clock & Reset / CDC | 20 |
| SerDes & Encoding | 18 |
| Arbitration | 11 |
| Error Correction | 11 |
What the models learn from
Full-testbench and stimulus modes, quality-filtered with compile verification
State-action-reward sequences captured from live generation runs
Mixture-of-experts corpora for 14B and 27B local models
Spec-to-RTL, completion, and study tasks built on real silicon modules
Claude-distilled examples, hierarchical decomposition annotations, preference pairs
Every number on this page is reproducible from source files by a single script in our data repository — the same discipline we apply to benchmark results.
Figures derived from source data on 2026-06-12.