DATA

The Data Moat.

Most RTL datasets are scraped. Ours is verified. Every training record passes through a real EDA toolchain — compile, simulate, lint — before it earns a place in the corpus.

0
Production training records
Deduplicated, EDA-verified, ready to train
0
Records collected
Raw corpus across 65 dataset files, 2.5 GB
0
Reference RTL modules
303,058 lines from real silicon projects
0
Curated knowledge entries
Design patterns, architecture notes, microarch idioms
PIPELINE

How the corpus is made

01 · SOURCE

Real silicon, not scraped text

973 production-grade RTL modules curated from 23 open-source silicon projects — AXI, PCIe, Ethernet, RISC-V cores, crypto — plus 656 hand-curated design patterns and microarchitectural idioms.

02 · SYNTHESIZE

RAG-augmented generation

Our own GPU fleet generates spec → RTL → testbench triples, with knowledge-base patterns injected into every prompt. The dataset grows itself.

03 · VERIFY

The EDA gate

Every record must survive a real toolchain: iverilog compile, simulation against a testbench, lint. Of 600K records collected, only 188,982 — under a third — make it into production training sets.

04 · TRAIN

Specialist corpora

SFT for RTL generation and testbench writing, DPO preference pairs from rejection sampling, and 29K state-action-reward trajectories for the RL world model.

05 · COMPOUND

The flywheel turns

Deployed models produce new generation trajectories. Failures become preference data, successes become golden pairs — and both flow back through the gate.

VERIFICATION

Quality is a gate, not a goal

188,982 of 600,007 collected records survive to production — a 31% acceptance rate. Most data pipelines optimize for volume; ours optimizes for what compiles and simulates.

GATE 01

Compile gate

iverilog / Verilator — every module and testbench must build against a real EDA toolchain

GATE 02

Simulation gate

Generated RTL runs against its testbench; only verified behavior enters the corpus

GATE 03

Quality filter

The raw 125,777-record testbench corpus yields 99,526 records (79%) after automated repair and compile verification

REFERENCE LIBRARY

973 modules of real silicon

Curated from 23 open-source silicon projects across 14 design categories — 303,058 lines of production RTL that grounds retrieval and generation in how hardware is actually built.

CategoryModules
CPU Building Blocks174
Datapath & Arithmetic165
Protocol Bridges153
Network & Packet98
Peripheral Controllers78
DSP77
Flow Control & FIFOs65
Memory37
Crypto33
System Infrastructure33
Clock & Reset / CDC20
SerDes & Encoding18
Arbitration11
Error Correction11
TRAINING CORPUS

What the models learn from

Testbench generation SFT
139,489 records

Full-testbench and stimulus modes, quality-filtered with compile verification

RL world-model trajectories
29,297 records

State-action-reward sequences captured from live generation runs

RTL generation SFT
16,439 records

Mixture-of-experts corpora for 14B and 27B local models

Reference-library derived
2,949 records

Spec-to-RTL, completion, and study tasks built on real silicon modules

Distillation, decomposition & DPO
808 records

Claude-distilled examples, hierarchical decomposition annotations, preference pairs

PROVENANCE

Every number on this page is reproducible from source files by a single script in our data repository — the same discipline we apply to benchmark results.

Figures derived from source data on 2026-06-12.