DATA

The Data Moat.

Most RTL datasets are scraped. Ours is verified. Every training record passes through a real EDA toolchain — compile, simulate, lint — before it earns a place in the corpus.

Production training records

Deduplicated, EDA-verified, ready to train

Records collected

Raw corpus across 65 dataset files, 2.5 GB

Reference RTL modules

303,058 lines from real silicon projects

Curated knowledge entries

Design patterns, architecture notes, microarch idioms

PIPELINE

How the corpus is made

01 · SOURCE

Real silicon, not scraped text

973 production-grade RTL modules curated from 23 open-source silicon projects — AXI, PCIe, Ethernet, RISC-V cores, crypto — plus 656 hand-curated design patterns and microarchitectural idioms.

02 · SYNTHESIZE

RAG-augmented generation

Our own GPU fleet generates spec → RTL → testbench triples, with knowledge-base patterns injected into every prompt. The dataset grows itself.

03 · VERIFY

The EDA gate

Every record must survive a real toolchain: iverilog compile, simulation against a testbench, lint. Of 600K records collected, only 188,982 — under a third — make it into production training sets.

04 · TRAIN

Specialist corpora

SFT for RTL generation and testbench writing, DPO preference pairs from rejection sampling, and 29K state-action-reward trajectories for the RL world model.

05 · COMPOUND

The flywheel turns

Deployed models produce new generation trajectories. Failures become preference data, successes become golden pairs — and both flow back through the gate.

VERIFICATION

Quality is a gate, not a goal

188,982 of 600,007 collected records survive to production — a 31% acceptance rate. Most data pipelines optimize for volume; ours optimizes for what compiles and simulates.

GATE 01

Compile gate

iverilog / Verilator — every module and testbench must build against a real EDA toolchain

GATE 02

Simulation gate

Generated RTL runs against its testbench; only verified behavior enters the corpus

GATE 03

Quality filter

The raw 125,777-record testbench corpus yields 99,526 records (79%) after automated repair and compile verification

REFERENCE LIBRARY

973 modules of real silicon

Curated from 23 open-source silicon projects across 14 design categories — 303,058 lines of production RTL that grounds retrieval and generation in how hardware is actually built.

Category	Modules	RTL Lines
CPU Building Blocks	174	62,488
Datapath & Arithmetic	165	5,924
Protocol Bridges	153	98,871
Network & Packet	98	35,839
Peripheral Controllers	78	25,460
DSP	77	18,128
Flow Control & FIFOs	65	13,525
Memory	37	10,416
Crypto	33	4,619
System Infrastructure	33	18,179
Clock & Reset / CDC	20	1,765
SerDes & Encoding	18	5,593
Arbitration	11	915
Error Correction	11	1,336

TRAINING CORPUS

What the models learn from

Testbench generation SFT

139,489 records

Full-testbench and stimulus modes, quality-filtered with compile verification

RL world-model trajectories

29,297 records

State-action-reward sequences captured from live generation runs

RTL generation SFT

16,439 records

Mixture-of-experts corpora for 14B and 27B local models

Reference-library derived

2,949 records

Spec-to-RTL, completion, and study tasks built on real silicon modules

Distillation, decomposition & DPO

808 records

Claude-distilled examples, hierarchical decomposition annotations, preference pairs

PROVENANCE

Every number on this page is reproducible from source files by a single script in our data repository — the same discipline we apply to benchmark results.

Figures derived from source data on 2026-06-12.