Nvidia open sources the synthetic data framework used to build Nemotron datasets
NVIDIA just open sourced NeMo Data Designer, the synthetic data framework used internally to build both pre-training and post-training datasets for Nemotron.
It lets you define an entire synthetic data pipeline directly in Python: structured outputs, statistical samplers, LLM-generated columns, dependency-aware field relationships, Python/SQL/remote validators, and optional LLM-as-judge scoring. Supports quick preview mode for fast iteration before scaling up.
Install:
``` pip install data-designer ```
A minimal example:
``` from data_designer.essentials import *
data_designer = DataDesigner() config = DataDesignerConfigBuilder()
config.add_column( SamplerColumnConfig( name="product_category", sampler_type=SamplerType.CATEGORY, params=CategorySamplerParams( values=["Electronics", "Clothing", "Home & Kitchen", "Books"] ), ) )
config.add_column( LLMTextColumnConfig( name="review", model_alias="nvidia-text", prompt="Write a short product review for a {{ product_category }} item." ) )
preview = data_designer.preview(config_builder=config) preview.display_sample_record() ```
This release also incorporates the synthetic data tech my team originally built at Gretel (now part of NVIDIA), now generally available for anyone to use or extend.
Repo: https://github.com/NVIDIA-NeMo/DataDesigner
Hi all- I’m a co-founder from Gretel; our team and tech are now part of NVIDIA.
NeMo Data Designer is our core product from Gretel and now the internal framework we use heavily for both pre- and post-training data in Nemotron for a variety of use cases.
The OSS version is fully general-purpose: Python-first, modular, and designed so you can mix statistical samplers, LLM columns, and seed datasets in a single pipeline.
Happy to answer questions or hear feedback on missing features