Nvidia open sources the synthetic data framework used to build Nemotron datasets

6 points by alexwatson405 3 hours ago

NVIDIA just open sourced NeMo Data Designer, the synthetic data framework used internally to build both pre-training and post-training datasets for Nemotron.

It lets you define an entire synthetic data pipeline directly in Python: structured outputs, statistical samplers, LLM-generated columns, dependency-aware field relationships, Python/SQL/remote validators, and optional LLM-as-judge scoring. Supports quick preview mode for fast iteration before scaling up.

Install:

``` pip install data-designer ```

A minimal example:

``` from data_designer.essentials import *

data_designer = DataDesigner() config = DataDesignerConfigBuilder()

config.add_column( SamplerColumnConfig( name="product_category", sampler_type=SamplerType.CATEGORY, params=CategorySamplerParams( values=["Electronics", "Clothing", "Home & Kitchen", "Books"] ), ) )

config.add_column( LLMTextColumnConfig( name="review", model_alias="nvidia-text", prompt="Write a short product review for a {{ product_category }} item." ) )

preview = data_designer.preview(config_builder=config) preview.display_sample_record() ```

This release also incorporates the synthetic data tech my team originally built at Gretel (now part of NVIDIA), now generally available for anyone to use or extend.

Repo: https://github.com/NVIDIA-NeMo/DataDesigner

alexwatson405 3 hours ago

Hi all- I’m a co-founder from Gretel; our team and tech are now part of NVIDIA.

NeMo Data Designer is our core product from Gretel and now the internal framework we use heavily for both pre- and post-training data in Nemotron for a variety of use cases.

The OSS version is fully general-purpose: Python-first, modular, and designed so you can mix statistical samplers, LLM columns, and seed datasets in a single pipeline.

Happy to answer questions or hear feedback on missing features