synthclaw
Lightweight synthetic data generation in Rust. Generate and augment datasets using OpenAI, Anthropic with support for HuggingFace datasets.
Available as both a CLI tool and a Rust library.
Installation
CLI
Library
[]
= "0.1"
Quick Start
# Generate 50 product reviews across categories
# Or use a config file
CLI Usage
Explore HuggingFace Datasets
# Search
# Get info
# Preview rows
Generate Data
# From scratch with categories
# Dry run (no API calls)
Writing Good Prompts
The tool uses system prompts by default to ensure clean outputs. You provide the user prompt template.
Template Variables
For generate mode:
{category}- current category being generated{index}- item number (0, 1, 2...)
For augment mode:
- Any column from source data:
{text},{label}, etc.
Good Prompt Examples
Product Reviews:
template: |
Generate a realistic product review for: {category}
Requirements:
- Customer perspective, 2-4 sentences
- Include specific details (brand, features, price)
- Natural tone - can be positive, negative, or mixed
Sentiment Data:
template: |
Generate a {category} movie review.
Requirements:
- The sentiment must clearly be {category}
- 1-3 sentences
- Mention specific aspects (acting, plot, visuals)
Data Augmentation (paraphrase):
template: |
Paraphrase this text while preserving meaning and sentiment:
Original: {text}
Paraphrase:
Question-Answer Generation:
template: |
Based on this document, generate a Q&A pair:
Document: {text}
Output JSON: {"question": "...", "answer": "..."}
system_prompt: |
Generate educational Q&A pairs. Output ONLY valid JSON.
Configuration
Generate from Scratch
name: "product_reviews"
provider:
type: openai
model: "gpt-4o-mini"
temperature: 0.8
generation:
task: generate
count: 100
concurrency: 10
categories:
- electronics
- books
- clothing
template: |
Generate a realistic {category} product review.
2-3 sentences, customer perspective, specific details.
output:
format: jsonl
path: "./output/reviews.jsonl"
Augment Existing Data
name: "sentiment_augmentation"
source:
type: huggingface
dataset: "cornell-movie-review-data/rotten_tomatoes"
split: "train"
sample: 500
provider:
type: openai
model: "gpt-4o-mini"
generation:
task: augment
count_per_example: 2
concurrency: 10
strategy: paraphrase
output:
format: jsonl
path: "./output/augmented.jsonl"
Custom System Prompt
Override the default system prompt when you need specific behavior:
generation:
template: |
Generate a {category} example in JSON format.
system_prompt: |
You are a data generation assistant.
Output ONLY valid JSON, no markdown, no explanations.
Schema: {"text": "...", "label": "..."}
Validation
Filter bad outputs and remove duplicates:
validation:
min_length: 20
max_length: 1000
json: true # must be valid JSON
json_schema: # required fields
blocklist: true # filter "Sure!", "As an AI", etc.
repetition: true # filter repetitive text
dedupe: normalized # exact | normalized | jaccard
Library Usage
use ;
// Load HuggingFace dataset
let mut source = new?;
let records = source.load?;
// Create provider and generate
let config = from_file?;
let provider = create_provider?;
let response = provider.generate.await?;
Validation (Library)
use ;
let results = engine.run.await?;
let pipeline = new
.add
.add
.add
.add;
let validated = validate_and_dedupe;
println!;
for r in validated.results
Output Formats
jsonl- Line-delimited JSON (recommended for large datasets)csv- Comma-separated valuesparquet- Apache Parquet (efficient for analytics)
Environment Variables
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
Roadmap
Production Scale
- Streaming pipeline (generate → validate → write, no memory accumulation)
- Checkpointing & resume
- Retry with exponential backoff
- Rate limiting
- Budget limits
Providers
- Gemini, Ollama, Azure OpenAI, Together AI, Groq
Integration
- HuggingFace Hub upload
- Dataset cards