synthclaw 0.1.0

Lightweight synthetic data generation library powered by LLMs
Documentation

synthclaw

Lightweight synthetic data generation in Rust. Generate and augment datasets using LLMs (OpenAI, Anthropic) with support for HuggingFace datasets.

Available as both a CLI tool and a Rust library.

Installation

CLI

cargo install --path .

Library

[dependencies]
synth_claw = { git = "https://github.com/semioz/synthclaw" }

Quick Start

export OPENAI_API_KEY=sk-...

# Generate 50 product reviews across categories
synthclaw generate \
  --prompt "Generate a realistic {category} product review, 2-3 sentences" \
  --provider openai \
  --categories electronics,books,clothing \
  -n 50 \
  -o reviews.jsonl

# Or use a config file
synthclaw generate --config examples/configs/generate_reviews.yaml

CLI Usage

Explore HuggingFace Datasets

# Search
synthclaw datasets search "sentiment" --limit 10

# Get info
synthclaw datasets info cornell-movie-review-data/rotten_tomatoes

# Preview rows
synthclaw datasets preview cornell-movie-review-data/rotten_tomatoes --rows 5

Generate Data

# From scratch with categories
synthclaw generate \
  --prompt "Generate a {category} example" \
  --provider openai \
  --categories positive,negative \
  -n 100

# Dry run (no API calls)
synthclaw generate --dry-run --config config.yaml

Writing Good Prompts

The tool uses system prompts by default to ensure clean outputs. You provide the user prompt template.

Template Variables

For generate mode:

  • {category} - current category being generated
  • {index} - item number (0, 1, 2...)

For augment mode:

  • Any column from source data: {text}, {label}, etc.

Good Prompt Examples

Product Reviews:

template: |
  Generate a realistic product review for: {category}
  
  Requirements:
  - Customer perspective, 2-4 sentences
  - Include specific details (brand, features, price)
  - Natural tone - can be positive, negative, or mixed

Sentiment Data:

template: |
  Generate a {category} movie review.
  
  Requirements:
  - The sentiment must clearly be {category}
  - 1-3 sentences
  - Mention specific aspects (acting, plot, visuals)

Data Augmentation (paraphrase):

template: |
  Paraphrase this text while preserving meaning and sentiment:
  
  Original: {text}
  
  Paraphrase:

Question-Answer Generation:

template: |
  Based on this document, generate a Q&A pair:
  
  Document: {text}
  
  Output JSON: {"question": "...", "answer": "..."}
system_prompt: |
  Generate educational Q&A pairs. Output ONLY valid JSON.

Bad Prompts (avoid these)

# Too vague - LLM will add explanations
template: "Generate something for {category}"

# Missing variable
template: "Generate a review"  # error: no {category} but categories specified

# Wrong variable for mode
template: "Paraphrase {text}"  # error in generate mode: {text} not available

Configuration

Generate from Scratch

name: "product_reviews"

provider:
  type: openai
  model: "gpt-4o-mini"
  temperature: 0.8

generation:
  task: generate
  count: 100
  concurrency: 10
  categories:
    - electronics
    - books
    - clothing
  template: |
    Generate a realistic {category} product review.
    2-3 sentences, customer perspective, specific details.

output:
  format: jsonl
  path: "./output/reviews.jsonl"

Augment Existing Data

name: "sentiment_augmentation"

source:
  type: huggingface
  dataset: "cornell-movie-review-data/rotten_tomatoes"
  split: "train"
  sample: 500

provider:
  type: openai
  model: "gpt-4o-mini"

generation:
  task: augment
  count_per_example: 2
  concurrency: 10
  strategy: paraphrase

output:
  format: jsonl
  path: "./output/augmented.jsonl"

Custom System Prompt

Override the default system prompt when you need specific behavior:

generation:
  template: |
    Generate a {category} example in JSON format.
  system_prompt: |
    You are a data generation assistant.
    Output ONLY valid JSON, no markdown, no explanations.
    Schema: {"text": "...", "label": "..."}

Library Usage

use synth_claw::{
    config::SynthConfig,
    datasets::{HuggingFaceSource, DataSource},
    providers::{create_provider, GenerationRequest},
};

// Load HuggingFace dataset
let mut source = HuggingFaceSource::new(
    "cornell-movie-review-data/rotten_tomatoes".to_string(),
    None,
    "train".to_string(),
    None,
)?;
let records = source.load(Some(100))?;

// Create provider and generate
let config = SynthConfig::from_file(&"config.yaml".into())?;
let provider = create_provider(&config.provider)?;

let response = provider.generate(GenerationRequest {
    prompt: "Generate a movie review".to_string(),
    system_prompt: Some("Output only the review text.".to_string()),
    temperature: Some(0.7),
    max_tokens: Some(500),
}).await?;

Output Formats

  • jsonl - Line-delimited JSON (recommended for large datasets)
  • csv - Comma-separated values
  • parquet - Apache Parquet (efficient for analytics)

Environment Variables

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Project Structure

src/
  cli/           CLI commands
  config/        YAML config schema
  datasets/      Data sources (HuggingFace, local files)
  providers/     LLM providers (OpenAI, Anthropic)
  generation/    Generation engine
  output/        Output writers (JSONL, CSV, Parquet)