synthclaw 0.1.3

Lightweight synthetic data generation library/CLI.
Documentation
# Skill: synthclaw - Synthetic Data Generation

## Description
Generate synthetic data or augment existing datasets using LLMs (OpenAI, Anthropic). Supports HuggingFace datasets, multiple output formats (JSONL, CSV, Parquet).

## When to Use
- User needs synthetic training data for ML models
- User wants to augment/expand an existing dataset
- User needs test data with specific patterns or categories
- User wants to paraphrase or transform text data
- User needs to explore HuggingFace datasets

## Prerequisites
- `cargo install synthclaw` (or available in PATH)
- `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` environment variable set

## CLI Commands

### Generate Data from Scratch
```bash
synthclaw generate \
  --provider openai \
  --model gpt-4o-mini \
  --prompt "Generate a realistic {category} product review, 2-3 sentences" \
  --categories "positive,negative,neutral" \
  -n 100 \
  --concurrency 10 \
  -o output.jsonl
```

### Generate with Config File
```bash
synthclaw generate --config config.yaml
```

### Dry Run (Preview without API calls)
```bash
synthclaw generate --config config.yaml --dry-run
```

### Search HuggingFace Datasets
```bash
synthclaw datasets search "sentiment" --limit 10
```

### Get Dataset Info
```bash
synthclaw datasets info cornell-movie-review-data/rotten_tomatoes
```

### Preview Dataset Rows
```bash
synthclaw datasets preview cornell-movie-review-data/rotten_tomatoes --rows 5
```

## Config File Examples

### Generate from Scratch
```yaml
name: "product_reviews"

provider:
  type: openai
  model: "gpt-4o-mini"

generation:
  task: generate
  count: 100
  concurrency: 10
  categories:
    - positive
    - negative
    - neutral
  template: |
    Generate a realistic {category} product review for an electronic device.
    Keep it to 2-3 sentences. Output only the review text.

output:
  format: jsonl
  path: "./output/reviews.jsonl"
```

### Augment HuggingFace Dataset
```yaml
name: "sentiment_augmentation"

source:
  type: huggingface
  dataset: "cornell-movie-review-data/rotten_tomatoes"
  split: "train"
  sample: 100

provider:
  type: openai
  model: "gpt-4o-mini"

generation:
  task: augment
  count_per_example: 2
  concurrency: 10
  strategy: paraphrase
  template: |
    Paraphrase this movie review while preserving its sentiment:
    
    Original: {text}
    
    Paraphrased:

output:
  format: jsonl
  path: "./output/augmented.jsonl"
```

### Augment Local File
```yaml
name: "local_augmentation"

source:
  type: local
  path: "./data/input.jsonl"
  format: jsonl
  sample: 50

provider:
  type: anthropic
  model: "claude-haiku-4-5-20251001"

generation:
  task: augment
  count_per_example: 3
  concurrency: 5
  template: |
    Rewrite this text in a different style:
    
    Original: {content}
    
    Rewritten:

output:
  format: jsonl
  path: "./output/augmented.jsonl"
```

## Template Variables

### Generate Mode
- `{category}` - Current category being generated
- `{index}` - Item number (0, 1, 2...)

### Augment Mode
- Any column from source data: `{text}`, `{label}`, `{content}`, etc.

## Output Formats
- `jsonl` - Line-delimited JSON (recommended)
- `csv` - Comma-separated values
- `parquet` - Apache Parquet (efficient for large datasets)

## Output Schema
Each output record contains:
```json
{
  "content": "Generated or augmented text",
  "category": "positive",
  "source_index": 42,
  "input_tokens": 150,
  "output_tokens": 45
}
```

## Providers
- `openai` - Models: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini, o3-mini
- `anthropic` - Models: claude-sonnet-4-20250514, claude-haiku-4-5-20251001, claude-opus-4-5-20251101

## Common Workflows

### 1. Generate Classification Training Data
```bash
synthclaw generate \
  --provider openai \
  --prompt "Generate a {category} customer support ticket. Include the issue description and customer tone." \
  --categories "billing,technical,shipping,returns" \
  -n 200 \
  -o support_tickets.jsonl
```

### 2. Expand Small Dataset
```yaml
# First, check what columns exist
synthclaw datasets preview my-dataset --rows 1

# Then augment
name: "expand_dataset"
source:
  type: huggingface
  dataset: "my-small-dataset"
  split: "train"
generation:
  task: augment
  count_per_example: 5
  strategy: paraphrase
output:
  format: parquet
  path: "./expanded.parquet"
```

### 3. Generate Test Fixtures
```bash
synthclaw generate \
  --provider openai \
  --prompt "Generate a realistic {category} JSON object for an e-commerce API test" \
  --categories "product,order,user,review" \
  -n 20 \
  -o test_fixtures.jsonl
```

## Tips
- Use `--dry-run` to verify config before making API calls
- Higher `concurrency` = faster but more API rate limit risk (10-20 is good)
- Use `sample` in source config to test with subset first
- Check dataset columns with `datasets preview` before writing augment templates