MockForge Data
Synthetic data generation engine with faker primitives and RAG (Retrieval-Augmented Generation).
This crate provides powerful tools for generating realistic test data, including traditional faker-based generation and advanced RAG-powered synthetic data creation. It's designed to work seamlessly with MockForge's mocking framework to create comprehensive test datasets.
Features
- Faker Primitives: Generate realistic fake data (names, emails, addresses, etc.)
- Schema-Based Generation: Define data structures and generate conforming datasets
- RAG Integration: Use Retrieval-Augmented Generation for contextually aware data synthesis
- Template Support: Create complex data structures with variable substitution
- Multiple Output Formats: JSON, JSON Lines, YAML, and CSV support
- Relationship Handling: Generate data with cross-schema relationships
- Batch Generation: Create multiple datasets simultaneously
Quick Start
Basic Data Generation
use ;
// Define a simple user schema
let mut schema = new;
schema = schema.with_field;
schema = schema.with_field;
// Configure generation
let config = DataConfig ;
// Generate data
let mut generator = new?;
let result = generator.generate.await?;
// Access generated data
println!;
println!;
Using Faker Directly
use ;
// Quick functions for common data
let email = email;
let name = name;
let uuid = uuid;
// Enhanced faker with more options
let mut faker = new;
let address = faker.address;
let phone = faker.phone;
let date = faker.date_iso;
Template-Based Generation
use TemplateFaker;
use Value;
let mut faker = new
.with_variable;
let result = faker.generate_from_template;
RAG-Enhanced Generation
use ;
// Configure RAG
let rag_config = RagConfig ;
// Create RAG engine
let mut engine = new;
// Add context documents
engine.add_document?;
// Generate with RAG
let schema = new;
let config = DataConfig ;
let result = engine.generate_with_rag.await?;
Key Modules
Faker (faker)
Enhanced faker utilities for generating realistic fake data:
- Basic Types: Strings, numbers, booleans, dates, UUIDs
- Personal Data: Names, emails, addresses, phone numbers
- Business Data: Company names, URLs, IP addresses
- Template Support: Variable substitution with
{{variable}}syntax - Quick Functions: One-liner access to common generators
Schema (schema)
Define data structures for generation:
- Field Definitions: Type-based field specifications
- Relationships: Cross-schema foreign key relationships
- Templates: Pre-built schemas for common entities (users, products, orders)
Generator (generator)
Core data generation engine:
- DataGenerator: Single schema generation with configuration
- BatchGenerator: Multi-schema batch processing
- Relationship Resolution: Automatic foreign key population
- Performance: Optimized for large dataset generation
RAG (rag)
Retrieval-Augmented Generation for intelligent data synthesis:
- Multiple Providers: OpenAI, Anthropic, Ollama, OpenAI-compatible APIs
- Semantic Search: Vector-based document retrieval
- Context Integration: Use existing data as generation context
- Configurable Models: Support for various LLM architectures
Output Formats
Generated data can be exported in multiple formats:
use GenerationResult;
// JSON (default)
let json = result.to_json_string?;
// JSON Lines
let jsonl = result.to_jsonl_string?;
// Access raw data
for row in &result.data
Configuration
DataConfig
Control generation parameters:
let config = DataConfig ;
RagConfig
Configure RAG behavior:
let rag_config = RagConfig ;
Examples
Generate Related Data
use utils;
// Generate orders with related users
let results = generate_orders_with_users.await?;
let user_data = &results;
let order_data = &results;
Custom Schema Generation
use utils;
// Generate from field definitions
let result = generate_sample_data.await?;
Integration with MockForge
This crate is designed to work with the broader MockForge ecosystem:
- MockForge Core: Use generated data in mock responses
- MockForge HTTP: Populate REST API mocks with realistic data
- MockForge GraphQL: Generate GraphQL schema-conforming data
Performance Considerations
- Memory Usage: Large datasets are generated in batches
- RAG Overhead: Semantic search adds processing time
- Parallel Generation: Use
BatchGeneratorfor concurrent processing - Caching: RAG engines cache embeddings for performance
Contributing
See the main MockForge repository for contribution guidelines.
License
Licensed under MIT OR Apache-2.0