Data Generator
A modern, configurable synthetic RDF data generator that creates realistic data conforming to ShEx or SHACL schemas.
Features
- Configuration-driven: Use TOML/JSON configuration files to control generation parameters
- Parallel processing: Generate data using multiple threads for better performance
- Parallel writing: Automatically write to multiple files simultaneously for optimal I/O performance
- Flexible field generation: Composable field generators for different data types
- ShEx and SHACL schema support: Generate data that conforms to both ShEx shape definitions and SHACL constraints
- Auto-detection: Automatically detect schema format based on file extension
- Multiple output formats: Support for Turtle, N-Triples, JSON-LD, and more
Quick Start
You can use these commands to test the application. Execute them from the root folder (/home/diego/Documents/rudof/
).
SHACL Examples
# Generate data from SHACL schema (auto-detected by .ttl extension)
# Generate with specific seed for reproducible SHACL data
# Generate from complex SHACL schema with more entities
# Use parallel processing for large SHACL datasets
ShEx Examples
# Generate data from ShEx schema (auto-detected by .shex extension)
# Generate with configuration file and ShEx schema
# Generate with inline parameters using example ShEx schema
# Generate with custom seed for reproducible ShEx results
Configuration-Driven Examples
# Use automatic parallel configuration for medium datasets (works with both formats)
# Use high-performance parallel configuration for large datasets
# Show help for all options
Sample SHACL Schema (simple_shacl.ttl)
@prefix : <http://example.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:Person a sh:NodeShape ;
sh:closed true ;
sh:property [
sh:path :name ;
sh:minCount 1;
sh:maxCount 1;
sh:datatype xsd:string ;
] ;
sh:property [
sh:path :birthDate ;
sh:maxCount 1;
sh:datatype xsd:date ;
] ;
sh:property [
sh:path :enrolledIn ;
sh:node :Course ;
] .
:Course a sh:NodeShape;
sh:closed true ;
sh:property [
sh:path :name ;
sh:minCount 1;
sh:maxCount 1;
sh:datatype xsd:string ;
] .
Sample Generated Output
From SHACL schema:
<http://example.org/Person-1> <http://example.org/name> "Diana Jones" ;
<http://example.org/enrolledIn> <http://example.org/Course-1> ;
<http://example.org/birthDate> "1971-03-12"^^<http://www.w3.org/2001/XMLSchema#date> ;
a <http://example.org/Person> .
<http://example.org/Course-1> <http://example.org/name> "Advanced Mathematics" ;
a <http://example.org/Course> .
From ShEx schema:
<http://example.org/Person-1> a <http://example.org/Person> ;
<http://example.org/name> "Fiona Rodriguez" .
<http://example.org/Course-1> a <http://example.org/Course> ;
<http://example.org/name> "Computer Science" .
Normal Start
- Create a configuration file (copy from examples below):
# Copy the simple ready-to-use config
# Or copy the comprehensive example
- Run the generator with your schema:
# For SHACL schemas (.ttl, .rdf, .nt files)
# For ShEx schemas (.shex files)
# Auto-detection works - no need to specify format
Usage
# Generate data using configuration file (works with both ShEx and SHACL)
# Generate with inline parameters from SHACL schema
# Generate with inline parameters from ShEx schema
# Generate with custom seed for reproducible results
# Use multiple threads for faster generation
# Show help for all options
Configuration
See examples/config.toml
for configuration options.
Configuration Examples
Basic Configuration (config.toml)
# Basic data generation settings
[]
= 1000 # Number of entities to generate
= 12345 # Random seed for reproducible results
= "Equal" # How to distribute entities across shapes
= "Balanced" # How to handle cardinalities
# Field generation settings
[]
= "en" # Locale for generated text
= "Medium" # Data quality level
# Output configuration
[]
= "generated_data.ttl" # Output file path
= "Turtle" # Output format
= false # Whether to compress output
= true # Write generation statistics
# Parallel processing
[]
= 4 # Number of worker threads
= 100 # Entity batch size
= true # Process shapes in parallel
= true # Generate fields in parallel
Advanced Configuration with Custom Field Generators
# Advanced configuration with custom field generators
[]
= 5000
= 98765
= "Weighted"
= "Random"
# Weighted distribution for different shape types
[]
= 0.5 # 50% persons
= 0.3 # 30% organizations
= 0.2 # 20% courses
[]
= "en"
= "High"
# Custom integer generation with specific ranges
[]
= "integer"
[]
= 1
= 10000
# Custom decimal generation
[]
= "decimal"
[]
= 0.0
= 1000.0
= 2
# Custom date generation
[]
= "date"
[]
= 1980
= 2024
# Property-specific generators
[]
= "string"
= {}
[]
= "string"
[]
= [
"{firstName}.{lastName}@{domain}",
"{firstName}{lastName}{number}@{domain}",
"info@{domain}",
"contact@{domain}"
]
[]
= "string"
= {}
# Output with compression
[]
= "large_dataset.ttl.gz"
= "Turtle"
= true
= true
# High-performance parallel settings
[]
= 8
= 250
= true
= true
Minimal Configuration
# Minimal configuration - uses defaults for most settings
[]
= 100
[]
= "simple_data.ttl"
Custom Entity Distribution
[]
= 2000
= "Custom"
# Exact entity counts per shape
[]
= 1000
= 200
= 800
[]
= "custom_distribution.ttl"
Using Configuration Files
# Use TOML configuration with any schema format
# Use JSON configuration with SHACL schema
# Use JSON configuration with ShEx schema
# Override config with command line (works with both formats)
Parallel Writing Examples
The data generator supports parallel writing to multiple files for improved I/O performance. The system can automatically detect the optimal number of files based on your dataset size and system capabilities.
Automatic File Count Detection
Set parallel_file_count = 0
to enable automatic detection:
# Small dataset (50 entities) → automatically uses 1 file
# Medium dataset (1000 entities) → automatically uses 8 files
# Large dataset (5000 entities) → automatically uses 16 files
Manual Parallel Writing Configuration
[]
= "dataset.ttl"
= "Turtle"
= true # Enable parallel writing
= 8 # Write to 8 parallel files (manual setting)
Auto-Detection Configuration
[]
= "dataset.ttl"
= "Turtle"
= true # Enable parallel writing
= 0 # 0 = auto-detect optimal count
Auto-detection algorithm:
- Small datasets (≤1,000 triples): 1 file (no overhead)
- Small-medium (1,001-5,000 triples): Up to 4 files
- Medium (5,001-50,000 triples): Up to 8 files (2x CPU cores)
- Large (>50,000 triples): Up to 16 files (2x CPU cores, capped)
Output files:
dataset_part_001.ttl
,dataset_part_002.ttl
, etc.dataset.manifest.txt
(lists all parallel files)dataset.stats.json
(combined statistics)
Performance benefits:
- Small dataset: 28.6ms vs ~35ms sequential (no significant difference)
- Medium dataset: 143.3ms vs 381ms sequential (62% faster)
- Large dataset: 601ms vs ~1200ms sequential (50% faster)
JSON Configuration Example
Configuration Options Reference
Generation Settings
entity_count
: Total number of entities to generateseed
: Random seed for reproducible results (optional)entity_distribution
: How to distribute entities across shapes"Equal"
: Equal distribution across all shapes"Weighted"
: Use weights to control distribution"Custom"
: Specify exact counts per shape
cardinality_strategy
: How to handle property cardinalities"Minimum"
: Use minimum cardinality values"Maximum"
: Use maximum cardinality values"Random"
: Random values within cardinality range"Balanced"
: Deterministic but varied distribution
Field Generator Settings
locale
: Language/locale for generated text ("en"
,"es"
,"fr"
)quality
: Data quality level ("Low"
,"Medium"
,"High"
)datatypes
: Custom generators for specific XSD datatypesproperties
: Custom generators for specific properties
Output Settings
path
: Output file pathformat
: Output format ("Turtle"
,"NTriples"
,"JSONLD"
,"RdfXml"
)compress
: Whether to compress output filewrite_stats
: Include generation statisticsparallel_writing
: Enable writing to multiple parallel files for better I/O performanceparallel_file_count
: Number of parallel files (0 = auto-detect optimal count)
Parallel Processing
worker_threads
: Number of parallel worker threadsbatch_size
: Entity batch size for processingparallel_shapes
: Process different shapes in parallelparallel_fields
: Generate field values in parallel
Tips
- Start simple: Use the minimal configuration and gradually add customizations
- Test with small datasets: Use low entity counts (10-100) while configuring
- Use fixed seeds: Set a
seed
value for reproducible results during development - Monitor performance: Increase
worker_threads
for large datasets - Enable parallel writing: Set
parallel_writing = true
andparallel_file_count = 0
for automatic optimization - Validate output: Check generated data conforms to your ShEx schema expectations
Output Files
When you run the generator with write_stats = true
, you'll get:
- Data file (
generated_data.ttl
): The actual RDF data in your chosen format - Statistics file (
generated_data.stats.json
): Generation statistics including:- Total triples generated
- Entity counts per shape type
- Generation performance metrics
- Data distribution information
Example statistics:
Architecture
The generator is built with a modular, functional architecture:
config/
: Configuration management and validationfield_generators/
: Composable field value generatorsshape_processing/
: ShEx schema parsing and analysisparallel_generation/
: Parallel data generation engineoutput/
: Multiple format output writers