rudof_generate 0.2.20

RDF data shapes implementation in Rust
Documentation

Data Generator

A modern, configurable synthetic RDF data generator that creates realistic data conforming to ShEx or SHACL schemas.

Features

  • Configuration-driven: Use TOML/JSON configuration files to control generation parameters
  • Parallel processing: Generate data using multiple threads for better performance
  • Parallel writing: Automatically write to multiple files simultaneously for optimal I/O performance
  • Flexible field generation: Composable field generators for different data types
  • ShEx and SHACL schema support: Generate data that conforms to both ShEx shape definitions and SHACL constraints
  • Auto-detection: Automatically detect schema format based on file extension
  • Multiple output formats: Support for Turtle, N-Triples, JSON-LD, and more

Quick Start

You can use these commands to test the application. Execute them from the root folder (/home/diego/Documents/rudof/).

SHACL Examples

# Generate data from SHACL schema (auto-detected by .ttl extension)
cargo run -p data_generator -- --schema examples/simple_shacl.ttl --output-file shacl_data.ttl --entities 100

# Generate with specific seed for reproducible SHACL data
cargo run -p data_generator -- --schema examples/simple_shacl.ttl --output-file shacl_reproducible.ttl --entities 50 --seed 12345

# Generate from complex SHACL schema with more entities
cargo run -p data_generator -- --schema examples/shacl/node_shacl.ttl --output-file complex_shacl_data.ttl --entities 200

# Use parallel processing for large SHACL datasets
cargo run -p data_generator -- --schema examples/simple_shacl.ttl --output-file large_shacl_data.ttl --entities 5000 --parallel 8

ShEx Examples

# Generate data from ShEx schema (auto-detected by .shex extension)
cargo run -p data_generator -- --schema examples/simple.shex --output-file shex_data.ttl --entities 100

# Generate with configuration file and ShEx schema
cargo run -p data_generator -- --config data_generator/examples/simple_config.toml --schema data_generator/examples/schema.shex

# Generate with inline parameters using example ShEx schema
cargo run -p data_generator -- --schema data_generator/examples/schema.shex --output-file quick_shex_data.ttl --entities 100

# Generate with custom seed for reproducible ShEx results
cargo run -p data_generator -- --schema data_generator/examples/schema.shex --entities 50 --seed 12345

Configuration-Driven Examples

# Use automatic parallel configuration for medium datasets (works with both formats)
cargo run -p data_generator -- --config data_generator/examples/auto_parallel.toml --schema examples/simple_shacl.ttl

# Use high-performance parallel configuration for large datasets
cargo run -p data_generator -- --config data_generator/examples/parallel_config.toml --schema examples/simple.shex

# Show help for all options
cargo run -p data_generator -- --help

Sample SHACL Schema (simple_shacl.ttl)

@prefix :       <http://example.org/> .
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

:Person a sh:NodeShape ;
   sh:closed true ;
   sh:property [
    sh:path     :name ;
    sh:minCount 1;
    sh:maxCount 1;
    sh:datatype xsd:string ;
  ] ;
  sh:property [
   sh:path     :birthDate ;
   sh:maxCount 1;
   sh:datatype xsd:date ;
  ] ;
  sh:property [
   sh:path     :enrolledIn ;
   sh:node    :Course ;
  ] .

:Course a sh:NodeShape;
   sh:closed true ;
   sh:property [
    sh:path     :name ;
    sh:minCount 1;
    sh:maxCount 1;
    sh:datatype xsd:string ;
  ] .

Sample Generated Output

From SHACL schema:

<http://example.org/Person-1> <http://example.org/name> "Diana Jones" ;
	<http://example.org/enrolledIn> <http://example.org/Course-1> ;
	<http://example.org/birthDate> "1971-03-12"^^<http://www.w3.org/2001/XMLSchema#date> ;
	a <http://example.org/Person> .
<http://example.org/Course-1> <http://example.org/name> "Advanced Mathematics" ;
	a <http://example.org/Course> .

From ShEx schema:

<http://example.org/Person-1> a <http://example.org/Person> ;
	<http://example.org/name> "Fiona Rodriguez" .
<http://example.org/Course-1> a <http://example.org/Course> ;
	<http://example.org/name> "Computer Science" .

Normal Start

  1. Create a configuration file (copy from examples below):
# Copy the simple ready-to-use config
cp data_generator/examples/simple_config.toml my_config.toml

# Or copy the comprehensive example
cp data_generator/examples/config.toml my_config.toml
  1. Run the generator with your schema:
# For SHACL schemas (.ttl, .rdf, .nt files)
data_generator --config my_config.toml --schema your_schema.ttl

# For ShEx schemas (.shex files)
data_generator --config my_config.toml --schema your_schema.shex

# Auto-detection works - no need to specify format
data_generator --config my_config.toml --schema your_schema_file

Usage

# Generate data using configuration file (works with both ShEx and SHACL)
data_generator --config config.toml --schema schema_file

# Generate with inline parameters from SHACL schema
data_generator --schema schema.ttl --output-file data.ttl --entities 1000

# Generate with inline parameters from ShEx schema
data_generator --schema schema.shex --output-file data.ttl --entities 1000

# Generate with custom seed for reproducible results
data_generator --schema schema_file --entities 500 --seed 12345

# Use multiple threads for faster generation
data_generator --schema schema_file --entities 10000 --parallel 8

# Show help for all options
data_generator --help

Configuration

See examples/config.toml for configuration options.

Configuration Examples

Basic Configuration (config.toml)

# Basic data generation settings
[generation]
entity_count = 1000          # Number of entities to generate
seed = 12345                 # Random seed for reproducible results
entity_distribution = "Equal" # How to distribute entities across shapes
cardinality_strategy = "Balanced" # How to handle cardinalities

# Field generation settings
[field_generators.default]
locale = "en"               # Locale for generated text
quality = "Medium"          # Data quality level

# Output configuration
[output]
path = "generated_data.ttl" # Output file path
format = "Turtle"           # Output format
compress = false            # Whether to compress output
write_stats = true          # Write generation statistics

# Parallel processing
[parallel]
worker_threads = 4          # Number of worker threads
batch_size = 100           # Entity batch size
parallel_shapes = true     # Process shapes in parallel
parallel_fields = true     # Generate fields in parallel

Advanced Configuration with Custom Field Generators

# Advanced configuration with custom field generators
[generation]
entity_count = 5000
seed = 98765
entity_distribution = "Weighted"
cardinality_strategy = "Random"

# Weighted distribution for different shape types
[generation.distribution_weights]
"http://example.org/Person" = 0.5        # 50% persons
"http://example.org/Organization" = 0.3  # 30% organizations
"http://example.org/Course" = 0.2        # 20% courses

[field_generators.default]
locale = "en"
quality = "High"

# Custom integer generation with specific ranges
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#integer"]
generator = "integer"
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#integer".parameters]
min = 1
max = 10000

# Custom decimal generation
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#decimal"]
generator = "decimal"
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#decimal".parameters]
min = 0.0
max = 1000.0
precision = 2

# Custom date generation
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#date"]
generator = "date"
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#date".parameters]
start_year = 1980
end_year = 2024

# Property-specific generators
[field_generators.properties."http://example.org/name"]
generator = "string"
parameters = {}

[field_generators.properties."http://example.org/email"]
generator = "string"
[field_generators.properties."http://example.org/email".parameters]
templates = [
    "{firstName}.{lastName}@{domain}",
    "{firstName}{lastName}{number}@{domain}",
    "info@{domain}",
    "contact@{domain}"
]

[field_generators.properties."http://example.org/legalName"]
generator = "string"
parameters = {}

# Output with compression
[output]
path = "large_dataset.ttl.gz"
format = "Turtle"
compress = true
write_stats = true

# High-performance parallel settings
[parallel]
worker_threads = 8
batch_size = 250
parallel_shapes = true
parallel_fields = true

Minimal Configuration

# Minimal configuration - uses defaults for most settings
[generation]
entity_count = 100

[output]
path = "simple_data.ttl"

Custom Entity Distribution

[generation]
entity_count = 2000
entity_distribution = "Custom"

# Exact entity counts per shape
[generation.custom_counts]
"http://example.org/Person" = 1000
"http://example.org/Organization" = 200
"http://example.org/Course" = 800

[output]
path = "custom_distribution.ttl"

Using Configuration Files

# Use TOML configuration with any schema format
data_generator --config config.toml --schema schema_file

# Use JSON configuration with SHACL schema
data_generator --config config.json --schema schema.ttl

# Use JSON configuration with ShEx schema
data_generator --config config.json --schema schema.shex

# Override config with command line (works with both formats)
data_generator --config config.toml --schema schema_file --entities 5000 --output-file override.ttl

Parallel Writing Examples

The data generator supports parallel writing to multiple files for improved I/O performance. The system can automatically detect the optimal number of files based on your dataset size and system capabilities.

Automatic File Count Detection

Set parallel_file_count = 0 to enable automatic detection:

# Small dataset (50 entities) → automatically uses 1 file
cargo run --bin data_generator -- -c examples/small_auto.toml -s examples/schema_file

# Medium dataset (1000 entities) → automatically uses 8 files
cargo run --bin data_generator -- -c examples/auto_parallel.toml -s examples/schema_file

# Large dataset (5000 entities) → automatically uses 16 files
cargo run --bin data_generator -- -c examples/large_auto.toml -s examples/schema_file

Manual Parallel Writing Configuration

[output]
path = "dataset.ttl"
format = "Turtle"
parallel_writing = true      # Enable parallel writing
parallel_file_count = 8      # Write to 8 parallel files (manual setting)

Auto-Detection Configuration

[output]
path = "dataset.ttl"
format = "Turtle"
parallel_writing = true      # Enable parallel writing
parallel_file_count = 0      # 0 = auto-detect optimal count

Auto-detection algorithm:

  • Small datasets (≤1,000 triples): 1 file (no overhead)
  • Small-medium (1,001-5,000 triples): Up to 4 files
  • Medium (5,001-50,000 triples): Up to 8 files (2x CPU cores)
  • Large (>50,000 triples): Up to 16 files (2x CPU cores, capped)

Output files:

  • dataset_part_001.ttl, dataset_part_002.ttl, etc.
  • dataset.manifest.txt (lists all parallel files)
  • dataset.stats.json (combined statistics)

Performance benefits:

  • Small dataset: 28.6ms vs ~35ms sequential (no significant difference)
  • Medium dataset: 143.3ms vs 381ms sequential (62% faster)
  • Large dataset: 601ms vs ~1200ms sequential (50% faster)

JSON Configuration Example

{
  "generation": {
    "entity_count": 1000,
    "seed": 12345,
    "entity_distribution": "Equal",
    "cardinality_strategy": "Balanced"
  },
  "field_generators": {
    "default": {
      "locale": "en",
      "quality": "Medium"
    },
    "datatypes": {
      "http://www.w3.org/2001/XMLSchema#integer": {
        "generator": "integer",
        "parameters": {
          "min": 1,
          "max": 10000
        }
      },
      "http://www.w3.org/2001/XMLSchema#string": {
        "generator": "string",
        "parameters": {}
      }
    },
    "properties": {
      "http://example.org/name": {
        "generator": "string",
        "parameters": {}
      }
    }
  },
  "output": {
    "path": "generated_data.ttl",
    "format": "Turtle",
    "compress": false,
    "write_stats": true
  },
  "parallel": {
    "worker_threads": 4,
    "batch_size": 100,
    "parallel_shapes": true,
    "parallel_fields": true
  }
}

Configuration Options Reference

Generation Settings

  • entity_count: Total number of entities to generate
  • seed: Random seed for reproducible results (optional)
  • entity_distribution: How to distribute entities across shapes
    • "Equal": Equal distribution across all shapes
    • "Weighted": Use weights to control distribution
    • "Custom": Specify exact counts per shape
  • cardinality_strategy: How to handle property cardinalities
    • "Minimum": Use minimum cardinality values
    • "Maximum": Use maximum cardinality values
    • "Random": Random values within cardinality range
    • "Balanced": Deterministic but varied distribution

Field Generator Settings

  • locale: Language/locale for generated text ("en", "es", "fr")
  • quality: Data quality level ("Low", "Medium", "High")
  • datatypes: Custom generators for specific XSD datatypes
  • properties: Custom generators for specific properties

Output Settings

  • path: Output file path
  • format: Output format ("Turtle", "NTriples", "JSONLD", "RdfXml")
  • compress: Whether to compress output file
  • write_stats: Include generation statistics
  • parallel_writing: Enable writing to multiple parallel files for better I/O performance
  • parallel_file_count: Number of parallel files (0 = auto-detect optimal count)

Parallel Processing

  • worker_threads: Number of parallel worker threads
  • batch_size: Entity batch size for processing
  • parallel_shapes: Process different shapes in parallel
  • parallel_fields: Generate field values in parallel

Tips

  • Start simple: Use the minimal configuration and gradually add customizations
  • Test with small datasets: Use low entity counts (10-100) while configuring
  • Use fixed seeds: Set a seed value for reproducible results during development
  • Monitor performance: Increase worker_threads for large datasets
  • Enable parallel writing: Set parallel_writing = true and parallel_file_count = 0 for automatic optimization
  • Validate output: Check generated data conforms to your ShEx schema expectations

Output Files

When you run the generator with write_stats = true, you'll get:

  1. Data file (generated_data.ttl): The actual RDF data in your chosen format
  2. Statistics file (generated_data.stats.json): Generation statistics including:
    • Total triples generated
    • Entity counts per shape type
    • Generation performance metrics
    • Data distribution information
    • Triple validity percentage
    • Shape translation loss percentage

Example statistics:

{
  "total_triples": 15248,
  "generation_time": "497ms",
  "shape_counts": {
    "http://example.org/Person": 334,
    "http://example.org/Organization": 333,
    "http://example.org/Course": 333
  },
  "conformance_metrics": {
    "total_generated_triples": 15248,
    "valid_triples": 14991,
    "triple_validity_percentage": 98.31,
    "original_schema_constraints": 42,
    "represented_constraints_in_unified": 38,
    "shape_translation_loss_percentage": 9.52
  }
}

Architecture

The generator is built with a modular, functional architecture:

  • config/: Configuration management and validation
  • field_generators/: Composable field value generators
  • shape_processing/: ShEx schema parsing and analysis
  • parallel_generation/: Parallel data generation engine
  • output/: Multiple format output writers

ShEx And SHACL Coverage Matrix

This matrix summarizes what the current generator can translate into the unified model and therefore use for generation and metric computation.

Feature ShEx SHACL Status Notes
Shape declarations Yes Yes Supported Core shape IDs are preserved
Property constraints Yes Yes Supported Property IRI and associated constraints are extracted
Cardinality Yes Yes Supported min/max in ShEx, sh:minCount/sh:maxCount in SHACL
Datatype Yes Yes Supported Mapped to unified datatype constraints
Node kind Yes Yes Supported Basic node kinds are mapped
Pattern Yes Yes Supported Regex patterns are extracted when present
Length facets Yes Yes Partially supported minLength/maxLength supported; exact length is not translated
Numeric range facets Yes Yes Supported Inclusive/exclusive bounds are extracted
Value set / enumeration Yes Yes Supported in and hasValue are represented
Shape references Yes Yes Supported Recursive or referenced shapes are mapped
Closed shapes Partial Partial Partially supported Present in the model, but not fully extracted from schemas
Triple expression composition Yes Partial Partially supported EachOf / OneOf are handled; some advanced references are not
Logical combinators Limited No Not fully supported SHACL sh:or, sh:and, sh:not, sh:xone are not fully translated
Qualified value shapes Limited No Not fully supported Not mapped in the unified model
SPARQL-based constraints No No Not supported Outside current generator scope
Semantic actions / imports Limited Limited Not supported Not preserved in the unified model