dgen-py
The worlds fastest Python random data generation - with NUMA optimization and zero-copy interface
Features
- π Blazing Fast: 10 GB/s per core, up to 300 GB/s verified
- β‘ Ultra-Fast Allocation:
create_bytearrays()for 1,280x faster pre-allocation than Python (NEW in v0.2.0) - π― Controllable Characteristics: Configurable deduplication and compression ratios
- π Reproducible Data: Seed parameter for identical data generation (v0.1.6) with dynamic reseeding (v0.1.7)
- π¬ Multi-Process NUMA: One Python process per NUMA node for maximum throughput
- π True Zero-Copy: Python buffer protocol with direct memory access (no data copying)
- π¦ Streaming API: Generate terabytes of data with constant 32 MB memory usage
- π§΅ Thread Pool Reuse: Created once, reused across all operations
- π οΈ Built with Rust: Memory-safe, production-quality implementation
Performance
Streaming Benchmark - 100 GB Test
Comparison of streaming random data generation methods on a 12-core system:
| Method | Throughput | Speedup vs Baseline | Memory Required |
|---|---|---|---|
| os.urandom() (baseline) | 0.34 GB/s | 1.0x | Minimal |
| NumPy Multi-Thread | 1.06 GB/s | 3.1x | 100 GB RAM* |
| Numba JIT Xoshiro256++ (streaming) | 57.11 GB/s | 165.7x | 32 MB RAM |
| dgen-py v0.1.5 (streaming) | 58.46 GB/s | 169.6x | 32 MB RAM |
* NumPy requires full dataset in memory (10 GB tested, would need 100 GB for 100 GB dataset)
Key Findings:
- dgen-py matches Numba's streaming performance (58.46 vs 57.11 GB/s)
- 55x faster than NumPy while using 3,000x less memory (32 MB vs 100 GB)
- Streaming architecture: Can generate unlimited data with only 32 MB RAM
- Per-core throughput: 4.87 GB/s (12 cores)
β οΈ Critical for Storage Testing: ONLY dgen-py supports configurable deduplication and compression ratios. All other methods (os.urandom, NumPy, Numba) generate purely random data with maximum entropy, making them unsuitable for realistic storage system testing. Real-world storage workloads require controllable data characteristics to test deduplication engines, compression algorithms, and storage efficiencyβcapabilities unique to dgen-py.
Multi-NUMA Scalability - GCP Emerald Rapid
Scalability testing on Google Cloud Platform Intel Emerald Rapid systems (1024 GB workload, compress=1.0):
| Instance | Physical Cores | NUMA Nodes | Aggregate Throughput | Per-Core | Scaling Efficiency |
|---|---|---|---|---|---|
| C4-8 | 4 | 1 (UMA) | 36.26 GB/s | 9.07 GB/s | Baseline |
| C4-16 | 8 | 1 (UMA) | 86.41 GB/s | 10.80 GB/s | 119% |
| C4-32 | 16 | 1 (UMA) | 162.78 GB/s | 10.17 GB/s | 112% |
| C4-96 | 48 | 2 (NUMA) | 248.53 GB/s | 5.18 GB/s | 51%* |
* NUMA penalty: 49% per-core reduction on multi-socket systems, but still achieves highest absolute throughput
Key Findings:
- Excellent UMA scaling: 112-119% efficiency on single-NUMA systems (super-linear due to larger L3 cache)
- Per-core performance: 10.80 GB/s on C4-16 (3.0x improvement vs dgen-py v0.1.3's 3.60 GB/s)
- Compression tradeoff: compress=2.0 provides 1.3-1.5x speedup, but makes data compressible (choose based on your test requirements, not performance)
- Storage headroom: Even modest 8-core systems exceed 86 GB/s (far beyond typical storage requirements)
See docs/BENCHMARK_RESULTS_V0.1.5.md for complete analysis
Installation
From PyPI (Recommended)
System Requirements
For NUMA support (Linux only):
# Ubuntu/Debian
# RHEL/CentOS/Fedora
Note: NUMA support is optional. Without these libraries, the package works perfectly on single-NUMA systems (workstations, cloud VMs).
Quick Start
Version 0.2.0: Ultra-Fast Bulk Buffer Allocation π
For scenarios where you need to pre-generate all data in memory before writing, use create_bytearrays() for 1,280x faster allocation than Python list comprehension:
# Pre-generate 24 GB in 32 MB chunks
= 24 * 1024**3 # 24 GB
= 32 * 1024**2 # 32 MB chunks
= // # 768 chunks
# β
FAST: Rust-optimized allocation (7-11 ms for 24 GB!)
=
=
= -
# Fill buffers with high-performance generation
=
=
= -
# Now write to storage...
# for buf in chunks:
# f.write(buf)
Performance (12-core system):
Allocation: 10.9 ms @ 2204 GB/s # 1,280x faster than Python!
Generation: 1.59s @ 15.1 GB/s
Performance comparison:
| Method | Allocation Time (24 GB) | Speedup |
|---|---|---|
Python [bytearray(size) for _ in ...] |
12-14 seconds | 1x (baseline) |
dgen_py.create_bytearrays() |
7-11 ms | 1,280x faster |
When to use:
- β Pre-generation pattern (DLIO benchmark, batch data loading)
- β Need all data in RAM before writing
- β Streaming - use
Generator.fill_chunk()with reusable buffer instead (see below)
Why it's fast:
- Uses Python C API (
PyByteArray_Resize) directly from Rust - For 32 MB chunks, glibc automatically uses
mmap(β₯128 KB threshold) - Zero-copy kernel page allocation, no heap fragmentation
- Bypasses Python interpreter overhead
Version 0.1.7: Dynamic Seed Changes
Dynamically change the random seed to reset the data stream or create alternating patterns without recreating the Generator:
=
=
# Generate data with seed A
# Pattern A
# Switch to seed B
# Pattern B
# Back to seed A - resets the stream!
# SAME as first chunk (pattern A)
Use cases:
- RAID stripe testing with alternating patterns per drive
- Multi-phase AI/ML workloads (different patterns for metadata/payload/footer)
- Complex reproducible benchmark scenarios
- Low-overhead stream reset (no Generator recreation)
Version 0.1.6: Reproducible Data Generation
Generate identical data across runs for reproducible benchmarking and testing:
# Reproducible mode - same seed produces identical data
=
=
# β gen1 and gen2 produce IDENTICAL data streams
# Non-deterministic mode (default) - different data each run
= # seed=None (default)
Use cases:
- π¬ Reproducible benchmarking: Compare storage systems with identical workloads
- β Consistent testing: Same test data across CI/CD pipeline runs
- π Debugging: Regenerate exact data streams for issue investigation
- π Compliance: Verifiable data generation for audits
Streaming API (Basic Usage)
For unlimited data generation with constant memory usage, use the streaming API:
# Generate 100 GB with streaming (only 32 MB in memory at a time)
=
# Create single reusable buffer
=
# Stream data in chunks (zero-copy, parallel generation)
=
=
break
# Write to file/network: buffer[:nbytes]
= -
Example output (8-core system):
Throughput: 86.41 GB/s
When to use:
- β Generating very large datasets (> available RAM)
- β Consistent low memory footprint (32 MB)
- β Network streaming, continuous data generation
System Information
=
Advanced Usage
Multi-Process NUMA (For Multi-NUMA Systems)
For maximum throughput on multi-socket systems, use one Python process per NUMA node with process affinity pinning.
See python/examples/benchmark_numa_multiprocess_v2.py for complete implementation.
Key architecture:
- One Python process per NUMA node
- Process pinning via
os.sched_setaffinity()to local cores - Local memory allocation on each NUMA node
- Synchronized start with multiprocessing.Barrier
Results:
- C4-96 (48 cores, 2 NUMA nodes): 248.53 GB/s aggregate
- C4-32 (16 cores, 1 NUMA node): 162.78 GB/s with 112% scaling efficiency
Chunk Size Optimization
Default chunk size is automatically optimized for your system. You can override if needed:
=
Newer CPUs (Emerald Rapid, Sapphire Rapids) with larger L3 cache benefit from 64 MB chunks.
Deduplication and Compression Ratios
Performance vs Test Accuracy Tradeoff:
# FAST: Incompressible data (1.0x baseline)
=
# FASTER: More compressible (1.3-1.5x speedup)
=
Important: Higher compress_ratio values improve generation performance (1.3-1.5x faster) BUT make the data more compressible, which may not represent your actual workload:
- compress_ratio=1.0: Incompressible data (realistic for encrypted files, compressed archives)
- compress_ratio=2.0: 2:1 compressible data (realistic for text, logs, uncompressed images)
- compress_ratio=3.0+: Highly compressible data (may not be realistic)
Choose based on YOUR test requirements, not performance numbers. If testing storage with compression enabled, use compress_ratio=1.0 to avoid inflating storage efficiency metrics.
Note: dedup_ratio has zero performance impact (< 1% variance)
NUMA Modes
# Auto-detect topology (recommended)
=
# Force UMA (single-socket)
=
# Manual NUMA node binding (multi-process only)
= # Bind to node 0
Architecture
Zero-Copy Implementation
Python buffer protocol with direct memory access:
- No data copying between Rust and Python
- GIL released during generation (true parallelism)
- Memoryview creation < 0.001ms (verified zero-copy)
Parallel Generation
- 4 MiB internal blocks distributed across all cores
- Thread pool created once, reused for all operations
- Xoshiro256++ RNG (5-10x faster than ChaCha20)
- Optimal for L3 cache performance
NUMA Optimization
- Multi-process architecture (one process per NUMA node)
- Local memory allocation on each node
- Local core affinity (no cross-node traffic)
- Automatic topology detection via hwloc
Use Cases
- Storage benchmarking: Generate realistic test data at 40-188 GB/s
- Network testing: High-throughput data sources
- AI/ML profiling: Simulate data loading pipelines
- Compression testing: Validate compressor behavior with controlled ratios
- Deduplication testing: Test dedup systems with known ratios
License
Dual-licensed under MIT OR Apache-2.0
Credits
- Built with PyO3 and Maturin
- Uses hwlocality for NUMA topology detection
- Xoshiro256++ RNG from rand crate