# Changelog
All notable changes to dgen-rs/dgen-py will be documented in this file.
## [0.1.7] - 2026-01-25
### Added
- **Stream Reset via `set_seed()` Method**: Dynamically change the random seed during generation to create reproducible data patterns
- Calling `set_seed(seed)` resets the data stream to the beginning of that seed's sequence
- Functionally equivalent to creating a new generator with minimal overhead
- Enables complex patterns: headers/payloads/footers with different seeds, striped patterns for RAID testing
- Available in both Rust (`DataGenerator::set_seed()`) and Python (`Generator.set_seed()`)
- Comprehensive test coverage: `test_set_seed_stream_reset()` in Rust, `test_set_seed_method.py` in Python
### Changed
- Internal RNG architecture now uses block sequence counter for deterministic stream reset
- Block-level RNG derivation ensures same seed always produces identical data stream
## [0.1.6] - 2026-01-23
### Added
#### Reproducible Data Generation
- **NEW: `seed` parameter** for `Generator()` constructor
- Optional `u64` seed value for reproducible random data streams
- When `seed=None` (default): Uses time + urandom entropy (non-deterministic, backward compatible)
- When `seed=Some(value)`: Generates identical data for same configuration (reproducible testing)
- Enables reproducible benchmarking, testing, and debugging workflows
#### Python API Enhancement
```python
# Reproducible mode - same seed produces identical data
gen1 = dgen_py.Generator(size=10*1024**3, seed=12345)
gen2 = dgen_py.Generator(size=10*1024**3, seed=12345)
# gen1 and gen2 will produce identical data
# Non-deterministic mode (default) - different data each time
gen3 = dgen_py.Generator(size=10*1024**3) # seed=None (default)
gen4 = dgen_py.Generator(size=10*1024**3) # seed=None (default)
# gen3 and gen4 will produce different data
```
#### Testing
- **NEW: `python/examples/test_seed_reproducibility.py`**: Comprehensive test suite
- Validates reproducibility with same seed
- Confirms non-determinism without seed
- Verifies different seeds produce different data
- SHA256 hash comparison for data integrity
### Changed
#### Core Implementation
- Modified `GeneratorConfig` struct to include optional `seed` field
- Updated `DataGenerator::new()` to use provided seed or generate entropy
- Enhanced `generate_call_entropy()` usage to be conditional on seed parameter
#### Code Quality
- Fixed all `clippy::identity_op` warnings in constants
- Added `#[allow(clippy::too_many_arguments)]` for PyO3 constructor (API requirement)
- Updated all examples and benchmarks to include new `seed` field
### Backward Compatibility
- **Fully backward compatible**: `seed` defaults to `None`
- Existing code continues to work without modification
- Default behavior unchanged (time + urandom entropy)
---
## [0.1.5] - 2026-01-19
### 🎉 Major Performance Improvements
#### Performance Gains vs v0.1.3
- **UMA systems (single NUMA node)**: ~50% improvement in per-core throughput
- v0.1.5: 10.80 GB/s per core (C4-16, 8 cores)
- v0.1.3: ~7 GB/s per core (12-core Ice Lake: 43.25 GB/s / 12 = 3.60 GB/s per thread)
- Note: v0.1.3 reported per-thread, v0.1.5 reports per physical core
- **NUMA systems**: Significant improvements due to bug fixes in NUMA implementation
- **Maximum aggregate**: **324.72 GB/s** on 48-core dual-NUMA system (GCP C4-96 with compress=2.0)
### Changed
#### Core Performance Optimization
- **BLOCK_SIZE increased**: 64 KB → **4 MB** for optimal L3 cache utilization
- 34% performance boost on modern CPUs (Emerald Rapid, Sapphire Rapids)
- Better parallelization across cores
- Reduced thread pool overhead
#### Multi-Process NUMA Architecture
- **Proper CPU affinity detection**: Uses `/sys/devices/system/node/nodeN/cpulist`
- **Process pinning**: `os.sched_setaffinity()` for NUMA locality
- **Synchronized start**: `multiprocessing.Barrier` for accurate timing
- **64 MB chunk size default**: Optimized for newer generation CPUs with larger L3 cache
### Added
#### Documentation
- **NEW: `docs/BENCHMARK_RESULTS_V0.1.5.md`**: Comprehensive 426-line performance analysis
- 4 GCP instances tested (C4-8, C4-16, C4-32, C4-96)
- Detailed scaling analysis (UMA vs NUMA)
- Compression ratio impact study (1.3-1.5x speedup with compress=2.0)
- Per-instance raw results and recommendations
#### Examples
- **`python/examples/benchmark_numa_multiprocess_v2.py`**: Production-grade NUMA benchmark
- Process affinity pinning via `os.sched_setaffinity()`
- Local memory allocation per NUMA node
- Synchronized multi-process execution
- Detailed per-node reporting
- **`examples/numa_test.rs`**: Native Rust NUMA testing utility
- **`examples/NUMA_BENCHMARK_README.md`**: NUMA architecture documentation
### Performance Results (v0.1.5)
**Scalability on GCP Intel Emerald Rapid (compress=1.0):**
| C4-8 | 4 | 1 (UMA) | 36.26 GB/s | 9.07 GB/s | Baseline |
| C4-16 | 8 | 1 (UMA) | **86.41 GB/s** | **10.80 GB/s** | **119%** |
| C4-32 | 16 | 1 (UMA) | **162.78 GB/s** | **10.17 GB/s** | **112%** |
| C4-96 | 48 | 2 (NUMA) | 248.53 GB/s | 5.18 GB/s | 51%* |
\* *NUMA penalty: 49% per-core reduction on multi-socket systems*
**Compression Ratio Impact (compress=2.0 vs compress=1.0):**
- C4-8: 53.95 GB/s (1.49x speedup)
- C4-16: 125.88 GB/s (1.46x speedup)
- C4-32: 222.28 GB/s (1.37x speedup)
- C4-96: 324.72 GB/s (1.31x speedup)
**Key Findings:**
- Excellent UMA scaling: 112-119% efficiency (super-linear due to larger L3 cache)
- Deduplication ratio has ZERO performance impact (< 1% variance)
- Compression ratio provides 1.3-1.5x speedup but makes data more compressible (choose based on test requirements)
### Updated
#### README.md
- Highlighted 3.0x improvement as main feature
- Replaced v0.1.3 benchmarks with v0.1.5 data
- Streamlined examples (removed verbose output)
- Clarified compression ratio tradeoff (performance vs test accuracy)
- Reduced from 363 to 256 lines for PyPI publication
#### pyproject.toml
- Updated benchmark comments with v0.1.5 performance data
- Added performance gains section (3.0x improvement)
- Updated storage benchmarking guidance
- Reflected new compression impact analysis
### Technical Details
#### BLOCK_SIZE Optimization
- **Old** (v0.1.3): 64 KB blocks
- High thread pool overhead on large datasets
- Suboptimal L3 cache utilization
- **New** (v0.1.5): 4 MB blocks
- Reduced parallel overhead (fewer blocks to process)
- Better L3 cache hit rates on modern CPUs
- Result: 34% throughput improvement
#### NUMA Architecture Improvements
- **Proper topology detection**: Reads `/sys/devices/system/node/nodeN/cpulist`
- **CPU affinity pinning**: `os.sched_setaffinity(0, [cpu_list])`
- **Local memory allocation**: Each process allocates on its NUMA node
- **Synchronized execution**: `multiprocessing.Barrier` ensures fair comparison
### Migration Guide
No breaking changes - existing code continues to work with 3.0x better performance.
**Optional optimization for newer CPUs:**
```python
# Override chunk size to 64 MB for Emerald Rapid, Sapphire Rapids
gen = dgen_py.Generator(
size=100 * 1024**3,
chunk_size=64 * 1024**2 # 64 MB (default is auto-detected)
)
```
---
## [0.1.4] - 2026-01-18
### Changed
#### Documentation Accuracy
- **README.md**: Removed projected performance numbers, added actual NUMA benchmark results
- **README.md**: Removed references to private repositories
- Fixed benchmark result reporting to match actual measured performance
### Performance Results (v0.1.4)
**Multi-NUMA Benchmarks (actual measurements):**
| GCP C4-16 | 16 | 1 (UMA) | 39.87 GB/s | 2.49 GB/s | 100% (baseline) |
| GCP C4-96 | 96 | 4 | 126.96 GB/s | 1.32 GB/s | 53% |
| Azure HBv5 | 368 | 16 | 188.24 GB/s | 0.51 GB/s | 20% |
**Key Findings:**
- Sub-linear scaling expected for memory-intensive workloads
- All systems exceed 80 GB/s storage testing requirements
- Documentation now reflects actual measured performance
---
## [0.1.3] - 2026-01-17
### 🚀 Major Performance Improvements
#### Zero-Copy Parallel Streaming (24x Python Performance Boost)
- **TRUE zero-copy Python API**: `fill_chunk()` now generates **directly into Python buffer** (no temporary allocation)
- **GIL release**: Uses `py.detach()` to release GIL during generation (enables true parallelism)
- **Thread pool reuse**: Created once in `DataGenerator::new()`, reused for all `fill_chunk()` calls
- **Performance results on 12-core system**:
- Python: 43.25 GB/s (was 1.97 GB/s in v0.1.2 - **22x faster**)
- Native Rust: 47.18 GB/s
- Python now achieves **92% of native Rust performance**
- **Projected performance on 384-core HPC system**:
- Python: 1,384 GB/s (**17.3x faster** than 80 GB/s storage target)
- Native Rust: 1,511 GB/s (**18.9x faster** than storage target)
### Changed
#### Python API (`src/python_api.rs`)
- `PyGenerator::fill_chunk()`:
- Removed temporary buffer allocation
- Generates directly into Python buffer via `std::slice::from_raw_parts_mut`
- Releases GIL using `py.detach()` (replaces deprecated `py.allow_threads()`)
- True zero-copy from Rust to Python
#### Core Generator (`src/generator.rs`)
- `DataGenerator` struct:
- Added `max_threads: usize` field
- Added `thread_pool: Option<rayon::ThreadPool>` field (reused across all `fill_chunk()` calls)
- `DataGenerator::new()`:
- Creates thread pool once during initialization
- Configures from `GeneratorConfig::max_threads`
- `fill_chunk()`:
- Split into `fill_chunk_parallel()` (≥8 MB) and `fill_chunk_sequential()` (<8 MB)
- Threshold: 2 blocks (8 MB) to trigger parallel path
- `fill_chunk_parallel()`:
- Uses stored thread pool (eliminates per-call creation overhead)
- Generates via `pool.install(|| chunk.par_chunks_mut().for_each(...))`
- Zero-copy: generates directly into output buffer using rayon parallel iteration
### Added
#### Examples
- `examples/streaming_benchmark.rs`: Native Rust streaming benchmark (shows 47.18 GB/s)
- `python/examples/Benchmark_dgen-py_FIXED.py`: Python benchmark demonstrating zero-copy performance (43.25 GB/s)
#### Documentation
- Performance tips in README.md about optimal chunk sizes (64-256 MB)
- Technical details about thread pool reuse and zero-copy implementation
### Performance Comparison
**Development System (12 cores, UMA):**
| Python | 1.97 GB/s | 43.25 GB/s | **22x** | 3.60 GB/s |
| Rust | 47.18 GB/s | 47.18 GB/s | baseline | 3.93 GB/s |
**Key Insight**: Python achieves 92% efficiency vs native Rust (was only 4% in v0.1.2)
### Technical Details
#### Memory Architecture Changes
- **Old approach** (v0.1.2):
```rust
let mut temp = vec![0u8; size]; self.inner.fill_chunk(&mut temp); copy_nonoverlapping(temp, dst_ptr, size); ```
Result: 2x memory bandwidth usage, GIL held during copy
- **New approach** (v0.1.3):
```rust
py.detach(|| { let dst = from_raw_parts_mut(buf_ptr, size);
self.inner.fill_chunk(dst) })
```
Result: 1x memory bandwidth, parallel execution without GIL
#### Thread Pool Overhead Eliminated
- **Old**: Created new thread pool for every 64 MB chunk
- On 384-core system: 384 threads × 16,000 chunks/TB = catastrophic overhead
- **New**: Thread pool created once, reused for ~16,000 chunks per TB
- Result: Eliminated dominant bottleneck
### Breaking Changes
None - API remains fully compatible with v0.1.2
### Migration Guide
No code changes required - existing applications automatically benefit from 22x performance improvement.
**Optional optimization**: Increase chunk size to 64-256 MB for streaming workloads:
```python
# Old recommendation (still works)
gen.fill_chunk(buffer[:4*1024*1024]) # 4 MB
# New recommendation for high-performance streaming
gen.fill_chunk(buffer[:64*1024*1024]) # 64 MB - better parallelization
```
Larger chunks enable better parallelization while maintaining cache efficiency.
## [Unreleased]
### Credits
- Algorithm ported from s3dlio/src/data_gen_alt.rs
- NUMA detection from kv-cache-bench
- Built with PyO3 and Maturin
## [0.1.0] - 2026-01-08
Initial release.