# Polars Integration Performance Guide
## Overview
This document covers performance characteristics and threading considerations when using XGBoost with Polars DataFrames.
## Performance Characteristics
### Conversion Overhead
The Polars integration involves converting DataFrames to the row-major f32 array format that XGBoost expects. This conversion has minimal overhead:
- **Small datasets** (< 1k rows): Overhead is negligible (< 1-5%)
- **Medium datasets** (1k-10k rows): Overhead typically 2-5%
- **Large datasets** (10k+ rows): Overhead becomes less significant relative to prediction time (< 3%)
The conversion is optimized for:
- Zero-copy operations where possible
- Efficient type casting for all numeric types
- Row-wise iteration for better cache locality
### When to Use Polars Integration
**Use Polars integration when:**
- ✅ You're already working with Polars DataFrames
- ✅ You need automatic type conversion from multiple numeric types
- ✅ You want column selection/subsetting before prediction
- ✅ Dataset size is > 100 rows (overhead is minimal)
- ✅ Code clarity and maintainability are priorities
**Use raw arrays when:**
- ⚡ Dataset is very small (< 100 rows) and called millions of times
- ⚡ You already have data in the correct f32 format
- ⚡ You're in an extremely latency-sensitive hot path
## Threading Considerations
### XGBoost Threading
XGBoost uses OpenMP for parallel prediction by default. The number of threads can be controlled:
```rust
// XGBoost will use all available cores by default
// To limit threads, set before prediction (requires custom wrapper)
```
### Polars Threading
Polars also uses multiple threads for operations. When using Polars DataFrames with XGBoost:
**Potential thread contention:** Both libraries may try to use all CPU cores simultaneously, which can cause overhead.
### Recommendations
#### For Production Use
```bash
# Option 1: Limit Polars threads
export POLARS_MAX_THREADS=1
# Option 2: Limit XGBoost threads (if wrapper supports it)
# Let Polars use multiple threads, XGBoost single-threaded
```
#### For Batch Prediction
If you're processing many DataFrames in parallel (e.g., using Rayon):
```rust
use rayon::prelude::*;
// Set POLARS_MAX_THREADS=1 to avoid nested parallelism
let results: Vec<_> = dataframes
.par_iter()
.map(|df| booster.predict_dataframe(df, 0, false))
.collect();
```
This allows outer parallelism (across DataFrames) without inner thread contention.
#### For Single Prediction
For single predictions with large DataFrames, use default threading - both libraries will coordinate reasonably well for most workloads.
## Benchmarking
### Run Performance Tests
```bash
# Quick performance comparison
cargo test --features polars --test performance_test -- --ignored --nocapture
# Detailed benchmarks (requires criterion)
cargo bench --features polars polars_benchmark
```
### Test Different Thread Configurations
```bash
# Default (all cores)
cargo test --features polars test_threading_performance -- --ignored --nocapture
# Single-threaded Polars
POLARS_MAX_THREADS=1 cargo test --features polars test_threading_performance -- --ignored --nocapture
# Compare
POLARS_MAX_THREADS=4 cargo test --features polars test_threading_performance -- --ignored --nocapture
```
## Best Practices
1. **Profile your specific workload** - Performance characteristics vary by:
- Dataset size and shape
- Number of features
- Model complexity
- CPU architecture
2. **Start with defaults** - Only optimize threading if you measure contention
3. **Consider your parallelism level**:
- **No outer parallelism**: Use default threading for both
- **Parallel predictions**: Set `POLARS_MAX_THREADS=1`
- **Very latency-sensitive**: Consider raw arrays
4. **Memory vs Speed tradeoff**:
- Polars DataFrame prediction creates a temporary conversion buffer
- For millions of predictions per second, this allocation may matter
- For typical ML serving workloads (100s-1000s QPS), overhead is negligible
## Example Results
Typical overhead on a modern CPU (example numbers):
| 100 rows | 45 μs | 47 μs | ~4% |
| 1,000 rows | 380 μs | 395 μs | ~4% |
| 10,000 rows | 3.2 ms | 3.3 ms | ~3% |
| 100,000 rows | 31 ms | 31.5 ms | ~2% |
*Note: Actual numbers depend on model complexity, feature count, and hardware.*
## Thread Safety
- ✅ `predict_dataframe()` is thread-safe if the underlying XGBoost booster supports thread-safe predictions
- ✅ Multiple threads can call `predict_dataframe()` on the same booster concurrently (XGBoost 1.0+)
- ✅ Polars DataFrame itself is thread-safe for reads
## Troubleshooting
### High CPU usage
If you see excessive CPU usage:
```bash
export POLARS_MAX_THREADS=1
# or
export OMP_NUM_THREADS=1 # For XGBoost
```
### Inconsistent performance
- Ensure first prediction is excluded (warm-up)
- Check for thermal throttling on long-running benchmarks
- Verify no other CPU-intensive processes are running
### Memory usage
If memory usage is a concern:
- Consider batch size for processing
- Polars creates temporary conversion buffers
- Monitor with `heaptrack` or similar tools