xgboost-rust 0.1.0

# Polars Integration Performance Guide

## Overview

This document covers performance characteristics and threading considerations when using XGBoost with Polars DataFrames.

## Performance Characteristics

### Conversion Overhead

The Polars integration involves converting DataFrames to the row-major f32 array format that XGBoost expects. This conversion has minimal overhead:

- **Small datasets** (< 1k rows): Overhead is negligible (< 1-5%)
- **Medium datasets** (1k-10k rows): Overhead typically 2-5%
- **Large datasets** (10k+ rows): Overhead becomes less significant relative to prediction time (< 3%)

The conversion is optimized for:
- Zero-copy operations where possible
- Efficient type casting for all numeric types
- Row-wise iteration for better cache locality

### When to Use Polars Integration

**Use Polars integration when:**
- ✅ You're already working with Polars DataFrames
- ✅ You need automatic type conversion from multiple numeric types
- ✅ You want column selection/subsetting before prediction
- ✅ Dataset size is > 100 rows (overhead is minimal)
- ✅ Code clarity and maintainability are priorities

**Use raw arrays when:**
- ⚡ Dataset is very small (< 100 rows) and called millions of times
- ⚡ You already have data in the correct f32 format
- ⚡ You're in an extremely latency-sensitive hot path

## Threading Considerations

### XGBoost Threading

XGBoost uses OpenMP for parallel prediction by default. The number of threads can be controlled:

```rust
// XGBoost will use all available cores by default
// To limit threads, set before prediction (requires custom wrapper)
```

### Polars Threading

Polars also uses multiple threads for operations. When using Polars DataFrames with XGBoost:

**Potential thread contention:** Both libraries may try to use all CPU cores simultaneously, which can cause overhead.

### Recommendations

#### For Production Use

```bash
# Option 1: Limit Polars threads
export POLARS_MAX_THREADS=1

# Option 2: Limit XGBoost threads (if wrapper supports it)
# Let Polars use multiple threads, XGBoost single-threaded
```

#### For Batch Prediction

If you're processing many DataFrames in parallel (e.g., using Rayon):

```rust
use rayon::prelude::*;

// Set POLARS_MAX_THREADS=1 to avoid nested parallelism
let results: Vec<_> = dataframes
    .par_iter()
    .map(|df| booster.predict_dataframe(df, 0, false))
    .collect();
```

This allows outer parallelism (across DataFrames) without inner thread contention.

#### For Single Prediction

For single predictions with large DataFrames, use default threading - both libraries will coordinate reasonably well for most workloads.

## Benchmarking

### Run Performance Tests

```bash
# Quick performance comparison
cargo test --features polars --test performance_test -- --ignored --nocapture

# Detailed benchmarks (requires criterion)
cargo bench --features polars polars_benchmark
```

### Test Different Thread Configurations

```bash
# Default (all cores)
cargo test --features polars test_threading_performance -- --ignored --nocapture

# Single-threaded Polars
POLARS_MAX_THREADS=1 cargo test --features polars test_threading_performance -- --ignored --nocapture

# Compare
POLARS_MAX_THREADS=4 cargo test --features polars test_threading_performance -- --ignored --nocapture
```

## Best Practices

1. **Profile your specific workload** - Performance characteristics vary by:
   - Dataset size and shape
   - Number of features
   - Model complexity
   - CPU architecture

2. **Start with defaults** - Only optimize threading if you measure contention

3. **Consider your parallelism level**:
   - **No outer parallelism**: Use default threading for both
   - **Parallel predictions**: Set `POLARS_MAX_THREADS=1`
   - **Very latency-sensitive**: Consider raw arrays

4. **Memory vs Speed tradeoff**:
   - Polars DataFrame prediction creates a temporary conversion buffer
   - For millions of predictions per second, this allocation may matter
   - For typical ML serving workloads (100s-1000s QPS), overhead is negligible

## Example Results

Typical overhead on a modern CPU (example numbers):

| Dataset Size | Raw Array | Polars DF | Overhead |
|--------------|-----------|-----------|----------|
| 100 rows     | 45 μs     | 47 μs     | ~4%      |
| 1,000 rows   | 380 μs    | 395 μs    | ~4%      |
| 10,000 rows  | 3.2 ms    | 3.3 ms    | ~3%      |
| 100,000 rows | 31 ms     | 31.5 ms   | ~2%      |

*Note: Actual numbers depend on model complexity, feature count, and hardware.*

## Thread Safety

- ✅ `predict_dataframe()` is thread-safe if the underlying XGBoost booster supports thread-safe predictions
- ✅ Multiple threads can call `predict_dataframe()` on the same booster concurrently (XGBoost 1.0+)
- ✅ Polars DataFrame itself is thread-safe for reads

## Troubleshooting

### High CPU usage

If you see excessive CPU usage:
```bash
export POLARS_MAX_THREADS=1
# or
export OMP_NUM_THREADS=1  # For XGBoost
```

### Inconsistent performance

- Ensure first prediction is excluded (warm-up)
- Check for thermal throttling on long-running benchmarks
- Verify no other CPU-intensive processes are running

### Memory usage

If memory usage is a concern:
- Consider batch size for processing
- Polars creates temporary conversion buffers
- Monitor with `heaptrack` or similar tools