calltrace-rs 1.1.4

# CallTrace Performance Tuning Guide

This guide provides comprehensive strategies for optimizing CallTrace performance across different use cases, from development debugging to production monitoring.

## Table of Contents

- [Performance Overview](#performance-overview)
- [Configuration Strategies](#configuration-strategies)
- [Compilation Optimizations](#compilation-optimizations)
- [Runtime Tuning](#runtime-tuning)
- [Memory Management](#memory-management)
- [Profiling and Measurement](#profiling-and-measurement)
- [Production Deployment](#production-deployment)
- [Use Case Scenarios](#use-case-scenarios)

## Performance Overview

### CallTrace Overhead Characteristics

| Configuration | Overhead per Call | Memory per 10K Calls | Use Case |
|---------------|-------------------|---------------------|----------|
| **Disabled** | < 5ns | 0 | Production baseline |
| **Basic Tracing** | 50-100ns | ~2MB | Production monitoring |
| **Argument Capture** | 500ns-2μs | ~5-10MB | Development debugging |
| **Debug Mode** | 1-5μs | ~10-20MB | Deep investigation |

### Performance Factors

1. **Function Call Frequency**
   - High-frequency functions (>1M calls/sec) feel overhead more
   - Recursive functions multiply overhead by recursion depth

2. **Argument Complexity**
   - Simple types (int, float): ~100ns overhead
   - Strings: ~200-500ns overhead  
   - Complex structs: ~1-2μs overhead
   - Arrays/pointers: ~500ns-1μs overhead

3. **Call Stack Depth**
   - Memory usage grows linearly with depth
   - Deep recursion (>1000 levels) can impact performance significantly

4. **Thread Count**
   - Per-thread overhead is minimal
   - Memory usage scales with number of active threads

## Configuration Strategies

### Development Configuration (Maximum Detail)

For debugging and development, prioritize completeness over performance:

```bash
# Full featured development setup
export CALLTRACE_OUTPUT=dev_trace.json
export CALLTRACE_CAPTURE_ARGS=1
export CALLTRACE_MAX_DEPTH=1000
export CALLTRACE_DEBUG=1
export CALLTRACE_PRETTY_JSON=1

# Use debug build for additional error checking
LD_PRELOAD=./.target/debug/libcalltrace.so ./your_program
```

**Expected overhead:** 10-50x slower than normal execution

### Testing Configuration (Balanced)

For integration testing and QA environments:

```bash
# Balanced testing setup
export CALLTRACE_OUTPUT=test_trace.json
export CALLTRACE_CAPTURE_ARGS=0      # Disable expensive argument capture
export CALLTRACE_MAX_DEPTH=500
export CALLTRACE_DEBUG=0
export CALLTRACE_PRETTY_JSON=1

# Use release build for better performance
LD_PRELOAD=./.target/release/libcalltrace.so ./your_program
```

**Expected overhead:** 2-5x slower than normal execution

### Production Configuration (Minimal Overhead)

For production monitoring with minimal impact:

```bash
# Production-optimized setup
export CALLTRACE_OUTPUT=prod_trace.json
export CALLTRACE_CAPTURE_ARGS=0      # Never enable in production
export CALLTRACE_MAX_DEPTH=100       # Limit memory usage
export CALLTRACE_DEBUG=0             # No debug output
export CALLTRACE_PRETTY_JSON=0       # Smaller file size

# Always use release build in production
LD_PRELOAD=./.target/release/libcalltrace.so ./your_program
```

**Expected overhead:** 10-20% slower than normal execution

### Minimal Configuration (Ultra-Low Overhead)

For scenarios where every nanosecond counts:

```bash
# Ultra-minimal setup - basic call counting only
export CALLTRACE_OUTPUT=/dev/null    # Or specific file when needed
export CALLTRACE_CAPTURE_ARGS=0
export CALLTRACE_MAX_DEPTH=50
export CALLTRACE_DEBUG=0
export CALLTRACE_PRETTY_JSON=0

# Release build only
LD_PRELOAD=./.target/release/libcalltrace.so ./your_program
```

**Expected overhead:** < 5% impact on execution time

## Compilation Optimizations

### Target Program Compilation

Choose compilation flags based on your performance requirements:

#### Maximum Performance (Production)
```bash
# Optimized for speed, limited tracing capability
gcc -rdynamic -finstrument-functions -O3 -DNDEBUG \
    -fno-omit-frame-pointer -fno-inline-small-functions \
    your_program.c -o your_program_optimized
```

**Trade-offs:**
- ✅ Fastest execution
- ❌ Some functions may be inlined (not traced)
- ❌ Harder to debug issues

#### Balanced Performance (Testing)
```bash
# Good performance with complete tracing
gcc -rdynamic -finstrument-functions -O2 -g \
    -fno-omit-frame-pointer \
    your_program.c -o your_program_balanced
```

**Trade-offs:**
- ✅ Good performance
- ✅ Most functions traced
- ✅ Debug info available

#### Maximum Tracing (Development)
```bash
# Complete tracing capability, slower execution
gcc -rdynamic -finstrument-functions -O0 -g \
    -fno-inline -fno-omit-frame-pointer \
    your_program.c -o your_program_debug
```

**Trade-offs:**
- ✅ All functions traced
- ✅ Complete debug information
- ❌ Significantly slower execution

### CallTrace Library Compilation

Use appropriate CallTrace build for your use case:

```bash
# Development: debug build with error checking
cargo build

# Production: optimized build
cargo build --release

# Ultra-optimized: custom profile (add to Cargo.toml)
[profile.production]
inherits = "release"
lto = "fat"          # Full link-time optimization
codegen-units = 1    # Single codegen unit for better optimization
panic = "abort"      # Smaller binary, faster execution
```

## Runtime Tuning

### Environment Variable Optimization

Fine-tune CallTrace behavior with environment variables:

```bash
# Memory-constrained environments
export CALLTRACE_MAX_DEPTH=50          # Limit call stack depth
export CALLTRACE_PRETTY_JSON=0         # Reduce file size

# High-frequency call environments  
export CALLTRACE_MAX_DEPTH=20          # Very shallow tracing
export CALLTRACE_OUTPUT=/tmp/trace.json # Fast storage

# Multi-threaded applications
export CALLTRACE_MAX_DEPTH=200         # Higher limit for concurrent threads
```

### Dynamic Performance Control

Control tracing overhead at runtime:

```c
// In your C program - disable instrumentation for hot functions
__attribute__((no_instrument_function))
void high_frequency_function() {
    // This function won't be traced, reducing overhead
}

// Use GCC pragmas to disable instrumentation for code blocks
#pragma GCC push_options
#pragma GCC optimize ("no-instrument-functions")
void performance_critical_section() {
    // Code here won't be traced
}
#pragma GCC pop_options
```

### Signal-Based Control (Future Feature)

```c
// Example of potential runtime control (not yet implemented)
#include <signal.h>

void toggle_tracing(int sig) {
    // Toggle CallTrace on/off during execution
    // Useful for focusing on specific time periods
}

int main() {
    signal(SIGUSR1, toggle_tracing);
    // Run program, send SIGUSR1 to toggle tracing
}
```

## Memory Management

### Understanding Memory Usage

CallTrace memory consumption breakdown:

1. **Call Tree Storage:** ~200 bytes per function call
2. **DWARF Cache:** ~1KB per unique function
3. **String Interning:** ~50 bytes per unique string
4. **Thread Metadata:** ~1KB per thread
5. **JSON Buffer:** ~2x final output size during generation

### Memory Optimization Strategies

#### Limit Call Depth
```bash
# Prevent memory explosion in recursive scenarios
export CALLTRACE_MAX_DEPTH=100

# Monitor actual depth in traces
jq '.call_trees[].root_calls[] | .. | .children? | length' trace.json | sort -n | tail -10
```

#### Periodic Output Flushing (Future Feature)
```bash
# Hypothetical future feature - periodic output to limit memory
export CALLTRACE_FLUSH_INTERVAL=10000   # Flush every 10K calls
export CALLTRACE_OUTPUT_ROLLING=1       # Rolling output files
```

#### Memory Monitoring
```bash
# Monitor CallTrace memory usage
valgrind --tool=massif LD_PRELOAD=./.target/release/libcalltrace.so ./your_program

# Check peak memory usage
/usr/bin/time -v LD_PRELOAD=./.target/release/libcalltrace.so ./your_program

# Monitor real-time memory usage
while true; do 
    ps aux | grep your_program | grep -v grep
    sleep 1
done
```

## Profiling and Measurement

### Baseline Measurements

Always establish baseline performance before adding CallTrace:

```bash
# 1. Measure without CallTrace
time ./your_program
perf stat -e cycles,instructions,cache-misses ./your_program

# 2. Measure with CallTrace (minimal config)
time env CALLTRACE_OUTPUT=/dev/null LD_PRELOAD=./.target/release/libcalltrace.so ./your_program

# 3. Measure with CallTrace (full config)
time env CALLTRACE_CAPTURE_ARGS=1 CALLTRACE_OUTPUT=trace.json LD_PRELOAD=./.target/release/libcalltrace.so ./your_program
```

### Performance Profiling

Use profiling tools to understand overhead distribution:

```bash
# Profile CallTrace overhead with perf
perf record -g LD_PRELOAD=./.target/release/libcalltrace.so ./your_program
perf report

# Focus on CallTrace-specific overhead
perf record -g -e probe_libcalltrace:* LD_PRELOAD=./.target/release/libcalltrace.so ./your_program

# Analyze hot functions
perf top -p $(pgrep your_program)
```

### CallTrace-Specific Metrics

Analyze CallTrace's own performance from trace output:

```bash
# Extract timing statistics from trace
jq -r '.call_trees[].root_calls[] | .. | select(.duration_ns?) | .duration_ns' trace.json | \
    awk '{sum+=$1; count++} END {print "Average:", sum/count "ns", "Total calls:", count}'

# Find slowest function calls
jq -r '.call_trees[].root_calls[] | .. | select(.duration_ns?) | "\(.duration_ns) \(.function)"' trace.json | \
    sort -nr | head -20

# Analyze call frequency
jq -r '.call_trees[].root_calls[] | .. | select(.function?) | .function' trace.json | \
    sort | uniq -c | sort -nr | head -20
```

## Production Deployment

### Pre-Production Testing

Before deploying CallTrace in production:

1. **Load Testing:**
   ```bash
   # Test with production-like load
   ab -n 10000 -c 100 http://localhost:8080/api/endpoint
   
   # Compare with and without CallTrace
   siege -t 60s -c 50 http://localhost:8080/
   ```

2. **Memory Leak Testing:**
   ```bash
   # Long-running memory test
   valgrind --tool=memcheck --leak-check=full \
       LD_PRELOAD=./.target/release/libcalltrace.so ./your_program
   ```

3. **Stress Testing:**
   ```bash
   # High concurrency test
   for i in {1..100}; do
       LD_PRELOAD=./.target/release/libcalltrace.so ./your_program &
   done
   wait
   ```

### Production Monitoring

Monitor CallTrace impact in production:

```bash
# Monitor system metrics
iostat -x 1    # Disk I/O impact from JSON writing
free -m        # Memory usage
top -p $(pgrep your_program)  # CPU usage

# Application-specific metrics
curl http://localhost:8080/metrics | grep response_time
```

### Rollback Strategy

Prepare for quick rollback if issues arise:

```bash
# Create wrapper script for easy enable/disable
cat > run_with_trace.sh << 'EOF'
#!/bin/bash
if [[ "$ENABLE_TRACING" == "1" ]]; then
    exec env LD_PRELOAD=./.target/release/libcalltrace.so "$@"
else
    exec "$@"
fi
EOF

# Use in production
ENABLE_TRACING=1 ./run_with_trace.sh ./your_program  # With tracing
ENABLE_TRACING=0 ./run_with_trace.sh ./your_program  # Without tracing
```

## Use Case Scenarios

### Scenario 1: Debugging Performance Regression

**Goal:** Find which functions are slower than expected

**Configuration:**
```bash
# Capture detailed timing with argument information
export CALLTRACE_CAPTURE_ARGS=1
export CALLTRACE_OUTPUT=regression_analysis.json
export CALLTRACE_PRETTY_JSON=1
LD_PRELOAD=./.target/debug/libcalltrace.so ./your_program
```

**Analysis:**
```bash
# Find functions taking longer than threshold
jq '.call_trees[].root_calls[] | .. | select(.duration_ns? > 1000000) | {function, duration_ns}' regression_analysis.json

# Compare with baseline trace
diff <(jq '.call_trees[].root_calls[] | .. | select(.function?) | .function' baseline.json | sort) \
     <(jq '.call_trees[].root_calls[] | .. | select(.function?) | .function' regression_analysis.json | sort)
```

### Scenario 2: Production Health Monitoring

**Goal:** Monitor application behavior with minimal overhead

**Configuration:**
```bash
# Ultra-lightweight monitoring
export CALLTRACE_OUTPUT=/var/log/app_trace.json
export CALLTRACE_CAPTURE_ARGS=0
export CALLTRACE_MAX_DEPTH=50
export CALLTRACE_PRETTY_JSON=0
LD_PRELOAD=./.target/release/libcalltrace.so ./your_program
```

**Automation:**
```bash
# Rotate trace files hourly
0 * * * * mv /var/log/app_trace.json /var/log/app_trace_$(date +%Y%m%d_%H).json

# Analyze for anomalies
jq '.metadata.total_calls' /var/log/app_trace_*.json | \
    awk '{if(NR==1){min=max=$1} if($1<min){min=$1} if($1>max){max=$1}} END {print "Range:", min "-" max}'
```

### Scenario 3: API Performance Analysis

**Goal:** Understand request processing flow and timing

**Configuration:**
```bash
# Capture API request flow
export CALLTRACE_OUTPUT=api_flow.json
export CALLTRACE_CAPTURE_ARGS=1  # Capture request parameters
export CALLTRACE_MAX_DEPTH=200   # Deep web framework stacks
LD_PRELOAD=./.target/release/libcalltrace.so ./api_server
```

**Analysis:**
```bash
# Extract request handler timing
jq '.call_trees[] | .root_calls[] | .. | select(.function? | test("handle_request|process_api")) | {function, duration_ns}' api_flow.json

# Identify bottleneck functions
jq -r '.call_trees[] | .root_calls[] | .. | select(.duration_ns? > 100000) | "\(.duration_ns) \(.function)"' api_flow.json | sort -nr
```

### Scenario 4: Memory Allocation Profiling

**Goal:** Track memory allocation patterns

**Configuration:**
```bash
# Trace malloc/free patterns (if instrumented)
export CALLTRACE_OUTPUT=memory_trace.json
export CALLTRACE_CAPTURE_ARGS=1
# Ensure memory functions are instrumented in your build
LD_PRELOAD=./.target/debug/libcalltrace.so ./your_program
```

**Note:** CallTrace doesn't automatically trace system memory functions. You'll need to instrument your own memory management functions.

## Performance Best Practices Summary

### ✅ Do's

1. **Always use release builds in production**
2. **Disable argument capture unless specifically needed**
3. **Set appropriate call depth limits**
4. **Monitor memory usage in long-running applications**
5. **Establish baseline performance measurements**
6. **Use minimal configuration for production monitoring**
7. **Test performance impact thoroughly before deployment**

### ❌ Don'ts

1. **Don't enable debug mode in production**
2. **Don't use unlimited call depth with recursive functions**
3. **Don't enable argument capture for high-frequency functions**
4. **Don't ignore memory usage growth over time**
5. **Don't deploy without performance testing**
6. **Don't use debug builds for performance measurements**
7. **Don't forget to monitor disk space for JSON output**

### 🎯 Golden Rules

1. **Measure First:** Always establish baseline performance
2. **Start Minimal:** Begin with lowest overhead configuration
3. **Iterate Gradually:** Add features as needed, measuring impact
4. **Monitor Continuously:** Watch for performance degradation over time
5. **Plan for Rollback:** Always have a way to quickly disable tracing

By following these guidelines, you can effectively use CallTrace across the entire development lifecycle, from detailed debugging to production monitoring, while maintaining acceptable performance characteristics.