torsh-profiler
Performance profiling and analysis tools for ToRSh applications.
Overview
This crate provides comprehensive profiling capabilities for deep learning workloads:
- Performance Profiling: CPU/GPU time, memory usage, operation counts
- Memory Profiling: Allocation tracking, peak usage, memory leaks
- Operation Analysis: Kernel timing, FLOPS counting, bottleneck detection
- Visualization: Chrome tracing, TensorBoard integration, custom views
- Integration: Works with CUDA profiler, Intel VTune, Apple Instruments
Usage
Basic Profiling
use *;
// Profile a model
let profiler = new
.record_shapes
.with_stack;
with_profiler?;
// Get results
let report = profiler.report;
println!;
Detailed Operation Profiling
// Profile with categories
let profiler = new
.activities
.record_shapes
.profile_memory
.with_stack;
// Profile specific operations
profiler.start;
profiler.step;
let batch = dataloader.next?;
profiler.step;
let output = model.forward?;
profiler.step;
let loss = criterion?;
profiler.step;
loss.backward?;
profiler.step;
optimizer.step?;
profiler.stop;
// Export trace
profiler.export_chrome_trace?;
Memory Profiling
use *;
// Track memory allocations
let memory_profiler = new
.track_allocations
.include_stacktraces;
memory_profiler.start;
// Your code here
let tensors = .map.;
memory_profiler.stop;
// Analyze memory usage
let snapshot = memory_profiler.snapshot?;
println!;
println!;
// Find memory leaks
let leaks = memory_profiler.find_leaks?;
for leak in leaks
FLOPS Counting
use *;
// Count FLOPS for a model
let flop_counter = new;
let input_shape = vec!;
let total_flops = flop_counter.count?;
println!;
// Detailed breakdown
let breakdown = flop_counter.breakdown?;
for in breakdown
Custom Profiling Regions
use profile;
// Profile specific code regions
profile!;
// Or with explicit profiler
let profiler = current;
let _guard = profiler.record;
// Critical code here
// _guard automatically stops profiling when dropped
TensorBoard Integration
use *;
// Export to TensorBoard format
let tb_profiler = new;
tb_profiler.add_scalar?;
tb_profiler.add_histogram?;
tb_profiler.add_graph?;
// Profile and export
with_profiler?;
Advanced Analysis
use *;
// Analyze bottlenecks
let analyzer = new;
let bottlenecks = analyzer.find_bottlenecks?;
for bottleneck in bottlenecks.iter.take
// Find inefficient operations
let inefficiencies = analyzer.find_inefficiencies?;
for issue in inefficiencies
// Memory access patterns
let memory_patterns = analyzer.analyze_memory_access?;
println!;
Multi-GPU Profiling
// Profile distributed training
let profiler = new
.rank
.world_size
.sync_enabled;
with_profiler?;
// Aggregate results from all ranks
if rank == 0
Integration with External Profilers
// NVIDIA Nsight Systems
// Intel VTune
Profiling Configuration
// Configure via environment variables
// TORSH_PROFILER_ENABLED=1
// TORSH_PROFILER_OUTPUT=trace.json
// TORSH_PROFILER_ACTIVITIES=cpu,cuda
// Or programmatically
default
.enabled
.output_path
.activities
.record_shapes
.profile_memory
.with_stack
.with_flops
.with_modules
.export_format
.apply?;
Visualization
The profiler can export data in various formats:
- Chrome Tracing: View in chrome://tracing
- TensorBoard: Integrated with TensorBoard profiler plugin
- Perfetto: Modern trace viewer
- Custom JSON: For custom analysis tools
Performance Tips
- Profile representative workloads
- Warm up before profiling (exclude first iterations)
- Profile both training and inference
- Look for memory allocation patterns
- Check for unnecessary synchronizations
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.