Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
TrustformeRS Serve
Version: 0.1.0 | Status: Stable | Tests: 216 | SLoC: 206,636 | Updated: 2026-03-21
High-performance inference server for TrustformeRS models with advanced batching, multi-protocol APIs, cloud-native deployment, and comprehensive observability.
Features
Dynamic Batching System
The dynamic batching system automatically groups inference requests to maximize throughput while maintaining low latency. Key features include:
- Intelligent Request Aggregation: Automatically collects requests into optimal batch sizes
- Priority-based Scheduling: Process critical requests first with configurable priority levels
- Adaptive Batching: Dynamically adjusts batch size and timeout based on load patterns
- Memory-aware Batching: Prevents OOM by tracking memory usage per batch
- Continuous Batching: Special mode for LLM text generation with KV cache management
- Sequence Bucketing: Groups similar-length sequences to minimize padding overhead
Configuration Options
use ;
use Duration;
let config = BatchingConfig ;
Batching Modes
- Fixed: Constant batch size
- Dynamic: Variable batch size based on queue depth
- Adaptive: Automatically adjusts based on load patterns
- Continuous: Special mode for LLM generation with incremental decoding
Optimization Targets
- Throughput: Maximize requests per second
- Latency: Minimize response time
- Balanced: Balance between throughput and latency
- Cost: Optimize for cloud deployment costs
Multi-Protocol APIs
- REST (Axum): HTTP/1.1 and HTTP/2, streaming via SSE and WebSockets
- gRPC (Tonic): High-throughput binary protocol with bidirectional streaming
- GraphQL (async-graphql): Flexible query API with subscriptions
SLO Monitoring and Observability
Built-in SLO (Service Level Objective) monitoring with Prometheus metrics exported via once_cell lazy statics for zero-cost initialization:
- Request throughput (req/s) and token throughput (tokens/s)
- Latency percentiles (p50, p90, p95, p99) with SLO breach alerting
- Batch size distribution and queue depth histograms
- GPU/memory utilization gauges
- Cache hit rate and eviction counters
- Automatic SLO violation detection and alerting
use ;
let slo = SloConfig ;
Distributed Tracing
Full OpenTelemetry-compatible distributed tracing with Jaeger and Zipkin exporters:
- Per-request span creation with context propagation
- Trace sampling (head-based and tail-based)
- Baggage propagation across service boundaries
- Integration with service mesh (Istio, Linkerd)
Performance Optimizer with NUMA/Topology Detection
Platform-aware performance optimizer that detects hardware topology to maximize throughput:
- Linux: Reads CPU topology via
/sys/devices/system/cpu/sysfs, NUMA node distances from/sys/devices/system/node/ - macOS: Queries CPU topology via
sysctl hw.physicalcpu,hw.logicalcpu,hw.cachesize - Automatic thread affinity binding to NUMA nodes
- Memory allocation policy optimized for NUMA topology
- Cache-line aware data structure layout
Speculative Decoding
Accelerates LLM text generation using draft models:
- Draft model generates candidate tokens in parallel
- Verifier model accepts or rejects in a single forward pass
- Configurable draft length (typically 4–8 tokens)
- Automatic fallback when speculative quality degrades
- Up to 3x throughput improvement for autoregressive models
Kernel Fusion
GPU kernel optimization to reduce memory bandwidth pressure:
- Vertical fusion: Sequential elementwise operations fused into single kernel
- Horizontal fusion: Independent parallel operations batched together
- Producer-consumer fusion: Eliminates intermediate tensor materialization
- Multi-pattern fusion: Combined patterns for attention and FFN blocks
- Reduced kernel launch overhead and improved L2 cache utilization
Message Queue Integration
Asynchronous request ingestion via:
- Apache Kafka: High-throughput topic-based routing, consumer groups, exactly-once semantics
- RabbitMQ: AMQP protocol, priority queues, dead-letter exchanges, TTL-based expiry
Cloud Provider Support
Native integrations for major cloud platforms:
- AWS: EKS deployment, SageMaker endpoint compatibility, S3 model storage, CloudWatch metrics
- GCP: GKE autopilot, Vertex AI serving, GCS model storage, Cloud Monitoring
- Azure: AKS deployment, Azure ML serving, Blob Storage, Azure Monitor
GDPR Compliance
Data protection and privacy controls:
- Request/response data anonymization with configurable PII redaction
- Right-to-erasure support with audit trail
- Consent management with per-user opt-in/opt-out
- Data processing records (ROPA) generation
- Comprehensive audit logs with tamper-evident storage
Usage Example
use ;
async
Advanced Features
Memory-aware Batching
Prevents out-of-memory errors by tracking memory usage:
let config = BatchingConfig ;
Priority Scheduling
Handle critical requests with higher priority:
let critical_request = Request ;
Continuous Batching for LLMs
Optimized for text generation with incremental decoding:
let config = BatchingConfig ;
Speculative Decoding Configuration
use SpeculativeConfig;
let spec_config = SpeculativeConfig ;
Performance Tips
- Batch Size: Start with max_batch_size = 32 and adjust based on GPU memory
- Timeout: Lower timeouts (10-50ms) for latency-sensitive applications
- Bucketing: Enable sequence bucketing to reduce padding overhead
- Memory Limits: Set appropriate memory limits to prevent OOM
- NUMA Binding: Enable topology detection for multi-socket servers
- Speculative Decoding: Use for autoregressive generation to improve throughput
- Monitoring: Use built-in SLO metrics to identify SLA violations early
Examples
See the examples/ directory for comprehensive demonstrations:
dynamic_batching_demo.rs: Complete demonstration of all batching featuresspeculative_decoding_demo.rs: Speculative decoding with draft modelskafka_integration_demo.rs: Message queue ingestion patternscloud_deployment_demo.rs: AWS/GCP/Azure deployment examples
License
Licensed under Apache License, Version 2.0 (LICENSE or http://www.apache.org/licenses/LICENSE-2.0).