Skip to main content

Module transformer

Module transformer 

Source
Expand description

Transformer module for LLM support

This module provides:

  • Autograd engine for automatic differentiation
  • Transformer layer implementations (Attention, FFN, Norm)
  • Model loading from Safetensors format
  • Graph-structured Transformer inference
  • KV Cache and batch inference optimizations
  • Sparse attention optimizations
  • Quantization support
  • Performance optimizations (SIMD, memory pool, optimized kernels)

Re-exports§

pub use autograd::ComputeGraph;
pub use autograd::DifferentiableTensor;
pub use autograd::Op;
pub use autograd::Optimizer;
pub use layers::MultiHeadAttention;
pub use layers::RMSNorm;
pub use layers::LayerNorm;
pub use layers::RoPE;
pub use layers::FeedForward;
pub use loader::SafetensorsLoader;
pub use loader::ModelConfig;
pub use model::LlamaModel;
pub use model::LlamaConfig;
pub use generation::GenerationConfig;
pub use generation::TextGenerator;
pub use graph_transformer::GraphExecutor;
pub use graph_transformer::GraphTransformer;
pub use graph_transformer::GraphNode;
pub use graph_transformer::GraphEdge;
pub use kv_cache::KVCache;
pub use sparse_attention::SparseAttention;
pub use batch::BatchInference;
pub use quantization::QuantizedTensor;
pub use quantization::QuantizationConfig;
pub use perf::TransformerMemoryPool;
pub use perf::softmax_inplace_simd;
pub use perf::matmul_with_buffer;

Modules§

autograd
Autograd engine for automatic differentiation
batch
Batch inference module for efficient throughput
generation
Text generation utilities
graph_transformer
Graph-structured Transformer core module
kv_cache
KV Cache module for efficient autoregressive generation
layers
Transformer layer implementations
loader
Model loader for loading pre-trained weights
model
LLaMA model implementation
optimization
CAD-LLM Topology Optimization Module
perf
Performance optimization utilities for Transformer inference
quantization
Quantization module for efficient inference
sparse_attention
Sparse Attention module for efficient attention computation