Expand description
§kizzasi-tokenizer
Signal quantization and tokenization for Kizzasi AGSP.
This crate provides methods for converting continuous signals into representations suitable for autoregressive prediction:
- Continuous Embedding: Direct float-to-latent projection (no discretization)
- VQ-VAE: Vector Quantized embeddings with learned codebook
- μ-law: Logarithmic quantization for audio signals
- Linear Quantization: Simple uniform quantization
§AGSP Philosophy
Unlike LLMs that tokenize text into discrete vocabulary indices, AGSP models can work with continuous signals directly. However, discretization can still be useful for:
- Reducing model complexity
- Enabling cross-modal transfer
- Improving training stability
§COOLJAPAN Ecosystem
This crate follows KIZZASI_POLICY.md and uses scirs2-core for all
array and numerical operations.
Re-exports§
pub use advanced_quant::AdaptiveQuantizer;pub use advanced_quant::DeadZoneQuantizer;pub use advanced_quant::NonUniformQuantizer;pub use batch::BatchTokenizer;pub use batch::StreamingTokenizer;pub use entropy::compression_ratio;pub use entropy::compute_frequencies;pub use entropy::ArithmeticDecoder;pub use entropy::ArithmeticEncoder;pub use entropy::BitrateController;pub use entropy::HuffmanDecoder;pub use entropy::HuffmanEncoder;pub use entropy::RangeDecoder;pub use entropy::RangeEncoder;pub use persistence::load_config;pub use persistence::save_config;pub use persistence::ModelCheckpoint;pub use persistence::ModelMetadata;pub use persistence::ModelVersion;pub use specialized::DCTConfig;pub use specialized::DCTTokenizer;pub use specialized::FourierConfig;pub use specialized::FourierTokenizer;pub use specialized::KMeansConfig;pub use specialized::KMeansTokenizer;pub use specialized::WaveletConfig;pub use specialized::WaveletFamily;pub use specialized::WaveletTokenizer;pub use advanced_features::add_batch_jitter;pub use advanced_features::add_jitter;pub use advanced_features::apply_batch_token_dropout;pub use advanced_features::apply_temporal_coherence;pub use advanced_features::apply_token_dropout;pub use advanced_features::HierarchicalConfig;pub use advanced_features::HierarchicalTokenizer;pub use advanced_features::JitterConfig;pub use advanced_features::TemporalCoherenceConfig;pub use advanced_features::TemporalFilterType;pub use advanced_features::TokenDropoutConfig;pub use compat::AudioMetadata;pub use compat::DType;pub use compat::ModelConfig;pub use compat::OnnxConfig;pub use compat::PyTorchCompat;pub use compat::TensorInfo;pub use domain_specific::EnvironmentalTokenizer;pub use domain_specific::EnvironmentalTokenizerConfig;pub use domain_specific::MusicTokenizer;pub use domain_specific::MusicTokenizerConfig;pub use domain_specific::SpeechTokenizer;pub use domain_specific::SpeechTokenizerConfig;pub use transformer::FeedForward;pub use transformer::LayerNorm;pub use transformer::MultiHeadAttention;pub use transformer::PositionalEncoding;pub use transformer::TransformerConfig;pub use transformer::TransformerEncoderLayer;pub use transformer::TransformerTokenizer;pub use pretraining::ContrastiveConfig;pub use pretraining::ContrastiveLearning;pub use pretraining::MSMConfig;pub use pretraining::MaskedSignalModeling;pub use pretraining::TemporalPrediction;pub use pretraining::TemporalPredictionConfig;pub use profiling::AllocationEvent;pub use profiling::EventType;pub use profiling::MemoryProfiler;pub use profiling::MemorySnapshot;pub use profiling::ProfileScope;pub use profiling::ScopeStats;pub use profiling::TimelineAnalyzer;
Modules§
- advanced_
features - Advanced features for tokenizer robustness and regularization
- advanced_
quant - Advanced quantization strategies
- batch
- Batch processing for efficient tokenization of multiple signals
- compat
- Compatibility and interoperability with other frameworks
- domain_
specific - Domain-specific tokenizers for specialized audio applications
- enhanced_
multiscale - Enhanced multi-scale tokenization with advanced features
- entropy
- Entropy coding for efficient compression
- gpu_
quant - GPU-accelerated quantization operations using candle
- metrics
- Quality metrics for evaluating tokenizer performance
- persistence
- Model persistence and checkpoint management
- pretraining
- Self-supervised pre-training for tokenizers.
- profiling
- Memory usage profiling and performance monitoring utilities.
- serde_
utils - Serialization and deserialization utilities for tokenizers
- simd_
quant - SIMD-optimized quantization operations
- specialized
- Specialized tokenizers for signal processing
- transformer
- Transformer-based signal tokenization using self-attention mechanisms.
- types
- Type-safe wrappers for improved type-level safety
- utils
- Common utilities and helper traits for kizzasi-tokenizer
Structs§
- Continuous
Tokenizer - Continuous tokenizer that projects signals to embedding space
- Linear
Quantizer - Linear uniform quantizer
- MuLaw
Codec - μ-law companding codec
- Multi
Scale Tokenizer - Multi-scale hierarchical tokenizer
- Pyramid
Tokenizer - Pyramid tokenizer with residual connections
- Reconstruction
Metrics - Reconstruction loss metrics
- Scale
Level - Configuration for a single scale level
- Trainable
Continuous Tokenizer - Trainable continuous tokenizer with gradient descent
- Training
Config - Training configuration for trainable tokenizer
Enums§
- Pool
Method - Method for downsampling signals
- Tokenizer
Error - Errors that can occur in tokenizer operations
- Tokenizer
Type - Configuration for tokenizer selection
- Upsample
Method - Method for upsampling signals
Traits§
- Quantizer
- Trait for quantization strategies
- Signal
Tokenizer - Trait for signal tokenization
Type Aliases§
- Array1
- one-dimensional array
- Array2
- two-dimensional array
- Tokenizer
Result - Result type alias for tokenizer operations