Crate kizzasi_tokenizer

Crate kizzasi_tokenizer 

Source
Expand description

§kizzasi-tokenizer

Signal quantization and tokenization for Kizzasi AGSP.

This crate provides methods for converting continuous signals into representations suitable for autoregressive prediction:

  • Continuous Embedding: Direct float-to-latent projection (no discretization)
  • VQ-VAE: Vector Quantized embeddings with learned codebook
  • μ-law: Logarithmic quantization for audio signals
  • Linear Quantization: Simple uniform quantization

§AGSP Philosophy

Unlike LLMs that tokenize text into discrete vocabulary indices, AGSP models can work with continuous signals directly. However, discretization can still be useful for:

  • Reducing model complexity
  • Enabling cross-modal transfer
  • Improving training stability

§COOLJAPAN Ecosystem

This crate follows KIZZASI_POLICY.md and uses scirs2-core for all array and numerical operations.

Re-exports§

pub use advanced_quant::AdaptiveQuantizer;
pub use advanced_quant::DeadZoneQuantizer;
pub use advanced_quant::NonUniformQuantizer;
pub use batch::BatchTokenizer;
pub use batch::StreamingTokenizer;
pub use entropy::compression_ratio;
pub use entropy::compute_frequencies;
pub use entropy::ArithmeticDecoder;
pub use entropy::ArithmeticEncoder;
pub use entropy::BitrateController;
pub use entropy::HuffmanDecoder;
pub use entropy::HuffmanEncoder;
pub use entropy::RangeDecoder;
pub use entropy::RangeEncoder;
pub use persistence::load_config;
pub use persistence::save_config;
pub use persistence::ModelCheckpoint;
pub use persistence::ModelMetadata;
pub use persistence::ModelVersion;
pub use specialized::DCTConfig;
pub use specialized::DCTTokenizer;
pub use specialized::FourierConfig;
pub use specialized::FourierTokenizer;
pub use specialized::KMeansConfig;
pub use specialized::KMeansTokenizer;
pub use specialized::WaveletConfig;
pub use specialized::WaveletFamily;
pub use specialized::WaveletTokenizer;
pub use advanced_features::add_batch_jitter;
pub use advanced_features::add_jitter;
pub use advanced_features::apply_batch_token_dropout;
pub use advanced_features::apply_temporal_coherence;
pub use advanced_features::apply_token_dropout;
pub use advanced_features::HierarchicalConfig;
pub use advanced_features::HierarchicalTokenizer;
pub use advanced_features::JitterConfig;
pub use advanced_features::TemporalCoherenceConfig;
pub use advanced_features::TemporalFilterType;
pub use advanced_features::TokenDropoutConfig;
pub use compat::AudioMetadata;
pub use compat::DType;
pub use compat::ModelConfig;
pub use compat::OnnxConfig;
pub use compat::PyTorchCompat;
pub use compat::TensorInfo;
pub use domain_specific::EnvironmentalTokenizer;
pub use domain_specific::EnvironmentalTokenizerConfig;
pub use domain_specific::MusicTokenizer;
pub use domain_specific::MusicTokenizerConfig;
pub use domain_specific::SpeechTokenizer;
pub use domain_specific::SpeechTokenizerConfig;
pub use transformer::FeedForward;
pub use transformer::LayerNorm;
pub use transformer::MultiHeadAttention;
pub use transformer::PositionalEncoding;
pub use transformer::TransformerConfig;
pub use transformer::TransformerEncoderLayer;
pub use transformer::TransformerTokenizer;
pub use pretraining::ContrastiveConfig;
pub use pretraining::ContrastiveLearning;
pub use pretraining::MSMConfig;
pub use pretraining::MaskedSignalModeling;
pub use pretraining::TemporalPrediction;
pub use pretraining::TemporalPredictionConfig;
pub use profiling::AllocationEvent;
pub use profiling::EventType;
pub use profiling::MemoryProfiler;
pub use profiling::MemorySnapshot;
pub use profiling::ProfileScope;
pub use profiling::ScopeStats;
pub use profiling::TimelineAnalyzer;

Modules§

advanced_features
Advanced features for tokenizer robustness and regularization
advanced_quant
Advanced quantization strategies
batch
Batch processing for efficient tokenization of multiple signals
compat
Compatibility and interoperability with other frameworks
domain_specific
Domain-specific tokenizers for specialized audio applications
enhanced_multiscale
Enhanced multi-scale tokenization with advanced features
entropy
Entropy coding for efficient compression
gpu_quant
GPU-accelerated quantization operations using candle
metrics
Quality metrics for evaluating tokenizer performance
persistence
Model persistence and checkpoint management
pretraining
Self-supervised pre-training for tokenizers.
profiling
Memory usage profiling and performance monitoring utilities.
serde_utils
Serialization and deserialization utilities for tokenizers
simd_quant
SIMD-optimized quantization operations
specialized
Specialized tokenizers for signal processing
transformer
Transformer-based signal tokenization using self-attention mechanisms.
types
Type-safe wrappers for improved type-level safety
utils
Common utilities and helper traits for kizzasi-tokenizer

Structs§

ContinuousTokenizer
Continuous tokenizer that projects signals to embedding space
LinearQuantizer
Linear uniform quantizer
MuLawCodec
μ-law companding codec
MultiScaleTokenizer
Multi-scale hierarchical tokenizer
PyramidTokenizer
Pyramid tokenizer with residual connections
ReconstructionMetrics
Reconstruction loss metrics
ScaleLevel
Configuration for a single scale level
TrainableContinuousTokenizer
Trainable continuous tokenizer with gradient descent
TrainingConfig
Training configuration for trainable tokenizer

Enums§

PoolMethod
Method for downsampling signals
TokenizerError
Errors that can occur in tokenizer operations
TokenizerType
Configuration for tokenizer selection
UpsampleMethod
Method for upsampling signals

Traits§

Quantizer
Trait for quantization strategies
SignalTokenizer
Trait for signal tokenization

Type Aliases§

Array1
one-dimensional array
Array2
two-dimensional array
TokenizerResult
Result type alias for tokenizer operations