Expand description
llama-rs: A Rust implementation of llama.cpp
High-performance LLM inference engine with support for GGUF and ONNX models.
§Features
- Full GGUF file format support (v1, v2, v3)
- All quantization formats (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, K-quants)
- Memory-mapped model loading
- CPU backend with SIMD and parallel operations
- LLaMA model architecture support
§Example
use llama_gguf::{GgufFile, default_backend};
// Load a GGUF model
let file = GgufFile::open("model.gguf").unwrap();
println!("Model architecture: {:?}", file.data.get_string("general.architecture"));
// Get the default backend
let backend = default_backend();
println!("Using backend: {}", backend.name());Re-exports§
pub use config::Config;pub use config::ConfigError;pub use engine::ChatEngine;pub use engine::ChatTemplate;pub use engine::Engine;pub use engine::EngineConfig;pub use engine::EngineError;pub use backend::default_backend;pub use backend::Backend;pub use backend::BackendError;pub use backend::tensor_parallel::ShardingPlan;pub use backend::tensor_parallel::SingleDeviceTP;pub use backend::tensor_parallel::TPConfig;pub use backend::tensor_parallel::TensorParallel;pub use backend::tensor_parallel::merge_shards;pub use backend::tensor_parallel::shard_weight;pub use gguf::GgufBuilder;pub use gguf::GgufData;pub use gguf::GgufFile;pub use gguf::GgufReader;pub use gguf::GgufWriter;pub use gguf::TensorToWrite;pub use gguf::QuantizeOptions;pub use gguf::QuantizeStats;pub use gguf::quantize_model;pub use model::Architecture;pub use model::InferenceContext;pub use model::KVCache;pub use model::LlamaModel;pub use model::Model;pub use model::ModelConfig;pub use model::ModelError;pub use model::ModelLoader;pub use model::load_llama_model;pub use model::AttentionLayer;pub use model::DeltaNetConfig;pub use model::DeltaNetLayer;pub use model::DeltaNetState;pub use model::RecurrentState;pub use model::LoraAdapter;pub use model::LoraAdapters;pub use model::LoraConfig;pub use model::MoeConfig;pub use model::MoeExpert;pub use model::MoeLayer;pub use model::MoeRouter;pub use model::MoeStats;pub use model::SpeculativeConfig;pub use model::SpeculativeDecoder;pub use model::SpeculativeMode;pub use model::SpeculativeStats;pub use model::EmbeddingConfig;pub use model::EmbeddingError;pub use model::EmbeddingExtractor;pub use model::PoolingStrategy;pub use model::TruncationStrategy;pub use model::cosine_similarity;pub use model::dot_product;pub use model::euclidean_distance;pub use model::find_nearest;pub use model::CachedPrefix;pub use model::PrefixId;pub use model::PrefixSharing;pub use model::PromptCache;pub use model::PromptCacheConfig;pub use model::PromptCacheStats;pub use model::KVCacheFormat;pub use model::QuantizedKVCache;pub use model::BlockId;pub use model::BlockTable;pub use model::PageAllocator;pub use model::PagedKVPool;pub use model::PagedSequence;pub use model::DEFAULT_BLOCK_SIZE;pub use sampling::Grammar;pub use sampling::GrammarSampler;pub use sampling::GbnfGrammar;pub use sampling::JsonGrammar;pub use sampling::RegexGrammar;pub use sampling::MirostatConfig;pub use sampling::Sampler;pub use sampling::SamplerConfig;pub use tensor::DType;pub use tensor::Tensor;pub use tensor::TensorError;pub use tensor::TensorStorage;pub use tokenizer::Tokenizer;pub use tokenizer::TokenizerError;pub use huggingface::HfClient;pub use huggingface::HfError;pub use huggingface::HfFileInfo;pub use huggingface::format_bytes;pub use onnx::HfConfig;pub use onnx::OnnxError;pub use onnx::OnnxFile;pub use onnx::OnnxMetadata;pub use onnx::OnnxModelLoader;pub use onnx::OnnxTensorInfo;pub use engine_batched::BatchFinishReason;pub use engine_batched::BatchRequest;pub use engine_batched::BatchToken;pub use engine_batched::BatchedEngine;pub use engine_batched::BatchedEngineConfig;
Modules§
- backend
- Hardware backends for tensor operations
- client
- HTTP client for connecting to an OpenAI-compatible inference server.
- config
- TOML configuration file support for llama-gguf.
- engine
- High-level inference engine for llama-gguf.
- engine_
batched - Batched inference engine for continuous batching
- gguf
- GGUF file format parser and writer
- huggingface
- HuggingFace Hub integration for downloading GGUF models
- model
- Model architectures and inference
- onnx
- ONNX model format support
- rag
- RAG (Retrieval-Augmented Generation) support with pgvector
- sampling
- Token sampling strategies for text generation
- server
- HTTP server with OpenAI-compatible API
- tensor
- Tensor module for llama-rs
- tokenizer
- Tokenizer implementations for text encoding/decoding
Enums§
- Error
- Library-wide error type