Crate llama_gguf

Expand description

llama-rs: A Rust implementation of llama.cpp

High-performance LLM inference engine with support for GGUF and ONNX models.

§Features

Full GGUF file format support (v1, v2, v3)
All quantization formats (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, K-quants)
Memory-mapped model loading
CPU backend with SIMD and parallel operations
LLaMA model architecture support

§Example

use llama_gguf::{GgufFile, default_backend};

// Load a GGUF model
let file = GgufFile::open("model.gguf").unwrap();
println!("Model architecture: {:?}", file.data.get_string("general.architecture"));

// Get the default backend
let backend = default_backend();
println!("Using backend: {}", backend.name());

Re-exports§

pub use config::Config;
pub use config::ConfigError;
pub use engine::ChatEngine;
pub use engine::ChatTemplate;
pub use engine::Engine;
pub use engine::EngineConfig;
pub use engine::EngineError;
pub use backend::default_backend;
pub use backend::Backend;
pub use backend::BackendError;
pub use backend::tensor_parallel::ShardingPlan;
pub use backend::tensor_parallel::SingleDeviceTP;
pub use backend::tensor_parallel::TPConfig;
pub use backend::tensor_parallel::TensorParallel;
pub use backend::tensor_parallel::merge_shards;
pub use backend::tensor_parallel::shard_weight;
pub use gguf::GgufBuilder;
pub use gguf::GgufData;
pub use gguf::GgufFile;
pub use gguf::GgufReader;
pub use gguf::GgufWriter;
pub use gguf::TensorToWrite;
pub use gguf::QuantizeOptions;
pub use gguf::QuantizeStats;
pub use gguf::quantize_model;
pub use model::Architecture;
pub use model::InferenceContext;
pub use model::KVCache;
pub use model::LlamaModel;
pub use model::Model;
pub use model::ModelConfig;
pub use model::ModelError;
pub use model::ModelLoader;
pub use model::load_llama_model;
pub use model::AttentionLayer;
pub use model::DeltaNetConfig;
pub use model::DeltaNetLayer;
pub use model::DeltaNetState;
pub use model::RecurrentState;
pub use model::LoraAdapter;
pub use model::LoraAdapters;
pub use model::LoraConfig;
pub use model::MoeConfig;
pub use model::MoeExpert;
pub use model::MoeLayer;
pub use model::MoeRouter;
pub use model::MoeStats;
pub use model::SpeculativeConfig;
pub use model::SpeculativeDecoder;
pub use model::SpeculativeMode;
pub use model::SpeculativeStats;
pub use model::EmbeddingConfig;
pub use model::EmbeddingError;
pub use model::EmbeddingExtractor;
pub use model::PoolingStrategy;
pub use model::TruncationStrategy;
pub use model::cosine_similarity;
pub use model::dot_product;
pub use model::euclidean_distance;
pub use model::find_nearest;
pub use model::CachedPrefix;
pub use model::PrefixId;
pub use model::PrefixSharing;
pub use model::PromptCache;
pub use model::PromptCacheConfig;
pub use model::PromptCacheStats;
pub use model::KVCacheFormat;
pub use model::QuantizedKVCache;
pub use model::BlockId;
pub use model::BlockTable;
pub use model::PageAllocator;
pub use model::PagedKVPool;
pub use model::PagedSequence;
pub use model::DEFAULT_BLOCK_SIZE;
pub use sampling::Grammar;
pub use sampling::GrammarSampler;
pub use sampling::GbnfGrammar;
pub use sampling::JsonGrammar;
pub use sampling::RegexGrammar;
pub use sampling::MirostatConfig;
pub use sampling::Sampler;
pub use sampling::SamplerConfig;
pub use tensor::DType;
pub use tensor::Tensor;
pub use tensor::TensorError;
pub use tensor::TensorStorage;
pub use tokenizer::Tokenizer;
pub use tokenizer::TokenizerError;
pub use huggingface::HfClient;
pub use huggingface::HfError;
pub use huggingface::HfFileInfo;
pub use huggingface::format_bytes;
pub use onnx::HfConfig;
pub use onnx::OnnxError;
pub use onnx::OnnxFile;
pub use onnx::OnnxMetadata;
pub use onnx::OnnxModelLoader;
pub use onnx::OnnxTensorInfo;
pub use engine_batched::BatchFinishReason;
pub use engine_batched::BatchRequest;
pub use engine_batched::BatchToken;
pub use engine_batched::BatchedEngine;
pub use engine_batched::BatchedEngineConfig;

Modules§

backend: Hardware backends for tensor operations
client: HTTP client for connecting to an OpenAI-compatible inference server.
config: TOML configuration file support for llama-gguf.
engine: High-level inference engine for llama-gguf.
engine_batched: Batched inference engine for continuous batching
gguf: GGUF file format parser and writer
huggingface: HuggingFace Hub integration for downloading GGUF models
model: Model architectures and inference
onnx: ONNX model format support
rag: RAG (Retrieval-Augmented Generation) support with pgvector
sampling: Token sampling strategies for text generation
server: HTTP server with OpenAI-compatible API
tensor: Tensor module for llama-rs
tokenizer: Tokenizer implementations for text encoding/decoding

Enums§

Error: Library-wide error type

Type Aliases§

Result

Crate llama_gguf

Crate llama_gguf Copy item path

§Features

§Example

Re-exports§

Modules§

Enums§

Type Aliases§

Crate llama_gguf