Skip to main content

Crate llama_gguf

Crate llama_gguf 

Source
Expand description

llama-rs: A Rust implementation of llama.cpp

High-performance LLM inference engine with support for GGUF and ONNX models.

§Features

  • Full GGUF file format support (v1, v2, v3)
  • All quantization formats (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, K-quants)
  • Memory-mapped model loading
  • CPU backend with SIMD and parallel operations
  • LLaMA model architecture support

§Example

use llama_gguf::{GgufFile, default_backend};

// Load a GGUF model
let file = GgufFile::open("model.gguf").unwrap();
println!("Model architecture: {:?}", file.data.get_string("general.architecture"));

// Get the default backend
let backend = default_backend();
println!("Using backend: {}", backend.name());

Re-exports§

pub use config::Config;
pub use config::ConfigError;
pub use engine::ChatEngine;
pub use engine::ChatTemplate;
pub use engine::Engine;
pub use engine::EngineConfig;
pub use engine::EngineError;
pub use backend::default_backend;
pub use backend::Backend;
pub use backend::BackendError;
pub use backend::tensor_parallel::ShardingPlan;
pub use backend::tensor_parallel::SingleDeviceTP;
pub use backend::tensor_parallel::TPConfig;
pub use backend::tensor_parallel::TensorParallel;
pub use backend::tensor_parallel::merge_shards;
pub use backend::tensor_parallel::shard_weight;
pub use gguf::GgufBuilder;
pub use gguf::GgufData;
pub use gguf::GgufFile;
pub use gguf::GgufReader;
pub use gguf::GgufWriter;
pub use gguf::TensorToWrite;
pub use gguf::QuantizeOptions;
pub use gguf::QuantizeStats;
pub use gguf::quantize_model;
pub use model::Architecture;
pub use model::InferenceContext;
pub use model::KVCache;
pub use model::LlamaModel;
pub use model::Model;
pub use model::ModelConfig;
pub use model::ModelError;
pub use model::ModelLoader;
pub use model::load_llama_model;
pub use model::AttentionLayer;
pub use model::DeltaNetConfig;
pub use model::DeltaNetLayer;
pub use model::DeltaNetState;
pub use model::RecurrentState;
pub use model::LoraAdapter;
pub use model::LoraAdapters;
pub use model::LoraConfig;
pub use model::MoeConfig;
pub use model::MoeExpert;
pub use model::MoeLayer;
pub use model::MoeRouter;
pub use model::MoeStats;
pub use model::SpeculativeConfig;
pub use model::SpeculativeDecoder;
pub use model::SpeculativeMode;
pub use model::SpeculativeStats;
pub use model::EmbeddingConfig;
pub use model::EmbeddingError;
pub use model::EmbeddingExtractor;
pub use model::PoolingStrategy;
pub use model::TruncationStrategy;
pub use model::cosine_similarity;
pub use model::dot_product;
pub use model::euclidean_distance;
pub use model::find_nearest;
pub use model::CachedPrefix;
pub use model::PrefixId;
pub use model::PrefixSharing;
pub use model::PromptCache;
pub use model::PromptCacheConfig;
pub use model::PromptCacheStats;
pub use model::KVCacheFormat;
pub use model::QuantizedKVCache;
pub use model::BlockId;
pub use model::BlockTable;
pub use model::PageAllocator;
pub use model::PagedKVPool;
pub use model::PagedSequence;
pub use model::DEFAULT_BLOCK_SIZE;
pub use sampling::Grammar;
pub use sampling::GrammarSampler;
pub use sampling::GbnfGrammar;
pub use sampling::JsonGrammar;
pub use sampling::RegexGrammar;
pub use sampling::MirostatConfig;
pub use sampling::Sampler;
pub use sampling::SamplerConfig;
pub use tensor::DType;
pub use tensor::Tensor;
pub use tensor::TensorError;
pub use tensor::TensorStorage;
pub use tokenizer::Tokenizer;
pub use tokenizer::TokenizerError;
pub use huggingface::HfClient;
pub use huggingface::HfError;
pub use huggingface::HfFileInfo;
pub use huggingface::format_bytes;
pub use onnx::HfConfig;
pub use onnx::OnnxError;
pub use onnx::OnnxFile;
pub use onnx::OnnxMetadata;
pub use onnx::OnnxModelLoader;
pub use onnx::OnnxTensorInfo;
pub use engine_batched::BatchFinishReason;
pub use engine_batched::BatchRequest;
pub use engine_batched::BatchToken;
pub use engine_batched::BatchedEngine;
pub use engine_batched::BatchedEngineConfig;

Modules§

backend
Hardware backends for tensor operations
client
HTTP client for connecting to an OpenAI-compatible inference server.
config
TOML configuration file support for llama-gguf.
engine
High-level inference engine for llama-gguf.
engine_batched
Batched inference engine for continuous batching
gguf
GGUF file format parser and writer
huggingface
HuggingFace Hub integration for downloading GGUF models
model
Model architectures and inference
onnx
ONNX model format support
rag
RAG (Retrieval-Augmented Generation) support with pgvector
sampling
Token sampling strategies for text generation
server
HTTP server with OpenAI-compatible API
tensor
Tensor module for llama-rs
tokenizer
Tokenizer implementations for text encoding/decoding

Enums§

Error
Library-wide error type

Type Aliases§

Result