docs.rs failed to build slm_ikllama-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

slm_ikllama

ik_llama.cpp-backed implementation of the slm_inference trait layer.

ik_llama.cpp is a performance-oriented fork of llama.cpp with improved CPU/GPU kernels, new quantization types, first-class DeepSeek/MLA support, and fused MoE operations.

Idea

This crate wires the slm_inference abstract traits to the ik_llama.cpp runtime via the slm_ikllama_sys FFI bindings.

`slm_inference` trait	This crate's type
`SlmModelConfig`	`ModelConfig`
`SlmModel`	`Model`
`SlmContextBuilder<Context>`	`Builder`
`SlmContext`	`Context`
`SlmBatch` / `SlmToken`	`Batch` / `Token`

Key details

ModelConfig wraps llama_model_params and exposes builder methods for GPU layers, main GPU selection, mmap, mlock, and multi-GPU split mode.
Model holds a ref-counted pointer to the loaded llama_model; it is cheaply Cloneable for sharing across contexts.
Builder configures llama_context_params (context size, batch size, KV cache quantization with optional Hadamard transform, flash attention) and builds the sampler: greedy when temperature ≤ 0, otherwise top-k → top-p → temperature.
Context implements the full SlmContext protocol: batched decode via llama_decode, manual sampling from logits, tokenization via llama_tokenize, piece extraction via llama_token_to_piece, and KV cache editing (truncate, cut, drop).
The ik_llama.cpp backend is initialized exactly once (guarded by an AtomicBool) and its log output is routed to tracing.
edit_level is SlmEditLevel::Cut — the KV cache supports mid-sequence removal, enabling efficient context management without full resets.

KV cache quantization

Builder::with_gen_type_kv(k, v) accepts SlmKvType values and maps them to ik_llama.cpp quantization types with automatic Hadamard-transform activation:

`SlmKvType`	GGML type	Hadamard
`Q4`	`Q4_0`	✓
`Q5`	`Q5_0`	✓
`Q6`	`Q6_0`	✓
`Q8`	`Q8_0`	✓
`RawQ8`	`Q8_0`	—
`F16`	`F16`	—
`F32`	`F32`	—

Multi-GPU split modes

ModelConfig::with_split_mode(SplitMode) controls how the model is distributed across GPUs:

Variant	Description
`None`	Single GPU
`Layer`	Split layers and KV across GPUs
`Row`	Split with attention-layer tensor parallelism
`Tensor`	Experimental full tensor parallelism

Features