docs.rs failed to build slm_ikllama-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
slm_ikllama
ik_llama.cpp-backed implementation of the
slm_inference trait layer.
ik_llama.cpp is a performance-oriented fork of llama.cpp with improved CPU/GPU kernels, new quantization types, first-class DeepSeek/MLA support, and fused MoE operations.
Idea
This crate wires the slm_inference abstract traits to the ik_llama.cpp runtime via the
slm_ikllama_sys FFI bindings.
slm_inference trait |
This crate's type |
|---|---|
SlmModelConfig |
ModelConfig |
SlmModel |
Model |
SlmContextBuilder<Context> |
Builder |
SlmContext |
Context |
SlmBatch / SlmToken |
Batch / Token |
Key details
ModelConfigwrapsllama_model_paramsand exposes builder methods for GPU layers, main GPU selection, mmap, mlock, and multi-GPU split mode.Modelholds a ref-counted pointer to the loadedllama_model; it is cheaplyCloneable for sharing across contexts.Builderconfiguresllama_context_params(context size, batch size, KV cache quantization with optional Hadamard transform, flash attention) and builds the sampler: greedy whentemperature ≤ 0, otherwise top-k → top-p → temperature.Contextimplements the fullSlmContextprotocol: batched decode viallama_decode, manual sampling from logits, tokenization viallama_tokenize, piece extraction viallama_token_to_piece, and KV cache editing (truncate,cut,drop).- The ik_llama.cpp backend is initialized exactly once (guarded by an
AtomicBool) and its log output is routed totracing. edit_levelisSlmEditLevel::Cut— the KV cache supports mid-sequence removal, enabling efficient context management without full resets.
KV cache quantization
Builder::with_gen_type_kv(k, v) accepts SlmKvType values and maps them to ik_llama.cpp
quantization types with automatic Hadamard-transform activation:
SlmKvType |
GGML type | Hadamard |
|---|---|---|
Q4 |
Q4_0 |
✓ |
Q5 |
Q5_0 |
✓ |
Q6 |
Q6_0 |
✓ |
Q8 |
Q8_0 |
✓ |
RawQ8 |
Q8_0 |
— |
F16 |
F16 |
— |
F32 |
F32 |
— |
Multi-GPU split modes
ModelConfig::with_split_mode(SplitMode) controls how the model is distributed across GPUs:
| Variant | Description |
|---|---|
None |
Single GPU |
Layer |
Split layers and KV across GPUs |
Row |
Split with attention-layer tensor parallelism |
Tensor |
Experimental full tensor parallelism |
Features
| Feature | Description |
|---|---|
cuda (default) |
NVIDIA CUDA via slm_ikllama_sys/cuda |
native (default) |
CPU compiled for the host architecture via slm_ikllama_sys/native |