Crate ferrotorch_llama

Expand description

Llama 3 (Meta LLaMA) model composition for ferrotorch.

Assembles the standard Llama decoder stack from ferrotorch primitives:

LlamaForCausalLM
├── LlamaModel
│   ├── Embedding                      (token embeddings)
│   ├── LlamaDecoderLayer × N
│   │   ├── RMSNorm                    (pre-attn)
│   │   ├── LlamaAttention             (GQA + RoPE)
│   │   ├── residual
│   │   ├── RMSNorm                    (pre-MLP)
│   │   ├── SwiGLU                     (gate/up/down projections)
│   │   └── residual
│   └── RMSNorm                        (final)
└── Linear lm_head                     (projection to vocab)

§Loading real weights

LlamaForCausalLM::load_hf_state_dict accepts a StateDict whose keys use the HuggingFace transformers naming convention and rewrites them to match the ferrotorch parameter paths before delegating to [Module::load_state_dict]. Combined with ferrotorch_serialize::load_safetensors_sharded this gives a direct path from a downloaded Meta-Llama-3-8B checkpoint to a loaded model.

Re-exports§

pub use attention::LlamaAttention;
pub use config::LlamaActivation;
pub use config::LlamaConfig;
pub use generation::GenerationConfig;
pub use generation::apply_repetition_penalty;
pub use generation::apply_temperature;
pub use generation::argmax;
pub use generation::generate;
pub use generation::generate_with_streamer;
pub use generation::sample_softmax;
pub use generation::top_k_filter;
pub use generation::top_p_filter;
pub use gguf_remap::gguf_key_to_hf;
pub use gguf_remap::gguf_to_hf_state_dict;
pub use kv_cache::LayerKvCache;
pub use kv_cache::LlamaKvCache;
pub use layer::LlamaDecoderLayer;
pub use mlp::LlamaMLP;
pub use model::LlamaForCausalLM;
pub use model::LlamaModel;
pub use quant_loaders::AwqQ4;
pub use quant_loaders::GptqQ4;
pub use quant_loaders::HqqQ4Axis1;
pub use quant_loaders::dequantize_awq_q4;
pub use quant_loaders::dequantize_gptq_q4;
pub use quant_loaders::dequantize_hqq_q4_axis1;
pub use quant_loaders::hqq_q4_axis1_to_dense;
pub use quant_loaders::hqq_state_dict_to_dense;
pub use spec_decode::LlamaHandle;
pub use spec_decode::ModelHandle;
pub use spec_decode::SpecDecodeConfig;
pub use spec_decode::SpecDecodeOutput;
pub use spec_decode::speculative_decode;
pub use ferrotorch_grammar as grammar;

Modules§

attention: Llama attention layer.
config: Typed Llama configuration.
generation: Token-level generation for LlamaForCausalLM. (#592)
gguf_remap: Translate GGUF tensor names into the HuggingFace transformers naming convention used by crate::LlamaForCausalLM::load_hf_state_dict.
kv_cache: Per-layer key/value cache for incremental Llama decoding (#1129).
layer: Single Llama decoder layer.
mlp: Llama feedforward block.
model: Top-level Llama model + causal-LM head.
quant_loaders: Weight unpackers for HF-quantized LLM checkpoints. (#593)
spec_decode: Speculative decoding — Leviathan et al. 2023 (arXiv:2211.17192).