Expand description
Llama 3 (Meta LLaMA) model composition for ferrotorch.
Assembles the standard Llama decoder stack from ferrotorch primitives:
LlamaForCausalLM
├── LlamaModel
│ ├── Embedding (token embeddings)
│ ├── LlamaDecoderLayer × N
│ │ ├── RMSNorm (pre-attn)
│ │ ├── LlamaAttention (GQA + RoPE)
│ │ ├── residual
│ │ ├── RMSNorm (pre-MLP)
│ │ ├── SwiGLU (gate/up/down projections)
│ │ └── residual
│ └── RMSNorm (final)
└── Linear lm_head (projection to vocab)§Loading real weights
LlamaForCausalLM::load_hf_state_dict accepts a StateDict whose
keys use the HuggingFace transformers naming convention and rewrites
them to match the ferrotorch parameter paths before delegating to
[Module::load_state_dict]. Combined with
ferrotorch_serialize::load_safetensors_sharded this gives a direct
path from a downloaded Meta-Llama-3-8B checkpoint to a loaded model.
Re-exports§
pub use attention::LlamaAttention;pub use config::LlamaActivation;pub use config::LlamaConfig;pub use generation::GenerationConfig;pub use generation::apply_repetition_penalty;pub use generation::apply_temperature;pub use generation::argmax;pub use generation::generate;pub use generation::generate_with_streamer;pub use generation::sample_softmax;pub use generation::top_k_filter;pub use generation::top_p_filter;pub use gguf_remap::gguf_key_to_hf;pub use gguf_remap::gguf_to_hf_state_dict;pub use kv_cache::LayerKvCache;pub use kv_cache::LlamaKvCache;pub use layer::LlamaDecoderLayer;pub use mlp::LlamaMLP;pub use model::LlamaForCausalLM;pub use model::LlamaModel;pub use quant_loaders::AwqQ4;pub use quant_loaders::GptqQ4;pub use quant_loaders::HqqQ4Axis1;pub use quant_loaders::dequantize_awq_q4;pub use quant_loaders::dequantize_gptq_q4;pub use quant_loaders::dequantize_hqq_q4_axis1;pub use quant_loaders::hqq_q4_axis1_to_dense;pub use quant_loaders::hqq_state_dict_to_dense;pub use spec_decode::LlamaHandle;pub use spec_decode::ModelHandle;pub use spec_decode::SpecDecodeConfig;pub use spec_decode::SpecDecodeOutput;pub use spec_decode::speculative_decode;pub use ferrotorch_grammar as grammar;
Modules§
- attention
- Llama attention layer.
- config
- Typed Llama configuration.
- generation
- Token-level generation for
LlamaForCausalLM. (#592) - gguf_
remap - Translate GGUF tensor names into the HuggingFace transformers naming
convention used by
crate::LlamaForCausalLM::load_hf_state_dict. - kv_
cache - Per-layer key/value cache for incremental Llama decoding (#1129).
- layer
- Single Llama decoder layer.
- mlp
- Llama feedforward block.
- model
- Top-level Llama model + causal-LM head.
- quant_
loaders - Weight unpackers for HF-quantized LLM checkpoints. (#593)
- spec_
decode - Speculative decoding — Leviathan et al. 2023 (arXiv:2211.17192).