pub struct Llama32Generator { /* private fields */ }Expand description
Stateful LLaMA-3.2 generation handle.
Holds the (config, weight bytes, token history) and rebuilds a
prefill graph on each [step] call. Cheap to construct after
initial weight load; tokens stay in-memory between calls.
Implementations§
Source§impl Llama32Generator
impl Llama32Generator
Sourcepub fn from_loader(
cfg: Llama32Config,
loader: &mut dyn WeightLoader,
device: Device,
) -> Result<Llama32Generator, Error>
pub fn from_loader( cfg: Llama32Config, loader: &mut dyn WeightLoader, device: Device, ) -> Result<Llama32Generator, Error>
Construct from any WeightLoader — drains it into an
internal cache so the loader is free after this call.
Sourcepub fn with_compile_seq_cap(self, cap: usize) -> Llama32Generator
pub fn with_compile_seq_cap(self, cap: usize) -> Llama32Generator
Cap symbolic seq / past-seq in dynamic compile paths. Use for models
with very large max_position_embeddings (e.g. 128k) when the runner
only needs a short window.
Sourcepub fn from_loader_at(
cfg: Llama32Config,
loader: &mut dyn WeightLoader,
device: Device,
weights_path: &Path,
) -> Result<Llama32Generator, Error>
pub fn from_loader_at( cfg: Llama32Config, loader: &mut dyn WeightLoader, device: Device, weights_path: &Path, ) -> Result<Llama32Generator, Error>
Like Self::from_loader but loads tier-1 profiles from
llama32.rlx.toml in the weights directory when present.
Sourcepub fn with_compile_profiles(
self,
prefill: CompileProfile,
decode: CompileProfile,
) -> Llama32Generator
pub fn with_compile_profiles( self, prefill: CompileProfile, decode: CompileProfile, ) -> Llama32Generator
Override tier-1 compile profiles explicitly.
pub fn prefill_profile(&self) -> &CompileProfile
pub fn decode_profile(&self) -> &CompileProfile
Sourcepub fn with_prefill_cache(self, capacity: usize) -> Llama32Generator
pub fn with_prefill_cache(self, capacity: usize) -> Llama32Generator
Enable the prefill compile cache with the given LRU capacity. Useful when the same prompt length is used across multiple generation runs — the second + Nth run skip the compile + param-attach roundtrip (~30-50ms per call on CPU).
Sourcepub fn with_dynamic_prefill_cache(self, capacity: usize) -> Llama32Generator
pub fn with_dynamic_prefill_cache(self, capacity: usize) -> Llama32Generator
Compile prefill once with sym::SEQ, specialize per prompt length.
Sourcepub fn with_decode_cache(self, max_past: usize) -> Llama32Generator
pub fn with_decode_cache(self, max_past: usize) -> Llama32Generator
Enable the bucketed decode compile cache spanning past-seq
values in [1, max_past]. Buckets are power-of-two
[1..2, 2..3, 3..5, 5..9, 9..17, …]. Each bucket compiles
one graph at its upper bound; a steady-state generation loop
across N tokens compiles O(log N) graphs instead of N.
Padding compute waste is bounded at 2×: actual past_seq is
at least half the bucket’s upper bound (except possibly the
smallest bucket).
Sourcepub fn with_dynamic_decode_cache(self, capacity: usize) -> Llama32Generator
pub fn with_dynamic_decode_cache(self, capacity: usize) -> Llama32Generator
Compile decode once with sym::PAST_SEQ, specialize per prefix length.
Sourcepub fn from_path(
cfg: Llama32Config,
path: &str,
device: Device,
) -> Result<Llama32Generator, Error>
pub fn from_path( cfg: Llama32Config, path: &str, device: Device, ) -> Result<Llama32Generator, Error>
Convenience: load weights from a safetensors or GGUF path
(dispatch by extension; see rlx_core::weight_loader::load_from_path).
Sourcepub fn from_path_with_mtp(
cfg: Llama32Config,
path: &str,
device: Device,
include_mtp: bool,
) -> Result<Llama32Generator, Error>
pub fn from_path_with_mtp( cfg: Llama32Config, path: &str, device: Device, include_mtp: bool, ) -> Result<Llama32Generator, Error>
Same as [from_path] but with MTP-head visibility control.
When include_mtp=true and the file is GGUF, MTP weights are
drained into the generator’s cache alongside the base
weights. The base inference path still ignores them — they
sit in cache for a future MTP-aware decoder. Non-GGUF formats
silently ignore the flag (safetensors files publish all
tensors uniformly; downstream code distinguishes by name).
Sourcepub fn prefill(&mut self, prompt_ids: &[u32])
pub fn prefill(&mut self, prompt_ids: &[u32])
Replace the token history with prompt_ids. Does not run the
model — the next [step] call processes the full sequence.
Clears any KV cache from a prior generation.
Sourcepub fn step(&mut self, opts: SampleOpts) -> Result<u32, Error>
pub fn step(&mut self, opts: SampleOpts) -> Result<u32, Error>
Run one prefill over the current token history and sample the next token. The sampled token is appended to the history and returned. Call repeatedly to generate.
Sourcepub fn generate(
&mut self,
n: usize,
opts: SampleOpts,
) -> Result<Vec<u32>, Error>
pub fn generate( &mut self, n: usize, opts: SampleOpts, ) -> Result<Vec<u32>, Error>
Run n steps and return the newly generated token ids
(excludes the prefill prompt).
Sourcepub fn step_cached(&mut self, opts: SampleOpts) -> Result<u32, Error>
pub fn step_cached(&mut self, opts: SampleOpts) -> Result<u32, Error>
Cached step: O(L) per token instead of O(L²). First call seeds
the KV cache from the prompt via prefill-with-cache; subsequent
calls run the decode-mode graph on just the last token + cached
past. Output is bit-identical to [step] modulo reduction
order in the SDPA kernel.
Invariant after each call: cache.past_seq == tokens.len() - 1
(the just-sampled token is appended but not yet in the cache;
it becomes the input for the next decode step).
Sourcepub fn generate_cached(
&mut self,
n: usize,
opts: SampleOpts,
) -> Result<Vec<u32>, Error>
pub fn generate_cached( &mut self, n: usize, opts: SampleOpts, ) -> Result<Vec<u32>, Error>
Run n cached steps and return the newly generated tokens.
Sourcepub fn generate_cached_with(
&mut self,
n: usize,
opts: SampleOpts,
on_token: impl FnMut(u32),
) -> Result<Vec<u32>, Error>
pub fn generate_cached_with( &mut self, n: usize, opts: SampleOpts, on_token: impl FnMut(u32), ) -> Result<Vec<u32>, Error>
Same as [generate_cached] but invokes on_token once per
freshly sampled id, inside the decode loop. The whole n step
loop shares the bucketed compile cache — callers wanting a
streaming UI should prefer this to calling
generate_cached(1, …) n times (which forces a fresh
compile per token at the bucket boundaries).
pub fn config(&self) -> &Llama32Config
Sourcepub fn prefill_get_last_logits(
&mut self,
context: &[u32],
) -> Result<Vec<f32>, Error>
pub fn prefill_get_last_logits( &mut self, context: &[u32], ) -> Result<Vec<f32>, Error>
Low-level primitive: reset internal state, run prefill-with-cache
over context, and return the last position’s logits row
(P(next_token | context)). Does NOT sample or append. The
internal tokens buffer is set to context and the KV cache
is populated to past_seq = context.len().
First row of logits after prefill-with-cache (no sampling).
Sourcepub fn decode_get_logits(&mut self, input: u32) -> Result<Vec<f32>, Error>
pub fn decode_get_logits(&mut self, input: u32) -> Result<Vec<f32>, Error>
Low-level primitive: run one decode step with the caller-
supplied input token (no sampling), advance the KV cache, and
return the resulting logits row P(next | history ++ input).
Appends input to the tokens buffer so the invariant
cache.past_seq == tokens.len() holds after this call (note:
differs from step_cached invariant because this method does
not append a sampled token).
Auto Trait Implementations§
impl !RefUnwindSafe for Llama32Generator
impl !Sync for Llama32Generator
impl !UnwindSafe for Llama32Generator
impl Freeze for Llama32Generator
impl Send for Llama32Generator
impl Unpin for Llama32Generator
impl UnsafeUnpin for Llama32Generator
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
impl<ST, DT> CastableFrom<ST, Initialized, Initialized> for DT
impl<ST, DT> CastableFrom<ST, Uninit, Uninit> for DT
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more