Skip to main content

ServeCommands

Enum ServeCommands 

Source
pub enum ServeCommands {
    Plan {
        model: String,
        gpu: bool,
        batch_size: usize,
        seq_len: usize,
        format: String,
        quant: Option<String>,
    },
    Run {
Show 15 fields file: PathBuf, port: u16, host: String, no_cors: bool, no_metrics: bool, no_gpu: bool, gpu: bool, batch: bool, trace: bool, trace_level: String, profile: bool, backend: Option<String>, otlp_endpoint: Option<String>, context_length: usize, no_fp8_cache: bool,
}, }
Expand description

Inference server subcommands (plan/run).

apr serve plan computes VRAM budget, throughput estimates, and contract verification before starting a server. apr serve run launches the server.

Variants§

§

Plan

Pre-flight inference capacity plan (VRAM budget, roofline, contracts)

Inspects model metadata, detects GPU hardware, and produces a capacity plan showing whether the model fits in VRAM with the requested batch size. No weights are loaded — header-only inspection.

Accepts local files (.gguf, .apr, .safetensors) or HuggingFace repo IDs (hf://org/repo or org/repo). For HF repos, only the ~2KB config.json is fetched — no weight download needed.

Fields

§model: String

Model source: local path or HuggingFace repo (hf://org/repo, org/repo)

§gpu: bool

Detect GPU via nvidia-smi for VRAM budget

§batch_size: usize

Target batch size for throughput estimation

§seq_len: usize

Sequence length for KV cache estimation

§format: String

Output format: text, json, yaml

§quant: Option<String>

Quantization override for HF models (e.g., Q4_K_M, Q6_K, F16)

§

Run

Start inference server (REST API, streaming, metrics)

Fields

§file: PathBuf

Path to model file

§port: u16

Port to listen on

§host: String

Host to bind to

§no_cors: bool

Disable CORS

§no_metrics: bool

Disable Prometheus metrics endpoint

§no_gpu: bool

Disable GPU acceleration

§gpu: bool

Force GPU acceleration (requires CUDA)

§batch: bool

Enable batched GPU inference for 2X+ throughput

§trace: bool

Enable inference tracing (PMAT-SHOWCASE-METHODOLOGY-001)

§trace_level: String

Trace detail level (none, basic, layer)

§profile: bool

Enable inline Roofline profiling (adds X-Profile headers)

§backend: Option<String>

PMAT-332: Compute backend override (cuda, cpu, wgpu)

§otlp_endpoint: Option<String>

PMAT-485: OTLP endpoint for distributed tracing export (Jaeger/Tempo)

When set, inference spans (W3C Trace Context) are exported via OTLP. Each request = parent span, each layer = child span with TensorStats. Example: –otlp-endpoint http://localhost:4317

§context_length: usize

GH-286: Max context/sequence length for KV cache. Lower = less RSS.

§no_fp8_cache: bool

GH-286: Skip FP8 weight cache warmup. Saves ~1.5 GB RSS.

Trait Implementations§

Source§

impl Debug for ServeCommands

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl FromArgMatches for ServeCommands

Source§

fn from_arg_matches(__clap_arg_matches: &ArgMatches) -> Result<Self, Error>

Instantiate Self from ArgMatches, parsing the arguments as needed. Read more
Source§

fn from_arg_matches_mut( __clap_arg_matches: &mut ArgMatches, ) -> Result<Self, Error>

Instantiate Self from ArgMatches, parsing the arguments as needed. Read more
Source§

fn update_from_arg_matches( &mut self, __clap_arg_matches: &ArgMatches, ) -> Result<(), Error>

Assign values from ArgMatches to self.
Source§

fn update_from_arg_matches_mut<'b>( &mut self, __clap_arg_matches: &mut ArgMatches, ) -> Result<(), Error>

Assign values from ArgMatches to self.
Source§

impl Subcommand for ServeCommands

Source§

fn augment_subcommands<'b>(__clap_app: Command) -> Command

Append to Command so it can instantiate Self via FromArgMatches::from_arg_matches_mut Read more
Source§

fn augment_subcommands_for_update<'b>(__clap_app: Command) -> Command

Append to Command so it can instantiate self via FromArgMatches::update_from_arg_matches_mut Read more
Source§

fn has_subcommand(__clap_name: &str) -> bool

Test whether Self can parse a specific subcommand

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,