Struct EndpointCreateInput

Source

pub struct EndpointCreateInput {Show 17 fields
    pub template_id: String,
    pub allowed_cuda_versions: Option<Vec<CudaVersion>>,
    pub compute_type: Option<ComputeType>,
    pub cpu_flavor_ids: Option<Vec<CpuFlavorId>>,
    pub data_center_ids: Option<Vec<DataCenterId>>,
    pub execution_timeout_ms: Option<i32>,
    pub flashboot: Option<bool>,
    pub gpu_count: Option<i32>,
    pub gpu_type_ids: Option<Vec<GpuTypeId>>,
    pub idle_timeout: Option<i32>,
    pub name: Option<String>,
    pub network_volume_id: Option<String>,
    pub scaler_type: Option<ScalerType>,
    pub scaler_value: Option<i32>,
    pub vcpu_count: Option<i32>,
    pub workers_max: Option<i32>,
    pub workers_min: Option<i32>,
}

Expand description

Input parameters for creating a new serverless endpoint.

This struct contains all the configuration options available when creating an endpoint, including compute specifications, scaling policies, and deployment preferences. Most fields are optional and will use RunPod defaults if not specified.

§Required Fields

Only template_id is required - all other configuration uses sensible defaults that can be customized based on your specific workload requirements.

§Examples

use runpod_sdk::model::{EndpointCreateInput, ScalerType};
use runpod_sdk::model::{ComputeType, CudaVersion, GpuTypeId};

// High-performance GPU endpoint for real-time AI inference
let inference_endpoint = EndpointCreateInput {
    template_id: "pytorch-inference-template".to_string(),
    name: Some("ai-inference-prod".to_string()),
    compute_type: Some(ComputeType::Gpu),
    gpu_count: Some(1),
    gpu_type_ids: Some(vec![GpuTypeId::NvidiaA100_80GbPcie]),
    allowed_cuda_versions: Some(vec![CudaVersion::V12_1]),
    scaler_type: Some(ScalerType::QueueDelay),
    scaler_value: Some(3), // Scale if requests wait >3 seconds
    workers_min: Some(1),  // Keep 1 worker always ready
    workers_max: Some(5),  // Burst up to 5 workers
    flashboot: Some(true), // Fast cold starts
    idle_timeout: Some(30), // Scale down after 30s idle
    execution_timeout_ms: Some(300000), // 5 minute timeout
    ..Default::default()
};

// Cost-optimized CPU endpoint for batch processing
let batch_endpoint = EndpointCreateInput {
    template_id: "batch-processor-template".to_string(),
    name: Some("data-batch-processor".to_string()),
    compute_type: Some(ComputeType::Cpu),
    vcpu_count: Some(8),
    scaler_type: Some(ScalerType::RequestCount),
    scaler_value: Some(10), // 1 worker per 10 requests
    workers_min: Some(0),   // No reserved capacity
    workers_max: Some(20),  // Allow large bursts
    flashboot: Some(false), // Standard startup (cheaper)
    idle_timeout: Some(120), // Longer idle time for batches
    execution_timeout_ms: Some(1800000), // 30 minute timeout
    ..Default::default()
};

Fields§

§template_id: String

The unique string identifying the template used to create the endpoint.

Required field - specifies the container image, environment, and resource configuration that will be deployed across all workers.

Templates ensure consistent runtime environments and can be shared across multiple endpoints for standardized deployments.

§allowed_cuda_versions: Option<Vec<CudaVersion>>

If the endpoint is a GPU endpoint, acceptable CUDA versions for workers.

Constrains worker allocation to machines with compatible CUDA runtimes. Useful for ensuring compatibility with specific AI/ML framework versions that require particular CUDA versions.

Default: Any CUDA version is acceptable GPU endpoints only: Ignored for CPU endpoints

Example: [CudaVersion::V12_1, CudaVersion::V11_8]

§compute_type: Option<ComputeType>

Set to GPU for GPU-accelerated workers, CPU for CPU-only workers.

Determines the type of compute resources allocated to workers:

GPU: Workers get GPU acceleration for AI/ML workloads
CPU: Workers get high-performance CPUs for general compute

Default: GPU Impact: Affects available hardware types, pricing, and performance

§cpu_flavor_ids: Option<Vec<CpuFlavorId>>

If creating a CPU endpoint, list of CPU flavors for workers.

Specifies the CPU configurations that can be used for workers. The order determines rental priority - preferred flavors first.

CPU endpoints only: Ignored for GPU endpoints Default: All available CPU flavors

Available flavors: cpu3c, cpu3g, cpu5c, cpu5g

§data_center_ids: Option<Vec<DataCenterId>>

List of data center IDs where workers can be located.

Workers are distributed across these data centers for availability, performance, and proximity to users. The system automatically selects the best available data center for each worker.

Default: All available data centers globally Strategy: Choose data centers close to your users and data sources

Common choices:

Global: ["US-CA-1", "EU-RO-1", "AP-JP-1"]
Regional: ["US-TX-1", "US-CA-2"] for US-only
Single DC: ["EU-RO-1"] for data residency requirements

§execution_timeout_ms: Option<i32>

Maximum execution time in milliseconds for individual requests.

Requests exceeding this timeout are terminated and marked as failed. Prevents runaway processes and ensures predictable resource usage.

Default: 600,000ms (10 minutes) Range: 1,000ms to 3,600,000ms (1 second to 1 hour)

Guidelines:

Web APIs: 30,000ms (30 seconds)
AI inference: 300,000ms (5 minutes)
Image processing: 600,000ms (10 minutes)
Batch jobs: 3,600,000ms (1 hour)

§flashboot: Option<bool>

Whether to enable flash boot for faster worker startup.

Flash boot dramatically reduces cold start time by using pre-warmed container images with cached dependencies and optimized initialization.

Default: false Trade-off: Higher per-request cost for much faster startup Best for: Interactive applications, real-time inference, low-latency requirements Startup time: ~5-10 seconds with flash boot vs 30-60 seconds without

§gpu_count: Option<i32>

If creating a GPU endpoint, number of GPUs per worker.

Determines GPU resources allocated to each worker for parallel processing. More GPUs enable larger models and higher throughput but increase costs.

Default: 1 GPU endpoints only: Ignored for CPU endpoints Range: 1-8 depending on GPU type availability

Use cases:

Single GPU: Most inference workloads, small models
Multi-GPU: Large language models, distributed training, high-throughput inference

§gpu_type_ids: Option<Vec<GpuTypeId>>

If creating a GPU endpoint, list of GPU types for workers.

Specifies GPU hardware that can be used for workers. The order determines rental priority - the system tries preferred types first.

GPU endpoints only: Ignored for CPU endpoints Default: All available GPU types

Performance tiers:

High-end: "NVIDIA H100 80GB HBM3", "NVIDIA A100 80GB PCIe"
Mid-range: "NVIDIA RTX A6000", "NVIDIA A40"
Budget: "NVIDIA RTX 4090", "NVIDIA RTX 3090"

§idle_timeout: Option<i32>

Number of seconds workers can be idle before scaling down.

Workers that haven’t processed requests for this duration are automatically terminated to reduce costs. Balance between cost optimization and cold start latency.

Default: 5 seconds Range: 1-3600 seconds (1 second to 1 hour)

Strategy:

Aggressive (cost-focused): 30-60 seconds
Balanced: 5-15 seconds
Responsive (latency-focused): 1-5 seconds

§name: Option<String>

A user-defined name for the endpoint.

Used for organization and identification in dashboards, monitoring, and API responses. The name does not need to be unique across your account.

Default: Auto-generated based on template name Max length: 191 characters Best practices: Use descriptive names like “prod-image-classifier” or “staging-api-v2”

§network_volume_id: Option<String>

The unique ID of a network volume to attach to workers.

Network volumes provide persistent, shared storage across all workers, useful for model weights, datasets, cached data, and other shared assets.

Default: No network volume attached Requirements: Volume must exist in same data centers as workers Use cases: Model storage, dataset access, shared caching, persistent logs

§scaler_type: Option<ScalerType>

The scaling strategy for managing worker count.

Determines how the system automatically scales workers up/down based on request load and queue depth.

Default: QueueDelay

Strategies:

QueueDelay: Scale based on request wait time (latency-optimized)
RequestCount: Scale based on queue depth (throughput-optimized)

§scaler_value: Option<i32>

The scaling sensitivity parameter.

Meaning depends on the scaler_type:

For QueueDelay: Maximum seconds requests can wait before scaling up

Lower values = more responsive scaling, higher costs
Higher values = slower scaling, lower costs

For RequestCount: Target requests per worker

queue_size / scaler_value = target_worker_count
Lower values = more workers, lower latency
Higher values = fewer workers, higher latency

Default: 4 Range: 1-3600

§vcpu_count: Option<i32>

If creating a CPU endpoint, number of vCPUs per worker.

Determines CPU resources allocated to each worker. More vCPUs enable higher parallelism and throughput for CPU-intensive workloads.

Default: 2 vCPUs CPU endpoints only: Ignored for GPU endpoints Range: 1-32 vCPUs depending on CPU flavor

Guidelines:

Light workloads: 1-2 vCPUs
Web APIs: 2-4 vCPUs
Data processing: 4-16 vCPUs
Heavy computation: 16+ vCPUs

§workers_max: Option<i32>

Maximum number of workers that can run simultaneously.

Hard limit preventing runaway scaling and controlling maximum costs. Set based on expected peak load, budget constraints, and infrastructure limits.

Default: No limit (subject to account quotas) Range: 0-1000+ depending on account limits

Strategy: Set 2-3x expected peak load for safety margin

§workers_min: Option<i32>

Minimum number of workers that always remain running.

Reserved capacity providing immediate availability even during idle periods. These workers are billed at a reduced rate but ensure zero cold start latency for the first few requests.

Default: 0 (no reserved capacity) Range: 0-100 depending on account limits

Trade-offs:

0: Maximum cost efficiency, but cold starts for first requests
1+: Immediate availability, continuous billing for reserved workers

Strategy: Set to 1 for production endpoints requiring <1s response time

Struct EndpointCreateInput Copy item path

§Required Fields

§Examples

Fields§

Trait Implementations§

impl Clone for EndpointCreateInput

fn clone(&self) -> EndpointCreateInput

fn clone_from(&mut self, source: &Self)

impl Debug for EndpointCreateInput

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Default for EndpointCreateInput

fn default() -> EndpointCreateInput

impl<'de> Deserialize<'de> for EndpointCreateInput

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where __D: Deserializer<'de>,

impl Serialize for EndpointCreateInput

fn serialize<__S>(&self, __serializer: __S) -> Result<__S::Ok, __S::Error>where __S: Serializer,

Auto Trait Implementations§

impl Freeze for EndpointCreateInput

impl RefUnwindSafe for EndpointCreateInput

impl Send for EndpointCreateInput

impl Sync for EndpointCreateInput

impl Unpin for EndpointCreateInput

impl UnwindSafe for EndpointCreateInput

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

impl<T> DeserializeOwned for Twhere T: for<'de> Deserialize<'de>,

Struct EndpointCreateInput

fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where __D: Deserializer<'de>,

fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where S: Serializer,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,