EndpointCreateInput

Struct EndpointCreateInput 

Source
pub struct EndpointCreateInput {
Show 17 fields pub template_id: String, pub allowed_cuda_versions: Option<Vec<CudaVersion>>, pub compute_type: Option<ComputeType>, pub cpu_flavor_ids: Option<Vec<CpuFlavorId>>, pub data_center_ids: Option<Vec<DataCenterId>>, pub execution_timeout_ms: Option<i32>, pub flashboot: Option<bool>, pub gpu_count: Option<i32>, pub gpu_type_ids: Option<Vec<GpuTypeId>>, pub idle_timeout: Option<i32>, pub name: Option<String>, pub network_volume_id: Option<String>, pub scaler_type: Option<ScalerType>, pub scaler_value: Option<i32>, pub vcpu_count: Option<i32>, pub workers_max: Option<i32>, pub workers_min: Option<i32>,
}
Expand description

Input parameters for creating a new serverless endpoint.

This struct contains all the configuration options available when creating an endpoint, including compute specifications, scaling policies, and deployment preferences. Most fields are optional and will use RunPod defaults if not specified.

§Required Fields

Only template_id is required - all other configuration uses sensible defaults that can be customized based on your specific workload requirements.

§Examples

use runpod_sdk::model::{EndpointCreateInput, ScalerType};
use runpod_sdk::model::{ComputeType, CudaVersion, GpuTypeId};

// High-performance GPU endpoint for real-time AI inference
let inference_endpoint = EndpointCreateInput {
    template_id: "pytorch-inference-template".to_string(),
    name: Some("ai-inference-prod".to_string()),
    compute_type: Some(ComputeType::Gpu),
    gpu_count: Some(1),
    gpu_type_ids: Some(vec![GpuTypeId::NvidiaA100_80GbPcie]),
    allowed_cuda_versions: Some(vec![CudaVersion::V12_1]),
    scaler_type: Some(ScalerType::QueueDelay),
    scaler_value: Some(3), // Scale if requests wait >3 seconds
    workers_min: Some(1),  // Keep 1 worker always ready
    workers_max: Some(5),  // Burst up to 5 workers
    flashboot: Some(true), // Fast cold starts
    idle_timeout: Some(30), // Scale down after 30s idle
    execution_timeout_ms: Some(300000), // 5 minute timeout
    ..Default::default()
};

// Cost-optimized CPU endpoint for batch processing
let batch_endpoint = EndpointCreateInput {
    template_id: "batch-processor-template".to_string(),
    name: Some("data-batch-processor".to_string()),
    compute_type: Some(ComputeType::Cpu),
    vcpu_count: Some(8),
    scaler_type: Some(ScalerType::RequestCount),
    scaler_value: Some(10), // 1 worker per 10 requests
    workers_min: Some(0),   // No reserved capacity
    workers_max: Some(20),  // Allow large bursts
    flashboot: Some(false), // Standard startup (cheaper)
    idle_timeout: Some(120), // Longer idle time for batches
    execution_timeout_ms: Some(1800000), // 30 minute timeout
    ..Default::default()
};

Fields§

§template_id: String

The unique string identifying the template used to create the endpoint.

Required field - specifies the container image, environment, and resource configuration that will be deployed across all workers.

Templates ensure consistent runtime environments and can be shared across multiple endpoints for standardized deployments.

§allowed_cuda_versions: Option<Vec<CudaVersion>>

If the endpoint is a GPU endpoint, acceptable CUDA versions for workers.

Constrains worker allocation to machines with compatible CUDA runtimes. Useful for ensuring compatibility with specific AI/ML framework versions that require particular CUDA versions.

Default: Any CUDA version is acceptable GPU endpoints only: Ignored for CPU endpoints

Example: [CudaVersion::V12_1, CudaVersion::V11_8]

§compute_type: Option<ComputeType>

Set to GPU for GPU-accelerated workers, CPU for CPU-only workers.

Determines the type of compute resources allocated to workers:

  • GPU: Workers get GPU acceleration for AI/ML workloads
  • CPU: Workers get high-performance CPUs for general compute

Default: GPU Impact: Affects available hardware types, pricing, and performance

§cpu_flavor_ids: Option<Vec<CpuFlavorId>>

If creating a CPU endpoint, list of CPU flavors for workers.

Specifies the CPU configurations that can be used for workers. The order determines rental priority - preferred flavors first.

CPU endpoints only: Ignored for GPU endpoints Default: All available CPU flavors

Available flavors: cpu3c, cpu3g, cpu5c, cpu5g

§data_center_ids: Option<Vec<DataCenterId>>

List of data center IDs where workers can be located.

Workers are distributed across these data centers for availability, performance, and proximity to users. The system automatically selects the best available data center for each worker.

Default: All available data centers globally Strategy: Choose data centers close to your users and data sources

Common choices:

  • Global: ["US-CA-1", "EU-RO-1", "AP-JP-1"]
  • Regional: ["US-TX-1", "US-CA-2"] for US-only
  • Single DC: ["EU-RO-1"] for data residency requirements
§execution_timeout_ms: Option<i32>

Maximum execution time in milliseconds for individual requests.

Requests exceeding this timeout are terminated and marked as failed. Prevents runaway processes and ensures predictable resource usage.

Default: 600,000ms (10 minutes) Range: 1,000ms to 3,600,000ms (1 second to 1 hour)

Guidelines:

  • Web APIs: 30,000ms (30 seconds)
  • AI inference: 300,000ms (5 minutes)
  • Image processing: 600,000ms (10 minutes)
  • Batch jobs: 3,600,000ms (1 hour)
§flashboot: Option<bool>

Whether to enable flash boot for faster worker startup.

Flash boot dramatically reduces cold start time by using pre-warmed container images with cached dependencies and optimized initialization.

Default: false Trade-off: Higher per-request cost for much faster startup Best for: Interactive applications, real-time inference, low-latency requirements Startup time: ~5-10 seconds with flash boot vs 30-60 seconds without

§gpu_count: Option<i32>

If creating a GPU endpoint, number of GPUs per worker.

Determines GPU resources allocated to each worker for parallel processing. More GPUs enable larger models and higher throughput but increase costs.

Default: 1 GPU endpoints only: Ignored for CPU endpoints Range: 1-8 depending on GPU type availability

Use cases:

  • Single GPU: Most inference workloads, small models
  • Multi-GPU: Large language models, distributed training, high-throughput inference
§gpu_type_ids: Option<Vec<GpuTypeId>>

If creating a GPU endpoint, list of GPU types for workers.

Specifies GPU hardware that can be used for workers. The order determines rental priority - the system tries preferred types first.

GPU endpoints only: Ignored for CPU endpoints Default: All available GPU types

Performance tiers:

  • High-end: "NVIDIA H100 80GB HBM3", "NVIDIA A100 80GB PCIe"
  • Mid-range: "NVIDIA RTX A6000", "NVIDIA A40"
  • Budget: "NVIDIA RTX 4090", "NVIDIA RTX 3090"
§idle_timeout: Option<i32>

Number of seconds workers can be idle before scaling down.

Workers that haven’t processed requests for this duration are automatically terminated to reduce costs. Balance between cost optimization and cold start latency.

Default: 5 seconds Range: 1-3600 seconds (1 second to 1 hour)

Strategy:

  • Aggressive (cost-focused): 30-60 seconds
  • Balanced: 5-15 seconds
  • Responsive (latency-focused): 1-5 seconds
§name: Option<String>

A user-defined name for the endpoint.

Used for organization and identification in dashboards, monitoring, and API responses. The name does not need to be unique across your account.

Default: Auto-generated based on template name Max length: 191 characters Best practices: Use descriptive names like “prod-image-classifier” or “staging-api-v2”

§network_volume_id: Option<String>

The unique ID of a network volume to attach to workers.

Network volumes provide persistent, shared storage across all workers, useful for model weights, datasets, cached data, and other shared assets.

Default: No network volume attached Requirements: Volume must exist in same data centers as workers Use cases: Model storage, dataset access, shared caching, persistent logs

§scaler_type: Option<ScalerType>

The scaling strategy for managing worker count.

Determines how the system automatically scales workers up/down based on request load and queue depth.

Default: QueueDelay

Strategies:

  • QueueDelay: Scale based on request wait time (latency-optimized)
  • RequestCount: Scale based on queue depth (throughput-optimized)
§scaler_value: Option<i32>

The scaling sensitivity parameter.

Meaning depends on the scaler_type:

For QueueDelay: Maximum seconds requests can wait before scaling up

  • Lower values = more responsive scaling, higher costs
  • Higher values = slower scaling, lower costs

For RequestCount: Target requests per worker

  • queue_size / scaler_value = target_worker_count
  • Lower values = more workers, lower latency
  • Higher values = fewer workers, higher latency

Default: 4 Range: 1-3600

§vcpu_count: Option<i32>

If creating a CPU endpoint, number of vCPUs per worker.

Determines CPU resources allocated to each worker. More vCPUs enable higher parallelism and throughput for CPU-intensive workloads.

Default: 2 vCPUs CPU endpoints only: Ignored for GPU endpoints Range: 1-32 vCPUs depending on CPU flavor

Guidelines:

  • Light workloads: 1-2 vCPUs
  • Web APIs: 2-4 vCPUs
  • Data processing: 4-16 vCPUs
  • Heavy computation: 16+ vCPUs
§workers_max: Option<i32>

Maximum number of workers that can run simultaneously.

Hard limit preventing runaway scaling and controlling maximum costs. Set based on expected peak load, budget constraints, and infrastructure limits.

Default: No limit (subject to account quotas) Range: 0-1000+ depending on account limits

Strategy: Set 2-3x expected peak load for safety margin

§workers_min: Option<i32>

Minimum number of workers that always remain running.

Reserved capacity providing immediate availability even during idle periods. These workers are billed at a reduced rate but ensure zero cold start latency for the first few requests.

Default: 0 (no reserved capacity) Range: 0-100 depending on account limits

Trade-offs:

  • 0: Maximum cost efficiency, but cold starts for first requests
  • 1+: Immediate availability, continuous billing for reserved workers

Strategy: Set to 1 for production endpoints requiring <1s response time

Trait Implementations§

Source§

impl Clone for EndpointCreateInput

Source§

fn clone(&self) -> EndpointCreateInput

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for EndpointCreateInput

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for EndpointCreateInput

Source§

fn default() -> EndpointCreateInput

Returns the “default value” for a type. Read more
Source§

impl<'de> Deserialize<'de> for EndpointCreateInput

Source§

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more
Source§

impl Serialize for EndpointCreateInput

Source§

fn serialize<__S>(&self, __serializer: __S) -> Result<__S::Ok, __S::Error>
where __S: Serializer,

Serialize this value into the given Serde serializer. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,