pub struct EndpointCreateInput {Show 17 fields
pub template_id: String,
pub allowed_cuda_versions: Option<Vec<CudaVersion>>,
pub compute_type: Option<ComputeType>,
pub cpu_flavor_ids: Option<Vec<CpuFlavorId>>,
pub data_center_ids: Option<Vec<DataCenterId>>,
pub execution_timeout_ms: Option<i32>,
pub flashboot: Option<bool>,
pub gpu_count: Option<i32>,
pub gpu_type_ids: Option<Vec<GpuTypeId>>,
pub idle_timeout: Option<i32>,
pub name: Option<String>,
pub network_volume_id: Option<String>,
pub scaler_type: Option<ScalerType>,
pub scaler_value: Option<i32>,
pub vcpu_count: Option<i32>,
pub workers_max: Option<i32>,
pub workers_min: Option<i32>,
}Expand description
Input parameters for creating a new serverless endpoint.
This struct contains all the configuration options available when creating an endpoint, including compute specifications, scaling policies, and deployment preferences. Most fields are optional and will use RunPod defaults if not specified.
§Required Fields
Only template_id is required - all other configuration uses sensible defaults
that can be customized based on your specific workload requirements.
§Examples
use runpod_sdk::model::{EndpointCreateInput, ScalerType};
use runpod_sdk::model::{ComputeType, CudaVersion, GpuTypeId};
// High-performance GPU endpoint for real-time AI inference
let inference_endpoint = EndpointCreateInput {
template_id: "pytorch-inference-template".to_string(),
name: Some("ai-inference-prod".to_string()),
compute_type: Some(ComputeType::Gpu),
gpu_count: Some(1),
gpu_type_ids: Some(vec![GpuTypeId::NvidiaA100_80GbPcie]),
allowed_cuda_versions: Some(vec![CudaVersion::V12_1]),
scaler_type: Some(ScalerType::QueueDelay),
scaler_value: Some(3), // Scale if requests wait >3 seconds
workers_min: Some(1), // Keep 1 worker always ready
workers_max: Some(5), // Burst up to 5 workers
flashboot: Some(true), // Fast cold starts
idle_timeout: Some(30), // Scale down after 30s idle
execution_timeout_ms: Some(300000), // 5 minute timeout
..Default::default()
};
// Cost-optimized CPU endpoint for batch processing
let batch_endpoint = EndpointCreateInput {
template_id: "batch-processor-template".to_string(),
name: Some("data-batch-processor".to_string()),
compute_type: Some(ComputeType::Cpu),
vcpu_count: Some(8),
scaler_type: Some(ScalerType::RequestCount),
scaler_value: Some(10), // 1 worker per 10 requests
workers_min: Some(0), // No reserved capacity
workers_max: Some(20), // Allow large bursts
flashboot: Some(false), // Standard startup (cheaper)
idle_timeout: Some(120), // Longer idle time for batches
execution_timeout_ms: Some(1800000), // 30 minute timeout
..Default::default()
};Fields§
§template_id: StringThe unique string identifying the template used to create the endpoint.
Required field - specifies the container image, environment, and resource configuration that will be deployed across all workers.
Templates ensure consistent runtime environments and can be shared across multiple endpoints for standardized deployments.
allowed_cuda_versions: Option<Vec<CudaVersion>>If the endpoint is a GPU endpoint, acceptable CUDA versions for workers.
Constrains worker allocation to machines with compatible CUDA runtimes. Useful for ensuring compatibility with specific AI/ML framework versions that require particular CUDA versions.
Default: Any CUDA version is acceptable GPU endpoints only: Ignored for CPU endpoints
Example: [CudaVersion::V12_1, CudaVersion::V11_8]
compute_type: Option<ComputeType>Set to GPU for GPU-accelerated workers, CPU for CPU-only workers.
Determines the type of compute resources allocated to workers:
GPU: Workers get GPU acceleration for AI/ML workloadsCPU: Workers get high-performance CPUs for general compute
Default: GPU
Impact: Affects available hardware types, pricing, and performance
cpu_flavor_ids: Option<Vec<CpuFlavorId>>If creating a CPU endpoint, list of CPU flavors for workers.
Specifies the CPU configurations that can be used for workers. The order determines rental priority - preferred flavors first.
CPU endpoints only: Ignored for GPU endpoints Default: All available CPU flavors
Available flavors: cpu3c, cpu3g, cpu5c, cpu5g
data_center_ids: Option<Vec<DataCenterId>>List of data center IDs where workers can be located.
Workers are distributed across these data centers for availability, performance, and proximity to users. The system automatically selects the best available data center for each worker.
Default: All available data centers globally Strategy: Choose data centers close to your users and data sources
Common choices:
- Global:
["US-CA-1", "EU-RO-1", "AP-JP-1"] - Regional:
["US-TX-1", "US-CA-2"]for US-only - Single DC:
["EU-RO-1"]for data residency requirements
execution_timeout_ms: Option<i32>Maximum execution time in milliseconds for individual requests.
Requests exceeding this timeout are terminated and marked as failed. Prevents runaway processes and ensures predictable resource usage.
Default: 600,000ms (10 minutes) Range: 1,000ms to 3,600,000ms (1 second to 1 hour)
Guidelines:
- Web APIs: 30,000ms (30 seconds)
- AI inference: 300,000ms (5 minutes)
- Image processing: 600,000ms (10 minutes)
- Batch jobs: 3,600,000ms (1 hour)
flashboot: Option<bool>Whether to enable flash boot for faster worker startup.
Flash boot dramatically reduces cold start time by using pre-warmed container images with cached dependencies and optimized initialization.
Default: false
Trade-off: Higher per-request cost for much faster startup
Best for: Interactive applications, real-time inference, low-latency requirements
Startup time: ~5-10 seconds with flash boot vs 30-60 seconds without
gpu_count: Option<i32>If creating a GPU endpoint, number of GPUs per worker.
Determines GPU resources allocated to each worker for parallel processing. More GPUs enable larger models and higher throughput but increase costs.
Default: 1 GPU endpoints only: Ignored for CPU endpoints Range: 1-8 depending on GPU type availability
Use cases:
- Single GPU: Most inference workloads, small models
- Multi-GPU: Large language models, distributed training, high-throughput inference
gpu_type_ids: Option<Vec<GpuTypeId>>If creating a GPU endpoint, list of GPU types for workers.
Specifies GPU hardware that can be used for workers. The order determines rental priority - the system tries preferred types first.
GPU endpoints only: Ignored for CPU endpoints Default: All available GPU types
Performance tiers:
- High-end:
"NVIDIA H100 80GB HBM3","NVIDIA A100 80GB PCIe" - Mid-range:
"NVIDIA RTX A6000","NVIDIA A40" - Budget:
"NVIDIA RTX 4090","NVIDIA RTX 3090"
idle_timeout: Option<i32>Number of seconds workers can be idle before scaling down.
Workers that haven’t processed requests for this duration are automatically terminated to reduce costs. Balance between cost optimization and cold start latency.
Default: 5 seconds Range: 1-3600 seconds (1 second to 1 hour)
Strategy:
- Aggressive (cost-focused): 30-60 seconds
- Balanced: 5-15 seconds
- Responsive (latency-focused): 1-5 seconds
name: Option<String>A user-defined name for the endpoint.
Used for organization and identification in dashboards, monitoring, and API responses. The name does not need to be unique across your account.
Default: Auto-generated based on template name Max length: 191 characters Best practices: Use descriptive names like “prod-image-classifier” or “staging-api-v2”
network_volume_id: Option<String>The unique ID of a network volume to attach to workers.
Network volumes provide persistent, shared storage across all workers, useful for model weights, datasets, cached data, and other shared assets.
Default: No network volume attached Requirements: Volume must exist in same data centers as workers Use cases: Model storage, dataset access, shared caching, persistent logs
scaler_type: Option<ScalerType>The scaling strategy for managing worker count.
Determines how the system automatically scales workers up/down based on request load and queue depth.
Default: QueueDelay
Strategies:
QueueDelay: Scale based on request wait time (latency-optimized)RequestCount: Scale based on queue depth (throughput-optimized)
scaler_value: Option<i32>The scaling sensitivity parameter.
Meaning depends on the scaler_type:
For QueueDelay: Maximum seconds requests can wait before scaling up
- Lower values = more responsive scaling, higher costs
- Higher values = slower scaling, lower costs
For RequestCount: Target requests per worker
queue_size / scaler_value = target_worker_count- Lower values = more workers, lower latency
- Higher values = fewer workers, higher latency
Default: 4 Range: 1-3600
vcpu_count: Option<i32>If creating a CPU endpoint, number of vCPUs per worker.
Determines CPU resources allocated to each worker. More vCPUs enable higher parallelism and throughput for CPU-intensive workloads.
Default: 2 vCPUs CPU endpoints only: Ignored for GPU endpoints Range: 1-32 vCPUs depending on CPU flavor
Guidelines:
- Light workloads: 1-2 vCPUs
- Web APIs: 2-4 vCPUs
- Data processing: 4-16 vCPUs
- Heavy computation: 16+ vCPUs
workers_max: Option<i32>Maximum number of workers that can run simultaneously.
Hard limit preventing runaway scaling and controlling maximum costs. Set based on expected peak load, budget constraints, and infrastructure limits.
Default: No limit (subject to account quotas) Range: 0-1000+ depending on account limits
Strategy: Set 2-3x expected peak load for safety margin
workers_min: Option<i32>Minimum number of workers that always remain running.
Reserved capacity providing immediate availability even during idle periods. These workers are billed at a reduced rate but ensure zero cold start latency for the first few requests.
Default: 0 (no reserved capacity) Range: 0-100 depending on account limits
Trade-offs:
- 0: Maximum cost efficiency, but cold starts for first requests
- 1+: Immediate availability, continuous billing for reserved workers
Strategy: Set to 1 for production endpoints requiring <1s response time
Trait Implementations§
Source§impl Clone for EndpointCreateInput
impl Clone for EndpointCreateInput
Source§fn clone(&self) -> EndpointCreateInput
fn clone(&self) -> EndpointCreateInput
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more