commonware_deployer/ec2/mod.rs
1//! AWS EC2 deployer
2//!
3//! Deploy a custom binary (and configuration) to any number of EC2 instances across multiple regions. View metrics and logs
4//! from all instances with Grafana.
5//!
6//! # Features
7//!
8//! * Automated creation, update, and destruction of EC2 instances across multiple regions
9//! * Provide a unique name, instance type, region, binary, and configuration for each deployed instance
10//! * Collect metrics, profiles (when enabled), and logs from all deployed instances on a long-lived monitoring instance
11//! (accessible only to the deployer's IP)
12//!
13//! # Architecture
14//!
15//! ```txt
16//! Deployer's Machine (Public IP)
17//! |
18//! |
19//! v
20//! +-----------------------------------+
21//! | Monitoring VPC (us-east-1) |
22//! | - Monitoring Instance |
23//! | - Prometheus |
24//! | - Loki |
25//! | - Pyroscope |
26//! | - Tempo |
27//! | - Grafana |
28//! | - Security Group |
29//! | - All: Deployer IP |
30//! | - 3100: Binary VPCs |
31//! | - 4040: Binary VPCs |
32//! | - 4318: Binary VPCs |
33//! +-----------------------------------+
34//! ^ ^
35//! (Telemetry) (Telemetry)
36//! | |
37//! | |
38//! +------------------------------+ +------------------------------+
39//! | Binary VPC 1 | | Binary VPC 2 |
40//! | - Binary Instance | | - Binary Instance |
41//! | - Binary A | | - Binary B |
42//! | - Promtail | | - Promtail |
43//! | - Node Exporter | | - Node Exporter |
44//! | - Pyroscope Agent | | - Pyroscope Agent |
45//! | - Security Group | | - Security Group |
46//! | - All: Deployer IP | | - All: Deployer IP |
47//! | - 9090: Monitoring IP | | - 9090: Monitoring IP |
48//! | - 9100: Monitoring IP | | - 9100: Monitoring IP |
49//! | - 8012: 0.0.0.0/0 | | - 8765: 12.3.7.9/32 |
50//! +------------------------------+ +------------------------------+
51//! ```
52//!
53//! ## Instances
54//!
55//! ### Monitoring
56//!
57//! * Deployed in `us-east-1` with a configurable instance type (e.g., `t4g.small` for ARM64, `t3.small` for x86_64) and storage (e.g., 10GB gp2). Architecture is auto-detected from the instance type.
58//! * Runs:
59//! * **Prometheus**: Scrapes binary metrics from all instances at `:9090` and system metrics from all instances at `:9100`.
60//! * **Loki**: Listens at `:3100`, storing logs in `/loki/chunks` with a TSDB index at `/loki/index`.
61//! * **Pyroscope**: Listens at `:4040`, storing profiles in `/var/lib/pyroscope`.
62//! * **Tempo**: Listens at `:4318`, storing traces in `/var/lib/tempo`.
63//! * **Grafana**: Hosted at `:3000`, provisioned with Prometheus, Loki, and Tempo datasources and a custom dashboard.
64//! * Ingress:
65//! * Allows deployer IP access (TCP 0-65535).
66//! * Binary instance traffic to Loki (TCP 3100) and Tempo (TCP 4318).
67//!
68//! ### Binary
69//!
70//! * Deployed in user-specified regions with configurable ARM64 or AMD64 instance types and storage.
71//! * Run:
72//! * **Custom Binary**: Executes with `--hosts=/home/ubuntu/hosts.yaml --config=/home/ubuntu/config.conf`, exposing metrics at `:9090`.
73//! * **Promtail**: Forwards `/var/log/binary.log` to Loki on the monitoring instance.
74//! * **Node Exporter**: Exposes system metrics at `:9100`.
75//! * **Pyroscope Agent**: Forwards `perf` profiles to Pyroscope on the monitoring instance.
76//! * Ingress:
77//! * Deployer IP access (TCP 0-65535).
78//! * Monitoring IP access to `:9090` and `:9100` for Prometheus.
79//! * User-defined ports from the configuration.
80//!
81//! ## Networking
82//!
83//! ### VPCs
84//!
85//! One per region with CIDR `10.<region-index>.0.0/16` (e.g., `10.0.0.0/16` for `us-east-1`).
86//!
87//! ### Subnets
88//!
89//! Single subnet per VPC (e.g., `10.<region-index>.1.0/24`), linked to a route table with an internet gateway.
90//!
91//! ### VPC Peering
92//!
93//! Connects the monitoring VPC to each binary VPC, with routes added to route tables for private communication.
94//!
95//! ### Security Groups
96//!
97//! Separate for monitoring (tag) and binary instances (`{tag}-binary`), dynamically configured for deployer and inter-instance traffic.
98//!
99//! # Workflow
100//!
101//! ## `ec2 create`
102//!
103//! 1. Validates configuration and generates an SSH key pair, stored in `$HOME/.commonware_deployer/{tag}/id_rsa_{tag}`.
104//! 2. Ensures the shared S3 bucket exists and caches observability tools (Prometheus, Grafana, Loki, etc.) if not already present.
105//! 3. Uploads deployment-specific files (binaries, configs) to S3.
106//! 4. Creates VPCs, subnets, internet gateways, route tables, and security groups per region (concurrently).
107//! 5. Establishes VPC peering between the monitoring region and binary regions.
108//! 6. Launches the monitoring instance.
109//! 7. Launches binary instances.
110//! 8. Caches all static config files and uploads per-instance configs (hosts.yaml, promtail, pyroscope) to S3.
111//! 9. Configures monitoring and binary instances in parallel via SSH (BBR, service installation, service startup).
112//! 10. Updates the monitoring security group to allow telemetry traffic from binary instances.
113//! 11. Marks completion with `$HOME/.commonware_deployer/{tag}/created`.
114//!
115//! ## `ec2 update`
116//!
117//! 1. Uploads the latest binary and configuration to S3.
118//! 2. Stops the `binary` service on each binary instance.
119//! 3. Instances download the updated files from S3 via pre-signed URLs.
120//! 4. Restarts the `binary` service, ensuring minimal downtime.
121//!
122//! ## `ec2 authorize`
123//!
124//! 1. Obtains the deployer's current public IP address (or parses the one provided).
125//! 2. For each security group in the deployment, adds an ingress rule for the IP (if it doesn't already exist).
126//!
127//! ## `ec2 destroy`
128//!
129//! 1. Terminates all instances across regions.
130//! 2. Deletes security groups, subnets, route tables, VPC peering connections, internet gateways, key pairs, and VPCs in dependency order.
131//! 3. Deletes deployment-specific data from S3 (cached tools remain for future deployments).
132//! 4. Marks destruction with `$HOME/.commonware_deployer/{tag}/destroyed`, retaining the directory to prevent tag reuse.
133//!
134//! ## `ec2 clean`
135//!
136//! 1. Deletes the shared S3 bucket and all its contents (cached tools and any remaining deployment data).
137//! 2. Use this to fully clean up when you no longer need the deployer cache.
138//!
139//! # Persistence
140//!
141//! * A directory `$HOME/.commonware_deployer/{tag}` stores the SSH private key and status files (`created`, `destroyed`).
142//! * The deployment state is tracked via these files, ensuring operations respect prior create/destroy actions.
143//!
144//! ## S3 Caching
145//!
146//! A shared S3 bucket (`commonware-deployer-cache`) is used to cache deployment artifacts. The bucket
147//! uses a fixed name intentionally so that all users within the same AWS account share the cache. This
148//! design provides two benefits:
149//!
150//! 1. **Faster deployments**: Observability tools (Prometheus, Grafana, Loki, etc.) are downloaded from
151//! upstream sources once and cached in S3. Subsequent deployments by any user skip the download and
152//! use pre-signed URLs to fetch directly from S3.
153//!
154//! 2. **Reduced bandwidth**: Instead of requiring the deployer to push binaries to each instance,
155//! unique binaries are uploaded once to S3 and then pulled from there.
156//!
157//! Per-deployment data (binaries, configs, hosts files) is isolated under `deployments/{tag}/` to prevent
158//! conflicts between concurrent deployments.
159//!
160//! The bucket stores:
161//! * `tools/binaries/{tool}/{version}/{platform}/{filename}` - Tool binaries (e.g., prometheus, grafana)
162//! * `tools/configs/{deployer-version}/{component}/{file}` - Static configs and service files
163//! * `deployments/{tag}/` - Deployment-specific files:
164//! * `monitoring/` - Prometheus config, dashboard
165//! * `instances/{name}/` - Binary, config, hosts.yaml, promtail config, pyroscope script
166//!
167//! Tool binaries are namespaced by tool version and platform. Static configs are namespaced by deployer
168//! version to ensure cache invalidation when the deployer is updated.
169//!
170//! # Example Configuration
171//!
172//! ```yaml
173//! tag: ffa638a0-991c-442c-8ec4-aa4e418213a5
174//! monitoring:
175//! instance_type: t4g.small # ARM64 (Graviton)
176//! storage_size: 10
177//! storage_class: gp2
178//! dashboard: /path/to/dashboard.json
179//! instances:
180//! - name: node1
181//! region: us-east-1
182//! instance_type: t4g.small # ARM64 (Graviton)
183//! storage_size: 10
184//! storage_class: gp2
185//! binary: /path/to/binary-arm64
186//! config: /path/to/config.conf
187//! profiling: true
188//! - name: node2
189//! region: us-west-2
190//! instance_type: t3.small # x86_64 (Intel/AMD)
191//! storage_size: 10
192//! storage_class: gp2
193//! binary: /path/to/binary-x86
194//! config: /path/to/config2.conf
195//! profiling: false
196//! ports:
197//! - protocol: tcp
198//! port: 4545
199//! cidr: 0.0.0.0/0
200//! ```
201
202use serde::{Deserialize, Serialize};
203use std::net::IpAddr;
204
205cfg_if::cfg_if! {
206 if #[cfg(feature="aws")] {
207 use thiserror::Error;
208 use std::path::PathBuf;
209
210 /// CPU architecture for EC2 instances
211 #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
212 pub enum Architecture {
213 Arm64,
214 X86_64,
215 }
216
217 impl Architecture {
218 /// Returns the architecture string used in AMI names, download URLs, and labels
219 pub const fn as_str(&self) -> &'static str {
220 match self {
221 Self::Arm64 => "arm64",
222 Self::X86_64 => "amd64",
223 }
224 }
225
226 /// Returns the Linux library path component for jemalloc
227 pub const fn linux_lib(&self) -> &'static str {
228 match self {
229 Self::Arm64 => "aarch64-linux-gnu",
230 Self::X86_64 => "x86_64-linux-gnu",
231 }
232 }
233 }
234
235 impl std::fmt::Display for Architecture {
236 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
237 f.write_str(self.as_str())
238 }
239 }
240
241 pub mod aws;
242 mod create;
243 pub mod services;
244 pub use create::create;
245 mod update;
246 pub use update::update;
247 mod authorize;
248 pub use authorize::authorize;
249 mod destroy;
250 pub use destroy::destroy;
251 mod clean;
252 pub use clean::clean;
253 pub mod utils;
254 pub mod s3;
255
256 /// Name of the monitoring instance
257 const MONITORING_NAME: &str = "monitoring";
258
259 /// AWS region where monitoring instances are deployed
260 const MONITORING_REGION: &str = "us-east-1";
261
262 /// File name that indicates the deployment completed
263 const CREATED_FILE_NAME: &str = "created";
264
265 /// File name that indicates the deployment was destroyed
266 const DESTROYED_FILE_NAME: &str = "destroyed";
267
268 /// Port on instance where system metrics are exposed
269 const SYSTEM_PORT: u16 = 9100;
270
271 /// Port on monitoring where logs are pushed
272 const LOGS_PORT: u16 = 3100;
273
274 /// Port on monitoring where profiles are pushed
275 const PROFILES_PORT: u16 = 4040;
276
277 /// Port on monitoring where traces are pushed
278 const TRACES_PORT: u16 = 4318;
279
280 /// Subcommand name
281 pub const CMD: &str = "ec2";
282
283 /// Create subcommand name
284 pub const CREATE_CMD: &str = "create";
285
286 /// Update subcommand name
287 pub const UPDATE_CMD: &str = "update";
288
289 /// Authorize subcommand name
290 pub const AUTHORIZE_CMD: &str = "authorize";
291
292 /// Destroy subcommand name
293 pub const DESTROY_CMD: &str = "destroy";
294
295 /// Clean subcommand name
296 pub const CLEAN_CMD: &str = "clean";
297
298 /// Directory where deployer files are stored
299 fn deployer_directory(tag: &str) -> PathBuf {
300 let base_dir = std::env::var("HOME").expect("$HOME is not configured");
301 PathBuf::from(format!("{base_dir}/.commonware_deployer/{tag}"))
302 }
303
304 /// S3 operations that can fail
305 #[derive(Debug, Clone, Copy)]
306 pub enum S3Operation {
307 CreateBucket,
308 DeleteBucket,
309 HeadObject,
310 PutObject,
311 ListObjects,
312 DeleteObjects,
313 }
314
315 /// Reasons why accessing a bucket may be forbidden
316 #[derive(Debug, Clone, Copy)]
317 pub enum BucketForbiddenReason {
318 /// Access denied (missing s3:ListBucket permission or bucket owned by another account)
319 AccessDenied,
320 }
321
322 impl std::fmt::Display for BucketForbiddenReason {
323 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
324 match self {
325 Self::AccessDenied => write!(f, "access denied (check IAM permissions or bucket ownership)"),
326 }
327 }
328 }
329
330 impl std::fmt::Display for S3Operation {
331 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
332 match self {
333 Self::CreateBucket => write!(f, "CreateBucket"),
334 Self::DeleteBucket => write!(f, "DeleteBucket"),
335 Self::HeadObject => write!(f, "HeadObject"),
336 Self::PutObject => write!(f, "PutObject"),
337 Self::ListObjects => write!(f, "ListObjects"),
338 Self::DeleteObjects => write!(f, "DeleteObjects"),
339 }
340 }
341 }
342
343 /// Errors that can occur when deploying infrastructure on AWS
344 #[derive(Error, Debug)]
345 pub enum Error {
346 #[error("AWS EC2 error: {0}")]
347 AwsEc2(#[from] aws_sdk_ec2::Error),
348 #[error("AWS security group ingress error: {0}")]
349 AwsSecurityGroupIngress(#[from] aws_sdk_ec2::operation::authorize_security_group_ingress::AuthorizeSecurityGroupIngressError),
350 #[error("AWS describe instances error: {0}")]
351 AwsDescribeInstances(#[from] aws_sdk_ec2::operation::describe_instances::DescribeInstancesError),
352 #[error("S3 operation failed: {operation} on bucket '{bucket}'")]
353 AwsS3 {
354 bucket: String,
355 operation: S3Operation,
356 #[source]
357 source: Box<aws_sdk_s3::Error>,
358 },
359 #[error("S3 bucket '{bucket}' forbidden: {reason}")]
360 S3BucketForbidden {
361 bucket: String,
362 reason: BucketForbiddenReason,
363 },
364 #[error("IO error: {0}")]
365 Io(#[from] std::io::Error),
366 #[error("YAML error: {0}")]
367 Yaml(#[from] serde_yaml::Error),
368 #[error("creation already attempted")]
369 CreationAttempted,
370 #[error("invalid instance name: {0}")]
371 InvalidInstanceName(String),
372 #[error("reqwest error: {0}")]
373 Reqwest(#[from] reqwest::Error),
374 #[error("SSH failed")]
375 SshFailed,
376 #[error("keygen failed")]
377 KeygenFailed,
378 #[error("service timeout({0}): {1}")]
379 ServiceTimeout(String, String),
380 #[error("deployment does not exist: {0}")]
381 DeploymentDoesNotExist(String),
382 #[error("deployment is not complete: {0}")]
383 DeploymentNotComplete(String),
384 #[error("deployment already destroyed: {0}")]
385 DeploymentAlreadyDestroyed(String),
386 #[error("private key not found")]
387 PrivateKeyNotFound,
388 #[error("invalid IP address: {0}")]
389 IpAddrParse(#[from] std::net::AddrParseError),
390 #[error("IP address is not IPv4: {0}")]
391 IpAddrNotV4(std::net::IpAddr),
392 #[error("download failed: {0}")]
393 DownloadFailed(String),
394 #[error("S3 presigning config error: {0}")]
395 S3PresigningConfig(#[from] aws_sdk_s3::presigning::PresigningConfigError),
396 #[error("S3 presigning failed: {0}")]
397 S3PresigningFailed(Box<aws_sdk_s3::error::SdkError<aws_sdk_s3::operation::get_object::GetObjectError>>),
398 #[error("S3 builder error: {0}")]
399 S3Builder(#[from] aws_sdk_s3::error::BuildError),
400 #[error("duplicate instance name: {0}")]
401 DuplicateInstanceName(String),
402 }
403
404 impl From<aws_sdk_s3::error::SdkError<aws_sdk_s3::operation::get_object::GetObjectError>> for Error {
405 fn from(err: aws_sdk_s3::error::SdkError<aws_sdk_s3::operation::get_object::GetObjectError>) -> Self {
406 Self::S3PresigningFailed(Box::new(err))
407 }
408 }
409 }
410}
411
412/// Port on binary where metrics are exposed
413pub const METRICS_PORT: u16 = 9090;
414
415/// Host deployment information
416#[derive(Serialize, Deserialize, Clone)]
417pub struct Host {
418 /// Name of the host
419 pub name: String,
420
421 /// Region where the host is deployed
422 pub region: String,
423
424 /// Public IP address of the host
425 pub ip: IpAddr,
426}
427
428/// List of hosts
429#[derive(Serialize, Deserialize, Clone)]
430pub struct Hosts {
431 /// Private IP address of the monitoring instance
432 pub monitoring: IpAddr,
433
434 /// Hosts deployed across all regions
435 pub hosts: Vec<Host>,
436}
437
438/// Port configuration
439#[derive(Serialize, Deserialize, Clone)]
440pub struct PortConfig {
441 /// Protocol (e.g., "tcp")
442 pub protocol: String,
443
444 /// Port number
445 pub port: u16,
446
447 /// CIDR block
448 pub cidr: String,
449}
450
451/// Instance configuration
452#[derive(Serialize, Deserialize, Clone)]
453pub struct InstanceConfig {
454 /// Name of the instance
455 pub name: String,
456
457 /// AWS region where the instance is deployed
458 pub region: String,
459
460 /// Instance type (e.g., `t4g.small` for ARM64, `t3.small` for x86_64)
461 pub instance_type: String,
462
463 /// Storage size in GB
464 pub storage_size: i32,
465
466 /// Storage class (e.g., "gp2")
467 pub storage_class: String,
468
469 /// Path to the binary to deploy
470 pub binary: String,
471
472 /// Path to the binary configuration file
473 pub config: String,
474
475 /// Whether to enable profiling
476 pub profiling: bool,
477}
478
479/// Monitoring configuration
480#[derive(Serialize, Deserialize, Clone)]
481pub struct MonitoringConfig {
482 /// Instance type (e.g., `t4g.small` for ARM64, `t3.small` for x86_64)
483 pub instance_type: String,
484
485 /// Storage size in GB
486 pub storage_size: i32,
487
488 /// Storage class (e.g., "gp2")
489 pub storage_class: String,
490
491 /// Path to a custom dashboard file that is automatically
492 /// uploaded to grafana
493 pub dashboard: String,
494}
495
496/// Deployer configuration
497#[derive(Serialize, Deserialize, Clone)]
498pub struct Config {
499 /// Unique tag for the deployment
500 pub tag: String,
501
502 /// Monitoring instance configuration
503 pub monitoring: MonitoringConfig,
504
505 /// Instance configurations
506 pub instances: Vec<InstanceConfig>,
507
508 /// Ports open on all instances
509 pub ports: Vec<PortConfig>,
510}