commonware_deployer/ec2/
mod.rs

1//! AWS EC2 deployer
2//!
3//! Deploy a custom binary (and configuration) to any number of EC2 instances across multiple regions. View metrics and logs
4//! from all instances with Grafana.
5//!
6//! # Features
7//!
8//! * Automated creation, update, and destruction of EC2 instances across multiple regions
9//! * Provide a unique name, instance type, region, binary, and configuration for each deployed instance
10//! * Collect metrics, profiles (when enabled), and logs from all deployed instances on a long-lived monitoring instance
11//!   (accessible only to the deployer's IP)
12//!
13//! # Architecture
14//!
15//! ```txt
16//!                    Deployer's Machine (Public IP)
17//!                                  |
18//!                                  |
19//!                                  v
20//!               +-----------------------------------+
21//!               | Monitoring VPC (us-east-1)        |
22//!               |  - Monitoring Instance            |
23//!               |    - Prometheus                   |
24//!               |    - Loki                         |
25//!               |    - Pyroscope                    |
26//!               |    - Tempo                        |
27//!               |    - Grafana                      |
28//!               |  - Security Group                 |
29//!               |    - All: Deployer IP             |
30//!               |    - 3100: Binary VPCs            |
31//!               |    - 4040: Binary VPCs            |
32//!               |    - 4318: Binary VPCs            |
33//!               +-----------------------------------+
34//!                     ^                       ^
35//!                (Telemetry)             (Telemetry)
36//!                     |                       |
37//!                     |                       |
38//! +------------------------------+  +------------------------------+
39//! | Binary VPC 1                 |  | Binary VPC 2                 |
40//! |  - Binary Instance           |  |  - Binary Instance           |
41//! |    - Binary A                |  |    - Binary B                |
42//! |    - Promtail                |  |    - Promtail                |
43//! |    - Node Exporter           |  |    - Node Exporter           |
44//! |    - Pyroscope Agent         |  |    - Pyroscope Agent         |
45//! |  - Security Group            |  |  - Security Group            |
46//! |    - All: Deployer IP        |  |    - All: Deployer IP        |
47//! |    - 9090: Monitoring IP     |  |    - 9090: Monitoring IP     |
48//! |    - 9100: Monitoring IP     |  |    - 9100: Monitoring IP     |
49//! |    - 8012: 0.0.0.0/0         |  |    - 8765: 12.3.7.9/32       |
50//! +------------------------------+  +------------------------------+
51//! ```
52//!
53//! ## Instances
54//!
55//! ### Monitoring
56//!
57//! * Deployed in `us-east-1` with a configurable instance type (e.g., `t4g.small` for ARM64, `t3.small` for x86_64) and storage (e.g., 10GB gp2). Architecture is auto-detected from the instance type.
58//! * Runs:
59//!     * **Prometheus**: Scrapes binary metrics from all instances at `:9090` and system metrics from all instances at `:9100`.
60//!     * **Loki**: Listens at `:3100`, storing logs in `/loki/chunks` with a TSDB index at `/loki/index`.
61//!     * **Pyroscope**: Listens at `:4040`, storing profiles in `/var/lib/pyroscope`.
62//!     * **Tempo**: Listens at `:4318`, storing traces in `/var/lib/tempo`.
63//!     * **Grafana**: Hosted at `:3000`, provisioned with Prometheus, Loki, and Tempo datasources and a custom dashboard.
64//! * Ingress:
65//!     * Allows deployer IP access (TCP 0-65535).
66//!     * Binary instance traffic to Loki (TCP 3100) and Tempo (TCP 4318).
67//!
68//! ### Binary
69//!
70//! * Deployed in user-specified regions with configurable ARM64 or AMD64 instance types and storage.
71//! * Run:
72//!     * **Custom Binary**: Executes with `--hosts=/home/ubuntu/hosts.yaml --config=/home/ubuntu/config.conf`, exposing metrics at `:9090`.
73//!     * **Promtail**: Forwards `/var/log/binary.log` to Loki on the monitoring instance.
74//!     * **Node Exporter**: Exposes system metrics at `:9100`.
75//!     * **Pyroscope Agent**: Forwards `perf` profiles to Pyroscope on the monitoring instance.
76//! * Ingress:
77//!     * Deployer IP access (TCP 0-65535).
78//!     * Monitoring IP access to `:9090` and `:9100` for Prometheus.
79//!     * User-defined ports from the configuration.
80//!
81//! ## Networking
82//!
83//! ### VPCs
84//!
85//! One per region with CIDR `10.<region-index>.0.0/16` (e.g., `10.0.0.0/16` for `us-east-1`).
86//!
87//! ### Subnets
88//!
89//! Single subnet per VPC (e.g., `10.<region-index>.1.0/24`), linked to a route table with an internet gateway.
90//!
91//! ### VPC Peering
92//!
93//! Connects the monitoring VPC to each binary VPC, with routes added to route tables for private communication.
94//!
95//! ### Security Groups
96//!
97//! Separate for monitoring (tag) and binary instances (`{tag}-binary`), dynamically configured for deployer and inter-instance traffic.
98//!
99//! # Workflow
100//!
101//! ## `ec2 create`
102//!
103//! 1. Validates configuration and generates an SSH key pair, stored in `$HOME/.commonware_deployer/{tag}/id_rsa_{tag}`.
104//! 2. Ensures the shared S3 bucket exists and caches observability tools (Prometheus, Grafana, Loki, etc.) if not already present.
105//! 3. Uploads deployment-specific files (binaries, configs) to S3.
106//! 4. Creates VPCs, subnets, internet gateways, route tables, and security groups per region (concurrently).
107//! 5. Establishes VPC peering between the monitoring region and binary regions.
108//! 6. Launches the monitoring instance.
109//! 7. Launches binary instances.
110//! 8. Caches all static config files and uploads per-instance configs (hosts.yaml, promtail, pyroscope) to S3.
111//! 9. Configures monitoring and binary instances in parallel via SSH (BBR, service installation, service startup).
112//! 10. Updates the monitoring security group to allow telemetry traffic from binary instances.
113//! 11. Marks completion with `$HOME/.commonware_deployer/{tag}/created`.
114//!
115//! ## `ec2 update`
116//!
117//! 1. Uploads the latest binary and configuration to S3.
118//! 2. Stops the `binary` service on each binary instance.
119//! 3. Instances download the updated files from S3 via pre-signed URLs.
120//! 4. Restarts the `binary` service, ensuring minimal downtime.
121//!
122//! ## `ec2 authorize`
123//!
124//! 1. Obtains the deployer's current public IP address (or parses the one provided).
125//! 2. For each security group in the deployment, adds an ingress rule for the IP (if it doesn't already exist).
126//!
127//! ## `ec2 destroy`
128//!
129//! 1. Terminates all instances across regions.
130//! 2. Deletes security groups, subnets, route tables, VPC peering connections, internet gateways, key pairs, and VPCs in dependency order.
131//! 3. Deletes deployment-specific data from S3 (cached tools remain for future deployments).
132//! 4. Marks destruction with `$HOME/.commonware_deployer/{tag}/destroyed`, retaining the directory to prevent tag reuse.
133//!
134//! ## `ec2 clean`
135//!
136//! 1. Deletes the shared S3 bucket and all its contents (cached tools and any remaining deployment data).
137//! 2. Use this to fully clean up when you no longer need the deployer cache.
138//!
139//! # Persistence
140//!
141//! * A directory `$HOME/.commonware_deployer/{tag}` stores the SSH private key and status files (`created`, `destroyed`).
142//! * The deployment state is tracked via these files, ensuring operations respect prior create/destroy actions.
143//!
144//! ## S3 Caching
145//!
146//! A shared S3 bucket (`commonware-deployer-cache`) is used to cache deployment artifacts. The bucket
147//! uses a fixed name intentionally so that all users within the same AWS account share the cache. This
148//! design provides two benefits:
149//!
150//! 1. **Faster deployments**: Observability tools (Prometheus, Grafana, Loki, etc.) are downloaded from
151//!    upstream sources once and cached in S3. Subsequent deployments by any user skip the download and
152//!    use pre-signed URLs to fetch directly from S3.
153//!
154//! 2. **Reduced bandwidth**: Instead of requiring the deployer to push binaries to each instance,
155//!    unique binaries are uploaded once to S3 and then pulled from there.
156//!
157//! Per-deployment data (binaries, configs, hosts files) is isolated under `deployments/{tag}/` to prevent
158//! conflicts between concurrent deployments.
159//!
160//! The bucket stores:
161//!   * `tools/binaries/{tool}/{version}/{platform}/{filename}` - Tool binaries (e.g., prometheus, grafana)
162//!   * `tools/configs/{deployer-version}/{component}/{file}` - Static configs and service files
163//!   * `deployments/{tag}/` - Deployment-specific files:
164//!     * `monitoring/` - Prometheus config, dashboard
165//!     * `instances/{name}/` - Binary, config, hosts.yaml, promtail config, pyroscope script
166//!
167//! Tool binaries are namespaced by tool version and platform. Static configs are namespaced by deployer
168//! version to ensure cache invalidation when the deployer is updated.
169//!
170//! # Example Configuration
171//!
172//! ```yaml
173//! tag: ffa638a0-991c-442c-8ec4-aa4e418213a5
174//! monitoring:
175//!   instance_type: t4g.small  # ARM64 (Graviton)
176//!   storage_size: 10
177//!   storage_class: gp2
178//!   dashboard: /path/to/dashboard.json
179//! instances:
180//!   - name: node1
181//!     region: us-east-1
182//!     instance_type: t4g.small  # ARM64 (Graviton)
183//!     storage_size: 10
184//!     storage_class: gp2
185//!     binary: /path/to/binary-arm64
186//!     config: /path/to/config.conf
187//!     profiling: true
188//!   - name: node2
189//!     region: us-west-2
190//!     instance_type: t3.small  # x86_64 (Intel/AMD)
191//!     storage_size: 10
192//!     storage_class: gp2
193//!     binary: /path/to/binary-x86
194//!     config: /path/to/config2.conf
195//!     profiling: false
196//! ports:
197//!   - protocol: tcp
198//!     port: 4545
199//!     cidr: 0.0.0.0/0
200//! ```
201
202use serde::{Deserialize, Serialize};
203use std::net::IpAddr;
204
205cfg_if::cfg_if! {
206    if #[cfg(feature="aws")] {
207        use thiserror::Error;
208        use std::path::PathBuf;
209
210        /// CPU architecture for EC2 instances
211        #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
212        pub enum Architecture {
213            Arm64,
214            X86_64,
215        }
216
217        impl Architecture {
218            /// Returns the architecture string used in AMI names, download URLs, and labels
219            pub const fn as_str(&self) -> &'static str {
220                match self {
221                    Self::Arm64 => "arm64",
222                    Self::X86_64 => "amd64",
223                }
224            }
225
226            /// Returns the Linux library path component for jemalloc
227            pub const fn linux_lib(&self) -> &'static str {
228                match self {
229                    Self::Arm64 => "aarch64-linux-gnu",
230                    Self::X86_64 => "x86_64-linux-gnu",
231                }
232            }
233        }
234
235        impl std::fmt::Display for Architecture {
236            fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
237                f.write_str(self.as_str())
238            }
239        }
240
241        pub mod aws;
242        mod create;
243        pub mod services;
244        pub use create::create;
245        mod update;
246        pub use update::update;
247        mod authorize;
248        pub use authorize::authorize;
249        mod destroy;
250        pub use destroy::destroy;
251        mod clean;
252        pub use clean::clean;
253        pub mod utils;
254        pub mod s3;
255
256        /// Name of the monitoring instance
257        const MONITORING_NAME: &str = "monitoring";
258
259        /// AWS region where monitoring instances are deployed
260        const MONITORING_REGION: &str = "us-east-1";
261
262        /// File name that indicates the deployment completed
263        const CREATED_FILE_NAME: &str = "created";
264
265        /// File name that indicates the deployment was destroyed
266        const DESTROYED_FILE_NAME: &str = "destroyed";
267
268        /// Port on instance where system metrics are exposed
269        const SYSTEM_PORT: u16 = 9100;
270
271        /// Port on monitoring where logs are pushed
272        const LOGS_PORT: u16 = 3100;
273
274        /// Port on monitoring where profiles are pushed
275        const PROFILES_PORT: u16 = 4040;
276
277        /// Port on monitoring where traces are pushed
278        const TRACES_PORT: u16 = 4318;
279
280        /// Subcommand name
281        pub const CMD: &str = "ec2";
282
283        /// Create subcommand name
284        pub const CREATE_CMD: &str = "create";
285
286        /// Update subcommand name
287        pub const UPDATE_CMD: &str = "update";
288
289        /// Authorize subcommand name
290        pub const AUTHORIZE_CMD: &str = "authorize";
291
292        /// Destroy subcommand name
293        pub const DESTROY_CMD: &str = "destroy";
294
295        /// Clean subcommand name
296        pub const CLEAN_CMD: &str = "clean";
297
298        /// Directory where deployer files are stored
299        fn deployer_directory(tag: &str) -> PathBuf {
300            let base_dir = std::env::var("HOME").expect("$HOME is not configured");
301            PathBuf::from(format!("{base_dir}/.commonware_deployer/{tag}"))
302        }
303
304        /// S3 operations that can fail
305        #[derive(Debug, Clone, Copy)]
306        pub enum S3Operation {
307            CreateBucket,
308            DeleteBucket,
309            HeadObject,
310            PutObject,
311            ListObjects,
312            DeleteObjects,
313        }
314
315        /// Reasons why accessing a bucket may be forbidden
316        #[derive(Debug, Clone, Copy)]
317        pub enum BucketForbiddenReason {
318            /// Access denied (missing s3:ListBucket permission or bucket owned by another account)
319            AccessDenied,
320        }
321
322        impl std::fmt::Display for BucketForbiddenReason {
323            fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
324                match self {
325                    Self::AccessDenied => write!(f, "access denied (check IAM permissions or bucket ownership)"),
326                }
327            }
328        }
329
330        impl std::fmt::Display for S3Operation {
331            fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
332                match self {
333                    Self::CreateBucket => write!(f, "CreateBucket"),
334                    Self::DeleteBucket => write!(f, "DeleteBucket"),
335                    Self::HeadObject => write!(f, "HeadObject"),
336                    Self::PutObject => write!(f, "PutObject"),
337                    Self::ListObjects => write!(f, "ListObjects"),
338                    Self::DeleteObjects => write!(f, "DeleteObjects"),
339                }
340            }
341        }
342
343        /// Errors that can occur when deploying infrastructure on AWS
344        #[derive(Error, Debug)]
345        pub enum Error {
346            #[error("AWS EC2 error: {0}")]
347            AwsEc2(#[from] aws_sdk_ec2::Error),
348            #[error("AWS security group ingress error: {0}")]
349            AwsSecurityGroupIngress(#[from] aws_sdk_ec2::operation::authorize_security_group_ingress::AuthorizeSecurityGroupIngressError),
350            #[error("AWS describe instances error: {0}")]
351            AwsDescribeInstances(#[from] aws_sdk_ec2::operation::describe_instances::DescribeInstancesError),
352            #[error("S3 operation failed: {operation} on bucket '{bucket}'")]
353            AwsS3 {
354                bucket: String,
355                operation: S3Operation,
356                #[source]
357                source: Box<aws_sdk_s3::Error>,
358            },
359            #[error("S3 bucket '{bucket}' forbidden: {reason}")]
360            S3BucketForbidden {
361                bucket: String,
362                reason: BucketForbiddenReason,
363            },
364            #[error("IO error: {0}")]
365            Io(#[from] std::io::Error),
366            #[error("YAML error: {0}")]
367            Yaml(#[from] serde_yaml::Error),
368            #[error("creation already attempted")]
369            CreationAttempted,
370            #[error("invalid instance name: {0}")]
371            InvalidInstanceName(String),
372            #[error("reqwest error: {0}")]
373            Reqwest(#[from] reqwest::Error),
374            #[error("SSH failed")]
375            SshFailed,
376            #[error("keygen failed")]
377            KeygenFailed,
378            #[error("service timeout({0}): {1}")]
379            ServiceTimeout(String, String),
380            #[error("deployment does not exist: {0}")]
381            DeploymentDoesNotExist(String),
382            #[error("deployment is not complete: {0}")]
383            DeploymentNotComplete(String),
384            #[error("deployment already destroyed: {0}")]
385            DeploymentAlreadyDestroyed(String),
386            #[error("private key not found")]
387            PrivateKeyNotFound,
388            #[error("invalid IP address: {0}")]
389            IpAddrParse(#[from] std::net::AddrParseError),
390            #[error("IP address is not IPv4: {0}")]
391            IpAddrNotV4(std::net::IpAddr),
392            #[error("download failed: {0}")]
393            DownloadFailed(String),
394            #[error("S3 presigning config error: {0}")]
395            S3PresigningConfig(#[from] aws_sdk_s3::presigning::PresigningConfigError),
396            #[error("S3 presigning failed: {0}")]
397            S3PresigningFailed(Box<aws_sdk_s3::error::SdkError<aws_sdk_s3::operation::get_object::GetObjectError>>),
398            #[error("S3 builder error: {0}")]
399            S3Builder(#[from] aws_sdk_s3::error::BuildError),
400            #[error("duplicate instance name: {0}")]
401            DuplicateInstanceName(String),
402        }
403
404        impl From<aws_sdk_s3::error::SdkError<aws_sdk_s3::operation::get_object::GetObjectError>> for Error {
405            fn from(err: aws_sdk_s3::error::SdkError<aws_sdk_s3::operation::get_object::GetObjectError>) -> Self {
406                Self::S3PresigningFailed(Box::new(err))
407            }
408        }
409    }
410}
411
412/// Port on binary where metrics are exposed
413pub const METRICS_PORT: u16 = 9090;
414
415/// Host deployment information
416#[derive(Serialize, Deserialize, Clone)]
417pub struct Host {
418    /// Name of the host
419    pub name: String,
420
421    /// Region where the host is deployed
422    pub region: String,
423
424    /// Public IP address of the host
425    pub ip: IpAddr,
426}
427
428/// List of hosts
429#[derive(Serialize, Deserialize, Clone)]
430pub struct Hosts {
431    /// Private IP address of the monitoring instance
432    pub monitoring: IpAddr,
433
434    /// Hosts deployed across all regions
435    pub hosts: Vec<Host>,
436}
437
438/// Port configuration
439#[derive(Serialize, Deserialize, Clone)]
440pub struct PortConfig {
441    /// Protocol (e.g., "tcp")
442    pub protocol: String,
443
444    /// Port number
445    pub port: u16,
446
447    /// CIDR block
448    pub cidr: String,
449}
450
451/// Instance configuration
452#[derive(Serialize, Deserialize, Clone)]
453pub struct InstanceConfig {
454    /// Name of the instance
455    pub name: String,
456
457    /// AWS region where the instance is deployed
458    pub region: String,
459
460    /// Instance type (e.g., `t4g.small` for ARM64, `t3.small` for x86_64)
461    pub instance_type: String,
462
463    /// Storage size in GB
464    pub storage_size: i32,
465
466    /// Storage class (e.g., "gp2")
467    pub storage_class: String,
468
469    /// Path to the binary to deploy
470    pub binary: String,
471
472    /// Path to the binary configuration file
473    pub config: String,
474
475    /// Whether to enable profiling
476    pub profiling: bool,
477}
478
479/// Monitoring configuration
480#[derive(Serialize, Deserialize, Clone)]
481pub struct MonitoringConfig {
482    /// Instance type (e.g., `t4g.small` for ARM64, `t3.small` for x86_64)
483    pub instance_type: String,
484
485    /// Storage size in GB
486    pub storage_size: i32,
487
488    /// Storage class (e.g., "gp2")
489    pub storage_class: String,
490
491    /// Path to a custom dashboard file that is automatically
492    /// uploaded to grafana
493    pub dashboard: String,
494}
495
496/// Deployer configuration
497#[derive(Serialize, Deserialize, Clone)]
498pub struct Config {
499    /// Unique tag for the deployment
500    pub tag: String,
501
502    /// Monitoring instance configuration
503    pub monitoring: MonitoringConfig,
504
505    /// Instance configurations
506    pub instances: Vec<InstanceConfig>,
507
508    /// Ports open on all instances
509    pub ports: Vec<PortConfig>,
510}