commonware_deployer/ec2/
mod.rs

1//! AWS EC2 deployer
2//!
3//! Deploy a custom binary (and configuration) to any number of EC2 instances across multiple regions. View metrics and logs
4//! from all instances with Grafana.
5//!
6//! # Features
7//!
8//! * Automated creation, update, and destruction of EC2 instances across multiple regions
9//! * Provide a unique name, instance type, region, binary, and configuration for each deployed instance
10//! * Collect metrics, profiles (when enabled), and logs from all deployed instances on a long-lived monitoring instance
11//!   (accessible only to the deployer's IP)
12//!
13//! # Architecture
14//!
15//! ```txt
16//!                    Deployer's Machine (Public IP)
17//!                                  |
18//!                                  |
19//!                                  v
20//!               +-----------------------------------+
21//!               | Monitoring VPC (us-east-1)        |
22//!               |  - Monitoring Instance            |
23//!               |    - Prometheus                   |
24//!               |    - Loki                         |
25//!               |    - Pyroscope                    |
26//!               |    - Tempo                        |
27//!               |    - Grafana                      |
28//!               |  - Security Group                 |
29//!               |    - All: Deployer IP             |
30//!               |    - 3100: Binary VPCs            |
31//!               |    - 4040: Binary VPCs            |
32//!               |    - 4318: Binary VPCs            |
33//!               +-----------------------------------+
34//!                     ^                       ^
35//!                (Telemetry)             (Telemetry)
36//!                     |                       |
37//!                     |                       |
38//! +------------------------------+  +------------------------------+
39//! | Binary VPC 1                 |  | Binary VPC 2                 |
40//! |  - Binary Instance           |  |  - Binary Instance           |
41//! |    - Binary A                |  |    - Binary B                |
42//! |    - Promtail                |  |    - Promtail                |
43//! |    - Node Exporter           |  |    - Node Exporter           |
44//! |    - Pyroscope Agent         |  |    - Pyroscope Agent         |
45//! |    - Memleak Agent           |  |    - Memleak Agent           |
46//! |  - Security Group            |  |  - Security Group            |
47//! |    - All: Deployer IP        |  |    - All: Deployer IP        |
48//! |    - 9090: Monitoring IP     |  |    - 9090: Monitoring IP     |
49//! |    - 9100: Monitoring IP     |  |    - 9100: Monitoring IP     |
50//! |    - 9200: Monitoring IP     |  |    - 9200: Monitoring IP     |
51//! |    - 8012: 0.0.0.0/0         |  |    - 8765: 12.3.7.9/32       |
52//! +------------------------------+  +------------------------------+
53//! ```
54//!
55//! ## Instances
56//!
57//! ### Monitoring
58//!
59//! * Deployed in `us-east-1` with a configurable ARM64 instance type (e.g., `t4g.small`) and storage (e.g., 10GB gp2).
60//! * Runs:
61//!     * **Prometheus**: Scrapes binary metrics from all instances at `:9090` and system metrics from all instances at `:9100`.
62//!     * **Loki**: Listens at `:3100`, storing logs in `/loki/chunks` with a TSDB index at `/loki/index`.
63//!     * **Pyroscope**: Listens at `:4040`, storing profiles in `/var/lib/pyroscope`.
64//!     * **Tempo**: Listens at `:4318`, storing traces in `/var/lib/tempo`.
65//!     * **Grafana**: Hosted at `:3000`, provisioned with Prometheus and Loki datasources and a custom dashboard.
66//! * Ingress:
67//!     * Allows deployer IP access (TCP 0-65535).
68//!     * Binary instance traffic to Loki (TCP 3100), Pyroscope (TCP 4040), and Tempo (TCP 4318).
69//!
70//! ### Binary
71//!
72//! * Deployed in user-specified regions with configurable ARM64 instance types and storage.
73//! * Run:
74//!     * **Custom Binary**: Executes with `--hosts=/home/ubuntu/hosts.yaml --config=/home/ubuntu/config.conf`, exposing metrics at `:9090`.
75//!     * **Promtail**: Forwards `/var/log/binary.log` to Loki on the monitoring instance.
76//!     * **Node Exporter**: Exposes system metrics at `:9100`.
77//!     * **Pyroscope Agent**: Forwards `perf` profiles to Pyroscope on the monitoring instance.
78//!     * **Memleak Agent**: Exposes `memleak` metrics at `:9200`.
79//! * Ingress:
80//!     * Deployer IP access (TCP 0-65535).
81//!     * Monitoring IP access to `:9090`, `:9100`, and `:9200` for Prometheus.
82//!     * User-defined ports from the configuration.
83//!
84//! ## Networking
85//!
86//! ### VPCs
87//!
88//! One per region with CIDR `10.<region-index>.0.0/16` (e.g., `10.0.0.0/16` for `us-east-1`).
89//!
90//! ### Subnets
91//!
92//! Single subnet per VPC (e.g., `10.<region-index>.1.0/24`), linked to a route table with an internet gateway.
93//!
94//! ### VPC Peering
95//!
96//! Connects the monitoring VPC to each binary VPC, with routes added to route tables for private communication.
97//!
98//! ### Security Groups
99//!
100//! Separate for monitoring (tag) and binary instances (`{tag}-binary`), dynamically configured for deployer and inter-instance traffic.
101//!
102//! # Workflow
103//!
104//! ## `ec2 create`
105//!
106//! 1. Validates configuration and generates an SSH key pair, stored in `$HOME/.commonware_deployer/{tag}/id_rsa_{tag}`.
107//! 2. Creates VPCs, subnets, internet gateways, route tables, and security groups per region.
108//! 3. Establishes VPC peering between the monitoring region and binary regions.
109//! 4. Launches the monitoring instance, uploads service files, and installs Prometheus, Grafana, Loki, and Pyroscope.
110//! 5. Launches binary instances, uploads binaries, configurations, and hosts.yaml, and installs Promtail and the binary.
111//! 6. Configures BBR on all instances and updates the monitoring security group for Loki traffic.
112//! 7. Marks completion with `$HOME/.commonware_deployer/{tag}/created`.
113//!
114//! ## `ec2 update`
115//!
116//! 1. Stops the `binary` service on each binary instance.
117//! 2. Uploads the latest binary and configuration from the YAML config.
118//! 3. Restarts the `binary` service, ensuring minimal downtime.
119//!
120//! ## `ec2 authorize`
121//!
122//! 1. Obtains the deployer's current public IP address (or parses the one provided).
123//! 2. For each security group in the deployment, adds an ingress rule for the IP (if it doesn't already exist).
124//!
125//! ## `ec2 destroy`
126//!
127//! 1. Terminates all instances across regions.
128//! 2. Deletes security groups, subnets, route tables, VPC peering connections, internet gateways, key pairs, and VPCs in dependency order.
129//! 3. Marks destruction with `$HOME/.commonware_deployer/{tag}/destroyed`, retaining the directory to prevent tag reuse.
130//!
131//! # Persistence
132//!
133//! * A directory `$HOME/.commonware_deployer/{tag}` stores the SSH private key, service files, and status files (`created`, `destroyed`).
134//! * The deployment state is tracked via these files, ensuring operations respect prior create/destroy actions.
135//!
136//! # Example Configuration
137//!
138//! ```yaml
139//! tag: ffa638a0-991c-442c-8ec4-aa4e418213a5
140//! monitoring:
141//!   instance_type: t4g.small
142//!   storage_size: 10
143//!   storage_class: gp2
144//!   dashboard: /path/to/dashboard.json
145//! instances:
146//!   - name: node1
147//!     region: us-east-1
148//!     instance_type: t4g.small
149//!     storage_size: 10
150//!     storage_class: gp2
151//!     binary: /path/to/binary
152//!     config: /path/to/config.conf
153//!     profiling: true
154//!   - name: node2
155//!     region: us-west-2
156//!     instance_type: t4g.small
157//!     storage_size: 10
158//!     storage_class: gp2
159//!     binary: /path/to/binary2
160//!     config: /path/to/config2.conf
161//!     profiling: false
162//! ports:
163//!   - protocol: tcp
164//!     port: 4545
165//!     cidr: 0.0.0.0/0
166//! ```
167
168use serde::{Deserialize, Serialize};
169use std::net::IpAddr;
170
171cfg_if::cfg_if! {
172    if #[cfg(feature="aws")] {
173        use thiserror::Error;
174        use std::path::PathBuf;
175
176        pub mod aws;
177        mod create;
178        pub mod services;
179        pub use create::create;
180        mod update;
181        pub use update::update;
182        mod authorize;
183        pub use authorize::authorize;
184        mod destroy;
185        pub use destroy::destroy;
186        pub mod utils;
187
188        /// Name of the monitoring instance
189        const MONITORING_NAME: &str = "monitoring";
190
191        /// AWS region where monitoring instances are deployed
192        const MONITORING_REGION: &str = "us-east-1";
193
194        /// File name that indicates the deployment completed
195        const CREATED_FILE_NAME: &str = "created";
196
197        /// File name that indicates the deployment was destroyed
198        const DESTROYED_FILE_NAME: &str = "destroyed";
199
200        /// Port on instance where system metrics are exposed
201        const SYSTEM_PORT: u16 = 9100;
202
203        /// Port on instance where memleak metrics are exposed
204        const MEMLEAK_PORT: u16 = 9200;
205
206        /// Port on monitoring where logs are pushed
207        const LOGS_PORT: u16 = 3100;
208
209        /// Port on monitoring where profiles are pushed
210        const PROFILES_PORT: u16 = 4040;
211
212        /// Port on monitoring where traces are pushed
213        const TRACES_PORT: u16 = 4318;
214
215        /// Subcommand name
216        pub const CMD: &str = "ec2";
217
218        /// Create subcommand name
219        pub const CREATE_CMD: &str = "create";
220
221        /// Update subcommand name
222        pub const UPDATE_CMD: &str = "update";
223
224        /// Authorize subcommand name
225        pub const AUTHORIZE_CMD: &str = "authorize";
226
227        /// Destroy subcommand name
228        pub const DESTROY_CMD: &str = "destroy";
229
230        /// Directory where deployer files are stored
231        fn deployer_directory(tag: &str) -> PathBuf {
232            let base_dir = std::env::var("HOME").expect("$HOME is not configured");
233            PathBuf::from(format!("{base_dir}/.commonware_deployer/{tag}"))
234        }
235
236        /// Errors that can occur when deploying infrastructure on AWS
237        #[derive(Error, Debug)]
238        pub enum Error {
239            #[error("AWS EC2 error: {0}")]
240            AwsEc2(#[from] aws_sdk_ec2::Error),
241            #[error("AWS security group ingress error: {0}")]
242            AwsSecurityGroupIngress(#[from] aws_sdk_ec2::operation::authorize_security_group_ingress::AuthorizeSecurityGroupIngressError),
243            #[error("AWS describe instances error: {0}")]
244            AwsDescribeInstances(#[from] aws_sdk_ec2::operation::describe_instances::DescribeInstancesError),
245            #[error("IO error: {0}")]
246            Io(#[from] std::io::Error),
247            #[error("YAML error: {0}")]
248            Yaml(#[from] serde_yaml::Error),
249            #[error("creation already attempted")]
250            CreationAttempted,
251            #[error("invalid instance name: {0}")]
252            InvalidInstanceName(String),
253            #[error("reqwest error: {0}")]
254            Reqwest(#[from] reqwest::Error),
255            #[error("SCP failed")]
256            ScpFailed,
257            #[error("SSH failed")]
258            SshFailed,
259            #[error("keygen failed")]
260            KeygenFailed,
261            #[error("service timeout({0}): {1}")]
262            ServiceTimeout(String, String),
263            #[error("deployment does not exist: {0}")]
264            DeploymentDoesNotExist(String),
265            #[error("deployment is not complete: {0}")]
266            DeploymentNotComplete(String),
267            #[error("deployment already destroyed: {0}")]
268            DeploymentAlreadyDestroyed(String),
269            #[error("private key not found")]
270            PrivateKeyNotFound,
271            #[error("invalid IP address: {0}")]
272            InvalidIpAddress(String),
273        }
274    }
275}
276
277/// Port on binary where metrics are exposed
278pub const METRICS_PORT: u16 = 9090;
279
280/// Host deployment information
281#[derive(Serialize, Deserialize, Clone)]
282pub struct Host {
283    /// Name of the host
284    pub name: String,
285
286    /// Region where the host is deployed
287    pub region: String,
288
289    /// Public IP address of the host
290    pub ip: IpAddr,
291}
292
293/// List of hosts
294#[derive(Serialize, Deserialize, Clone)]
295pub struct Hosts {
296    /// Private IP address of the monitoring instance
297    pub monitoring: IpAddr,
298
299    /// Hosts deployed across all regions
300    pub hosts: Vec<Host>,
301}
302
303/// Port configuration
304#[derive(Serialize, Deserialize, Clone)]
305pub struct PortConfig {
306    /// Protocol (e.g., "tcp")
307    pub protocol: String,
308
309    /// Port number
310    pub port: u16,
311
312    /// CIDR block
313    pub cidr: String,
314}
315
316/// Instance configuration
317#[derive(Serialize, Deserialize, Clone)]
318pub struct InstanceConfig {
319    /// Name of the instance
320    pub name: String,
321
322    /// AWS region where the instance is deployed
323    pub region: String,
324
325    /// Instance type (only ARM-based instances are supported)
326    pub instance_type: String,
327
328    /// Storage size in GB
329    pub storage_size: i32,
330
331    /// Storage class (e.g., "gp2")
332    pub storage_class: String,
333
334    /// Path to the binary to deploy
335    pub binary: String,
336
337    /// Path to the binary configuration file
338    pub config: String,
339
340    /// Whether to enable profiling
341    pub profiling: bool,
342}
343
344/// Monitoring configuration
345#[derive(Serialize, Deserialize, Clone)]
346pub struct MonitoringConfig {
347    /// Instance type (only ARM-based instances are supported)
348    pub instance_type: String,
349
350    /// Storage size in GB
351    pub storage_size: i32,
352
353    /// Storage class (e.g., "gp2")
354    pub storage_class: String,
355
356    /// Path to a custom dashboard file that is automatically
357    /// uploaded to grafana
358    pub dashboard: String,
359}
360
361/// Deployer configuration
362#[derive(Serialize, Deserialize, Clone)]
363pub struct Config {
364    /// Unique tag for the deployment
365    pub tag: String,
366
367    /// Monitoring instance configuration
368    pub monitoring: MonitoringConfig,
369
370    /// Instance configurations
371    pub instances: Vec<InstanceConfig>,
372
373    /// Ports open on all instances
374    pub ports: Vec<PortConfig>,
375}