commonware_deployer/ec2/mod.rs
1//! AWS EC2 deployer
2//!
3//! Deploy a custom binary (and configuration) to any number of EC2 instances across multiple regions. View metrics and logs
4//! from all instances with Grafana.
5//!
6//! # Features
7//!
8//! * Automated creation, update, and destruction of EC2 instances across multiple regions
9//! * Provide a unique name, instance type, region, binary, and configuration for each deployed instance
10//! * Collect metrics, profiles (when enabled), and logs from all deployed instances on a long-lived monitoring instance
11//! (accessible only to the deployer's IP)
12//!
13//! # Architecture
14//!
15//! ```txt
16//! Deployer's Machine (Public IP)
17//! |
18//! |
19//! v
20//! +-----------------------------------+
21//! | Monitoring VPC (us-east-1) |
22//! | - Monitoring Instance |
23//! | - Prometheus |
24//! | - Loki |
25//! | - Pyroscope |
26//! | - Tempo |
27//! | - Grafana |
28//! | - Security Group |
29//! | - All: Deployer IP |
30//! | - 3100: Binary VPCs |
31//! | - 4040: Binary VPCs |
32//! | - 4318: Binary VPCs |
33//! +-----------------------------------+
34//! ^ ^
35//! (Telemetry) (Telemetry)
36//! | |
37//! | |
38//! +------------------------------+ +------------------------------+
39//! | Binary VPC 1 | | Binary VPC 2 |
40//! | - Binary Instance | | - Binary Instance |
41//! | - Binary A | | - Binary B |
42//! | - Promtail | | - Promtail |
43//! | - Node Exporter | | - Node Exporter |
44//! | - Pyroscope Agent | | - Pyroscope Agent |
45//! | - Memleak Agent | | - Memleak Agent |
46//! | - Security Group | | - Security Group |
47//! | - All: Deployer IP | | - All: Deployer IP |
48//! | - 9090: Monitoring IP | | - 9090: Monitoring IP |
49//! | - 9100: Monitoring IP | | - 9100: Monitoring IP |
50//! | - 9200: Monitoring IP | | - 9200: Monitoring IP |
51//! | - 8012: 0.0.0.0/0 | | - 8765: 12.3.7.9/32 |
52//! +------------------------------+ +------------------------------+
53//! ```
54//!
55//! ## Instances
56//!
57//! ### Monitoring
58//!
59//! * Deployed in `us-east-1` with a configurable ARM64 instance type (e.g., `t4g.small`) and storage (e.g., 10GB gp2).
60//! * Runs:
61//! * **Prometheus**: Scrapes binary metrics from all instances at `:9090` and system metrics from all instances at `:9100`.
62//! * **Loki**: Listens at `:3100`, storing logs in `/loki/chunks` with a TSDB index at `/loki/index`.
63//! * **Pyroscope**: Listens at `:4040`, storing profiles in `/var/lib/pyroscope`.
64//! * **Tempo**: Listens at `:4318`, storing traces in `/var/lib/tempo`.
65//! * **Grafana**: Hosted at `:3000`, provisioned with Prometheus and Loki datasources and a custom dashboard.
66//! * Ingress:
67//! * Allows deployer IP access (TCP 0-65535).
68//! * Binary instance traffic to Loki (TCP 3100), Pyroscope (TCP 4040), and Tempo (TCP 4318).
69//!
70//! ### Binary
71//!
72//! * Deployed in user-specified regions with configurable ARM64 instance types and storage.
73//! * Run:
74//! * **Custom Binary**: Executes with `--hosts=/home/ubuntu/hosts.yaml --config=/home/ubuntu/config.conf`, exposing metrics at `:9090`.
75//! * **Promtail**: Forwards `/var/log/binary.log` to Loki on the monitoring instance.
76//! * **Node Exporter**: Exposes system metrics at `:9100`.
77//! * **Pyroscope Agent**: Forwards `perf` profiles to Pyroscope on the monitoring instance.
78//! * **Memleak Agent**: Exposes `memleak` metrics at `:9200`.
79//! * Ingress:
80//! * Deployer IP access (TCP 0-65535).
81//! * Monitoring IP access to `:9090`, `:9100`, and `:9200` for Prometheus.
82//! * User-defined ports from the configuration.
83//!
84//! ## Networking
85//!
86//! ### VPCs
87//!
88//! One per region with CIDR `10.<region-index>.0.0/16` (e.g., `10.0.0.0/16` for `us-east-1`).
89//!
90//! ### Subnets
91//!
92//! Single subnet per VPC (e.g., `10.<region-index>.1.0/24`), linked to a route table with an internet gateway.
93//!
94//! ### VPC Peering
95//!
96//! Connects the monitoring VPC to each binary VPC, with routes added to route tables for private communication.
97//!
98//! ### Security Groups
99//!
100//! Separate for monitoring (tag) and binary instances (`{tag}-binary`), dynamically configured for deployer and inter-instance traffic.
101//!
102//! # Workflow
103//!
104//! ## `ec2 create`
105//!
106//! 1. Validates configuration and generates an SSH key pair, stored in `$HOME/.commonware_deployer/{tag}/id_rsa_{tag}`.
107//! 2. Creates VPCs, subnets, internet gateways, route tables, and security groups per region.
108//! 3. Establishes VPC peering between the monitoring region and binary regions.
109//! 4. Launches the monitoring instance, uploads service files, and installs Prometheus, Grafana, Loki, and Pyroscope.
110//! 5. Launches binary instances, uploads binaries, configurations, and hosts.yaml, and installs Promtail and the binary.
111//! 6. Configures BBR on all instances and updates the monitoring security group for Loki traffic.
112//! 7. Marks completion with `$HOME/.commonware_deployer/{tag}/created`.
113//!
114//! ## `ec2 update`
115//!
116//! 1. Stops the `binary` service on each binary instance.
117//! 2. Uploads the latest binary and configuration from the YAML config.
118//! 3. Restarts the `binary` service, ensuring minimal downtime.
119//!
120//! ## `ec2 authorize`
121//!
122//! 1. Obtains the deployer's current public IP address (or parses the one provided).
123//! 2. For each security group in the deployment, adds an ingress rule for the IP (if it doesn't already exist).
124//!
125//! ## `ec2 destroy`
126//!
127//! 1. Terminates all instances across regions.
128//! 2. Deletes security groups, subnets, route tables, VPC peering connections, internet gateways, key pairs, and VPCs in dependency order.
129//! 3. Marks destruction with `$HOME/.commonware_deployer/{tag}/destroyed`, retaining the directory to prevent tag reuse.
130//!
131//! # Persistence
132//!
133//! * A directory `$HOME/.commonware_deployer/{tag}` stores the SSH private key, service files, and status files (`created`, `destroyed`).
134//! * The deployment state is tracked via these files, ensuring operations respect prior create/destroy actions.
135//!
136//! # Example Configuration
137//!
138//! ```yaml
139//! tag: ffa638a0-991c-442c-8ec4-aa4e418213a5
140//! monitoring:
141//! instance_type: t4g.small
142//! storage_size: 10
143//! storage_class: gp2
144//! dashboard: /path/to/dashboard.json
145//! instances:
146//! - name: node1
147//! region: us-east-1
148//! instance_type: t4g.small
149//! storage_size: 10
150//! storage_class: gp2
151//! binary: /path/to/binary
152//! config: /path/to/config.conf
153//! profiling: true
154//! - name: node2
155//! region: us-west-2
156//! instance_type: t4g.small
157//! storage_size: 10
158//! storage_class: gp2
159//! binary: /path/to/binary2
160//! config: /path/to/config2.conf
161//! profiling: false
162//! ports:
163//! - protocol: tcp
164//! port: 4545
165//! cidr: 0.0.0.0/0
166//! ```
167
168use serde::{Deserialize, Serialize};
169use std::net::IpAddr;
170
171cfg_if::cfg_if! {
172 if #[cfg(feature="aws")] {
173 use thiserror::Error;
174 use std::path::PathBuf;
175
176 pub mod aws;
177 mod create;
178 pub mod services;
179 pub use create::create;
180 mod update;
181 pub use update::update;
182 mod authorize;
183 pub use authorize::authorize;
184 mod destroy;
185 pub use destroy::destroy;
186 pub mod utils;
187
188 /// Name of the monitoring instance
189 const MONITORING_NAME: &str = "monitoring";
190
191 /// AWS region where monitoring instances are deployed
192 const MONITORING_REGION: &str = "us-east-1";
193
194 /// File name that indicates the deployment completed
195 const CREATED_FILE_NAME: &str = "created";
196
197 /// File name that indicates the deployment was destroyed
198 const DESTROYED_FILE_NAME: &str = "destroyed";
199
200 /// Port on instance where system metrics are exposed
201 const SYSTEM_PORT: u16 = 9100;
202
203 /// Port on instance where memleak metrics are exposed
204 const MEMLEAK_PORT: u16 = 9200;
205
206 /// Port on monitoring where logs are pushed
207 const LOGS_PORT: u16 = 3100;
208
209 /// Port on monitoring where profiles are pushed
210 const PROFILES_PORT: u16 = 4040;
211
212 /// Port on monitoring where traces are pushed
213 const TRACES_PORT: u16 = 4318;
214
215 /// Subcommand name
216 pub const CMD: &str = "ec2";
217
218 /// Create subcommand name
219 pub const CREATE_CMD: &str = "create";
220
221 /// Update subcommand name
222 pub const UPDATE_CMD: &str = "update";
223
224 /// Authorize subcommand name
225 pub const AUTHORIZE_CMD: &str = "authorize";
226
227 /// Destroy subcommand name
228 pub const DESTROY_CMD: &str = "destroy";
229
230 /// Directory where deployer files are stored
231 fn deployer_directory(tag: &str) -> PathBuf {
232 let base_dir = std::env::var("HOME").expect("$HOME is not configured");
233 PathBuf::from(format!("{base_dir}/.commonware_deployer/{tag}"))
234 }
235
236 /// Errors that can occur when deploying infrastructure on AWS
237 #[derive(Error, Debug)]
238 pub enum Error {
239 #[error("AWS EC2 error: {0}")]
240 AwsEc2(#[from] aws_sdk_ec2::Error),
241 #[error("AWS security group ingress error: {0}")]
242 AwsSecurityGroupIngress(#[from] aws_sdk_ec2::operation::authorize_security_group_ingress::AuthorizeSecurityGroupIngressError),
243 #[error("AWS describe instances error: {0}")]
244 AwsDescribeInstances(#[from] aws_sdk_ec2::operation::describe_instances::DescribeInstancesError),
245 #[error("IO error: {0}")]
246 Io(#[from] std::io::Error),
247 #[error("YAML error: {0}")]
248 Yaml(#[from] serde_yaml::Error),
249 #[error("creation already attempted")]
250 CreationAttempted,
251 #[error("invalid instance name: {0}")]
252 InvalidInstanceName(String),
253 #[error("reqwest error: {0}")]
254 Reqwest(#[from] reqwest::Error),
255 #[error("SCP failed")]
256 ScpFailed,
257 #[error("SSH failed")]
258 SshFailed,
259 #[error("keygen failed")]
260 KeygenFailed,
261 #[error("service timeout({0}): {1}")]
262 ServiceTimeout(String, String),
263 #[error("deployment does not exist: {0}")]
264 DeploymentDoesNotExist(String),
265 #[error("deployment is not complete: {0}")]
266 DeploymentNotComplete(String),
267 #[error("deployment already destroyed: {0}")]
268 DeploymentAlreadyDestroyed(String),
269 #[error("private key not found")]
270 PrivateKeyNotFound,
271 #[error("invalid IP address: {0}")]
272 InvalidIpAddress(String),
273 }
274 }
275}
276
277/// Port on binary where metrics are exposed
278pub const METRICS_PORT: u16 = 9090;
279
280/// Host deployment information
281#[derive(Serialize, Deserialize, Clone)]
282pub struct Host {
283 /// Name of the host
284 pub name: String,
285
286 /// Region where the host is deployed
287 pub region: String,
288
289 /// Public IP address of the host
290 pub ip: IpAddr,
291}
292
293/// List of hosts
294#[derive(Serialize, Deserialize, Clone)]
295pub struct Hosts {
296 /// Private IP address of the monitoring instance
297 pub monitoring: IpAddr,
298
299 /// Hosts deployed across all regions
300 pub hosts: Vec<Host>,
301}
302
303/// Port configuration
304#[derive(Serialize, Deserialize, Clone)]
305pub struct PortConfig {
306 /// Protocol (e.g., "tcp")
307 pub protocol: String,
308
309 /// Port number
310 pub port: u16,
311
312 /// CIDR block
313 pub cidr: String,
314}
315
316/// Instance configuration
317#[derive(Serialize, Deserialize, Clone)]
318pub struct InstanceConfig {
319 /// Name of the instance
320 pub name: String,
321
322 /// AWS region where the instance is deployed
323 pub region: String,
324
325 /// Instance type (only ARM-based instances are supported)
326 pub instance_type: String,
327
328 /// Storage size in GB
329 pub storage_size: i32,
330
331 /// Storage class (e.g., "gp2")
332 pub storage_class: String,
333
334 /// Path to the binary to deploy
335 pub binary: String,
336
337 /// Path to the binary configuration file
338 pub config: String,
339
340 /// Whether to enable profiling
341 pub profiling: bool,
342}
343
344/// Monitoring configuration
345#[derive(Serialize, Deserialize, Clone)]
346pub struct MonitoringConfig {
347 /// Instance type (only ARM-based instances are supported)
348 pub instance_type: String,
349
350 /// Storage size in GB
351 pub storage_size: i32,
352
353 /// Storage class (e.g., "gp2")
354 pub storage_class: String,
355
356 /// Path to a custom dashboard file that is automatically
357 /// uploaded to grafana
358 pub dashboard: String,
359}
360
361/// Deployer configuration
362#[derive(Serialize, Deserialize, Clone)]
363pub struct Config {
364 /// Unique tag for the deployment
365 pub tag: String,
366
367 /// Monitoring instance configuration
368 pub monitoring: MonitoringConfig,
369
370 /// Instance configurations
371 pub instances: Vec<InstanceConfig>,
372
373 /// Ports open on all instances
374 pub ports: Vec<PortConfig>,
375}