network-protocol 1.2.1

# Deployment Patterns

This guide provides architectural patterns and best practices for deploying network-protocol in production environments.

## Table of Contents

- [Deployment Topologies](#deployment-topologies)
- [Single-Node Deployment](#single-node-deployment)
- [Cluster Deployment](#cluster-deployment)
- [Edge Computing](#edge-computing)
- [Circuit Breaker Pattern](#circuit-breaker-pattern)
- [Monitoring and Observability](#monitoring-and-observability)
- [Security Considerations](#security-considerations)
- [Disaster Recovery](#disaster-recovery)

---

## Deployment Topologies

### Overview

The library supports three primary deployment patterns:

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Single-Node    │    │    Cluster      │    │   Edge/Hybrid   │
│                 │    │                 │    │                 │
│  ┌──────────┐   │    │  ┌────┐ ┌────┐ │    │  ┌────┐  Cloud  │
│  │  Server  │   │    │  │ N1 │─│ N2 │ │    │  │Edge│────┐    │
│  └──────────┘   │    │  └────┘ └────┘ │    │  └────┘    ▼    │
│       │         │    │    │       │    │    │    │    ┌────┐  │
│  ┌────┴────┐    │    │  ┌─┴───────┴─┐  │    │    └───▶│Hub │  │
│  │Clients  │    │    │  │  Clients  │  │    │         └────┘  │
│  └─────────┘    │    │  └───────────┘  │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
```

---

## Single-Node Deployment

### Use Cases

- Development and testing
- Low-traffic applications (<10k concurrent connections)
- Stateful applications with session affinity
- Cost-sensitive deployments

### Architecture

```rust
use network_protocol::{service::daemon, transport::tls, config::Config};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize logging
    network_protocol::init();
    
    // Configure TLS
    let tls_config = tls::ServerConfig::builder()
        .with_cert_and_key("cert.pem", "key.pem")
        .build()?;
    
    // Configure server
    let config = Config {
        bind_addr: "0.0.0.0:8443".parse()?,
        max_connections: 1000,
        tls: Some(Arc::new(tls_config)),
        ..Default::default()
    };
    
    // Start server
    daemon::start_with_config(config).await?;
    
    Ok(())
}
```

### Deployment Checklist

- [ ] Set up systemd service (Linux) or launchd (macOS)
- [ ] Configure log rotation
- [ ] Set resource limits (ulimit, memory)
- [ ] Enable automatic restart on failure
- [ ] Configure firewall rules
- [ ] Set up monitoring alerts
- [ ] Configure TLS certificates with auto-renewal
- [ ] Test graceful shutdown

### Systemd Service Example (Linux)

```ini
[Unit]
Description=Network Protocol Server
After=network.target

[Service]
Type=simple
User=netprotocol
Group=netprotocol
WorkingDirectory=/opt/network-protocol
ExecStart=/opt/network-protocol/bin/server
Restart=always
RestartSec=10
LimitNOFILE=65536

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/network-protocol

[Install]
WantedBy=multi-user.target
```

```bash
# Enable and start service
sudo systemctl enable network-protocol
sudo systemctl start network-protocol

# Check status
sudo systemctl status network-protocol

# View logs
sudo journalctl -u network-protocol -f
```

### Resource Requirements

**Minimum:**
- CPU: 1 core
- RAM: 512 MB
- Disk: 100 MB (plus logs)
- Network: 100 Mbps

**Recommended (1000 concurrent connections):**
- CPU: 4 cores
- RAM: 4 GB
- Disk: 10 GB (with log rotation)
- Network: 1 Gbps

---

## Cluster Deployment

### Use Cases

- High availability requirements (99.9%+ uptime)
- High traffic (10k+ concurrent connections)
- Geographic distribution
- Load balancing and failover

### Architecture

```
                    ┌──────────────┐
                    │ Load Balancer│
                    │   (HAProxy)  │
                    └───────┬──────┘
                            │
           ┌────────────────┼────────────────┐
           │                │                │
     ┌─────▼─────┐    ┌─────▼─────┐   ┌─────▼─────┐
     │  Node 1   │    │  Node 2   │   │  Node 3   │
     │  Primary  │────│  Replica  │───│  Replica  │
     └───────────┘    └───────────┘   └───────────┘
           │                │                │
     ┌─────▼────────────────▼────────────────▼─────┐
     │         Shared State (Redis/etcd)            │
     └──────────────────────────────────────────────┘
```

### Load Balancer Configuration (HAProxy)

```haproxy
global
    maxconn 10000
    log /dev/log local0

defaults
    mode tcp
    timeout connect 5s
    timeout client 30s
    timeout server 30s
    log global
    option tcplog

frontend network_protocol_frontend
    bind *:8443
    default_backend network_protocol_backend

backend network_protocol_backend
    balance roundrobin
    option tcp-check
    
    server node1 10.0.1.10:8443 check inter 2s rise 2 fall 3
    server node2 10.0.1.11:8443 check inter 2s rise 2 fall 3
    server node3 10.0.1.12:8443 check inter 2s rise 2 fall 3
```

### Session Affinity

For stateful applications requiring session persistence:

```haproxy
backend network_protocol_backend
    balance source  # Use client IP for routing
    hash-type consistent  # Consistent hashing
    
    # Or use cookie-based session affinity
    cookie SERVERID insert indirect nocache
    server node1 10.0.1.10:8443 check cookie node1
    server node2 10.0.1.11:8443 check cookie node2
```

### Health Checks

Implement health check endpoints:

```rust
use axum::{routing::get, Router};

async fn health_check() -> &'static str {
    "OK"
}

async fn readiness_check() -> &'static str {
    // Check if server is ready to accept connections
    if is_ready() {
        "READY"
    } else {
        "NOT_READY"
    }
}

#[tokio::main]
async fn main() {
    let health_router = Router::new()
        .route("/health", get(health_check))
        .route("/ready", get(readiness_check));
    
    // Run health check server on separate port
    tokio::spawn(async {
        axum::Server::bind(&"0.0.0.0:8080".parse().unwrap())
            .serve(health_router.into_make_service())
            .await
    });
    
    // Start main protocol server
    network_protocol::service::daemon::start("0.0.0.0:8443").await.unwrap();
}
```

### Kubernetes Deployment

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: network-protocol
spec:
  replicas: 3
  selector:
    matchLabels:
      app: network-protocol
  template:
    metadata:
      labels:
        app: network-protocol
    spec:
      containers:
      - name: server
        image: myregistry/network-protocol:latest
        ports:
        - containerPort: 8443
          name: protocol
        - containerPort: 8080
          name: health
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: network-protocol
spec:
  type: LoadBalancer
  selector:
    app: network-protocol
  ports:
  - port: 8443
    targetPort: 8443
    protocol: TCP
```

---

## Edge Computing

### Use Cases

- IoT deployments
- Mobile edge computing
- Content delivery networks
- Latency-sensitive applications

### Architecture

```
     Edge Nodes                     Regional Hub                 Central Cloud
┌─────────────────┐            ┌──────────────────┐        ┌─────────────────┐
│  ┌────┐ ┌────┐  │            │   ┌──────────┐   │        │  ┌──────────┐   │
│  │E1  │ │E2  │  │───────────▶│   │Regional  │   │───────▶│  │  Cloud   │   │
│  └────┘ └────┘  │            │   │   Hub    │   │        │  │ Services │   │
│       Local     │            │   └──────────┘   │        │  └──────────┘   │
│    Processing   │            │   Aggregation    │        │   Long-term     │
└─────────────────┘            └──────────────────┘        │    Storage      │
                                                            └─────────────────┘
```

### Edge Node Configuration

```rust
use network_protocol::{transport::local, config::EdgeConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = EdgeConfig {
        // Local IPC for sensors
        local_socket: "/tmp/sensor.sock",
        
        // Upstream connection to hub
        hub_address: "hub.example.com:8443",
        
        // Aggressive timeouts for edge
        connection_timeout_ms: 5000,
        
        // Buffer for offline operation
        offline_buffer_size: 10000,
    };
    
    // Start edge server
    start_edge_node(config).await?;
    
    Ok(())
}

async fn start_edge_node(config: EdgeConfig) -> Result<(), Box<dyn std::error::Error>> {
    // Start local IPC server for sensors
    tokio::spawn(async move {
        local::start_server(config.local_socket).await
    });
    
    // Connect to regional hub with retry logic
    loop {
        match connect_to_hub(&config).await {
            Ok(connection) => {
                handle_hub_connection(connection).await;
            }
            Err(e) => {
                eprintln!("Hub connection failed: {}. Retrying...", e);
                tokio::time::sleep(tokio::time::Duration::from_secs(5)).await;
            }
        }
    }
}
```

### Offline-First Design

```rust
use std::collections::VecDeque;

struct OfflineBuffer {
    buffer: VecDeque<Message>,
    max_size: usize,
}

impl OfflineBuffer {
    fn new(max_size: usize) -> Self {
        Self {
            buffer: VecDeque::with_capacity(max_size),
            max_size,
        }
    }
    
    fn push(&mut self, message: Message) {
        if self.buffer.len() >= self.max_size {
            // Drop oldest message when full
            self.buffer.pop_front();
        }
        self.buffer.push_back(message);
    }
    
    async fn flush_to_hub(&mut self, hub: &mut Connection) -> Result<(), Error> {
        while let Some(message) = self.buffer.pop_front() {
            hub.send(message).await?;
        }
        Ok(())
    }
}
```

---

## Circuit Breaker Pattern

Prevent cascade failures in distributed systems:

### Implementation

```rust
use std::sync::atomic::{AtomicU64, AtomicBool, Ordering};
use std::sync::Arc;
use std::time::{Duration, Instant};

#[derive(Clone)]
struct CircuitBreaker {
    failure_count: Arc<AtomicU64>,
    last_failure: Arc<AtomicU64>,
    is_open: Arc<AtomicBool>,
    threshold: u64,
    timeout: Duration,
}

impl CircuitBreaker {
    fn new(threshold: u64, timeout: Duration) -> Self {
        Self {
            failure_count: Arc::new(AtomicU64::new(0)),
            last_failure: Arc::new(AtomicU64::new(0)),
            is_open: Arc::new(AtomicBool::new(false)),
            threshold,
            timeout,
        }
    }
    
    async fn call<F, T, E>(&self, f: F) -> Result<T, E>
    where
        F: FnOnce() -> Result<T, E>,
    {
        // Check if circuit is open
        if self.is_open.load(Ordering::Relaxed) {
            let elapsed = self.elapsed_since_failure();
            if elapsed < self.timeout {
                return Err(/* Circuit open error */);
            } else {
                // Try to close circuit (half-open state)
                self.is_open.store(false, Ordering::Relaxed);
            }
        }
        
        // Execute function
        match f() {
            Ok(result) => {
                // Success - reset failure count
                self.failure_count.store(0, Ordering::Relaxed);
                Ok(result)
            }
            Err(e) => {
                // Failure - increment counter
                let failures = self.failure_count.fetch_add(1, Ordering::Relaxed) + 1;
                self.last_failure.store(
                    Instant::now().elapsed().as_secs(),
                    Ordering::Relaxed
                );
                
                // Open circuit if threshold exceeded
                if failures >= self.threshold {
                    self.is_open.store(true, Ordering::Relaxed);
                }
                
                Err(e)
            }
        }
    }
    
    fn elapsed_since_failure(&self) -> Duration {
        let last = self.last_failure.load(Ordering::Relaxed);
        let now = Instant::now().elapsed().as_secs();
        Duration::from_secs(now - last)
    }
}
```

### Usage

```rust
let circuit_breaker = CircuitBreaker::new(
    5,  // Open after 5 failures
    Duration::from_secs(30)  // Try again after 30 seconds
);

loop {
    match circuit_breaker.call(|| connect_to_service()).await {
        Ok(connection) => {
            // Use connection
        }
        Err(e) if circuit_breaker.is_open() => {
            // Circuit open - use fallback
            use_fallback_service().await;
        }
        Err(e) => {
            // Regular error - handle normally
            handle_error(e);
        }
    }
}
```

---

## Monitoring and Observability

### Metrics to Track

#### Application Metrics

```rust
use network_protocol::utils::metrics;

// Periodically export metrics
tokio::spawn(async {
    loop {
        let stats = metrics::get_stats();
        
        // Export to monitoring system
        export_metric("handshakes_total", stats.handshakes_completed);
        export_metric("messages_sent", stats.messages_sent);
        export_metric("messages_received", stats.messages_received);
        export_metric("connections_active", stats.connections_active);
        export_metric("errors_total", stats.errors_total);
        
        tokio::time::sleep(Duration::from_secs(60)).await;
    }
});
```

#### System Metrics

Monitor host-level metrics:

- **CPU**: Usage per core, load average
- **Memory**: RSS, heap usage, page faults
- **Network**: Bandwidth, packets/sec, errors
- **Disk**: I/O operations, latency, space

### Prometheus Integration

```rust
use prometheus::{Encoder, TextEncoder, Registry, Counter, Gauge};

lazy_static! {
    static ref REGISTRY: Registry = Registry::new();
    static ref CONNECTIONS: Gauge = Gauge::new("connections_active", "Active connections")
        .expect("metric creation");
    static ref MESSAGES: Counter = Counter::new("messages_total", "Total messages")
        .expect("metric creation");
}

fn init_metrics() {
    REGISTRY.register(Box::new(CONNECTIONS.clone())).unwrap();
    REGISTRY.register(Box::new(MESSAGES.clone())).unwrap();
}

async fn metrics_handler() -> String {
    let encoder = TextEncoder::new();
    let metric_families = REGISTRY.gather();
    let mut buffer = vec![];
    encoder.encode(&metric_families, &mut buffer).unwrap();
    String::from_utf8(buffer).unwrap()
}
```

### Alert Configuration

Example Prometheus alert rules:

```yaml
groups:
- name: network_protocol_alerts
  rules:
  # High error rate
  - alert: HighErrorRate
    expr: rate(errors_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      
  # Service down
  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service is down"
      
  # High latency
  - alert: HighLatency
    expr: histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "P99 latency above 1 second"
```

---

## Security Considerations

### TLS Configuration

Always use TLS in production:

```rust
use network_protocol::transport::tls;

// Load certificates
let tls_config = tls::ServerConfig::builder()
    .with_cert_and_key("fullchain.pem", "privkey.pem")
    .with_client_auth_optional() // For mTLS
    .build()?;
```

### Certificate Management

```bash
# Let's Encrypt with certbot
sudo certbot certonly --standalone -d example.com

# Auto-renewal (cron)
0 0 1 * * certbot renew --quiet && systemctl reload network-protocol
```

### Network Segmentation

```
  Internet
      │
      ▼
┌──────────┐
│ Firewall │ (Allow 8443/tcp)
└────┬─────┘
     │
     ▼
┌──────────┐
│   DMZ    │ (Public-facing nodes)
└────┬─────┘
     │
     ▼
┌──────────┐
│ Internal │ (Backend services)
└──────────┘
```

### Firewall Rules (iptables)

```bash
# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow protocol port
iptables -A INPUT -p tcp --dport 8443 -j ACCEPT

# Drop everything else
iptables -A INPUT -j DROP
```

---

## Disaster Recovery

### Backup Strategy

**What to backup:**
- Configuration files
- TLS certificates and keys
- Application logs (if needed)
- State databases (if applicable)

```bash
#!/bin/bash
# Daily backup script

BACKUP_DIR="/backup/network-protocol/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

# Backup configuration
cp -r /etc/network-protocol "$BACKUP_DIR/"

# Backup certificates
cp -r /etc/letsencrypt "$BACKUP_DIR/"

# Compress
tar -czf "$BACKUP_DIR.tar.gz" "$BACKUP_DIR"
rm -rf "$BACKUP_DIR"

# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR.tar.gz" s3://backups/network-protocol/
```

### Disaster Recovery Plan

1. **Detection**: Monitor alerts for service degradation
2. **Assessment**: Determine scope of failure
3. **Failover**: Switch to standby nodes/region
4. **Recovery**: Restore service from backups
5. **Post-mortem**: Document incident and improve

### Testing DR Procedures

```bash
# Quarterly DR drill
1. Simulate node failure
2. Verify automatic failover
3. Test backup restoration
4. Measure recovery time (RTO)
5. Document results
```

---

## Best Practices Summary

### Do's ✅

- Use TLS for all production traffic
- Implement health checks and monitoring
- Configure resource limits
- Use circuit breakers for external dependencies
- Implement graceful shutdown
- Log structured data
- Test disaster recovery procedures
- Document configuration changes

### Don'ts ❌

- Don't run as root
- Don't use self-signed certs in production
- Don't ignore security updates
- Don't skip load testing
- Don't deploy without monitoring
- Don't hardcode secrets
- Don't skip backups

---

## Additional Resources

- [Performance Tuning Guide](./TUNING.md)
- [Security Model](../THREAT_MODEL.md)
- [Architecture Overview](../ARCHITECTURE.md)
- [Kubernetes Best Practices](https://kubernetes.io/docs/concepts/configuration/overview/)

---

## Appendix: Container & Orchestration Examples

### Docker Container

**Dockerfile** (production-ready):

```dockerfile
FROM rust:1.75-alpine AS builder
RUN apk add --no-cache musl-dev openssl-dev
WORKDIR /app
COPY . .
RUN cargo build --release

FROM alpine:3.19
RUN apk add --no-cache ca-certificates && \
    adduser -D -u 1000 protocol
COPY --from=builder /app/target/release/daemon /app/daemon
USER protocol
EXPOSE 8443
CMD ["/app/daemon", "--config", "/app/config.toml"]
```

### Systemd Service

**/etc/systemd/system/network-protocol.service**:

```ini
[Unit]
Description=Network Protocol Service
After=network.target

[Service]
Type=simple
User=protocol
ExecStart=/opt/network-protocol/bin/daemon --config /etc/network-protocol/config.toml
Restart=on-failure
NoNewPrivileges=true
LimitNOFILE=65536
MemoryMax=2G

[Install]
WantedBy=multi-user.target
```

### Kubernetes Deployment

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: network-protocol
spec:
  replicas: 3
  selector:
    matchLabels:
      app: network-protocol
  template:
    metadata:
      labels:
        app: network-protocol
    spec:
      containers:
      - name: server
        image: network-protocol:1.2.1
        ports:
        - containerPort: 8443
        resources:
          limits:
            memory: "1Gi"
            cpu: "1000m"
          requests:
            memory: "256Mi"
            cpu: "250m"
        livenessProbe:
          tcpSocket:
            port: 8443
          periodSeconds: 30
        readinessProbe:
          tcpSocket:
            port: 8443
          periodSeconds: 10
```

### Connection Pooling Configuration

```rust
use network_protocol::service::pool::{ConnectionPool, PoolConfig};
use std::time::Duration;

let config = PoolConfig {
    min_size: 10,          // Pre-warm connections
    max_size: 100,         // Maximum concurrent
    idle_timeout: Duration::from_secs(300),   // 5 min
    max_lifetime: Duration::from_secs(3600),  // 1 hour
};

let pool = ConnectionPool::new(factory, config)?;
let conn = pool.acquire().await?;  // Reuses existing or creates new
```

---

For specific deployment scenarios or questions, please file an issue on GitHub.