bssh 0.5.1 - Docs.rs

# bssh Architecture Documentation

## Overview

bssh (Backend.AI SSH) is a high-performance parallel SSH command execution tool designed for managing Backend.AI clusters. This document describes the detailed architecture, implementation decisions, and design patterns used in the project.

## System Architecture

```
┌─────────────────────────────────────────────────────────┐
│                     CLI Interface                       │
│                       (main.rs)                         │
└─────────────────────┬───────────────────────────────────┘
                      │
        ┌─────────────┼───────────────┐
        ▼             ▼               ▼
┌──────────────┐ ┌───────────┐ ┌─────────────┐
│   Commands   │ │  Config   │ │    Utils    │
│   Module     │ │  Manager  │ │   Module    │
│ (commands/*) │ │(config.rs)│ │  (utils/*)  │
└──────┬───────┘ └───────────┘ └─────────────┘
        │
        ▼
┌──────────────┐           ┌──────────────┐  ┌──────────┐
│   Executor   │◄──────────┤     Node     │  │    UI    │
│  (Parallel)  │           │    Parser    │  │  System  │
│(executor.rs) │           │  (node.rs)   │  │ (ui.rs)  │
└──────┬───────┘           └──────────────┘  └──────────┘
       │
       ├──────────┬────────────┐
       ▼          ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│   SSH    │ │   SSH    │ │   SSH    │
│  Client  │ │  Client  │ │  Client  │
│ (async)  │ │ (async)  │ │ (async)  │
└──────────┘ └──────────┘ └──────────┘
```

### Modular Design (Refactored 2025-01-22)

The codebase has been restructured for better maintainability and scalability:

1. **Minimal Entry Point (`main.rs`):**
   - Reduced from 987 lines to ~150 lines
   - Only handles CLI parsing and command dispatching
   - Delegates all business logic to specialized modules

2. **Command Modules (`commands/`):**
   - `exec.rs`: Command execution with output management
   - `ping.rs`: Connectivity testing
   - `interactive.rs`: Interactive shell sessions (Phase 1 completed)
   - `list.rs`: Cluster listing
   - `upload.rs`: File upload operations
   - `download.rs`: File download operations
   - Each module is self-contained and independently testable

3. **Utility Modules (`utils/`):**
   - `fs.rs`: File system operations (glob patterns, directory walking)
   - `output.rs`: Command output file management
   - `logging.rs`: Logging initialization
   - Reusable across different commands

## Component Details

### 1. CLI Interface (`cli.rs`, `main.rs`)

**Design Decisions:**
- Uses clap v4 with derive macros for type-safe argument parsing
- Subcommand pattern for different operations (exec, list, ping, upload, download)
- Environment variable support via `env` attribute
- **Refactored (2025-01-22):** Separated command logic from main.rs

**Implementation:**
```rust
// main.rs - Minimal dispatcher
async fn main() -> Result<()> {
    let cli = Cli::parse();
    match cli.command {
        Commands::Exec { .. } => exec::execute_command(params).await,
        Commands::List => list::list_clusters(&config),
        Commands::Ping => ping::ping_nodes(nodes, ...).await,
        Commands::Upload { .. } => upload::upload_file(params, ...).await,
        Commands::Download { .. } => download::download_file(params, ...).await,
    }
}
```

**Trade-offs:**
- Derive macros increase compile time but provide better type safety
- Subcommand pattern adds complexity but improves UX
- Modular structure increases file count but improves testability

### 2. Configuration Management (`config.rs`)

**Design Decisions:**
- YAML format for human readability
- Hierarchical configuration with cluster → nodes structure
- Default values with override capability
- Full XDG Base Directory specification compliance

**Configuration Loading Priority:**
1. Backend.AI environment variables (auto-detection)
2. Current directory (`./config.yaml`)
3. XDG config directory (`$XDG_CONFIG_HOME/bssh/config.yaml` or `~/.config/bssh/config.yaml`)
4. CLI specified path (via `--config` flag)

**XDG Support:**
- Respects `$XDG_CONFIG_HOME` environment variable
- Uses `directories` crate's `ProjectDirs` for platform-specific paths
- Follows XDG Base Directory specification
- Tilde expansion for paths using `shellexpand`

**Key Features:**
- Lazy loading of configuration
- Validation at parse time
- Support for both file-based and CLI-specified nodes
- ✅ Environment variable expansion (Phase 1 - Completed 2025-08-21)
  - Supports `${VAR}` and `$VAR` syntax
  - Expands in hostnames and usernames
  - Graceful fallback for undefined variables

**Data Model:**
```rust
pub struct Config {
    pub clusters: HashMap<String, Cluster>,
    pub default_cluster: Option<String>,
    pub ssh_config: SshConfig,
}

pub struct Cluster {
    pub nodes: Vec<Node>,
    pub ssh_key: Option<PathBuf>,
    pub user: Option<String>,
}
```

### 3. Parallel Executor (`executor.rs`)

**Design Decisions:**
- Tokio-based async execution for maximum concurrency
- Semaphore-based concurrency limiting to prevent resource exhaustion
- Progress bar visualization using `indicatif`
- Streaming output collection for real-time feedback

**Concurrency Model:**
```rust
let semaphore = Arc::new(Semaphore::new(max_parallel));
let tasks: Vec<JoinHandle<Result<ExecutionResult>>> = nodes
    .into_iter()
    .map(|node| {
        let permit = semaphore.clone().acquire_owned();
        tokio::spawn(async move {
            let _permit = permit.await;
            execute_on_node(node, command).await
        })
    })
    .collect();
```

**Performance Optimizations:**
- Connection reuse within same node (planned)
- Buffered I/O for output collection
- Early termination on critical failures

### 4. SSH Client (`ssh/client.rs`)

**Library Choice: async-ssh2-tokio**
- **Why not thrussh:** async-ssh2-tokio provides simpler API and better OpenSSH compatibility
- **Why not openssh:** Need fine-grained control over connections
- **Why not ssh2:** Need async/await support for concurrent operations

**Implementation Details:**
- Async/await pattern for non-blocking I/O
- Support for both key-based and agent authentication
- Configurable timeouts and retry logic

**Security Implementation (Phase 1 - Completed 2025-08-21):**
- ✅ Host key verification with three modes:
  - `StrictHostKeyChecking::Yes` - Strict verification using known_hosts
  - `StrictHostKeyChecking::No` - Skip all verification
  - `StrictHostKeyChecking::AcceptNew` - TOFU mode (limited by library)
- ✅ CLI flag `--strict-host-key-checking` with default "accept-new"
- ✅ Uses system known_hosts file (~/.ssh/known_hosts)

**Remaining Limitations:**
- Missing SFTP support for file operations
- Accept-new mode falls back to NoCheck due to library limitations
- Connection reuse not possible with async-ssh2-tokio (see Connection Pooling section)

### 5. Connection Pooling (`ssh/pool.rs`)

**Current Status:** Placeholder implementation (Phase 3, 2025-08-21)

**Design Decision:**
After thorough analysis, connection pooling was determined to be **not beneficial** for bssh's current usage pattern. The implementation exists as a placeholder for future features.

**Analysis Results:**
- **Current Usage Pattern:** Each CLI invocation executes exactly one operation per host then terminates
- **No Reuse Scenarios:** There are no cases where connections would be reused within a single bssh execution
- **Library Limitation:** async-ssh2-tokio's `Client` type doesn't support cloning or connection reuse
- **Performance Impact:** Zero benefit for current one-shot command execution model

**When Pooling Would Be Beneficial:**
- Interactive mode with persistent shell sessions
- Watch mode for periodic command execution
- Server mode providing an HTTP API
- Batch command execution from files
- Command pipelining on the same hosts

**Implementation:**
```rust
pub struct ConnectionPool {
    _connections: Arc<RwLock<Vec<ConnectionKey>>>,  // Placeholder
    ttl: Duration,
    enabled: bool,
    max_connections: usize,
}
```

**Current Behavior:**
- Always creates new connections regardless of `enabled` flag
- Provides API surface for future pooling implementation
- No performance overhead when disabled (default)

**Recommendation:**
Focus on more impactful optimizations like:
- Connection timeout tuning
- SSH compression for large outputs
- Buffered I/O optimizations
- Early termination on critical failures
- Parallel DNS resolution

### 6. Node Management (`node.rs`)

**Design Decisions:**
- Flexible parsing supporting multiple formats
- Smart defaults (port 22, current user)
- Validation at parse time

**Supported Formats:**
- `hostname` → Simple hostname
- `user@hostname` → With username
- `hostname:port` → With custom port
- `user@hostname:port` → Full specification
- `[ipv6::addr]:port` → IPv6 support

## Data Flow

### Command Execution Flow

1. **CLI Parsing** → Parse arguments and load configuration
2. **Node Resolution** → Determine target nodes from config or CLI
3. **Executor Setup** → Create semaphore and progress bars
4. **Parallel Spawn** → Launch tokio tasks for each node
5. **SSH Connection** → Establish authenticated SSH session
6. **Command Execution** → Run command and collect output
7. **Result Aggregation** → Collect all results and report

### Error Handling Strategy

- **Connection Failures:** Report per-node, continue with others
- **Authentication Failures:** Fail fast with clear error message
- **Command Failures:** Report exit code, continue execution
- **Timeout Handling:** Configurable per-operation timeouts

## Performance Characteristics

### Benchmarks (Target)

| Nodes | Command | Time | Memory |
|-------|---------|------|--------|
| 10    | uptime  | <2s  | <50MB  |
| 100   | uptime  | <5s  | <200MB |
| 1000  | uptime  | <30s | <1GB   |

### Bottlenecks

1. **SSH Handshake:** ~200-500ms per connection
2. **Memory:** Output buffering for large responses
3. **CPU:** Minimal, mostly I/O bound

### Optimization Strategies

1. **Connection Pooling:** Reuse connections for multiple commands
2. **Pipelining:** Send multiple commands in single session
3. **Compression:** Enable SSH compression for large outputs
4. **Caching:** Cache host keys and authentication

## Interactive Mode Architecture

### Overview

Interactive mode provides persistent shell sessions with single-node or multiplexed multi-node support. This feature enables real-time interaction with cluster nodes, maintaining stateful connections for extended operations.

### Design Decisions

1. **PTY Support:** 
   - Full pseudo-terminal allocation for proper shell interaction
   - Terminal size detection and dynamic resizing
   - ANSI escape sequence support for colored output

2. **Session Management:**
   - Persistent SSH connections with keep-alive
   - Graceful reconnection on connection drops
   - Session state tracking (working directory, environment)

3. **Input/Output Multiplexing:**
   - Commands broadcast to all nodes simultaneously
   - Node-prefixed output with color coding
   - Visual status indicators (● connected, ○ disconnected)

### Implementation Details

```rust
struct NodeSession {
    node: Node,
    client: Client,
    channel: Channel<Msg>,
    working_dir: String,
    is_connected: bool,
}
```

### Modes of Operation

1. **Single-Node Mode (`--single-node`):**
   - Interactive shell on one selected node
   - Full terminal emulation
   - Command history with rustyline

2. **Multiplex Mode (default):**
   - Commands sent to all nodes
   - Synchronized output display
   - Node status tracking

### Future Enhancements (Phase 2-3)

- Node switching with `!node1`, `!node2` commands
- Session persistence and detach/reattach
- Full TUI with ratatui (split panes, monitoring)
- File manager integration
- Performance metrics visualization

## Security Model

### Current Implementation

- SSH key-based authentication
- No password storage
- Agent forwarding support

### Planned Improvements

1. **Host Key Verification:**
   - Known_hosts file support
   - TOFU (Trust On First Use) mode
   - Strict mode with pre-shared keys

2. **Audit Logging:**
   - Command execution history
   - Connection attempts
   - Authentication failures

3. **Secrets Management:**
   - Integration with system keyring
   - Encrypted configuration support

## User Interface System (`ui.rs`)

### Design Philosophy
The UI system provides a modern, clean, and elegant command-line interface with semantic colors and Unicode symbols for better visual hierarchy and user experience.

### Key Components

1. **Color Scheme:**
   - **Cyan**: Headers, prompts, and informational elements
   - **Green**: Success indicators and positive outcomes
   - **Red**: Failure indicators and errors
   - **Yellow**: Counts, numbers, and warnings
   - **Blue**: Active/processing states
   - **Dimmed**: Secondary information and decorative elements

2. **Unicode Symbols:**
   - `●` (filled circle): Status indicators (colored based on state)
   - `○` (empty circle): Pending/inactive state
   - `◐/◑` (partial circles): In-progress animations
   - `▶` (triangle): Section headers and actions
   - `•` (bullet): List items
   - `└` (corner): Error details and nested information
   - `✓/✗`: Success/failure checkmarks

3. **UI Components:**

   **NodeStatus Enum:**
   - Represents the current state of a node (Pending, Connecting, Executing, Success, Failed)
   - Provides colored symbols and text representations

   **NodeGrid:**
   - Compact grid layout for displaying multiple node statuses
   - Responsive to terminal width
   - Shows real-time status updates during execution

   **OutputFormatter:**
   - Formats command output with proper indentation and wrapping
   - Handles terminal width constraints
   - Provides consistent formatting for headers, summaries, and results

### Implementation Details

```rust
pub enum NodeStatus {
    Pending,
    Connecting,
    Executing,
    Success,
    Failed(String),
}

impl NodeStatus {
    pub fn symbol(&self) -> String {
        match self {
            NodeStatus::Pending => "○".dimmed(),
            NodeStatus::Connecting => "◐".yellow(),
            NodeStatus::Executing => "◑".blue(),
            NodeStatus::Success => "●".green(),
            NodeStatus::Failed(_) => "●".red(),
        }
    }
}
```

### Progress Indicators
- Uses `indicatif` for animated progress spinners during execution
- Custom tick characters for smooth animation: `⣾⣽⣻⣟⣯⣷⣿`
- Per-node progress bars with status messages

### Terminal Responsiveness
- Detects terminal width using `terminal_size` crate
- Adapts output formatting based on available space
- Wraps long lines intelligently while preserving indentation

### Output Examples

**Command Execution:**
```
► Executing on 3 nodes:
  echo 'test'

[node1] ⣾ Connecting...
[node2] ◑ Executing...
[node3] ● Success

✓ node1
  test output

✗ node2 - Failed
  └ Connection timeout

════════════════════════════════════════
 Summary: 3 nodes • 2 successful • 1 failed
════════════════════════════════════════
```

**Cluster Listing:**
```
▶ Available clusters

  ● production (5 nodes)
    • prod-1.example.com
    • prod-2.example.com
    ...
    
  ● staging (2 nodes)
    • stage-1.example.com
    • stage-2.example.com
```

## Testing Strategy

### Unit Tests

- Configuration parsing edge cases
- Node format parsing
- Error handling paths

### Integration Tests

- Mock SSH server for protocol testing
- Docker-based real SSH testing
- Cluster simulation

### Coverage Goals

- Core modules: >90%
- SSH client: >80%
- Overall: >85%

## Future Improvements

### Short-term (v0.2)

- [ ] Implement proper host key verification
- [ ] Add connection pooling
- [ ] Complete file copy functionality
- [ ] Add dry-run mode
- [ ] Implement output filtering

### Medium-term (v0.3)

- [ ] SFTP support for efficient file transfers
- [ ] Interactive session support (PTY)
- [ ] Command templates and scripts
- [ ] Result caching
- [ ] Parallel file distribution

### Long-term (v1.0)

- [ ] Web UI dashboard
- [ ] REST API server mode
- [ ] Kubernetes operator integration
- [ ] Metrics and monitoring
- [ ] Plugin system

## Technical Debt

1. ~~**Host Key Verification:** Currently disabled, security risk~~ ✅ Fixed in Phase 1
2. **Test Coverage:** Integration tests missing
3. **Error Messages:** Need better context and suggestions
4. **Documentation:** API docs incomplete

## Change Log

### Phase 1 - Critical Fixes (2025-08-21)

**Completed:**
1. **Host Key Verification:** Implemented three modes of verification with CLI flag
2. **List Command Bug:** Fixed logic to allow list without host specification
3. **Environment Variables:** Added expansion support for YAML configuration

**Impact:**
- Security improved with proper host key checking
- Better UX with fixed list command
- More flexible configuration with env var support

### Phase 3 - Connection Pooling Analysis (2025-08-21)

**Completed:**
1. **Connection Pool Module:** Implemented placeholder connection pool infrastructure
2. **Performance Analysis:** Determined pooling provides no benefit for current usage pattern
3. **Architecture Documentation:** Documented design decision and rationale

**Key Findings:**
- Current one-shot execution model doesn't benefit from connection pooling
- async-ssh2-tokio Client doesn't support connection reuse or cloning
- Pooling would only benefit future features like interactive mode or watch mode

**Recommendation:**
- Keep placeholder implementation for future use
- Focus on other performance optimizations with immediate impact
- Revisit when implementing persistent/interactive features

### Phase 4 - Code Structure Refactoring (2025-01-22)

**Completed:**
1. **Modular Command Structure:** Separated commands into individual modules
2. **Utility Extraction:** Created reusable utility modules for common functions
3. **Main.rs Simplification:** Reduced from 987 to ~150 lines

**New Structure:**
```
src/
├── commands/         # Command implementations
│   ├── exec.rs      # Execute command (~75 lines)
│   ├── ping.rs      # Connectivity test (~80 lines)
│   ├── list.rs      # List clusters (~50 lines)
│   ├── upload.rs    # File upload (~175 lines)
│   └── download.rs  # File download (~240 lines)
├── utils/           # Utility functions
│   ├── fs.rs        # File system utilities (~100 lines)
│   ├── output.rs    # Output management (~200 lines)
│   └── logging.rs   # Logging setup (~30 lines)
└── main.rs          # CLI dispatcher (~150 lines)
```

**Benefits:**
- **Improved Maintainability:** Each command is self-contained
- **Better Testability:** Individual modules can be tested in isolation
- **Enhanced Scalability:** New commands can be added without touching main.rs
- **Code Reusability:** Utility functions are shared across commands
- **Clear Separation of Concerns:** Each module has a single responsibility

**Metrics:**
- Main.rs size reduction: 84% (987 → 150 lines)
- Average module size: ~100 lines
- Total modules created: 9 new files
- No functionality changes, only structural improvements

## Dependencies and Licensing

All dependencies are compatible with Apache-2.0 licensing:

- `tokio`: MIT
- `async-ssh2-tokio`: MIT
- `clap`: MIT/Apache-2.0
- `serde`: MIT/Apache-2.0
- Other dependencies: Similar permissive licenses

## Appendix

### A. Configuration Schema

```yaml
# Full configuration example
clusters:
  production:
    nodes:
      - host: node1.example.com
        port: 22
        user: admin
    ssh_key: ~/.ssh/id_rsa
    known_hosts: ~/.ssh/known_hosts
    
default_cluster: production

ssh_config:
  connect_timeout: 10
  command_timeout: 300
  max_retries: 3
```

### B. Error Codes

| Code | Description |
|------|-------------|
| 1    | General error |
| 2    | Configuration error |
| 3    | Connection failed |
| 4    | Authentication failed |
| 5    | Command execution failed |
| 10   | Partial failure (some nodes failed) |

### C. Performance Tuning

Environment variables for tuning:
- `BSSH_MAX_PARALLEL`: Maximum parallel connections
- `BSSH_CONNECT_TIMEOUT`: Connection timeout in seconds
- `BSSH_BUFFER_SIZE`: Output buffer size per connection
- `RUST_LOG`: Logging level (trace/debug/info/warn/error)