prestige-cli 0.3.0

# Prestige CLI

Command-line interface for S3 Parquet file operations.

## Installation

Build from source:

```bash
cargo build --release --package prestige-cli
```

The binary will be available at `target/release/prestige`.

## Commands

### compact

Consolidate small parquet files in S3 into larger files.

#### Basic Usage

```bash
prestige compact \
  --bucket my-bucket \
  --prefix sensor_data \
  --start 1704067200 \
  --end 1704153600 \
  --target-bytes 104857600
```

#### Options

- `--bucket <BUCKET>` - S3 bucket name (required)
- `--prefix <PREFIX>` - File prefix to compact (required)
- `--start <START>` - Unix timestamp in seconds, exclusive lower bound (required)
- `--end <END>` - Unix timestamp in seconds, inclusive upper bound (required)
- `--target-bytes <BYTES>` - Target size per output file in bytes (default: 104857600 = 100MB)
- `--delete-originals` - Delete original files after successful compaction (default: true)
- `--compression <TYPE>` - Compression algorithm: snappy, gzip, lzo, brotli, lz4, zstd, uncompressed (default: snappy)
- `--row-group-size <SIZE>` - Parquet row group size (default: 10000)
- `--deduplicate` - Enable row-level deduplication (default: false)
- `--plan` - Dry-run mode: show statistics without modifying files

#### Authentication

AWS credentials can be provided via:

1. Command-line arguments:
   ```bash
   prestige compact \
     --access-key-id AKIAXXXXXXXX \
     --secret-access-key XXXXXXXX \
     --region us-east-1 \
     ...
   ```

2. Environment variables:
   ```bash
   export AWS_ACCESS_KEY_ID=AKIAXXXXXXXX
   export AWS_SECRET_ACCESS_KEY=XXXXXXXX
   export AWS_REGION=us-east-1
   prestige compact ...
   ```

3. AWS credentials file (`~/.aws/credentials`)

#### Plan Mode

Use `--plan` to estimate compaction results without modifying files:

```bash
prestige compact \
  --bucket my-bucket \
  --prefix sensor_data \
  --start 1704067200 \
  --end 1704153600 \
  --plan
```

Output:
```json
{
  "compacted_files_produced": 5,
  "uncompacted_files_deleted": 127,
  "records_processed": 1234567,
  "duplicate_records_eliminated": 0,
  "storage_saved_bytes": 45678900
}
```

#### Deduplication

Enable row-level deduplication to eliminate duplicate records:

```bash
prestige compact \
  --bucket my-bucket \
  --prefix sensor_data \
  --start 1704067200 \
  --end 1704153600 \
  --deduplicate
```

Output:
```json
{
  "status": "success",
  "files_processed": 127,
  "files_created": 5,
  "records_consolidated": 1234567,
  "bytes_saved": 45678900,
  "duplicate_records_eliminated": 61728,
  "last_processed_timestamp": "2024-01-02T00:00:00Z",
  "deletion_failures": []
}
```

**Note:** Deduplication uses row hashing and may increase processing time and memory usage proportional to the number of unique records.

#### LocalStack Testing

For local S3 testing with LocalStack:

```bash
docker run -d -p 4566:4566 localstack/localstack

prestige compact \
  --bucket test-bucket \
  --prefix data \
  --start 1704067200 \
  --end 1704153600 \
  --endpoint http://localhost:4566 \
  --region us-east-1
```

## Examples

### Compact last hour of data

```bash
START=$(date -u -d '1 hour ago' +%s)
END=$(date -u +%s)

prestige compact \
  --bucket production-data \
  --prefix events \
  --start $START \
  --end $END \
  --target-bytes 209715200  # 200MB
```

### Compact with maximum compression

```bash
prestige compact \
  --bucket archive-data \
  --prefix historical \
  --start 1704067200 \
  --end 1704153600 \
  --compression zstd \
  --target-bytes 524288000  # 500MB
```

### Compact without deleting originals

```bash
prestige compact \
  --bucket backup-data \
  --prefix logs \
  --start 1704067200 \
  --end 1704153600 \
  --delete-originals=false
```

## File Naming Convention

The compactor follows these naming conventions:

- **Original files**: `{prefix}.{timestamp_millis}.parquet`
- **Compacted files**: `{prefix}.{timestamp_millis}.c.parquet`
- **Processed markers**: `{prefix}.{timestamp_millis}.parquet.processed`

## Idempotency

The compactor is idempotent:

1. After successful compaction, `.processed` markers are created for source files
2. Subsequent runs skip files that have `.processed` markers
3. If deletion fails, the marker prevents reprocessing
4. Use `last_processed_timestamp` for checkpoint-based processing

## Error Handling

- Deletion failures are tracked in `deletion_failures` but don't fail the operation
- Schema mismatches cause files to be skipped
- Empty or invalid parquet files are skipped
- All errors are logged to stderr