icepick 0.2.0 - Docs.rs

# AGENTS.md - icepick

## EXECUTIVE SUMMARY

**icepick** is an experimental Rust client for Apache Iceberg that provides simple, production-ready access to cloud-native Iceberg catalogs (AWS S3 Tables and Cloudflare R2). Unlike the official iceberg-rust library, icepick targets WASM compilation for serverless environments and focuses on REST catalog implementations with minimal configuration. The library abstracts authentication, catalog REST APIs, and file I/O while exposing a clean, type-safe interface for reading and writing Iceberg tables.

## QUICK START

```toml
# Add to Cargo.toml
[dependencies]
icepick = "0.1"
tokio = { version = "1", features = ["full"] }
```

### AWS S3 Tables (native platforms only)

```rust
use icepick::catalog::Catalog;
use icepick::{S3TablesCatalog, spec::{NamespaceIdent, TableIdent}};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create catalog from S3 Tables ARN
    let catalog = S3TablesCatalog::from_arn(
        "my-catalog",
        "arn:aws:s3tables:us-west-2:123456789012:bucket/my-bucket"
    ).await?;

    // Load and read a table
    let table_id = TableIdent::from_strs(&["namespace"], "table_name");
    let table = catalog.load_table(&table_id).await?;

    // Scan table data
    let scan = table.scan().build()?;
    let mut stream = scan.to_arrow().await?;

    Ok(())
}
```

### Cloudflare R2 (WASM-compatible)

```rust
use icepick::{R2Catalog, catalog::Catalog, spec::TableIdent};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let catalog = R2Catalog::new(
        "my-catalog",
        "account-id",
        "bucket-name",
        "cloudflare-api-token"
    ).await?;

    let table_id = "namespace.table_name".parse()?;
    let table = catalog.load_table(&table_id).await?;

    Ok(())
}
```

## CORE CONCEPTS

- **REST Catalog Pattern**: All catalog operations use REST API calls with platform-specific authentication (SigV4 for AWS, bearer tokens for Cloudflare)
- **WASM Compatibility**: R2Catalog compiles to wasm32-unknown-unknown; S3TablesCatalog requires native AWS SDK and is native-only
- **Optimistic Concurrency**: Transactions use metadata location pointers for atomic commits with automatic retry on concurrent modification
- **FileIO Abstraction**: Built on OpenDAL for cross-platform storage access; supports both single-operator (R2) and multi-bucket dynamic operator creation (S3 Tables)
- **Type-Safe Identifiers**: `TableIdent` and `NamespaceIdent` enforce valid naming with compile-time safety

## API SURFACE

```
Module Structure:
├── catalog/          # Catalog implementations (S3TablesCatalog, R2Catalog)
│   ├── auth/        # Authentication (SigV4, bearer tokens)
│   └── rest/        # REST catalog protocol
├── spec/            # Iceberg specification types (Schema, TableIdent, etc.)
├── table/           # Table representation and operations
├── transaction/     # Write operations with ACID guarantees
├── scan/            # Table scanning and reading
├── io/              # FileIO abstraction over OpenDAL
├── writer/          # Parquet writing (both Iceberg and standalone)
├── reader/          # Manifest and data file reading
├── manifest/        # Iceberg manifest handling (Avro)
└── error/           # Structured error types
```

### Most Important Public Items

1. **S3TablesCatalog::from_arn()** - Create AWS S3 Tables catalog
2. **R2Catalog::new()** - Create Cloudflare R2 catalog
3. **Catalog trait** - Core operations (create_table, load_table, list_tables, drop_table)
4. **Table** - Iceberg table with scan() and transaction() methods
5. **Transaction::append().commit()** - Append data files atomically
6. **TableScan::to_arrow()** - Read table as Arrow RecordBatch stream
7. **arrow_to_parquet()** - Write Arrow data directly to S3 without Iceberg metadata

## COMMON PATTERNS

### Pattern 1: Creating and writing to a table

```rust
use icepick::catalog::Catalog;
use icepick::spec::{NestedField, PrimitiveType, Schema, TableCreation, Type};

// Build Iceberg schema
let schema = Schema::builder()
    .with_fields(vec![
        NestedField::required_field(1, "id".to_string(),
            Type::Primitive(PrimitiveType::Long)),
        NestedField::optional_field(2, "name".to_string(),
            Type::Primitive(PrimitiveType::String)),
    ])
    .build()?;

// Create table
let table_creation = TableCreation::builder()
    .with_name("my_table")
    .with_schema(schema)
    .build()?;

let table = catalog.create_table(&namespace, table_creation).await?;

// Write data using ParquetWriter
use icepick::writer::ParquetWriter;
let mut writer = ParquetWriter::new(table.schema()?.clone())?;
writer.write_batch(&arrow_batch)?;

let data_file = writer.finish(
    table.file_io(),
    format!("{}/data/{}.parquet", table.location(), uuid::Uuid::new_v4())
).await?;

// Commit transaction
table.transaction()
    .append(vec![data_file])
    .commit(&catalog)
    .await?;
```

### Pattern 2: Reading table data

```rust
use futures::StreamExt;

let table = catalog.load_table(&table_id).await?;

// Option A: Get data file list
let files = table.files().await?;
for file in files {
    println!("{} - {} rows", file.file_path, file.record_count);
}

// Option B: Stream as Arrow batches
let scan = table.scan().build()?;
let mut stream = scan.to_arrow().await?;

while let Some(batch_result) = stream.next().await {
    let batch = batch_result?;
    // Process batch
}
```

### Pattern 3: Error handling

```rust
use icepick::Error;

match catalog.load_table(&table_id).await {
    Ok(table) => { /* use table */ },
    Err(Error::NotFound { resource }) => {
        eprintln!("Table not found: {}", resource);
    },
    Err(Error::ConcurrentModification { message }) => {
        // Retry transaction
    },
    Err(e) => return Err(e.into()),
}
```

### Pattern 4: Direct Parquet writes (without Iceberg metadata)

```rust
use icepick::{arrow_to_parquet, FileIO, io::AwsCredentials};
use parquet::basic::Compression;

let file_io = FileIO::from_aws_credentials(
    AwsCredentials {
        access_key_id: "key".to_string(),
        secret_access_key: "secret".to_string(),
        session_token: None,
    },
    "us-west-2".to_string()
);

// Simple write
arrow_to_parquet(&batch, "s3://bucket/data.parquet", &file_io).await?;

// With compression
arrow_to_parquet(&batch, "s3://bucket/data.parquet", &file_io)
    .with_compression(Compression::ZSTD(Default::default()))
    .await?;
```

## INTEGRATION POINTS

- **Async Runtime**: tokio (required for examples/tests, not enforced as dependency)
- **Serialization**: serde with JSON for REST API, apache-avro for manifest files
- **Arrow/Parquet**: Uses arrow 55.2.0 and parquet 55.2.0 crates directly
- **Storage Backend**: OpenDAL 0.51 with services-s3 and services-memory features
- **Authentication**:
  - Native: aws-config, aws-sdk-sts, aws-sigv4, reqwest with rustls-tls
  - WASM: reqwest with JSON (no TLS features)
- **Key Feature Flags**: None (platform selection via cfg(target_family = "wasm"))
- **Critical Dependencies**: opendal (storage abstraction), async-trait (catalog trait), thiserror (error types)

## CONSTRAINTS & GOTCHAS

- **MSRV**: Rust 2021 edition (likely 1.70+, not explicitly documented)
- **Platform-specific behavior**:
  - `S3TablesCatalog` unavailable on WASM (requires AWS SDK)
  - `R2Catalog` uses `?Send` async trait on WASM (single-threaded)
  - Some error variants (e.g., `Error::InvalidArn`) only exist on native platforms
- **Performance cliffs**:
  - `arrow_to_parquet()` buffers entire Parquet file in memory before upload
  - Table scans read all data files sequentially (no filtering/projection yet)
  - No connection pooling for REST catalog calls
- **Common misuse patterns**:
  - Don't call `table.files()` in a loop - cache the table metadata
  - Don't create new catalog instances per request - reuse them
  - Always reload table after commit to get latest metadata
- **Unsafe code**: None in the library
- **Concurrent modification handling**: Transactions will fail with `ConcurrentModification` error if table metadata changes between load and commit - client must retry with fresh table metadata

## TESTING GUIDANCE

```rust
#[cfg(test)]
mod tests {
    use super::*;
    use icepick::spec::{NamespaceIdent, TableIdent, Schema, NestedField, Type, PrimitiveType};
    use icepick::io::FileIO;
    use opendal::Operator;

    #[tokio::test]
    async fn test_table_operations() {
        // Setup: Create in-memory FileIO for testing
        let op = Operator::via_iter(opendal::Scheme::Memory, []).unwrap();
        let file_io = FileIO::new(op);

        // Create test schema
        let schema = Schema::builder()
            .with_fields(vec![
                NestedField::required_field(1, "id".to_string(),
                    Type::Primitive(PrimitiveType::Long))
            ])
            .build()
            .unwrap();

        // Test operations
        let metadata = TableMetadata::builder()
            .with_location("memory://test/table")
            .with_current_schema(schema)
            .build()
            .unwrap();

        let table_id = TableIdent::from_strs(&["test"], "table");
        let table = Table::new(table_id, metadata, "memory://test/metadata.json".to_string(), file_io);

        assert_eq!(table.location(), "memory://test/table");
    }
}
```

### Testing with real AWS/Cloudflare services

Use environment variables for credentials in integration tests:

```rust
#[tokio::test]
#[ignore] // Only run with --ignored flag
async fn test_s3_tables_integration() {
    dotenvy::dotenv().ok();
    let arn = std::env::var("S3_TABLES_ARN").unwrap();
    let catalog = S3TablesCatalog::from_arn("test", &arn).await.unwrap();
    // ... test operations
}
```

## CONTRIBUTION VECTORS

- **Code style**: Use `cargo fmt` (standard rustfmt.toml expected but not present)
- **Test coverage**: Unit tests for pure functions, integration tests (with `#[ignore]`) for cloud services
- **Benchmark requirements**: None currently - performance testing is manual
- **Documentation standards**:
  - All public items require doc comments with examples
  - Use `//!` for module-level docs
  - Include `# Errors` section for fallible functions
  - Add `# Examples` with `no_run` for async/cloud examples
- **Where to add new functionality**:
  - New catalog implementations: `src/catalog/<provider>.rs` + update `catalog/mod.rs`
  - New Iceberg spec types: `src/spec/<type>.rs`
  - New write operations: Extend `Transaction` in `src/transaction.rs`
  - New read capabilities: Extend `TableScan` in `src/scan.rs`
  - Storage backend changes: `src/io/file_io.rs`

## SEMANTIC VERSIONING CONTRACT

**Breaking changes** (require major version bump):
- Changes to public trait methods (Catalog, async fn signatures)
- Removal of public types or methods
- Changes to Error enum variants (code matching on them will break)
- Modifications to `TableIdent`, `NamespaceIdent`, or other core spec types
- FileIO method signature changes

**Non-breaking changes** (minor version bump):
- New catalog implementations
- New methods on existing types
- New error variants (if using catch-all patterns)
- Performance improvements without API changes
- New optional features

**Patch changes**:
- Bug fixes in existing functionality
- Documentation improvements
- Internal refactoring
- Dependency updates (within semver compatibility)

## FOR AI AGENTS

### Quick Reference

When working with this library:
1. Always use `catalog::Catalog` trait for catalog operations (don't call REST endpoints directly)
2. Platform check: Use `S3TablesCatalog` for native AWS, `R2Catalog` for WASM/Cloudflare
3. Run `cargo clippy -- -D warnings` before suggesting changes
4. For architecture decisions, this is a thin wrapper over Iceberg REST protocol - prioritize simplicity over feature completeness
5. Error pattern: All errors implement Display with context - use `?` operator and let errors propagate

### Key Invariants to Maintain

- **FileIO must never expose raw OpenDAL Operator** - all file operations go through FileIO methods
- **Catalog implementations must use optimistic locking** - always pass old_metadata_location when updating
- **WASM compatibility for R2Catalog** - never use AWS SDK types in R2 code paths

### When Generating Code Using This Library

**Always:**
- Reload table after commit: `let table = catalog.load_table(&table_id).await?;`
- Use `TableIdent::from_strs(&["namespace"], "table")` for simple construction
- Include proper error handling (don't unwrap on I/O operations)
- Use `#[tokio::main]` or equivalent async runtime in examples
- Add field IDs to Iceberg schemas (required for Parquet field mapping)

**Never:**
- Construct `Table` directly (use catalog methods)
- Reuse `Transaction` after commit (it consumes self)
- Mix S3TablesCatalog with WASM targets
- Assume tables have snapshots (check with `current_snapshot()`)
- Hardcode credentials in examples (use env vars or function parameters)

## PERFORMANCE PROFILE

| Operation | Complexity | Notes |
|-----------|-----------|-------|
| `catalog.load_table()` | O(1) | Single REST API call + metadata JSON parse |
| `table.files()` | O(m) | Reads manifest list + m manifest files (Avro) |
| `table.scan().to_arrow()` | O(n) | Sequential read of n data files, no parallelism yet |
| `transaction.commit()` | O(m) | Write new manifest files + update metadata (atomic CAS) |
| `arrow_to_parquet()` | O(n) | Full buffer in memory before upload |

Where m = number of manifest files, n = number of data files

## COMPARISON MATRIX

| Feature | icepick | iceberg-rust |
|---------|---------|--------------|
| WASM Support | ✅ R2Catalog | ❌ |
| Native AWS | ✅ S3TablesCatalog | ✅ |
| Full Iceberg Spec | ❌ (REST catalogs only) | ✅ (complete) |
| Dependencies | Lightweight | Heavy (full AWS SDK) |
| Maturity | Experimental | Production (Apache) |
| Transaction API | Simplified (append only) | Full (delete, overwrite, etc.) |
| Query Optimization | None yet | Predicate pushdown, projection |

**When to use icepick**: WASM deployment, serverless environments (Cloudflare Workers), simpler API for append-only workloads, R2 Data Catalog support

**When to use iceberg-rust**: Full Iceberg feature support, non-REST catalogs (Glue, Hive, etc.), complex query patterns, production-critical workloads