safe_unzip 0.1.3

Secure zip extraction. Prevents Zip Slip and Zip Bombs.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
# safe_unzip

Archive extraction that won't ruin your day. Supports **ZIP** and **TAR** formats.

## The Problem

Zip files can contain malicious paths that escape the extraction directory:

```python
import zipfile
zipfile.ZipFile("evil.zip").extractall("/var/uploads")
# Extracts ../../etc/cron.d/pwned β†’ /etc/cron.d/pwned πŸ’€
```

This is [Zip Slip](https://snyk.io/research/zip-slip-vulnerability), and Python's default behavior is still vulnerable.

### "But didn't Python fix this?"

Sort of. Python added warnings and `ZipInfo.filename` sanitization in 2014. In Python 3.12+, there's a `filter` parameter:

```python
# The "safe" way β€” but who knows this exists?
zipfile.ZipFile("evil.zip").extractall("/var/uploads", filter="data")
```

The problem: **the safe option is opt-in**. The default is still vulnerable. Most developers don't read the docs carefully enough to discover `filter="data"`.

`safe_unzip` makes security the default, not an afterthought.

## The Solution

```rust
use safe_unzip::extract_file;

extract_file("/var/uploads", "evil.zip")?;
// Err(PathEscape { entry: "../../etc/cron.d/pwned", ... })
```

```python
# Python bindings β€” same safety
from safe_unzip import extract_file

extract_file("/var/uploads", "evil.zip")
# Raises: PathEscapeError
```

**Security is the default.** No special flags, no opt-in safety. Every path is validated. Malicious archives are rejected, not extracted.

## Why Not Just Use `zip` / `tar` / `zipfile`?

Because **archive extraction is a security boundary**, and most libraries treat it as a convenience function.

| Library | Default Behavior | Safe Option |
|---------|------------------|-------------|
| Python `zipfile` | Vulnerable | `filter="data"` (opt-in, obscure) |
| Python `tarfile` | Vulnerable | `filter="data"` (opt-in, Python 3.12+) |
| Rust `zip` | Vulnerable | Manual path validation |
| Rust `tar` | Vulnerable | Manual path validation |
| `safe_unzip` | **Safe by default** | N/A β€” always safe |

If you're extracting untrusted archives, you need a library designed for that threat model.

## Who Should Use This

- **Backend services** handling user-uploaded zip files
- **CI/CD systems** unpacking third-party artifacts  
- **SaaS platforms** with file import features
- **Forensics / malware analysis** pipelines
- **Anything running as a privileged user**

If your zip files only come from trusted sources you control, the standard `zip` crate is fine. If users can upload archives, use `safe_unzip`.

## Features

- **Multi-Format Support** β€” ZIP and TAR (`.tar`, `.tar.gz`) archives
- **Async API** β€” Optional tokio-based async extraction (feature flag)
- **Zip Slip Protection** β€” Path traversal attacks blocked via [path_jail]https://crates.io/crates/path_jail
- **Zip Bomb Protection** β€” Configurable limits on size, file count, and path depth
- **Strict Size Enforcement** β€” Catches files that decompress larger than declared
- **Filename Sanitization** β€” Blocks control characters and Windows reserved names
- **Symlink Handling** β€” Skip or reject symlinks (no symlink-based escapes)
- **Secure Overwrite** β€” Removes symlinks before overwriting to prevent symlink attacks
- **Atomic File Creation** β€” TOCTOU-safe file creation using `O_EXCL`
- **Overwrite Policies** β€” Error, skip, or overwrite existing files
- **Filter Callback** β€” Extract only the files you want
- **Two-Pass Mode** β€” Validate everything before writing anything
- **Permission Stripping** β€” Removes setuid/setgid bits on Unix

## Installation

**Rust:**
```toml
[dependencies]
safe_unzip = "0.1"
```

**Python:**
```bash
pip install safe-unzip
```

### Python Bindings

The Python bindings are **thin wrappers** over the Rust implementation via PyO3. This means:

- βœ… **Identical security guarantees** β€” same code path, same validation
- βœ… **Identical limits** β€” same defaults (1GB total, 10K files, 100MB per file)
- βœ… **Identical semantics** β€” same error conditions, same behavior
- βœ… **No re-implementation** β€” Python calls Rust directly, no logic duplication

Security reviewers: the Python API is a direct binding, not a port.

## Quick Start

```rust
use safe_unzip::extract_file;

// Extract with safe defaults
let report = extract_file("/var/uploads", "archive.zip")?;
println!("Extracted {} files ({} bytes)", 
    report.files_extracted, 
    report.bytes_written
);
```

## Usage Examples

### Basic Extraction

```rust
use safe_unzip::Extractor;

let report = Extractor::new("/var/uploads")?
    .extract_file("archive.zip")?;
```

### Create Destination if Missing

```rust
use safe_unzip::Extractor;

// Extractor::new() errors if destination doesn't exist (catches typos)
// Extractor::new_or_create() creates it automatically
let report = Extractor::new_or_create("/var/uploads/new_folder")?
    .extract_file("archive.zip")?;

// The convenience functions (extract_file, extract) also create automatically
use safe_unzip::extract_file;
extract_file("/var/uploads/new_folder", "archive.zip")?;
```

### Custom Limits (Prevent Zip Bombs)

```rust
use safe_unzip::{Extractor, Limits};

let report = Extractor::new("/var/uploads")?
    .limits(Limits {
        max_total_bytes: 500 * 1024 * 1024,  // 500 MB total
        max_file_count: 1_000,                // Max 1000 files
        max_single_file: 50 * 1024 * 1024,   // 50 MB per file
        max_path_depth: 10,                   // No deeper than 10 levels
    })
    .extract_file("archive.zip")?;
```

### Filter by Extension

```rust
use safe_unzip::Extractor;

// Only extract images
let report = Extractor::new("/var/uploads")?
    .filter(|entry| {
        entry.name.ends_with(".png") || 
        entry.name.ends_with(".jpg") ||
        entry.name.ends_with(".gif")
    })
    .extract_file("archive.zip")?;

println!("Extracted {} images, skipped {} other files",
    report.files_extracted,
    report.entries_skipped
);
```

### Overwrite Policies

```rust
use safe_unzip::{Extractor, OverwritePolicy};

// Skip files that already exist
let report = Extractor::new("/var/uploads")?
    .overwrite(OverwritePolicy::Skip)
    .extract_file("archive.zip")?;

// Or overwrite them
let report = Extractor::new("/var/uploads")?
    .overwrite(OverwritePolicy::Overwrite)
    .extract_file("archive.zip")?;

// Default: Error if file exists
let report = Extractor::new("/var/uploads")?
    .overwrite(OverwritePolicy::Error)  // This is the default
    .extract_file("archive.zip")?;
```

### Symlink Policies

```rust
use safe_unzip::{Extractor, SymlinkPolicy};

// Default: silently skip symlinks
let report = Extractor::new("/var/uploads")?
    .symlinks(SymlinkPolicy::Skip)
    .extract_file("archive.zip")?;

// Or reject archives containing symlinks
let report = Extractor::new("/var/uploads")?
    .symlinks(SymlinkPolicy::Error)
    .extract_file("archive.zip")?;
```

### Extraction Modes

| Mode | Speed | On Failure | Use When |
|------|-------|------------|----------|
| `Streaming` (default) | Fast (1 pass) | Partial files remain | Speed matters; you'll clean up on error |
| `ValidateFirst` | Slower (2 passes) | No files if validation fails | Can't tolerate partial state |

**⚠️ Neither mode is truly atomic.** If extraction fails mid-write (e.g., disk full), partial files remain regardless of mode. `ValidateFirst` only prevents writes when *validation* fails (bad paths, limits exceeded), not when I/O fails during extraction.

```rust
use safe_unzip::{Extractor, ExtractionMode};

// Two-pass extraction:
// 1. Validate ALL entries (no disk writes)
// 2. Extract (only if validation passed)
let report = Extractor::new("/var/uploads")?
    .mode(ExtractionMode::ValidateFirst)
    .extract_file("untrusted.zip")?;
```

Use `ValidateFirst` when you can't tolerate partial state from malicious archives. Use `Streaming` (default) when speed matters and you can clean up on error.

### Extracting from Memory

```rust
use safe_unzip::Extractor;
use std::io::Cursor;

let zip_bytes: Vec<u8> = download_zip_somehow();
let cursor = Cursor::new(zip_bytes);

let report = Extractor::new("/var/uploads")?
    .extract(cursor)?;
```

### TAR Extraction (New in v0.1.2)

```rust
use safe_unzip::{Driver, TarAdapter};

// Extract a .tar file
let report = Driver::new("/var/uploads")?
    .extract_tar_file("archive.tar")?;

// Extract a .tar.gz file
let report = Driver::new("/var/uploads")?
    .extract_tar_gz_file("archive.tar.gz")?;

// With options
let report = Driver::new("/var/uploads")?
    .filter(|entry| entry.name.ends_with(".txt"))
    .validation(safe_unzip::ValidationMode::ValidateFirst)
    .extract_tar_file("archive.tar")?;
```

The new `Driver` API provides a unified interface for all archive formats with the same security guarantees.

### Async Extraction (New)

Enable the `async` feature for tokio-based async extraction:

```toml
[dependencies]
safe_unzip = { version = "0.1", features = ["async"] }
```

```rust
use safe_unzip::r#async::{extract_file, extract_tar_file, AsyncExtractor};

#[tokio::main]
async fn main() -> Result<(), safe_unzip::Error> {
    // Simple async extraction
    let report = extract_file("/var/uploads", "archive.zip").await?;
    
    // TAR extraction
    let report = extract_tar_file("/var/uploads", "archive.tar").await?;
    
    // With options
    let report = AsyncExtractor::new("/var/uploads")?
        .max_total_bytes(500 * 1024 * 1024)
        .max_file_count(1000)
        .extract_file("archive.zip")
        .await?;
    
    Ok(())
}
```

Concurrent extraction of multiple archives:

```rust
use safe_unzip::r#async::{extract_file, extract_tar_bytes};

let (zip_result, tar_result) = tokio::join!(
    extract_file("/uploads/a", "first.zip"),
    extract_tar_bytes("/uploads/b", tar_data),
);
```

The async API uses `spawn_blocking` internally, so extraction runs in a thread pool without blocking the async runtime.

## Security Model

| Threat | Attack Vector | Defense |
|--------|---------------|---------|
| **Zip Slip** | Entry named `../../etc/cron.d/pwned` | `path_jail` validates every path |
| **Zip Bomb (size)** | 42KB β†’ 4PB expansion | `max_total_bytes` limit + streaming enforcement |
| **Zip Bomb (count)** | 1 million empty files | `max_file_count` limit |
| **Zip Bomb (lying)** | Declared 1KB, decompresses to 1GB | Strict size reader detects mismatch |
| **Symlink Escape** | Symlink to `/etc/passwd` | Skip or reject symlinks |
| **Symlink Overwrite** | Create symlink, then overwrite target | Symlinks removed before overwrite |
| **Path Depth** | `a/b/c/.../1000levels` | `max_path_depth` limit |
| **Invalid Filename** | Control chars, `CON`, `NUL` | Filename sanitization |
| **Overwrite** | Replace sensitive files | `OverwritePolicy::Error` default |
| **Setuid** | Create setuid executables | Permission bits stripped |
| **Encrypted Archives** | Password handling complexity | Rejected (see [Encrypted Archives]#encrypted-archives) |

## Default Limits

| Limit | Default | Description |
|-------|---------|-------------|
| `max_total_bytes` | 1 GB | Total uncompressed size |
| `max_file_count` | 10,000 | Number of files |
| `max_single_file` | 100 MB | Largest single file |
| `max_path_depth` | 50 | Directory nesting depth |

## Error Handling

```rust
use safe_unzip::{extract_file, Error};

match extract_file("/var/uploads", "archive.zip") {
    Ok(report) => {
        println!("Success: {} files", report.files_extracted);
    }
    Err(Error::PathEscape { entry, detail }) => {
        eprintln!("Blocked path traversal in '{}': {}", entry, detail);
    }
    Err(Error::TotalSizeExceeded { limit, would_be }) => {
        eprintln!("Archive too large: {} bytes (limit: {})", would_be, limit);
    }
    Err(Error::FileTooLarge { entry, size, limit }) => {
        eprintln!("File '{}' too large: {} bytes (limit: {})", entry, size, limit);
    }
    Err(Error::FileCountExceeded { limit }) => {
        eprintln!("Too many files (limit: {})", limit);
    }
    Err(Error::AlreadyExists { path }) => {
        eprintln!("File already exists: {}", path);
    }
    Err(Error::InvalidFilename { entry, .. }) => {
        eprintln!("Invalid filename: {}", entry);
    }
    Err(Error::EncryptedEntry { entry }) => {
        eprintln!("Encrypted entry not supported: {}", entry);
    }
    Err(e) => {
        eprintln!("Extraction failed: {}", e);
    }
}
```

## Limitations

### Format Limitations

- **ZIP and TAR only** β€” Other formats (7z, rar) not supported
- **Requires seekable input for ZIP** β€” ZIP format requires reading the central directory at the end
- **TAR is sequential** β€” TAR files are read in order; `ValidateFirst` mode caches entries in memory
- **No encrypted archives** β€” See below

### Encrypted Archives

`safe_unzip` does not support password-protected zip files. Encrypted entries are rejected with `Error::EncryptedEntry`.

If you need to extract encrypted archives:
1. Decrypt first using the `zip` crate directly
2. Then extract with `safe_unzip`

This is intentionalβ€”encryption handling is outside our security scope. Password management, key derivation, and cryptographic validation are complex domains that deserve dedicated tooling.

### Extraction Behavior

- **Partial state in Streaming mode** β€” If extraction fails mid-way, already-extracted files remain on disk. Use `ExtractionMode::ValidateFirst` to validate before writing.
- **Filters not applied during validation** β€” In `ValidateFirst` mode, limits are checked against ALL entries. Filtered entries still count toward limits. This is conservative: validation may reject archives that would succeed with filtering.

### Security Scope

These threats are **not fully addressed** (by design or complexity):

| Limitation | Reason |
|------------|--------|
| **Case-insensitive collisions** | On Windows/macOS, `File.txt` and `file.txt` map to the same file. We don't track extracted names to detect this. |
| **Unicode normalization** | `cafΓ©` (NFC) vs `cafΓ©` (NFD) appear identical but are different bytes. Full normalization requires ICU. |
| **Concurrent extraction** | If multiple threads/processes extract to the same destination, race conditions can occur. Use file locking or separate destinations. |
| **Sparse file attacks** | Not applicable to zip format. |
| **Hard links** | Zip format doesn't support hard links. |
| **Device files** | Zip format doesn't support special device files. |

### TOCTOU Mitigations

For `OverwriteMode::Error` and `OverwriteMode::Skip`, we use **atomic file creation** (`O_CREAT | O_EXCL`) instead of check-then-create. This eliminates race conditions between checking if a file exists and creating it.

For `OverwriteMode::Overwrite`, symlinks are removed before writing to prevent symlink-following attacks, but there's a brief window between removal and creation.

### Filename Restrictions

These filenames are **rejected** for security:

- Control characters (including null bytes)
- Backslashes (`\`) β€” prevents Windows path separator confusion
- Paths longer than 1024 bytes
- Path components longer than 255 bytes
- Windows reserved names: `CON`, `PRN`, `AUX`, `NUL`, `COM1-9`, `LPT1-9`

## Development

### Fuzzing

We use [cargo-fuzz](https://github.com/rust-fuzz/cargo-fuzz) with two targets:

```bash
# Install cargo-fuzz (requires nightly)
cargo install cargo-fuzz

# Run the main extraction fuzzer
cargo +nightly fuzz run fuzz_extract

# Run the adapter fuzzer (tests parsing layer)
cargo +nightly fuzz run fuzz_zip_adapter
```

Fuzzing targets are in `fuzz/fuzz_targets/`. Run fuzzing before releases to catch parsing edge cases.

## License

MIT OR Apache-2.0