pdf_oxide 0.3.1

The Complete PDF Toolkit: extract, create, and edit PDFs. Rust core with bindings for Python, Node, WASM, Go, and more.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
# Contributing to PDF Library

Thank you for your interest in contributing! This document provides guidelines and information for contributors.

## Table of Contents

- [Code of Conduct]#code-of-conduct
- [Getting Started]#getting-started
- [Development Setup]#development-setup
- [Project Structure]#project-structure
- [Development Workflow]#development-workflow
- [Coding Standards]#coding-standards
- [Testing]#testing
- [Documentation]#documentation
- [Submitting Changes]#submitting-changes
- [License]#license

## Code of Conduct

This project adheres to the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior by opening an issue or contacting the maintainers.

## Getting Started

### Prerequisites

- **Rust**: 1.70+ ([Install Rust]https://rustup.rs/)
- **Python**: 3.8+ (for Python bindings)
- **Git**: For version control
- **C Compiler**: gcc or clang (for dependencies)

### Optional Tools

- **cargo-watch**: Auto-reload on file changes
  ```bash
  cargo install cargo-watch
  ```

- **cargo-tarpaulin**: Code coverage
  ```bash
  cargo install cargo-tarpaulin
  ```

- **maturin**: Python packaging
  ```bash
  pip install maturin
  ```

## Development Setup

1. **Fork and clone** the repository:
   ```bash
   git clone https://github.com/YOUR_USERNAME/pdf_oxide.git
   cd pdf_oxide
   ```

2. **Build the project**:
   ```bash
   cargo build
   ```

3. **Run tests**:
   ```bash
   cargo test
   ```

4. **Set up pre-commit hooks** (recommended):
   ```bash
   ./scripts/setup-hooks.sh
   ```

   This installs a pre-commit hook that automatically runs:
   - Code formatting (`cargo fmt --check`)
   - Linting (`cargo clippy`)
   - Build verification (`cargo check`)
   - Library tests (`cargo test --lib`)
   - Integration tests (`cargo test --tests`)
   - Documentation tests (`cargo test --doc`)

## Project Structure

See `docs/planning/README.md` for comprehensive documentation.

```
pdf_oxide/
├── src/              # Rust source code
├── tests/            # Integration tests
├── benches/          # Performance benchmarks
├── examples/         # Usage examples
├── python/           # Python bindings (PyO3)
├── docs/planning/    # Planning documents (16 files)
└── training/         # ML training scripts
```

## Development Workflow

### 1. Pick a Task

- Check [Issues]https://github.com/yfedoseev/pdf_oxide/issues
- Look for issues labeled `help-wanted` or `good-first-issue`
- Comment on the issue to claim it

### 2. Create a Branch

```bash
git checkout -b feature/your-feature-name
# or
git checkout -b fix/your-bug-fix
```

Branch naming:
- `feature/` - New features
- `fix/` - Bug fixes
- `docs/` - Documentation updates
- `test/` - Test additions
- `refactor/` - Code refactoring

### 3. Make Changes

Write code following our [Coding Standards](#coding-standards).

### 4. Test Your Changes

```bash
# Run all tests
cargo test

# Run specific test
cargo test test_name

# Run with features
cargo test --features ml

# Watch mode (auto-reload)
cargo watch -x test
```

### 5. Format and Lint

```bash
# Format code
cargo fmt

# Run linter
cargo clippy -- -D warnings

# Fix clippy suggestions
cargo clippy --fix
```

### 6. Commit Your Changes

Follow [Conventional Commits](https://www.conventionalcommits.org/):

```bash
git commit -m "feat: add PDF object parser"
git commit -m "fix: correct unicode mapping in ToUnicode CMap"
git commit -m "docs: update API documentation"
```

Commit types:
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation only
- `test`: Adding tests
- `refactor`: Code refactoring
- `perf`: Performance improvement
- `chore`: Maintenance tasks

### 7. Push and Create Pull Request

```bash
git push origin feature/your-feature-name
```

Then create a pull request on GitHub.

## Coding Standards

### Rust

#### Style

- Follow [Rust API Guidelines]https://rust-lang.github.io/api-guidelines/
- Use `rustfmt` (configured in `rustfmt.toml`)
- Maximum line length: 100 characters
- Use 4 spaces for indentation

#### Naming Conventions

```rust
// Modules and crates
mod pdf_parser;

// Types (structs, enums, traits)
struct PdfDocument;
enum Object;
trait Decoder;

// Functions and methods
fn parse_object() -> Result<Object>;

// Constants
const MAX_RECURSION_DEPTH: usize = 100;

// Static variables
static GLOBAL_CONFIG: Config = Config::new();
```

#### Error Handling

```rust
// Use Result<T> for fallible operations
pub fn parse_pdf(path: &Path) -> Result<PdfDocument> {
    // Use ? operator for error propagation
    let file = File::open(path)?;

    // Provide context when wrapping errors
    let doc = parse_file(file)
        .map_err(|e| Error::Parse(format!("Failed to parse {}: {}", path.display(), e)))?;

    Ok(doc)
}

// Avoid unwrap() in library code (only in tests and examples)
// Use expect() with descriptive messages when appropriate
```

#### Documentation

```rust
/// Parse a PDF object from bytes.
///
/// # Arguments
///
/// * `bytes` - Raw bytes containing the PDF object
///
/// # Returns
///
/// Returns the parsed object or an error if parsing fails.
///
/// # Errors
///
/// Returns `Error::Parse` if the bytes don't represent a valid PDF object.
///
/// # Examples
///
/// ```
/// use pdf_oxide::parse_object;
///
/// let bytes = b"42";
/// let obj = parse_object(bytes)?;
/// assert_eq!(obj, Object::Integer(42));
/// ```
pub fn parse_object(bytes: &[u8]) -> Result<Object> {
    // Implementation
}
```

#### Safety

- Avoid `unsafe` unless absolutely necessary
- Document all `unsafe` blocks with safety invariants
- Prefer safe abstractions from the standard library

### Python

- Follow [PEP 8]https://pep8.org/
- Use `black` for formatting
- Type hints for all public functions
- Docstrings in Google style

## Testing

### Unit Tests

```rust
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_parse_integer() {
        let obj = parse_object(b"42").unwrap();
        assert_eq!(obj, Object::Integer(42));
    }

    #[test]
    #[should_panic(expected = "Invalid")]
    fn test_invalid_input() {
        parse_object(b"invalid").unwrap();
    }
}
```

### Integration Tests

Located in `tests/`:

```rust
// tests/test_integration.rs
use pdf_oxide::PdfDocument;

#[test]
fn test_extract_text_from_simple_pdf() {
    let mut doc = PdfDocument::open("tests/fixtures/simple.pdf").unwrap();
    let text = doc.extract_text(0).unwrap();
    assert!(text.contains("Hello, World!"));
}
```

### Property-Based Testing

```rust
use proptest::prelude::*;

proptest! {
    #[test]
    fn test_roundtrip(s: String) {
        let encoded = encode(&s);
        let decoded = decode(&encoded)?;
        prop_assert_eq!(s, decoded);
    }
}
```

### Coverage Goals

- **Library code**: 80%+ coverage
- **Critical paths**: 100% coverage (parsing, error handling)

Check coverage:
```bash
cargo tarpaulin --out Html
open tarpaulin-report.html
```

## Documentation

### Code Documentation

- All public items must have doc comments
- Include examples in doc comments
- Run `cargo doc` to check rendered docs

### Planning Documentation

- Update relevant `PHASE_*.md` files if modifying implementation
- Keep `README.md` up to date with new features
- Document breaking changes

### Examples

Add examples to `examples/`:

```rust
// examples/basic.rs
use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = PdfDocument::open("paper.pdf")?;
    let text = doc.extract_text(0)?;
    println!("{}", text);
    Ok(())
}
```

## Submitting Changes

### Pull Request Checklist

Before submitting a PR, ensure:

- [ ] Code compiles without warnings
- [ ] All tests pass (`cargo test`)
- [ ] Code is formatted (`cargo fmt`)
- [ ] Clippy passes (`cargo clippy -- -D warnings`)
- [ ] New code has tests
- [ ] Documentation is updated
- [ ] Commit messages follow conventions
- [ ] PR description explains changes clearly

### Pull Request Template

```markdown
## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Related Issue
Fixes #(issue number)

## Testing
Describe testing done

## Checklist
- [ ] Tests pass
- [ ] Code formatted
- [ ] Documentation updated
```

### Review Process

1. Maintainers will review your PR
2. Address feedback and push updates
3. Once approved, your PR will be merged
4. Your changes will appear in the next release

## Performance Considerations

When working on performance-critical code:

1. **Benchmark before and after**:
   ```bash
   cargo bench
   ```

2. **Profile if needed**:
   ```bash
   cargo install flamegraph
   cargo flamegraph --bench my_bench
   ```

3. **Consider memory usage**:
   - Avoid unnecessary allocations
   - Use `Cow<str>` when appropriate
   - Stream large files instead of loading entirely

## ML Features

When working on ML features (OCR, layout analysis):

- Document model requirements
- Provide ONNX conversion scripts
- Test on CPU-only systems
- Keep models small (<50MB)
- Document accuracy metrics

## License

By contributing, you agree that your contributions will be dual licensed under **MIT OR Apache-2.0**, as defined in the Apache-2.0 license, without any additional terms or conditions.

This means:
- Your code will be available under permissive open-source licenses
- Users can choose either MIT or Apache-2.0 for their needs
- See [LICENSE-MIT]LICENSE-MIT and [LICENSE-APACHE]LICENSE-APACHE for full terms

## Questions?

- Check `docs/spec/` for PDF specification references
- Read code comments and documentation
- Open an issue for questions
- Join discussions on GitHub Discussions

## Recognition

Contributors will be acknowledged in:
- GitHub contributors list
- Release notes
- `CONTRIBUTORS.md` file (coming soon)

Thank you for contributing! 🎉