zarja 0.1.3

Extract Protocol Buffer definitions from compiled binaries
# zarja

Extract Protocol Buffer definitions from compiled binaries.

When you compile a program that uses protobufs (Go, C++, Java, etc.), the `.proto` definitions often get embedded in the binary as `FileDescriptorProto` messages. zarja finds these embedded descriptors and reconstructs the original `.proto` source files.

## Why

You have a binary. You want to know what protobufs it uses. Maybe you're reverse engineering an API, analyzing a closed-source application, or recovering lost proto files from old builds. zarja extracts what's there.

## Installation

```bash
cargo install zarja

# or build from source
cargo build --release
./target/release/zarja --help
```

## Usage

### Single binary

```bash
# Extract all protos to current directory
zarja -f ./some-binary

# Extract to specific directory
zarja -f ./some-binary -o ./protos

# Just list what's in there
zarja -f ./some-binary --list-only
```

### Directory of binaries

```bash
# Recursively process all binaries in a directory
zarja -d /path/to/binaries -o ./protos

# See what's happening
zarja -d /path/to/binaries -o ./protos -v
```

### Output

```
$ zarja -f ./GeoServices -o ./protos --force -v
Wrote ./protos/AddressObject.proto
Wrote ./protos/geo3-slim.proto
Wrote ./protos/google/protobuf/descriptor.proto
Wrote ./protos/compressed_path.proto
Wrote ./protos/path.proto
INFO Summary: 6 found, 0 duplicates skipped, 1 conflicts renamed, 6 written
```

## How it works

### Finding descriptors

Protobuf's `FileDescriptorProto` always starts with field 1 (the filename), which is a length-delimited string ending in `.proto`. zarja scans the binary for the `.proto` byte sequence, backtracks to find the field header (`0x0A`), then parses forward using protobuf's wire format to find where the descriptor ends.

```
Binary data:
... garbage ... 0A 12 74 65 73 74 2E 70 72 6F 74 6F ... more fields ... garbage ...
                │  │  └──────── "test.proto" ────────┘
                │  └─ length: 18 bytes
                └─ field 1, wire type 2 (length-delimited)
```

The tricky part: binaries contain lots of noise, and descriptors can be adjacent to each other or surrounded by garbage. zarja's scanner handles edge cases like:

- Filenames exactly 10 bytes long (where the length byte is also `0x0A`)
- Adjacent descriptors that share boundaries
- Corrupted or partial descriptors (skipped gracefully)

### Reconstructing proto source

Once zarja has the raw `FileDescriptorProto` bytes, it parses them with prost and walks the descriptor tree to emit valid `.proto` syntax:

```
FileDescriptorProto
├── name: "example.proto"
├── package: "myapp"
├── message_type[]
│   └── DescriptorProto
│       ├── name: "Request"
│       ├── field[]
│       │   └── FieldDescriptorProto { name: "id", number: 1, type: INT32 }
│       └── nested_type[]
└── enum_type[]
```

Becomes:

```protobuf
syntax = "proto3";

package myapp;

message Request {
  int32 id = 1;
}
```

The reconstructor handles proto2 vs proto3 syntax, nested messages, enums, oneofs, maps, services, extensions, reserved fields, and most field options.

## Conflict resolution

When processing multiple binaries, you'll often find the same `.proto` file in several of them. Sometimes they're identical (duplicates), sometimes they differ (conflicts). zarja tracks content by hash and handles both:

| Situation | Behavior |
|-----------|----------|
| Same filename, same content | Skip (duplicate) |
| Same filename, different content | Rename with suffix |

Three strategies for handling conflicts:

```bash
# Append content hash (default): descriptor~a1b2c3d4.proto
zarja -d ./bins -o ./protos --conflict-strategy hash-suffix

# Append source binary name: descriptor~from-myapp.proto  
zarja -d ./bins -o ./protos --conflict-strategy source-suffix

# Keep first, skip rest
zarja -d ./bins -o ./protos --conflict-strategy skip-conflicts
```

## Binary detection

When scanning directories, zarja needs to figure out which files are actually binaries worth scanning. It uses a combination of:

1. **Extension filtering** - skips `.txt`, `.json`, `.py`, `.proto`, etc.
2. **Size filtering** - skips files < 1KB or > 500MB
3. **Magic bytes** - looks for Mach-O (`0xCFFAEDFE`), ELF (`0x7F454C46`), PE (`MZ`)
4. **Fallback** - tries files with no extension

## Project structure

```
zarja/
├── crates/
│   ├── zarja-core/          # Library: scanner + reconstructor
│   │   ├── scanner/         # Binary scanning, wire format parsing
│   │   ├── proto/           # Proto reconstruction, source generation
│   │   └── error.rs         # Error types
│   └── zarja-cli/           # Binary: CLI interface
```

### Using as a library

```rust
use zarja_core::{Scanner, ScanStrategy, ProtoReconstructor};

let data = std::fs::read("./binary")?;
let scanner = Scanner::new();

for result in scanner.scan(&data)? {
    match ProtoReconstructor::from_bytes(result.as_bytes()) {
        Ok(proto) => {
            println!("// {}", proto.filename());
            println!("{}", proto.reconstruct());
        }
        Err(e) => eprintln!("Failed to parse: {}", e),
    }
}
```

## Limitations

**What gets embedded depends on the language and build:**

- **Go**: Usually embeds full descriptors for reflection. Good extraction results.
- **C++**: Depends on build flags. Sometimes only has partial descriptors or none.
- **Java**: Often embeds descriptors. Results vary by protobuf version.

**What zarja can't recover:**

- Comments from the original `.proto` files (not stored in descriptors)
- Original formatting and whitespace
- Import paths may be incomplete if dependencies weren't embedded
- Custom options beyond the standard set

**Known gaps in reconstruction:**

- Some complex custom options aren't fully rendered
- `optimize_for`, `deprecated`, and a few other file options are TODOs
- Group fields (deprecated proto2 feature) are parsed but output is minimal

## Options

```
-f, --file <FILE>           Single binary to process
-d, --directory <DIR>       Directory of binaries (recursive)
-o, --output <DIR>          Output directory [default: .]
-v, --verbose               Increase verbosity (-v, -vv, -vvv)
    --force                 Overwrite existing files
    --dry-run               Show what would be extracted
    --list-only             List proto filenames only
    --max-descriptors <N>   Limit descriptors per file (0 = unlimited)
    --conflict-strategy     hash-suffix | source-suffix | skip-conflicts
    --format                proto | filename
```

## Examples

**Recover protos from a macOS system framework:**

```bash
zarja -f /System/Library/PrivateFrameworks/GeoServices.framework/GeoServices \
      -o ./apple-protos --force
```

**Scan an Android APK's native libraries:**

```bash
unzip app.apk -d ./unpacked
zarja -d ./unpacked/lib -o ./protos -v
```

**Quick inventory of what's in a binary:**

```bash
zarja -f ./mystery-binary --list-only
```

**Diff proto versions between two builds:**

```bash
zarja -f ./v1/server -o ./v1-protos
zarja -f ./v2/server -o ./v2-protos
diff -r ./v1-protos ./v2-protos
```

## Performance

zarja processes a ~35MB binary in about 40ms on an M1 Mac. The scanner is single-pass and reconstruction is straightforward tree traversal. Memory usage is proportional to binary size (it reads the whole file into memory).

## Building

```bash
git clone https://github.com/example/zarja
cd zarja
cargo build --release
cargo test
```

Minimum Rust version: 1.75

## License

MIT