octo-flow
High-performance Rust CLI for streaming and filtering GitHub event data.
octo-flow processes massive GitHub Archive (GHArchive) datasets and transforms newline-delimited JSON (NDJSON) event streams into clean tabular reports — using constant memory and zero-copy deserialization.
The tool is designed for data pipelines, log processing, and analytics workflows where large JSON streams must be processed efficiently.
Features
Streaming JSON Processing
Processes NDJSON line-by-line using buffered I/O.
This allows multi-gigabyte datasets to be processed while using only a few megabytes of memory.
Zero-Copy Deserialization
Event fields are deserialized using &str references instead of allocating new Strings.
Benefits:
- fewer heap allocations
- better cache locality
- faster processing
Constant Memory Footprint
The tool never loads the dataset into memory.
Instead it uses a streaming architecture:
input stream
↓
BufReader
↓
line iterator
↓
serde_json parser
↓
event filter
↓
TSV output
This makes octo-flow suitable for:
- large analytics datasets
- CI/CD logs
- observability pipelines
- ETL preprocessing
Flexible Input Sources
octo-flow can read from:
- local files
- standard input (stdin)
- decompression pipelines
Example:
|
Example
Filter GitHub Watch events from a GHArchive dataset:
Example output:
Real-World Pipeline
GHArchive publishes hourly GitHub event streams as compressed NDJSON files.
octo-flow integrates naturally with shell pipelines:
|
|
CLI Options
| Option | Description |
|---|---|
--input <FILE> |
Path to NDJSON file (- for stdin) |
--event <TYPE> |
Optional GitHub event filter |
Example event types:
PushEventPullRequestEventWatchEventForkEvent
Documentation
The project includes full Rust API documentation.
Generate the documentation locally with:
This will build and open the documentation site for the octo-flow library, including the event model, streaming pipeline, and error handling.
Key components documented in the crate:
process_events— core streaming event pipelineGitHubEvent— GitHub event data modelOctoFlowError— structured error handling
Performance
Benchmark on a 9.5MB NDJSON dataset (~65k events):
| Tool | Time |
|---|---|
| jq | 0.40s |
| octo-flow | 0.053s |
| grep | 0.001s |
grep is faster but performs no JSON parsing, which can produce false positives.
octo-flow provides structured parsing with near-native speed.
Testing
The project includes both unit tests and end-to-end CLI tests.
Integration tests use assert_cmd to validate the compiled binary against realistic scenarios:
- CLI argument validation
- event filtering correctness
- file handling errors
Run tests:
Installation
Build from Source
Clone and build with Cargo:
Binary location:
Install via Cargo
If you have Rust installed, you can install octo-flow directly from crates.io:
Why Rust?
Rust enables this tool to combine:
- C-like performance
- memory safety
- zero-cost abstractions
- predictable resource usage
These properties make Rust ideal for high-throughput data processing tools like octo-flow.
License
MIT / Apache 2.0