octo-flow 0.1.2

High-performance Rust CLI for streaming and filtering GitHub event data from GHArchive.
Documentation

octo-flow

CI Rust License CLI Crates.io

High-performance Rust CLI for streaming and filtering GitHub event data.

octo-flow processes massive GitHub Archive (GHArchive) datasets and transforms newline-delimited JSON (NDJSON) event streams into clean tabular reports — using constant memory and zero-copy deserialization.

The tool is designed for data pipelines, log processing, and analytics workflows where large JSON streams must be processed efficiently.


Features

Streaming JSON Processing

Processes NDJSON line-by-line using buffered I/O.

This allows multi-gigabyte datasets to be processed while using only a few megabytes of memory.


Zero-Copy Deserialization

Event fields are deserialized using &str references instead of allocating new Strings.

Benefits:

  • fewer heap allocations
  • better cache locality
  • faster processing

Constant Memory Footprint

The tool never loads the dataset into memory.

Instead it uses a streaming architecture:

input stream
↓
BufReader
↓
line iterator
↓
serde_json parser
↓
event filter
↓
TSV output

This makes octo-flow suitable for:

  • large analytics datasets
  • CI/CD logs
  • observability pipelines
  • ETL preprocessing

Flexible Input Sources

octo-flow can read from:

  • local files
  • standard input (stdin)
  • decompression pipelines

Example:

zcat 2026-03-11-15.json.gz | octo-flow --input - --event WatchEvent

Example

Filter GitHub Watch events from a GHArchive dataset:

octo-flow --input 2015-01-01-15.json --event WatchEvent

Example output:

2489651057	2015-01-01T15:00:03Z	SametSisartenep	visionmedia/debug	WatchEvent
2489651078	2015-01-01T15:00:05Z	comcxx11	phpsysinfo/phpsysinfo	WatchEvent
2489651080	2015-01-01T15:00:05Z	Soufien	wasabeef/awesome-android-libraries	WatchEvent

Real-World Pipeline

GHArchive publishes hourly GitHub event streams as compressed NDJSON files.

octo-flow integrates naturally with shell pipelines:

curl https://data.gharchive.org/2026-03-11-15.json.gz 
| zcat 
| octo-flow --input - --event WatchEvent > stars.tsv

CLI Options

Option Description
--input <FILE> Path to NDJSON file (- for stdin)
--event <TYPE> Optional GitHub event filter

Example event types:

  • PushEvent
  • PullRequestEvent
  • WatchEvent
  • ForkEvent

Documentation

The project includes full Rust API documentation.

Generate the documentation locally with:

cargo doc --open

This will build and open the documentation site for the octo-flow library, including the event model, streaming pipeline, and error handling.

Key components documented in the crate:

  • process_events — core streaming event pipeline
  • GitHubEvent — GitHub event data model
  • OctoFlowError — structured error handling

Performance

Benchmark on a 9.5MB NDJSON dataset (~65k events):

Tool Time
jq 0.40s
octo-flow 0.053s
grep 0.001s

grep is faster but performs no JSON parsing, which can produce false positives.

octo-flow provides structured parsing with near-native speed.


Testing

The project includes both unit tests and end-to-end CLI tests.

Integration tests use assert_cmd to validate the compiled binary against realistic scenarios:

  • CLI argument validation
  • event filtering correctness
  • file handling errors

Run tests:

cargo test

Installation

Build from Source

Clone and build with Cargo:

git clone https://github.com/writeonlycode/octo-flow
cd octo-flow
cargo build --release

Binary location:

target/release/octo-flow

Install via Cargo

If you have Rust installed, you can install octo-flow directly from crates.io:

cargo install octo-flow

Why Rust?

Rust enables this tool to combine:

  • C-like performance
  • memory safety
  • zero-cost abstractions
  • predictable resource usage

These properties make Rust ideal for high-throughput data processing tools like octo-flow.


License

MIT / Apache 2.0