sluice
A Rust parser and CLI for the Maven Central Nexus binary index format. Runs without a JVM, streams through the full index (~2.8 GB compressed, ~97M records) in a few minutes, and emits JSON Lines.
For a byte-level specification of the wire format, see docs/binary-format.md. The incremental-update protocol is covered in docs/incremental-updates.md.
Layout
The repo is a Cargo workspace with two crates. crates/core is the library, published as sluice-rs. It's I/O-neutral: it operates on any std::io::Read and has no knowledge of gzip, HTTP, files, or JSON. It parses the Nexus binary header and record stream, decodes fields (including CESU-8 strings), and classifies documents into descriptors, group lists, and artifact add/remove records with parsed UINFO tuples. crates/cli builds the sluice binary, which wraps the library with gzip decoding, argument parsing, and JSON Lines output.
Installation
Homebrew (macOS and Linux)
Cargo
Prebuilt archives
Download the archive for your platform from the latest release, extract, and move sluice onto your PATH.
From source
Quick start
# Parse a gzipped Maven Central index chunk and print artifact adds as
# JSON Lines (with stats on stderr).
Or stream the full Maven Central index straight from Apache without saving it to disk (~2.8 GB compressed, several minutes to parse):
|
Contributors working from a clone can use the just recipes — see Development below.
CLI options
sluice [OPTIONS] [INPUT]
INPUT— path to a gzipped Maven index file. Reads from stdin if omitted.--include-removes— also emitArtifactRemoverecords (type="remove") alongside adds.--full— emit all records including classified artifacts (sources, javadoc, etc.) with their classifier and extension. By default, only root-level artifacts (classifier=NA) are emitted.--stats— print summary stats to stderr at end of run.
Output is one JSON object per line, e.g.:
With --full, classified artifacts are included and the classifier field appears:
By default, records whose classifier is anything other than NA are filtered out. Use --full to include all records.
Library usage
The core library reads from any std::io::Read. For gzipped index files, bring your own decompressor — flate2 works. The crate is published as sluice-rs on crates.io; the import path is sluice:
[]
= "0.1"
= "1"
use File;
use BufReader;
use GzDecoder;
use ;
let file = open?;
let gz = new;
let index = new?;
for doc in index
Serde support
Enable the serde feature to derive Serialize on all domain types (Record, Uinfo, Document, etc.):
[]
= { = "0.1", = ["serde"] }
= "1"
use ;
// ... set up IndexReader as above ...
for doc in index
Performance
On the full Maven Central index (2.8 GB compressed, ~97M documents), sluice takes about 208 seconds end-to-end. The Java indexer-reader from Apache Maven Indexer takes about 1112 seconds on the same input.
| Tool | Mean | Relative |
|---|---|---|
| sluice (Rust) | 208s | 1.00 |
| indexer-reader (Java) | 1112s | 5.35 |
These numbers aren't directly comparable: the Java tool does additional per-record work (field expansion via RecordExpander) that sluice doesn't, so some of the gap is workload, not implementation. Output matches across all ~97M records. Methodology and reproduction steps are in docs/benchmark.md.
Development
Recipes are run through just (cargo install just or brew install just):
The Rust toolchain is pinned via rust-toolchain.toml. MSRV is 1.75 for the library (sluice-rs) and 1.85 for the CLI (sluice-cli) — clap transitively requires edition2024. Lints are workspace-wide: rust_2018_idioms denied and clippy::pedantic at warn level.
Test fixtures
A small sample fixture (crates/core/tests/fixtures/chunk-sample.gz) is committed for offline testing. To regenerate it from a full Maven Central chunk:
The full fixture is not committed to keep clone sizes small.
License
Apache-2.0.