sluice
Until now, reading the Maven Central index required the JVM, a custom script wiring up the Java indexer-reader library, and patience. Sluice is a single binary that does it 5x faster.
A fast, streaming parser for the Maven Central Nexus binary index format, plus a CLI that turns index files into JSON Lines.
For a byte-level specification of the wire format, see docs/binary-format.md. For the incremental-update protocol, see docs/incremental-updates.md.
Layout
This is a Cargo workspace with two crates:
crates/core—sluice, the library. I/O-neutral: operates on anystd::io::Read, with no knowledge of gzip, HTTP, files, or JSON. Parses the Nexus binary header and record stream, decodes fields (including CESU-8 strings), and classifies documents into descriptors, group lists, and artifact add/remove records with parsedUINFOtuples.crates/cli—sluice-cli, which builds thesluicebinary. Handles gzip decoding, argument parsing, and JSON Lines output on stdout.
Installation
Homebrew (macOS and Linux)
Cargo
Prebuilt archives
Download the archive for your platform from the latest release, extract, and move sluice onto your PATH.
From source
Quick start
# Parse a gzipped Maven Central index chunk and print artifact adds as
# JSON Lines (with stats on stderr).
Or stream the full Maven Central index straight from Apache without saving it to disk (~2.8 GB compressed, several minutes to parse):
|
Contributors working from a clone can use the just recipes — see Development below.
CLI options
sluice [OPTIONS] [INPUT]
INPUT— path to a gzipped Maven index file. Reads from stdin if omitted.--include-removes— also emitArtifactRemoverecords (type="remove") alongside adds.--full— emit all records including classified artifacts (sources, javadoc, etc.) with their classifier and extension. By default, only root-level artifacts (classifier=NA) are emitted.--stats— print summary stats to stderr at end of run.
Output is one JSON object per line, e.g.:
With --full, classified artifacts are included and the classifier field appears:
By default, records whose classifier is anything other than NA are filtered out. Use --full to include all records.
Library usage
The core library is I/O-neutral — it reads from any std::io::Read. For gzipped index files, bring your own decompressor (e.g. flate2). The crate is published as sluice-rs on crates.io; the import path is sluice:
[]
= "0.1"
= "1"
use File;
use BufReader;
use GzDecoder;
use ;
let file = open?;
let gz = new;
let index = new?;
for doc in index
Serde support
Enable the serde feature to derive Serialize on all domain types (Record, Uinfo, Document, etc.):
[]
= { = "0.1", = ["serde"] }
= "1"
use ;
// ... set up IndexReader as above ...
for doc in index
Performance
Sluice is ~5x faster than the Java Apache Maven Indexer indexer-reader on the full Maven Central index (2.8 GB compressed, ~97M documents):
| Tool | Mean | Relative |
|---|---|---|
| sluice (Rust) | 208s | 1.00 |
| indexer-reader (Java) | 1112s | 5.35 |
Both tools produce identical output across all ~97M records. The Java tool does additional per-record work (field expansion via RecordExpander) that sluice skips, so the workloads are not identical. See docs/benchmark.md for methodology, reproduction steps, and a detailed discussion.
Development
Recipes are run through just (cargo install just or brew install just):
The Rust toolchain is pinned via rust-toolchain.toml. MSRV is 1.75 for the library (sluice-rs) and 1.85 for the CLI (sluice-cli) — clap transitively requires edition2024. Lints are workspace-wide: rust_2018_idioms denied and clippy::pedantic at warn level.
Test fixtures
A small sample fixture (crates/core/tests/fixtures/chunk-sample.gz) is committed for offline testing. To regenerate it from a full Maven Central chunk:
The full fixture is not committed to keep clone sizes small.
License
Apache-2.0.