sluice
Until now, reading the Maven Central index required the JVM, a custom script wiring up the Java indexer-reader library, and patience. Sluice is a single binary that does it 5x faster.
A fast, streaming parser for the Maven Central Nexus binary index format, plus a CLI that turns index files into JSON Lines.
For a byte-level specification of the wire format, see docs/binary-format.md. For the incremental-update protocol, see docs/incremental-updates.md.
Layout
This is a Cargo workspace with two crates:
crates/core—sluice, the library. I/O-neutral: operates on anystd::io::Read, with no knowledge of gzip, HTTP, files, or JSON. Parses the Nexus binary header and record stream, decodes fields (including CESU-8 strings), and classifies documents into descriptors, group lists, and artifact add/remove records with parsedUINFOtuples.crates/cli—sluice-cli, which builds thesluicebinary. Handles gzip decoding, argument parsing, and JSON Lines output on stdout.
Quick start
You need a Rust toolchain (pinned in rust-toolchain.toml) and the just task runner (cargo install just or brew install just).
# Fetch the latest incremental chunk into fixtures/chunk-latest.gz
# Parse it and print artifact adds as JSON Lines (with stats on stderr)
# Or parse the full Maven Central index (~2.8 GB download, ~minutes to parse)
Under the hood:
CLI options
sluice [OPTIONS] [INPUT]
INPUT— path to a gzipped Maven index file. Reads from stdin if omitted.--include-removes— also emitArtifactRemoverecords (type="remove") alongside adds.--full— emit all records including classified artifacts (sources, javadoc, etc.) with their classifier and extension. By default, only root-level artifacts (classifier=NA) are emitted.--stats— print summary stats to stderr at end of run.
Output is one JSON object per line, e.g.:
With --full, classified artifacts are included and the classifier field appears:
By default, records whose classifier is anything other than NA are filtered out. Use --full to include all records.
Library usage
The core library is I/O-neutral — it reads from any std::io::Read. For gzipped index files, bring your own decompressor (e.g. flate2). The crate is published as sluice-rs on crates.io; the import path is sluice:
[]
= "0.1"
= "1"
use File;
use BufReader;
use GzDecoder;
use ;
let file = open?;
let gz = new;
let index = new?;
for doc in index
Serde support
Enable the serde feature to derive Serialize on all domain types (Record, Uinfo, Document, etc.):
[]
= { = "0.1", = ["serde"] }
= "1"
use ;
// ... set up IndexReader as above ...
for doc in index
Performance
Sluice is ~5x faster than the Java Apache Maven Indexer indexer-reader on the full Maven Central index (2.8 GB compressed, ~97M documents):
| Tool | Mean | Relative |
|---|---|---|
| sluice (Rust) | 208s | 1.00 |
| indexer-reader (Java) | 1112s | 5.35 |
Both tools produce identical output across all ~97M records. The Java tool does additional per-record work (field expansion via RecordExpander) that sluice skips, so the workloads are not identical. See docs/benchmark.md for methodology, reproduction steps, and a detailed discussion.
Development
The Rust toolchain is pinned via rust-toolchain.toml. MSRV is 1.75 for the library (sluice-rs) and 1.85 for the CLI (sluice-cli) — clap transitively requires edition2024. Lints are workspace-wide: rust_2018_idioms denied and clippy::pedantic at warn level.
Test fixtures
A small sample fixture (crates/core/tests/fixtures/chunk-sample.gz) is committed for offline testing. To regenerate it from a full Maven Central chunk:
The full fixture is not committed to keep clone sizes small.
License
Apache-2.0.