wme-stream
Streaming utilities for the Wikimedia Enterprise API.
This crate provides utilities for processing NDJSON streams from Wikimedia Enterprise APIs (Snapshot and Realtime). It handles deduplication, checkpoint/resume functionality, and efficient streaming parsing.
Features
- NDJSON Streaming - Parse newline-delimited JSON from snapshots and realtime feeds
- Deduplication - Handle duplicate articles (< 1% in snapshots) by keeping the latest version
- Checkpoint/Resume - Save and resume long-running downloads from any point
- Visitor Pattern - Process articles without full materialization for low memory usage
- Progress Events - Track download and processing progress
- Statistics - Collect processing metrics (articles/sec, bytes, errors)
Usage
Parse NDJSON from a Snapshot
use NdjsonStream;
use Article;
use StreamExt;
use File;
use BufReader;
use pin;
# async
Handle Duplicates
use dedup_stream;
use StreamExt;
use pin;
# async # where
# S: Stream,
#
Checkpoint and Resume
use ResumeCheckpoint;
# async
Visitor Pattern (Low Memory)
use ArticleVisitor;
use Article;
use Value;
Architecture
Modules
ndjson- NDJSON parsing utilitiesdedup- Duplicate detection and removalcheckpoint- Resume checkpoint functionalityvisitor- Visitor trait for streaming processing
Key Types
NdjsonStream- Parse NDJSON streamsdedup_stream()- Wrap streams to deduplicate articlesResumeCheckpoint- Save/load processing stateArticleVisitor- Trait for visitor patternProcessingStats- Track processing metricsSnapshotEvent- Progress events for UI updates
Snapshot Processing
Snapshots are delivered as .tar.gz files containing NDJSON:
- Download chunks (use parallel downloads for speed)
- Decompress tarball
- Stream NDJSON lines
- Deduplicate articles
- Process or store results
Realtime Streaming
Realtime API provides SSE or NDJSON streams:
- Connect to streaming endpoint
- Receive events (update, delete, visibility-change)
- Track partition/offset for resume
- Process events incrementally
License
This project is licensed under the terms of the workspace license.