Crate pvstream

Crate pvstream 

Source
Expand description

§pvstream

Stream download, parse, and filter Wikimedia pageviews files.

This library provides efficient streaming access to Wikimedia’s hourly pageview dumps. It can download and parse multi-gigabyte compressed files on-the-fly without storing the entire file in memory.

§Features

  • Streaming parsing: Process files as they download, minimizing memory usage
  • Flexible filtering: Filter by language, domain, page title (regex), view counts, and more
  • Performance optimization: Apply regex filters before parsing for maximum efficiency
  • Parquet export: Convert filtered data to Parquet format for analysis
  • Rust and Python: Native Rust library with Python bindings via PyO3

§Quick Start

use pvstream::{stream_from_file, filter::FilterBuilder};
use std::path::PathBuf;

let filter = FilterBuilder::new()
    .domain_codes(["en.m"])
    .page_title("Rust")
    .build();

let rows = stream_from_file(PathBuf::from("pageviews.gz"), &filter).unwrap();
for result in rows {
    match result {
        Ok(pageview) => println!("{:?}", pageview),
        Err(e) => eprintln!("Error: {:?}", e),
    }
}

Modules§

filter
parse
stream

Functions§

parquet_from_file
Parse a local pageviews file and write filtered results to a Parquet file.
parquet_from_url
Download a remote pageviews file and write filtered results to a Parquet file.
stream_from_file
Decompress, stream, and parse lines from a local pageviews file
stream_from_url
Decompress, stream, and parse lines from a remote pageviews file

Type Aliases§

RowIterator
Iterator type returned by streaming functions.