Expand description
§pvstream
Stream download, parse, and filter Wikimedia pageviews files.
This library provides efficient streaming access to Wikimedia’s hourly pageview dumps. It can download and parse multi-gigabyte compressed files on-the-fly without storing the entire file in memory.
§Features
- Streaming parsing: Process files as they download, minimizing memory usage
- Flexible filtering: Filter by language, domain, page title (regex), view counts, and more
- Performance optimization: Apply regex filters before parsing for maximum efficiency
- Parquet export: Convert filtered data to Parquet format for analysis
- Rust and Python: Native Rust library with Python bindings via PyO3
§Quick Start
use pvstream::{stream_from_file, filter::FilterBuilder};
use std::path::PathBuf;
let filter = FilterBuilder::new()
.domain_codes(["en.m"])
.page_title("Rust")
.build();
let rows = stream_from_file(PathBuf::from("pageviews.gz"), &filter).unwrap();
for result in rows {
match result {
Ok(pageview) => println!("{:?}", pageview),
Err(e) => eprintln!("Error: {:?}", e),
}
}Modules§
Functions§
- parquet_
from_ file - Parse a local pageviews file and write filtered results to a Parquet file.
- parquet_
from_ url - Download a remote pageviews file and write filtered results to a Parquet file.
- stream_
from_ file - Decompress, stream, and parse lines from a local pageviews file
- stream_
from_ url - Decompress, stream, and parse lines from a remote pageviews file
Type Aliases§
- RowIterator
- Iterator type returned by streaming functions.