Crate warc_parquet
source · [−]Expand description
A small library providing a reader from WARC to Arrow.
This implementation is written for the WARC Format 1.0 specification.
Users will consume the Reader
struct to create a new reader of a WARC
source. The reader expects some BufRead
source which it will internally
wrap with a WarcReader
. Once
created, the reader can be iterated in order to retrieve the Arrow
representation of the WARC records.
The standard WARC schema is also provided via the DEFAULT_SCHEMA
reference.
The warc-parquet
command line utility leverages this library directly.
Example
use std::io::{BufReader, Cursor};
use warc_parquet::{Reader, DEFAULT_SCHEMA};
let file = BufReader::new(Cursor::new(b""));
let schema = DEFAULT_SCHEMA.clone();
let mut reader = Reader::new(file, schema);
for record in reader.iter_reader() {
dbg!(record); // There won't be anything, since we provided an empty buffer.
}
Structs
The WARC Format 1.0 schema.
A reader which transforms the given BufRead
source into an Arrow
representation.