Crate warc_parquet
source ·Expand description
A small library providing a reader which translates a WARC source to Arrow.
This implementation is written for the WARC Format 1.0 specification.
Users will consume the WarcToArrowReader
struct to create a new reader
of a WARC source. The reader expects some BufRead
source which it will
internally wrap with a WarcReader
. Once
created, the reader can be iterated in order to retrieve the Arrow
representation of the WARC records.
The warc-parquet
command line utility leverages this crate to provide
WARC-to-Parquet translation.
Example
use std::io::{BufReader, Cursor, Read};
use arrow::array::{BinaryArray, StringArray};
use parquet::arrow::{arrow_reader::ParquetRecordBatchReaderBuilder, ArrowWriter};
use tempfile::tempfile;
use warc_parquet::{parquet::basic::Compression, WarcToArrowReader, WARC_1_0_SCHEMA};
let warc_content = b"\
WARC/1.0\r\n\
Warc-Type: response\r\n\
Content-Length: 13\r\n\
WARC-Record-Id: <urn:test:basic-record:record-0>\r\n\
WARC-Date: 2020-07-08T02:52:55Z\r\n\
\r\n\
Hello, world!\r\n\
\r\n\
";
let input = BufReader::new(Cursor::new(warc_content));
let mut output = tempfile().unwrap();
let mut reader = WarcToArrowReader::builder(input)
.with_batch_size(1024)
.build();
let mut writer = ArrowWriter::try_new(&mut output, WARC_1_0_SCHEMA.clone(), None).unwrap();
let record_batches = reader.iter_reader();
for record_batch in record_batches {
writer.write(&record_batch.unwrap()).unwrap();
}
writer.close().unwrap();
let mut parquet_reader = ParquetRecordBatchReaderBuilder::try_new(output)
.unwrap()
.with_batch_size(1024)
.build()
.unwrap();
let record_batch = parquet_reader.next().unwrap().unwrap();
assert_eq!(
record_batch
.column_by_name("type")
.unwrap()
.as_any()
.downcast_ref::<StringArray>()
.unwrap(),
&StringArray::from(vec!["response"])
);
assert_eq!(
record_batch
.column_by_name("body")
.unwrap()
.as_any()
.downcast_ref::<BinaryArray>()
.unwrap(),
&BinaryArray::from_vec(vec![b"Hello, world!"])
);
Re-exports
pub use schema::WARC_1_0_SCHEMA;
pub use schema::WARC_1_0_SCHEMA;
pub use arrow;
pub use parquet;
Structs
- The WARC Format 1.0 schema.
- A wrapper around a
WarcReader
which provides a translation from a WARC source to an Arrow representation. The Arrow representation can then be used for different tasks, including persistence via a format such as Parquet. - A builder used to constract
WarcToArrowReader
for a given reader of WARC.