pub struct ArrowDataset { /* private fields */ }Expand description
An in-memory dataset backed by Arrow RecordBatches.
This is the primary dataset type for alimentar. It stores data as a collection of RecordBatches and provides efficient access patterns for ML training loops.
§Example
use alimentar::{ArrowDataset, Dataset};
// Load from parquet
let dataset = ArrowDataset::from_parquet("data.parquet").unwrap();
println!("Dataset has {} rows", dataset.len());Implementations§
Source§impl ArrowDataset
impl ArrowDataset
Sourcepub fn new(batches: Vec<RecordBatch>) -> Result<Self>
pub fn new(batches: Vec<RecordBatch>) -> Result<Self>
Creates a new ArrowDataset from a vector of RecordBatches.
§Errors
Returns an error if:
- The batches vector is empty
- The batches have inconsistent schemas
Sourcepub fn from_batch(batch: RecordBatch) -> Result<Self>
pub fn from_batch(batch: RecordBatch) -> Result<Self>
Sourcepub fn from_parquet(path: impl AsRef<Path>) -> Result<Self>
pub fn from_parquet(path: impl AsRef<Path>) -> Result<Self>
Loads a dataset from a Parquet file.
§Errors
Returns an error if:
- The file cannot be opened
- The file is not valid Parquet
- The file is empty
Sourcepub fn to_parquet(&self, path: impl AsRef<Path>) -> Result<()>
pub fn to_parquet(&self, path: impl AsRef<Path>) -> Result<()>
Saves the dataset to a Parquet file.
§Errors
Returns an error if:
- The file cannot be created
- Writing fails
Sourcepub fn from_ipc(path: impl AsRef<Path>) -> Result<Self>
pub fn from_ipc(path: impl AsRef<Path>) -> Result<Self>
Loads a dataset from an Arrow IPC file (Issue #2: Enhanced Data Loading)
Arrow IPC (Inter-Process Communication) format enables zero-copy data sharing. This is the native Arrow file format with optimal read performance.
§Arguments
path- Path to the Arrow IPC file (typically .arrow or .ipc extension)
§Errors
Returns an error if:
- The file cannot be opened
- The file is not valid Arrow IPC format
- The file is empty
§Example
let dataset = ArrowDataset::from_ipc("data.arrow").unwrap();Sourcepub fn to_ipc(&self, path: impl AsRef<Path>) -> Result<()>
pub fn to_ipc(&self, path: impl AsRef<Path>) -> Result<()>
Saves the dataset to an Arrow IPC file (Issue #2: Enhanced Data Loading)
Creates a file in Arrow IPC format, the native Arrow format. This format provides optimal read performance for Arrow-based tools.
§Arguments
path- Path for the output file
§Errors
Returns an error if the file cannot be created or writing fails.
§Example
dataset.to_ipc("output.arrow").unwrap();Sourcepub fn from_ipc_stream(path: impl AsRef<Path>) -> Result<Self>
pub fn from_ipc_stream(path: impl AsRef<Path>) -> Result<Self>
Loads a dataset from an Arrow IPC stream file (Issue #2: Enhanced Data Loading)
Arrow IPC streaming format is designed for streaming scenarios where the schema is sent first, followed by record batches. The file extension is typically .arrows.
§Arguments
path- Path to the Arrow IPC stream file
§Errors
Returns an error if parsing fails or the file is empty.
Sourcepub fn to_ipc_stream(&self, path: impl AsRef<Path>) -> Result<()>
pub fn to_ipc_stream(&self, path: impl AsRef<Path>) -> Result<()>
Saves the dataset to an Arrow IPC stream file (Issue #2: Enhanced Data Loading)
Creates a file in Arrow IPC streaming format. This format is suitable for streaming scenarios and produces slightly smaller files than the standard IPC file format.
§Arguments
path- Path for the output file (typically .arrows extension)
§Errors
Returns an error if the file cannot be created or writing fails.
Sourcepub fn from_csv_with_options(
path: impl AsRef<Path>,
options: CsvOptions,
) -> Result<Self>
pub fn from_csv_with_options( path: impl AsRef<Path>, options: CsvOptions, ) -> Result<Self>
Sourcepub fn to_csv(&self, path: impl AsRef<Path>) -> Result<()>
pub fn to_csv(&self, path: impl AsRef<Path>) -> Result<()>
Saves the dataset to a CSV file.
§Errors
Returns an error if the file cannot be created or writing fails.
Sourcepub fn from_json(path: impl AsRef<Path>) -> Result<Self>
pub fn from_json(path: impl AsRef<Path>) -> Result<Self>
Loads a dataset from a JSON Lines (JSONL) file.
Each line in the file should be a valid JSON object representing a row.
§Errors
Returns an error if the file cannot be opened or parsed.
Sourcepub fn from_json_with_options(
path: impl AsRef<Path>,
options: JsonOptions,
) -> Result<Self>
pub fn from_json_with_options( path: impl AsRef<Path>, options: JsonOptions, ) -> Result<Self>
Loads a dataset from a JSON Lines file with options.
§Errors
Returns an error if parsing fails or the file is empty.
Sourcepub fn to_json(&self, path: impl AsRef<Path>) -> Result<()>
pub fn to_json(&self, path: impl AsRef<Path>) -> Result<()>
Saves the dataset to a JSON Lines (JSONL) file.
Each row is written as a single JSON object on its own line.
§Errors
Returns an error if the file cannot be created or writing fails.
Sourcepub fn from_parquet_bytes(data: &[u8]) -> Result<Self>
pub fn from_parquet_bytes(data: &[u8]) -> Result<Self>
Loads a dataset from Parquet bytes in memory.
§Errors
Returns an error if the data is not valid Parquet.
Sourcepub fn to_parquet_bytes(&self) -> Result<Vec<u8>>
pub fn to_parquet_bytes(&self) -> Result<Vec<u8>>
Sourcepub fn from_csv_str(data: &str) -> Result<Self>
pub fn from_csv_str(data: &str) -> Result<Self>
Sourcepub fn from_json_str(data: &str) -> Result<Self>
pub fn from_json_str(data: &str) -> Result<Self>
Sourcepub fn batches(&self) -> &[RecordBatch]
pub fn batches(&self) -> &[RecordBatch]
Returns the underlying batches.
Sourcepub fn into_batches(self) -> Vec<RecordBatch>
pub fn into_batches(self) -> Vec<RecordBatch>
Consumes the dataset and returns the underlying batches.
Sourcepub fn with_transform<T: Transform>(&self, transform: &T) -> Result<Self>
pub fn with_transform<T: Transform>(&self, transform: &T) -> Result<Self>
Applies a transform to create a new dataset.
§Errors
Returns an error if the transform fails on any batch.
Sourcepub fn rows(&self) -> RowIterator<'_> ⓘ
pub fn rows(&self) -> RowIterator<'_> ⓘ
Returns an iterator over rows as single-row RecordBatches.
Trait Implementations§
Source§impl Clone for ArrowDataset
impl Clone for ArrowDataset
Source§fn clone(&self) -> ArrowDataset
fn clone(&self) -> ArrowDataset
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Dataset for ArrowDataset
impl Dataset for ArrowDataset
Source§fn get(&self, index: usize) -> Option<RecordBatch>
fn get(&self, index: usize) -> Option<RecordBatch>
Source§fn iter(&self) -> Box<dyn Iterator<Item = RecordBatch> + Send + '_>
fn iter(&self) -> Box<dyn Iterator<Item = RecordBatch> + Send + '_>
Source§fn num_batches(&self) -> usize
fn num_batches(&self) -> usize
Auto Trait Implementations§
impl Freeze for ArrowDataset
impl !RefUnwindSafe for ArrowDataset
impl Send for ArrowDataset
impl Sync for ArrowDataset
impl Unpin for ArrowDataset
impl UnsafeUnpin for ArrowDataset
impl !UnwindSafe for ArrowDataset
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more