Struct FileReader

Source

pub struct FileReader { /* private fields */ }

Implementations§

Source §

pub async fn read_all_metadata( scheduler: &FileScheduler, ) -> Result<CachedFileMetadata>

Source

pub async fn try_open( scheduler: FileScheduler, base_projection: Option<ReaderProjection>, decoder_plugins: Arc<DecoderPlugins>, cache: &LanceCache, options: FileReaderOptions, ) -> Result<Self>

Opens a new file reader without any pre-existing knowledge

This will read the file schema from the file itself and thus requires a bit more I/O

A base_projection can also be provided. If provided, then the projection will apply to all reads from the file that do not specify their own projection.

Source

pub async fn try_open_with_file_metadata( scheduler: Arc<dyn EncodingsIo>, path: Path, base_projection: Option<ReaderProjection>, decoder_plugins: Arc<DecoderPlugins>, file_metadata: Arc<CachedFileMetadata>, cache: &LanceCache, options: FileReaderOptions, ) -> Result<Self>

Same as try_open but with the file metadata already loaded.

This method also can accept any kind of EncodingsIo implementation allowing for custom strategies to be used for I/O scheduling (e.g. for takes on fast disks it may be better to avoid asynchronous overhead).

Source

pub async fn read_tasks( &self, params: ReadBatchParams, batch_size: u32, projection: Option<ReaderProjection>, filter: FilterExpression, ) -> Result<Pin<Box<dyn Stream<Item = ReadBatchTask> + Send>>>

Creates a stream of “read tasks” to read the data from the file

The arguments are similar to Self::read_stream_projected but instead of returning a stream of record batches it returns a stream of “read tasks”.

The tasks should be consumed with some kind of buffered argument if CPU parallelism is desired.

Note that “read task” is probably a bit imprecise. The tasks are actually “decode tasks”. The reading happens asynchronously in the background. In other words, a single read task may map to multiple I/O operations or a single I/O operation may map to multiple read tasks.

§Why is this async?

Constructing the read stream requires running the decode scheduler’s initialize step, which performs the metadata I/O (chunk metadata, dictionaries, repetition index, …) needed to plan the read. We drive that I/O on the awaiting task rather than smuggling it into the stream’s first poll. This way callers control where the scheduling I/O runs (typically inside a per-fragment tokio::spawn), planning errors surface from the await instead of from the first stream item, and small reads can also complete the synchronous scheduling step before returning (see DecoderConfig::inline_scheduling).

Source

pub async fn read_stream_projected( &self, params: ReadBatchParams, batch_size: u32, batch_readahead: u32, projection: ReaderProjection, filter: FilterExpression, ) -> Result<Pin<Box<dyn RecordBatchStream>>>

Reads data from the file as a stream of record batches

params - Specifies the range (or indices) of data to read
batch_size - The maximum size of a single batch. A batch may be smaller if it is the last batch or if it is not possible to create a batch of the requested size.

For example, if the batch size is 1024 and one of the columns is a string column then there may be some ranges of 1024 rows that contain more than 2^31 bytes of string data (which is the maximum size of a string column in Arrow). In this case smaller batches may be emitted.
batch_readahead - The number of batches to read ahead. This controls the amount of CPU parallelism of the read. In other words it controls how many batches will be decoded in parallel. It has no effect on the I/O parallelism of the read (how many I/O requests are in flight at once).

This parameter also is also related to backpressure. If the consumer of the stream is slow then the reader will build up RAM.
projection - A projection to apply to the read. This controls which columns are read from the file. The projection is NOT applied on top of the base projection. The projection is applied directly to the file schema.

§Why is this async?

This delegates to Self::read_tasks, which awaits the decode scheduler’s initialize step (and, for small reads, the synchronous scheduling that follows) before returning. See read_tasks for details on why this work is performed up front rather than on the stream’s first poll.

Source

pub fn read_stream_projected_blocking( &self, params: ReadBatchParams, batch_size: u32, projection: Option<ReaderProjection>, filter: FilterExpression, ) -> Result<Box<dyn RecordBatchReader + Send + 'static>>

Read data from the file as an iterator of record batches

This is a blocking variant of Self::read_stream_projected that runs entirely in the calling thread. It will block on I/O if the decode is faster than the I/O. It is useful for benchmarking and potentially from “take“ing small batches from fast disks.

Large scans of in-memory data will still benefit from threading (and should therefore not use this method) because we can parallelize the decode.

Note: calling this from within a tokio runtime will panic. It is acceptable to call this from a spawn_blocking context.

Source

pub async fn read_stream( &self, params: ReadBatchParams, batch_size: u32, batch_readahead: u32, filter: FilterExpression, ) -> Result<Pin<Box<dyn RecordBatchStream>>>

Reads data from the file as a stream of record batches

This is similar to Self::read_stream_projected but uses the base projection provided when the file was opened (or reads all columns if the file was opened without a base projection)

§Why is this async?

This delegates to Self::read_stream_projected, which awaits the decode scheduler’s initialize step before returning the stream. See Self::read_tasks for the rationale.

Source

pub fn schema(&self) -> &Arc<Schema>

Trait Implementations§

Source §

impl Clone for FileReader

Source §

fn clone(&self) -> FileReader

Returns a duplicate of the value. Read more

1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Source §

impl Debug for FileReader

Source §

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl Freeze for FileReader

§

impl !RefUnwindSafe for FileReader

§

impl Send for FileReader

§

impl Sync for FileReader

§

impl Unpin for FileReader

§

impl UnsafeUnpin for FileReader

§

impl !UnwindSafe for FileReader

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> CloneToUninit for T
where T: Clone,

Source §

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)

Performs copy-assignment from self to dest. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §