pub struct FileReader { /* private fields */ }Implementations§
Source§impl FileReader
impl FileReader
pub fn with_scheduler(&self, scheduler: Arc<dyn EncodingsIo>) -> Self
pub fn num_rows(&self) -> u64
pub fn metadata(&self) -> &Arc<CachedFileMetadata>
pub fn file_statistics(&self) -> FileStatistics
pub async fn read_global_buffer(&self, index: u32) -> Result<Bytes>
pub async fn read_all_metadata( scheduler: &FileScheduler, ) -> Result<CachedFileMetadata>
Sourcepub async fn try_open(
scheduler: FileScheduler,
base_projection: Option<ReaderProjection>,
decoder_plugins: Arc<DecoderPlugins>,
cache: &LanceCache,
options: FileReaderOptions,
) -> Result<Self>
pub async fn try_open( scheduler: FileScheduler, base_projection: Option<ReaderProjection>, decoder_plugins: Arc<DecoderPlugins>, cache: &LanceCache, options: FileReaderOptions, ) -> Result<Self>
Opens a new file reader without any pre-existing knowledge
This will read the file schema from the file itself and thus requires a bit more I/O
A base_projection can also be provided. If provided, then the projection will apply
to all reads from the file that do not specify their own projection.
Sourcepub async fn try_open_with_file_metadata(
scheduler: Arc<dyn EncodingsIo>,
path: Path,
base_projection: Option<ReaderProjection>,
decoder_plugins: Arc<DecoderPlugins>,
file_metadata: Arc<CachedFileMetadata>,
cache: &LanceCache,
options: FileReaderOptions,
) -> Result<Self>
pub async fn try_open_with_file_metadata( scheduler: Arc<dyn EncodingsIo>, path: Path, base_projection: Option<ReaderProjection>, decoder_plugins: Arc<DecoderPlugins>, file_metadata: Arc<CachedFileMetadata>, cache: &LanceCache, options: FileReaderOptions, ) -> Result<Self>
Same as try_open but with the file metadata already loaded.
This method also can accept any kind of EncodingsIo implementation allowing
for custom strategies to be used for I/O scheduling (e.g. for takes on fast
disks it may be better to avoid asynchronous overhead).
Sourcepub async fn read_tasks(
&self,
params: ReadBatchParams,
batch_size: u32,
projection: Option<ReaderProjection>,
filter: FilterExpression,
) -> Result<Pin<Box<dyn Stream<Item = ReadBatchTask> + Send>>>
pub async fn read_tasks( &self, params: ReadBatchParams, batch_size: u32, projection: Option<ReaderProjection>, filter: FilterExpression, ) -> Result<Pin<Box<dyn Stream<Item = ReadBatchTask> + Send>>>
Creates a stream of “read tasks” to read the data from the file
The arguments are similar to Self::read_stream_projected but instead of returning a stream
of record batches it returns a stream of “read tasks”.
The tasks should be consumed with some kind of buffered argument if CPU parallelism is desired.
Note that “read task” is probably a bit imprecise. The tasks are actually “decode tasks”. The reading happens asynchronously in the background. In other words, a single read task may map to multiple I/O operations or a single I/O operation may map to multiple read tasks.
§Why is this async?
Constructing the read stream requires running the decode scheduler’s
initialize step, which performs the metadata I/O (chunk metadata,
dictionaries, repetition index, …) needed to plan the read. We
drive that I/O on the awaiting task rather than smuggling it into
the stream’s first poll. This way callers control where the
scheduling I/O runs (typically inside a per-fragment
tokio::spawn), planning errors surface from the await instead of
from the first stream item, and small reads can also complete the
synchronous scheduling step before returning (see
DecoderConfig::inline_scheduling).
Sourcepub async fn read_stream_projected(
&self,
params: ReadBatchParams,
batch_size: u32,
batch_readahead: u32,
projection: ReaderProjection,
filter: FilterExpression,
) -> Result<Pin<Box<dyn RecordBatchStream>>>
pub async fn read_stream_projected( &self, params: ReadBatchParams, batch_size: u32, batch_readahead: u32, projection: ReaderProjection, filter: FilterExpression, ) -> Result<Pin<Box<dyn RecordBatchStream>>>
Reads data from the file as a stream of record batches
-
params- Specifies the range (or indices) of data to read -
batch_size- The maximum size of a single batch. A batch may be smaller if it is the last batch or if it is not possible to create a batch of the requested size.For example, if the batch size is 1024 and one of the columns is a string column then there may be some ranges of 1024 rows that contain more than 2^31 bytes of string data (which is the maximum size of a string column in Arrow). In this case smaller batches may be emitted.
-
batch_readahead- The number of batches to read ahead. This controls the amount of CPU parallelism of the read. In other words it controls how many batches will be decoded in parallel. It has no effect on the I/O parallelism of the read (how many I/O requests are in flight at once).This parameter also is also related to backpressure. If the consumer of the stream is slow then the reader will build up RAM.
-
projection- A projection to apply to the read. This controls which columns are read from the file. The projection is NOT applied on top of the base projection. The projection is applied directly to the file schema.
§Why is this async?
This delegates to Self::read_tasks, which awaits the decode
scheduler’s initialize step (and, for small reads, the synchronous
scheduling that follows) before returning. See read_tasks for
details on why this work is performed up front rather than on the
stream’s first poll.
Sourcepub fn read_stream_projected_blocking(
&self,
params: ReadBatchParams,
batch_size: u32,
projection: Option<ReaderProjection>,
filter: FilterExpression,
) -> Result<Box<dyn RecordBatchReader + Send + 'static>>
pub fn read_stream_projected_blocking( &self, params: ReadBatchParams, batch_size: u32, projection: Option<ReaderProjection>, filter: FilterExpression, ) -> Result<Box<dyn RecordBatchReader + Send + 'static>>
Read data from the file as an iterator of record batches
This is a blocking variant of Self::read_stream_projected that runs entirely in the
calling thread. It will block on I/O if the decode is faster than the I/O. It is useful
for benchmarking and potentially from “take“ing small batches from fast disks.
Large scans of in-memory data will still benefit from threading (and should therefore not use this method) because we can parallelize the decode.
Note: calling this from within a tokio runtime will panic. It is acceptable to call this from a spawn_blocking context.
Sourcepub async fn read_stream(
&self,
params: ReadBatchParams,
batch_size: u32,
batch_readahead: u32,
filter: FilterExpression,
) -> Result<Pin<Box<dyn RecordBatchStream>>>
pub async fn read_stream( &self, params: ReadBatchParams, batch_size: u32, batch_readahead: u32, filter: FilterExpression, ) -> Result<Pin<Box<dyn RecordBatchStream>>>
Reads data from the file as a stream of record batches
This is similar to Self::read_stream_projected but uses the base projection
provided when the file was opened (or reads all columns if the file was
opened without a base projection)
§Why is this async?
This delegates to Self::read_stream_projected, which awaits the
decode scheduler’s initialize step before returning the stream.
See Self::read_tasks for the rationale.
pub fn schema(&self) -> &Arc<Schema>
Trait Implementations§
Source§impl Clone for FileReader
impl Clone for FileReader
Source§fn clone(&self) -> FileReader
fn clone(&self) -> FileReader
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreAuto Trait Implementations§
impl Freeze for FileReader
impl !RefUnwindSafe for FileReader
impl Send for FileReader
impl Sync for FileReader
impl Unpin for FileReader
impl UnsafeUnpin for FileReader
impl !UnwindSafe for FileReader
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more