Expand description
§About
get_chunk
is a library designed to create file iterators or streams (asynchronous iterators).
The main task, the ability to retrieve chunk data especially from large files.
Key Features:
- File Chunking: Seamlessly divide files, including large ones, into chunks with each “Next” iteration.
- Modes: Choose between automatic or manual tuning based on percentage or number of bytes.
- Automatic chunking: Each “Next” iteration dynamically determines an optimal chunk size, facilitating efficient handling of even large files.
⚠️ Important Notice:
The algorithm adjusts chunk sizes for optimal performance after the “Next” call, taking into account available RAM. However, crucially, this adjustment occurs only after the current chunk is sent and before the subsequent “Next” call.
It’s important to note a potential scenario: Suppose a chunk is 15GB, and there’s initially 16GB of free RAM. If, between the current and next “Next” calls, 2GB of RAM becomes unexpectedly occupied, the current 15GB chunk will still be processed. This situation introduces a risk, as the system might either reclaim resources (resulting in io::Error) or lead to a code crash.
Iterators created by get_chunk
do not store the entire file in memory, especially for large datasets.
Their purpose is to fetch data from files, even when dealing with substantial sizes, by reading in chunks.
Key Points:
- Limited File Retention: Creating an iterator for a small file might result in obtaining all data, depending on the OS. However, this doesn’t guarantee file persistence after iterator creation.
- Deletion Warning: Deleting a file during iterator or stream iterations will result in an error. These structures do not track the last successful position.
- No File Restoration: Attempting to restore a deleted file during iterations is not supported. These structures do not keep track of the file’s original state.
§How it works
The calculate_chunk
function in the ChunkSize
enum determines the optimal chunk size based on various parameters. Here’s a breakdown of how the size is calculated:
The variables prev
and now
represent the previous and current read time, respectively.
prev:
Definition: prev
represents the time taken to read a piece of data in the previous iteration.
now:
Definition: now
represents the current time taken to read the data fragment in the current iteration.
-
Auto Mode:
- If the previous read time (
prev
) is greater than zero:- If the current read time (
now
) is also greater than zero:- If
now
is less thanprev
, decrease the chunk size usingdecrease_chunk
method. - If
now
is greater than or equal toprev
, increase the chunk size usingincrease_chunk
method.
- If
- If
now
is zero or negative, maintain the previous chunk size (prev
).
- If the current read time (
- If the previous read time is zero or negative, use the default chunk size based on the file size and available RAM.
- If the previous read time (
-
Percent Mode:
- Calculate the chunk size as a percentage of the total file size using the
percentage_chunk
method. The percentage is capped between 0.1% and 100%.
- Calculate the chunk size as a percentage of the total file size using the
-
Bytes Mode:
- Calculate the chunk size based on the specified number of bytes using the
bytes_chunk
method. The size is capped by the file size and available RAM.
- Calculate the chunk size based on the specified number of bytes using the
§Key Formulas:
- Increase Chunk Size:
(prev * (1.0 + ((now - prev) / prev).min(0.15))).min(ram_available * 0.85).min(f64::MAX)
- Decrease Chunk Size:
(prev * (1.0 - ((prev - now) / prev).min(0.45))).min(ram_available * 0.85).min(f64::MAX)
- Default Chunk Size:
(file_size * (0.1 / 100.0)).min(ram_available * 0.85).min(f64::MAX)
- Percentage Chunk Size:
(file_size * (percentage.min(100.0).max(0.1) / 100.0)).min(ram_available * 0.85)
- Bytes Chunk Size:
(bytes as f64).min(file_size).min(ram_available * 0.85)
Modules§
- data_size_format
size_format
The module is responsible for the size of the data - The module is responsible for retrieval of chunks from a file
- stream
stream
The module is responsible for async retrieval of chunks from a file
Enums§
- The
ChunkSize
enum represents different modes for determining the chunk size in the file processing module. Regardless of the specific mode chosen, all modes adhere to the rules of the Auto mode with RAM constraints.