Module polars::docs::performance [−][src]
Expand description
Performance
Understanding the memory format used by Arrow/ Polars can really increase performance of your queries. This is especially true for large string data. The figure below shows how an Arrow UTF8 array is laid out in memory.
The array ["foo", "bar", "ham"]
is encoded by
- a concatenated string
"foobarham"
- an offset array indicating the start (and end) of each string
[0, 2, 5, 8]
- a null bitmap, indicating null values
This memory structure is very cache efficient if we are to read the string values. Especially if
we compare it to a Vec<String>
.
However, if we need to reorder the Arrow UTF8 array, we need to swap around all the bytes of the
string values, which can become very expensive when we’re dealing with large strings. On the
other hand, for the Vec<String>
, we only need to swap pointers around which is only 8 bytes data
that have to be moved.
If you have a DataFrame with a large number of Utf8Chunked columns and you need to reorder them due to an operation like a FILTER, JOIN, GROUPBY, etc. than this can become quite expensive.
Categorical type
For this reason Polars has a CategoricalType.
A CategoricalChunked
is an array filled with u32
values that each represent a unique string value.
Thereby maintaining cache-efficiency, whilst also making it cheap to move values around.
Example: Single DataFrame
In the example below we show how you can cast a Utf8Chunked
column to a CategoricalChunked
.
use polars::prelude::*; fn example(path: &str) -> Result<DataFrame> { let mut df = CsvReader::from_path(path)? .finish()?; df.may_apply("utf8-column", |s| s.cast::<CategoricalType>())?; Ok(df) }
Example: Eager join multiple DataFrames on a Categorical
When the strings of one column need to be joined with the string data from another DataFrame
.
The Categorical
data needs to be synchronized (Categories in df A need to point to the same
underlying string data as Categories in df B). You can do that by turning the global string cache
on.
use polars::prelude::*; use polars::toggle_string_cache; fn example(mut df_a: DataFrame, mut df_b: DataFrame) -> Result<DataFrame> { // Set a global string cache toggle_string_cache(true); df_a.may_apply("a", |s| s.cast::<CategoricalType>())?; df_b.may_apply("b", |s| s.cast::<CategoricalType>())?; df_a.join(&df_b, "a", "b", JoinType::Inner) }
Example: Lazy join multiple DataFrames on a Categorical
A lazy Query always has a global string cache (unless you opt-out) for the duration of that query (until collect
is called).
The example below shows how you could join two DataFrames with Categorical types.
use polars::prelude::*; fn lazy_example(mut df_a: LazyFrame, mut df_b: LazyFrame) -> Result<DataFrame> { let q1 = df_a.with_columns(vec![ col("a").cast(DataType::Categorical), ]); let q2 = df_b.with_columns(vec![ col("b").cast(DataType::Categorical) ]); q1.inner_join(q2, col("a"), col("b"), None).collect() }