1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
//! # Performance
//!
//! Understanding the memory format used by Arrow/ Polars can really increase performance of your
//! queries. This is especially true for large string data. The figure below shows how an Arrow UTF8
//! array is laid out in memory.
//!
//! The array `["foo", "bar", "ham"]` is encoded by
//!
//! * a concatenated string `"foobarham"`
//! * an offset array indicating the start (and end) of each string `[0, 2, 5, 8]`
//! * a null bitmap, indicating null values
//!
//! ![](https://raw.githubusercontent.com/pola-rs/polars-static/master/docs/arrow-string.svg)
//!
//! This memory structure is very cache efficient if we are to read the string values. Especially if
//! we compare it to a `Vec<String>`.
//!
//! ![](https://raw.githubusercontent.com/pola-rs/polars-static/master/docs/pandas-string.svg)
//!
//! However, if we need to reorder the Arrow UTF8 array, we need to swap around all the bytes of the
//! string values, which can become very expensive when we're dealing with large strings. On the
//! other hand, for the `Vec<String>`, we only need to swap pointers around which is only 8 bytes data
//! that have to be moved.
//!
//! If you have a [DataFrame](crate::frame::DataFrame) with a large number of
//! [Utf8Chunked](crate::datatypes::Utf8Chunked) columns and you need to reorder them due to an
//! operation like a FILTER, JOIN, GROUPBY, etc. than this can become quite expensive.
//!
//! ## Categorical type
//! For this reason Polars has a [CategoricalType](https://pola-rs.github.io/polars/polars/datatypes/struct.CategoricalType.html).
//! A `CategoricalChunked` is an array filled with `u32` values that each represent a unique string value.
//! Thereby maintaining cache-efficiency, whilst also making it cheap to move values around.
//!
//! ### Example: Single DataFrame
//!
//! In the example below we show how you can cast a `Utf8Chunked` column to a `CategoricalChunked`.
//!
//! ```rust
//! use polars::prelude::*;
//!
//! fn example(path: &str) -> Result<DataFrame> {
//!     let mut df = CsvReader::from_path(path)?
//!                 .finish()?;
//!
//!     df.may_apply("utf8-column", |s| s.cast::<CategoricalType>())?;
//!     Ok(df)
//! }
//!
//! ```
//!
//! ### Example: Eager join multiple DataFrames on a Categorical
//! When the strings of one column need to be joined with the string data from another `DataFrame`.
//! The `Categorical` data needs to be synchronized (Categories in df A need to point to the same
//! underlying string data as Categories in df B). You can do that by turning the global string cache
//! on.
//!
//! ```rust
//! use polars::prelude::*;
//! use polars::toggle_string_cache;
//!
//! fn example(mut df_a: DataFrame, mut df_b: DataFrame) -> Result<DataFrame> {
//!     // Set a global string cache
//!     toggle_string_cache(true);
//!
//!     df_a.may_apply("a", |s| s.cast::<CategoricalType>())?;
//!     df_b.may_apply("b", |s| s.cast::<CategoricalType>())?;
//!     df_a.join(&df_b, "a", "b", JoinType::Inner)
//! }
//! ```
//!
//! ### Example: Lazy join multiple DataFrames on a Categorical
//! A lazy Query always has a global string cache (unless you opt-out) for the duration of that query (until `collect` is called).
//! The example below shows how you could join two DataFrames with Categorical types.
//!
//! ```rust
//! # #[cfg(feature = "lazy")]
//! # {
//! use polars::prelude::*;
//!
//! fn lazy_example(mut df_a: LazyFrame, mut df_b: LazyFrame) -> Result<DataFrame> {
//!
//!     let q1 = df_a.with_columns(vec![
//!         col("a").cast(DataType::Categorical),
//!     ]);
//!
//!     let q2 = df_b.with_columns(vec![
//!         col("b").cast(DataType::Categorical)
//!     ]);
//!     q1.inner_join(q2, col("a"), col("b"), None).collect()
//! }
//! # }
//! ```