pub fn replace_values_with_null(
dataframe: DataFrame,
null_value_list: &[&str],
apply_to_all_columns: bool,
) -> PolarsResult<DataFrame>Expand description
Replaces values with null based on a list of matching strings, with options to apply to all columns or only string columns.
This function compares values against null_value_list and replaces them with NULL
upon a match. The comparison behavior depends on the apply_to_all_columns flag:
-
If
apply_to_all_columnsisfalse(Default String Behavior):- Operates only on columns with
DataType::String. - Trims leading/trailing whitespace from the original string value.
- Compares the trimmed string against
null_value_list. - Non-string columns and non-matching strings are untouched.
- To nullify empty/whitespace-only strings, include
""innull_value_list.
- Operates only on columns with
-
If
apply_to_all_columnsistrue(Universal Behavior):- Operates on all columns in the DataFrame.
- Casts the value in each column to its string representation (
DataType::String). - Trims leading/trailing whitespace from this string representation.
- Compares the trimmed string representation against
null_value_list. - If a match occurs, the original value (regardless of type) is replaced with
NULL.
§Important Considerations (especially when apply_to_all_columns = true):
- Trimming: Whitespace is always trimmed before comparison in both modes.
For
apply_to_all_columns = true, trimming occurs after casting to string. - Type Casting: The universal mode relies on Polars’ default casting to String.
Ensure strings in
null_value_listmatch the trimmed string representation of numbers, booleans, dates, etc. (e.g., “3.45”, “true”, “2023-01-01”, “NA”). - Ambiguity: A string like “123” in the list might match integer
123, float123.0(if its string form trims to “123”), and string" 123 ". - Complex Types: Casting complex types (List, Struct, Binary) to String might yield unpredictable representations or errors. Use with caution.
- Performance: The universal mode (casting all values) can be slower than the string-only mode on large datasets.