replace_values_with_null

Function replace_values_with_null 

Source
pub fn replace_values_with_null(
    dataframe: DataFrame,
    null_value_list: &[&str],
    apply_to_all_columns: bool,
) -> PolarsResult<DataFrame>
Expand description

Replaces values with null based on a list of matching strings, with options to apply to all columns or only string columns.

This function compares values against null_value_list and replaces them with NULL upon a match. The comparison behavior depends on the apply_to_all_columns flag:

  1. If apply_to_all_columns is false (Default String Behavior):

    • Operates only on columns with DataType::String.
    • Trims leading/trailing whitespace from the original string value.
    • Compares the trimmed string against null_value_list.
    • Non-string columns and non-matching strings are untouched.
    • To nullify empty/whitespace-only strings, include "" in null_value_list.
  2. If apply_to_all_columns is true (Universal Behavior):

    • Operates on all columns in the DataFrame.
    • Casts the value in each column to its string representation (DataType::String).
    • Trims leading/trailing whitespace from this string representation.
    • Compares the trimmed string representation against null_value_list.
    • If a match occurs, the original value (regardless of type) is replaced with NULL.

§Important Considerations (especially when apply_to_all_columns = true):

  • Trimming: Whitespace is always trimmed before comparison in both modes. For apply_to_all_columns = true, trimming occurs after casting to string.
  • Type Casting: The universal mode relies on Polars’ default casting to String. Ensure strings in null_value_list match the trimmed string representation of numbers, booleans, dates, etc. (e.g., “3.45”, “true”, “2023-01-01”, “NA”).
  • Ambiguity: A string like “123” in the list might match integer 123, float 123.0 (if its string form trims to “123”), and string " 123 ".
  • Complex Types: Casting complex types (List, Struct, Binary) to String might yield unpredictable representations or errors. Use with caution.
  • Performance: The universal mode (casting all values) can be slower than the string-only mode on large datasets.