pub struct ArrowBytesMap<O, V>{ /* private fields */ }
Expand description

Optimized map for storing Arrow “bytes” types (String, LargeString, Binary, and LargeBinary) values that can produce the set of keys on output as GenericBinaryArray without copies.

Equivalent to HashSet<String, V> but with better performance for arrow data.

§Generic Arguments

  • O: OffsetSize (String/LargeString)
  • V: payload type

§Description

This is a specialized HashMap with the following properties:

  1. Optimized for storing and emitting Arrow byte types (e.g. StringArray / BinaryArray) very efficiently by minimizing copying of the string values themselves, both when inserting and when emitting the final array.

  2. Retains the insertion order of entries in the final array. The values are in the same order as they were inserted.

Note this structure can be used as a HashSet by specifying the value type as (), as is done by ArrowBytesSet.

This map is used by the special COUNT DISTINCT aggregate function to store the distinct values, and by the GROUP BY operator to store group values when they are a single string array.

§Example

The following diagram shows how the map would store the four strings “Foo”, NULL, “Bar”, “TheQuickBrownFox”:

  • hashtable stores entries for each distinct string that has been inserted. The entries contain the payload as well as information about the value (either an offset or the actual bytes, see Entry docs for more details)

  • offsets stores offsets into buffer for each distinct string value, following the same convention as the offsets in a StringArray or LargeStringArray.

  • buffer stores the actual byte data

  • null: stores the index and payload of the null value, in this case the second value (index 1)

┌───────────────────────────────────┐    ┌─────┐    ┌────┐
│                ...                │    │  0  │    │FooB│
│ ┌──────────────────────────────┐  │    │  0  │    │arTh│
│ │      <Entry for "Bar">       │  │    │  3  │    │eQui│
│ │            len: 3            │  │    │  3  │    │ckBr│
│ │   offset_or_inline: "Bar"    │  │    │  6  │    │ownF│
│ │         payload:...          │  │    │     │    │ox  │
│ └──────────────────────────────┘  │    │     │    │    │
│                ...                │    └─────┘    └────┘
│ ┌──────────────────────────────┐  │
│ │<Entry for "TheQuickBrownFox">│  │    offsets    buffer
│ │           len: 16            │  │
│ │     offset_or_inline: 6      │  │    ┌───────────────┐
│ │         payload: ...         │  │    │    Some(1)    │
│ └──────────────────────────────┘  │    │ payload: ...  │
│                ...                │    └───────────────┘
└───────────────────────────────────┘
                                             null
              HashTable

§Entry Format

Entries stored in a ArrowBytesMap represents a value that is either stored inline or in the buffer

This helps the case where there are many short (less than 8 bytes) strings that are the same (e.g. “MA”, “CA”, “NY”, “TX”, etc)

                                                               ┌──────────────────┐
                                                 ─ ─ ─ ─ ─ ─ ─▶│...               │
                                                │              │TheQuickBrownFox  │
                                                               │...               │
                                                │              │                  │
                                                               └──────────────────┘
                                                │               buffer of u8

                                                │
                       ┌────────────────┬───────────────┬───────────────┐
 Storing               │                │ starting byte │  length, in   │
 "TheQuickBrownFox"    │   hash value   │   offset in   │  bytes (not   │
 (long string)         │                │    buffer     │  characters)  │
                       └────────────────┴───────────────┴───────────────┘
                             8 bytes          8 bytes       4 or 8


                        ┌───────────────┬─┬─┬─┬─┬─┬─┬─┬─┬───────────────┐
Storing "foobar"        │               │ │ │ │ │ │ │ │ │  length, in   │
(short string)          │  hash value   │?│?│f│o│o│b│a│r│  bytes (not   │
                        │               │ │ │ │ │ │ │ │ │  characters)  │
                        └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┴───────────────┘
                             8 bytes         8 bytes        4 or 8

Implementations§

source§

impl<O: OffsetSizeTrait, V> ArrowBytesMap<O, V>
where V: Debug + PartialEq + Eq + Clone + Copy + Default,

source

pub fn new(output_type: OutputType) -> Self

source

pub fn take(&mut self) -> Self

Return the contents of this map and replace it with a new empty map with the same output type

source

pub fn insert_if_new<MP, OP>( &mut self, values: &ArrayRef, make_payload_fn: MP, observe_payload_fn: OP )
where MP: FnMut(Option<&[u8]>) -> V, OP: FnMut(V),

Inserts each value from values into the map, invoking payload_fn for each value if not already present, deferring the allocation of the payload until it is needed.

Note that this is different than a normal map that would replace the existing entry

§Arguments:

values: array whose values are inserted

make_payload_fn: invoked for each value that is not already present to create the payload, in order of the values in values

observe_payload_fn: invoked once, for each value in values, that was already present in the map, with corresponding payload value.

§Returns

The payload value for the entry, either the existing value or the newly inserted value

§Safety:

Note that make_payload_fn and observe_payload_fn are only invoked with valid values from values, not for the NULL value.

source

pub fn into_state(self) -> ArrayRef

Converts this set into a StringArray, LargeStringArray, BinaryArray, or LargeBinaryArray containing each distinct value that was inserted. This is done without copying the values.

The values are guaranteed to be returned in the same order in which they were first seen.

source

pub fn len(&self) -> usize

Total number of entries (including null, if present)

source

pub fn is_empty(&self) -> bool

Is the set empty?

source

pub fn non_null_len(&self) -> usize

Number of non null entries

source

pub fn size(&self) -> usize

Return the total size, in bytes, of memory used to store the data in this set, not including self

Trait Implementations§

source§

impl<O: OffsetSizeTrait, V> Debug for ArrowBytesMap<O, V>
where V: Debug + PartialEq + Eq + Clone + Copy + Default,

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<O, V> Freeze for ArrowBytesMap<O, V>
where O: Integer<Output = O, Output = O, Output = O, Output = O, Output = O> + Eq + PartialEq + Ord + PartialOrd + Num + NumOps + Rem + Div + Mul + Sub + Add + One + Zero + AddAssign + ArrowNativeType + Sealed + Default + Copy + Clone + Sync + Send + Debug + 'static, V: Freeze,

§

impl<O, V> RefUnwindSafe for ArrowBytesMap<O, V>
where O: Integer<Output = O, Output = O, Output = O, Output = O, Output = O> + Eq + PartialEq + Ord + PartialOrd + Num + NumOps + Rem + Div + Mul + Sub + Add + One + Zero + AddAssign + ArrowNativeType + Sealed + Default + Copy + Clone + Sync + Send + Debug + 'static + RefUnwindSafe, V: RefUnwindSafe,

§

impl<O, V> Send for ArrowBytesMap<O, V>
where O: Integer<Output = O, Output = O, Output = O, Output = O, Output = O> + Eq + PartialEq + Ord + PartialOrd + Num + NumOps + Rem + Div + Mul + Sub + Add + One + Zero + AddAssign + ArrowNativeType + Sealed + Default + Copy + Clone + Sync + Send + Debug + 'static, V: Send,

§

impl<O, V> Sync for ArrowBytesMap<O, V>
where O: Integer<Output = O, Output = O, Output = O, Output = O, Output = O> + Eq + PartialEq + Ord + PartialOrd + Num + NumOps + Rem + Div + Mul + Sub + Add + One + Zero + AddAssign + ArrowNativeType + Sealed + Default + Copy + Clone + Sync + Send + Debug + 'static, V: Sync,

§

impl<O, V> Unpin for ArrowBytesMap<O, V>
where O: Integer<Output = O, Output = O, Output = O, Output = O, Output = O> + Eq + PartialEq + Ord + PartialOrd + Num + NumOps + Rem + Div + Mul + Sub + Add + One + Zero + AddAssign + ArrowNativeType + Sealed + Default + Copy + Clone + Sync + Send + Debug + 'static + Unpin, V: Unpin,

§

impl<O, V> UnwindSafe for ArrowBytesMap<O, V>
where O: Integer<Output = O, Output = O, Output = O, Output = O, Output = O> + Eq + PartialEq + Ord + PartialOrd + Num + NumOps + Rem + Div + Mul + Sub + Add + One + Zero + AddAssign + ArrowNativeType + Sealed + Default + Copy + Clone + Sync + Send + Debug + 'static + UnwindSafe, V: UnwindSafe,

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V

source§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,