Skip to main content

RandomPartitionedDataBuilder

Struct RandomPartitionedDataBuilder 

Source
pub struct RandomPartitionedDataBuilder {
    pub seed: u64,
    pub num_partitions: usize,
    pub batches_per_partition: usize,
    pub rows_per_batch: usize,
    /* private fields */
}
Expand description

Builder for generating test data partitions with random geometries.

This builder allows you to create deterministic test datasets with configurable geometry types, data distribution, and partitioning for testing spatial operations.

The generated data includes:

  • id: Unique integer identifier for each row
  • dist: Random floating-point distance value (0.0 to 100.0)
  • geometry: Random geometry data in the specified format (WKB or WKB View)

The strategy for generating geometries and their options are not stable and may change as the needs of testing and benchmarking evolve or better strategies are discovered. The strategy for generating random geometries is as follows:

  • Points are uniformly distributed over the Self::bounds indicated
  • Linestrings are generated by calculating the points in a circle of a randomly chosen size (according to Self::size_range) with vertex count sampled using Self::vertices_per_linestring_range. The start and end point of generated linestrings are never connected.
  • Polygons are generated using a closed version of the linestring generated. They may or may not have a hole according to Self::polygon_hole_rate.
  • MultiPoint, MultiLinestring, and MultiPolygon geometries are constructed with the number of parts sampled according to Self::num_parts_range. The size of the entire feature is constrained to Self::size_range, and this space is subdivided to obtain the exact number of spaces needed. Child features are generated using the global options except with sizes sampled to approach the space given to them.

§Example

use sedona_testing::datagen::RandomPartitionedDataBuilder;
use sedona_geometry::types::GeometryTypeId;
use geo_types::{Coord, Rect};

let (schema, partitions) = RandomPartitionedDataBuilder::new()
    .seed(42)
    .num_partitions(4)
    .rows_per_batch(1000)
    .geometry_type(GeometryTypeId::Polygon)
    .bounds(Rect::new(Coord { x: 0.0, y: 0.0 }, Coord { x: 100.0, y: 100.0 }))
    .build()
    .unwrap();

Fields§

§seed: u64§num_partitions: usize§batches_per_partition: usize§rows_per_batch: usize

Implementations§

Source§

impl RandomPartitionedDataBuilder

Source

pub fn new() -> Self

Creates a new RandomPartitionedDataBuilder with default values.

Default configuration:

  • seed: 42 (for deterministic results)
  • num_partitions: 1
  • batches_per_partition: 1
  • rows_per_batch: 10
  • geometry_type: Point
  • bounds: (0,0) to (100,100)
  • size_range: 1.0 to 10.0
  • null_rate: 0.0 (no nulls)
  • empty_rate: 0.0 (no empties)
  • vertices_per_linestring_range
  • num_parts_range: 1 to 3
  • polygon_hole_rate: 0.0 (no polygons with holes)
Source

pub fn seed(self, seed: u64) -> Self

Sets the random seed for deterministic data generation.

Using the same seed will produce identical datasets, which is useful for reproducible tests.

§Arguments
  • seed - The random seed value
Source

pub fn num_partitions(self, num_partitions: usize) -> Self

Sets the number of data partitions to generate.

Each partition contains multiple batches of data. This is useful for testing distributed processing scenarios.

§Arguments
  • num_partitions - Number of partitions to create
Source

pub fn batches_per_partition(self, batches_per_partition: usize) -> Self

Sets the number of batches per partition.

Each batch is a RecordBatch containing the specified number of rows.

§Arguments
  • batches_per_partition - Number of batches in each partition
Source

pub fn rows_per_batch(self, rows_per_batch: usize) -> Self

Sets the number of rows per batch.

This determines the size of each RecordBatch that will be generated.

§Arguments
  • rows_per_batch - Number of rows in each batch
Source

pub fn geometry_type(self, geom_type: GeometryTypeId) -> Self

Sets the type of geometry to generate.

Currently supports:

  • GeometryTypeId::Point: Random points within the specified bounds
  • GeometryTypeId::Polygon: Random diamond-shaped polygons
  • Other types default to point generation
§Arguments
  • geom_type - The geometry type to generate
Source

pub fn sedona_type(self, sedona_type: SedonaType) -> Self

Sets the Sedona data type for the geometry column.

This determines how the geometry data is stored (e.g., WKB or WKB View).

§Arguments
  • sedona_type - The Sedona type for geometry storage
Source

pub fn bounds(self, bounds: Rect) -> Self

Sets the spatial bounds for geometry generation.

All generated geometries will be positioned within these bounds. For polygons, the bounds are used to ensure the entire polygon fits within the area.

§Arguments
  • bounds - Rectangle defining the spatial bounds (min_x, min_y, max_x, max_y)
Source

pub fn size_range(self, size_range: (f64, f64)) -> Self

Sets the size range for generated geometries.

For polygons, this controls the radius of the generated shapes. For points, this parameter is not used.

§Arguments
  • size_range - Tuple of (min_size, max_size) for geometry dimensions
Source

pub fn null_rate(self, null_rate: f64) -> Self

Sets the rate of null values in the geometry column.

§Arguments
  • null_rate - Fraction of rows that should have null geometry (0.0 to 1.0)
Source

pub fn empty_rate(self, empty_rate: f64) -> Self

Sets the rate of EMPTY geometries in the geometry column.

§Arguments
  • empty_rate - Fraction of rows that should have empty geometry (0.0 to 1.0)
Source

pub fn vertices_per_linestring_range( self, vertices_per_linestring_range: (usize, usize), ) -> Self

Sets the vertex count range

§Arguments
  • vertices_per_linestring_range - The minimum and maximum (inclusive) number of vertices in linestring output. This also affects polygon output, although the actual number of vertices in the polygon ring will be one more than the range indicated here to close the polygon.
Source

pub fn num_parts_range(self, num_parts_range: (usize, usize)) -> Self

Sets the number of parts range

§Arguments
  • num_parts_range - The minimum and maximum (inclusive) number of parts in multi geometry and/or collection output.
Source

pub fn polygon_hole_rate(self, polygon_hole_rate: f64) -> Self

Sets the polygon hole rate

§Arguments
  • polygon_hole_rate - Fraction of polygons that should have an interior ring. Currently only a single interior ring is possible.
Source

pub fn schema(&self) -> SchemaRef

The SchemaRef generated by this builder

The resulting schema contains three columns:

  • id: Int32 - Unique sequential identifier for each row
  • dist: Float64 - Random distance value between 0.0 and 100.0
  • geometry: SedonaType - Random geometry data (WKB or WKB View format)
Source

pub fn build(&self) -> Result<(SchemaRef, Vec<Vec<RecordBatch>>)>

Builds the random partitioned dataset with the configured parameters.

Generates a deterministic dataset based on the seed and configuration. The resulting schema contains three columns:

  • id: Int32 - Unique sequential identifier for each row
  • dist: Float64 - Random distance value between 0.0 and 100.0
  • geometry: SedonaType - Random geometry data (WKB or WKB View format)
§Returns

A tuple containing:

  • SchemaRef: Arrow schema for the generated data
  • Vec<Vec<RecordBatch>>: Vector of partitions, each containing a vector of record batches
§Errors

Returns a datafusion_common::Result error if:

  • RecordBatch creation fails
  • Array conversion fails
  • Schema creation fails
Source

pub fn validate(&self) -> Result<()>

Validate options

This is called internally before generating batches to prevent panics from occurring while creating random output; however, it may also be called at a higher level to generate an error at a more relevant time.

Source

pub fn default_rng(seed: u64) -> impl Rng

Generate a Rng based on a seed

Callers can also supply their own Rng.

Source

pub fn partition_reader<R: Rng + Send + 'static>( &self, rng: R, partition_idx: usize, ) -> Box<dyn RecordBatchReader + Send>

Create a RecordBatchReader that reads a single partition

Trait Implementations§

Source§

impl Clone for RandomPartitionedDataBuilder

Source§

fn clone(&self) -> RandomPartitionedDataBuilder

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for RandomPartitionedDataBuilder

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for RandomPartitionedDataBuilder

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V