Crate spark_connect_rs

Source
Expand description

Spark Connection Client for Rust

Currently, the Spark Connect client for Rust is highly experimental and should not be used in any production setting. This is currently a “proof of concept” to identify the methods of interacting with Spark cluster from rust.

§Quickstart

Create a Spark Session and create a DataFrame from a arrow::array::RecordBatch.

use spark_connect_rs::{SparkSession, SparkSessionBuilder};
use spark_connect_rs::functions::{col, lit}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let spark: SparkSession = SparkSessionBuilder::remote("sc://127.0.0.1:15002/;user_id=example_rs")
        .build()
        .await?;

    let name: ArrayRef = Arc::new(StringArray::from(vec!["Tom", "Alice", "Bob"]));
    let age: ArrayRef = Arc::new(Int64Array::from(vec![14, 23, 16]));

    let data = RecordBatch::try_from_iter(vec![("name", name), ("age", age)])?

    let df = spark.create_dataframe(&data).await?

    // 2 records total
    let records = df.select(["*"])
        .with_column("age_plus", col("age") + lit(4))
        .filter(col("name").contains("o"))
        .count()
        .await?;

    Ok(())
};

Create a Spark Session and create a DataFrame from a SQL statement:

use spark_connect_rs::{SparkSession, SparkSessionBuilder};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let spark: SparkSession = SparkSessionBuilder::remote("sc://127.0.0.1:15002/;user_id=example_rs")
        .build()
        .await?;

    let df = spark.sql("SELECT * FROM json.`/datasets/employees.json`").await?;

    // Show the first 5 records
    df.filter("salary > 3000").show(Some(5), None, None).await?;

    Ok(())
};

Create a Spark Session, read a CSV file into a DataFrame, apply function transformations, and write the results:

use spark_connect_rs::{SparkSession, SparkSessionBuilder};

use spark_connect_rs::functions as F;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let spark: SparkSession = SparkSessionBuilder::remote("sc://127.0.0.1:15002/;user_id=example_rs")
        .build()
        .await?;

    let paths = ["/datasets/people.csv"];

    let df = spark
        .read()
        .format("csv")
        .option("header", "True")
        .option("delimiter", ";")
        .load(paths)?;

    let df = df
        .filter("age > 30")
        .select([
            F::col("name"),
            F::col("age").cast("int")
        ]);

    df.write()
      .format("csv")
      .option("header", "true")
      .save("/opt/spark/examples/src/main/rust/people/")
      .await?;

    Ok(())
};

§Databricks Connection

Spark Connect is enabled for Databricks Runtime 13.3 LTS and above, and requires the feature flag feature = "tls". The connection string for the remote session must contain the following values in the string;

"sc://<workspace id>:443/;token=<personal access token>;x-databricks-cluster-id=<cluster-id>"

Re-exports§

pub use dataframe::DataFrame;
pub use dataframe::DataFrameReader;
pub use dataframe::DataFrameWriter;
pub use session::SparkSession;
pub use session::SparkSessionBuilder;

Modules§

catalog
Spark Catalog representation through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
client
Implementation of the SparkConnectServiceClient
column
Column represents a column in a DataFrame that holds a spark::Expression
conf
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
dataframe
DataFrame representation for Spark Connection
errors
Defines a SparkError for representing failures in various Spark operations. Most of these are wrappers for tonic or arrow error messages
expressions
Traits for converting Rust Types to Spark Connect Expression Types
functions
A re-implementation of Spark functions
group
A DataFrame created with an aggregate statement
plan
Logical Plan representation
readwriter
DataFrameReader & DataFrameWriter representations
session
Spark Session containing the remote gRPC client
spark
Spark Connect gRPC protobuf translated using tonic
storage
Enum for handling Spark Storage representations
streaming
Streaming implementation for the Spark Connect Client
types
Rust Types to Spark Types
window
Utility structs for defining a window over a DataFrame