Crate dsq_core

Crate dsq_core 

Source
Expand description

dsq-core: Core library for dsq data processing

This crate provides the core functionality for dsq, a data processing tool that extends jq-ish syntax to work with structured data formats like Parquet, Avro, CSV, and more. dsq leverages Polars DataFrames to provide high-performance data manipulation across multiple file formats.

§Features

  • Format Flexibility: Support for CSV, TSV, Parquet, Avro, JSON Lines, Arrow, and JSON
  • Performance: Built on Polars DataFrames with lazy evaluation and columnar operations
  • Type Safety: Proper type handling with clear error messages

§Quick Start

use dsq_core::{Value, ops, io};

// Read a CSV file
let data = io::read_file_sync("data.csv", &io::ReadOptions::default())?;

// Apply operations
let result = ops::OperationPipeline::new()
    .select(vec!["name".to_string(), "age".to_string()])
    .sort(vec![ops::SortOptions::desc("age".to_string())])
    .head(10)
    .execute(data)?;

// Write to Parquet
io::write_file_sync(&result, "output.parquet", &io::WriteOptions::default())?;

§Architecture

The library is organized into several key modules:

  • value - Core value type that bridges JSON and DataFrames
  • ops - Data operations (select, filter, aggregate, join, transform)
  • [io] - Input/output for various file formats
  • [filter] - jq-compatible filter compilation and execution
  • error - Error handling and result types
  • format - File format detection and metadata

§Examples

§Basic DataFrame Operations

use dsq_core::{Value, ops::basic::*};
use polars::prelude::*;

let df = df! {
    "name" => ["Alice", "Bob", "Charlie"],
    "age" => [30, 25, 35],
    "department" => ["Engineering", "Sales", "Engineering"]
}?;

let data = Value::DataFrame(df);

// Select columns
let selected = select_columns(&data, &["name".to_string(), "age".to_string()])?;

// Sort by age
let sorted = sort_by_columns(&selected, &[SortOptions::desc("age")])?;

// Take first 2 rows
let result = head(&sorted, 2)?;

§Aggregation Operations

use dsq_core::{Value, ops::aggregate::*};

// Group by department and calculate statistics
let aggregated = group_by_agg(
    &data,
    &["department".to_string()],
    &[
        AggregationFunction::Count,
        AggregationFunction::Mean("age".to_string()),
        AggregationFunction::Sum("salary".to_string()),
    ]
)?;

§Join Operations

use dsq_core::{Value, ops::join::*};

let keys = JoinKeys::on(vec!["id".to_string()]);
let options = JoinOptions {
    join_type: JoinType::Inner,
    ..Default::default()
};

let joined = join(&left_data, &right_data, &keys, &options)?;

§Format Conversion

use dsq_core::io;

// Convert CSV to Parquet
io::convert_file(
    "data.csv",
    "data.parquet",
    &io::ReadOptions::default(),
    &io::WriteOptions::default()
)?;

§Filter Execution

use dsq_core::filter::{FilterExecutor, ExecutorConfig};

let mut executor = FilterExecutor::with_config(
    ExecutorConfig {
        lazy_evaluation: true,
        dataframe_optimizations: true,
        ..Default::default()
    }
);

// Execute jq-style filter on DataFrame
let result = executor.execute_str(
    r#"map(select(.age > 30)) | sort_by(.name)"#,
    data
)?;

§Error Handling

All operations return Result<T> where errors are represented by the Error type:

use dsq_core::{Error, Result, TypeError, FormatError};

match some_operation() {
    Ok(value) => println!("Success: {:?}", value),
    Err(Error::Type(TypeError::InvalidConversion { from, to })) => {
        eprintln!("Cannot convert from {} to {}", from, to);
    }
    Err(Error::Format(FormatError::Unknown(format))) => {
        eprintln!("Unknown format: {}", format);
    }
    Err(e) => eprintln!("Other error: {}", e),
}

§Performance Tips

  • Use lazy evaluation for large datasets with LazyFrame
  • Prefer columnar operations over row-by-row processing
  • Use appropriate data types to minimize memory usage
  • Consider using streaming for very large files that don’t fit in memory
  • Enable DataFrame-specific optimizations in the filter executor

§Feature Flags

This crate supports several optional features:

  • default - Includes all commonly used functionality
  • io-csv - CSV/TSV reading and writing support
  • io-parquet - Parquet format support
  • io-json - JSON and JSON Lines support
  • io-avro - Avro format support (planned)
  • io-arrow - Arrow IPC format support
  • filter - jq-compatible filter compilation and execution
  • repl - Interactive REPL support (for CLI usage)

Re-exports§

pub use crate::error::Error;
pub use crate::error::FilterError;
pub use crate::error::FormatError;
pub use crate::error::Result;
pub use crate::error::TypeError;
pub use ops::recommended_batch_size;
pub use ops::supports_operation;
pub use ops::Operation;
pub use ops::OperationPipeline;
pub use ops::OperationType;
pub use ops::basic::count;
pub use ops::basic::filter_values;
pub use ops::basic::head;
pub use ops::basic::reverse;
pub use ops::basic::select_columns;
pub use ops::basic::slice;
pub use ops::basic::sort_by_columns;
pub use ops::basic::tail;
pub use ops::basic::unique;
pub use ops::basic::SortOptions;
pub use ops::aggregate::group_by;
pub use ops::aggregate::group_by_agg;
pub use ops::aggregate::pivot;
pub use ops::aggregate::unpivot;
pub use ops::aggregate::AggregationFunction;
pub use ops::aggregate::WindowFunction;
pub use ops::join::inner_join;
pub use ops::join::join;
pub use ops::join::left_join;
pub use ops::join::outer_join;
pub use ops::join::right_join;
pub use ops::join::JoinKeys;
pub use ops::join::JoinOptions;
pub use ops::join::JoinType;
pub use ops::transform::Transform;
pub use utils::array;
pub use utils::object;

Modules§

error
Error types and handling
ops
Operations module for dsq
prelude
Prelude module for convenient imports
utils
Utility functions for working with dsq

Structs§

BuildInfo
Build information structure
FormatOptions
Options for format detection

Enums§

DataFormat
Supported data formats for reading and writing
Value

Constants§

BUILD_INFO
Build information for dsq-core
VERSION
Version information

Functions§

detect_format_from_content
Detect format from file content (magic bytes)