rustsight 1.2.4

A fast, safe CLI tool for dataset analysis and validation. Analyzes CSV files for column types, missing values, basic statistics (min/max/mean), outliers, no-variance columns, and mixed-type columns — helping you catch data quality issues before ML/AI training. Also supports binary and text file introspection.
rustsight-1.2.4 is not a library.

RustSight

RustSight is a fast, safe, and extensible dataset analysis CLI tool written in Rust.
This project focuses on data validation and exploratory analysis — the exact step that comes before AI/ML model training. It works on any CSV file and can also analyze binary or text files to extract useful properties.

Crates.io Downloads


📦 Installation

From crates.io (Recommended)

cargo install rustsight

👉 crates.io/crates/rustsight

From Source

git clone https://github.com/omarnahdi/Dataset-Analyzer.git

cd dataset-analyzer

cargo build --release

./target/release/rustsight stats your_file.csv


✨ Features

CSV Dataset Analysis

  • Detects numeric vs categorical columns
  • Counts missing values per column
  • Computes basic statistics (min, max, mean) for numeric columns
  • Handles large CSV files efficiently (streaming)
  • Generates a clean, readable analysis report
  • Saves results to a _report.txt file

Data Validation

  • Detects columns with high missing value ratios
  • Flags no-variance columns (min == max)
  • Detects potential outliers
  • Identifies mixed-type columns
  • Prints clear validation warnings before ML usage

File Analysis

  • Counts total bytes
  • Detects UTF-8 validity
  • Counts lines and words (if text)
  • Counts non-ASCII bytes (for binaries)

🚀 Usage

Analyze a CSV dataset

rustsight stats your_dataset.csv

Validate a dataset

rustsight validate your_dataset.csv

Example output:

File: insta_data.csv
⚠ Column 'followers_count' may contain outliers
⚠ Column 'user_engagement_score' may contain outliers

Inspect any file (text or binary)

rustsight inspect your_file.txt

Help & Version

rustsight help

rustsight version


⚡ Benchmark

Tested on chicago crimes.csv — 8,500,901 rows, 22 columns.

Tool Time vs Pandas
🐻‍❄️ Polars (Python) 1.42s 22.2× faster
🦆 DuckDB CLI 4.33s 7.3× faster
🦀 RustSight 5.57s 5.7× faster
🐼 Pandas (Python) 31.53s baseline
❌ csvkit DNF unusably slow

Benchmarked on Windows, release build (cargo build --release), 20 threads.
RustSight outperforms Pandas by 5.7× and runs within 1.3 seconds of DuckDB — a production C++ query engine.

Dataset source: Chicago Crime Dataset 2024–2026 via Kaggle.


📂 Example Datasets

Used during development (not required):

  • stockdata.csv — financial dataset
  • CVD Dataset.csv — cardiovascular health dataset

⚠ Large datasets are not bundled. You can analyze any CSV file.


🪟 Windows .exe

  1. Go to the Releases section on GitHub
  2. Download rustsight.exe
  3. Run from terminal:
rustsight stats your_file.csv

rustsight validate your_file.csv

No Rust installation required.


🛠️ Tech Stack

  • Rust — performance, memory safety
  • csv crate — efficient CSV parsing
  • CLI first design — easy automation & scripting

📝 License

MIT License


🤝 Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

Portfolio: https://omarnahdi.dev
RustSight: https://omarnahdi.dev/work/dataset-analyzer
crates.io: https://crates.io/crates/rustsight
Learn more: https://omarnahdi.dev/writing/rustsight-cli-csv-analyzer