rustsight 1.2.4

A fast, safe CLI tool for dataset analysis and validation. Analyzes CSV files for column types, missing values, basic statistics (min/max/mean), outliers, no-variance columns, and mixed-type columns — helping you catch data quality issues before ML/AI training. Also supports binary and text file introspection.
# RustSight


RustSight is a fast, safe, and extensible **dataset analysis CLI tool written in Rust**.  
This project focuses on **data validation and exploratory analysis** — the exact step that comes *before* AI/ML model training.
It works on **any CSV file** and can also analyze **binary or text files** to extract useful properties.

[![Crates.io](https://img.shields.io/crates/v/rustsight)](https://crates.io/crates/rustsight)
[![Downloads](https://img.shields.io/crates/d/rustsight)](https://crates.io/crates/rustsight)

---

## 📦 Installation


### From crates.io (Recommended)


```bash
cargo install rustsight
```

👉 [crates.io/crates/rustsight](https://crates.io/crates/rustsight)

### From Source


```bash
git clone https://github.com/omarnahdi/Dataset-Analyzer.git
cd dataset-analyzer
cargo build --release
./target/release/rustsight stats your_file.csv
```

---

## ✨ Features


### CSV Dataset Analysis

- Detects **numeric vs categorical columns**
- Counts **missing values per column**
- Computes **basic statistics** (min, max, mean) for numeric columns
- Handles large CSV files efficiently (streaming)
- Generates a **clean, readable analysis report**
- Saves results to a `_report.txt` file

### Data Validation

- Detects columns with **high missing value ratios**
- Flags **no-variance columns** (min == max)
- Detects **potential outliers**
- Identifies **mixed-type columns**
- Prints clear **validation warnings** before ML usage

### File Analysis

- Counts total bytes
- Detects UTF-8 validity
- Counts lines and words (if text)
- Counts non-ASCII bytes (for binaries)

---

## 🚀 Usage


### Analyze a CSV dataset


```bash
rustsight stats your_dataset.csv
```

### Validate a dataset


```bash
rustsight validate your_dataset.csv
```

Example output:

```
File: insta_data.csv
⚠ Column 'followers_count' may contain outliers
⚠ Column 'user_engagement_score' may contain outliers
```

### Inspect any file (text or binary)


```bash
rustsight inspect your_file.txt
```

### Help & Version


```bash
rustsight help
rustsight version
```

---

## ⚡ Benchmark


Tested on **chicago crimes.csv** — 8,500,901 rows, 22 columns.

| Tool | Time | vs Pandas |
|------|------|-----------|
| 🐻‍❄️ Polars (Python) | 1.42s | 22.2× faster |
| 🦆 DuckDB CLI | 4.33s | 7.3× faster |
| 🦀 **RustSight** | **5.57s** | **5.7× faster** |
| 🐼 Pandas (Python) | 31.53s | baseline |
| ❌ csvkit | DNF | unusably slow |

> Benchmarked on Windows, release build (`cargo build --release`), 20 threads.  
> RustSight outperforms Pandas by **5.7×** and runs within 1.3 seconds of DuckDB — a production C++ query engine.

Dataset source: [Chicago Crime Dataset 2024–2026](https://www.kaggle.com/datasets/aliafzal9323/chicago-crime-dataset-2024-2026) via Kaggle.

---

## 📂 Example Datasets


Used during development (not required):

- `stockdata.csv` — financial dataset
- `CVD Dataset.csv` — cardiovascular health dataset

> ⚠ Large datasets are **not bundled**. You can analyze **any CSV file**.

---

## 🪟 Windows `.exe`


1. Go to the **Releases** section on GitHub
2. Download `rustsight.exe`
3. Run from terminal:

```bash
rustsight stats your_file.csv
rustsight validate your_file.csv
```

No Rust installation required.

---

## 🛠️ Tech Stack


- **Rust** — performance, memory safety
- **csv crate** — efficient CSV parsing
- **CLI first design** — easy automation & scripting

---

## 📝 License


MIT License

---

## 🤝 Contributing


Contributions are welcome! Feel free to open issues or submit pull requests.

Portfolio: https://omarnahdi.dev  
RustSight: https://omarnahdi.dev/work/dataset-analyzer  
crates.io: https://crates.io/crates/rustsight  
Learn more: https://omarnahdi.dev/writing/rustsight-cli-csv-analyzer