<h1 align="center">quasivision</h1>
<p align="center">
<a href="https://github.com/WeiChens/quasivision/stargazers"><img src="https://img.shields.io/github/stars/WeiChens/quasivision?style=for-the-badge&logo=github" alt="Stars"></a>
<a href="https://github.com/WeiChens/quasivision/network/members"><img src="https://img.shields.io/github/forks/WeiChens/quasivision?style=for-the-badge&logo=github" alt="Forks"></a>
<a href="https://github.com/WeiChens/quasivision/issues"><img src="https://img.shields.io/github/issues/WeiChens/quasivision?style=for-the-badge&logo=github" alt="Issues"></a>
<a href="https://github.com/WeiChens/quasivision/blob/main/LICENSE"><img src="https://img.shields.io/github/license/WeiChens/quasivision?style=for-the-badge" alt="License"></a>
</p>
<p align="center">
<a href="README.md">π¬π§ English</a> Β· <a href="README-zh.md">π¨π³ δΈζ</a>
</p>
A Rust-based pseudo-visual understanding tool.
Analyzes screenshots, UI mockups, and real-world photos β detects UI components (buttons, text fields, icons, images, etc.), recognizes text via OCR, identifies 860 classes of everyday objects (people, cars, phones, food, etc.) with YOLOE-26n, classifies 81 types of icon meanings, and outputs structured descriptions with visual annotations.
---
## π Table of Contents
1. [Quick Start](#quick-start)
2. [Demo Gallery](#demo-gallery)
3. [Output Overview](#output-overview)
4. [CLI Reference](#cli-reference)
5. [Output File Structure](#output-file-structure)
6. [Pipeline](#pipeline)
7. [Core Features](#core-features)
8. [FAQ](#faq)
9. [Practical Examples](#practical-examples)
10. [Proxy Configuration](#proxy-configuration)
11. [License](#license)
---
## π Quick Start <a id="quick-start"></a>
### Basic Usage
```bash
# Single image
cargo run -- --input image.png
# Try with the built-in demo
cargo run -- --input demo/ui.jpg
# Custom output directory
cargo run -- --input image.png --output ./result
# Batch process all images in a directory
cargo run -- --input ./screenshots/
# Recursive processing (include subdirectories)
cargo run -- --input ./screenshots/ --recursive
```
### Minimal Example
```bash
cargo run -- --input demo/ui.jpg
```
Results are written to `./output/ui/`.
---
## πΌοΈ Demo Gallery <a id="demo-gallery"></a>
### 1. UI Detection β Web Search Page
|  |  |
Detected UI components including text, icons, buttons, and structured blocks from a search result page β with full OCR text extraction.
### 2. Object Detection β Real-World Photo
|  |  |
Detected 6 objects with hierarchical relationships (person β cap/hat/glasses/glove/jacket), visualized with bounding boxes and labels.
**Detection result:**
```
Objects (474Γ714) β 6 found:
ββ [ 0,278 433Γ436] person (87%)
ββ [111,277 118Γ 93] cap (39%)
β ββ [111,277 118Γ 93] hat (82%)
β ββ [112,345 88Γ 38] glasses (65%)
ββ [ 1,649 46Γ 65] glove (21%)
ββ [ 55,342 373Γ372] jacket (20%)
```
### 3. Mixed Scenario β Stock Photo Gallery
|  |  |  |
A stock photo gallery page: UI detection extracts the layout structure (image grid, navigation bar, text labels), while object detection identifies photo subjects (people, faces, etc.).
---
## π€ Output Overview <a id="output-overview"></a>
### Output Format (fixed: `tree`)
Output is always in **`tree` format** (no `--format` flag needed):
```
tree Nested tree structure, JSON + plain text, AI-readable DOM
```
It generates both `elements.tree.json` (JSON tree) and `elements.tree.txt` (plain text tree).
### Output Files
| `elements.tree.json` | UI Detection | All detected UI components (buttons/text/icons/etc) |
| `elements.tree.txt` | UI Detection | Plain text summary |
| `visualization.jpg` | UI Detection | Annotated image with color-coded component borders |
| `objects.tree.json` | Object Detect | YOLOE-detected objects (860 classes) with hierarchy |
| `objects.tree.txt` | Object Detect | Object detection plain text summary |
| `objects.jpg` | Object Detect | Object detection visualization with labels |
---
## βοΈ CLI Reference <a id="cli-reference"></a>
### Basic Options
| `-i, --input` | String | **Required** | Input image path or directory |
| `-o, --output` | String | `output` | Output root directory |
| `--recursive` | bool | `false` | Recursively process subdirectories |
| `--extensions` | String | `png,jpg,jpeg,jfif` | Comma-separated image file extensions |
### UI Detection Options
| `--gradient` | u8 | `4` | Gradient threshold (dribbble/rico: 4, web: 1) |
| `--min-area` | u32 | `55` | Minimum connected component area |
| `--paragraph` | bool | `false` | Enable paragraph merging |
| `--remove-bar` | bool | `true` | Remove top/bottom navigation bars |
| `--sub-component` | bool | `true` | Detect sub-components (buttons inside images) |
| `--synthesize-text` | bool | `true` | Auto-synthesize container blocks for orphan text |
### Line / Rectangle Options
| `--line-thickness` | u32 | `8` | Maximum line thickness (pixels) |
| `--line-min-length` | f64 | `0.95` | Minimum line length ratio |
| `--rec-evenness` | f64 | `0.7` | Minimum rectangle evenness |
| `--rec-dent` | f64 | `0.25` | Maximum rectangle dent ratio |
| `--rec-corner-skip` | f64 | `0.08` | Corner tolerance (0=strict right angle, 0.08~0.12=rounded) |
### Block Detection Options
| `--block-side` | f64 | `0.15` | Block side length ratio threshold |
| `--block-grad` | u8 | `5` | Block nesting detection gradient threshold |
### Text Options
| `--text-max-h` | f64 | `0.08` | Max text height ratio (relative to image height) |
| `--text-gap` | u32 | `10` | Max word gap (pixels) |
| `--ocr` | bool | `true` | Enable OCR text recognition |
### Icon / Object Detection Options
| `--icon-classify` | bool | `true` | Enable icon meaning classification |
| `--object-detect` | bool | `true` | Enable object detection |
| `--detect-model` | String | `resources/object-detection/yoloe-26n-seg-dynamic.onnx` | YOLOE model path |
| `--detect-labels` | String | `resources/object-detection/yoloe-26n_classes.txt` | YOLOE labels file path |
| `--detect-conf` | f32 | `0.2` | Detection confidence threshold (0~1) |
| `--models-dir` | String | `resources` | Model resource root directory |
### Disabling Features
```bash
# Disable OCR (structure-only detection)
cargo run -- --input image.png --ocr false
# Disable object detection
cargo run -- --input image.png --object-detect false
# Disable icon classification
cargo run -- --input image.png --icon-classify false
# UI detection only (all optional features off)
cargo run -- --input image.png --ocr false --object-detect false --icon-classify false
```
---
## π Output File Structure <a id="output-file-structure"></a>
### Single Image Output
```
output/
βββ image_name/ # Named after the input file (without extension)
βββ elements.tree.json # UI element tree (JSON)
βββ elements.tree.txt # UI element tree (text)
βββ visualization.jpg # UI detection visualization
βββ objects.tree.json # Object detection tree (JSON)
βββ objects.tree.txt # Object detection tree (text)
βββ objects.jpg # Object detection visualization
```
> Note: `objects.*` files are only generated when `--object-detect true` and objects are found.
---
## π Pipeline <a id="pipeline"></a>
```
Input Image
β
ββ 1. Preprocessing ββββββ Grayscale, line removal, background removal
β
ββ 2. Connected Component β Gradient β CCL (Connected Component Labeling)
β
ββ 3. Rect/Line Detection β Buttons, input fields, etc.
β
ββ 4. Merge & Filter βββββ Merge overlapping regions, remove noise
β
ββ 5. Classification βββββ Block / Button / Text / Icon / Image
β β
β ββ Icon Classifier ββ 81 common icon categories (ONNX Runtime)
β β
β ββ OCR (background) β Text recognition (PaddleOCR)
β
ββ 6. Merge ββββββββββββββ Merge OCR text into UI elements
β
ββ 7. Color Detection ββββ Extract background/foreground colors
β
ββ 8. Output βββββββββββββ 5 formats + visualization annotation
```
### Parallel Execution
Object detection (YOLOE-26n) and OCR run on **background threads** in parallel with the main pipeline, adding no extra wait time.
---
## π§© Core Features <a id="core-features"></a>
### 1. UI Element Detection (Main Feature)
Detects 7 types of UI elements:
| **Block** | Container blocks (cards, list items, nav bars) |
| **Button** | Clickable buttons |
| **Text** | Text labels |
| **Icon** | Icons (small square elements) |
| **Image** | Images |
| **Input** | Input fields |
| **List Item** | List items (with checkmark indicators) |
### 2. OCR Text Recognition
- Based on PaddleOCR (PP-OCRv5) models
- Windows: DirectML GPU acceleration supported
- Auto-detects text in images
- Long text protection: meaningful text (>5 chars) bypasses height filters
### 3. Object Detection (YOLOE-26n)
- ONNX Runtime-based YOLOE-26n model (dynamic input, end-to-end NMS)
- 860 common object classes (people, cars, phones, food, animals, etc.)
- Auto-builds parent-child containment trees
- Outputs annotated `objects.jpg` visualization
- **77% smaller** than the previous YOLO-World model (11.1 MB vs 49.5 MB)
### 4. Icon Meaning Classification
- 81-class icon classification via ONNX model
- Common UI icon meanings (settings, search, share, back, etc.)
- Confidence > 40% displays candidate meanings
### 5. Color Detection
- Auto-extracts background/foreground colors per element
- Outputs hex color values
---
## β FAQ <a id="faq"></a>
### Q: Where do model files come from?
Model files are located in the `resources/` directory:
```
resources/
βββ ocr-models/
β βββ ppocrv5_mobile_det.onnx # OCR detection model
β βββ ppocrv5_mobile_rec.onnx # OCR recognition model
β βββ ppocrv5_dict.txt # Chinese dictionary
βββ icon-classifier/
β βββ icon_classifier.onnx # Icon classification model
β βββ labels.json # 81 class labels
βββ object-detection/
βββ yoloe-26n-seg-dynamic.onnx # YOLOE-26n object detection model (11 MB)
βββ yoloe-26n_classes.txt # 860 class labels
```
**Auto-download**: Missing model files are automatically downloaded from the Hugging Face repo ([WeiChens/quasivision-models](https://huggingface.co/WeiChens/quasivision-models)) on first run.
**Mirror for China users**:
```bash
set QUASIVISION_MODELS_URL=https://hf-mirror.com/WeiChens/quasivision-models/resolve/main
cargo run -- --input image.png
```
### Q: What coordinate system does the output use?
Output uses raw pixel coordinates (tree format):
```json
{
"column_min": 100,
"row_min": 200,
"column_max": 300,
"row_max": 400
}
```
All coordinates are in original pixel values (0β1000 normalization is not used).
### Q: How can I run object detection only (without UI detection)?
The current design runs the full pipeline. You can disable ancillary features with `--ocr false --icon-classify false`.
### Q: How to improve detection quality?
- **Gradient threshold**: Web pages: `--gradient 1`, App screenshots: `--gradient 4`
- **Rounded corners**: Use `--rec-corner-skip 0.12` for large rounded elements
- **Small text**: Increase `--text-max-h 0.12` to raise text height limit
### Q: What confidence threshold should I use?
| Only high-confidence objects | 0.5 |
| Balanced precision & recall | 0.2 (default) |
| Maximum recall (tolerate noise) | 0.1 |
### Q: Supported image formats?
Default: `png`, `jpg`, `jpeg`, `jfif`. Customize with `--extensions`.
---
## π‘ Practical Examples <a id="practical-examples"></a>
```bash
# App screenshot (recommended parameters)
cargo run -- -i app.png --gradient 4
# Web page detection
cargo run -- -i webpage.png --gradient 1 --rec-corner-skip 0.1
# Batch processing with recursion
cargo run -- -i ./screenshots/ --recursive
# AI-friendly output (disable non-essential features)
cargo run -- -i ui.png --icon-classify false
# High-recall detection
cargo run -- -i photo.jpg --detect-conf 0.1
# Paragraph-aware text detection
cargo run -- -i document.png --paragraph true --text-max-h 0.15
```
---
## π Proxy Configuration <a id="proxy-configuration"></a>
On Windows, quasivision automatically detects system proxy settings (compatible with Clash, V2Ray, etc.). If your proxy requires manual configuration:
```bash
# Windows (cmd)
set HTTP_PROXY=http://127.0.0.1:7890
set HTTPS_PROXY=http://127.0.0.1:7890
cargo run -- --input image.png
# macOS / Linux
HTTP_PROXY=http://127.0.0.1:7890 HTTPS_PROXY=http://127.0.0.1:7890 cargo run -- --input image.png
```
---
## π License <a id="license"></a>
- **Source code**: MIT Β© quasivision
- **PP-OCRv5**: Apache 2.0 Β© PaddlePaddle
- **YOLOE-26n-seg**: AGPL-3.0 Β© Ultralytics
- **Icon Classifier**: MIT
---
## π Also Available In
- [δΈζζζ‘£ (Chinese)](README-zh.md)