A Rust-based pseudo-visual understanding tool.
Analyzes screenshots, UI mockups, and real-world photos β detects UI components (buttons, text fields, icons, images, etc.), recognizes text via OCR, identifies 860 classes of everyday objects (people, cars, phones, food, etc.) with YOLOE-26n, classifies 81 types of icon meanings, and outputs structured descriptions with visual annotations.
π Table of Contents
- Quick Start
- Demo Gallery
- Output Overview
- CLI Reference
- Output File Structure
- Pipeline
- Core Features
- FAQ
- Practical Examples
- Proxy Configuration
- License
π Quick Start
Basic Usage
# Single image
# Try with the built-in demo
# Custom output directory
# Batch process all images in a directory
# Recursive processing (include subdirectories)
Minimal Example
Results are written to ./output/ui/.
πΌοΈ Demo Gallery
1. UI Detection β Web Search Page
| Input | Output |
|---|---|
![]() |
![]() |
Detected UI components including text, icons, buttons, and structured blocks from a search result page β with full OCR text extraction.
2. Object Detection β Real-World Photo
| Input | Output |
|---|---|
![]() |
![]() |
Detected 6 objects with hierarchical relationships (person β cap/hat/glasses/glove/jacket), visualized with bounding boxes and labels.
Detection result:
Objects (474Γ714) β 6 found:
ββ [ 0,278 433Γ436] person (87%)
ββ [111,277 118Γ 93] cap (39%)
β ββ [111,277 118Γ 93] hat (82%)
β ββ [112,345 88Γ 38] glasses (65%)
ββ [ 1,649 46Γ 65] glove (21%)
ββ [ 55,342 373Γ372] jacket (20%)
3. Mixed Scenario β Stock Photo Gallery
| Input | Output (UI) | Output (Objects) |
|---|---|---|
![]() |
![]() |
![]() |
A stock photo gallery page: UI detection extracts the layout structure (image grid, navigation bar, text labels), while object detection identifies photo subjects (people, faces, etc.).
π€ Output Overview
Output Format (fixed: tree)
Output is always in tree format (no --format flag needed):
tree Nested tree structure, JSON + plain text, AI-readable DOM
It generates both elements.tree.json (JSON tree) and elements.tree.txt (plain text tree).
Output Files
| File | Source | Description |
|---|---|---|
elements.tree.json |
UI Detection | All detected UI components (buttons/text/icons/etc) |
elements.tree.txt |
UI Detection | Plain text summary |
visualization.jpg |
UI Detection | Annotated image with color-coded component borders |
objects.tree.json |
Object Detect | YOLOE-detected objects (860 classes) with hierarchy |
objects.tree.txt |
Object Detect | Object detection plain text summary |
objects.jpg |
Object Detect | Object detection visualization with labels |
βοΈ CLI Reference
Basic Options
| Argument | Type | Default | Description |
|---|---|---|---|
-i, --input |
String | Required | Input image path or directory |
-o, --output |
String | output |
Output root directory |
--recursive |
bool | false |
Recursively process subdirectories |
--extensions |
String | png,jpg,jpeg,jfif |
Comma-separated image file extensions |
UI Detection Options
| Argument | Type | Default | Description |
|---|---|---|---|
--gradient |
u8 | 4 |
Gradient threshold (dribbble/rico: 4, web: 1) |
--min-area |
u32 | 55 |
Minimum connected component area |
--paragraph |
bool | false |
Enable paragraph merging |
--remove-bar |
bool | true |
Remove top/bottom navigation bars |
--sub-component |
bool | true |
Detect sub-components (buttons inside images) |
--synthesize-text |
bool | true |
Auto-synthesize container blocks for orphan text |
Line / Rectangle Options
| Argument | Type | Default | Description |
|---|---|---|---|
--line-thickness |
u32 | 8 |
Maximum line thickness (pixels) |
--line-min-length |
f64 | 0.95 |
Minimum line length ratio |
--rec-evenness |
f64 | 0.7 |
Minimum rectangle evenness |
--rec-dent |
f64 | 0.25 |
Maximum rectangle dent ratio |
--rec-corner-skip |
f64 | 0.08 |
Corner tolerance (0=strict right angle, 0.08~0.12=rounded) |
Block Detection Options
| Argument | Type | Default | Description |
|---|---|---|---|
--block-side |
f64 | 0.15 |
Block side length ratio threshold |
--block-grad |
u8 | 5 |
Block nesting detection gradient threshold |
Text Options
| Argument | Type | Default | Description |
|---|---|---|---|
--text-max-h |
f64 | 0.08 |
Max text height ratio (relative to image height) |
--text-gap |
u32 | 10 |
Max word gap (pixels) |
--ocr |
bool | true |
Enable OCR text recognition |
Icon / Object Detection Options
| Argument | Type | Default | Description |
|---|---|---|---|
--icon-classify |
bool | true |
Enable icon meaning classification |
--object-detect |
bool | true |
Enable object detection |
--detect-model |
String | resources/object-detection/yoloe-26n-seg-dynamic.onnx |
YOLOE model path |
--detect-labels |
String | resources/object-detection/yoloe-26n_classes.txt |
YOLOE labels file path |
--detect-conf |
f32 | 0.2 |
Detection confidence threshold (0~1) |
--models-dir |
String | resources |
Model resource root directory |
Disabling Features
# Disable OCR (structure-only detection)
# Disable object detection
# Disable icon classification
# UI detection only (all optional features off)
π Output File Structure
Single Image Output
output/
βββ image_name/ # Named after the input file (without extension)
βββ elements.tree.json # UI element tree (JSON)
βββ elements.tree.txt # UI element tree (text)
βββ visualization.jpg # UI detection visualization
βββ objects.tree.json # Object detection tree (JSON)
βββ objects.tree.txt # Object detection tree (text)
βββ objects.jpg # Object detection visualization
Note:
objects.*files are only generated when--object-detect trueand objects are found.
π Pipeline
Input Image
β
ββ 1. Preprocessing ββββββ Grayscale, line removal, background removal
β
ββ 2. Connected Component β Gradient β CCL (Connected Component Labeling)
β
ββ 3. Rect/Line Detection β Buttons, input fields, etc.
β
ββ 4. Merge & Filter βββββ Merge overlapping regions, remove noise
β
ββ 5. Classification βββββ Block / Button / Text / Icon / Image
β β
β ββ Icon Classifier ββ 81 common icon categories (ONNX Runtime)
β β
β ββ OCR (background) β Text recognition (PaddleOCR)
β
ββ 6. Merge ββββββββββββββ Merge OCR text into UI elements
β
ββ 7. Color Detection ββββ Extract background/foreground colors
β
ββ 8. Output βββββββββββββ 5 formats + visualization annotation
Parallel Execution
Object detection (YOLOE-26n) and OCR run on background threads in parallel with the main pipeline, adding no extra wait time.
π§© Core Features
1. UI Element Detection (Main Feature)
Detects 7 types of UI elements:
| Category | Description |
|---|---|
| Block | Container blocks (cards, list items, nav bars) |
| Button | Clickable buttons |
| Text | Text labels |
| Icon | Icons (small square elements) |
| Image | Images |
| Input | Input fields |
| List Item | List items (with checkmark indicators) |
2. OCR Text Recognition
- Based on PaddleOCR (PP-OCRv5) models
- Windows: DirectML GPU acceleration supported
- Auto-detects text in images
- Long text protection: meaningful text (>5 chars) bypasses height filters
3. Object Detection (YOLOE-26n)
- ONNX Runtime-based YOLOE-26n model (dynamic input, end-to-end NMS)
- 860 common object classes (people, cars, phones, food, animals, etc.)
- Auto-builds parent-child containment trees
- Outputs annotated
objects.jpgvisualization - 77% smaller than the previous YOLO-World model (11.1 MB vs 49.5 MB)
4. Icon Meaning Classification
- 81-class icon classification via ONNX model
- Common UI icon meanings (settings, search, share, back, etc.)
- Confidence > 40% displays candidate meanings
5. Color Detection
- Auto-extracts background/foreground colors per element
- Outputs hex color values
β FAQ
Q: Where do model files come from?
Model files are located in the resources/ directory:
resources/
βββ ocr-models/
β βββ ppocrv5_mobile_det.onnx # OCR detection model
β βββ ppocrv5_mobile_rec.onnx # OCR recognition model
β βββ ppocrv5_dict.txt # Chinese dictionary
βββ icon-classifier/
β βββ icon_classifier.onnx # Icon classification model
β βββ labels.json # 81 class labels
βββ object-detection/
βββ yoloe-26n-seg-dynamic.onnx # YOLOE-26n object detection model (11 MB)
βββ yoloe-26n_classes.txt # 860 class labels
Auto-download: Missing model files are automatically downloaded from the Hugging Face repo (WeiChens/quasivision-models) on first run.
Mirror for China users:
Q: What coordinate system does the output use?
Output uses raw pixel coordinates (tree format):
All coordinates are in original pixel values (0β1000 normalization is not used).
Q: How can I run object detection only (without UI detection)?
The current design runs the full pipeline. You can disable ancillary features with --ocr false --icon-classify false.
Q: How to improve detection quality?
- Gradient threshold: Web pages:
--gradient 1, App screenshots:--gradient 4 - Rounded corners: Use
--rec-corner-skip 0.12for large rounded elements - Small text: Increase
--text-max-h 0.12to raise text height limit
Q: What confidence threshold should I use?
| Scenario | --detect-conf Recommended |
|---|---|
| Only high-confidence objects | 0.5 |
| Balanced precision & recall | 0.2 (default) |
| Maximum recall (tolerate noise) | 0.1 |
Q: Supported image formats?
Default: png, jpg, jpeg, jfif. Customize with --extensions.
π‘ Practical Examples
# App screenshot (recommended parameters)
# Web page detection
# Batch processing with recursion
# AI-friendly output (disable non-essential features)
# High-recall detection
# Paragraph-aware text detection
π Proxy Configuration
On Windows, quasivision automatically detects system proxy settings (compatible with Clash, V2Ray, etc.). If your proxy requires manual configuration:
# Windows (cmd)
# macOS / Linux
HTTP_PROXY=http://127.0.0.1:7890 HTTPS_PROXY=http://127.0.0.1:7890
π License
- Source code: MIT Β© quasivision
- PP-OCRv5: Apache 2.0 Β© PaddlePaddle
- YOLOE-26n-seg: AGPL-3.0 Β© Ultralytics
- Icon Classifier: MIT






