quasivision 0.2.2

A Rust-based pseudo-visual understanding tool. Analyzes screenshots, UI mockups, and real-world photos β€” detects UI components, recognizes text via OCR, identifies objects with YOLOE-26n.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
<h1 align="center">quasivision</h1>

<p align="center">
  <a href="https://github.com/WeiChens/quasivision/stargazers"><img src="https://img.shields.io/github/stars/WeiChens/quasivision?style=for-the-badge&logo=github" alt="Stars"></a>
  <a href="https://github.com/WeiChens/quasivision/network/members"><img src="https://img.shields.io/github/forks/WeiChens/quasivision?style=for-the-badge&logo=github" alt="Forks"></a>
  <a href="https://github.com/WeiChens/quasivision/issues"><img src="https://img.shields.io/github/issues/WeiChens/quasivision?style=for-the-badge&logo=github" alt="Issues"></a>
  <a href="https://github.com/WeiChens/quasivision/blob/main/LICENSE"><img src="https://img.shields.io/github/license/WeiChens/quasivision?style=for-the-badge" alt="License"></a>
</p>
<p align="center">
  <a href="README.md">πŸ‡¬πŸ‡§ English</a> Β· <a href="README-zh.md">πŸ‡¨πŸ‡³ δΈ­ζ–‡</a>
</p>

A Rust-based pseudo-visual understanding tool.  
Analyzes screenshots, UI mockups, and real-world photos β€” detects UI components (buttons, text fields, icons, images, etc.), recognizes text via OCR, identifies 860 classes of everyday objects (people, cars, phones, food, etc.) with YOLOE-26n, classifies 81 types of icon meanings, and outputs structured descriptions with visual annotations.

---

## πŸ“‹ Table of Contents


1. [Quick Start]#quick-start
2. [Demo Gallery]#demo-gallery
3. [Output Overview]#output-overview
4. [CLI Reference]#cli-reference
5. [Output File Structure]#output-file-structure
6. [Pipeline]#pipeline
7. [Core Features]#core-features
8. [FAQ]#faq
9. [Practical Examples]#practical-examples
10. [Proxy Configuration]#proxy-configuration
11. [License]#license

---

## πŸš€ Quick Start <a id="quick-start"></a>


### Basic Usage


```bash
# Single image

cargo run -- --input image.png

# Try with the built-in demo

cargo run -- --input demo/ui.jpg

# Custom output directory

cargo run -- --input image.png --output ./result

# Batch process all images in a directory

cargo run -- --input ./screenshots/

# Recursive processing (include subdirectories)

cargo run -- --input ./screenshots/ --recursive
```

### Minimal Example


```bash
cargo run -- --input demo/ui.jpg
```

Results are written to `./output/ui/`.

---

## πŸ–ΌοΈ Demo Gallery <a id="demo-gallery"></a>


### 1. UI Detection β€” Web Search Page


|          Input           |                   Output                    |
| :----------------------: | :-----------------------------------------: |
| ![ui-input]demo/ui.jpg | ![ui-viz]demo/output/ui/visualization.jpg |

Detected UI components including text, icons, buttons, and structured blocks from a search result page β€” with full OCR text extraction.

### 2. Object Detection β€” Real-World Photo


|               Input                |                     Output                      |
| :--------------------------------: | :---------------------------------------------: |
| ![reality-input]demo/reality.jpg | ![reality-viz]demo/output/reality/objects.jpg |

Detected 6 objects with hierarchical relationships (person β†’ cap/hat/glasses/glove/jacket), visualized with bounding boxes and labels.

**Detection result:**

```
Objects (474Γ—714) β€” 6 found:
└─ [  0,278 433Γ—436] person (87%)
   β”œβ”€ [111,277 118Γ— 93] cap (39%)
   β”‚  └─ [111,277 118Γ— 93] hat (82%)
   β”‚     └─ [112,345  88Γ— 38] glasses (65%)
   β”œβ”€ [  1,649  46Γ— 65] glove (21%)
   └─ [ 55,342 373Γ—372] jacket (20%)
```

### 3. Mixed Scenario β€” Stock Photo Gallery


|                 Input                 |                       Output (UI)                       |                  Output (Objects)                  |
| :-----------------------------------: | :-----------------------------------------------------: | :------------------------------------------------: |
| ![mixed-input]demo/realityAndUi.jpg | ![mixed-ui]demo/output/realityAndUi/visualization.jpg | ![mixed-obj]demo/output/realityAndUi/objects.jpg |

A stock photo gallery page: UI detection extracts the layout structure (image grid, navigation bar, text labels), while object detection identifies photo subjects (people, faces, etc.).

---

## πŸ“€ Output Overview <a id="output-overview"></a>


### Output Format (fixed: `tree`)


Output is always in **`tree` format** (no `--format` flag needed):

```
tree        Nested tree structure, JSON + plain text, AI-readable DOM
```

It generates both `elements.tree.json` (JSON tree) and `elements.tree.txt` (plain text tree).

### Output Files


| File                                   | Source        | Description                                         |
| -------------------------------------- | ------------- | --------------------------------------------------- |
| `elements.tree.json`                   | UI Detection  | All detected UI components (buttons/text/icons/etc) |
| `elements.tree.txt`                    | UI Detection  | Plain text summary                                  |
| `visualization.jpg`                    | UI Detection  | Annotated image with color-coded component borders  |
| `objects.tree.json`                    | Object Detect | YOLOE-detected objects (860 classes) with hierarchy |
| `objects.tree.txt`                     | Object Detect | Object detection plain text summary                 |
| `objects.jpg`                          | Object Detect | Object detection visualization with labels          |

---

## βš™οΈ CLI Reference <a id="cli-reference"></a>


### Basic Options


| Argument       | Type   | Default             | Description                                            |
| -------------- | ------ | ------------------- | ------------------------------------------------------ |
| `-i, --input`  | String | **Required**        | Input image path or directory                          |
| `-o, --output` | String | `output`            | Output root directory                                  |
| `--recursive`  | bool   | `false`             | Recursively process subdirectories                     |
| `--extensions` | String | `png,jpg,jpeg,jfif` | Comma-separated image file extensions                  |

### UI Detection Options


| Argument            | Type | Default | Description                                      |
| ------------------- | ---- | ------- | ------------------------------------------------ |
| `--gradient`        | u8   | `4`     | Gradient threshold (dribbble/rico: 4, web: 1)    |
| `--min-area`        | u32  | `55`    | Minimum connected component area                 |
| `--paragraph`       | bool | `false` | Enable paragraph merging                         |
| `--remove-bar`      | bool | `true`  | Remove top/bottom navigation bars                |
| `--sub-component`   | bool | `true`  | Detect sub-components (buttons inside images)    |
| `--synthesize-text` | bool | `true`  | Auto-synthesize container blocks for orphan text |

### Line / Rectangle Options


| Argument            | Type | Default | Description                                                |
| ------------------- | ---- | ------- | ---------------------------------------------------------- |
| `--line-thickness`  | u32  | `8`     | Maximum line thickness (pixels)                            |
| `--line-min-length` | f64  | `0.95`  | Minimum line length ratio                                  |
| `--rec-evenness`    | f64  | `0.7`   | Minimum rectangle evenness                                 |
| `--rec-dent`        | f64  | `0.25`  | Maximum rectangle dent ratio                               |
| `--rec-corner-skip` | f64  | `0.08`  | Corner tolerance (0=strict right angle, 0.08~0.12=rounded) |

### Block Detection Options


| Argument       | Type | Default | Description                                |
| -------------- | ---- | ------- | ------------------------------------------ |
| `--block-side` | f64  | `0.15`  | Block side length ratio threshold          |
| `--block-grad` | u8   | `5`     | Block nesting detection gradient threshold |

### Text Options


| Argument       | Type | Default | Description                                      |
| -------------- | ---- | ------- | ------------------------------------------------ |
| `--text-max-h` | f64  | `0.08`  | Max text height ratio (relative to image height) |
| `--text-gap`   | u32  | `10`    | Max word gap (pixels)                            |
| `--ocr`        | bool | `true`  | Enable OCR text recognition                      |

### Icon / Object Detection Options


| Argument          | Type   | Default                                                 | Description                          |
| ----------------- | ------ | ------------------------------------------------------- | ------------------------------------ |
| `--icon-classify` | bool   | `true`                                                  | Enable icon meaning classification   |
| `--object-detect` | bool   | `true`                                                  | Enable object detection              |
| `--detect-model`  | String | `resources/object-detection/yoloe-26n-seg-dynamic.onnx` | YOLOE model path                     |
| `--detect-labels` | String | `resources/object-detection/yoloe-26n_classes.txt`      | YOLOE labels file path               |
| `--detect-conf`   | f32    | `0.2`                                                   | Detection confidence threshold (0~1) |
| `--models-dir`    | String | `resources`                                             | Model resource root directory        |

### Disabling Features


```bash
# Disable OCR (structure-only detection)

cargo run -- --input image.png --ocr false

# Disable object detection

cargo run -- --input image.png --object-detect false

# Disable icon classification

cargo run -- --input image.png --icon-classify false

# UI detection only (all optional features off)

cargo run -- --input image.png --ocr false --object-detect false --icon-classify false
```

---

## πŸ“ Output File Structure <a id="output-file-structure"></a>


### Single Image Output


```
output/
└── image_name/             # Named after the input file (without extension)
    β”œβ”€β”€ elements.tree.json  # UI element tree (JSON)
    β”œβ”€β”€ elements.tree.txt   # UI element tree (text)
    β”œβ”€β”€ visualization.jpg   # UI detection visualization
    β”œβ”€β”€ objects.tree.json   # Object detection tree (JSON)
    β”œβ”€β”€ objects.tree.txt    # Object detection tree (text)
    └── objects.jpg         # Object detection visualization
```

> Note: `objects.*` files are only generated when `--object-detect true` and objects are found.

---

## πŸ”„ Pipeline <a id="pipeline"></a>


```
Input Image
  β”‚
  β”œβ”€ 1. Preprocessing ────── Grayscale, line removal, background removal
  β”‚
  β”œβ”€ 2. Connected Component ─ Gradient β†’ CCL (Connected Component Labeling)
  β”‚
  β”œβ”€ 3. Rect/Line Detection ─ Buttons, input fields, etc.
  β”‚
  β”œβ”€ 4. Merge & Filter ───── Merge overlapping regions, remove noise
  β”‚
  β”œβ”€ 5. Classification ───── Block / Button / Text / Icon / Image
  β”‚      β”‚
  β”‚      β”œβ”€ Icon Classifier ── 81 common icon categories (ONNX Runtime)
  β”‚      β”‚
  β”‚      └─ OCR (background) ─ Text recognition (PaddleOCR)
  β”‚
  β”œβ”€ 6. Merge ────────────── Merge OCR text into UI elements
  β”‚
  β”œβ”€ 7. Color Detection ──── Extract background/foreground colors
  β”‚
  └─ 8. Output ───────────── 5 formats + visualization annotation
```

### Parallel Execution


Object detection (YOLOE-26n) and OCR run on **background threads** in parallel with the main pipeline, adding no extra wait time.

---

## 🧩 Core Features <a id="core-features"></a>


### 1. UI Element Detection (Main Feature)


Detects 7 types of UI elements:

| Category      | Description                                    |
| ------------- | ---------------------------------------------- |
| **Block**     | Container blocks (cards, list items, nav bars) |
| **Button**    | Clickable buttons                              |
| **Text**      | Text labels                                    |
| **Icon**      | Icons (small square elements)                  |
| **Image**     | Images                                         |
| **Input**     | Input fields                                   |
| **List Item** | List items (with checkmark indicators)         |

### 2. OCR Text Recognition


- Based on PaddleOCR (PP-OCRv5) models
- Windows: DirectML GPU acceleration supported
- Auto-detects text in images
- Long text protection: meaningful text (>5 chars) bypasses height filters

### 3. Object Detection (YOLOE-26n)


- ONNX Runtime-based YOLOE-26n model (dynamic input, end-to-end NMS)
- 860 common object classes (people, cars, phones, food, animals, etc.)
- Auto-builds parent-child containment trees
- Outputs annotated `objects.jpg` visualization
- **77% smaller** than the previous YOLO-World model (11.1 MB vs 49.5 MB)

### 4. Icon Meaning Classification


- 81-class icon classification via ONNX model
- Common UI icon meanings (settings, search, share, back, etc.)
- Confidence > 40% displays candidate meanings

### 5. Color Detection


- Auto-extracts background/foreground colors per element
- Outputs hex color values

---

## ❓ FAQ <a id="faq"></a>


### Q: Where do model files come from?


Model files are located in the `resources/` directory:

```
resources/
β”œβ”€β”€ ocr-models/
β”‚   β”œβ”€β”€ ppocrv5_mobile_det.onnx   # OCR detection model
β”‚   β”œβ”€β”€ ppocrv5_mobile_rec.onnx   # OCR recognition model
β”‚   └── ppocrv5_dict.txt          # Chinese dictionary
β”œβ”€β”€ icon-classifier/
β”‚   β”œβ”€β”€ icon_classifier.onnx      # Icon classification model
β”‚   └── labels.json               # 81 class labels
└── object-detection/
    β”œβ”€β”€ yoloe-26n-seg-dynamic.onnx # YOLOE-26n object detection model (11 MB)
    └── yoloe-26n_classes.txt     # 860 class labels
```

**Auto-download**: Missing model files are automatically downloaded from the Hugging Face repo ([WeiChens/quasivision-models](https://huggingface.co/WeiChens/quasivision-models)) on first run.

**Mirror for China users**:

```bash
set QUASIVISION_MODELS_URL=https://hf-mirror.com/WeiChens/quasivision-models/resolve/main
cargo run -- --input image.png
```

### Q: What coordinate system does the output use?


Output uses raw pixel coordinates (tree format):

```json
{
  "column_min": 100,
  "row_min": 200,
  "column_max": 300,
  "row_max": 400
}
```

All coordinates are in original pixel values (0–1000 normalization is not used).

### Q: How can I run object detection only (without UI detection)?


The current design runs the full pipeline. You can disable ancillary features with `--ocr false --icon-classify false`.

### Q: How to improve detection quality?


- **Gradient threshold**: Web pages: `--gradient 1`, App screenshots: `--gradient 4`
- **Rounded corners**: Use `--rec-corner-skip 0.12` for large rounded elements
- **Small text**: Increase `--text-max-h 0.12` to raise text height limit

### Q: What confidence threshold should I use?


| Scenario                        | `--detect-conf` Recommended |
| ------------------------------- | :-------------------------: |
| Only high-confidence objects    |             0.5             |
| Balanced precision & recall     |        0.2 (default)        |
| Maximum recall (tolerate noise) |             0.1             |

### Q: Supported image formats?


Default: `png`, `jpg`, `jpeg`, `jfif`. Customize with `--extensions`.

---

## πŸ’‘ Practical Examples <a id="practical-examples"></a>


```bash
# App screenshot (recommended parameters)

cargo run -- -i app.png --gradient 4

# Web page detection

cargo run -- -i webpage.png --gradient 1 --rec-corner-skip 0.1

# Batch processing with recursion

cargo run -- -i ./screenshots/ --recursive

# AI-friendly output (disable non-essential features)

cargo run -- -i ui.png --icon-classify false

# High-recall detection

cargo run -- -i photo.jpg --detect-conf 0.1

# Paragraph-aware text detection

cargo run -- -i document.png --paragraph true --text-max-h 0.15
```

---

## 🌐 Proxy Configuration <a id="proxy-configuration"></a>


On Windows, quasivision automatically detects system proxy settings (compatible with Clash, V2Ray, etc.). If your proxy requires manual configuration:

```bash
# Windows (cmd)

set HTTP_PROXY=http://127.0.0.1:7890
set HTTPS_PROXY=http://127.0.0.1:7890
cargo run -- --input image.png

# macOS / Linux

HTTP_PROXY=http://127.0.0.1:7890 HTTPS_PROXY=http://127.0.0.1:7890 cargo run -- --input image.png
```

---

## πŸ“„ License <a id="license"></a>


- **Source code**: MIT Β© quasivision
- **PP-OCRv5**: Apache 2.0 Β© PaddlePaddle
- **YOLOE-26n-seg**: AGPL-3.0 Β© Ultralytics
- **Icon Classifier**: MIT

---

## πŸ“– Also Available In


- [δΈ­ζ–‡ζ–‡ζ‘£ (Chinese)]README-zh.md