quasivision 0.2.2

A Rust-based pseudo-visual understanding tool. Analyzes screenshots, UI mockups, and real-world photos β€” detects UI components, recognizes text via OCR, identifies objects with YOLOE-26n.
Documentation

A Rust-based pseudo-visual understanding tool.
Analyzes screenshots, UI mockups, and real-world photos β€” detects UI components (buttons, text fields, icons, images, etc.), recognizes text via OCR, identifies 860 classes of everyday objects (people, cars, phones, food, etc.) with YOLOE-26n, classifies 81 types of icon meanings, and outputs structured descriptions with visual annotations.


πŸ“‹ Table of Contents

  1. Quick Start
  2. Demo Gallery
  3. Output Overview
  4. CLI Reference
  5. Output File Structure
  6. Pipeline
  7. Core Features
  8. FAQ
  9. Practical Examples
  10. Proxy Configuration
  11. License

πŸš€ Quick Start

Basic Usage

# Single image

cargo run -- --input image.png


# Try with the built-in demo

cargo run -- --input demo/ui.jpg


# Custom output directory

cargo run -- --input image.png --output ./result


# Batch process all images in a directory

cargo run -- --input ./screenshots/


# Recursive processing (include subdirectories)

cargo run -- --input ./screenshots/ --recursive

Minimal Example

cargo run -- --input demo/ui.jpg

Results are written to ./output/ui/.


πŸ–ΌοΈ Demo Gallery

1. UI Detection β€” Web Search Page

Input Output
ui-input ui-viz

Detected UI components including text, icons, buttons, and structured blocks from a search result page β€” with full OCR text extraction.

2. Object Detection β€” Real-World Photo

Input Output
reality-input reality-viz

Detected 6 objects with hierarchical relationships (person β†’ cap/hat/glasses/glove/jacket), visualized with bounding boxes and labels.

Detection result:

Objects (474Γ—714) β€” 6 found:
└─ [  0,278 433Γ—436] person (87%)
   β”œβ”€ [111,277 118Γ— 93] cap (39%)
   β”‚  └─ [111,277 118Γ— 93] hat (82%)
   β”‚     └─ [112,345  88Γ— 38] glasses (65%)
   β”œβ”€ [  1,649  46Γ— 65] glove (21%)
   └─ [ 55,342 373Γ—372] jacket (20%)

3. Mixed Scenario β€” Stock Photo Gallery

Input Output (UI) Output (Objects)
mixed-input mixed-ui mixed-obj

A stock photo gallery page: UI detection extracts the layout structure (image grid, navigation bar, text labels), while object detection identifies photo subjects (people, faces, etc.).


πŸ“€ Output Overview

Output Format (fixed: tree)

Output is always in tree format (no --format flag needed):

tree        Nested tree structure, JSON + plain text, AI-readable DOM

It generates both elements.tree.json (JSON tree) and elements.tree.txt (plain text tree).

Output Files

File Source Description
elements.tree.json UI Detection All detected UI components (buttons/text/icons/etc)
elements.tree.txt UI Detection Plain text summary
visualization.jpg UI Detection Annotated image with color-coded component borders
objects.tree.json Object Detect YOLOE-detected objects (860 classes) with hierarchy
objects.tree.txt Object Detect Object detection plain text summary
objects.jpg Object Detect Object detection visualization with labels

βš™οΈ CLI Reference

Basic Options

Argument Type Default Description
-i, --input String Required Input image path or directory
-o, --output String output Output root directory
--recursive bool false Recursively process subdirectories
--extensions String png,jpg,jpeg,jfif Comma-separated image file extensions

UI Detection Options

Argument Type Default Description
--gradient u8 4 Gradient threshold (dribbble/rico: 4, web: 1)
--min-area u32 55 Minimum connected component area
--paragraph bool false Enable paragraph merging
--remove-bar bool true Remove top/bottom navigation bars
--sub-component bool true Detect sub-components (buttons inside images)
--synthesize-text bool true Auto-synthesize container blocks for orphan text

Line / Rectangle Options

Argument Type Default Description
--line-thickness u32 8 Maximum line thickness (pixels)
--line-min-length f64 0.95 Minimum line length ratio
--rec-evenness f64 0.7 Minimum rectangle evenness
--rec-dent f64 0.25 Maximum rectangle dent ratio
--rec-corner-skip f64 0.08 Corner tolerance (0=strict right angle, 0.08~0.12=rounded)

Block Detection Options

Argument Type Default Description
--block-side f64 0.15 Block side length ratio threshold
--block-grad u8 5 Block nesting detection gradient threshold

Text Options

Argument Type Default Description
--text-max-h f64 0.08 Max text height ratio (relative to image height)
--text-gap u32 10 Max word gap (pixels)
--ocr bool true Enable OCR text recognition

Icon / Object Detection Options

Argument Type Default Description
--icon-classify bool true Enable icon meaning classification
--object-detect bool true Enable object detection
--detect-model String resources/object-detection/yoloe-26n-seg-dynamic.onnx YOLOE model path
--detect-labels String resources/object-detection/yoloe-26n_classes.txt YOLOE labels file path
--detect-conf f32 0.2 Detection confidence threshold (0~1)
--models-dir String resources Model resource root directory

Disabling Features

# Disable OCR (structure-only detection)

cargo run -- --input image.png --ocr false


# Disable object detection

cargo run -- --input image.png --object-detect false


# Disable icon classification

cargo run -- --input image.png --icon-classify false


# UI detection only (all optional features off)

cargo run -- --input image.png --ocr false --object-detect false --icon-classify false


πŸ“ Output File Structure

Single Image Output

output/
└── image_name/             # Named after the input file (without extension)
    β”œβ”€β”€ elements.tree.json  # UI element tree (JSON)
    β”œβ”€β”€ elements.tree.txt   # UI element tree (text)
    β”œβ”€β”€ visualization.jpg   # UI detection visualization
    β”œβ”€β”€ objects.tree.json   # Object detection tree (JSON)
    β”œβ”€β”€ objects.tree.txt    # Object detection tree (text)
    └── objects.jpg         # Object detection visualization

Note: objects.* files are only generated when --object-detect true and objects are found.


πŸ”„ Pipeline

Input Image
  β”‚
  β”œβ”€ 1. Preprocessing ────── Grayscale, line removal, background removal
  β”‚
  β”œβ”€ 2. Connected Component ─ Gradient β†’ CCL (Connected Component Labeling)
  β”‚
  β”œβ”€ 3. Rect/Line Detection ─ Buttons, input fields, etc.
  β”‚
  β”œβ”€ 4. Merge & Filter ───── Merge overlapping regions, remove noise
  β”‚
  β”œβ”€ 5. Classification ───── Block / Button / Text / Icon / Image
  β”‚      β”‚
  β”‚      β”œβ”€ Icon Classifier ── 81 common icon categories (ONNX Runtime)
  β”‚      β”‚
  β”‚      └─ OCR (background) ─ Text recognition (PaddleOCR)
  β”‚
  β”œβ”€ 6. Merge ────────────── Merge OCR text into UI elements
  β”‚
  β”œβ”€ 7. Color Detection ──── Extract background/foreground colors
  β”‚
  └─ 8. Output ───────────── 5 formats + visualization annotation

Parallel Execution

Object detection (YOLOE-26n) and OCR run on background threads in parallel with the main pipeline, adding no extra wait time.


🧩 Core Features

1. UI Element Detection (Main Feature)

Detects 7 types of UI elements:

Category Description
Block Container blocks (cards, list items, nav bars)
Button Clickable buttons
Text Text labels
Icon Icons (small square elements)
Image Images
Input Input fields
List Item List items (with checkmark indicators)

2. OCR Text Recognition

  • Based on PaddleOCR (PP-OCRv5) models
  • Windows: DirectML GPU acceleration supported
  • Auto-detects text in images
  • Long text protection: meaningful text (>5 chars) bypasses height filters

3. Object Detection (YOLOE-26n)

  • ONNX Runtime-based YOLOE-26n model (dynamic input, end-to-end NMS)
  • 860 common object classes (people, cars, phones, food, animals, etc.)
  • Auto-builds parent-child containment trees
  • Outputs annotated objects.jpg visualization
  • 77% smaller than the previous YOLO-World model (11.1 MB vs 49.5 MB)

4. Icon Meaning Classification

  • 81-class icon classification via ONNX model
  • Common UI icon meanings (settings, search, share, back, etc.)
  • Confidence > 40% displays candidate meanings

5. Color Detection

  • Auto-extracts background/foreground colors per element
  • Outputs hex color values

❓ FAQ

Q: Where do model files come from?

Model files are located in the resources/ directory:

resources/
β”œβ”€β”€ ocr-models/
β”‚   β”œβ”€β”€ ppocrv5_mobile_det.onnx   # OCR detection model
β”‚   β”œβ”€β”€ ppocrv5_mobile_rec.onnx   # OCR recognition model
β”‚   └── ppocrv5_dict.txt          # Chinese dictionary
β”œβ”€β”€ icon-classifier/
β”‚   β”œβ”€β”€ icon_classifier.onnx      # Icon classification model
β”‚   └── labels.json               # 81 class labels
└── object-detection/
    β”œβ”€β”€ yoloe-26n-seg-dynamic.onnx # YOLOE-26n object detection model (11 MB)
    └── yoloe-26n_classes.txt     # 860 class labels

Auto-download: Missing model files are automatically downloaded from the Hugging Face repo (WeiChens/quasivision-models) on first run.

Mirror for China users:

set QUASIVISION_MODELS_URL=https://hf-mirror.com/WeiChens/quasivision-models/resolve/main

cargo run -- --input image.png

Q: What coordinate system does the output use?

Output uses raw pixel coordinates (tree format):

{
  "column_min": 100,
  "row_min": 200,
  "column_max": 300,
  "row_max": 400
}

All coordinates are in original pixel values (0–1000 normalization is not used).

Q: How can I run object detection only (without UI detection)?

The current design runs the full pipeline. You can disable ancillary features with --ocr false --icon-classify false.

Q: How to improve detection quality?

  • Gradient threshold: Web pages: --gradient 1, App screenshots: --gradient 4
  • Rounded corners: Use --rec-corner-skip 0.12 for large rounded elements
  • Small text: Increase --text-max-h 0.12 to raise text height limit

Q: What confidence threshold should I use?

Scenario --detect-conf Recommended
Only high-confidence objects 0.5
Balanced precision & recall 0.2 (default)
Maximum recall (tolerate noise) 0.1

Q: Supported image formats?

Default: png, jpg, jpeg, jfif. Customize with --extensions.


πŸ’‘ Practical Examples

# App screenshot (recommended parameters)

cargo run -- -i app.png --gradient 4


# Web page detection

cargo run -- -i webpage.png --gradient 1 --rec-corner-skip 0.1


# Batch processing with recursion

cargo run -- -i ./screenshots/ --recursive


# AI-friendly output (disable non-essential features)

cargo run -- -i ui.png --icon-classify false


# High-recall detection

cargo run -- -i photo.jpg --detect-conf 0.1


# Paragraph-aware text detection

cargo run -- -i document.png --paragraph true --text-max-h 0.15


🌐 Proxy Configuration

On Windows, quasivision automatically detects system proxy settings (compatible with Clash, V2Ray, etc.). If your proxy requires manual configuration:

# Windows (cmd)

set HTTP_PROXY=http://127.0.0.1:7890

set HTTPS_PROXY=http://127.0.0.1:7890

cargo run -- --input image.png


# macOS / Linux

HTTP_PROXY=http://127.0.0.1:7890 HTTPS_PROXY=http://127.0.0.1:7890 cargo run -- --input image.png


πŸ“„ License

  • Source code: MIT Β© quasivision
  • PP-OCRv5: Apache 2.0 Β© PaddlePaddle
  • YOLOE-26n-seg: AGPL-3.0 Β© Ultralytics
  • Icon Classifier: MIT

πŸ“– Also Available In