# Agent Context: native-devtools-mcp
**About:** This is the AGENTS.md for **native-devtools-mcp**, an MCP (Model Context Protocol) server that enables **computer use** / **desktop automation** on macOS, Windows, and Android: screenshots, OCR, mouse/keyboard input, window management, and Android device control via ADB.
**Search keywords:** MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, screen reading, mouse, keyboard, macOS, Windows, Android, ADB, mobile testing, native-devtools-mcp.
**Role:** You are an agent equipped with "Computer Use" capabilities. You can see the screen, type, move the mouse, and interact with native desktop and mobile applications.
**Constraint:** You are operating a real machine. Actions are permanent. Ensure you verify the state of the screen before and after actions.
## π§ Core Reasoning Loop
For robust automation, follow this "Visual Feedback Loop":
1. **OBSERVE:** Call `take_screenshot(app_name="TargetApp")` to see the current state.
2. **LOCATE:** Analyze the image or use the OCR summary text in the response to find coordinates.
3. **ACT:** Call `click()`, `type_text()`, or `scroll()` using those coordinates.
4. **VERIFY:** Call `take_screenshot` again to confirm the action had the intended effect.
**macOS-preferred branch (native apps):** substitute OBSERVE with `take_ax_snapshot(app_name='...')`; LOCATE reads uid + bbox from the emitted tree; ACT calls `ax_click(uid)` for pressable controls, `ax_set_value(uid, text)` for text fields, or `ax_select(uid)` for `NSOutlineView` / `NSTableView` row selection (sidebars, rule lists); VERIFY re-snapshots and reads the new state. This branch does not move the cursor or steal focus, so it composes with background work.
---
## πΊοΈ Capabilities Matrix (Strategy Guide)
Use this table to choose the right tool sequence for the user's goal.
| "Click the 'Submit' button" | `find_text(text="Submit")` β `click(x, y)` | Fastest. No visual analysis needed if text is known. |
| "Click the red icon" | `take_screenshot()` β (Analyze Image) β `click(screenshot_x=..., screenshot_y=..., screenshot_origin_x=..., screenshot_origin_y=..., screenshot_scale=...)` | Visual features require full screenshot analysis. |
| "What element is at (500, 300)?" | `element_at_point(x=500, y=300)` | Returns the accessibility element at those coordinates (name, role, bounds, etc.). |
| "Type into the search bar" | `find_text(text="Search")` β `click(x, y)` β `type_text("hello")` | Must click to focus before typing. |
| "Scroll down" | `scroll(x=500, y=500, delta_y=200)` | Positive `delta_y` scrolls down. |
| "Find an open window" | `list_windows()` β `focus_window(window_id=...)` | Don't guess window names; list them first. |
| "Track what I hover over" | `start_hover_tracking(min_dwell_ms=300)` β user moves mouse β `stop_hover_tracking()` | Records element transitions with dwell filtering. |
| "Record what the user does" | `start_recording(output_dir="/tmp/rec")` β user interacts β `stop_recording()` | Captures frontmost app at ~5fps as JPEG frames. |
| "Launch Safari with debug port" | `launch_app(app_name="Safari", args=["--remote-debugging-port=9222"])` | Pass CLI args on fresh launch. |
| "Quit an app" | `quit_app(app_name="Safari")` | Graceful by default; use `force=true` to kill immediately. |
| "Click a named button in a native app (macOS)" | `take_ax_snapshot` β `ax_click(uid)` | Focus-preserving. Generation-tagged uids; fresh snapshot invalidates prior uids. |
| "Enter text into a text field via value assignment (macOS)" | `take_ax_snapshot` β `ax_set_value(uid, text)` | No key events, no IME, no undo-stack entry. Fall back to `click` + `type_text` on `not_dispatchable`. |
| "Select a sidebar row in System Settings / a `NSOutlineView` row (macOS)" | `take_ax_snapshot` β `ax_select(uid)` | Writes `AXSelectedRows` on the enclosing outline/table. Use this instead of `ax_click` for row targets β rows typically refuse `AXPress`. |
| "AX snapshot invalidation rule (macOS)" | β | Every `take_ax_snapshot` call bumps the generation. All prior uids become stale; `ax_*` tools return `snapshot_expired`. |
| "Click a button in Chrome" | `cdp_connect(port=9222)` β `cdp_find_elements(query="Submit")` β `cdp_click(uid="d1")` | CDP is more reliable than coordinates for web content. |
| "Type in a web input" | `cdp_find_elements(query="Email")` β `cdp_fill(uid="d1", value="hello")` | Works for `<input>`, `<textarea>`, and `<select>` elements. |
| "Run JS in a browser page" | `cdp_evaluate_script(function="() => document.title")` | Evaluate any JS in the selected page. |
| "Navigate to a URL" | `cdp_navigate(url="https://example.com")` | Also supports back, forward, reload. |
| "Press Enter or shortcut" | `cdp_press_key(key="Enter")` or `cdp_press_key(key="Control+A")` | Supports modifier combos. |
| "Wait for page content" | `cdp_wait_for(text=["Success"])` | Polls page text until any value appears or timeout. Pass `include_snapshot=true` to also get a DOM snapshot. |
| "Switch browser tabs" | `cdp_list_pages()` β `cdp_select_page(page_idx=1)` | List tabs, then select by index. |
| "Get browser page structure" | `cdp_take_dom_snapshot()` | Full interactive-element DOM snapshot with UIDs, roles, and labels. |
---
## π οΈ Tool Definitions & Schemas
### 1. Vision & Perception (The "Eyes")
#### `take_ax_snapshot` (cross-platform with macOS session semantics)
**Purpose:** Return a structured text snapshot of an app's accessibility tree with per-element uids, roles, names, state attributes, and (on macOS) bounding boxes.
**Schema:**
- `app_name: string` (optional) β target app; defaults to frontmost.
**macOS session state:** on macOS this tool is no longer a pure read β it writes session state. Every call bumps a monotonic generation and replaces the server's cached map of `AXUIElement` handles. Uids are emitted as `a<N>g<gen>` (e.g. `a42g3`). All uids from prior snapshots become stale for `ax_click` / `ax_set_value` / `ax_select` consumption (return `snapshot_expired`). Windows behavior is unchanged β no session, uids stay bare `a<N>`.
**Usage pattern (macOS):** snapshot immediately before each `ax_click` / `ax_set_value` / `ax_select` call to avoid `snapshot_expired`. Every branch or retry starts with a fresh snapshot.
#### `take_screenshot`
Captures pixel data and layout.
* **Inputs:**
* `mode` (string, default `"window"`): `"screen"`, `"window"`, or `"region"`.
* `app_name` (string, optional): Capture this app's window (for mode `"window"`).
* `window_id` (number, optional): Window ID (for mode `"window"`).
* `x`, `y`, `width`, `height` (numbers): Region bounds (for mode `"region"`).
* `include_ocr` (boolean, default `true`): Include OCR summary text with coordinates.
* **Returns (content list):**
```json
[
{ "type": "image", "mime": "image/jpeg", "data": "..." },
{ "type": "text", "text": "{ \"screenshot_origin_x\": 0, \"screenshot_origin_y\": 0, \"screenshot_scale\": 2.0, \"screenshot_window_id\": 1234, \"screenshot_pixel_width\": 1920, \"screenshot_pixel_height\": 1080 }" },
{ "type": "text", "text": "## OCR Text Detected (click coordinates)\n- \"File\" at (10, 10) bounds: {x: 0, y: 0, w: 50, h: 20}" }
]
```
* **Metadata fields:**
* `screenshot_origin_x`, `screenshot_origin_y`: Screen-space origin of the screenshot (top-left corner), in points.
* `screenshot_scale`: Display scale factor (e.g., 2.0 for Retina).
* `screenshot_window_id`: Window ID (only for mode `"window"`). Present even when using `app_name`.
* `screenshot_pixel_width`, `screenshot_pixel_height`: Actual pixel dimensions of the captured image.
#### `find_text`
Fast-path to get coordinates without image analysis.
* **Inputs:** `text` (string, case-insensitive substring match against accessibility element names, then OCR), `app_name` (string, optional), `window_id` (number, optional), `display_id` (number, optional).
* **Returns (JSON array):**
```json
[
{ "text": "Save", "x": 500, "y": 300, "confidence": 1.0, "bounds": { "x": 480, "y": 290, "width": 40, "height": 20 } }
]
```
* **Platform behavior:**
* **Both platforms:** Uses the **platform accessibility API** as the primary mechanism β searches the accessibility tree for elements by name. This gives precise element-level coordinates (`confidence: 1.0`). Falls back to OCR automatically if accessibility finds no matches.
* **macOS:** Accessibility API (primary), Vision OCR (fallback). Matches against element title, value, and description. Note: accessibility results use semantic names (e.g., "All Clear" instead of "AC", "Subtract" instead of "β"), so search by meaning rather than displayed symbol.
* **Windows:** UI Automation (primary), WinRT OCR (fallback). Matches against element Name property only.
#### `element_at_point`
Inspect the accessibility element at given screen coordinates.
* **Inputs:** `x` (number, required), `y` (number, required), `app_name` (string, optional β scope lookup to a specific app for faster, more precise results).
* **Returns (JSON object, fields present only when available):**
```json
{
"role": "AXButton",
"name": "Save",
"label": "Save document",
"value": "...",
"bounds": { "x": 480, "y": 290, "width": 40, "height": 20 },
"pid": 12345,
"app_name": "TextEdit"
}
```
* **Platform behavior:**
* **macOS:** Uses `AXUIElementCopyElementAtPosition`. When the result is a container (e.g. AXScrollArea in Electron apps), automatically walks the full AX tree to find the most specific element at the coordinates. With `app_name`, scopes to that app's element tree (useful when windows overlap).
* **Windows:** Uses `IUIAutomation::ElementFromPoint`. `app_name` is not yet supported (ignored).
* **Known limitation:** Some privacy-focused Electron apps (e.g. Signal) intentionally restrict their accessibility tree, exposing only top-level containers without individual UI elements. In these cases, `element_at_point` returns the outermost container (e.g. AXScrollArea or AXWindow). **Workaround:** If the returned element is a large container, use `take_screenshot(app_name=..., include_ocr=true)` and check the OCR results to identify the text/element at those coordinates.
### 2. Input & Interaction (The "Hands")
#### `click`
Simulates a mouse click.
* **Inputs:**
* **Method A (Screen Absolute):** `x` (number), `y` (number). Use with `find_text` results.
* **Method B (Window Relative):** `window_x`, `window_y`, `window_id`.
* **Method C (Screenshot Relative):** `screenshot_x`, `screenshot_y`, `screenshot_origin_x`, `screenshot_origin_y`, `screenshot_scale`. Use with `take_screenshot` visual analysis.
* `button`: "left" (default), "right", "center".
* `click_count`: 1 (default), 2 (double-click).
#### `type_text`
Types text at the *current* cursor position.
* **Inputs:** `text` (string).
* **Warning:** Always `click()` the input field first to ensure focus!
#### `ax_click` (macOS only)
**Purpose:** Press a button identified by its AX uid, without stealing cursor focus.
**Schema:**
- `uid: string` β generation-tagged uid from the most recent `take_ax_snapshot`, e.g. `"a42g3"`.
**Response:**
- Success: `{ "ok": true, "dispatched_via": "AXPress", "bbox": { "x", "y", "w", "h" } }`.
- Error: `CallToolResult::error` with JSON body `{ "error": { "code", "message", "fallback": { "x", "y" } | null } }`. Codes: `snapshot_expired`, `uid_not_found`, `not_dispatchable`, `ax_error`.
**Gotchas:**
- Any fresh `take_ax_snapshot` invalidates every prior uid β snapshot immediately before calling.
- When `fallback` is populated on `not_dispatchable`, retry via `click(fallback.x, fallback.y)`.
#### `ax_set_value` (macOS only)
**Purpose:** Write to an element's `kAXValueAttribute`. Value assignment, **not** keystroke typing.
**Schema:**
- `uid: string` β generation-tagged uid.
- `text: string` β value to assign.
**Response:** same shape as `ax_click`, with `"dispatched_via": "AXSetAttributeValue"`.
**Limits:**
- Bypasses key events (no `keydown`/`keyup` observed by apps).
- Does not participate in IME / composition.
- Does not populate the app's undo stack.
- Only works on elements that expose a writable `kAXValueAttribute` (e.g. `AXTextField`, `AXTextArea`, `AXSearchField`).
**Fallback on `not_dispatchable`:** perform `click(fallback.x, fallback.y)` to focus, then `type_text(text)` for true key-event input.
#### `ax_select` (macOS only)
**Purpose:** Select a row inside an `NSOutlineView` / `NSTableView` by writing `AXSelectedRows` on the enclosing outline/table. Use for sidebars (System Settings, Mail, Xcode, Finder), rule lists, and any native browser-style row container where `AXPress` is not supported.
**Schema:**
- `uid: string` β generation-tagged uid. Points at the row itself, a cell inside the row, or any descendant; the tool walks up to the enclosing `AXRow`.
**Response:**
- Success: `{ "ok": true, "dispatched_via": "AXSelectedRows", "bbox": { "x", "y", "w", "h" } }`. The bbox describes the resolved row (not necessarily the uid-targeted descendant).
- Error: same envelope as `ax_click` with codes `snapshot_expired`, `uid_not_found`, `no_row_ancestor`, `no_outline_container`, `not_dispatchable`, `ax_error`.
**When to reach for `ax_select` vs `ax_click`:** rows in native sidebars typically refuse `AXPress` β `ax_click` returns `not_dispatchable` or AX error `-25205` against them. Use `ax_select` for row targets. A coordinate-based fallback (`click(x, y)`) works but steals focus β the whole reason `ax_*` exists.
**Gotchas:**
- `no_row_ancestor` means the uid is not inside a row at all β you probably targeted the wrong element. Re-snapshot and pick the `AXRow` or one of its descendants.
- `no_outline_container` means the row exists but is not nested in an `AXOutline` / `AXTable` β unusual; custom row containers may need `click(fallback.x, fallback.y)`.
#### `scroll`
Scrolls at a specific screen position.
* **Inputs:** `x` (number), `y` (number), `delta_y` (integer), `delta_x` (integer, optional).
* **Direction:** Positive `delta_y` scrolls down; negative scrolls up.
### 3. Window Management
* `list_windows`: Returns array of `{ id, title, bounds, app_name }`.
* `focus_window`: Accepts `{ window_id: 123 }`, `{ app_name: "Code" }`, or `{ pid: 999 }`.
* `launch_app`: Launch an app by name. Optional `args` parameter for CLI arguments. If app is already running with no args, brings to front; with args, returns error (use `quit_app` first).
* `quit_app`: Quit a running app. Accepts `app_name` (required) and `force` (boolean, default false).
### 4. Hover Tracking
Track cursor hover transitions across UI elements. Designed for observing user navigation patterns (tooltip triggers, dropdown reveals, panel expansions).
* `start_hover_tracking`: Begin polling session. Inputs: `app_name` (optional), `poll_interval_ms` (default 100), `max_duration_ms` (default 60000), `min_dwell_ms` (default 300 β cursor must stay on element this long before recording).
* `get_hover_events`: Drain buffered events since last call. Returns JSON array of dwell events: `{ timestamp_ms, cursor: {x, y}, element: {role, name, label, bounds, app_name, pid}, dwell_ms }`.
* `stop_hover_tracking`: End session, return remaining events.
Tools appear dynamically β `get_hover_events` and `stop_hover_tracking` only show while tracking is active. Use `element_at_point` with event cursor coordinates for full element details (value, etc.).
#### Hover Tracking Example
```
1. start_hover_tracking(min_dwell_ms=500, app_name="Safari")
2. (user moves mouse around)
3. get_hover_events β [{timestamp_ms: 1200, cursor: {x: 500, y: 300}, element: {role: "AXLink", name: "Home", ...}, dwell_ms: 800}]
4. stop_hover_tracking β [remaining events]
```
### 5. Screen Recording
Record the frontmost app's window as timestamped JPEG frames. Automatically follows app switches β when the user moves to a different app, recording captures the new app's window.
* `start_recording`: Begin recording. Inputs: `output_dir` (required β directory for JPEG frames), `fps` (default 5), `max_duration_ms` (default 60000 = 1 min).
* `stop_recording`: End session, return all frame metadata as JSON array.
Each frame includes: `{ timestamp_ms, path, app_name, window_id, origin_x, origin_y, scale, pixel_width, pixel_height }`.
Tools appear dynamically β `stop_recording` only shows while recording is active.
#### Screen Recording Example
```
1. start_recording(output_dir="/tmp/recording", fps=5)
2. (user interacts with apps)
3. stop_recording β [{timestamp_ms: 1234, path: "/tmp/recording/frame_1234.jpg", app_name: "Safari", ...}, ...]
```
### 6. Browser Automation (CDP)
Connect to Chrome or Electron apps via Chrome DevTools Protocol for DOM-level element targeting. More reliable than screen coordinates for web content β handles offscreen elements, responsive layouts, and iframes.
**Setup:**
- **Chrome 136+:** Must launch with both `--remote-debugging-port=PORT` and `--user-data-dir=PATH` (non-default profile required β Chrome silently ignores the debug port with the default profile). The profile is persistent across launches.
- **Electron apps** (Signal, Discord, Slack, VS Code, etc.): Only need `--remote-debugging-port=PORT`. No `--user-data-dir` required β Electron respects the flag with its default profile.
#### Tools (always listed; calls fail with "No CDP connection" until `cdp_connect` succeeds)
* `cdp_connect(port)`: Connect to a Chrome/Electron debug port. Auto-selects the first page.
* `cdp_disconnect`: Disconnect the CDP client. The CDP tools stay listed; subsequent calls return a "not connected" error until `cdp_connect` is called again.
* `cdp_take_dom_snapshot(max_nodes?)`: Full DOM snapshot of interactive elements β returns UIDs prefixed `d` (e.g., d1, d2) with roles, labels, and parent context. Use when you need the complete page structure; for targeted lookups prefer `cdp_find_elements`. **Always take a fresh snapshot after any navigation or DOM change before resolving UIDs.**
* `cdp_find_elements(query, role?, max_results?)`: **Preferred discovery tool.** Search the live DOM for interactive elements matching a text query. Returns matches with `d`-prefixed UIDs plus a page-level inventory grouped by role β focused results without flooding context.
* `cdp_click(uid, dbl_click?)`: Click an element by UID. Scrolls into view automatically.
* `cdp_hover(uid)`: Hover over an element by UID.
* `cdp_fill(uid, value)`: Type text into an input/textarea or select an option from a `<select>`.
* `cdp_press_key(key)`: Press a key or combo (e.g., `"Enter"`, `"Control+A"`, `"Control+Shift+R"`). Modifiers: Control, Shift, Alt, Meta.
* `cdp_type_text(text, submit_key?)`: Character-by-character keyboard input into a previously focused element. Use `cdp_fill` for form fields; use this for inputs that react to each keypress.
* `cdp_handle_dialog(action, prompt_text?)`: Accept or dismiss a JS dialog (alert, confirm, prompt).
* `cdp_navigate(url?, type?)`: Navigate to a URL, or go back/forward/reload. Type: `"url"` (default), `"back"`, `"forward"`, `"reload"`.
* `cdp_new_page(url)`: Create a new tab and navigate to URL. Becomes the selected page.
* `cdp_close_page(page_idx)`: Close a tab by index. Cannot close the last page.
* `cdp_wait_for(text, timeout?, include_snapshot?)`: Wait for any value in `text` to appear on the page (polls `document.body.innerText`, default 10s timeout). Returns a one-line "text appeared after Xms" header by default; pass `include_snapshot=true` to also append a DOM snapshot.
* `cdp_evaluate_script(function, args?)`: Evaluate JS in the page. No args: `() => document.title`. With element args: `(el) => el.innerText` + `args=[{uid: "d5"}]`.
* `cdp_list_pages`: List open tabs/windows with indices. Selected page marked with `*`.
* `cdp_select_page(page_idx)`: Switch to a tab/window by index.
* `cdp_element_at_point(x, y)`: Given screen coordinates (in points), hit-test the DOM element at that position. Always returns `backend_node_id`. If a current DOM snapshot already contains the node, the response also carries the `d`-prefixed UID, role, and name; otherwise `uid` is `null` with a note to call `cdp_take_dom_snapshot` or `cdp_find_elements`. Requires an active CDP connection.
#### Key Patterns
**Finding elements in large snapshots:** Snapshots can have 500+ elements. Look for elements by their role and name β buttons, links, textboxes, and headings are the most useful. Elements like `StaticText`, `InlineTextBox`, `none`, and `generic` are usually structural and can be skipped when looking for interactive targets.
#### CDP Workflow Examples
**Chrome:**
```
1. launch_app(app_name="Google Chrome", args=["--remote-debugging-port=9222", "--user-data-dir=/tmp/chrome-profile"])
2. cdp_connect(port=9222) β "Connected. Selected page: chrome://new-tab-page/"
3. cdp_navigate(url="https://example.com")
4. cdp_find_elements(query="Search") β matches=[{uid:"d1", role:"textbox", label:"Search", ...}]
5. cdp_fill(uid="d1", value="search query")
6. cdp_press_key(key="Enter")
7. cdp_wait_for(text=["Results"])
8. cdp_find_elements(query="Submit") β matches=[{uid:"d1", role:"button", label:"Submit", ...}]
9. cdp_click(uid="d1") β "Clicked uid=d1 'Submit' (button) at (200, 300)"
```
**Electron app (e.g., Slack, Discord, Signal):**
```
1. quit_app(app_name="MyElectronApp")
2. launch_app(app_name="MyElectronApp", args=["--remote-debugging-port=9333"])
3. cdp_connect(port=9333) β "Connected. Selected page: file:///...app.html"
4. cdp_take_dom_snapshot() β uid=d1 button "New Chat" tag=button ... (full interactive surface)
5. cdp_click(uid="d5") β click a list item or button
6. cdp_take_dom_snapshot() β fresh snapshot of the new view
7. cdp_fill(uid="d12", value="hello")
```
### 7. Android Device Control
Android support is built into every release. No feature flag or separate build is required.
Android tools use the `android_` prefix. Device management tools are always available; all other tools appear after connecting to a device.
#### Device Management
* `android_list_devices`: Lists connected ADB devices. Returns `[{ "serial": "abc123", "state": "device" }]`.
* `android_connect`: Connect to a device. **Input:** `serial` (string). Unlocks all other `android_*` tools.
* `android_disconnect`: Disconnect from the current device.
#### Vision
* `android_screenshot`: Captures the device screen. Returns a PNG image + metadata `{ "device": "abc123", "width": 1080, "height": 2400, "scale": 1.0 }`.
* `android_find_text`: Find UI elements by text (case-insensitive substring). Uses `uiautomator dump` to search the accessibility tree. **Input:** `text` (string). Returns `[{ "text": "OK", "x": 540, "y": 1200, "bounds": { "x": 480, "y": 1170, "width": 120, "height": 60 } }]`.
#### Input
* `android_click`: Tap at screen coordinates. **Inputs:** `x`, `y` (numbers).
* `android_swipe`: Swipe between two points. **Inputs:** `start_x`, `start_y`, `end_x`, `end_y` (numbers), `duration_ms` (optional).
* `android_type_text`: Type text on the device. **Input:** `text` (string). Handles shell escaping automatically.
* `android_press_key`: Press a key. **Input:** `key` (string, e.g., `"KEYCODE_HOME"`, `"KEYCODE_BACK"`, `"KEYCODE_ENTER"`).
#### App & Display Info
* `android_launch_app`: Launch an app. **Input:** `package` (string, e.g., `"com.android.settings"`).
* `android_list_apps`: List installed packages.
* `android_get_display_info`: Returns `{ "width": 1080, "height": 2400, "density": 440 }`.
* `android_get_current_activity`: Returns the current foreground activity component.
#### Android Workflow Example
```
1. android_list_devices β [{"serial": "abc123", "state": "device"}]
2. android_connect(serial="abc123") β Connected
3. android_screenshot β [image + metadata]
4. android_find_text(text="Settings") β [{"text": "Settings", "x": 540, "y": 800, ...}]
5. android_click(x=540, y=800) β Tapped
6. android_screenshot β Verify result
```
**Note:** Android coordinates are absolute screen pixels (no scale conversion needed). Use `x`/`y` from `android_find_text` directly with `android_click`.
---
## π Coordinate Systems & Best Practices
**CRITICAL:** There are two ways to target clicks. Choose ONE based on your data source.
### Method A: Absolute Screen Coordinates (Recommended)
Use this when you have data from `find_text` OR `take_screenshot` (OCR results).
* **Source:** `find_text` returns `{ "x": 500, "y": 300 }`.
* **Action:** `click(x=500, y=300)`.
* **Why:** These are already global screen coordinates.
### Method B: Relative Screenshot Coordinates
Use this when you (the model) look at the *image* from `take_screenshot` and estimate positions (e.g., "The icon is at 50% width").
* **Source:** `take_screenshot` returns metadata `{ "screenshot_origin_x": 100, "screenshot_origin_y": 100, "screenshot_scale": 2.0, "screenshot_pixel_width": 1920, "screenshot_pixel_height": 1080 }`.
* **Your Vision:** You see a button at pixel `(x=50, y=50)` inside the image.
* **Action:** `click(screenshot_x=50, screenshot_y=50, screenshot_origin_x=100, screenshot_origin_y=100, screenshot_scale=2.0)`.
* **Why:** The tool handles the math to convert image-pixels to screen-pixels.
**Manual conversion (for tools that only accept screen coordinates, e.g. `drag`):**
* `screen_x = screenshot_origin_x + (screenshot_x / screenshot_scale)`
* `screen_y = screenshot_origin_y + (screenshot_y / screenshot_scale)`
---
## β‘ Intent Examples (Chain of Thought)
### "Click the 'Save' button in Notepad"
1. **Thought:** I need to find the text "Save" in the app "Notepad".
2. **Call:** `focus_window(app_name="Notepad")`
3. **Call:** `find_text(text="Save")` -> Returns `[{"text":"Save","x":200,"y":400,...}]`
4. **Call:** `click(x=200, y=400)`
### "Draw a circle in Paint"
1. **Thought:** Text search won't work for a canvas. I need to see the screen.
2. **Call:** `take_screenshot(app_name="Paint")`
3. **Analysis:** I see the canvas center at pixel (500, 500) in the image.
4. **Compute:** `start_x = screenshot_origin_x + 500 / screenshot_scale`, `start_y = screenshot_origin_y + 500 / screenshot_scale`
5. **Call:** `drag(start_x=..., start_y=..., end_x=..., end_y=...)`
### "Copy text from this window"
1. **Thought:** I can read text directly from the screenshot OCR data without using the clipboard.
2. **Call:** `take_screenshot(include_ocr=true)`
3. **Action:** Read the OCR summary text in the response (lines include clickable coordinates).
### "Open the Wi-Fi pane in System Settings (macOS)"
1. **Thought:** This is a native macOS app. The AX-dispatch branch is preferred β no mouse movement, no focus steal, and sidebar rows are a perfect fit for `ax_select`.
2. **Call:** `take_ax_snapshot(app_name="System Settings")` β tree with a row `AXRow "Wi-Fi"` at `uid="a18g3"`.
3. **Thought:** Sidebar rows are backed by `NSOutlineView` β they refuse `AXPress`, so use `ax_select` rather than `ax_click`.
4. **Call:** `ax_select(uid="a18g3")` β `{ "ok": true, "dispatched_via": "AXSelectedRows" }`.
5. **Call:** `take_ax_snapshot(app_name="System Settings")` β verify the Wi-Fi detail pane is visible. The generation bumps; any earlier uids are now stale.
### "Fill the search field in Finder (macOS)"
1. **Thought:** Native macOS app with a text field. `ax_set_value` writes `kAXValueAttribute` directly β but it doesn't fire keydown/keyup and doesn't feed IME, so if the app has key-handlers fall back to `click` + `type_text`.
2. **Call:** `take_ax_snapshot(app_name="Finder")` β `AXSearchField` at `uid="a7g2"`.
3. **Call:** `ax_set_value(uid="a7g2", text="invoice.pdf")` β `{ "ok": true, "dispatched_via": "AXSetAttributeValue" }`.
4. **Fallback on `not_dispatchable`:** `click(fallback.x, fallback.y)` to focus, then `type_text("invoice.pdf")` for real key events.
---
## πΌοΈ Template Matching (Advanced Vision)
For finding non-text UI elements like icons, shapes, or specific visual patterns, use the `find_image` tool with template matching.
### `load_image`
Load an image from a local file path and cache it for use with `find_image`.
* **Inputs:**
* `path` (string, required): Local filesystem path to the image file.
* `id_prefix` (string, optional): Prefix for the generated ID (e.g., "template", "mask").
* `max_width`, `max_height` (integer, optional): Downscale constraints (maintains aspect ratio).
* `as_mask` (boolean, default `false`): Convert to single-channel grayscale mask.
* `return_base64` (boolean, default `false`): Include base64-encoded image data in response.
* **Returns (JSON):**
```json
{
"image_id": "template-0",
"width": 64,
"height": 64,
"channels": 4,
"mime": "image/png",
"sha256": "abc123..."
}
```
### `find_image`
Find a template image within a screenshot using template matching. Returns precise click coordinates.
* **Inputs:**
* `screenshot_id` (string, optional): Screenshot ID from `take_screenshot` (preferred).
* `screenshot_image_base64` (string, optional): Base64-encoded screenshot (if no screenshot_id).
* `template_id` (string, optional): Image ID from `load_image` (preferred).
* `template_image_base64` (string, optional): Base64-encoded template (if no template_id).
* `mask_id` (string, optional): Image ID for the mask (from `load_image`).
* `mask_image_base64` (string, optional): Base64-encoded mask (white=match, black=ignore).
* `mode` (string, default `"fast"`): `"fast"` or `"accurate"`. Fast uses downscaling/early-exit for speed; accurate uses full-res, wider scales, smaller stride.
* `threshold` (number, optional): Minimum match score 0.0-1.0.
* `max_results` (integer, optional): Maximum matches to return.
* `scales` (object, optional): Scale search range `{min, max, step}`.
* `rotations` (array, optional): Rotations to try in degrees (only 0, 90, 180, 270 supported).
* **Returns (JSON):**
```json
{
"matches": [
{
"score": 0.95,
"bbox": {"x": 100, "y": 200, "w": 64, "h": 64},
"center": {"x": 132, "y": 232},
"scale": 1.0,
"rotation": 0,
"screen_x": 166,
"screen_y": 216
}
]
}
```
### Template Matching Example Flow
```
1. take_screenshot(app_name="MyApp") β screenshot_id: "screenshot-0"
2. load_image(path="/path/to/icon.png") β image_id: "image-0"
3. find_image(screenshot_id="screenshot-0", template_id="image-0")
β matches: [{screen_x: 150, screen_y: 200, ...}]
4. click(x=150, y=200)
```