vectordb-cli 1.3.2-stable

A CLI tool for semantic code search.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
# vectordb-cli

A lightweight command-line tool for fast, local code search using semantic retrieval powered by ONNX models and Qdrant. Now with multi-repository and branch-aware indexing!

**Note:** This repository contains both the `vectordb-cli` command-line tool and the underlying `vectordb_lib` library.

## Table of Contents

-   [Features]#features
-   [Use Cases]#use-cases
-   [Supported Languages]#supported-languages
-   [Setup]#setup
    -   [Prerequisites]#prerequisites
    -   [Qdrant Setup]#qdrant-setup
    -   [Environment Setup Guides]#environment-setup-guides
-   [Installation]#installation
-   [Configuration]#configuration
    -   [Environment Variables]#environment-variables
    -   [Configuration File (`config.toml`)]#configuration-file-configtoml
-   [Usage (CLI)]#usage-cli
    -   [Global Options]#global-options
    -   [Simple Commands (`simple`)]#simple-commands-simple
        -   [`simple index`]#simple-index
        -   [`simple query`]#simple-query
        -   [`simple clear`]#simple-clear
    -   [Repository Management (`repo`)]#repository-management-repo
        -   [`repo add`]#repo-add
        -   [`repo list`]#repo-list
        -   [`repo use`]#repo-use
        -   [`repo remove`]#repo-remove
        -   [`repo use-branch`]#repo-use-branch
        -   [`repo sync`]#repo-sync
        -   [`repo clear`]#repo-clear
        -   [`repo query`]#repo-query
        -   [`repo stats`]#repo-stats
-   [Library (`vectordb_lib`)]#library-vectordb_lib

## Features

-   **Semantic Search:** Finds relevant code chunks based on meaning using ONNX models.
-   **Repository Management:** Manage configurations for multiple Git repositories.
-   **Branch-Aware Indexing:** Track and sync specific branches within repositories.
-   **Qdrant Backend:** Utilizes a Qdrant vector database instance for scalable storage and efficient search.
-   **Local or Remote Qdrant:** Can connect to a local Dockerized Qdrant or a remote instance.
-   **Simple Indexing (Default):** Recursively indexes specified directories (can be used alongside repository management).
-   **Configurable:** Supports custom ONNX embedding models/tokenizers and Qdrant connection details via config file or environment variables.

## Use Cases

-   **Debugging Assistance:** Use semantic search to find potentially related code sections when investigating bugs. Combine with LLMs by providing relevant code snippets found through queries for diagnosis, explanation, or generating flow charts.
-   **Code Exploration & Understanding:** Quickly locate definitions, implementations, or usages of functions, classes, or variables across large codebases or multiple repositories, even if you don't know the exact name.
-   **Finding Examples:** Locate examples of how a particular API, library function, or design pattern is used within your indexed code.
-   **Onboarding:** Help new team members find relevant code sections related to specific features or concepts they need to learn.
-   **Building AI Coding Tools:** Integrate the `vectordb_lib` library into your own AI-powered development tools, agents, or custom workflows.
-   **Documentation Search:** Index and search through Markdown documentation alongside code (Note: Current Markdown parsing is basic but will be improved).
-   **Refactoring & Auditing:** Identify code locations potentially affected by refactoring or search for specific patterns related to security or best practices.

## Supported Languages

The CLI uses tree-sitter for Abstract Syntax Tree (AST) parsing to extract meaningful code chunks (like functions, classes, structs) for indexing. This leads to more contextually relevant search results compared to simple line-based splitting.
Here is the current status of language support:

| Language   | Status         | Supported Elements                                                                  |
| :--------- | :------------- | :---------------------------------------------------------------------------------- |
| Rust       | ✅ Supported | functions, structs, enums, impls, traits, mods, macros, use, extern crates, type aliases, unions, statics, consts |
| Ruby       | ✅ Supported | modules, classes, methods, singleton_methods                                        |
| Go         | ✅ Supported | functions, methods, types (struct/interface), consts, vars                        |
| Python     | ✅ Supported | functions, classes, top-level statements                                            |
| JavaScript | ✅ Supported | functions, classes, methods, assignments                                          |
| TypeScript | ✅ Supported | functions, classes, methods, interfaces, enums, types, assignments                |
| Markdown   | ✅ Supported | headings, code blocks, list items, paragraphs                                       |
| YAML       | ✅ Supported | documents                                                                           |
| Other      | ✅ Supported | Whole file chunk (fallback_chunk)                                                 |

Files with unsupported extensions will automatically use the whole-file fallback mechanism.

**Planned Languages:**

Support for the following languages is planned for future releases:

*   Java (`.java`)
*   C# (`.cs`)
*   C++ (`.cpp`, `.h`, `.hpp`)
*   C (`.c`, `.h`)
*   PHP (`.php`)
*   Swift (`.swift`)
*   Kotlin (`.kt`, `.kts`)
*   HTML (`.html`)
*   CSS (`.css`)
*   JSON (`.json`)

## Setup

### Prerequisites

-   **Rust:** Required for building the project. Install from [rustup.rs]https://rustup.rs/.
    ```bash
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    # After installing rustup, source the Cargo environment script or restart your terminal
    source "$HOME/.cargo/env"
    ```
-   **Git:** Required for repository management features (`repo add`, `repo sync`, etc.).
-   **Build Tools:** Rust often requires a C linker and build tools.
    -   **Linux (Debian/Ubuntu):**
        ```bash
        sudo apt-get update && sudo apt-get install build-essential git-lfs libssl-dev pkg-config
        ```
    -   **macOS:** Install the Xcode Command Line Tools. If you don't have Xcode installed, running the following command in your terminal will prompt you to install them:
        ```bash
        xcode-select --install
        ```
        Install required packages using Homebrew:
        ```bash
        brew install git-lfs pkg-config
        ```
-   **Qdrant:** A Qdrant instance (v1.7.0 or later recommended) must be running and accessible. See [Qdrant Setup]#qdrant-setup.
-   **ONNX Model Files:** An ONNX embedding model and its corresponding tokenizer files are required. See [Installation]#installation and [Configuration]#configuration.

### Qdrant Setup

`vectordb-cli` requires a running Qdrant instance. Each managed repository will have its own collection in Qdrant, named `repo_<repository_name>`.

**Option 1: Docker (Recommended for Local Use)**

```bash
docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant:latest
```

This starts Qdrant with the default gRPC port (6333, used by `vectordb-cli`) and HTTP/REST port (6334, typically for the web UI) mapped to your host. Data will be persisted in the `qdrant_storage` directory in your current working directory.

**Option 2: Qdrant Cloud or Other Deployment**

Follow the instructions for your chosen deployment method. You will need the **URL** (including `http://` or `https://` and the port, typically 6333 for gRPC) and potentially an **API Key** if required by your setup.

### Environment Setup Guides

For specific environment configurations (GPU acceleration), refer to the guides in the `docs/` directory:

-   [docs/CUDA_SETUP.md]./docs/CUDA_SETUP.md (Linux with NVIDIA GPU)
-   [docs/MACOS_GPU_SETUP.md]./docs/MACOS_GPU_SETUP.md (macOS with Metal GPU)
-   [docs/CODEBERT_SETUP.md]./docs/CODEBERT_SETUP.md (Using CodeBERT model - *may be outdated*)

## Installation

1.  **Clone the Repository:**
    ```bash
    git clone https://gitlab.com/amulvany/vectordb-cli.git
    cd vectordb-cli
    ```

2.  **Prepare ONNX Model & Tokenizer:**
    Download or obtain your desired ONNX embedding model (`.onnx` file) and its tokenizer configuration (`tokenizer.json` and potentially other files like `vocab.txt`, `merges.txt`, etc., usually in a single directory). Place them in a known location. See [Configuration]#configuration for how to tell the tool where these are.

    **Using the Example Model:** This repository includes an example `all-MiniLM-L6-v2` model in the `onnx/` directory, managed via Git LFS. If you followed the prerequisites and installed Git LFS, Git should handle pulling the model files automatically when you clone or pull updates. If the `.onnx` file in `onnx/model/` is small (a pointer file), you might need to run `git lfs pull` manually.

    **Note:** The tool dynamically detects the embedding dimension from the provided `.onnx` model.

3.  **Build:**
    *   **Standard (CPU):**
        ```bash
        cargo build --release
        ```
    *   **With CUDA GPU Support (Linux):** Ensure you have NVIDIA drivers, the CUDA toolkit, and `cudnn` installed (see [docs/CUDA_SETUP.md]./docs/CUDA_SETUP.md). Then build with:
        ```bash
        cargo build --release --features ort/cuda
        ```
    *   **With Metal GPU Support (macOS):** (See [docs/MACOS_GPU_SETUP.md]./docs/MACOS_GPU_SETUP.md)
        ```bash
        cargo build --release --features ort/coreml # Or ort/metal if preferred/available
        ```

4.  **Understanding the Build Process (Linux/macOS):**
    *   The project uses a build script (`build.rs`) to simplify setup.
    *   During the build, this script automatically finds the necessary ONNX Runtime libraries (downloaded by the `ort` crate to `~/.cache/ort.pyke.io/`) including provider-specific libraries (like CUDA `.so` files or macOS `.dylib` files).
    *   It copies these libraries into the final build output directory (`target/release/lib/`).
    *   It sets the necessary RPATH (`$ORIGIN/lib` on Linux, `@executable_path/lib` on macOS) on the `vectordb-cli` executable.
    *   This means you typically **do not** need to manually set `LD_LIBRARY_PATH` (Linux) or `DYLD_LIBRARY_PATH` (macOS).

5.  **Install Binary (Optional):** Symlink the compiled binary to a location in your `PATH`.
    ```bash
    # Example for Linux/macOS to set it up globally
    sudo ln -s $PWD/target/release/vectordb-cli /usr/local/bin
    ```

## Configuration

`vectordb-cli` uses a hierarchical configuration system:

1.  **Command-line Arguments:** Highest priority (e.g., `--onnx-model-path-arg`, `--onnx-tokenizer-dir-arg`).
2.  **Environment Variables:** Second priority.
3.  **Configuration File (`config.toml`):** Lowest priority.

### Environment Variables

-   `QDRANT_URL`: URL of the Qdrant gRPC endpoint (e.g., `http://localhost:6333`). Defaults to `http://localhost:6333` if not set.
-   `QDRANT_API_KEY`: API key for Qdrant authentication (optional).
-   `VECTORDB_ONNX_MODEL`: Full path to the `.onnx` model file.
-   `VECTORDB_ONNX_TOKENIZER_DIR`: Full path to the directory containing the `tokenizer.json` file.

### Configuration File (`config.toml`)

The tool looks for a `config.toml` file in the XDG configuration directory:

*   **Linux/macOS:** `~/.config/vectordb-cli/config.toml`

**Example `config.toml`:**

```toml
# URL for the Qdrant gRPC endpoint
qdrant_url = "http://localhost:6334"

# --- Optional: Qdrant API Key ---
# api_key = "your_qdrant_api_key"

# --- Optional: ONNX Model Configuration ---
# These are only needed if not provided via args or env vars.

# Path to the ONNX model file
onnx_model_path = "/path/to/your/model.onnx"

# Path to the directory containing tokenizer.json
# Note: Key name is `onnx_tokenizer_path`
onnx_tokenizer_path = "/path/to/your/tokenizer_directory"

# --- Repository Management ---
# The active repository (used by default for commands like sync, query)
# Set via `repo use <name>`
active_repository = "my-project"

# List of managed repositories
[[repositories]]
name = "my-project"
# Local path where the repository was cloned
local_path = "/home/user/dev/my-project"
# Branches tracked by `repo sync`
tracked_branches = ["main", "develop"]
# The branch currently checked out locally
active_branch = "main" # Updated automatically by `repo use-branch`
# Last commit hash synced for each tracked branch
# Updated automatically by `repo sync`
[repositories.last_synced_commits]
main = "a1b2c3d4e5f6..."
develop = "f6e5d4c3b2a1..."

[[repositories]]
name = "another-repo"
local_path = "/home/user/dev/another-repo"
tracked_branches = ["release-v1"]
active_branch = "release-v1"
[repositories.last_synced_commits]
release-v1 = "deadbeef..."

# ... other repositories ...
```

**Note:** You *must* provide the ONNX model and tokenizer paths via one of these methods (arguments, environment variables, or config file) for commands like `index`, `query`, and `repo sync` to work. The `repositories` section is managed automatically by the `repo` subcommands.

## Usage (CLI)

This section focuses on the `vectordb-cli` command-line tool.

### Global Options

These options can be used with most commands:

-   `-m, --onnx-model <PATH>`: Path to the ONNX model file (overrides config & env var).
-   `-t, --onnx-tokenizer-dir <PATH>`: Path to the ONNX tokenizer directory (overrides config & env var).

### Simple Commands (`simple`)

These commands operate on a default, non-repository-specific Qdrant collection (`vectordb-code-search`).

#### `simple index`

Recursively indexes files in specified directories or specific files into the default collection.

```bash
vectordb-cli simple index <PATHS>... [-e <ext>] [--extension <ext>]
```

-   `<PATHS>...`: One or more file or directory paths to index.
-   `-e <ext>`, `--extension <ext>`: Optional: Filter by specific file extensions (without the dot, e.g., `-e rs`, `-e py`). If omitted, attempts to parse based on known extensions.

#### `simple query`

Performs a semantic search against the default collection.

```bash
vectordb-cli simple query "<query text>" [-l <limit>] [--lang <language>] [--type <element_type>]
```

-   `<query text>`: The natural language query.
-   `-l <limit>`, `--limit <limit>` (Optional): Max number of results (default: 10).
-   `--lang <language>` (Optional): Filter by language (e.g., `rust`, `python`).
-   `--type <element_type>` (Optional): Filter by code element type (e.g., `function`).

#### `simple clear`

Deletes the entire simple index collection (`vectordb-code-search`). This does **not** affect repository indices. Requires confirmation unless `-y` is provided.

```bash
vectordb-cli simple clear [-y]
```
-   `-y`: Confirm deletion without prompting.

### Repository Management (`repo`)

This subcommand group manages configurations for Git repositories, allowing you to index and query specific branches within dedicated Qdrant collections (`repo_<repository_name>`).

#### `repo add`

Clones a Git repository locally (if not already present) and adds it to the managed list.

```bash
vectordb-cli repo add --url <repo-url> [--local-path <path>] [--name <repo-name>] [--branch <branch-name>] [--remote <remote_name>] [--ssh-key <path>] [--ssh-passphrase <passphrase>]
```

-   `--url <repo-url>`: The URL of the Git repository (HTTPS or SSH).
-   `--local-path <path>` (Optional): Local directory to clone into (defaults to `<config_dir>/repos/<repo_name>`).
-   `--name <repo-name>` (Optional): Name for the repository configuration (defaults to deriving from URL).
-   `--branch <branch-name>` (Optional): Initial branch to track (defaults to the repo's default).
-   `--remote <remote_name>` (Optional): Name for the Git remote (defaults to "origin").
-   `--ssh-key <path>` (Optional): Path to the SSH private key file for authentication.
-   `--ssh-passphrase <passphrase>` (Optional): Passphrase for the SSH key.

#### `repo list`

Lists all configured repositories, their URLs, local paths, tracked branches, and detected indexed languages. Indicates the active repository with a `*`.

```bash
vectordb-cli repo list
```

Example Output:
```
Managed Repositories:
 * my-project (https://github.com/user/my-project.git) -> /home/user/.config/vectordb-cli/repos/my-project
     Default Branch: main
     Active Branch: main
     Tracked Branches: ["main", "develop"]
     Indexed Languages: rust, markdown
   another-repo (https://github.com/user/another.git) -> /home/user/.config/vectordb-cli/repos/another-repo
     Default Branch: main
     Active Branch: main
     Tracked Branches: ["main"]
     Indexed Languages: python
```

#### `repo use`

Sets a repository as the active one, used by default for other `repo` subcommands like `query`, `sync`, `use-branch`, `clear`, `stats`.

```bash
vectordb-cli repo use <name>
```
-   `<name>`: (Required) The name of the repository configuration to activate.

#### `repo remove`

Removes a repository configuration and its corresponding Qdrant collection (`repo_<name>`). This also removes the local clone by default.

```bash
vectordb-cli repo remove <name> [-y]
```
-   `<name>`: (Required) The name of the repository configuration to remove.
-   `-y`: Skip confirmation prompt.

**This operation is irreversible and deletes the Qdrant data and local clone.**

#### `repo use-branch`

Checks out a specific branch in the active repository locally and adds it to the list of tracked branches for syncing.

```bash
vectordb-cli repo use-branch <branch_name>
```
-   `<branch_name>`: (Required) The name of the branch to check out and track. Fetches from the configured remote if the branch isn't available locally.

#### `repo sync`

Fetches updates from the configured remote for the *currently checked-out, tracked branch* of the active repository (or specified repository). It calculates changes since the last sync and updates the Qdrant index accordingly (adding/modifying/deleting points).

```bash
vectordb-cli repo sync [-n <name>] [--name <name>] [-e <ext>,...] [--extensions <ext>,...] [--force]
```
-   `-n <name>`, `--name <name>` (Optional): Name of the repository to sync. Defaults to the active repository.
-   `-e <ext>,...`, `--extensions <ext>,...` (Optional): Specify file extensions to sync (without the dot, comma-separated or multiple flags: `-e rs,py` or `-e rs -e py`). If omitted, syncs files matching known parsers.
-   `--force` (Optional): Force a full re-index of the specified files for the branch, ignoring the last synced commit state.

#### `repo clear`

Clears the index (Qdrant collection `repo_<repo_name>`) for a specific repository without removing the repository configuration or local clone. Requires confirmation unless `-y` is provided.

```bash
vectordb-cli repo clear [-n <name>] [--name <name>] [-y]
```
-   `-n <name>`, `--name <name>` (Optional): The name of the repository index to clear. Defaults to the *active* repository.
-   `-y`: Confirm deletion without prompting.

**This operation is irreversible.**

#### `repo query`

Performs a semantic search across the indexed data for the *active repository*.

```bash
vectordb-cli repo query "<query text>" [-l <limit>] [--lang <language>] [--type <element_type>]
```
-   `<query text>`: The natural language query.
-   `-l <limit>`, `--limit <limit>` (Optional): Max number of results (default: 10).
-   `--lang <language>` (Optional): Filter by language (e.g., `rust`, `python`).
-   `--type <element_type>` (Optional): Filter by code element type (e.g., `function`).

Results display file paths (relative to the repository root), line numbers, scores, and the relevant code chunk.

#### `repo stats`

Displays statistics (like point count) about the Qdrant collection for the *active repository*.

```bash
vectordb-cli repo stats
```

## Library (`vectordb_lib`)

This crate also provides the `vectordb_lib` library, which contains the core logic for configuration, code parsing, embedding management, and interacting with the vector database.

While the CLI provides a convenient interface, you can use the library programmatically for more custom integrations.

*   **Quickstart Guide:** [docs/library_quickstart.md]./docs/library_quickstart.md
*   **API Documentation:** [https://docs.rs/vectordb-cli]https://docs.rs/vectordb-cli

See the crate-level documentation within the library (`src/lib.rs`) for a conceptual example and overview of the main components like `EmbeddingHandler`.

**Important Runtime Dependency:**

Users of the `vectordb_lib` library must ensure the ONNX Runtime shared libraries are available when running their application. This is because the library itself does not bundle these dependencies.

Refer to the [ONNX Runtime installation guide](https://onnxruntime.ai/docs/install/) for instructions on how to install the runtime system-wide, or ensure the necessary shared library files (`.so`/`.dylib`/`.dll`) are discoverable via the system's library path (e.g., using `LD_LIBRARY_PATH` on Linux).

## Development

The project has 42% unit test coverage and thorough end-to-end testing for key features.

```bash
# Run tests
cargo test

# Run clippy
cargo clippy --all-targets -- -D warnings

# Format code
cargo fmt
```

## Contributing

(Contribution guidelines)

## License

MIT License