vectordb-cli
A lightweight command-line tool for fast, local code search using semantic retrieval powered by ONNX models and Qdrant. Now with multi-repository and branch-aware indexing!
Note: This repository contains both the vectordb-cli command-line tool and the underlying vectordb_lib library.
Note: This is currently not a stable product. It works but has bugs. I am working hard to fix these bugs and increase test coverage (current coverage ~28%).
Table of Contents
- Features
- Use Cases
- Supported Languages
- Setup
- Installation
- Configuration
- Usage (CLI)
- Library (
vectordb_lib)
Features
- Semantic Search: Finds relevant code chunks based on meaning using ONNX models.
- Repository Management: Manage configurations for multiple Git repositories.
- Branch-Aware Indexing: Track and sync specific branches within repositories.
- Qdrant Backend: Utilizes a Qdrant vector database instance for scalable storage and efficient search.
- Local or Remote Qdrant: Can connect to a local Dockerized Qdrant or a remote instance.
- Simple Indexing (Default): Recursively indexes specified directories (can be used alongside repository management).
- Configurable: Supports custom ONNX embedding models/tokenizers and Qdrant connection details via config file or environment variables.
Use Cases
- Debugging Assistance: Use semantic search to find potentially related code sections when investigating bugs. Combine with LLMs by providing relevant code snippets found through queries for diagnosis, explanation, or generating flow charts.
- Code Exploration & Understanding: Quickly locate definitions, implementations, or usages of functions, classes, or variables across large codebases or multiple repositories, even if you don't know the exact name.
- Finding Examples: Locate examples of how a particular API, library function, or design pattern is used within your indexed code.
- Onboarding: Help new team members find relevant code sections related to specific features or concepts they need to learn.
- Building AI Coding Tools: Integrate the
vectordb_liblibrary into your own AI-powered development tools, agents, or custom workflows. - Documentation Search: Index and search through Markdown documentation alongside code (Note: Current Markdown parsing is basic but will be improved).
- Refactoring & Auditing: Identify code locations potentially affected by refactoring or search for specific patterns related to security or best practices.
Supported Languages
The CLI uses tree-sitter for Abstract Syntax Tree (AST) parsing to extract meaningful code chunks (like functions, classes, structs) for indexing. This leads to more contextually relevant search results compared to simple line-based splitting. Here is the current status of language support:
| Language | Status | Supported Elements |
|---|---|---|
| Rust | ✅ Supported | functions, structs, enums, impls, traits, mods, macros, use, extern crates, type aliases, unions, statics, consts |
| Ruby | ✅ Supported | modules, classes, methods, singleton_methods |
| Go | ✅ Supported | functions, methods, types (struct/interface), consts, vars |
| Python | ✅ Supported | functions, classes, top-level statements |
| JavaScript | ✅ Supported | functions, classes, methods, assignments |
| TypeScript | ✅ Supported | functions, classes, methods, interfaces, enums, types, assignments |
| Markdown | ✅ Supported | headings, code blocks, list items, paragraphs |
| YAML | ✅ Supported | documents |
| Other | ✅ Supported | Whole file chunk (fallback_chunk) |
Files with unsupported extensions will automatically use the whole-file fallback mechanism.
Planned Languages:
Support for the following languages is planned for future releases:
- Java (
.java) - C# (
.cs) - C++ (
.cpp,.h,.hpp) - C (
.c,.h) - PHP (
.php) - Swift (
.swift) - Kotlin (
.kt,.kts) - HTML (
.html) - CSS (
.css) - JSON (
.json)
Setup
Prerequisites
- Rust: Required for building the project. Install from rustup.rs.
| # After installing rustup, source the Cargo environment script or restart your terminal - Git: Required for repository management features (
repo add,repo sync, etc.). - Build Tools: Rust often requires a C linker and build tools.
- Linux (Debian/Ubuntu):
&& - macOS: Install the Xcode Command Line Tools. If you don't have Xcode installed, running the following command in your terminal will prompt you to install them:
Install required packages using Homebrew:
- Linux (Debian/Ubuntu):
- Qdrant: A Qdrant instance (v1.7.0 or later recommended) must be running and accessible. See Qdrant Setup.
- ONNX Model Files: An ONNX embedding model and its corresponding tokenizer files are required. See Installation and Configuration.
Qdrant Setup
vectordb-cli requires a running Qdrant instance. Each managed repository will have its own collection in Qdrant, named repo_<repository_name>.
Option 1: Docker (Recommended for Local Use)
This starts Qdrant with the default gRPC port (6333, used by vectordb-cli) and HTTP/REST port (6334, typically for the web UI) mapped to your host. Data will be persisted in the qdrant_storage directory in your current working directory.
Option 2: Qdrant Cloud or Other Deployment
Follow the instructions for your chosen deployment method. You will need the URL (including http:// or https:// and the port, typically 6333 for gRPC) and potentially an API Key if required by your setup.
Environment Setup Guides
For specific environment configurations (GPU acceleration), refer to the guides in the docs/ directory:
- docs/CUDA_SETUP.md (Linux with NVIDIA GPU)
- docs/MACOS_GPU_SETUP.md (macOS with Metal GPU)
- docs/CODEBERT_SETUP.md (Using CodeBERT model - may be outdated)
Installation
-
Clone the Repository:
-
Prepare ONNX Model & Tokenizer: Download or obtain your desired ONNX embedding model (
.onnxfile) and its tokenizer configuration (tokenizer.jsonand potentially other files likevocab.txt,merges.txt, etc., usually in a single directory). Place them in a known location. See Configuration for how to tell the tool where these are.Using the Example Model: This repository includes an example
all-MiniLM-L6-v2model in theonnx/directory, managed via Git LFS. If you followed the prerequisites and installed Git LFS, Git should handle pulling the model files automatically when you clone or pull updates. If the.onnxfile inonnx/model/is small (a pointer file), you might need to rungit lfs pullmanually.Note: The tool dynamically detects the embedding dimension from the provided
.onnxmodel. -
Build:
- Standard (CPU):
- With CUDA GPU Support (Linux): Ensure you have NVIDIA drivers, the CUDA toolkit, and
cudnninstalled (see docs/CUDA_SETUP.md). Then build with: - With Metal GPU Support (macOS): (See docs/MACOS_GPU_SETUP.md)
- Standard (CPU):
-
Understanding the Build Process (Linux/macOS):
- The project uses a build script (
build.rs) to simplify setup. - During the build, this script automatically finds the necessary ONNX Runtime libraries (downloaded by the
ortcrate to~/.cache/ort.pyke.io/) including provider-specific libraries (like CUDA.sofiles or macOS.dylibfiles). - It copies these libraries into the final build output directory (
target/release/lib/). - It sets the necessary RPATH (
$ORIGIN/libon Linux,@executable_path/libon macOS) on thevectordb-cliexecutable. - This means you typically do not need to manually set
LD_LIBRARY_PATH(Linux) orDYLD_LIBRARY_PATH(macOS).
- The project uses a build script (
-
Install Binary (Optional): Symlink the compiled binary to a location in your
PATH.# Example for Linux/macOS to set it up globally
Configuration
vectordb-cli uses a hierarchical configuration system:
- Command-line Arguments: Highest priority (e.g.,
--onnx-model-path-arg,--onnx-tokenizer-dir-arg). - Environment Variables: Second priority.
- Configuration File (
config.toml): Lowest priority.
Environment Variables
QDRANT_URL: URL of the Qdrant gRPC endpoint (e.g.,http://localhost:6333). Defaults tohttp://localhost:6333if not set.QDRANT_API_KEY: API key for Qdrant authentication (optional).VECTORDB_ONNX_MODEL: Full path to the.onnxmodel file.VECTORDB_ONNX_TOKENIZER_DIR: Full path to the directory containing thetokenizer.jsonfile.
Configuration File (config.toml)
The tool looks for a config.toml file in the XDG configuration directory:
- Linux/macOS:
~/.config/vectordb-cli/config.toml
Example config.toml:
# URL for the Qdrant gRPC endpoint
= "http://localhost:6334"
# --- Optional: Qdrant API Key ---
# api_key = "your_qdrant_api_key"
# --- Optional: ONNX Model Configuration ---
# These are only needed if not provided via args or env vars.
# Path to the ONNX model file
= "/path/to/your/model.onnx"
# Path to the directory containing tokenizer.json
# Note: Key name is `onnx_tokenizer_path`
= "/path/to/your/tokenizer_directory"
# --- Repository Management ---
# The active repository (used by default for commands like sync, query)
# Set via `repo use <name>`
= "my-project"
# List of managed repositories
[[]]
= "my-project"
# Local path where the repository was cloned
= "/home/user/dev/my-project"
# Branches tracked by `repo sync`
= ["main", "develop"]
# The branch currently checked out locally
= "main" # Updated automatically by `repo use-branch`
# Last commit hash synced for each tracked branch
# Updated automatically by `repo sync`
[]
= "a1b2c3d4e5f6..."
= "f6e5d4c3b2a1..."
[[]]
= "another-repo"
= "/home/user/dev/another-repo"
= ["release-v1"]
= "release-v1"
[]
= "deadbeef..."
# ... other repositories ...
Note: You must provide the ONNX model and tokenizer paths via one of these methods (arguments, environment variables, or config file) for commands like index, query, and repo sync to work. The repositories section is managed automatically by the repo subcommands.
Usage (CLI)
This section focuses on the vectordb-cli command-line tool.
Global Options
These options can be used with most commands:
-m, --onnx-model <PATH>: Path to the ONNX model file (overrides config & env var).-t, --onnx-tokenizer-dir <PATH>: Path to the ONNX tokenizer directory (overrides config & env var).
Simple Indexing (index)
This command indexes code based on directories specified directly, without linking to a specific managed repository. This is the simpler, older method ("default").
<PATHS>...: One or more file or directory paths to index.- If a directory is provided, it will be indexed recursively.
-e <ext>,--extension <ext>: Optional list of file extensions (without the dot) to include (e.g.,-e rs -e py -e mdor--extension rs --extension py). If omitted, common code extensions are attempted.
Repository Management (repo)
This subcommand group manages configurations for Git repositories, allowing you to index and query specific branches.
Important: Repository management uses separate Qdrant collections for each repository (repo_<repository_name>), distinct from the collection used by the simple index command.
Common Options:
--repo-name <name>: Specifies the repository configuration to use (defaults to theactive_repositoryin the config).
repo add
Clones a Git repository locally (if not already present) and adds it to the managed list.
<repo-url>: The URL of the Git repository (HTTPS or SSH).--name: Optional name for the repository configuration (defaults to the repository name extracted from the URL).--branch: Optional initial branch to track (defaults to the repository's default branch).--ssh-key: Path to the SSH private key file for authentication (if using SSH URL).--ssh-passphrase: Passphrase for the SSH key (if needed).
repo list
Lists all configured repositories, their URLs, local paths, tracked branches, and detected indexed languages.
Output indicates the active repository with a *.
Managed Repositories:
* my-project (https://github.com/user/my-project.git) -> /home/user/dev/my-project
Default Branch: main
Tracked Branches: ["develop", "main"]
Indexed Languages: rust, markdown
another-repo (https://github.com/user/another.git) -> /home/user/dev/another-repo
Default Branch: main
Tracked Branches: ["main"]
Indexed Languages: python
repo use
Sets a repository as the active one, used by default for commands like query, sync, use-branch.
Arguments:
name: (Required) The name of the repository configuration to activate.
repo remove
Removes a repository configuration and optionally deletes its corresponding Qdrant collection.
# Remove configuration only
# Remove configuration AND delete Qdrant collection (requires confirmation)
Arguments:
name: (Required) The name of the repository configuration to remove.--delete-collection: If set, deletes therepo_<name>collection from Qdrant.
repo use-branch
Checks out a specific branch in the active repository locally and adds it to the list of tracked branches for syncing.
# Assuming 'my-cool-project' is the active repo:
# Checkout 'develop' branch and track it
# Checkout and track a feature branch
Arguments:
name: (Required) The name of the branch to check out and track. Fetches fromoriginif the branch isn't available locally.
repo sync
Fetches updates from the origin remote for the currently checked-out, tracked branch of the active repository (or specified repository). It calculates the changes since the last sync and updates the Qdrant index accordingly (adding new/modified files, deleting removed/renamed files).
# Sync the active repository's current branch
# Sync a specific repository (uses its currently checked-out tracked branch)
# Sync only specific file types in the active repository
# Force a full re-index of specified file types for the active repository
# Sync a specific repo with specific extensions
Arguments:
name(Optional, Positional): Name of the repository to sync. Defaults to the active repository.-e <ext>,--extensions <ext>(Optional): Specify one or more file extensions (without the dot, comma-separated or multiple flags) to include (e.g.,-e rs,pyor-e rs -e py). If omitted, defaults to syncing only extensions with dedicated parsers (see Supported Languages).--force(Optional): Force a full re-index of the specified files for the branch, ignoring the last synced commit state.
Note: Currently only fetches from the configured remote (origin by default) and primarily supports SSH key authentication (via --ssh-key in repo add or system defaults like ssh-agent). Support for other credential types (HTTPS tokens, etc.) is planned.
Manual Testing for SSH: To test SSH key authentication, try adding a private repository using its SSH URL (git@...) and provide the path to your corresponding private key using --ssh-key. Ensure your key doesn't require a passphrase for automated testing, or provide it with --ssh-passphrase (not recommended for security). Running repo sync should then succeed if authentication works.
repo clear
Clears the index (Qdrant collection repo_<repo_name>) for a specific repository without removing the repository configuration or local clone.
repo_name(Optional): The name of the repository index to clear. If omitted, the active repository is used.-y: Confirm deletion without prompting.
This operation is irreversible.
query
Performs a semantic search across the indexed data for the active repository, specified repositories, or all repositories.
Note: This command is deprecated and may be removed in the future. Use the simple query command for the simple index or rely on external tools to query repository-specific Qdrant collections (repo_<repo_name>).
<query text>: The natural language query to search for.-r <repo_name>,--repo <repo_name>(Optional): Specify one or more repository names to search within. Conflicts with--all-repos.--all-repos(Optional): Search across all configured repositories. Conflicts with--repo.-b <branch>,--branch <branch>(Optional): Filter results by a specific branch name within the target repository/repositories.-l <limit>,--limit <limit>(Optional): Maximum number of results to return (default: 10).--lang <language>(Optional): Filter results by programming language (e.g.,rust,python).--type <element_type>(Optional): Filter results by code element type (e.g.,function,struct).
If neither --repo nor --all-repos is provided, the search defaults to the currently active repository.
Results are displayed with file paths (relative to the repository root for repo searches, absolute for legacy index searches), line numbers, scores, and the relevant code chunk.
stats
Displays statistics about the Qdrant collections.
--repo-name: If provided, shows stats only for the specified repository's collection. Otherwise, shows stats for all repository collections and the default index collection.
list
Lists the unique files that have been indexed for the active repository.
clear
Deletes the entire simple index collection (vectordb-code-search). This does not affect repository indices.
-y: Confirm deletion without prompting.
Library (vectordb_lib)
This crate also provides the vectordb_lib library, which contains the core logic for configuration, code parsing, embedding management, and interacting with the vector database.
While the CLI provides a convenient interface, you can use the library programmatically for more custom integrations.
- Quickstart Guide: docs/library_quickstart.md
- API Documentation: https://docs.rs/vectordb-cli
See the crate-level documentation within the library (src/lib.rs) for a conceptual example and overview of the main components like EmbeddingHandler.
Important Runtime Dependency:
Users of the vectordb_lib library must ensure the ONNX Runtime shared libraries are available when running their application. This is because the library itself does not bundle these dependencies.
Refer to the ONNX Runtime installation guide for instructions on how to install the runtime system-wide, or ensure the necessary shared library files (.so/.dylib/.dll) are discoverable via the system's library path (e.g., using LD_LIBRARY_PATH on Linux).
Development
(Include instructions for setting up the dev environment, running tests, etc.)
# Run tests
# Run clippy
# Format code
Contributing
(Contribution guidelines)
License
MIT License