# ck - Semantic Code Search
[](https://github.com/BeaconBay/ck/actions/workflows/ci.yaml)
[](https://crates.io/crates/ck-search)
[](https://crates.io/crates/ck-search)
[](LICENSE-MIT)
[](https://www.rust-lang.org)
[](https://beaconbay.github.io/ck/)
**ck (seek)** finds code by meaning, not just keywords. It's grep that understands what you're looking for โ search for "error handling" and find try/catch blocks, error returns, and exception handling code even when those exact words aren't present.
## ๐ Quick Start
```bash
# Install from crates.io
cargo install ck-search
# Just search โ ck builds and updates indexes automatically
ck --sem "error handling" src/
ck --sem "authentication logic" src/
ck --sem "database connection pooling" src/
# Traditional grep-compatible search still works
ck -n "TODO" *.rs
# Combine both: semantic relevance + keyword filtering
ck --hybrid "connection timeout" src/
```
> **๐ [Full Documentation](https://beaconbay.github.io/ck/)** โ Installation guides, tutorials, feature deep-dives, and API reference
## โจ Headline Features
### ๐ค **AI Agent Integration (MCP Server)**
Connect ck directly to Claude Desktop, Cursor, or any MCP-compatible AI client for seamless code search integration:
```bash
# Start MCP server for AI agent integration
ck --serve
```
**Claude Desktop Setup:**
```bash
# Install via Claude Code CLI (recommended)
claude mcp add ck-search -s user -- ck --serve
# Note: You may need to restart Claude Code after installation
# Verify installation with:
claude mcp list # or use /mcp in Claude Code
```
**Manual Configuration (alternative):**
```json
{
"mcpServers": {
"ck": {
"command": "ck",
"args": ["--serve"],
"cwd": "/path/to/your/codebase"
}
}
}
```
**Tool Permissions:** When prompted by Claude Code, approve permissions for ck-search tools (semantic_search, regex_search, hybrid_search, etc.)
**Available MCP Tools:**
- `semantic_search` - Find code by meaning using embeddings
- `regex_search` - Traditional grep-style pattern matching
- `hybrid_search` - Combined semantic and keyword search
- `index_status` - Check indexing status and metadata
- `reindex` - Force rebuild of search index
- `health_check` - Server status and diagnostics
**Built-in Pagination:** Handles large result sets gracefully with page_size controls, cursors, and snippet length management.
### ๐จ **Interactive TUI (Terminal User Interface)**
Launch an interactive search interface with real-time results and multiple preview modes:
```bash
# Start TUI for current directory
ck --tui
# Start with initial query
ck --tui "error handling"
```
**Features:**
- **Multiple Search Modes**: Toggle between Semantic, Regex, and Hybrid search with `Tab`
- **Preview Modes**: Switch between Heatmap, Syntax highlighting, and Chunk view with `Ctrl+V`
- **View Options**: Toggle between snippet and full-file view with `Ctrl+F`
- **Multi-select**: Select multiple files with `Ctrl+Space`, open all in editor with `Enter`
- **Search History**: Navigate with `Ctrl+Up/Down`
- **Editor Integration**: Opens files in `$EDITOR` with line numbers (Vim, VS Code, Cursor, etc.)
- **Progress Tracking**: Live indexing progress with file and chunk counts
- **Config Persistence**: Preferences saved to `~/.config/ck/tui.json`
See [TUI.md](TUI.md) for keyboard shortcuts and detailed usage.
### ๐ **Semantic Search**
Find code by concept, not keywords. Understands synonyms, related terms, and conceptual similarity:
```bash
# These find related code even without exact keywords:
ck --sem "retry logic" # finds backoff, circuit breakers
ck --sem "user authentication" # finds login, auth, credentials
ck --sem "data validation" # finds sanitization, type checking
# Get complete functions/classes containing matches
ck --sem --full-section "error handling" # returns entire functions
```
### โก **Drop-in grep Compatibility**
All your muscle memory works. Same flags, same behavior, same output format:
```bash
ck -i "warning" *.log # Case-insensitive
ck -n -A 3 -B 1 "error" src/ # Line numbers + context
ck -l "error" src/ # List files with matches only
ck -L "TODO" src/ # List files without matches
ck -R --exclude "*.test.js" "bug" # Recursive with exclusions
```
### ๐ฏ **Hybrid Search**
Combine keyword precision with semantic understanding using Reciprocal Rank Fusion:
```bash
ck --hybrid "async timeout" src/ # Best of both worlds
ck --hybrid --scores "cache" src/ # Show relevance scores with color highlighting
ck --hybrid --threshold 0.02 query # Filter by minimum relevance
```
### โ๏ธ **Automatic Delta Indexing with Chunk-Level Caching**
Semantic and hybrid searches transparently create and refresh their indexes before running. The first search builds what it needs; subsequent searches intelligently reuse cached embeddings:
- **Chunk-level incremental indexing**: Only changed chunks are re-embedded (80-90% cache hit rate for typical code changes)
- **Content-aware invalidation**: Doc comments and whitespace changes properly invalidate cache
- **Model consistency**: Prevents silent embedding corruption when switching models
- **Smart caching**: Hash-based invalidation using blake3(text + trivia) for reliable change detection
### ๐ **Smart File Filtering**
Automatically excludes cache directories, build artifacts, and respects `.gitignore` and `.ckignore` files:
```bash
# ck respects multiple exclusion layers (all are additive):
ck "pattern" . # Uses .gitignore + .ckignore + defaults
ck --no-ignore "pattern" . # Skip .gitignore (still uses .ckignore)
ck --no-ckignore "pattern" . # Skip .ckignore (still uses .gitignore)
ck --exclude "dist" --exclude "logs" . # Add custom exclusions
# .ckignore file (created automatically on first index):
# - Excludes images, videos, audio, binaries, archives by default
# - Excludes JSON/YAML config files (issue #27)
# - Uses same syntax as .gitignore (glob patterns, ! for negation)
# - Persists across searches (issue #67)
# - Located at repository root, editable for custom patterns
# Exclusion patterns use .gitignore syntax:
ck --exclude "node_modules" . # Exclude directory and all contents
ck --exclude "*.test.js" . # Exclude files matching pattern
ck --exclude "build/" --exclude "*.log" . # Multiple exclusions
# Note: Patterns are relative to the search root
```
**Why .ckignore?** While `.gitignore` handles version control exclusions, many files that *should* be in your repo aren't ideal for semantic search. Config files (`package.json`, `tsconfig.json`), images, videos, and data files add noise to search results and slow down indexing. `.ckignore` lets you focus semantic search on actual code while keeping everything else in git. Think of it as "what should I search" vs "what should I commit".
## ๐ Advanced Usage
### AI Agent Integration
#### MCP Server (Recommended)
```python
# Example usage in AI agents
response = await client.call_tool("semantic_search", {
"query": "authentication logic",
"path": "/path/to/code",
"page_size": 25,
"top_k": 50, # Limit total results (default: 100 for MCP)
"snippet_length": 200
})
# Handle pagination
if response["pagination"]["next_cursor"]:
next_response = await client.call_tool("semantic_search", {
"query": "authentication logic",
"path": "/path/to/code",
"cursor": response["pagination"]["next_cursor"]
})
```
#### JSONL Output (Custom Workflows)
Perfect structured output for LLMs, scripts, and automation:
```bash
# JSONL format - one JSON object per line (recommended for agents)
ck --jsonl --sem "error handling" src/
ck --jsonl --no-snippet "function" . # Metadata only
ck --jsonl --topk 5 --threshold 0.7 "auth" # High-confidence results
# Traditional JSON (single array)
**Why JSONL for AI agents?**
- โ
**Streaming friendly**: Process results as they arrive
- โ
**Memory efficient**: Parse one result at a time
- โ
**Error resilient**: One malformed line doesn't break entire response
- โ
**Standard format**: Used by OpenAI API, Anthropic API, and modern ML pipelines
### Search & Filter Options
```bash
# Threshold filtering
ck --sem --threshold 0.7 "query" # Only high-confidence matches
ck --hybrid --threshold 0.01 "concept" # Low-confidence (exploration)
# Limit results
ck --sem --topk 5 "authentication patterns"
# Complete code sections
ck --sem --full-section "database queries" # Complete functions
ck --full-section "class.*Error" src/ # Complete classes (works with regex too)
# Relevance scoring
ck --sem --scores "machine learning" docs/
# [0.847] ./ai_guide.txt: Machine learning introduction...
# [0.732] ./statistics.txt: Statistical learning methods...
```
### Language Coverage
| Zig | โ
| โ
| โ
| contributed by [@Nevon](https://github.com/Nevon) (PR #72) |
### Model Selection
Choose the right embedding model for your needs:
```bash
# Default: BGE-Small (fast, precise chunking)
ck --index .
# Mixedbread xsmall: Optimized for local semantic search (4K context, 384 dims)
ck --index --model mxbai-xsmall .
# Enhanced: Nomic V1.5 (8K context, optimal for large functions)
ck --index --model nomic-v1.5 .
# Code-specialized: Jina Code (optimized for programming languages)
ck --index --model jina-code .
```
**Model Comparison:**
- **`bge-small`** (default): 400-token chunks, fast indexing, good for most code
- **`mxbai-xsmall`**: 4K context window, 384 dimensions, optimized for local inference (Mixedbread)
- **`nomic-v1.5`**: 1024-token chunks with 8K model capacity, better for large functions
- **`jina-code`**: 1024-token chunks with 8K model capacity, specialized for code understanding
### Index Management
```bash
# Check index status
ck --status .
# Clean up and rebuild / switch models
ck --clean .
ck --switch-model mxbai-xsmall .
ck --switch-model nomic-v1.5 .
ck --switch-model nomic-v1.5 --force . # Force rebuild
# Add single file to index
ck --add new_file.rs
# File inspection (analyze chunking and token usage)
ck --inspect src/main.rs
ck --inspect --model bge-small src/main.rs # Test different models
```
**Interrupting Operations:** Indexing can be safely interrupted with Ctrl+C. The partial index is saved, and the next operation will resume from where it stopped, only processing new or changed files.
## ๐ Language Support
| Python | โ
| โ
| โ
Functions, classes |
| JavaScript/TypeScript | โ
| โ
| โ
Functions, classes, methods |
| Rust | โ
| โ
| โ
Functions, structs, traits |
| Go | โ
| โ
| โ
Functions, types, methods |
| Ruby | โ
| โ
| โ
Classes, methods, modules |
| Haskell | โ
| โ
| โ
Functions, types, instances |
| C# | โ
| โ
| โ
Classes, interfaces, methods |
| Dart | โ
| โ
| โ
Classes, mixins, methods |
**Text Formats:** Markdown, JSON, YAML, TOML, XML, HTML, CSS, shell scripts, SQL, log files, config files, and any other text format.
**Smart Binary Detection:** Uses ripgrep-style content analysis, automatically indexing any text file while correctly excluding binary files.
**Unsupported File Types:** Text files with unrecognized extensions (like `.org`, `.adoc`, etc.) are automatically indexed as plain text. ck detects text vs binary based on file contents, not extensions.
## ๐ Installation
### From crates.io
```bash
cargo install ck-search
```
### From Source
```bash
git clone https://github.com/BeaconBay/ck
cd ck
cargo install --path ck-cli
```
### Package Managers
```bash
# Currently available:
cargo install ck-search # โ
Available now via crates.io
# Coming soon:
brew install ck-search # ๐ง In development (use cargo for now)
apt install ck-search # ๐ง In development
```
## ๐ก Examples
### Finding Code Patterns
```bash
# Find authentication/authorization code
ck --sem "user permissions" src/
ck --sem "access control" src/
ck --sem "login validation" src/
# Find error handling strategies
ck --sem "exception handling" src/
ck --sem "error recovery" src/
ck --sem "fallback mechanisms" src/
# Find performance-related code
ck --sem "caching strategies" src/
ck --sem "database optimization" src/
ck --sem "memory management" src/
```
### Team Workflows
```bash
# Find related test files
ck --sem "unit tests for authentication" tests/
ck -l --sem "test" tests/ # List test files by semantic content
# Identify refactoring candidates
ck --sem "duplicate logic" src/
ck --sem "code complexity" src/
ck -L "test" src/ # Find source files without tests
# Security audit
```
### Integration Examples
```bash
# Git hooks
# CI/CD pipeline
# Code review prep
ck --hybrid --scores "performance" src/ > review_notes.txt
# Documentation generation
## โก Performance
**Field-tested on real codebases:**
- **Indexing:** ~1M LOC in under 2 minutes
- **Incremental indexing:** 80-90% cache hit rate for typical code changes (only changed chunks re-embedded)
- **Search:** Sub-500ms queries on typical codebases
- **Index size:** ~2x source code size with compression
- **Memory:** Efficient streaming for large repositories
- **Token precision:** HuggingFace tokenizers for exact model-specific token counting
## ๐ง Architecture
ck uses a modular Rust workspace:
- **`ck-cli`** - Command-line interface and MCP server
- **`ck-tui`** - Interactive terminal user interface (ratatui-based)
- **`ck-core`** - Shared types, configuration, and utilities
- **`ck-engine`** - Search engine implementations (regex, semantic, hybrid)
- **`ck-index`** - File indexing, hashing, and sidecar management
- **`ck-embed`** - Text embedding providers (FastEmbed, API backends)
- **`ck-ann`** - Approximate nearest neighbor search indices
- **`ck-chunk`** - Text segmentation and language-aware parsing ([query-based chunking](docs/explanation/query-based-chunking.md))
- **`ck-models`** - Model registry and configuration management
### Index Storage
Indexes are stored in `.ck/` directories alongside your code:
```
project/
โโโ src/
โโโ docs/
โโโ .ck/ # Semantic index (can be safely deleted)
โโโ embeddings.json
โโโ ann_index.bin
โโโ tantivy_index/
```
The `.ck/` directory is a cache โ safe to delete and rebuild anytime.
## ๐งช Testing
```bash
# Run the full test suite
cargo test --workspace
# Test with each feature combination
cargo hack test --each-feature --workspace
```
## ๐ค Contributing
ck is actively developed and welcomes contributions:
1. **Issues:** Report bugs, request features
2. **Code:** Submit PRs for bug fixes, new features
3. **Documentation:** Improve examples, guides, tutorials
4. **Testing:** Help test on different codebases and languages
### Development Setup
```bash
git clone https://github.com/BeaconBay/ck
cd ck
cargo build --workspace
cargo test --workspace
./target/debug/ck --index test_files/
./target/debug/ck --sem "test query" test_files/
```
### CI Requirements
Before submitting a PR, ensure your code passes all CI checks:
```bash
# Format code (required)
cargo fmt --all
# Run clippy linter (required - must have no warnings)
cargo clippy --workspace --all-features --all-targets -- -D warnings
# Run tests (required)
cargo test --workspace
# Check minimum supported Rust version (MSRV)
cargo hack check --each-feature --locked --rust-version --workspace
```
The CI pipeline runs on Ubuntu, Windows, and macOS to ensure cross-platform compatibility.
## ๐บ Roadmap
### Current (v0.7+)
- โ
MCP (Model Context Protocol) server for AI agent integration
- โ
Chunk-level incremental indexing with smart embedding reuse
- โ
grep-compatible CLI with semantic search and file listing flags
- โ
FastEmbed integration with BGE models and enhanced model selection
- โ
File exclusion patterns and glob support
- โ
Threshold filtering and relevance scoring with visual highlighting
- โ
Tree-sitter parsing and intelligent chunking for 7+ languages
- โ
Complete code section extraction (`--full-section`)
- โ
Clean stdout/stderr separation for reliable scripting
- โ
Token-aware chunking with HuggingFace tokenizers
- โ
Published to crates.io (`cargo install ck-search`)
### Next (v0.6+)
- ๐ง Configuration file support
- ๐ง Package manager distributions (brew, apt)
- ๐ง Enhanced MCP tools (file writing, refactoring assistance)
- ๐ง VS Code extension
- ๐ง JetBrains plugin
- ๐ง Additional language chunkers (Java, PHP, Swift)
## โ FAQ
**Q: How is this different from grep/ripgrep/silver-searcher?**
A: ck includes all the features of traditional search tools, but adds semantic understanding. Search for "error handling" and find relevant code even when those exact words aren't used.
**Q: Does it work offline?**
A: Yes, completely offline. The embedding model runs locally with no network calls.
**Q: How big are the indexes?**
A: Typically 1-3x the size of your source code. The `.ck/` directory can be safely deleted to reclaim space.
**Q: Is it fast enough for large codebases?**
A: Yes. The first semantic search builds the index automatically; after that only changed files are reprocessed, keeping searches sub-second even on large projects.
**Q: Can I use it in scripts/automation?**
A: Absolutely. The `--json` and `--jsonl` flags provide structured output perfect for automated processing and AI agent integration.
**Q: What about privacy/security?**
A: Everything runs locally. No code or queries are sent to external services. The embedding model is downloaded once and cached locally.
**Q: Where are the embedding models cached?**
A: Models are cached in platform-specific directories:
- Linux/macOS: `~/.cache/ck/models/`
- Windows: `%LOCALAPPDATA%\ck\cache\models\`
- Fallback: `.ck_models/models/` in current directory
## ๐ License
Licensed under either of:
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))
- MIT License ([LICENSE-MIT](LICENSE-MIT))
at your option.
## ๐ Credits
Built with:
- [Rust](https://rust-lang.org) - Systems programming language
- [FastEmbed](https://github.com/Anush008/fastembed-rs) - Fast text embeddings
- [Tantivy](https://github.com/quickwit-oss/tantivy) - Full-text search engine
- [clap](https://github.com/clap-rs/clap) - Command line argument parsing
Inspired by the need for better code search tools in the age of AI-assisted development.
---
**Start finding code by what it does, not what it says.**
```bash
cargo install ck-search
ck --sem "the code you're looking for"
```