ck-search 0.7.11

Semantic grep by embedding - find code by meaning, not just keywords
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
# ck - Semantic Code Search

[![CI](https://github.com/BeaconBay/ck/actions/workflows/ci.yaml/badge.svg)](https://github.com/BeaconBay/ck/actions/workflows/ci.yaml)
[![Crates.io](https://img.shields.io/crates/v/ck-search.svg)](https://crates.io/crates/ck-search)
[![Downloads](https://img.shields.io/crates/d/ck-search.svg)](https://crates.io/crates/ck-search)
[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](LICENSE-MIT)
[![MSRV](https://img.shields.io/badge/rust-1.88%2B-blue.svg)](https://www.rust-lang.org)
[![Documentation](https://img.shields.io/badge/docs-beaconbay.github.io%2Fck-blue)](https://beaconbay.github.io/ck/)

**ck (seek)** finds code by meaning, not just keywords. It's grep that understands what you're looking for โ€” search for "error handling" and find try/catch blocks, error returns, and exception handling code even when those exact words aren't present.

## ๐Ÿš€ Quick Start

```bash
# Install from crates.io
cargo install ck-search

# Just search โ€” ck builds and updates indexes automatically
ck --sem "error handling" src/
ck --sem "authentication logic" src/
ck --sem "database connection pooling" src/

# Traditional grep-compatible search still works
ck -n "TODO" *.rs
ck -R "TODO|FIXME" .

# Combine both: semantic relevance + keyword filtering
ck --hybrid "connection timeout" src/
```

> **๐Ÿ“š [Full Documentation]https://beaconbay.github.io/ck/** โ€” Installation guides, tutorials, feature deep-dives, and API reference

## โœจ Headline Features

### ๐Ÿค– **AI Agent Integration (MCP Server)**
Connect ck directly to Claude Desktop, Cursor, or any MCP-compatible AI client for seamless code search integration:

```bash
# Start MCP server for AI agent integration
ck --serve
```

**Claude Desktop Setup:**

```bash
# Install via Claude Code CLI (recommended)
claude mcp add ck-search -s user -- ck --serve

# Note: You may need to restart Claude Code after installation
# Verify installation with:
claude mcp list  # or use /mcp in Claude Code
```

**Manual Configuration (alternative):**
```json
{
  "mcpServers": {
    "ck": {
      "command": "ck",
      "args": ["--serve"],
      "cwd": "/path/to/your/codebase"
    }
  }
}
```

**Tool Permissions:** When prompted by Claude Code, approve permissions for ck-search tools (semantic_search, regex_search, hybrid_search, etc.)

**Available MCP Tools:**
- `semantic_search` - Find code by meaning using embeddings
- `regex_search` - Traditional grep-style pattern matching
- `hybrid_search` - Combined semantic and keyword search
- `index_status` - Check indexing status and metadata
- `reindex` - Force rebuild of search index
- `health_check` - Server status and diagnostics

**Built-in Pagination:** Handles large result sets gracefully with page_size controls, cursors, and snippet length management.

### ๐ŸŽจ **Interactive TUI (Terminal User Interface)**
Launch an interactive search interface with real-time results and multiple preview modes:

```bash
# Start TUI for current directory
ck --tui

# Start with initial query
ck --tui "error handling"
```

**Features:**
- **Multiple Search Modes**: Toggle between Semantic, Regex, and Hybrid search with `Tab`
- **Preview Modes**: Switch between Heatmap, Syntax highlighting, and Chunk view with `Ctrl+V`
- **View Options**: Toggle between snippet and full-file view with `Ctrl+F`
- **Multi-select**: Select multiple files with `Ctrl+Space`, open all in editor with `Enter`
- **Search History**: Navigate with `Ctrl+Up/Down`
- **Editor Integration**: Opens files in `$EDITOR` with line numbers (Vim, VS Code, Cursor, etc.)
- **Progress Tracking**: Live indexing progress with file and chunk counts
- **Config Persistence**: Preferences saved to `~/.config/ck/tui.json`

See [TUI.md](TUI.md) for keyboard shortcuts and detailed usage.

### ๐Ÿ” **Semantic Search**
Find code by concept, not keywords. Understands synonyms, related terms, and conceptual similarity:

```bash
# These find related code even without exact keywords:
ck --sem "retry logic"           # finds backoff, circuit breakers
ck --sem "user authentication"   # finds login, auth, credentials
ck --sem "data validation"       # finds sanitization, type checking

# Get complete functions/classes containing matches
ck --sem --full-section "error handling"  # returns entire functions
```

### โšก **Drop-in grep Compatibility**
All your muscle memory works. Same flags, same behavior, same output format:

```bash
ck -i "warning" *.log              # Case-insensitive
ck -n -A 3 -B 1 "error" src/       # Line numbers + context
ck -l "error" src/                  # List files with matches only
ck -L "TODO" src/                   # List files without matches
ck -R --exclude "*.test.js" "bug"  # Recursive with exclusions
```

### ๐ŸŽฏ **Hybrid Search**
Combine keyword precision with semantic understanding using Reciprocal Rank Fusion:

```bash
ck --hybrid "async timeout" src/    # Best of both worlds
ck --hybrid --scores "cache" src/   # Show relevance scores with color highlighting
ck --hybrid --threshold 0.02 query  # Filter by minimum relevance
```

### โš™๏ธ **Automatic Delta Indexing with Chunk-Level Caching**
Semantic and hybrid searches transparently create and refresh their indexes before running. The first search builds what it needs; subsequent searches intelligently reuse cached embeddings:

- **Chunk-level incremental indexing**: Only changed chunks are re-embedded (80-90% cache hit rate for typical code changes)
- **Content-aware invalidation**: Doc comments and whitespace changes properly invalidate cache
- **Model consistency**: Prevents silent embedding corruption when switching models
- **Smart caching**: Hash-based invalidation using blake3(text + trivia) for reliable change detection

### ๐Ÿ“ **Smart File Filtering**
Automatically excludes cache directories, build artifacts, and respects `.gitignore` and `.ckignore` files:

```bash
# ck respects multiple exclusion layers (all are additive):
ck "pattern" .                           # Uses .gitignore + .ckignore + defaults
ck --no-ignore "pattern" .               # Skip .gitignore (still uses .ckignore)
ck --no-ckignore "pattern" .             # Skip .ckignore (still uses .gitignore)
ck --exclude "dist" --exclude "logs" .   # Add custom exclusions

# .ckignore file (created automatically on first index):
# - Excludes images, videos, audio, binaries, archives by default
# - Excludes JSON/YAML config files (issue #27)
# - Uses same syntax as .gitignore (glob patterns, ! for negation)
# - Persists across searches (issue #67)
# - Located at repository root, editable for custom patterns

# Exclusion patterns use .gitignore syntax:
ck --exclude "node_modules" .            # Exclude directory and all contents
ck --exclude "*.test.js" .                # Exclude files matching pattern
ck --exclude "build/" --exclude "*.log" . # Multiple exclusions
# Note: Patterns are relative to the search root
```

**Why .ckignore?** While `.gitignore` handles version control exclusions, many files that *should* be in your repo aren't ideal for semantic search. Config files (`package.json`, `tsconfig.json`), images, videos, and data files add noise to search results and slow down indexing. `.ckignore` lets you focus semantic search on actual code while keeping everything else in git. Think of it as "what should I search" vs "what should I commit".

## ๐Ÿ›  Advanced Usage

### AI Agent Integration

#### MCP Server (Recommended)
```python
# Example usage in AI agents
response = await client.call_tool("semantic_search", {
    "query": "authentication logic",
    "path": "/path/to/code",
    "page_size": 25,
    "top_k": 50,           # Limit total results (default: 100 for MCP)
    "snippet_length": 200
})

# Handle pagination
if response["pagination"]["next_cursor"]:
    next_response = await client.call_tool("semantic_search", {
        "query": "authentication logic",
        "path": "/path/to/code",
        "cursor": response["pagination"]["next_cursor"]
    })
```

#### JSONL Output (Custom Workflows)
Perfect structured output for LLMs, scripts, and automation:

```bash
# JSONL format - one JSON object per line (recommended for agents)
ck --jsonl --sem "error handling" src/
ck --jsonl --no-snippet "function" .        # Metadata only
ck --jsonl --topk 5 --threshold 0.7 "auth"  # High-confidence results

# Traditional JSON (single array)
ck --json --sem "error handling" src/ | jq '.file'
```

**Why JSONL for AI agents?**
- โœ… **Streaming friendly**: Process results as they arrive
- โœ… **Memory efficient**: Parse one result at a time
- โœ… **Error resilient**: One malformed line doesn't break entire response
- โœ… **Standard format**: Used by OpenAI API, Anthropic API, and modern ML pipelines

### Search & Filter Options

```bash
# Threshold filtering
ck --sem --threshold 0.7 "query"           # Only high-confidence matches
ck --hybrid --threshold 0.01 "concept"     # Low-confidence (exploration)

# Limit results
ck --sem --topk 5 "authentication patterns"

# Complete code sections
ck --sem --full-section "database queries"  # Complete functions
ck --full-section "class.*Error" src/       # Complete classes (works with regex too)

# Relevance scoring
ck --sem --scores "machine learning" docs/
# [0.847] ./ai_guide.txt: Machine learning introduction...
# [0.732] ./statistics.txt: Statistical learning methods...
```


### Language Coverage

| Language | Indexing | Chunking | AST-aware | Notes |
|----------|----------|----------|-----------|-------|
| Markdown | โœ… | โœ… | โœ… | Headings, sections, code blocks |
| Zig | โœ… | โœ… | โœ… | contributed by [@Nevon]https://github.com/Nevon (PR #72) |

### Model Selection

Choose the right embedding model for your needs:

```bash
# Default: BGE-Small (fast, precise chunking)
ck --index .

# Mixedbread xsmall: Optimized for local semantic search (4K context, 384 dims)
ck --index --model mxbai-xsmall .

# Enhanced: Nomic V1.5 (8K context, optimal for large functions)
ck --index --model nomic-v1.5 .

# Code-specialized: Jina Code (optimized for programming languages)
ck --index --model jina-code .
```

**Model Comparison:**
- **`bge-small`** (default): 400-token chunks, fast indexing, good for most code
- **`mxbai-xsmall`**: 4K context window, 384 dimensions, optimized for local inference (Mixedbread)
- **`nomic-v1.5`**: 1024-token chunks with 8K model capacity, better for large functions
- **`jina-code`**: 1024-token chunks with 8K model capacity, specialized for code understanding

### Index Management

```bash
# Check index status
ck --status .

# Clean up and rebuild / switch models
ck --clean .
ck --switch-model mxbai-xsmall .
ck --switch-model nomic-v1.5 .
ck --switch-model nomic-v1.5 --force .     # Force rebuild

# Add single file to index
ck --add new_file.rs

# File inspection (analyze chunking and token usage)
ck --inspect src/main.rs
ck --inspect --model bge-small src/main.rs  # Test different models
```

**Interrupting Operations:** Indexing can be safely interrupted with Ctrl+C. The partial index is saved, and the next operation will resume from where it stopped, only processing new or changed files.

## ๐Ÿ“š Language Support

| Language | Indexing | Tree-sitter Parsing | Semantic Chunking |
|----------|----------|-------------------|------------------|
| Python | โœ… | โœ… | โœ… Functions, classes |
| JavaScript/TypeScript | โœ… | โœ… | โœ… Functions, classes, methods |
| Rust | โœ… | โœ… | โœ… Functions, structs, traits |
| Go | โœ… | โœ… | โœ… Functions, types, methods |
| C | โœ… | โœ… | โœ… Functions, structs, enums, unions |
| C++ | โœ… | โœ… | โœ… Classes, structs, namespaces, templates |
| Markdown | โœ… | โœ… | โœ… Headings, sections, code blocks |
| Ruby | โœ… | โœ… | โœ… Classes, methods, modules |
| Haskell | โœ… | โœ… | โœ… Functions, types, instances |
| C# | โœ… | โœ… | โœ… Classes, interfaces, methods |
| Dart | โœ… | โœ… | โœ… Classes, mixins, methods |

**Text Formats:** JSON, YAML, TOML, XML, HTML, CSS, shell scripts, SQL, log files, config files, and any other text format.

**Smart Binary Detection:** Uses ripgrep-style content analysis, automatically indexing any text file while correctly excluding binary files.

**Unsupported File Types:** Text files with unrecognized extensions (like `.org`, `.adoc`, etc.) are automatically indexed as plain text. ck detects text vs binary based on file contents, not extensions.

## ๐Ÿ— Installation

### From crates.io
```bash
cargo install ck-search
```

### From Source
```bash
git clone https://github.com/BeaconBay/ck
cd ck
cargo install --path ck-cli
```

### Package Managers
```bash
# Currently available:
cargo install ck-search    # โœ… Available now via crates.io

# Coming soon:
brew install ck-search     # ๐Ÿšง In development (use cargo for now)
apt install ck-search      # ๐Ÿšง In development
```

## ๐Ÿ’ก Examples

### Finding Code Patterns
```bash
# Find authentication/authorization code
ck --sem "user permissions" src/
ck --sem "access control" src/
ck --sem "login validation" src/

# Find error handling strategies
ck --sem "exception handling" src/
ck --sem "error recovery" src/
ck --sem "fallback mechanisms" src/

# Find performance-related code
ck --sem "caching strategies" src/
ck --sem "database optimization" src/
ck --sem "memory management" src/
```

### Team Workflows
```bash
# Find related test files
ck --sem "unit tests for authentication" tests/
ck -l --sem "test" tests/           # List test files by semantic content

# Identify refactoring candidates
ck --sem "duplicate logic" src/
ck --sem "code complexity" src/
ck -L "test" src/                   # Find source files without tests

# Security audit
ck --hybrid "password|credential|secret" src/
ck --sem "input validation" src/
```

### Integration Examples
```bash
# Git hooks
git diff --name-only | xargs ck --sem "TODO"

# CI/CD pipeline
ck --json --sem "security vulnerability" . | security_scanner.py

# Code review prep
ck --hybrid --scores "performance" src/ > review_notes.txt

# Documentation generation
ck --json --sem "public API" src/ | generate_docs.py
```

## โšก Performance

**Field-tested on real codebases:**

- **Indexing:** ~1M LOC in under 2 minutes
- **Incremental indexing:** 80-90% cache hit rate for typical code changes (only changed chunks re-embedded)
- **Search:** Sub-500ms queries on typical codebases
- **Index size:** ~2x source code size with compression
- **Memory:** Efficient streaming for large repositories
- **Token precision:** HuggingFace tokenizers for exact model-specific token counting

## ๐Ÿ”ง Architecture

ck uses a modular Rust workspace:

- **`ck-cli`** - Command-line interface and MCP server
- **`ck-tui`** - Interactive terminal user interface (ratatui-based)
- **`ck-core`** - Shared types, configuration, and utilities
- **`ck-engine`** - Search engine implementations (regex, semantic, hybrid)
- **`ck-index`** - File indexing, hashing, and sidecar management
- **`ck-embed`** - Text embedding providers (FastEmbed, API backends)
- **`ck-ann`** - Approximate nearest neighbor search indices
- **`ck-chunk`** - Text segmentation and language-aware parsing ([query-based chunking]docs/explanation/query-based-chunking.md)
- **`ck-models`** - Model registry and configuration management

### Index Storage

Indexes are stored in `.ck/` directories alongside your code:

```
project/
โ”œโ”€โ”€ src/
โ”œโ”€โ”€ docs/
โ””โ”€โ”€ .ck/           # Semantic index (can be safely deleted)
    โ”œโ”€โ”€ embeddings.json
    โ”œโ”€โ”€ ann_index.bin
    โ””โ”€โ”€ tantivy_index/
```

The `.ck/` directory is a cache โ€” safe to delete and rebuild anytime.

## ๐Ÿงช Testing

```bash
# Run the full test suite
cargo test --workspace

# Test with each feature combination
cargo hack test --each-feature --workspace
```

## ๐Ÿค Contributing

ck is actively developed and welcomes contributions:

1. **Issues:** Report bugs, request features
2. **Code:** Submit PRs for bug fixes, new features
3. **Documentation:** Improve examples, guides, tutorials
4. **Testing:** Help test on different codebases and languages

### Development Setup
```bash
git clone https://github.com/BeaconBay/ck
cd ck
cargo build --workspace
cargo test --workspace
./target/debug/ck --index test_files/
./target/debug/ck --sem "test query" test_files/
```

### CI Requirements
Before submitting a PR, ensure your code passes all CI checks:

```bash
# Format code (required)
cargo fmt --all

# Run clippy linter (required - must have no warnings)
cargo clippy --workspace --all-features --all-targets -- -D warnings

# Run tests (required)
cargo test --workspace

# Check minimum supported Rust version (MSRV)
cargo hack check --each-feature --locked --rust-version --workspace
```

The CI pipeline runs on Ubuntu, Windows, and macOS to ensure cross-platform compatibility.

## ๐Ÿ—บ Roadmap

### Current (v0.7+)
- โœ… MCP (Model Context Protocol) server for AI agent integration
- โœ… Chunk-level incremental indexing with smart embedding reuse
- โœ… grep-compatible CLI with semantic search and file listing flags
- โœ… FastEmbed integration with BGE models and enhanced model selection
- โœ… File exclusion patterns and glob support
- โœ… Threshold filtering and relevance scoring with visual highlighting
- โœ… Tree-sitter parsing and intelligent chunking for 7+ languages
- โœ… Complete code section extraction (`--full-section`)
- โœ… Clean stdout/stderr separation for reliable scripting
- โœ… Token-aware chunking with HuggingFace tokenizers
- โœ… Published to crates.io (`cargo install ck-search`)

### Next (v0.6+)
- ๐Ÿšง Configuration file support
- ๐Ÿšง Package manager distributions (brew, apt)
- ๐Ÿšง Enhanced MCP tools (file writing, refactoring assistance)
- ๐Ÿšง VS Code extension
- ๐Ÿšง JetBrains plugin
- ๐Ÿšง Additional language chunkers (Java, PHP, Swift)

## โ“ FAQ

**Q: How is this different from grep/ripgrep/silver-searcher?**
A: ck includes all the features of traditional search tools, but adds semantic understanding. Search for "error handling" and find relevant code even when those exact words aren't used.

**Q: Does it work offline?**
A: Yes, completely offline. The embedding model runs locally with no network calls.

**Q: How big are the indexes?**
A: Typically 1-3x the size of your source code. The `.ck/` directory can be safely deleted to reclaim space.

**Q: Is it fast enough for large codebases?**
A: Yes. The first semantic search builds the index automatically; after that only changed files are reprocessed, keeping searches sub-second even on large projects.

**Q: Can I use it in scripts/automation?**
A: Absolutely. The `--json` and `--jsonl` flags provide structured output perfect for automated processing and AI agent integration.

**Q: What about privacy/security?**
A: Everything runs locally. No code or queries are sent to external services. The embedding model is downloaded once and cached locally.

**Q: Where are the embedding models cached?**
A: Models are cached in platform-specific directories:
- Linux/macOS: `~/.cache/ck/models/`
- Windows: `%LOCALAPPDATA%\ck\cache\models\`
- Fallback: `.ck_models/models/` in current directory

## ๐Ÿ“„ License

Licensed under either of:
- Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE)
- MIT License ([LICENSE-MIT]LICENSE-MIT)

at your option.

## ๐Ÿ™ Credits

Built with:
- [Rust]https://rust-lang.org - Systems programming language
- [FastEmbed]https://github.com/Anush008/fastembed-rs - Fast text embeddings
- [Tantivy]https://github.com/quickwit-oss/tantivy - Full-text search engine
- [clap]https://github.com/clap-rs/clap - Command line argument parsing

Inspired by the need for better code search tools in the age of AI-assisted development.

---

**Start finding code by what it does, not what it says.**

```bash
cargo install ck-search
ck --sem "the code you're looking for"
```