ck - Semantic Grep by Embedding
ck (seek) finds code by meaning, not just keywords. It's a drop-in replacement for grep that understands what you're looking for — search for "error handling" and find try/catch blocks, error returns, and exception handling code even when those exact words aren't present.
Quick Start
# Install from crates.io
# Or build from source
# Index your project for semantic search (one-time setup)
# Search by meaning - automatically updates index for changed files
# Traditional grep-compatible search still works
# Combine both: semantic relevance + keyword filtering
Why ck?
For Developers: Stop hunting through thousands of regex false positives. Find the code you actually need by describing what it does.
For AI Agents: Get structured, semantic search results in JSONL format. Stream-friendly, error-resilient output perfect for LLM workflows, code analysis, documentation generation, and automated refactoring.
Core Features
🔍 Semantic Search
Find code by concept, not keywords. Searches understand synonyms, related terms, and conceptual similarity.
# These find related code even without exact keywords:
# Get complete functions/classes containing matches (NEW!)
⚡ Drop-in grep Compatibility
All your muscle memory works. Same flags, same behavior, same output format.
🎯 Hybrid Search
Combine keyword precision with semantic understanding using Reciprocal Rank Fusion.
🤖 Agent-Friendly Output
Perfect structured output for LLMs, scripts, and automation. JSONL format provides superior parsing reliability for AI agents.
# JSONL format - one JSON object per line (recommended for agents)
# Traditional JSON (single array)
|
|
|
Why JSONL for AI agents?
- ✅ Streaming friendly: Process results as they arrive, no waiting for complete response
- ✅ Memory efficient: Parse one result at a time, not entire array into memory
- ✅ Error resilient: One malformed line doesn't break entire response
- ✅ Composable: Works perfectly with Unix pipes and stream processing
- ✅ Standard format: Used by OpenAI API, Anthropic API, and modern ML pipelines
JSONL Output Format:
Agent Processing Example:
# Stream-process JSONL results (memory efficient)
=
=
# High-confidence matches only
📁 Smart File Filtering
Automatically excludes cache directories, build artifacts, and respects .gitignore files.
# Respects .gitignore by default (NEW!)
# These are also excluded by default:
# .git, node_modules, target/, __pycache__
# Override defaults:
# Works with indexing too (NEW in v0.3.6!):
How It Works
1. Index Once, Search Many
# Create semantic index (one-time setup)
# Now search instantly by meaning
2. Embedding Model Selection
Choose the right model for your needs when creating the index:
# Default: BGE-Small (fast, precise chunking)
# Enhanced: Nomic V1.5 (8K context, optimal for large functions)
# Code-specialized: Jina Code (optimized for programming languages)
Model Comparison:
bge-small(default): 400-token chunks, fast indexing, good for most codenomic-v1.5: 1024-token chunks with 8K model capacity, better for large functions and classesjina-code: 1024-token chunks with 8K model capacity, specialized for code understanding
New in v0.4.5: Token-aware chunking uses actual model tokenizers for precise sizing, with model-specific chunk configurations balancing precision vs context.
Note: Model choice is set during indexing. Existing indexes will automatically use their original model.
3. Three Search Modes
--regex(default): Classic grep behavior, no indexing required--sem: Pure semantic search using embeddings (requires index)--hybrid: Combines regex + semantic with intelligent ranking
4. Relevance Scoring
# [0.847] ./ai_guide.txt: Machine learning introduction...
# [0.732] ./statistics.txt: Statistical learning methods...
# [0.681] ./algorithms.txt: Classification algorithms...
Advanced Usage
Search Specific Files
# Glob patterns work
# Multiple files
# Quoted patterns prevent shell expansion
Threshold Filtering
# Only high-confidence semantic matches
# Low-confidence hybrid matches (good for exploration)
# Get complete code sections instead of snippets (NEW!)
Top-K Results
# Limit results for focused analysis
# Great for AI agent consumption
|
Directory Management
# Check index status
# Clean up and rebuild
# Add single file to index
File Inspection (New in v0.4.5)
Analyze how files will be chunked for embedding with the enhanced --inspect command:
# Inspect file chunking and token usage
# Output: File info, chunk count, token statistics, and chunk details
# Example output:
# File: src/main.rs (49.6 KB, 1378 lines, 12083 tokens)
# Language: rust
#
# Chunks: 17 (tokens: min=4, max=3942, avg=644)
# 1. mod: 4 tokens | L9-9 | mod progress;
# 2. func: 1185 tokens | L88-294 | struct Cli { ... }
# 3. func: 442 tokens | L296-341 | fn expand_glob_patterns(...
# Check different model configurations
File Support
| Language | Indexing | Tree-sitter Parsing | Semantic Chunking |
|---|---|---|---|
| Python | ✅ | ✅ | ✅ Functions, classes |
| JavaScript | ✅ | ✅ | ✅ Functions, classes, methods |
| TypeScript | ✅ | ✅ | ✅ Functions, classes, methods |
| Haskell | ✅ | ✅ | ✅ Functions, types, instances |
| Rust | ✅ | ✅ | ✅ Functions, structs, traits |
| Ruby | ✅ | ✅ | ✅ Classes, methods, modules |
| Go | ✅ | ✅ | ✅ Functions, types, methods, variables |
| C# | ✅ | ✅ | ✅ Classes, interfaces, methods, variables |
Text Formats: Markdown, JSON, YAML, TOML, XML, HTML, CSS, shell scripts, SQL, log files, config files, and any other text format.
Smart Binary Detection: Uses ripgrep-style content analysis (NUL byte detection) instead of extension-based filtering, automatically indexing any text file regardless of extension while correctly excluding binary files.
Smart Exclusions: Automatically skips .git, node_modules, target/, build/, dist/, __pycache__/, .venv, venv, and other common build/cache/virtual environment directories.
Installation
From Source
From crates.io
# Install latest release from crates.io
Package Managers (Planned)
# Coming soon:
Architecture
ck uses a modular Rust workspace:
ck-cli- Command-line interface and argument parsingck-core- Shared types, configuration, and utilitiesck-search- Search engine implementations (regex, BM25, semantic)ck-index- File indexing, hashing, and sidecar managementck-embed- Text embedding providers (FastEmbed, API backends)ck-ann- Approximate nearest neighbor search indicesck-chunk- Text segmentation and language-aware parsingck-models- Model registry and configuration management
Index Storage
Indexes are stored in .ck/ directories alongside your code:
project/
├── src/
├── docs/
└── .ck/ # Semantic index (can be safely deleted)
├── embeddings.json
├── ann_index.bin
└── tantivy_index/
The .ck/ directory is a cache — safe to delete and rebuild anytime.
Examples
Finding Code Patterns
# Find authentication/authorization code
# Find error handling strategies
# Find performance-related code
Integration Examples
# Git hooks
|
# CI/CD pipeline
|
# Code review prep
# Documentation generation
|
Team Workflows
# Find related test files
# Identify refactoring candidates
# Security audit
Configuration
Default Exclusions
# View current exclusion patterns
|
# These directories are excluded by default:
# .git, .svn, .hg # Version control
# node_modules, target, build # Build artifacts
# .cache, __pycache__ # Caches
# .vscode, .idea # IDE files
Custom Configuration (Planned)
# .ck/config.toml
[]
= "hybrid"
= 0.05
[]
= ["*.log", "temp/"]
= 512
= 64
[]
= "BAAI/bge-small-en-v1.5"
Performance
- Indexing: ~1M LOC in under 2 minutes (with smart exclusions and token-aware chunking)
- Search: Sub-500ms queries on typical codebases
- Index size: ~2x source code size with compression
- Memory: Efficient streaming for large repositories with span-based content extraction
- File filtering: Automatic exclusion of virtual environments and build artifacts
- Output: Clean stdout/stderr separation for reliable piping and scripting
- Token precision: HuggingFace tokenizers for exact model-specific token counting (v0.4.5+)
Testing
Run the comprehensive test suite:
# Full test suite (40+ tests)
# Quick smoke test (14 core tests)
Tests cover grep compatibility, semantic search, index management, file filtering, and more.
Contributing
ck is actively developed and welcomes contributions:
- Issues: Report bugs, request features
- Code: Submit PRs for bug fixes, new features
- Documentation: Improve examples, guides, tutorials
- Testing: Help test on different codebases and languages
Development Setup
Roadmap
Current (v0.4+)
- ✅ grep-compatible CLI with semantic search and file listing flags (
-l,-L) - ✅ FastEmbed integration with BGE models and enhanced model selection
- ✅ File exclusion patterns and glob support
- ✅ Threshold filtering and relevance scoring with visual highlighting
- ✅ Tree-sitter parsing and intelligent chunking (Python, TypeScript, JavaScript, Go, Haskell, Rust, Ruby)
- ✅ Complete code section extraction (
--full-section) - ✅ Enhanced indexing strategy with v3 semantic search optimization
- ✅ Clean stdout/stderr separation for reliable scripting
- ✅ Incremental index updates with hash-based change detection
- ✅ Token-aware chunking with HuggingFace tokenizers (v0.4.5)
- ✅ Model-specific chunk sizing and FastEmbed capacity utilization (v0.4.5)
- ✅ Enhanced
--inspectcommand with token analysis (v0.4.5) - ✅ Granular indexing progress with file-level and chunk-level progress bars (v0.4.5)
Next (v0.5+)
- ✅ Published to crates.io (
cargo install ck-search) - 🚧 Configuration file support
- 🚧 Package manager distributions (brew, apt)
FAQ
Q: How is this different from grep/ripgrep/silver-searcher?
A: ck includes all the features of traditional search tools, but adds semantic understanding. Search for "error handling" and find relevant code even when those exact words aren't used.
Q: Does it work offline?
A: Yes, completely offline. The embedding model runs locally with no network calls.
Q: How big are the indexes?
A: Typically 1-3x the size of your source code, depending on content. The .ck/ directory can be safely deleted to reclaim space.
Q: Is it fast enough for large codebases?
A: Yes. Indexing is a one-time cost, and searches are sub-second even on large projects. Regex searches require no indexing and are as fast as grep.
Q: Can I use it in scripts/automation?
A: Absolutely. The --json flag provides structured output perfect for automated processing. Use --full-section to get complete functions for AI analysis.
Q: What about privacy/security?
A: Everything runs locally. No code or queries are sent to external services. The embedding model is downloaded once and cached locally.
Q: Where are the embedding models cached?
A: The embedding models (ONNX format) are downloaded and cached in platform-specific directories:
- Linux/macOS:
~/.cache/ck/models/(or$XDG_CACHE_HOME/ck/models/if set) - Windows:
%LOCALAPPDATA%\ck\cache\models\ - Fallback:
.ck_models/models/in the current directory (only if no home directory is found)
The models are downloaded automatically on first use and reused for subsequent runs.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Credits
Built with:
- Rust - Systems programming language
- FastEmbed - Fast text embeddings
- Tantivy - Full-text search engine
- clap - Command line argument parsing
Inspired by the need for better code search tools in the age of AI-assisted development.
Start finding code by what it does, not what it says.