SemTools
Semantic search and document parsing tools for the command line
A collection of high-performance CLI tools for document processing and semantic search, built with Rust for speed and reliability.
Tools
parse- Parse documents (PDF, DOCX, etc.) using, by default, the LlamaParse API into markdown formatsearch- Semantic search using multilingual embeddings with cosine similarity matching and per-line context matching
Key Features
- Fast semantic search using model2vec embeddings, without the burden of a vector database
- Reliable document parsing with caching and error handling
- Unix-friendly design with proper stdin/stdout handling
- Configurable distance thresholds and returned chunk sizes
- Multi-format support for parsing documents (PDF, DOCX, PPTX, etc.)
- Concurrent processing for better parsing performance
Quick Start
Prerequisites:
- Rust + Cargo
- For the
parsetool: LlamaIndex Cloud API key
Install:
# install entire crate
# install only parse
# install only search
Basic Usage:
# Parse a PDF and search for specific content
| |
# Search within many files after parsing
|
# Search with custom context and thresholds
# Search from stdin
|
CLI Help
<FILES>...
<QUERY>
)
Configuration
Parse Tool Configuration
By default, the parse tool uses the LlamaParse API to parse documents.
It will look for a ~/.parse_config.json file to configure the API key and other parameters.
Otherwise, it will fallback to looking for a LLAMA_CLOUD_API_KEY environment variable and a set of default parameters.
To configure the parse tool, create a ~/.parse_config.json file with the following content (defaults are shown below):
Or just set via environment variable:
Usage Examples
Basic Document Parsing and Search
# Parse multiple documents
# Chain parsing with semantic search
| |
# Search with distance threshold (lower = more similar)
| |
Advanced Search Patterns
# Search multiple files directly
# Combine with grep for exact-match pre-filtering and distance thresholding
| | |
# Pipeline with content search (note the 'cat')
| | |
Unix Pipeline Integration
The tools follow Unix philosophy and work seamlessly with standard tools:
# Combine with grep for filtering (could be before or after parse/search!)
| |
# Use with xargs for batch processing
| |
# Save search results
|
Further Documentation
Future Work
- More parsing backends (something local-only would be great!)
- Allowing model selection for the search tool
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- LlamaIndex/LlamaParse for document parsing capabilities
- model2vec for fast embedding generation
- simsimd for efficient similarity computation