SemTools
Semantic search and document parsing tools for the command line
A collection of high-performance CLI tools for document processing and semantic search, built with Rust for speed and reliability.
parse- Parse documents (PDF, DOCX, etc.) using, by default, the LlamaParse API into markdown formatsearch- Local semantic keyword search using multilingual embeddings with cosine similarity matching and per-line context matching
NOTE: By default, parse uses LlamaParse as a backend. Get your API key today for free at https://cloud.llamaindex.ai. search remains local-only.
Key Features
- Fast semantic search using model2vec embeddings, without the burden of a vector database
- Reliable document parsing with caching and error handling
- Unix-friendly design with proper stdin/stdout handling
- Configurable distance thresholds and returned chunk sizes
- Multi-format support for parsing documents (PDF, DOCX, PPTX, etc.)
- Concurrent processing for better parsing performance
Quick Start
Prerequisites:
- Rust + Cargo
- For the
parsetool: LlamaIndex Cloud API key
Install:
# install entire crate
# install only parse
# install only search
Basic Usage:
# Parse some files
# Search some (text-based) files
# Combine parsing and search
|
Advanced Usage:
# Combine with grep for exact-match pre-filtering and distance thresholding
| | |
# Pipeline with content search (note the 'cat')
| |
# Combine with grep for filtering (grep could be before or after parse/search!)
| |
# Save search results
| |
CLI Help
<FILES>...
<QUERY> Query )
)
)
)
Configuration
Parse Tool Configuration
By default, the parse tool uses the LlamaParse API to parse documents.
It will look for a ~/.parse_config.json file to configure the API key and other parameters.
Otherwise, it will fallback to looking for a LLAMA_CLOUD_API_KEY environment variable and a set of default parameters.
To configure the parse tool, create a ~/.parse_config.json file with the following content (defaults are shown below):
Or just set via environment variable:
Agent Use Case Examples
Future Work
- More parsing backends (something local-only would be great!)
- Allowing model selection for the search tool
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- LlamaIndex/LlamaParse for document parsing capabilities
- model2vec-rsfor fast embedding generation
- simsimd for efficient similarity computation