SemTools
Semantic search and document parsing tools for the command line
A high-performance CLI tool for document processing and semantic search, built with Rust for speed and reliability.
semtools parse- Parse documents (PDF, DOCX, etc.) using, by default, the LlamaParse API into markdown formatsemtools search- Local semantic keyword search using multilingual embeddings with cosine similarity matching and per-line context matchingsemtools ask- AI agent with search and read tools for answering questions over document collections (defaults to OpenAI, but see the config section to learn more about connecting to any OpenAI-Compatible API)semtools workspace- Workspace management for accelerating search over large collections
NOTE: By default, parse uses LlamaParse as a backend. Get your API key today for free at https://cloud.llamaindex.ai. search and workspace remain local-only. ask requires an OpenAI API key.
Key Features
- Fast semantic search using model2vec embeddings from minishlab/potion-multilingual-128M
- Reliable document parsing with caching and error handling
- Unix-friendly design with proper stdin/stdout handling
- Configurable distance thresholds and returned chunk sizes
- Multi-format support for parsing documents (PDF, DOCX, PPTX, etc.)
- Concurrent processing for better parsing performance
- Workspace management for efficient document retrieval over large collections
Installation
Prerequisites:
- For the
parsesubcommand: LlamaIndex Cloud API key - For the
asksubcommand: OpenAI API key
Install:
You can install semtools via npm:
Or via cargo:
# install entire crate
# install only select features
Note: Installing from npm builds the Rust binaries locally during install if a prebuilt binary is not available, which requires Rust and Cargo to be available in your environment. Install from rustup if needed: https://www.rust-lang.org/tools/install.
Quick Start
Basic Usage:
# Parse some files
# Search some (text-based) files
# Ask questions about your documents using an AI agent
# Combine parsing and search
|
# Ask a question to a set of files
# Combine parsing with the ask agent
|
# Ask based on stdin content
|
Advanced Usage:
# Combine with grep for exact-match pre-filtering and distance thresholding
| | |
# Pipeline with content search (note the 'xargs' on search to search files instead of stdin)
| |
# Combine with grep for filtering (grep could be before or after parse/search!)
| |
# Save search results from stdin search
| |
Using Workspaces:
# Create or select a workspace
# Workspaces are stored in ~/.semtools/workspaces/
> Workspace
> To
> export SEMTOOLS_WORKSPACE=my-workspace
>
> Or )
# Activate the workspace
# All search commands will now use the workspace for caching embeddings
# The initial command is used to initialize the workspace
# If documents change, they are automatically re-embedded and cached
# If documents are removed, you can run prune to clean up stale files
# You can see the stats of a workspace at any time
> Active
> Root:
> Documents:
> Index: )
CLI Help
<FILES>...
<QUERY> Query )
)
)
)
)
)
<QUERY> Query
)
)
)
)
Configuration
SemTools uses a unified configuration file at ~/.semtools_config.json that contains settings for all CLI tools. You can also specify a custom config file path using the -c or --config flag on any command.
Unified Configuration File
Create a ~/.semtools_config.json file with settings for the tools you use. All sections are optional - if not specified, sensible defaults will be used. (They parse_kwargs section is passed directly to LlamaParse, see docs for available options.)
Find out more about parsing configuration on the dedicated documentation page.
See example_semtools_config.json in the repository for a complete example.
Environment Variables
As an alternative or supplement to the config file, you can set API keys via environment variables:
# For parse tool
# For ask tool
Configuration Priority
Configuration values are resolved in the following priority order (highest to lowest):
- CLI arguments (e.g.,
--api-key,--model,--base-url) - Config file (
~/.semtools_config.jsonor custom path via-c) - Environment variables (
LLAMA_CLOUD_API_KEY,OPENAI_API_KEY) - Built-in defaults
This allows you to set common defaults in the config file while overriding them on a per-command basis when needed.
Subcommand-Specific Configuration
Parse Subcommand
The parse subcommand requires a LlamaParse API key. Get your free API key at https://cloud.llamaindex.ai.
Configuration options:
api_key: Your LlamaParse API keybase_url: API endpoint (default: "https://api.cloud.llamaindex.ai")num_ongoing_requests: Number of concurrent requests (default: 10)parse_kwargs: Additional parsing parameterscheck_interval,max_timeout,max_retries,retry_delay_ms,backoff_multiplier: Retry and timeout settings
Ask Subcommand
The ask subcommand requires an OpenAI API key for the agent's LLM.
Configuration options:
api_key: Your OpenAI API keybase_url: Custom OpenAI-compatible API endpoint (optional, for using other providers)model: LLM model to use (default: "gpt-4o-mini")max_iterations: Maximum agent loop iterations (default: 10)
You can also override these per-command:
Agent Use Case Examples
Future Work
- More parsing backends (something local-only would be great!)
- Improved search algorithms
- Built-in agentic search
- Persistence for speedups on repeat searches on the same files
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- LlamaIndex/LlamaParse for document parsing capabilities
- model2vec-rsfor fast embedding generation
- minishlab/potion-multilingual-128M for an amazing default static embedding model
- simsimd for efficient similarity computation