EmbedCache
Stop recomputing embeddings. Start shipping faster.
EmbedCache is a Rust library and REST API that generates text embeddings locally and caches the results. No external API calls, no per-token billing, no rate limits. Just fast, local embeddings with 22+ models.
Why EmbedCache?
Building RAG apps, semantic search, or anything with embeddings? You've probably hit these problems:
- Recomputing the same embeddings every time you restart your app
- Paying for API calls to embed text you've already processed
- Waiting on rate limits when you need to embed thousands of documents
- Vendor lock-in to a specific embedding provider
EmbedCache fixes all of this. Embeddings are generated locally using FastEmbed and cached in SQLite. Process a URL once, get instant results forever.
Features
- 22+ embedding models - BGE, MiniLM, Nomic, E5 multilingual, and more
- Local inference - No API keys, no costs, no rate limits
- Automatic caching - SQLite-backed, survives restarts
- LLM-powered chunking - Optional semantic chunking via Ollama/OpenAI
- Dual interface - Use as a Rust library or REST API
- Built-in docs - Swagger, ReDoc, RapiDoc, Scalar
Quick Start
As a Service
# Generate embeddings
# Process a URL (fetches, chunks, embeds, caches)
As a Library
[]
= "0.1"
= { = "1", = ["full"] }
use ;
use ;
async
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/embed |
POST | Generate embeddings for text array |
/v1/process |
POST | Fetch URL, chunk, embed, and cache |
/v1/params |
GET | List available models and chunkers |
Interactive docs at /swagger, /redoc, /rapidoc, or /scalar.
Configuration
Create a .env file or set environment variables:
SERVER_HOST=127.0.0.1
SERVER_PORT=8081
DB_PATH=cache.db
ENABLED_MODELS=BGESmallENV15,AllMiniLML6V2
# Optional: LLM-powered chunking
LLM_PROVIDER=ollama
LLM_MODEL=llama3
LLM_BASE_URL=http://localhost:11434
Supported Models
| Model | Dimensions | Use Case |
|---|---|---|
AllMiniLML6V2 |
384 | Fast, general purpose |
BGESmallENV15 |
384 | Best quality/speed balance |
BGEBaseENV15 |
768 | Higher quality |
BGELargeENV15 |
1024 | Highest quality |
MultilingualE5Base |
768 | 100+ languages |
Chunking Strategies
| Strategy | Description |
|---|---|
words |
Split by whitespace (fast, always available) |
llm-concept |
LLM identifies semantic boundaries |
llm-introspection |
LLM analyzes then chunks (highest quality) |
Custom Chunkers
Implement the ContentChunker trait:
use ContentChunker;
use async_trait;
;
Performance
- First request: ~100-500ms (model loading)
- Subsequent requests: ~10-50ms per text
- Cache hits: <5ms
Memory usage depends on enabled models (~200MB-800MB each).
Documentation
Build docs locally:
Project Structure
src/
├── chunking/ # Text chunking (word, LLM-based)
├── embedding/ # Embedding generation (FastEmbed)
├── handlers/ # HTTP endpoints
├── cache/ # SQLite caching
├── models/ # Data types
└── utils/ # Hash generation, URL fetching
Contributing
PRs welcome. Please open an issue first for major changes.
License
GPL-3.0. See LICENSE.
Links
Built by Skelf Research with FastEmbed and Actix-web.