NanoFTS
A high-performance full-text search engine with Rust core, featuring efficient indexing and searching capabilities for both English and Chinese text.
Features
- High Performance: Rust-powered core with sub-millisecond search latency
- LSM-Tree Architecture: Scalable to billions of documents
- Incremental Updates: Real-time document add/update/delete
- Fuzzy Search: Intelligent fuzzy matching with configurable thresholds
- Full CRUD: Complete document management operations
- Result Handle: Zero-copy result with set operations (AND/OR/NOT)
- NumPy Support: Direct numpy array output
- Multilingual: Support for both English and Chinese text
- Persistence: Disk-based storage with WAL recovery
- LRU Cache: Built-in caching for frequently accessed terms
- Data Import: Import from pandas, polars, arrow, parquet, CSV, JSON
Installation
Quick Start
# Create a search engine
=
# Add documents (field values must be strings)
# Search - returns ResultHandle object
=
# Update document
# Delete document
# Compact to persist deletions
Rust Usage (Rust Core)
The Rust crate name is nanofts (minimum Rust version: rustc >= 1.75). If you are building a Rust service, you can use it directly as a pure Rust full-text search library.
Add as a dependency
Add this to your project Cargo.toml:
[]
= "0.5.0"
Optional features:
mimalloc: enabled by default; lower latency / more stable allocation performancepython: enable PyO3/Numpy bindings (only needed if you build the Python extension)simd: enable SIMD acceleration (requires nightly andpacked_simd_2)
Minimal example: in-memory indexing and searching
use ;
use HashMap;
Persistence: single-file index + WAL recovery
use ;
Run the built-in Rust example in this repo
Performance Tuning (Rust Developer Perspective)
Build and runtime knobs
- Use release builds:
cargo build --release/cargo run --release(this repo already configureslto=fat,codegen-units=1,panic=abort,strip=truefor release). - Optimize for your CPU (optional): set
RUSTFLAGS="-C target-cpu=native"when building/running on a specific machine. - SIMD (optional): if you enable
--features simd, use nightly and validate the benefit for your workload.
Fastest ingestion formats and APIs
- Prefer batch ingestion: it reduces per-document overhead and lets the engine use its optimized parallel paths.
- Fastest Rust API:
UnifiedEngine::add_documents_texts(doc_ids, texts)is the fastest ingestion path when you can pre-concatenate all searchable fields into a singleStringper document. - Columnar ingestion:
UnifiedEngine::add_documents_columnar(doc_ids, columns)avoids constructing aHashMapper document and is a good fit for Arrow/DataFrame-style input. - Arrow zero-copy ingestion: if your data is already in Arrow (or can be represented as borrowed
&strslices), useUnifiedEngine::add_documents_arrow_str(doc_ids, columns)(multi-column) orUnifiedEngine::add_documents_arrow_texts(doc_ids, texts)(single merged text column) to avoidStringallocation/copy. - Batch HashMap ingestion:
UnifiedEngine::add_documents(docs)is still much faster than callingadd_documentin a loop.
Arrow Zero-Copy API Examples
Multi-column zero-copy ingestion
use ;
let engine = new?;
// Simulate Arrow StringArray data (in real use, extract from Arrow)
let doc_ids = vec!;
let titles = vec!;
let contents = vec!;
// Zero-copy columnar ingestion
let columns = vec!;
engine.add_documents_arrow_str?;
Single-column zero-copy ingestion (fastest for Arrow)
// Pre-merged text from Arrow (single column)
let doc_ids = vec!;
let merged_texts = vec!;
// Zero-copy single column ingestion
engine.add_documents_arrow_texts?;
Real Arrow StringArray integration
// Example with real Arrow StringArray
use StringArray;
let title_array = from;
let content_array = from;
// Extract zero-copy string slices from Arrow
let title_slices: = title_array.iter
.map
.collect;
let content_slices: = content_array.iter
.map
.collect;
let columns = vec!;
engine.add_documents_arrow_str?;
Flush/compact strategy
flush()frequency: flushing periodically bounds WAL/memory usage, but flushing too often may increase IO amplification.- Deletion persistence: deletes/updates are logical until
compact().- If you delete a lot, compact in bigger batches rather than after every small delete wave.
- Track doc terms only when you need updates/deletes: enable it only if you need update/delete support (Python:
track_doc_terms=True). It adds extra bookkeeping on ingestion.
Large indexes and memory footprint
- Use
lazy_loadwhen the index is large and you don't want to map everything into memory:with_lazy_load(true)/ Pythonlazy_load=True. - Tune
cache_size: inlazy_loadmode, cache hit rate is a major driver for latency. Iterate usingengine.stats()(e.g., cache hit rate).
Query-side optimization
- Use boolean/batch APIs and set operations: prefer
search_and/search_ororResultHandle::{intersect, union, difference}to avoid repeated work. - Fuzzy search is more expensive:
fuzzy_searchintroduces extra candidate generation and edit-distance checks. Use it only when needed and tune thresholds/distances.
Benchmarking and profiling
- Benchmarks: use
cargo bench(or your own fixed dataset) and compare A/B with realistic data scale, term distribution, and query sets. - CPU profiling: profile release binaries to find hot spots (tokenization, bitmap ops, IO, compression/decompression). On macOS, Instruments is usually the easiest.
- Measure first: use
engine.stats()to track search counts, cumulative time, and cache hit rate before tuning.
API Reference
Creating Engine
=
Document Operations
# Add single document
# Add multiple documents
=
# Update document (requires track_doc_terms=True)
# Delete single document
# Delete multiple documents
# Flush buffer to disk
# Compact index (applies deletions permanently)
Search Operations
# Basic search - returns ResultHandle
=
# Get results
= # List[int]
= # numpy array
= # Top N results
= # Pagination
# Result properties
# Total match count
# Check if empty
# Check if doc_id in results
# Fuzzy search (for typo tolerance)
=
# True if fuzzy matching was applied
# Batch search
=
# AND search (intersection)
=
# OR search (union)
=
# Filter by document IDs
=
# Exclude specific IDs
=
Result Set Operations
# Search for different terms
=
=
# Intersection (AND)
=
# Union (OR)
=
# Difference (NOT)
=
# Chained operations
=
Statistics
=
# {
# 'term_count': 1234,
# 'search_count': 100,
# 'fuzzy_search_count': 10,
# 'total_search_ns': 1234567,
# ...
# }
Data Import
NanoFTS supports importing data from various sources:
=
# Import from pandas DataFrame
=
# Import from Polars DataFrame
=
# Import from PyArrow Table
=
# Import from Parquet file
# Import from CSV file
# Import from JSON file
# Import from JSON Lines file
# Import from Python dict list
=
Specifying Text Columns
By default, all columns except the ID column are indexed. You can specify which columns to index:
# Only index 'title' and 'content' columns, ignore 'metadata'
# Same for other import methods
CSV and JSON Options
You can pass additional options to the underlying pandas readers:
# CSV with custom delimiter
# JSON Lines format
Chinese Text Support
NanoFTS handles Chinese text using n-gram tokenization:
=
# Search Chinese text
=
# [1]
Persistence and Recovery
# Create persistent index
=
# Close and reopen
del
=
# Data is automatically recovered
=
# [1]
# Important: Use compact() to persist deletions
# Deletions are now permanent
Memory-Only Mode
# Create in-memory engine (no persistence)
=
# No flush needed for in-memory mode
=
Best Practices
For Production Use
- Always call
compact()after bulk deletions - Deletions are only persisted after compaction - Use
track_doc_terms=Trueif you need update/delete operations - Call
flush()periodically to persist new documents - Use
lazy_load=Truefor large indexes that don't fit in memory
Performance Tips
# Batch operations are faster
=
# Much faster than individual add_document calls
# Use batch search for multiple queries
=
# Use result set operations instead of multiple searches
# Good:
=
# Instead of:
# result = engine.search("python").intersect(engine.search("tutorial"))
Migration from Old API
If you're upgrading from the old FullTextSearch API:
# Old API (deprecated)
# from nanofts import FullTextSearch
# fts = FullTextSearch(index_dir="./index")
# fts.add_document(1, {"title": "Test"})
# results = fts.search("Test") # Returns List[int]
# New API
=
=
= # Returns List[int]
Key differences:
FullTextSearch→create_engine()functionindex_dir→index_file(file path, not directory)- Search returns
ResultHandleinstead ofList[int] - Call
.to_list()to get document IDs - Use
compact()to persist deletions
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.