nanofts 0.6.0 - Docs.rs

# NanoFTS

A high-performance full-text search engine with Rust core, featuring efficient indexing and searching capabilities for both English and Chinese text.

## Features

- **High Performance**: Rust-powered core with sub-millisecond search latency
- **LSM-Tree Architecture**: Scalable to billions of documents
- **Incremental Updates**: Real-time document add/update/delete
- **Fuzzy Search**: Intelligent fuzzy matching with configurable thresholds
- **Full CRUD**: Complete document management operations
- **Result Handle**: Zero-copy result with set operations (AND/OR/NOT)
- **NumPy Support**: Direct numpy array output
- **Multilingual**: Support for both English and Chinese text
- **Persistence**: Disk-based storage with WAL recovery
- **LRU Cache**: Built-in caching for frequently accessed terms
- **Data Import**: Import from pandas, polars, arrow, parquet, CSV, JSON

## Installation

```bash
pip install nanofts
```

## Quick Start

```python
from nanofts import create_engine

# Create a search engine
engine = create_engine(
    index_file="./index.nfts",
    track_doc_terms=True,  # Enable update/delete operations
)

# Add documents (field values must be strings)
engine.add_document(1, {"title": "Python Tutorial", "content": "Learn Python programming"})
engine.add_document(2, {"title": "Data Analysis", "content": "Process data with pandas"})
engine.flush()

# Search - returns ResultHandle object
result = engine.search("Python")
print(f"Found {result.total_hits} documents")
print(f"Document IDs: {result.to_list()}")

# Update document
engine.update_document(1, {"title": "Advanced Python Tutorial", "content": "Deep dive into Python"})

# Delete document
engine.remove_document(2)

# Compact to persist deletions
engine.compact()
```

## Rust Usage (Rust Core)

The Rust crate name is `nanofts` (minimum Rust version: `rustc >= 1.75`). If you are building a Rust service, you can use it directly as a pure Rust full-text search library.

### Add as a dependency

Add this to your project `Cargo.toml`:

```toml
[dependencies]
nanofts = "0.6.0"
```

Optional features:

- **`mimalloc`**: enabled by default; lower latency / more stable allocation performance
- **`python`**: enable PyO3/Numpy bindings (only needed if you build the Python extension)
- **`simd`**: enable SIMD acceleration (requires nightly and `packed_simd_2`)

### Minimal example: in-memory indexing and searching

```rust
use nanofts::{UnifiedEngine, EngineConfig, EngineResult};
use std::collections::HashMap;

fn main() -> EngineResult<()> {
    // 1) Create an in-memory engine
    let engine = UnifiedEngine::new(EngineConfig::memory_only())?;

    // 2) Add a document (field values must be String)
    let mut fields = HashMap::new();
    fields.insert("title".to_string(), "Rust Tutorial".to_string());
    fields.insert("content".to_string(), "Build a high-performance full-text search engine in Rust".to_string());
    engine.add_document(1, fields)?;

    // 3) Search
    let result = engine.search("Rust")?;
    println!("hits={}, ids={:?}", result.total_hits(), result.to_list());
    Ok(())
}
```

### Persistence: single-file index + WAL recovery

```rust
use nanofts::{UnifiedEngine, EngineConfig, EngineResult};

fn main() -> EngineResult<()> {
    let config = EngineConfig::persistent("./index.nfts")
        .with_lazy_load(true)
        .with_cache_size(10_000);
    let engine = UnifiedEngine::new(config)?;

    // ... add/update/remove ...

    // Flush new documents to disk
    engine.flush()?;

    // Deletions become permanent only after compaction
    engine.compact()?;
    Ok(())
}
```

### Run the built-in Rust example in this repo

```bash
cargo run --example basic_usage --release
```

## Performance Tuning (Rust Developer Perspective)

### Build and runtime knobs

- **Use release builds**: `cargo build --release` / `cargo run --release` (this repo already configures `lto=fat`, `codegen-units=1`, `panic=abort`, `strip=true` for release).
- **Optimize for your CPU** (optional): set `RUSTFLAGS="-C target-cpu=native"` when building/running on a specific machine.
- **SIMD** (optional): if you enable `--features simd`, use nightly and validate the benefit for your workload.

### Fastest ingestion formats and APIs

- **Prefer batch ingestion**: it reduces per-document overhead and lets the engine use its optimized parallel paths.
- **Fastest Rust API**: `UnifiedEngine::add_documents_texts(doc_ids, texts)` is the fastest ingestion path when you can pre-concatenate all searchable fields into a single `String` per document.
- **Columnar ingestion**: `UnifiedEngine::add_documents_columnar(doc_ids, columns)` avoids constructing a `HashMap` per document and is a good fit for Arrow/DataFrame-style input.
- **Arrow zero-copy ingestion**: if your data is already in Arrow (or can be represented as borrowed `&str` slices), use `UnifiedEngine::add_documents_arrow_str(doc_ids, columns)` (multi-column) or `UnifiedEngine::add_documents_arrow_texts(doc_ids, texts)` (single merged text column) to avoid `String` allocation/copy.
- **Batch HashMap ingestion**: `UnifiedEngine::add_documents(docs)` is still much faster than calling `add_document` in a loop.

### Arrow Zero-Copy API Examples

#### Multi-column zero-copy ingestion

```rust
use nanofts::{UnifiedEngine, EngineConfig};

let engine = UnifiedEngine::new(EngineConfig::memory_only())?;

// Simulate Arrow StringArray data (in real use, extract from Arrow)
let doc_ids = vec![1, 2, 3];
let titles = vec!["Title 1", "Title 2", "Title 3"];
let contents = vec!["Content 1", "Content 2", "Content 3"];

// Zero-copy columnar ingestion
let columns = vec![
    ("title".to_string(), titles),
    ("content".to_string(), contents),
];

engine.add_documents_arrow_str(&doc_ids, columns)?;
```

#### Single-column zero-copy ingestion (fastest for Arrow)

```rust
// Pre-merged text from Arrow (single column)
let doc_ids = vec![1, 2, 3];
let merged_texts = vec![
    "Title 1 Content 1",
    "Title 2 Content 2", 
    "Title 3 Content 3",
];

// Zero-copy single column ingestion
engine.add_documents_arrow_texts(&doc_ids, &merged_texts)?;
```

#### Real Arrow StringArray integration

```rust
// Example with real Arrow StringArray
use arrow_array::StringArray;

let title_array = StringArray::from(vec!["Title 1", "Title 2", "Title 3"]);
let content_array = StringArray::from(vec!["Content 1", "Content 2", "Content 3"]);

// Extract zero-copy string slices from Arrow
let title_slices: Vec<&str> = title_array.iter()
    .map(|s| s.unwrap_or(""))
    .collect();
let content_slices: Vec<&str> = content_array.iter()
    .map(|s| s.unwrap_or(""))
    .collect();

let columns = vec![
    ("title".to_string(), title_slices),
    ("content".to_string(), content_slices),
];

engine.add_documents_arrow_str(&doc_ids, columns)?;
```

### Flush/compact strategy

- **`flush()` frequency**: flushing periodically bounds WAL/memory usage, but flushing too often may increase IO amplification.
- **Deletion persistence**: deletes/updates are logical until `compact()`.
  - If you delete a lot, compact in bigger batches rather than after every small delete wave.
- **Track doc terms only when you need updates/deletes**: enable it only if you need update/delete support (Python: `track_doc_terms=True`). It adds extra bookkeeping on ingestion.

### Large indexes and memory footprint

- **Use `lazy_load`** when the index is large and you don't want to map everything into memory: `with_lazy_load(true)` / Python `lazy_load=True`.
- **Tune `cache_size`**: in `lazy_load` mode, cache hit rate is a major driver for latency. Iterate using `engine.stats()` (e.g., cache hit rate).

### Query-side optimization

- **Use boolean/batch APIs and set operations**: prefer `search_and` / `search_or` or `ResultHandle::{intersect, union, difference}` to avoid repeated work.
- **Fuzzy search is more expensive**: `fuzzy_search` introduces extra candidate generation and edit-distance checks. Use it only when needed and tune thresholds/distances.

### Benchmarking and profiling

- **Benchmarks**: use `cargo bench` (or your own fixed dataset) and compare A/B with realistic data scale, term distribution, and query sets.
- **CPU profiling**: profile release binaries to find hot spots (tokenization, bitmap ops, IO, compression/decompression). On macOS, Instruments is usually the easiest.
- **Measure first**: use `engine.stats()` to track search counts, cumulative time, and cache hit rate before tuning.

## API Reference

### Creating Engine

```python
from nanofts import create_engine

engine = create_engine(
    index_file="./index.nfts",     # Index file path (empty string for memory-only)
    max_chinese_length=4,          # Max Chinese n-gram length
    min_term_length=2,             # Minimum term length to index
    fuzzy_threshold=0.7,           # Fuzzy search similarity threshold (0.0-1.0)
    fuzzy_max_distance=2,          # Maximum edit distance for fuzzy search
    track_doc_terms=False,         # Enable for update/delete support
    drop_if_exists=False,          # Drop existing index on creation
    lazy_load=False,               # Lazy load mode (memory efficient)
    cache_size=10000,              # LRU cache size for lazy load mode
)
```

### Document Operations

```python
# Add single document
engine.add_document(doc_id=1, fields={"title": "Hello", "content": "World"})

# Add multiple documents
docs = [
    (1, {"title": "Doc 1", "content": "Content 1"}),
    (2, {"title": "Doc 2", "content": "Content 2"}),
]
engine.add_documents(docs)

# Update document (requires track_doc_terms=True)
engine.update_document(1, {"title": "Updated", "content": "New content"})

# Delete single document
engine.remove_document(1)

# Delete multiple documents
engine.remove_documents([1, 2, 3])

# Flush buffer to disk
engine.flush()

# Compact index (applies deletions permanently)
engine.compact()
```

### Search Operations

```python
# Basic search - returns ResultHandle
result = engine.search("python programming")

# Get results
doc_ids = result.to_list()           # List[int]
doc_ids = result.to_numpy()          # numpy array
top_10 = result.top(10)              # Top N results
page_2 = result.page(page=2, size=10)  # Pagination

# Result properties
print(result.total_hits)             # Total match count
print(result.is_empty)               # Check if empty
print(1 in result)                   # Check if doc_id in results

# Fuzzy search (for typo tolerance)
result = engine.fuzzy_search("pythn", min_results=5)
print(result.fuzzy_used)             # True if fuzzy matching was applied

# Batch search
results = engine.search_batch(["python", "rust", "java"])

# AND search (intersection)
result = engine.search_and(["python", "tutorial"])

# OR search (union)
result = engine.search_or(["python", "rust"])

# Filter by document IDs
result = engine.filter_by_ids([1, 2, 3, 4, 5])

# Exclude specific IDs
result = engine.exclude_ids([1, 2])
```

### Result Set Operations

```python
# Search for different terms
python_docs = engine.search("python")
rust_docs = engine.search("rust")

# Intersection (AND)
both = python_docs.intersect(rust_docs)

# Union (OR)
either = python_docs.union(rust_docs)

# Difference (NOT)
python_only = python_docs.difference(rust_docs)

# Chained operations
result = engine.search("python").intersect(
    engine.search("tutorial")
).difference(
    engine.search("beginner")
)
```

### Statistics

```python
stats = engine.stats()
print(stats)
# {
#     'term_count': 1234,
#     'search_count': 100,
#     'fuzzy_search_count': 10,
#     'total_search_ns': 1234567,
#     ...
# }
```

### Data Import

NanoFTS supports importing data from various sources:

```python
from nanofts import create_engine

engine = create_engine("./index.nfts")

# Import from pandas DataFrame
import pandas as pd
df = pd.DataFrame({
    'id': [1, 2, 3],
    'title': ['Hello World', '全文搜索', 'Test Document'],
    'content': ['This is a test', '支持多语言', 'Another test']
})
engine.from_pandas(df, id_column='id')

# Import from Polars DataFrame
import polars as pl
df = pl.DataFrame({
    'id': [1, 2, 3],
    'title': ['Doc 1', 'Doc 2', 'Doc 3']
})
engine.from_polars(df, id_column='id')

# Import from PyArrow Table
import pyarrow as pa
table = pa.Table.from_pydict({
    'id': [1, 2, 3],
    'title': ['Arrow 1', 'Arrow 2', 'Arrow 3']
})
engine.from_arrow(table, id_column='id')

# Import from Parquet file
engine.from_parquet("documents.parquet", id_column='id')

# Import from CSV file
engine.from_csv("documents.csv", id_column='id')

# Import from JSON file
engine.from_json("documents.json", id_column='id')

# Import from JSON Lines file
engine.from_json("documents.jsonl", id_column='id', lines=True)

# Import from Python dict list
data = [
    {'id': 1, 'title': 'Hello', 'content': 'World'},
    {'id': 2, 'title': 'Test', 'content': 'Document'}
]
engine.from_dict(data, id_column='id')
```

#### Specifying Text Columns

By default, all columns except the ID column are indexed. You can specify which columns to index:

```python
# Only index 'title' and 'content' columns, ignore 'metadata'
engine.from_pandas(df, id_column='id', text_columns=['title', 'content'])

# Same for other import methods
engine.from_csv("data.csv", id_column='id', text_columns=['title', 'content'])
```

#### CSV and JSON Options

You can pass additional options to the underlying pandas readers:

```python
# CSV with custom delimiter
engine.from_csv("data.csv", id_column='id', sep=';', encoding='utf-8')

# JSON Lines format
engine.from_json("data.jsonl", id_column='id', lines=True)
```

## Chinese Text Support

NanoFTS handles Chinese text using n-gram tokenization:

```python
engine = create_engine(
    index_file="./chinese_index.nfts",
    max_chinese_length=4,  # Generate 2,3,4-gram for Chinese
)

engine.add_document(1, {"content": "全文搜索引擎"})
engine.flush()

# Search Chinese text
result = engine.search("搜索")
print(result.to_list())  # [1]
```

## Persistence and Recovery

```python
# Create persistent index
engine = create_engine(index_file="./data.nfts")
engine.add_document(1, {"title": "Test"})
engine.flush()

# Close and reopen
del engine
engine = create_engine(index_file="./data.nfts")

# Data is automatically recovered
result = engine.search("Test")
print(result.to_list())  # [1]

# Important: Use compact() to persist deletions
engine.remove_document(1)
engine.compact()  # Deletions are now permanent
```

## Memory-Only Mode

```python
# Create in-memory engine (no persistence)
engine = create_engine(index_file="")

engine.add_document(1, {"content": "temporary data"})
# No flush needed for in-memory mode

result = engine.search("temporary")
```

## Best Practices

### For Production Use

1. **Always call `compact()` after bulk deletions** - Deletions are only persisted after compaction
2. **Use `track_doc_terms=True`** if you need update/delete operations
3. **Call `flush()` periodically** to persist new documents
4. **Use `lazy_load=True`** for large indexes that don't fit in memory

### Performance Tips

```python
# Batch operations are faster
docs = [(i, {"content": f"doc {i}"}) for i in range(10000)]
engine.add_documents(docs)  # Much faster than individual add_document calls
engine.flush()

# Use batch search for multiple queries
results = engine.search_batch(["query1", "query2", "query3"])

# Use result set operations instead of multiple searches
# Good:
result = engine.search_and(["python", "tutorial"])
# Instead of:
# result = engine.search("python").intersect(engine.search("tutorial"))
```

## Migration from Old API

If you're upgrading from the old `FullTextSearch` API:

```python
# Old API (deprecated)
# from nanofts import FullTextSearch
# fts = FullTextSearch(index_dir="./index")
# fts.add_document(1, {"title": "Test"})
# results = fts.search("Test")  # Returns List[int]

# New API
from nanofts import create_engine
engine = create_engine(index_file="./index.nfts")
engine.add_document(1, {"title": "Test"})
result = engine.search("Test")
results = result.to_list()  # Returns List[int]
```

Key differences:
- `FullTextSearch` → `create_engine()` function
- `index_dir` → `index_file` (file path, not directory)
- Search returns `ResultHandle` instead of `List[int]`
- Call `.to_list()` to get document IDs
- Use `compact()` to persist deletions

## License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.