# Edgesearch
Build a full text search API using Cloudflare Workers and WebAssembly.
## Features
- Uses an [inverted index](https://en.wikipedia.org/wiki/Inverted_index) and [compressed bit sets](https://roaringbitmap.org/) stored in [Cloudflare Workers KV](https://www.cloudflare.com/products/workers-kv/).
- Packing multiple index entries and documents allows storing large amounts of data in relatively few keys—no database or server required.
- Runs on Cloudflare Workers at edge locations with WebAssembly code for fast, scalable performance.
## Demos
Check out the [demo](./demo) folder for live deployed demos with source code.
## How it works
Assume we want to index a collection of documents, where a **document** is simply any nonempty UTF-8 sequence of characters. When searching, we are trying to find any documents that match one or more **terms**, which are also simply any nonempty UTF-8 sequence of characters.
The documents and their terms are provided to Edgesearch to build an index using `edgesearch build`. The relation between a document's terms and content is irrelevant to Edgesearch and terms do not necessarily have to be words from the document.
Edgesearch will then use the data to create code and data files ready to be deployed to Cloudflare using `edgesearch deploy`. The worker can also be tested locally by running `edgesearch test`.
### Data
The contents of every document are provided in a file. Each document is sequentially assigned an ID, starting from zero. Another file is provided which has the corresponding terms for each document.
Edgesearch will build a reverse index by mapping each term to a compressed bit set (using Roaring Bitmaps) of document IDs representing documents containing the term.
Each Cloudflare Workers KV key only able to store a maximum of 10 MiB of data.
For the top few terms by frequency, their bit sets are packed into the minimum amount of keys able to store them. Their key and byte offset within the key's value are recorded and stored in JS code as a map.
For the remaining terms as well as documents, they are sorted and then packed into keys as well. However, instead of storing each term's/document's key and offset, only the first term/document in a key is stored in JS alongside the position of the middle term/document in the key.
When searching for the corresponding bit set for a term, if the term is in the popular terms map, the bit set is simply retrieved directly from Cloudflare Workers KV. If it's not, a binary search is done to find the key containing the term's bit set, and then a binary search is done in the key's value to find the bit set. Retrieving the contents of a document uses the same binary search approach.
Packing multiple bit sets/documents reduces costs and deploy times, especially for large datasets. It also improves caching. For example, the [English Wikipedia demo](./demo/wiki/) has 13.4 million documents and 2.3 million terms, which when packed results in only 66 keys.
### Searching
Searching is done by looking for terms in a document.
There are three modes for each term:
- require: the term must exist in the document
- contain: at least one term with this mode must exist in the document
- exclude: the term must not exist in the document
The results are generated by doing bitwise operations across multiple bit sets.
The general computation could be summarised as:
```c
This will upload the worker script and associated WASM to Cloudflare Workers, and write every key to Cloudflare Workers KV.
```bash
edgesearch deploy \
--default-results default.txt \
--account-id CF_ACCOUNT_ID \
--account-email me@email.com \
--global-api-key CF_GLOBAL_API_KEY \
--name my-edgesearch \
--output-dir dist/worker/ \
--namespace CF_KV_NAMESPACE_ID \
--upload-data
```
### Calling the API
A [client](./client/) for the browser is available for using a deployed Edgesearch worker:
```typescript
import * as Edgesearch from 'edgesearch-client';
type Document = {
id: string;
title: string;
description: string;
};
const client = new Edgesearch.Client<Document>('my-edgesearch.me.workers.dev');
const query = new Edgesearch.Query();
query.add(Edgesearch.Mode.REQUIRE, 'world');
query.add(Edgesearch.Mode.CONTAIN, 'hello', 'welcome', 'greetings');
query.add(Edgesearch.Mode.EXCLUDE, 'bye', 'goodbye');
const results = await client.search(query);
```
## Performance
The code is reasonably fast, so most of the latency will arise from how cached the script and Workers KV data are at Cloudflare edge locations.
Keys that are not frequently retrieved from Workers KV will take longer to retrieve due to cache misses from edge locations. Script code and accompanying WASM binary may not be present at edge locations if not executed frequently. Therefore, to ensure consistent low-latency request responses, ensure that there is consistent traffic hitting the worker to keep code and data at the edge.