# RFC-0004: Search Protocol
> **Version:** 0.2.0 | **Status:** normative | **Phase:** test
---
## 1. Summary
### [RFC-0004:C-OVERVIEW] Overview (Informative) <a id="rfc-0004c-overview"></a>
Search enables agents to find content within skills using full-text queries.
The search system builds an index during compilation and supports BM25-ranked queries with snippet extraction. Rather than requiring agents to know exact section headings, search allows natural language queries like "how to configure authentication" to find relevant content.
**Design principles:**
1. **Progressive disclosure** — Search returns section references; agents use `show` to retrieve content
2. **Offline-first** — Uses SQLite FTS5, no external dependencies
3. **Extensible formats** — File format support is added incrementally
4. **Observable** — All searches are logged for analytics
**Commands:**
- `skc search <skill> <query>` — Find matching sections
*Since: v0.1.0*
---
## 2. Specification
### [RFC-0004:C-FORMATS] Supported Formats (Normative) <a id="rfc-0004c-formats"></a>
The search index MUST support the following file formats:
**v0.1.0:**
- `.md` — Markdown files, segmented by headings. Each section (heading + content until next heading of equal or higher level) becomes a searchable document. The `section` field contains the heading text.
- `.txt` — Plain text files, indexed as a single document. The `section` field MUST be an empty string (`""`).
**Retrieval:**
Search results from any indexed format can be retrieved using `skc open <skill> <path>`. The `open` command is not restricted to `.md` files (see [RFC-0002:C-OPEN](../rfc/RFC-0002.md#rfc-0002c-open)).
Additional formats MAY be added in future versions. The implementation MUST silently skip unsupported file types without error.
**Planned (future versions):**
- `.csv` — Each row as a searchable document
- `.json`, `.yaml`, `.toml` — Flattened key-value pairs
- Code files (`.py`, `.js`, `.ts`, `.rs`) — Plain text with comment weighting
*Since: v0.1.0*
### [RFC-0004:C-INDEX] Index Storage (Normative) <a id="rfc-0004c-index"></a>
The search index MUST be stored as a SQLite FTS5 database in the skill's runtime directory.
**Index file naming:**
To avoid collisions when multiple source directories share a runtime store, the index file MUST be named using a hash of the source path:
```
.skillc-meta/search-<hash16>.db
```
Where `<hash16>` is the first 16 characters (64 bits) of the SHA-256 hash of the canonicalized source directory path.
**Tokenizer preference:**
The current tokenizer preference is determined at runtime:
1. Attempt `porter unicode61`
2. If unavailable, fall back to `unicode61`
This preference is used for both index creation (build) and checking (search).
**Corrupt index definition:**
An index is **corrupt** if any of:
- The database file cannot be opened as a SQLite database
- The `index_meta` table does not exist
- Any required key is missing (`skill_path`, `source_hash`, `schema_version`, `tokenizer`)
- Any required key value cannot be parsed (e.g., non-integer `schema_version`)
**Any read or parse failure during index access MUST be treated as corruption.** Implementations MUST NOT attempt to distinguish transient errors from permanent corruption; all failures map to the same handling (E002 for search, delete+rebuild for build).
**Index file selection (search):**
When searching, the implementation MUST follow these steps in order:
1. Compute the expected filename `search-<hash16>.db`
2. If file does not exist: exit with error E002 (missing)
3. Open the database and read required keys from `index_meta`; if corrupt (per definition above): exit with error E002
4. If `skill_path` does not match current source path: exit with error E003 (collision)
5. Check staleness conditions (see below): if stale, exit with error E002
6. Proceed with search
**E002 conditions (search):**
E002 ("unusable") is the umbrella error covering three distinct failure modes:
- **Missing**: file does not exist (step 2)
- **Corrupt**: any read/parse failure (step 3)
- **Stale**: metadata mismatch (step 5) — a subset of "unusable"
All three require `skc build` to fix.
See [RFC-0005:C-CODES](../rfc/RFC-0005.md#rfc-0005c-codes) for canonical error messages.
**Staleness conditions (search-only):**
"Stale" is a subset of "unusable" that applies when the index can be read but its metadata does not match the current skill state.
After confirming no collision (step 4), the index is **stale** if any of:
- `source_hash` does not match current manifest hash
- `schema_version` < current schema version (currently `2`)
- `tokenizer` does not match current tokenizer preference
These three fields are the only staleness conditions. Missing file and corrupt index are handled earlier (steps 2-3). The `skill_path` field is used for collision detection (step 4), not staleness.
If stale, `skc search` MUST exit with error E002.
**Index lifecycle (build):**
When `skc build` runs:
1. Compute the expected filename `search-<hash16>.db`
2. If file does not exist: proceed to step 6 to create new index
3. Open the database and read required keys from `index_meta`; if any read/parse failure: delete file, proceed to step 6 to create new index
4. If `skill_path` does NOT match current source path: exit with error E003 (collision). Stop.
5. Compare `source_hash`, `schema_version`, and `tokenizer`:
- **If** all match: skip rebuild. Done.
- **Else**: delete existing file, proceed to step 6 to create new index.
6. **Create new index** with current tokenizer preference
Build does not error on unusable indexes; it rebuilds them. Corrupt indexes are deleted without collision detection since `skill_path` cannot be reliably read.
Build behavior summary:
- **Missing**: create new index (step 2 → 6)
- **Corrupt**: delete and rebuild without collision check (step 3 → 6)
- **Collision** (`skill_path` mismatch): error E003 (step 4)
- **Up-to-date** (all metadata matches): skip rebuild (step 5)
- **Stale** (metadata differs): rebuild (step 5 → 6)
**No automatic cleanup:**
The implementation MUST NOT delete other `search-*.db` files. Multiple skills may share a runtime directory.
**Runtime directory resolution:**
The runtime directory MUST be resolved using the same logic as [RFC-0007:C-RESOLUTION](../rfc/RFC-0007.md#rfc-0007c-resolution).
**Index schema:**
The database MUST contain a virtual table using FTS5 for full-text search:
```sql
CREATE VIRTUAL TABLE sections USING fts5(
file,
section,
content,
tokenize='porter unicode61'
);
```
The database MUST contain a `headings` table for section lookup by [RFC-0002:C-SHOW](../rfc/RFC-0002.md#rfc-0002c-show):
```sql
CREATE TABLE headings (
id INTEGER PRIMARY KEY,
file TEXT NOT NULL,
text TEXT NOT NULL,
level INTEGER NOT NULL,
start_line INTEGER NOT NULL,
end_line INTEGER NOT NULL
);
CREATE INDEX idx_headings_text ON headings(text COLLATE NOCASE);
```
Fields:
- `file` — relative path from skill root
- `text` — heading text (without `#` prefix)
- `level` — heading level (1-6)
- `start_line` — 1-based line number of heading
- `end_line` — 1-based line number of next heading (or EOF+1)
The database MUST also contain a metadata table:
```sql
CREATE TABLE index_meta (
key TEXT PRIMARY KEY,
value TEXT
);
```
**Required metadata keys:**
- `source_hash` — Hash from [RFC-0001:C-MANIFEST](../rfc/RFC-0001.md#rfc-0001c-manifest)
- `skill_path` — Canonicalized absolute path to source directory
- `schema_version` — Integer (currently `2`)
- `indexed_at` — RFC 3339 UTC timestamp
- `tokenizer` — `porter` or `unicode61`
**Schema migration:**
No in-place migration. User must run `skc build`.
*Updated in v0.2.0: Added `headings` table for index-based section lookup. Bumped `schema_version` to `2`.*
*Since: v0.1.0*
### [RFC-0004:C-SEARCH] Search Command (Normative) <a id="rfc-0004c-search"></a>
**Syntax:** `skc search <skill> <query> [options]`
The search command MUST query the FTS5 index and return ranked results.
**Options:**
- `--limit N` — Maximum results (default: 10)
- `--format <text|json>` — Output format (default: text)
**Ranking:**
Results MUST be ranked using BM25. Scores MUST be negated for display (FTS5 returns negative).
**Snippet extraction:**
Snippet parameters (FTS5 `snippet()` function):
- Column: `content` (index 2)
- Start marker: `[MATCH]`
- End marker: `[/MATCH]`
- Ellipsis: `...`
- Token limit: `32`
**Output format guarantees:**
**JSON is canonical (Normative).** Use JSON for machine parsing.
**JSON output (`--format json`):**
```json
{
"query": "<original-query>",
"results": [
{
"file": "<relative-path>",
"section": "<heading-or-identifier>",
"snippet": "...text with [MATCH]term[/MATCH]...",
"score": <float>
}
]
}
```
**Text output (`--format text`) — Informative only:**
Human-readable, NOT a stable contract. Do NOT parse.
```
<file>#<section> (score: <score>)
<snippet>
```
**No results:**
Return empty result set, exit 0. JSON: `{"query": "...", "results": []}`.
**Skill resolution:**
Per [RFC-0007:C-RESOLUTION](../rfc/RFC-0007.md#rfc-0007c-resolution).
**Error handling:**
All errors exit with status 1. See [RFC-0005:C-CODES](../rfc/RFC-0005.md#rfc-0005c-codes) for canonical error messages.
| Skill resolution failed | E001 or E010 |
| Index unusable | E002 |
| Index hash collision | E003 |
| Empty query | E004 |
| Invalid CLI option | E100 |
Note: "Index unusable" (E002) covers three cases: missing file, corrupt index, or stale metadata. See [RFC-0004:C-INDEX](../rfc/RFC-0004.md#rfc-0004c-index) for details.
*Since: v0.1.0*
### [RFC-0004:C-LOGGING] Search Logging (Normative) <a id="rfc-0004c-logging"></a>
Search commands MUST be logged per [RFC-0007:C-LOGGING](../rfc/RFC-0007.md#rfc-0007c-logging).
**Command name:** `search`
**Args format:**
```json
{
"query": "<search-query>",
"result_count": <number-of-results-returned>
}
```
**Error field:**
If the search fails (e.g., stale index), the error message MUST be recorded in the `error` field.
**Analytics extension (future):**
The `stats` command (per [RFC-0003](../rfc/RFC-0003.md)) SHOULD be extended to support a `searches` query type that aggregates:
- Query strings and their frequency
- Average result counts
- Zero-result query patterns
This extension is NOT part of v0.1.0. Until implemented, `--query searches` is not a valid query type and will result in error E030 per [RFC-0005:C-CODES](../rfc/RFC-0005.md#rfc-0005c-codes).
*Since: v0.1.0*
### [RFC-0004:C-QUERY-SYNTAX] Query Syntax (Normative) <a id="rfc-0004c-query-syntax"></a>
**Query semantics:**
The `<query>` argument is treated as a **bag-of-words** query with implicit AND. Each word in the query must appear somewhere in the document for a match; word order and adjacency are NOT required.
Example: Query `configure authentication` matches documents containing both words anywhere in the content, regardless of order or proximity.
**Tokenizer-dependent matching:**
Query matching behavior depends on the tokenizer used to build the index (recorded in `index_meta.tokenizer`):
- `porter`: Terms are stemmed. Query `configuring` matches indexed `configure`.
- `unicode61`: No stemming. Query `configuring` does NOT match `configure`.
FTS5 handles tokenizer selection internally based on how the index was created. The implementation does not need to branch on tokenizer type; FTS5 applies the correct tokenizer automatically. However, the recorded tokenizer value is useful for debugging, diagnostics, and user understanding of matching behavior.
**Query tokenization:**
To construct a bag-of-words AND query, the implementation MUST:
1. Split the query on ASCII whitespace only (space U+0020, tab U+0009, newline U+000A, carriage return U+000D)
2. Remove empty tokens
3. For each non-empty token:
a. Escape internal `"` by doubling (`"` → `""`)
b. Wrap in double quotes to make it a literal FTS5 term
4. Join all quoted tokens with spaces (implicit AND in FTS5)
**Unicode whitespace limitation:**
Non-ASCII whitespace characters (e.g., non-breaking space U+00A0, ideographic space U+3000) are NOT treated as token separators. They are included as part of the token. This is a known limitation. Users should use ASCII spaces in queries.
Example: User input `configure authentication` becomes FTS5 query `"configure" "authentication"`.
Example with quotes: User input `my "special" app` becomes FTS5 query `"my" """special""" "app"`.
**Why this works:**
Each quoted token is passed to FTS5 as a single-term phrase. FTS5 applies its internal tokenizer (matching the index) to each term, handling Unicode normalization and stemming as appropriate. Multiple quoted terms joined by spaces create an implicit AND query.
**Note on punctuation:**
Punctuation attached to words (e.g., `hello,` or `(world)`) is passed to FTS5 as-is. FTS5's tokenizer will strip it during matching.
**Shell quoting:**
The query is passed as a single shell argument. Users MUST quote multi-word queries:
```bash
skc search my-skill "configure authentication" # Correct
skc search my-skill configure authentication # Wrong: two positional args
```
**Empty query:**
If the query is empty or contains only ASCII whitespace, the command MUST exit with error E004. See [RFC-0005:C-CODES](../rfc/RFC-0005.md#rfc-0005c-codes) for the canonical message.
**Future extension:**
- `--raw` — Pass raw FTS5 syntax for advanced users
- `--phrase` — Require adjacency (true phrase matching)
These flags are NOT part of v0.1.0.
*Since: v0.1.0*
### [RFC-0004:C-ERRORS] Error Messages (Normative) <a id="rfc-0004c-errors"></a>
All search-related errors MUST exit with status 1 and print an error message to stderr.
**Error codes:**
See [RFC-0005:C-CODES](../rfc/RFC-0005.md#rfc-0005c-codes) for canonical error messages. This RFC uses:
| Skill resolution failed | E001 or E010 |
| Index unusable | E002 |
| Index hash collision | E003 |
| Empty query | E004 |
| Invalid CLI option | E100 |
**Usage:**
- **E001/E010**: Skill resolution failed per [RFC-0007:C-RESOLUTION](../rfc/RFC-0007.md#rfc-0007c-resolution). See [RFC-0005:C-CODES](../rfc/RFC-0005.md#rfc-0005c-codes) for when to use each.
- **E002**: Index unusable (missing, corrupt, or stale). See [RFC-0004:C-INDEX](../rfc/RFC-0004.md#rfc-0004c-index) for details.
- **E003**: Index filename exists but `skill_path` does not match (hash collision)
- **E004**: Query is empty or contains only whitespace
- **E100**: Unknown flag, missing required value, or other CLI parsing failure
**Collision handling:**
Hash collisions are rare but possible. When detected, the user MUST manually delete the conflicting index file. The error message includes the filename pattern to delete. Automatic deletion is not performed because it could destroy another skill's valid index.
**Corrupt index handling:**
An index is corrupt if the database cannot be opened, `index_meta` table is missing, required keys are missing, or key values cannot be parsed. See [RFC-0004:C-INDEX](../rfc/RFC-0004.md#rfc-0004c-index) for the full definition. During search, corruption maps to E002. During build, corrupt files are deleted and rebuilt without collision detection (since `skill_path` cannot be verified).
*Since: v0.1.0*
---
## Changelog
### v0.2.0 (2026-01-31)
C-INDEX updated with headings table, schema version bumped to 2
### v0.1.0 (2026-01-30)
Initial release