syntext 1.0.1

Hybrid code search index for agent workflows
Documentation
# Security Audit Log

Findings from security audits, with identification, remediation status, and
rationale for accepted risks.

## Threat Model

syntext is a local code search tool. The index directory is mode 0700 (owner
only). The primary threat is a compromised or malicious file within the
indexed repository, not a remote attacker. Attacks requiring write access to
the index directory are low-severity because that access already implies
owner-level compromise.

## Findings

### SA-001: TOCTOU Window in Symlink Walk

**Severity:** Medium
**Status:** Mitigated (defense-in-depth)
**File:** `src/index/walk.rs`, `collect_symlink_entry`

**Identification:** The symlink validation in `collect_symlink_entry` spans
multiple syscalls: `read_link`, `symlink_metadata`, `canonicalize`, and a
second `symlink_metadata`. Between the final validation and the build-time
file read for an in-repo file symlink, a concurrent symlink swap could
redirect to an out-of-scope target. Directory symlinks are now skipped during
repository enumeration, which removes the nested-walk portion of this risk.

**Remediation:**
1. Directory symlinks are skipped during repository enumeration, so the walker
   no longer performs nested sub-walks through symlink aliases.
2. The build pipeline (`src/index/build.rs`) already applies per-file
   `open_readonly_nofollow` + `verify_fd_matches_stat` inode verification,
   catching any remaining swaps between walk discovery and content read.
3. File symlinks still require the target to resolve inside the repo root, and
   multi-hop symlink chains are rejected before indexing.

**Residual risk:** The remaining window is between validation and the later
file open for an accepted in-repo file symlink. The build-time inode check is
the backstop.

### SA-002: V2 Posting Offset Lower Bound Too Permissive

**Severity:** Medium
**Status:** Fixed
**File:** `src/index/segment/mod.rs`, `read_posting_list_mmap`

**Identification:** For V2 combined segments, `read_posting_list_mmap`
validated that `abs_off >= HEADER_SIZE`, but the postings section starts after
the document table, not at `HEADER_SIZE`. A crafted V2 segment with a valid
checksum could embed a dictionary entry whose posting offset pointed into the
doc table region, causing doc table bytes to be interpreted as posting data.

**Remediation:**
1. The segment footer's `postings_offset` field (bytes 8..16) is now parsed
   and stored in `SegmentLayout` and `MmapSegment`.
2. `read_posting_list_mmap` uses `postings_offset` as the lower bound when
   non-zero (V2 segments that recorded it). Falls back to
   `doc_table_offset + doc_count * 8` (end of the doc table index array) as a
   conservative minimum for segments where `postings_offset` is zero.
3. `parse_segment_mmap` validates that `postings_offset` (when non-zero) falls
   within `[doc_table_offset, dict_offset]`.

### SA-003: Integer Underflow in Overlay Doc Count

**Severity:** Medium (cosmetic)
**Status:** Fixed
**File:** `src/index/overlay.rs`, `build_incremental`

**Identification:** The expression
`old_overlay.docs.len() + new_files.len() - newly_changed.len()` can underflow
when a path appears in both `newly_changed` and `removed_paths` (e.g.,
`notify_change` then `notify_delete` in the same batch). In release mode, the
bare subtraction wraps to `usize::MAX`, producing a misleading `DocIdOverflow`
error message. No memory corruption or privilege escalation results.

**Remediation:** Replaced bare arithmetic with `saturating_add` / `saturating_sub`
at both call sites in `build_incremental` (full rebuild and delta paths).

## Previously Addressed (for reference)

These were fixed in earlier commits and verified during this audit:

- **Path traversal** (c492ea4): `repo_relative_path` rejects `..`, absolute,
  and prefix components. `commit_batch` canonicalizes and checks
  `starts_with(canonical_root)`. Manifest filename validation rejects `/`, `\`,
  `..`, and absolute paths. Symbol search filters absolute and `..` paths.

- **ReDoS** (ec27f9d): 10 MiB NFA/DFA size cap on all regex compilation paths.
  The `regex` crate's RE2 engine guarantees linear-time matching.

- **Doc entry bounds** (c492ea4): `get_doc` validates `abs_off` within
  `[doc_table_offset, dict_offset)` and checks full variable-length entry
  (22-byte header + `path_len`) fits within the doc table region.

- **TOCTOU file reads** (c492ea4): `open_readonly_nofollow` + inode
  verification in `build.rs`, `commit_batch`, and the resolver hot path.

- **Symlink dedup** (0f3b6d9): `seen_canonical` set prevents duplicate file
  records when N symlinks point to the same in-repo file target.

## Verified Clean Areas

- **Unsafe blocks (2):** Both use `map_copy_read_only` (MAP_PRIVATE). Justified.
- **SQL injection:** Symbol index uses parameterized queries throughout.
- **Varint decoding:** Overflow guards on 5th byte and delta accumulation.
- **Concurrency:** ArcSwap snapshot isolation is correct. Poisoned mutex
  recovery is acceptable for idempotent bitmap caches.

## Round 2 Findings (2026-03-28)

### SA-004: Permissive Index Directory Mode Accepted Silently

**Severity:** High
**Status:** Fixed
**File:** `src/index/mod.rs`, `Index::open()`

**Identification:** The permissive-mode check only warned on stderr (gated
behind `config.verbose`). A pre-existing index with mode 0755 continued
operating with no user-visible signal.

**Remediation:** `Index::open()` now returns `CorruptIndex` when the index
directory has group/other bits set, unless `Config::strict_permissions` is
false. `build_index()` continues to enforce 0700 on new builds.

### SA-005: Lock Gap Between build_index and open

**Severity:** Medium
**Status:** Fixed
**File:** `src/index/build.rs`

**Identification:** The exclusive directory lock was dropped before `open()`
acquired a shared lock. Two concurrent builds could both succeed in the gap.

**Remediation:** The exclusive lock is downgraded to shared (unlock +
re-lock shared) while the writer lock is still held. The writer lock is
dropped only after the shared directory lock is acquired, closing the window.

### SA-006: segment_id Not Validated as UUID

**Severity:** Medium (latent)
**Status:** Fixed
**File:** `src/index/manifest.rs`, `Manifest::load()`

**Identification:** `segment_id` was not validated. While not currently used
in filesystem paths, a future code path could expose a path traversal.

**Remediation:** `Manifest::load()` validates that each `segment_id` parses
as a UUID.

### SA-007: MAX_POSTING_BYTES Allows 64 MB Per-Posting Allocation

**Severity:** Low
**Status:** Fixed
**File:** `src/index/segment/reader.rs`

**Identification:** A crafted `.post` file could force 64 MB allocation per
posting list. Multiple crafted grams in one query could exhaust memory.

**Remediation:** Reduced `MAX_POSTING_BYTES` from 64 MB to 8 MB. 8 MB
covers ~2M delta-varint-encoded doc_ids, well above any realistic segment.

### SA-008: Duplicate base64 Implementations

**Severity:** Low
**Status:** Fixed
**Files:** `src/cli/render.rs`, `tests/integration/cli.rs`

**Identification:** Two independent base64 implementations increased the
surface for encoding bugs in JSON output.

**Remediation:** Consolidated into `src/base64.rs` with RFC 4648 test vectors.

## Round 2 Accepted Risks

### AR-001: resolve_git_binary TOCTOU

**Severity:** Medium
**File:** `src/cli/manage.rs`
**Rationale:** Inherent to the Unix exec model. The canonical path refers to
the correct inode; only a binary replacement in a writable directory could
exploit this. `execveat(O_PATH)` would close the gap on Linux 3.19+ but is
not portable to macOS.

### AR-002: No Rate Limit on commit_batch Disk Writes

**Severity:** Low
**File:** `src/index/mod.rs`
**Rationale:** Requires API changes to Index (RateLimiter field or generation
cap). The overlay-full check (`OVERLAY_ENFORCE_THRESHOLD`) already bounds
total data growth. Rate limiting is a v2 consideration.

### AR-003: Thread-Local Buffer Sizing Under Large max_file_size

**Severity:** Low
**File:** `src/tokenizer/mod.rs`
**Rationale:** The shrink logic at `MIN_CAPACITY.max(needed * 4)` is correct.
Worst case is bounded by `max_file_size` (clamped to 1 GiB in SA-003 round 1).
Each rayon worker retains at most one buffer; rayon's default thread count is
bounded by CPU cores.