# SPEC: `crates/llmtxt-core/src/classify` Module
Spec version: 1.0.0
Task: T811 (S1)
Epic: T780 (Wave-2 — llmtxt@2026.4.13)
Status: APPROVED (Architect decision T780-decomposition.md)
---
## Overview
The `classify` module provides a layered content classification pipeline:
```
classify_content(bytes: &[u8]) → ClassificationResult
Layer 1: Magic-byte detection (infer@0.19)
↓ (if no match)
Layer 2: Text/binary gate (content_inspector@0.2.4)
↓ (if is_text)
Layer 3: Text heuristics (extracted from disclosure/mod.rs)
→ markdown / json / code / plain-text
WASM binding: classify_content_wasm(bytes: &[u8]) → String (JSON)
```
---
## Module Layout
```
crates/llmtxt-core/src/classify/
mod.rs — public API: classify_content(), re-exports types
magic.rs — Layer 1: magic-byte detection (infer crate wrapper)
text_gate.rs — Layer 2: text/binary gate (content_inspector wrapper)
heuristic.rs — Layer 3: markdown/JSON/code/text signals (extracted from disclosure/mod.rs)
wasm_bindings.rs — #[wasm_bindgen] exports
types.rs — ClassificationResult, ContentCategory, ContentFormat enums
tests.rs — #[cfg(test)] unit tests (30+ cases minimum)
```
The directory `crates/llmtxt-core/src/classify/` is a NEW module and does not exist prior to Wave-2. It is NOT to be confused with the existing `disclosure/` module which is separate and remains unchanged (except for the back-compat reroute in `disclosure/mod.rs`).
---
## Public Rust API (`mod.rs`)
```rust
//! Content classification pipeline for LLMtxt.
//!
//! Entry point: [`classify_content`] — takes raw bytes, returns a [`ClassificationResult`].
//!
//! Layer order:
//! 1. Magic-byte detection via `infer` crate (1.0 confidence on hit)
//! 2. Text/binary gate via `content_inspector` (determines is_text)
//! 3. Heuristic text classification (markdown / JSON / code / plain-text)
//!
//! The WASM export is [`classify_content_wasm`] in `wasm_bindings.rs`.
pub mod magic;
pub mod text_gate;
pub mod heuristic;
pub mod types;
pub mod wasm_bindings;
#[cfg(test)]
mod tests;
pub use types::{ClassificationResult, ContentCategory, ContentFormat};
/// Classify the content of `bytes` using a three-layer pipeline.
///
/// # Guarantees
///
/// - Zero-byte input returns `{ confidence: 0.0, category: Unknown, format: Unknown }` without panic.
/// - UTF-8 BOM (0xEF 0xBB 0xBF) and UTF-16 BOMs (0xFF 0xFE, 0xFE 0xFF) are stripped
/// before the heuristic pass.
/// - Binary inputs that pass magic-byte detection return `is_extractable: false`
/// (except PDF which returns `is_extractable: true`).
///
/// # Confidence Semantics
///
/// | 1.0 | Magic-byte match confirmed by `infer` |
/// | 0.8 | Strong heuristic signal (JSON parses successfully; heading present) |
/// | 0.5 | Weak heuristic match (single code signal) |
/// | 0.0 | Empty input — cannot determine |
///
/// # Examples
///
/// ```rust
/// use llmtxt_core::classify::{classify_content, ContentFormat};
///
/// let result = classify_content(b"%PDF-1.4 ...");
/// assert_eq!(result.format, ContentFormat::Pdf);
/// assert_eq!(result.confidence, 1.0);
///
/// let result = classify_content(b"# Hello\n\nThis is markdown.");
/// assert_eq!(result.format, ContentFormat::Markdown);
/// assert_eq!(result.confidence, 0.8);
///
/// let result = classify_content(b"");
/// assert_eq!(result.confidence, 0.0);
/// ```
pub fn classify_content(bytes: &[u8]) -> ClassificationResult {
use magic::detect_magic;
use text_gate::is_text;
use heuristic::classify_text;
use types::{ContentCategory, ContentFormat};
// Guard: zero-byte input
if bytes.is_empty() {
return ClassificationResult {
mime_type: "application/octet-stream".to_string(),
category: ContentCategory::Unknown,
format: ContentFormat::Unknown,
confidence: 0.0,
is_extractable: false,
};
}
// Layer 1: Magic-byte detection
if let Some(result) = detect_magic(bytes) {
return result;
}
// Layer 2: Text/binary gate
if !is_text(bytes) {
return ClassificationResult {
mime_type: "application/octet-stream".to_string(),
category: ContentCategory::Binary,
format: ContentFormat::Unknown,
confidence: 0.5,
is_extractable: false,
};
}
// Layer 3: Text heuristics (BOM-strip happens inside classify_text)
classify_text(bytes)
}
```
---
## Types (`types.rs`)
```rust
use serde::{Deserialize, Serialize};
/// Result of [`classify_content`].
///
/// Fields use `camelCase` for JSON serialization to match the TypeScript surface.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct ClassificationResult {
/// IANA MIME type string, e.g. `"application/pdf"`, `"text/markdown"`.
pub mime_type: String,
/// Coarse category bucket.
pub category: ContentCategory,
/// Specific format within the category.
pub format: ContentFormat,
/// Classification confidence in `[0.0, 1.0]`.
///
/// See [`classify_content`] documentation for semantics.
pub confidence: f32,
/// Whether text content can be extracted from this format.
///
/// `true` for text formats and PDF (future extraction support).
/// `false` for binary formats without a text layer (images, audio, zip).
pub is_extractable: bool,
}
/// Coarse content category.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub enum ContentCategory {
/// Binary format (image, audio, video, archive, etc.)
Binary,
/// Human-readable text (plain, markdown, code).
Text,
/// Structured data format (JSON, YAML, TOML).
Structured,
/// Cannot be determined (empty input or unrecognised binary).
Unknown,
}
/// Specific content format.
///
/// Variants listed below form the Wave-2 minimum coverage set (17 formats).
/// Additional formats may be added in future waves without breaking changes.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub enum ContentFormat {
// ── Binary ────────────────────────────────────────────────
/// PDF document (application/pdf) — magic: `%PDF`
Pdf,
/// PNG image (image/png) — magic: `\x89PNG\r\n\x1a\n`
Png,
/// JPEG image (image/jpeg) — magic: `\xFF\xD8\xFF`
Jpeg,
/// WebP image (image/webp) — magic: `RIFF....WEBP`
Webp,
/// AVIF image (image/avif) — magic: `ftyp` container
Avif,
/// SVG image (image/svg+xml) — magic/heuristic: `<svg`
Svg,
/// GIF image (image/gif) — magic: `GIF87a` / `GIF89a`
Gif,
/// MP4 video (video/mp4) — magic: `ftyp` container
Mp4,
/// WebM video (video/webm) — magic: `\x1A\x45\xDF\xA3`
Webm,
/// MP3 audio (audio/mpeg) — magic: `ID3` / `\xFF\xFB`
Mp3,
/// WAV audio (audio/wav) — magic: `RIFF....WAVE`
Wav,
/// OGG audio (audio/ogg) — magic: `OggS`
Ogg,
/// ZIP archive (application/zip) — magic: `PK\x03\x04`
Zip,
// ── Text ─────────────────────────────────────────────────
/// Markdown document (text/markdown)
Markdown,
/// JSON document (application/json)
Json,
/// JavaScript source (text/javascript)
JavaScript,
/// TypeScript source (text/typescript)
TypeScript,
/// Python source (text/x-python)
Python,
/// Rust source (text/x-rust)
Rust,
/// Go source (text/x-go)
Go,
/// Plain text (text/plain) — fallback for unrecognised text content
PlainText,
// ── Fallback ─────────────────────────────────────────────
/// Unknown format — empty input or unrecognised binary
Unknown,
}
```
---
## Layer 1: Magic-Byte Detection (`magic.rs`)
```rust
//! Layer 1: Magic-byte content detection using the `infer` crate.
//!
//! Returns `Some(ClassificationResult)` when a magic-byte pattern is confirmed,
//! or `None` when the type is unrecognised (caller falls through to Layer 2).
use infer;
use super::types::{ClassificationResult, ContentCategory, ContentFormat};
/// Attempt to detect the content type via magic bytes.
///
/// Returns `None` for unknown binary types — caller should proceed to Layer 2.
pub fn detect_magic(bytes: &[u8]) -> Option<ClassificationResult> {
let kind = infer::get(bytes)?;
let (format, category, is_extractable) = match kind.mime_type() {
"application/pdf" => (ContentFormat::Pdf, ContentCategory::Binary, true),
"image/png" => (ContentFormat::Png, ContentCategory::Binary, false),
"image/jpeg" => (ContentFormat::Jpeg, ContentCategory::Binary, false),
"image/webp" => (ContentFormat::Webp, ContentCategory::Binary, false),
"image/avif" => (ContentFormat::Avif, ContentCategory::Binary, false),
"image/svg+xml" => (ContentFormat::Svg, ContentCategory::Binary, false),
"image/gif" => (ContentFormat::Gif, ContentCategory::Binary, false),
"video/mp4" => (ContentFormat::Mp4, ContentCategory::Binary, false),
"video/webm" => (ContentFormat::Webm, ContentCategory::Binary, false),
"audio/mpeg" => (ContentFormat::Mp3, ContentCategory::Binary, false),
"audio/x-wav" | "audio/wav"
=> (ContentFormat::Wav, ContentCategory::Binary, false),
"audio/ogg" => (ContentFormat::Ogg, ContentCategory::Binary, false),
"application/zip" => (ContentFormat::Zip, ContentCategory::Binary, false),
// Unrecognised MIME from infer — return None, let text gate decide
_ => return None,
};
Some(ClassificationResult {
mime_type: kind.mime_type().to_string(),
category,
format,
confidence: 1.0,
is_extractable,
})
}
```
**Implementation note**: `infer::get` returns `None` for text content — it only matches known binary magic patterns. Text detection is handled by Layers 2 and 3.
**WAV MIME note**: `infer` returns `"audio/x-wav"`. Normalise to `"audio/wav"` in the arm for consistency with IANA registry.
---
## Layer 2: Text/Binary Gate (`text_gate.rs`)
```rust
//! Layer 2: Text/binary gate using the `content_inspector` crate.
//!
//! `content_inspector::inspect` uses byte-frequency analysis to distinguish
//! text from binary content. It handles UTF-8 BOM, UTF-16 LE/BE BOMs, and
//! malformed UTF-8 sequences correctly.
use content_inspector;
/// Returns `true` if `bytes` appear to be text (UTF-8, UTF-16, or similar text encoding).
///
/// Returns `false` for binary content (high frequency of null/control bytes).
///
/// # Notes
///
/// - UTF-8 BOM (`\xEF\xBB\xBF`) is treated as text.
/// - UTF-16 LE BOM (`\xFF\xFE`) and UTF-16 BE BOM (`\xFE\xFF`) are treated as text.
/// - Malformed UTF-8 sequences with high binary byte frequency → binary.
pub fn is_text(bytes: &[u8]) -> bool {
content_inspector::inspect(bytes).is_text()
}
```
---
## Layer 3: Text Heuristics (`heuristic.rs`)
```rust
//! Layer 3: Text heuristic classification.
//!
//! Extracts the signal logic currently inlined in `disclosure/mod.rs:detect_document_format`
//! into this standalone, testable module. After Wave-2 lands, `detect_document_format`
//! delegates to `classify_content` (via the back-compat reroute in T828).
//!
//! BOM stripping is performed before analysis:
//! - UTF-8 BOM: `\xEF\xBB\xBF` — strip 3 bytes
//! - UTF-16 LE: `\xFF\xFE` — strip 2 bytes
//! - UTF-16 BE: `\xFE\xFF` — strip 2 bytes
use super::types::{ClassificationResult, ContentCategory, ContentFormat};
/// Classify text bytes using heuristic signals.
///
/// Caller guarantees `bytes` is non-empty and passed Layer 2 (`is_text == true`).
///
/// Precedence order:
/// 1. JSON: valid `serde_json` parse → `Json` (confidence 0.8)
/// 2. Markdown: any heading signal → `Markdown` (confidence 0.8)
/// 3. Markdown: 2+ of 5 signals → `Markdown` (confidence 0.8)
/// 4. Code: 2+ of 5 code signals → appropriate code format (confidence 0.5)
/// 5. Fallback: `PlainText` (confidence 0.3)
pub fn classify_text(bytes: &[u8]) -> ClassificationResult {
// Strip BOMs before string conversion
let stripped = strip_bom(bytes);
// Convert to &str (lossy for robustness)
let content = match std::str::from_utf8(stripped) {
Ok(s) => s,
Err(_) => {
// Valid text (passed is_text gate) but not valid UTF-8 — treat as plain
return ClassificationResult {
mime_type: "text/plain".to_string(),
category: ContentCategory::Text,
format: ContentFormat::PlainText,
confidence: 0.5,
is_extractable: true,
};
}
};
let trimmed = content.trim();
// ── JSON detection ────────────────────────────────────────────────
if trimmed.starts_with('{') || trimmed.starts_with('[') {
if serde_json::from_str::<serde_json::Value>(trimmed).is_ok() {
return ClassificationResult {
mime_type: "application/json".to_string(),
category: ContentCategory::Structured,
format: ContentFormat::Json,
confidence: 0.8,
is_extractable: true,
};
}
// Partial JSON signals (malformed but JSON-like)
let json_signals = [
content.contains("\":"),
trimmed.starts_with('{') || trimmed.starts_with('['),
trimmed.ends_with('}') || trimmed.ends_with(']'),
];
if json_signals.iter().filter(|&&b| b).count() >= 2 {
return ClassificationResult {
mime_type: "application/json".to_string(),
category: ContentCategory::Structured,
format: ContentFormat::Json,
confidence: 0.5,
is_extractable: true,
};
}
}
// ── Markdown detection ────────────────────────────────────────────
let has_heading = content.lines().any(|l| {
let t = l.trim_start_matches(' ');
t.starts_with("# ")
|| t.starts_with("## ")
|| t.starts_with("### ")
|| t.starts_with("#### ")
|| t.starts_with("##### ")
|| t.starts_with("###### ")
});
// Short-circuit on heading (strong unambiguous signal — Wave-1 fix)
if has_heading {
return ClassificationResult {
mime_type: "text/markdown".to_string(),
category: ContentCategory::Text,
format: ContentFormat::Markdown,
confidence: 0.8,
is_extractable: true,
};
}
let markdown_signals: [bool; 5] = [
has_heading,
content.lines().any(|l| l.trim_start().starts_with("- ") || l.trim_start().starts_with("* ")),
content.lines().any(|l| l.trim_start().starts_with(|c: char| c.is_ascii_digit()) && l.contains(". ")),
content.contains("```"),
has_markdown_link(content),
];
if markdown_signals.iter().filter(|&&b| b).count() >= 2 {
return ClassificationResult {
mime_type: "text/markdown".to_string(),
category: ContentCategory::Text,
format: ContentFormat::Markdown,
confidence: 0.8,
is_extractable: true,
};
}
// ── Code detection ────────────────────────────────────────────────
let (code_signals, detected_lang) = detect_code(content);
if code_signals >= 2 {
let (mime_type, format) = match detected_lang {
Some(CodeLang::TypeScript) => ("text/typescript", ContentFormat::TypeScript),
Some(CodeLang::JavaScript) => ("text/javascript", ContentFormat::JavaScript),
Some(CodeLang::Python) => ("text/x-python", ContentFormat::Python),
Some(CodeLang::Rust) => ("text/x-rust", ContentFormat::Rust),
Some(CodeLang::Go) => ("text/x-go", ContentFormat::Go),
None => ("text/plain", ContentFormat::PlainText),
};
return ClassificationResult {
mime_type: mime_type.to_string(),
category: ContentCategory::Text,
format,
confidence: 0.5,
is_extractable: true,
};
}
// ── Fallback: plain text ─────────────────────────────────────────
ClassificationResult {
mime_type: "text/plain".to_string(),
category: ContentCategory::Text,
format: ContentFormat::PlainText,
confidence: 0.3,
is_extractable: true,
}
}
// ── Internal helpers ───────────────────────────────────────────────
/// Strips leading BOM sequences from bytes.
fn strip_bom(bytes: &[u8]) -> &[u8] {
// UTF-8 BOM: EF BB BF
if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {
return &bytes[3..];
}
// UTF-16 LE BOM: FF FE
if bytes.starts_with(&[0xFF, 0xFE]) {
return &bytes[2..];
}
// UTF-16 BE BOM: FE FF
if bytes.starts_with(&[0xFE, 0xFF]) {
return &bytes[2..];
}
bytes
}
/// Detects whether a `[text](url)` markdown link pattern exists in `s`.
fn has_markdown_link(s: &str) -> bool {
let bytes = s.as_bytes();
let mut i = 0;
while i < bytes.len() {
if bytes[i] == b'['
&& let Some(cb) = bytes[i..].iter().position(|&b| b == b']')
{
let j = i + cb;
if j + 1 < bytes.len() && bytes[j + 1] == b'(' && bytes[j + 1..].contains(&b')') {
return true;
}
}
i += 1;
}
false
}
/// Language hints for code detection.
enum CodeLang { JavaScript, TypeScript, Python, Rust, Go }
/// Returns `(signal_count, Option<CodeLang>)`.
fn detect_code(content: &str) -> (usize, Option<CodeLang>) {
let code_signals: [bool; 5] = [
content.lines().any(|l| {
let t = l.trim_start();
t.starts_with("import ")
|| t.starts_with("export ")
|| t.starts_with("const ")
|| t.starts_with("let ")
|| t.starts_with("var ")
|| t.starts_with("function ")
|| t.starts_with("class ")
|| t.starts_with("def ")
|| t.starts_with("fn ")
|| t.starts_with("pub ")
|| t.starts_with("use ")
}),
content.lines().any(|l| {
l.trim_end().ends_with('{')
|| l.trim_end().ends_with(';')
|| l.trim_end().ends_with('}')
}),
content.lines().any(|l| {
let t = l.trim_start();
t.starts_with("if ")
|| t.starts_with("for ")
|| t.starts_with("while ")
|| t.starts_with("return ")
|| t.starts_with("switch ")
}),
content.contains("=>"),
content.contains(": string")
|| content.contains(": number")
|| content.contains(": boolean")
|| content.contains(": int")
|| content.contains(": void")
|| content.contains(": any"),
];
let count = code_signals.iter().filter(|&&b| b).count();
// Language detection heuristics (best-effort, not exhaustive)
let lang = if content.contains(": string") || content.contains(": boolean") || content.contains(": number") {
Some(CodeLang::TypeScript)
} else if content.contains("def ") && content.contains(":") && !content.contains("{") {
Some(CodeLang::Python)
} else if content.contains("fn ") && content.contains("pub ") {
Some(CodeLang::Rust)
} else if content.contains("func ") && content.contains("package ") {
Some(CodeLang::Go)
} else if count >= 2 {
Some(CodeLang::JavaScript)
} else {
None
};
(count, lang)
}
```
---
## WASM Bindings (`wasm_bindings.rs`)
```rust
//! WASM exports for the classify module.
//!
//! Exports a single JSON-returning function `classify_content_wasm` to avoid
//! wasm-bindgen's struct serialization complexity. The TS layer parses the JSON
//! and maps to camelCase fields.
use wasm_bindgen::prelude::*;
use super::classify_content;
/// Classify content and return a JSON string.
///
/// Input: raw bytes as a `&[u8]`. WASM callers pass a `Uint8Array`.
///
/// Output: JSON string of [`ClassificationResult`] (camelCase keys).
///
/// On serialization failure (should not occur in practice), returns:
/// `{"error":"serialize failed: <reason>","mimeType":"application/octet-stream","category":"Unknown","format":"Unknown","confidence":0.0,"isExtractable":false}`
///
/// WASM consumers MUST check for the `error` key.
#[wasm_bindgen]
pub fn classify_content_wasm(bytes: &[u8]) -> String {
let result = classify_content(bytes);
serde_json::to_string(&result).unwrap_or_else(|e| {
format!(
r#"{{"error":"serialize failed: {}","mimeType":"application/octet-stream","category":"Unknown","format":"Unknown","confidence":0.0,"isExtractable":false}}"#,
e
)
})
}
```
**Notes**:
- `ClassificationResult` derives `Serialize` with `#[serde(rename_all = "camelCase")]` — JSON keys are camelCase automatically.
- `ContentCategory` and `ContentFormat` serialize as their variant name string (e.g. `"Binary"`, `"Markdown"`).
- The TS adapter maps these to lowercase strings for the public API (see T812 spec).
---
## Cargo Dependencies (`Cargo.toml` additions)
Add to `[dependencies]` section of `crates/llmtxt-core/Cargo.toml`:
```toml
# Layer 1: magic-byte content type detection
infer = { version = "0.19", default-features = false, features = ["alloc"] }
# Layer 2: text/binary gate via byte-frequency analysis
content_inspector = "0.2.4"
```
**Feature flags**:
- `infer`: `default-features = false, features = ["alloc"]` — disables std (compatible with wasm32-unknown-unknown), enables alloc (required for Vec/String in WASM). Confirmed WASM-compatible in T780 probe.
- `content_inspector`: no feature flags needed — pure Rust, memchr dependency only, WASM-confirmed.
**Feature gating**: These dependencies are NOT gated behind `[features]`. They are always compiled. The WASM size probe showed ~102KB overhead for both crates in a minimal probe crate — well within the 526KB budget (20% of 2.57MB baseline).
---
## Confidence Semantics (Normative)
| `1.0` | Magic-byte match confirmed by `infer::get()` |
| `0.8` | Strong heuristic: JSON parse succeeds OR markdown heading present OR 2+ markdown signals |
| `0.5` | Weak heuristic: 2+ code signals (language uncertain); or content_inspector says text but no heuristic match; or malformed JSON with 2+ JSON-like signals |
| `0.3` | Plain text fallback (no other signal matched) |
| `0.0` | Zero-byte input — cannot determine |
---
## Zero-Byte Handling (Normative)
`classify_content(&[])` MUST return:
```rust
ClassificationResult {
mime_type: "application/octet-stream".to_string(),
category: ContentCategory::Unknown,
format: ContentFormat::Unknown,
confidence: 0.0,
is_extractable: false,
}
```
No panic. No `unwrap`. This contract is verified by test `test_empty_input` in `tests.rs`.
---
## BOM Handling (Normative)
The following BOMs are stripped by `strip_bom()` before heuristic analysis:
| UTF-8 | `0xEF 0xBB 0xBF` | Strip 3 bytes before heuristic |
| UTF-16 LE | `0xFF 0xFE` | Strip 2 bytes before heuristic |
| UTF-16 BE | `0xFE 0xFF` | Strip 2 bytes before heuristic |
Content that begins with a BOM passes `is_text()` check (content_inspector handles these correctly). The BOM is stripped before passing to heuristic analysis to avoid misclassifying BOM-prefixed JSON/markdown as unknown.
---
## Test Plan (`tests.rs` minimum — 30 cases)
### Magic-byte tests (13 cases — one per binary format)
```rust
test_pdf_magic() — b"%PDF-1.4 body" → Pdf, confidence 1.0
test_png_magic() — PNG header bytes → Png, confidence 1.0
test_jpeg_magic() — JPEG header bytes → Jpeg, confidence 1.0
test_webp_magic() — RIFF...WEBP bytes → Webp, confidence 1.0
test_avif_magic() — ftyp container → Avif, confidence 1.0
test_svg_magic() — <svg bytes → Svg (may be text-layer detect)
test_gif_magic() — GIF89a bytes → Gif, confidence 1.0
test_mp4_magic() — ftyp container → Mp4, confidence 1.0
test_webm_magic() — 0x1A45DFA3 bytes → Webm, confidence 1.0
test_mp3_magic() — ID3 bytes → Mp3, confidence 1.0
test_wav_magic() — RIFF...WAVE bytes → Wav, confidence 1.0
test_ogg_magic() — OggS bytes → Ogg, confidence 1.0
test_zip_magic() — PK\x03\x04 bytes → Zip, confidence 1.0
```
### Heuristic text tests (10 cases)
```rust
test_json_valid() — r#"{"key":"value"}"# → Json, confidence 0.8
test_json_array() — "[1,2,3]" → Json, confidence 0.8
test_markdown_heading() — "# Title\n\nBody" → Markdown, confidence 0.8
test_markdown_heading_only()— "# Title Only" → Markdown (Wave-1 fix validated)
test_markdown_multi_signal()— "- item\n```code```" → Markdown (2+ signals)
test_code_typescript() — "const x: string = 'hi';" → TypeScript
test_code_python() — "def foo():\n return 1" → Python
test_code_rust() — "pub fn main() { let x = 1; }" → Rust
test_plain_text() — "Hello world." → PlainText
test_empty_input() — b"" → Unknown, confidence 0.0
```
### BOM handling tests (3 cases)
```rust
test_utf8_bom_json() — "\xEF\xBB\xBF{\"a\":1}" → Json (BOM stripped)
test_utf16_le_bom() — "\xFF\xFE" + UTF-16 text → Text category
test_utf16_be_bom() — "\xFE\xFF" + UTF-16 text → Text category
```
### Edge cases (4 cases)
```rust
test_partial_json_signals() — malformed JSON with 2+ signals → Json confidence 0.5
test_binary_not_magic() — random binary bytes → Binary Unknown confidence 0.5
test_svg_text_detection() — SVG XML content (text path) → Svg or PlainText
test_code_weak_signal() — single code signal only → PlainText (not code)
```
---
## Integration with Back-Compat Reroute (T828)
After Wave-2 lands, `disclosure/mod.rs` is updated:
```rust
// In crates/llmtxt-core/src/disclosure/mod.rs
use crate::classify::{classify_content, ContentFormat};
pub fn detect_document_format(content: &str) -> &'static str {
let result = classify_content(content.as_bytes());
match result.format {
ContentFormat::Json => "json",
ContentFormat::Markdown => "markdown",
ContentFormat::JavaScript | ContentFormat::TypeScript
| ContentFormat::Python | ContentFormat::Rust | ContentFormat::Go => "code",
_ => "text", // PlainText, Binary formats, Unknown all → "text"
}
}
```
The Wave-1 markdown heading fix (short-circuit in `heuristic.rs`) is the SSoT for this logic after Wave-2 lands. The previously inlined logic in `mod.rs` is REMOVED and replaced by this delegation.
---
## WASM Module Registration
In `crates/llmtxt-core/src/lib.rs`, add the module declaration:
```rust
pub mod classify;
```
Ensure `classify_content_wasm` is re-exported at the crate root or accessible via the module path for wasm-pack. The `#[wasm_bindgen]` attribute on `classify_content_wasm` in `wasm_bindings.rs` registers it automatically when wasm-pack builds with `--target bundler` or `--target web`.
---
## Quality Gates
Before merging Wave-2 implementation:
- [ ] `cargo fmt --check` exits 0 on `crates/llmtxt-core`
- [ ] `cargo clippy -- -D warnings` exits 0 on `crates/llmtxt-core`
- [ ] `cargo test -p llmtxt-core` — all 30+ classify tests pass
- [ ] `ferrous-forge validate` exits 0
- [ ] WASM binary size: `wc -c packages/llmtxt/wasm/llmtxt_core_bg.wasm` ≤ 3,160,021 bytes
- [ ] `classify_content_wasm` accessible from `packages/llmtxt/wasm/llmtxt_core.js`