langmail
Email preprocessing for LLMs. Fast, typed, Rust-powered.
Emails are messy — nested MIME parts, quoted reply chains, HTML cruft, signatures, forwarded headers. LLMs don't need any of that. langmail strips it all away and gives you clean, structured text optimized for language model consumption.
Table of Contents
Install
Requires Node.js 18 or later. Prebuilt native binaries are included — no Rust toolchain needed.
Quick Start
import { preprocess, preprocessString, toLlmContext } from "langmail";
import { readFileSync } from "fs";
// From a raw .eml file
const raw = readFileSync("message.eml");
const email = preprocess(raw);
// Or from a string (e.g. Gmail API response)
const fromString = preprocessString(rawEmailString);
console.log(email.body);
// → "Hi Alice! Great to hear from you."
console.log(email.from);
// → { name: "Bob", email: "bob@example.com" }
// Format for an LLM prompt
console.log(toLlmContext(email));
// FROM: Bob <bob@example.com>
// TO: Alice <alice@example.com>
// SUBJECT: Re: Project update
// DATE: 2024-01-15T10:30:00Z
// CONTENT:
// Hi Alice! Great to hear from you.
API Reference
preprocess(raw)
Parse and preprocess a raw email from a Buffer.
import { preprocess } from "langmail";
import { readFileSync } from "fs";
const raw = readFileSync("message.eml");
const email = preprocess(raw);
Parameters:
| Name | Type | Description |
|---|---|---|
raw |
Buffer |
Raw email bytes (RFC 5322 / EML) |
Returns: ProcessedEmail
Throws: If the input cannot be parsed as a valid RFC 5322 message.
preprocessString(raw)
Convenience wrapper that accepts a string instead of a Buffer.
import { preprocessString } from "langmail";
const email = preprocessString(rawEmailString);
Parameters:
| Name | Type | Description |
|---|---|---|
raw |
string |
Raw email as string |
Returns: ProcessedEmail
Throws: If the input cannot be parsed as a valid RFC 5322 message.
preprocessWithOptions(raw, options)
Preprocess with custom options to control quote stripping, signature removal, and body length.
import { preprocessWithOptions } from "langmail";
const email = preprocessWithOptions(raw, {
stripQuotes: true, // Remove quoted replies (default: true)
stripSignature: true, // Remove email signatures (default: true)
maxBodyLength: 4000, // Truncate body to N chars (default: 0 = no limit)
});
Parameters:
| Name | Type | Description |
|---|---|---|
raw |
Buffer |
Raw email bytes |
options |
PreprocessOptions |
Preprocessing options |
Returns: ProcessedEmail
Throws: If the input cannot be parsed as a valid RFC 5322 message.
toLlmContext(email)
Format a ProcessedEmail as a deterministic plain-text block for LLM prompts. Missing fields are omitted; the CONTENT: line is always present.
import { preprocess, toLlmContext } from "langmail";
const email = preprocess(raw);
console.log(toLlmContext(email));
// FROM: Bob <bob@example.com>
// TO: Alice <alice@example.com>
// SUBJECT: Re: Project update
// DATE: 2024-01-15T10:30:00Z
// CONTENT:
// Hi Alice! Great to hear from you.
Parameters:
| Name | Type | Description |
|---|---|---|
email |
ProcessedEmail |
A preprocessed email |
Returns: string
Never throws.
toLlmContextWithOptions(email, options)
Same as toLlmContext but accepts options to control rendering. Use renderMode: "ThreadHistory" to include quoted reply history as a chronological transcript.
import { preprocess, toLlmContextWithOptions } from "langmail";
const email = preprocess(raw);
// Default: only the latest message
console.log(toLlmContextWithOptions(email, { renderMode: "LatestOnly" }));
// Include thread history
console.log(toLlmContextWithOptions(email, { renderMode: "ThreadHistory" }));
// FROM: Bob <bob@example.com>
// SUBJECT: Re: Project update
// CONTENT:
// Hi Alice! Great to hear from you.
//
// THREAD HISTORY (oldest first):
// ---
// FROM: Alice <alice@example.com>
// DATE: 2024-01-14T09:00:00Z
// Alice's original message here...
// ---
Parameters:
| Name | Type | Description |
|---|---|---|
email |
ProcessedEmail |
A preprocessed email |
options |
LlmContextOptions |
Rendering options |
Returns: string
Never throws.
Output Structure
ProcessedEmail
interface ProcessedEmail {
body: string; // Clean text, ready for your LLM
subject?: string;
from?: Address;
to: Address[];
cc: Address[];
date?: string; // ISO 8601
rfcMessageId?: string; // RFC 2822 Message-ID header
inReplyTo?: string[]; // In-Reply-To header (threading)
references?: string[]; // References header (threading)
signature?: string; // Extracted signature, if found
rawBodyLength: number; // Body length before cleaning
cleanBodyLength: number; // Body length after cleaning
primaryCta?: CallToAction; // Primary call-to-action from HTML body
threadMessages: ThreadMessage[]; // Quoted replies, oldest first
}
Address
interface Address {
name?: string; // Display name (e.g. "Alice")
email: string; // Email address (e.g. "alice@example.com")
}
CallToAction
interface CallToAction {
url: string; // The URL the action points to
text: string; // Human-readable label
confidence: number; // Score between 0.0 and 1.0
}
ThreadMessage
interface ThreadMessage {
sender: string; // Sender attribution (e.g. "Max <max@example.com>")
timestamp?: string; // ISO 8601, if parseable from the attribution
body: string; // Message body (cleaned, no nested quotes)
}
PreprocessOptions
interface PreprocessOptions {
stripQuotes?: boolean; // Remove quoted replies (default: true)
stripSignature?: boolean; // Remove email signatures (default: true)
maxBodyLength?: number; // Max body chars, 0 = no limit (default: 0)
}
LlmContextOptions / RenderMode
interface LlmContextOptions {
renderMode?: RenderMode; // Default: "LatestOnly"
}
// TypeScript enum — JS users pass the string literals directly ("LatestOnly" or "ThreadHistory")
const enum RenderMode {
/** Only the latest message — all quoted content stripped. */
LatestOnly = "LatestOnly",
/** Chronological transcript of quoted replies below the main content. */
ThreadHistory = "ThreadHistory",
}
Features
- MIME parsing — handles nested multipart messages, attachments, and encoded headers
- HTML to text — converts HTML email bodies to clean plain text, preserving links and structure
- Quote stripping — detects and removes quoted replies from Gmail, Outlook, Apple Mail, forwarded messages, and
>prefixed lines; supports English, German, French, and Spanish - Signature removal — strips signatures (preserved in the
signaturefield); detected via--delimiter and heuristics - CTA extraction — extracts the primary call-to-action from HTML emails via JSON-LD (
potentialAction) or heuristic link scoring; filters out unsubscribe/privacy/logo links - Thread history — extracts quoted reply blocks into structured
ThreadMessage[](oldest first); render withtoLlmContextWithOptions({ renderMode: "ThreadHistory" }) - Whitespace cleanup — normalizes excessive blank lines and trailing spaces
Performance
langmail uses mail-parser under the hood — a zero-copy Rust MIME parser. The preprocessing pipeline adds minimal overhead on top of the parse step.
Typical throughput on a modern machine: 10,000+ emails/second for plain text messages.
License
MIT OR Apache-2.0
Built by the team behind Marbles.