wc-parser
A decently fast Rust library for parsing WhatsApp chat exports.
Features
- Parse WhatsApp chat exports into structured data
- Support for multiple date and time formats
- Automatic detection of date format (day/month vs month/day)
- Optional attachment parsing
- System message detection
- Multiline message support
Performance & Optimisations
wc-parser is designed to be fast and memory-efficient. Key optimisations include:
- Memory-mapped I/O —
parse_fileusesmemmap2so chat exports are read straight from the operating-system page-cache without first copying them into aString, keeping peak RSS low even for multi-gigabyte logs. - Zero-copy parsing — When parsing from a
&str, we split the original slice into&strline slices instead of allocating new strings, only allocating when constructing the finalMessagestructs. - Pre-compiled regular expressions — All regex patterns are built once at start-up via
lazy_static!, removing the compile cost from the hot parsing path. - Data-parallel message processing — Heavy-weight work (regex capture extraction, date/time normalisation, etc.) runs in parallel across CPU cores with
rayonwhen debug output is disabled. - Selective attachment parsing — Attachment extraction is completely skipped unless
parse_attachments = true, saving an extra regex run per message in the common case. - Configurable debug logging — Expensive debug printing is off by default. When enabled it switches to single-threaded execution to keep log output ordered.
- Small-footprint date handling — Simple heuristics determine whether the log is day-first or month-first in a single pass, avoiding per-message branching once parsing begins.
Usage
Add this to your Cargo.toml:
[]
= "0.1.2"
Basic Usage
use parse_string;
Advanced Usage with Options
use ;
let options = ParseStringOptions ;
let messages = parse_string.unwrap;
Message Structure
Each parsed message contains:
// Located in `src/models.rs`
Supported Formats
This library supports various WhatsApp chat export formats including:
- Different date formats (DD/MM/YYYY, MM/DD/YYYY, YYYY/MM/DD, etc.)
- 12-hour and 24-hour time formats
- Various separators and punctuation
- Unicode characters and directional marks
- System messages and notifications