mail-parser 0.2.1

Fast and robust e-mail parsing library for Rust
Documentation
# mail-parser

[![crates.io](https://img.shields.io/crates/v/mail-parser)](https://crates.io/crates/mail-parser)
[![build](https://github.com/stalwartlabs/mail-parser/actions/workflows/rust.yml/badge.svg)](https://github.com/stalwartlabs/mail-parser/actions/workflows/rust.yml)
[![docs.rs](https://img.shields.io/docsrs/mail-parser)](https://docs.rs/mail-parser)
[![crates.io](https://img.shields.io/crates/l/mail-parser)](http://www.apache.org/licenses/LICENSE-2.0)
[![Twitter Follow](https://img.shields.io/twitter/follow/stalwartlabs?style=social)](https://twitter.com/stalwartlabs)

_mail-parser_ is an **e-mail parsing library** written in Rust that fully conforms to the Internet Message Format standard (_RFC 5322_), the
Multipurpose Internet Mail Extensions (MIME; _RFC 2045 - 2049_) as well as other [internet messaging RFCs](#conformed-rfcs).

It also supports decoding messages in [41 different character sets](#supported-character-sets) including obsolete formats such as UTF-7.
All Unicode (UTF-*) and single-byte character sets are handled internally by the library while support for legacy multi-byte encodings of Chinese
and Japanese languages such as BIG5 or ISO-2022-JP is provided by the optional dependency [encoding_rs](https://crates.io/crates/encoding_rs).

In general, this library abides by the Postel's law or [Robustness Principle](https://en.wikipedia.org/wiki/Robustness_principle) which 
states that an implementation must be conservative in its sending behavior and liberal in its receiving behavior. This means that
_mail-parser_ will make a best effort to parse non-conformant e-mail messages as long as these do not deviate too much from the standard.

Unlike other e-mail parsing libraries that return nested representations of the different MIME parts in a message, this library 
conforms to [RFC 8621, Section 4.1.4](https://datatracker.ietf.org/doc/html/rfc8621#section-4.1.4) and provides a more human-friendly
representation of the message contents consisting of just text body parts, html body parts and attachments. Additionally, conversion to/from
HTML and plain text inline body parts is done automatically when the _alternative_ version is missing.

Performance and memory safety were two important factors while designing _mail-parser_:

- **Zero-copy**: Practically all strings returned by this library are `Cow<str>` references to the input raw message.
- **High performance Base64 decoding** based on Chromium's decoder ([the fastest non-SIMD decoder]https://github.com/lemire/fastbase64). 
- **Fast parsing** of message header fields, character set names and HTML entities using [perfect hashing]https://en.wikipedia.org/wiki/Perfect_hash_function.
- Written in **100% safe** Rust with no external dependencies.
- Every function in the library has been [fuzzed]#testing-fuzzing--benchmarking and 
  meticulously [tested with MIRI]#testing-fuzzing--benchmarking.
- Thoroughly **battle-tested** with millions of real-world e-mail messages dating from 1995 until today.

## Usage Example

```rust
    let input = concat!(
        "From: Art Vandelay <art@vandelay.com> (Vandelay Industries)\n",
        "To: \"Colleagues\": \"James Smythe\" <james@vandelay.com>; Friends:\n",
        "    jane@example.com, =?UTF-8?Q?John_Sm=C3=AEth?= <john@example.com>;\n",
        "Date: Sat, 20 Nov 2021 14:22:01 -0800\n",
        "Subject: Why not both importing AND exporting? =?utf-8?b?4pi6?=\n",
        "Content-Type: multipart/mixed; boundary=\"festivus\";\n\n",
        "--festivus\n",
        "Content-Type: text/html; charset=\"us-ascii\"\n",
        "Content-Transfer-Encoding: base64\n\n",
        "PGh0bWw+PHA+SSB3YXMgdGhpbmtpbmcgYWJvdXQgcXVpdHRpbmcgdGhlICZsZHF1bztle\n",
        "HBvcnRpbmcmcmRxdW87IHRvIGZvY3VzIGp1c3Qgb24gdGhlICZsZHF1bztpbXBvcnRpbm\n",
        "cmcmRxdW87LDwvcD48cD5idXQgdGhlbiBJIHRob3VnaHQsIHdoeSBub3QgZG8gYm90aD8\n",
        "gJiN4MjYzQTs8L3A+PC9odG1sPg==\n",
        "--festivus\n",
        "Content-Type: message/rfc822\n\n",
        "From: \"Cosmo Kramer\" <kramer@kramerica.com>\n",
        "Subject: Exporting my book about coffee tables\n",
        "Content-Type: multipart/mixed; boundary=\"giddyup\";\n\n",
        "--giddyup\n",
        "Content-Type: text/plain; charset=\"utf-16\"\n",
        "Content-Transfer-Encoding: quoted-printable\n\n",
        "=FF=FE=0C!5=D8\"=DD5=D8)=DD5=D8-=DD =005=D8*=DD5=D8\"=DD =005=D8\"=\n",
        "=DD5=D85=DD5=D8-=DD5=D8,=DD5=D8/=DD5=D81=DD =005=D8*=DD5=D86=DD =\n",
        "=005=D8=1F=DD5=D8,=DD5=D8,=DD5=D8(=DD =005=D8-=DD5=D8)=DD5=D8\"=\n",
        "=DD5=D8=1E=DD5=D80=DD5=D8\"=DD!=00\n",
        "--giddyup\n",
        "Content-Type: image/gif; name*1=\"about \"; name*0=\"Book \";\n",
        "              name*2*=utf-8''%e2%98%95 tables.gif\n",
        "Content-Transfer-Encoding: Base64\n",
        "Content-Disposition: attachment\n\n",
        "R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\n",
        "--giddyup--\n",
        "--festivus--\n",
    )
    .as_bytes();

    let message = Message::parse(input);

    // Parses addresses (including comments), lists and groups
    assert_eq!(
        message.get_from(),
        &Address::Address(Addr {
            name: Some("Art Vandelay (Vandelay Industries)".into()),
            address: Some("art@vandelay.com".into())
        })
    );
    assert_eq!(
        message.get_to(),
        &Address::GroupList(vec![
            Group {
                name: Some("Colleagues".into()),
                addresses: vec![Addr {
                    name: Some("James Smythe".into()),
                    address: Some("james@vandelay.com".into())
                }]
            },
            Group {
                name: Some("Friends".into()),
                addresses: vec![
                    Addr {
                        name: None,
                        address: Some("jane@example.com".into())
                    },
                    Addr {
                        name: Some("John Smîth".into()),
                        address: Some("john@example.com".into())
                    }
                ]
            }
        ])
    );

    assert_eq!(
        message.get_date().unwrap().to_iso8601(),
        "2021-11-20T14:22:01-08:00"
    );

    // RFC2047 support for encoded text in message readers
    assert_eq!(
        message.get_subject().unwrap(),
        "Why not both importing AND exporting? ☺"
    );

    // HTML and text body parts are returned conforming to RFC8621, Section 4.1.4 
    assert_eq!(
        message.get_html_body(0).unwrap().to_string(),
        concat!(
            "<html><p>I was thinking about quitting the &ldquo;exporting&rdquo; to ",
            "focus just on the &ldquo;importing&rdquo;,</p><p>but then I thought,",
            " why not do both? &#x263A;</p></html>"
        )
    );

    // HTML parts are converted to plain text (and viceversa) when missing
    assert_eq!(
        message.get_text_body(0).unwrap().to_string(),
        concat!(
            "I was thinking about quitting the “exporting” to focus just on the",
            " “importing”,\nbut then I thought, why not do both? ☺\n"
        )
    );

    // Supports nested messages as well as multipart/digest
    let nested_message = match message.get_attachment(0).unwrap() {
        MessagePart::Message(v) => v,
        _ => unreachable!(),
    };

    assert_eq!(
        nested_message.get_subject().unwrap(),
        "Exporting my book about coffee tables"
    );

    // Handles UTF-* as well as many legacy encodings
    assert_eq!(
        nested_message.get_text_body(0).unwrap().to_string(),
        "ℌ𝔢𝔩𝔭 𝔪𝔢 𝔢𝔵𝔭𝔬𝔯𝔱 𝔪𝔶 𝔟𝔬𝔬𝔨 𝔭𝔩𝔢𝔞𝔰𝔢!"
    );
    assert_eq!(
        nested_message.get_html_body(0).unwrap().to_string(),
        "<html><body>ℌ𝔢𝔩𝔭 𝔪𝔢 𝔢𝔵𝔭𝔬𝔯𝔱 𝔪𝔶 𝔟𝔬𝔬𝔨 𝔭𝔩𝔢𝔞𝔰𝔢!</body></html>"
    );

    let nested_attachment = match nested_message.get_attachment(0).unwrap() {
        MessagePart::Binary(v) => v,
        _ => unreachable!(),
    };

    assert_eq!(nested_attachment.len(), 42);

    // Full RFC2231 support for continuations and character sets
    assert_eq!(
        nested_attachment
            .get_header()
            .unwrap()
            .get_content_type()
            .unwrap()
            .get_attribute("name")
            .unwrap(),
        "Book about ☕ tables.gif"
    );

    // Integrates with Serde
    println!("{}", serde_json::to_string_pretty(&message).unwrap());
    println!("{}", serde_yaml::to_string(&message).unwrap());
```

## Testing, Fuzzing & Benchmarking

To run the testsuite:

```bash
 $ cargo test --all-features
```

or, to run the testsuite with MIRI:

```bash
 $ cargo +nightly miri test --all-features
```

To fuzz the library with `cargo-fuzz`:

```bash
 $ cargo +nightly fuzz run mail_parser
```

and, to run the benchmarks:

```bash
 $ cargo +nightly bench --all-features
```

## Conformed RFCs

- [RFC 822 - Standard for ARPA Internet Text Messages]https://datatracker.ietf.org/doc/html/rfc822
- [RFC 5322 - Internet Message Format]https://datatracker.ietf.org/doc/html/rfc5322
- [RFC 2045 - Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies]https://datatracker.ietf.org/doc/html/rfc2045
- [RFC 2046 - Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types]https://datatracker.ietf.org/doc/html/rfc2046
- [RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text]https://datatracker.ietf.org/doc/html/rfc2047
- [RFC 2048 - Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures]https://datatracker.ietf.org/doc/html/rfc2048
- [RFC 2049 - Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples]https://datatracker.ietf.org/doc/html/rfc2049
- [RFC 2231 - MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations]https://datatracker.ietf.org/doc/html/rfc2231
- [RFC 2183 - Communicating Presentation Information in Internet Messages: The Content-Disposition Header Field]https://datatracker.ietf.org/doc/html/rfc2183
- [RFC 6532 - Internationalized Email Headers]https://datatracker.ietf.org/doc/html/rfc6532
- [RFC 2152 - UTF-7 - A Mail-Safe Transformation Format of Unicode]https://datatracker.ietf.org/doc/html/rfc2152
- [RFC 2369 - The Use of URLs as Meta-Syntax for Core Mail List Commands and their Transport through Message Header Fields]https://datatracker.ietf.org/doc/html/rfc2369
- [RFC 2919 - List-Id: A Structured Field and Namespace for the Identification of Mailing Lists]https://datatracker.ietf.org/doc/html/rfc2919
- [RFC 8621 - The JSON Meta Application Protocol (JMAP) for Mail (Section 4.1.4)]https://datatracker.ietf.org/doc/html/rfc8621#section-4.1.4

## Supported Character Sets

- UTF-8
- UTF-16, UTF-16BE, UTF-16LE
- UTF-7
- US-ASCII
- ISO-8859-1 
- ISO-8859-2 
- ISO-8859-3 
- ISO-8859-4 
- ISO-8859-5 
- ISO-8859-6 
- ISO-8859-7 
- ISO-8859-8 
- ISO-8859-9 
- ISO-8859-10 
- ISO-8859-13 
- ISO-8859-14 
- ISO-8859-15 
- ISO-8859-16
- CP1250 
- CP1251 
- CP1252 
- CP1253 
- CP1254 
- CP1255 
- CP1256 
- CP1257 
- CP1258
- KOI8-R
- KOI8_U
- MACINTOSH
- IBM850
- TIS-620
  
Supported character sets via the optional dependency [encoding_rs](https://crates.io/crates/encoding_rs):
  
- SHIFT_JIS
- BIG5
- EUC-JP 
- EUC-KR 
- GB18030
- GBK
- ISO-2022-JP 
- WINDOWS-874
- IBM-866

## License

Licensed under either of

 * Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT]LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

## Copyright

Copyright (C) 2020-2022, Stalwart Labs, Minter Ltd.

See [COPYING] for the license.

[COPYING]: https://github.com/stalwartlabs/mail-parser/blob/main/COPYING