Crate encoding_rs [] [src]

encoding_rs is a Gecko-oriented Free Software / Open Source implementation of the Encoding Standard in Rust. Gecko-oriented means that converting to and from UTF-16 is supported in addition to converting to and from UTF-8, that the performance and streamability goals are browser-oriented, and that FFI-friendliness is a goal.

Additionally, the mem module provides functions that are useful for applications that need to be able to deal with legacy in-memory representations of Unicode.


The code is available under the Apache license, Version 2.0 or the MIT license, at your option. See the COPYRIGHT file for details. The repository is on GitHub. The crate is available on


Example programs:

Decode using the non-streaming API:

use encoding_rs::*;

let expectation = "\u{30CF}\u{30ED}\u{30FC}\u{30FB}\u{30EF}\u{30FC}\u{30EB}\u{30C9}";
let bytes = b"\x83n\x83\x8D\x81[\x81E\x83\x8F\x81[\x83\x8B\x83h";

let (cow, encoding_used, had_errors) = SHIFT_JIS.decode(bytes);
assert_eq!(&cow[..], expectation);
assert_eq!(encoding_used, SHIFT_JIS);

Decode using the streaming API with minimal unsafe:

use encoding_rs::*;

let expectation = "\u{30CF}\u{30ED}\u{30FC}\u{30FB}\u{30EF}\u{30FC}\u{30EB}\u{30C9}";

// Use an array of byte slices to demonstrate content arriving piece by
// piece from the network.
let bytes: [&'static [u8]; 4] = [b"\x83",

// Very short output buffer to demonstrate the output buffer getting full.
// Normally, you'd use something like `[0u8; 2048]`.
let mut buffer_bytes = [0u8; 8];
// Rust doesn't allow us to stack-allocate a `mut str` without `unsafe`.
let mut buffer: &mut str = unsafe {
    std::mem::transmute(&mut buffer_bytes[..])

// How many bytes in the buffer currently hold significant data.
let mut bytes_in_buffer = 0usize;

// Collect the output to a string for demonstration purposes.
let mut output = String::new();

// The `Decoder`
let mut decoder = SHIFT_JIS.new_decoder();

// Track whether we see errors.
let mut total_had_errors = false;

// Decode using a fixed-size intermediate buffer (for demonstrating the
// use of a fixed-size buffer; normally when the output of an incremental
// decode goes to a `String` one would use `Decoder.decode_to_string()` to
// avoid the intermediate buffer).
for input in &bytes[..] {
    // The number of bytes already read from current `input` in total.
    let mut total_read_from_current_input = 0usize;

    loop {
        let (result, read, written, had_errors) =
                                  &mut buffer[bytes_in_buffer..],
        total_read_from_current_input += read;
        bytes_in_buffer += written;
        total_had_errors |= had_errors;
        match result {
            CoderResult::InputEmpty => {
                // We have consumed the current input buffer. Break out of
                // the inner loop to get the next input buffer from the
                // outer loop.
            CoderResult::OutputFull => {
                // Write the current buffer out and consider the buffer
                // empty.
                bytes_in_buffer = 0usize;

// Process EOF
loop {
    let (result, _, written, had_errors) =
                              &mut buffer[bytes_in_buffer..],
    bytes_in_buffer += written;
    total_had_errors |= had_errors;
    // Write the current buffer out and consider the buffer empty.
    // Need to do this here for both `match` arms, because we exit the
    // loop on `CoderResult::InputEmpty`.
    bytes_in_buffer = 0usize;
    match result {
        CoderResult::InputEmpty => {
            // Done!
        CoderResult::OutputFull => {

assert_eq!(&output[..], expectation);

Web / Browser Focus

Both in terms of scope and performance, the focus is on the Web. For scope, this means that encoding_rs implements the Encoding Standard fully and doesn't implement encodings that are not specified in the Encoding Standard. For performance, this means that decoding performance is important as well as performance for encoding into UTF-8 or encoding the Basic Latin range (ASCII) into legacy encodings. Non-Basic Latin needs to be encoded into legacy encodings in only two places in the Web platform: in the query part of URLs, in which case it's a matter of relatively rare error handling, and in form submission, in which case the user action and networking tend to hide the performance of the encoder.

Deemphasizing performance of encoding non-Basic Latin text into legacy encodings enables smaller code size thanks to the encoder side using the decode-optimized data tables without having encode-optimized data tables at all. Even in decoders, smaller lookup table size is preferred over avoiding multiplication operations.

Additionally, performance is a non-goal for the ASCII-incompatible ISO-2022-JP and UTF-16 encodings, which are rarely used on the Web. For clarity, this means that performance is a non-goal for UTF-16 as used on the wire as an interchange encoding (UTF-16 on the [u8] side of the API). Good performance for UTF-16 used as an in-RAM Unicode representation (UTF-16 the [u16] side of the API) is a goal.

Despite the focus on the Web, encoding_rs may well be useful for decoding email, although you'll need to implement UTF-7 decoding and label handling by other means. (Due to the Web focus, patches to add UTF-7 are unwelcome in encoding_rs itself.) Also, despite the browser focus, the hope is that non-browser applications that wish to consume Web content or submit Web forms in a Web-compatible way will find encoding_rs useful.

Streaming & Non-Streaming; Rust & C/C++

The API in Rust has two modes of operation: streaming and non-streaming. The streaming API is the foundation of the implementation and should be used when processing data that arrives piecemeal from an i/o stream. The streaming API has an FFI wrapper (as a separate crate) that exposes it to C callers. The non-streaming part of the API is for Rust callers only and is smart about borrowing instead of copying when possible. When streamability is not needed, the non-streaming API should be preferrer in order to avoid copying data when a borrow suffices.

There is no analogous C API exposed via FFI, mainly because C doesn't have standard types for growable byte buffers and Unicode strings that know their length.

The C API (header file generated at target/include/encoding_rs.h when building encoding_rs) can, in turn, be wrapped for use from C++. Such a C++ wrapper can re-create the non-streaming API in C++ for C++ callers. The C binding comes with a C++14 wrapper that uses standard library + GSL types and that recreates the non-streaming API in C++ on top of the streaming API. A C++ wrapper with XPCOM/MFBT types is being developed as part of Mozilla bug 1261841.

The Encoding type is common to both the streaming and non-streaming modes. In the streaming mode, decoding operations are performed with a Decoder and encoding operations with an Encoder object obtained via Encoding. In the non-streaming mode, decoding and encoding operations are performed using methods on Encoding objects themselves, so the Decoder and Encoder objects are not used at all.

Memory management

The non-streaming mode never performs heap allocations (even the methods that write into a Vec<u8> or a String by taking them as arguments do not reallocate the backing buffer of the Vec<u8> or the String). That is, the non-streaming mode uses caller-allocated buffers exclusively.

The methods of the streaming mode that return a Vec<u8> or a String perform heap allocations but only to allocate the backing buffer of the Vec<u8> or the String.

Encoding is always statically allocated. Decoder and Encoder need no Drop cleanup.

Buffer reading and writing behavior

Based on experience gained with the java.nio.charset encoding converter API and with the Gecko uconv encoding converter API, the buffer reading and writing behaviors of encoding_rs are asymmetric: input buffers are fully drained but output buffers are not always fully filled.

When reading from an input buffer, encoding_rs always consumes all input up to the next error or to the end of the buffer. In particular, when decoding, even if the input buffer ends in the middle of a byte sequence for a character, the decoder consumes all input. This has the benefit that the caller of the API can always fill the next buffer from the start from whatever source the bytes come from and never has to first copy the last bytes of the previous buffer to the start of the next buffer. However, when encoding, the UTF-8 input buffers have to end at a character boundary, which is a requirement for the Rust str type anyway, and UTF-16 input buffer boundaries falling in the middle of a surrogate pair result in both suggorates being treated individually as unpaired surrogates.

Additionally, decoders guarantee that they can be fed even one byte at a time and encoders guarantee that they can be fed even one code point at a time. This has the benefit of not placing restrictions on the size of chunks the content arrives e.g. from network.

When writing into an output buffer, encoding_rs makes sure that the code unit sequence for a character is never split across output buffer boundaries. This may result in wasted space at the end of an output buffer, but the advantages are that the output side of both decoders and encoders is greatly simplified compared to designs that attempt to fill output buffers exactly even when that entails splitting a code unit sequence and when encoding_rs methods return to the caller, the output produces thus far is always valid taken as whole. (In the case of encoding to ISO-2022-JP, the output needs to be considered as a whole, because the latest output buffer taken alone might not be valid taken alone if the transition away from the ASCII state occurred in an earlier output buffer. However, since the ISO-2022-JP decoder doesn't treat streams that don't end in the ASCII state as being in error despite the encoder generating a transition to the ASCII state at the end, the claim about the partial output taken as a whole being valid is true even for ISO-2022-JP.)

Error Reporting

Based on experience gained with the java.nio.charset encoding converter API and with the Gecko uconv encoding converter API, the error reporting behaviors of encoding_rs are asymmetric: decoder errors include offsets that leave it up to the caller to extract the erroneous bytes from the input stream if the caller wishes to do so but encoder errors provide the code point associated with the error without requiring the caller to extract it from the input on its own.

On the encoder side, an error is always triggered by the most recently pushed Unicode scalar, which makes it simple to pass the char to the caller. Also, it's very typical for the caller to wish to do something with this data: generate a numeric escape for the character. Additionally, the ISO-2022-JP encoder reports U+FFFD instead of the actual input character in certain cases, so requiring the caller to extract the character from the input buffer would require the caller to handle ISO-2022-JP details. Furthermore, requiring the caller to extract the character from the input buffer would require the caller to implement UTF-8 or UTF-16 math, which is the job of an encoding conversion library.

On the decoder side, errors are triggered in more complex ways. For example, when decoding the sequence ESC, '$', buffer boundary, 'A' as ISO-2022-JP, the ESC byte is in error, but this is discovered only after the buffer boundary when processing 'A'. Thus, the bytes in error might not be the ones most recently pushed to the decoder and the error might not even be in the current buffer.

Some encoding conversion APIs address the problem by not acknowledging trailing bytes of an input buffer as consumed if it's still possible for future bytes to cause the trailing bytes to be in error. This way, error reporting can always refer to the most recently pushed buffer. This has the problem that the caller of the API has to copy the unconsumed trailing bytes to the start of the next buffer before being able to fill the rest of the next buffer. This is annoying, error-prone and inefficient.

A possible solution would be making the decoder remember recently consumed bytes in order to be able to include a copy of the erroneous bytes when reporting an error. This has two problem: First, callers a rarely interested in the erroneous bytes, so attempts to identify them are most often just overhead anyway. Second, the rare applications that are interested typically care about the location of the error in the input stream.

To keep the API convenient for common uses and the overhead low while making it possible to develop applications, such as HTML validators, that care about which bytes were in error, encoding_rs reports the length of the erroneous sequence and the number of bytes consumed after the erroneous sequence. As long as the caller doesn't discard the 6 most recent bytes, this makes it possible for callers that care about the erroneous bytes to locate them.

No Convenience API for Custom Replacements

The Web Platform and, therefore, the Encoding Standard supports only one error recovery mode for decoders and only one error recovery mode for encoders. The supported error recovery mode for decoders is emitting the REPLACEMENT CHARACTER on error. The supported error recovery mode for encoders is emitting an HTML decimal numeric character reference for unmappable characters.

Since encoding_rs is Web-focused, these are the only error recovery modes for which convenient support is provided. Moreover, on the decoder side, there aren't really good alternatives for emitting the REPLACEMENT CHARACTER on error (other than treating errors as fatal). In particular, simply ignoring errors is a security problem, so it would be a bad idea for encoding_rs to provide a mode that encouraged callers to ignore errors.

On the encoder side, there are plausible alternatives for HTML decimal numeric character references. For example, when outputting CSS, CSS-style escapes would seem to make sense. However, instead of facilitating the output of CSS, JS, etc. in non-UTF-8 encodings, encoding_rs takes the design position that you shouldn't generate output in encodings other than UTF-8, except where backward compatibility with interacting with the legacy Web requires it. The legacy Web requires it only when parsing the query strings of URLs and when submitting forms, and those two both use HTML decimal numeric character references.

While encoding_rs doesn't make encoder replacements other than HTML decimal numeric character references easy, it does make them possible. encode_from_utf8(), which emits HTML decimal numeric character references for unmappable characters, is implemented on top of encode_from_utf8_without_replacement(). Applications that really, really want other replacement schemes for unmappable characters can likewise implement them on top of encode_from_utf8_without_replacement().

No Extensibility by Design

The set of encodings supported by encoding_rs is not extensible by design. That is, Encoding, Decoder and Encoder are intentionally structs rather than traits. encoding_rs takes the design position that all future text interchange should be done using UTF-8, which can represent all of Unicode. (It is, in fact, the only encoding supported by the Encoding Standard and encoding_rs that can represent all of Unicode and that has encoder support. UTF-16LE and UTF-16BE don't have encoder support, and gb18030 cannot encode U+E5E5.) The other encodings are supported merely for legacy compatibility and not due to non-UTF-8 encodings having benefits other than being able to consume legacy content.

Considering that UTF-8 can represent all of Unicode and is already supported by all Web browsers, introducing a new encoding wouldn't add to the expressiveness but would add to compatibility problems. In that sense, adding new encodings to the Web Platform doesn't make sense, and, in fact, post-UTF-8 attempts at encodings, such as BOCU-1, have been rejected from the Web Platform. On the other hand, the set of legacy encodings that must be supported for a Web browser to be able to be successful is not going to expand. Empirically, the set of encodings specified in the Encoding Standard is already sufficient and the set of legacy encodings won't grow retroactively.

Since extensibility doesn't make sense considering the Web focus of encoding_rs and adding encodings to Web clients would be actively harmful, it makes sense to make the set of encodings that encoding_rs supports non-extensible and to take the (admittedly small) benefits arising from that, such as the size of Decoder and Encoder objects being known ahead of time, which enables stack allocation thereof.

This does have downsides for applications that might want to put encoding_rs to non-Web uses if those non-Web uses involve legacy encodings that aren't needed for Web uses. The needs of such applications should not complicate encoding_rs itself, though. It is up to those applications to provide a framework that delegates the operations with encodings that encoding_rs supports to encoding_rs and operations with other encodings to something else (as opposed to encoding_rs itself providing an extensibility framework).


Methods in encoding_rs can panic if the API is used against the requirements stated in the documentation, if a state that's supposed to be impossible is reached due to an internal bug or on integer overflow. When used according to documentation with buffer sizes that stay below integer overflow, in the absence of internal bugs, encoding_rs does not panic.

Panics arising from API misuse aren't documented beyond this on individual methods.

At-Risk Parts of the API

The foreseeable source of partially backward-incompatible API change is the way the instances of Encoding are made available.

If Rust changes to allow the entries of [&'static Encoding; N] to be initialized with statics of type &'static Encoding, the non-reference FOO_INIT public Encoding instances will be removed from the public API.

If Rust changes to make the referent of pub const FOO: &'static Encoding unique when the constant is used in different crates, the reference-typed statics for the encoding instances will be changed from static to const and the non-reference-typed _INIT instances will be removed.

Mapping Spec Concepts onto the API

Spec ConceptStreamingNon-Streaming
encoding&'static Encoding&'static Encoding
UTF-8 encodingUTF_8UTF_8
get an encodingEncoding::for_label(label)Encoding::for_label(label)
get an output encodingencoding.output_encoding()encoding.output_encoding()
decodelet d = encoding.new_decoder();
let res = d.decode_to_*(src, dst, false);
// …
let last_res = d.decode_to_*(src, dst, true);
UTF-8 decodelet d = UTF_8.new_decoder_with_bom_removal();
let res = d.decode_to_*(src, dst, false);
// …
let last_res = d.decode_to_*(src, dst, true);
UTF-8 decode without BOMlet d = UTF_8.new_decoder_without_bom_handling();
let res = d.decode_to_*(src, dst, false);
// …
let last_res = d.decode_to_*(src, dst, true);
UTF-8 decode without BOM or faillet d = UTF_8.new_decoder_without_bom_handling();
let res = d.decode_to_*_without_replacement(src, dst, false);
// … (fail if malformed)
let last_res = d.decode_to_*_without_replacement(src, dst, true);
// (fail if malformed)
encodelet e = encoding.new_encoder();
let res = e.encode_to_*(src, dst, false);
// …
let last_res = e.encode_to_*(src, dst, true);
UTF-8 encodeUse the UTF-8 nature of Rust strings directly:
// refill src
// refill src
// …
Use the UTF-8 nature of Rust strings directly:

Compatibility with the rust-encoding API

The crate encoding_rs_compat is a drop-in replacement for rust-encoding 0.2.32 that implements (most of) the API of rust-encoding 0.2.32 on top of encoding_rs.

Mapping rust-encoding concepts to encoding_rs concepts

The following table provides a mapping from rust-encoding constructs to encoding_rs ones.

encoding::EncodingRef&'static encoding_rs::Encoding
encoding::all::WINDOWS_31J (not based on the WHATWG name for some encodings)encoding_rs::SHIFT_JIS (always the WHATWG name uppercased and hyphens replaced with underscores)
encoding::all::ERRORNot available because not in the Encoding Standard
encoding::all::ASCIINot available because not in the Encoding Standard
encoding::all::ISO_8859_1Not available because not in the Encoding Standard
encoding::all::HZNot available because not in the Encoding Standard
enc.whatwg_name() (always lower case) (potentially mixed case) available because not in the Encoding Standard
encoding::decode(bytes, encoding::DecoderTrap::Replace, enc)enc.decode(bytes)
enc.decode(bytes, encoding::DecoderTrap::Replace)enc.decode_without_bom_handling(bytes)
enc.encode(string, encoding::EncoderTrap::NcrEscape)enc.encode(string)
raw_decoder.raw_feed(src, dst_string)dst_string.reserve(decoder.max_utf8_buffer_length_without_replacement(src.len()));
decoder.decode_to_string_without_replacement(src, dst_string, false)
raw_encoder.raw_feed(src, dst_vec)dst_vec.reserve(encoder.max_buffer_length_from_utf8_without_replacement(src.len()));
encoder.encode_from_utf8_to_vec_without_replacement(src, dst_vec, false)
decoder.decode_to_string_without_replacement(b"", dst, true)
encoder.encode_from_utf8_to_vec_without_replacement("", dst, true)
encoding::DecoderTrap::Strictdecode* methods that have _without_replacement in their name (and treating the `Malformed` result as fatal).
encoding::DecoderTrap::Replacedecode* methods that do not have _without_replacement in their name.
encoding::DecoderTrap::IgnoreIt is a bad idea to ignore errors due to security issues, but this could be implemented using decode* methods that have _without_replacement in their name.
encoding::DecoderTrap::Call(DecoderTrapFunc)Can be implemented using decode* methods that have _without_replacement in their name.
encoding::EncoderTrap::Strictencode* methods that have _without_replacement in their name (and treating the `Unmappable` result as fatal).
encoding::EncoderTrap::ReplaceCan be implemented using encode* methods that have _without_replacement in their name.
encoding::EncoderTrap::IgnoreIt is a bad idea to ignore errors due to security issues, but this could be implemented using encode* methods that have _without_replacement in their name.
encoding::EncoderTrap::NcrEscapeencode* methods that do not have _without_replacement in their name.
encoding::EncoderTrap::Call(EncoderTrapFunc)Can be implemented using encode* methods that have _without_replacement in their name.



Functions for converting between different in-RAM representations of text and for quickly checking if the Unicode Bidirectional Algorithm can be avoided.



A converter that decodes a byte stream into Unicode according to a character encoding in a streaming (incremental) manner.


A converter that encodes a Unicode stream into bytes according to a character encoding in a streaming (incremental) manner.


An encoding as defined in the Encoding Standard.



Result of a (potentially partial) decode or encode operation with replacement.


Result of a (potentially partial) decode operation without replacement.


Result of a (potentially partial) encode operation without replacement.



The Big5 encoding.


The initializer for the Big5 encoding.


The EUC-JP encoding.


The initializer for the EUC-JP encoding.


The EUC-KR encoding.


The initializer for the EUC-KR encoding.


The gb18030 encoding.


The initializer for the gb18030 encoding.


The GBK encoding.


The initializer for the GBK encoding.


The IBM866 encoding.


The initializer for the IBM866 encoding.


The ISO-2022-JP encoding.


The initializer for the ISO-2022-JP encoding.


The ISO-8859-2 encoding.


The ISO-8859-3 encoding.


The ISO-8859-4 encoding.


The ISO-8859-5 encoding.


The ISO-8859-6 encoding.


The ISO-8859-7 encoding.


The ISO-8859-8 encoding.


The ISO-8859-10 encoding.


The ISO-8859-13 encoding.


The ISO-8859-14 encoding.


The ISO-8859-15 encoding.


The ISO-8859-16 encoding.


The initializer for the ISO-8859-10 encoding.


The initializer for the ISO-8859-13 encoding.


The initializer for the ISO-8859-14 encoding.


The initializer for the ISO-8859-15 encoding.


The initializer for the ISO-8859-16 encoding.


The initializer for the ISO-8859-2 encoding.


The initializer for the ISO-8859-3 encoding.


The initializer for the ISO-8859-4 encoding.


The initializer for the ISO-8859-5 encoding.


The initializer for the ISO-8859-6 encoding.


The initializer for the ISO-8859-7 encoding.


The ISO-8859-8-I encoding.


The initializer for the ISO-8859-8 encoding.


The initializer for the ISO-8859-8-I encoding.


The KOI8-R encoding.


The initializer for the KOI8-R encoding.


The KOI8-U encoding.


The initializer for the KOI8-U encoding.


The macintosh encoding.


The initializer for the macintosh encoding.


The replacement encoding.


The initializer for the replacement encoding.


The Shift_JIS encoding.


The initializer for the Shift_JIS encoding.


The UTF-8 encoding.


The UTF-16BE encoding.


The initializer for the UTF-16BE encoding.


The UTF-16LE encoding.


The initializer for the UTF-16LE encoding.


The initializer for the UTF-8 encoding.


The windows-874 encoding.


The windows-1250 encoding.


The windows-1251 encoding.


The windows-1252 encoding.


The windows-1253 encoding.


The windows-1254 encoding.


The windows-1255 encoding.


The windows-1256 encoding.


The windows-1257 encoding.


The windows-1258 encoding.


The initializer for the windows-1250 encoding.


The initializer for the windows-1251 encoding.


The initializer for the windows-1252 encoding.


The initializer for the windows-1253 encoding.


The initializer for the windows-1254 encoding.


The initializer for the windows-1255 encoding.


The initializer for the windows-1256 encoding.


The initializer for the windows-1257 encoding.


The initializer for the windows-1258 encoding.


The initializer for the windows-874 encoding.


The x-mac-cyrillic encoding.


The initializer for the x-mac-cyrillic encoding.


The x-user-defined encoding.


The initializer for the x-user-defined encoding.