Crate encoding_rs [] [src]

encoding_rs is a Gecko-oriented Free Software / Open Source implementation of the Encoding Standard in Rust. Gecko-oriented means that converting to and from UTF-16 is supported in addition to converting to and from UTF-8 and that the performance and streamability goals are browser-oriented.

Availability

The code is available under the Apache license, Version 2.0 or the MIT license, at your option. See the COPYRIGHT file for details. The repository is on GitHub. The crate is available on crates.io.

Web / Browser Focus

Both in terms of scope and performance, the focus is on the Web. For scope, this means that encoding_rs implements the Encoding Standard fully and doesn't implement encodings that are not specified in the Encoding Standard. For performance, this means that decoding performance is important as well as performance for encoding into UTF-8 or encoding the Basic Latin range (ASCII) into legacy encodings. Non-Basic Latin needs to be encoded into legacy encodings in only two places in the Web platform: in the query part of URLs, in which case it's a matter of relatively rare error handling, and in form submission, in which case the user action and networking tend to hide the performance of the encoder.

Deemphasizing performance of encoding non-Basic Latin text into legacy encodings enables smaller code size thanks to the encoder side using the decode-optimized data tables without having encode-optimized data tables at all. Even in decoders, smaller lookup table size is preferred over avoiding multiplication operations.

Additionally, performance is a non-goal for the ASCII-incompatible ISO-2022-JP and UTF-16 encodings, which are rarely used on the Web.

Despite the focus on the Web, encoding_rs may well be useful for decoding email, although you'll need to implement UTF-7 decoding and label handling by other means. (Due to the Web focus, patches to add UTF-7 are unwelcome in encoding_rs itself.) Also, despite the browser focus, the hope is that non-browser applications that wish to consume Web content or submit Web forms in a Web-compatible way will find encoding_rs useful.

Streaming & Non-Streaming; Rust & C/C++

The API in Rust has two modes of operation: streaming and non-streaming. The streaming API is the foundation of the implementation and should be used when processing data that arrives piecemeal from an i/o stream. The streaming API has an FFI wrapper that exposes it to C callers. The non-streaming part of the API is for Rust callers only and is implemented on top of the streaming API and, as such, could be considered as merely a set of convenience methods. There is no analogous C API exposed via FFI, mainly because C doesn't have standard types for growable byte buffers and Unicode strings that know their length.

The C API (header file generated at target/include/encoding_rs.h when building encoding_rs) can, in turn, be wrapped for use from C++. Such a C++ wrapper could re-create the non-streaming API in C++ for C++ callers. Currently, encoding_rs comes with a C++ wrapper that uses STL+GSL types, but this wrapper doesn't provide non-streaming convenience methods at this time. A C++ wrapper with XPCOM/MFBT types is planned but does not exist yet.

The Encoding type is common to both the streaming and non-streaming modes. In the streaming mode, decoding operations are performed with a Decoder and encoding operations with an Encoder object obtained via Encoding. In the non-streaming mode, decoding and encoding operations are performed using methods on Encoding objects themselves, so the Decoder and Encoder objects are not used at all.

Mapping Spec Concepts onto the API

Spec ConceptStreamingNon-Streaming
encoding&'static Encoding&'static Encoding
UTF-8 encodingUTF_8UTF_8
get an encodingEncoding::for_label(label)Encoding::for_label(label)
nameencoding.name()encoding.name()
get an output encodingencoding.output_encoding()encoding.output_encoding()
decodelet d = encoding.new_decoder();
let res = d.decode_to_*(src, dst, false);
// …
let last_res = d.decode_to_*(src, dst, true);
encoding.decode(src)
UTF-8 decodelet d = UTF_8.new_decoder_with_bom_removal();
let res = d.decode_to_*(src, dst, false);
// …
let last_res = d.decode_to_*(src, dst, true);
UTF_8.decode_with_bom_removal(src)
UTF-8 decode without BOMlet d = UTF_8.new_decoder_without_bom_handling();
let res = d.decode_to_*(src, dst, false);
// …
let last_res = d.decode_to_*(src, dst, true);
UTF_8.decode_without_bom_handling(src)
UTF-8 decode without BOM or faillet d = UTF_8.new_decoder_without_bom_handling();
let res = d.decode_to_*_without_replacement(src, dst, false);
// … (fail if malformed)
let last_res = d.decode_to_*_without_replacement(src, dst, true);
// (fail if malformed)
UTF_8.decode_without_bom_handling_and_without_replacement(src)
encodelet e = encoding.new_encoder();
let res = e.encode_to_*(src, dst, false);
// …
let last_res = e.encode_to_*(src, dst, true);
encoding.encode(src)
UTF-8 encodeUse the UTF-8 nature of Rust strings directly:
write(src.as_bytes());
write(src.as_bytes());
write(src.as_bytes());
// …
Use the UTF-8 nature of Rust strings directly:
src.as_bytes()

Reexports

pub use ffi::*;

Modules

ffi

Structs

Decoder

A converter that decodes a byte stream into Unicode according to a character encoding in a streaming (incremental) manner.

Encoder

A converter that encodes a Unicode stream into bytes according to a character encoding in a streaming (incremental) manner.

Encoding

An encoding as defined in the Encoding Standard.

Enums

CoderResult

Result of a (potentially partial) decode or encode operation with replacement.

DecoderResult

Result of a (potentially partial) decode operation without replacement.

EncoderResult

Result of a (potentially partial) encode operation without replacement.

Constants

BIG5

The Big5 encoding.

EUC_JP

The EUC-JP encoding.

EUC_KR

The EUC-KR encoding.

GB18030

The gb18030 encoding.

GBK

The GBK encoding.

IBM866

The IBM866 encoding.

ISO_2022_JP

The ISO-2022-JP encoding.

ISO_8859_10

The ISO-8859-10 encoding.

ISO_8859_13

The ISO-8859-13 encoding.

ISO_8859_14

The ISO-8859-14 encoding.

ISO_8859_15

The ISO-8859-15 encoding.

ISO_8859_16

The ISO-8859-16 encoding.

ISO_8859_2

The ISO-8859-2 encoding.

ISO_8859_3

The ISO-8859-3 encoding.

ISO_8859_4

The ISO-8859-4 encoding.

ISO_8859_5

The ISO-8859-5 encoding.

ISO_8859_6

The ISO-8859-6 encoding.

ISO_8859_7

The ISO-8859-7 encoding.

ISO_8859_8

The ISO-8859-8 encoding.

ISO_8859_8_I

The ISO-8859-8-I encoding.

KOI8_R

The KOI8-R encoding.

KOI8_U

The KOI8-U encoding.

MACINTOSH

The macintosh encoding.

REPLACEMENT

The replacement encoding.

SHIFT_JIS

The Shift_JIS encoding.

UTF_16BE

The UTF-16BE encoding.

UTF_16LE

The UTF-16LE encoding.

UTF_8

The UTF-8 encoding.

WINDOWS_1250

The windows-1250 encoding.

WINDOWS_1251

The windows-1251 encoding.

WINDOWS_1252

The windows-1252 encoding.

WINDOWS_1253

The windows-1253 encoding.

WINDOWS_1254

The windows-1254 encoding.

WINDOWS_1255

The windows-1255 encoding.

WINDOWS_1256

The windows-1256 encoding.

WINDOWS_1257

The windows-1257 encoding.

WINDOWS_1258

The windows-1258 encoding.

WINDOWS_874

The windows-874 encoding.

X_MAC_CYRILLIC

The x-mac-cyrillic encoding.

X_USER_DEFINED

The x-user-defined encoding.