Crate encoding_rs [−] [src]
encoding_rs is a Gecko-oriented Free Software / Open Source implementation of the Encoding Standard in Rust. Gecko-oriented means that converting to and from UTF-16 is supported in addition to converting to and from UTF-8 and that the performance and streamability goals are browser-oriented.
Availability
The code is available under the
Apache license, Version 2.0
or the MIT license, at your option.
See the
COPYRIGHT
file for details.
The repository is on GitHub. The
crate is available on crates.io.
Web / Browser Focus
Both in terms of scope and performance, the focus is on the Web. For scope, this means that encoding_rs implements the Encoding Standard fully and doesn't implement encodings that are not specified in the Encoding Standard. For performance, this means that decoding performance is important as well as performance for encoding into UTF-8 or encoding the Basic Latin range (ASCII) into legacy encodings. Non-Basic Latin needs to be encoded into legacy encodings in only two places in the Web platform: in the query part of URLs, in which case it's a matter of relatively rare error handling, and in form submission, in which case the user action and networking tend to hide the performance of the encoder.
Deemphasizing performance of encoding non-Basic Latin text into legacy encodings enables smaller code size thanks to the encoder side using the decode-optimized data tables without having encode-optimized data tables at all. Even in decoders, smaller lookup table size is preferred over avoiding multiplication operations.
Additionally, performance is a non-goal for the ASCII-incompatible ISO-2022-JP and UTF-16 encodings, which are rarely used on the Web.
Despite the focus on the Web, encoding_rs may well be useful for decoding email, although you'll need to implement UTF-7 decoding and label handling by other means. (Due to the Web focus, patches to add UTF-7 are unwelcome in encoding_rs itself.) Also, despite the browser focus, the hope is that non-browser applications that wish to consume Web content or submit Web forms in a Web-compatible way will find encoding_rs useful.
Streaming & Non-Streaming; Rust & C/C++
The API in Rust has two modes of operation: streaming and non-streaming. The streaming API is the foundation of the implementation and should be used when processing data that arrives piecemeal from an i/o stream. The streaming API has an FFI wrapper that exposes it to C callers. The non-streaming part of the API is for Rust callers only and is implemented on top of the streaming API and, as such, could be considered as merely a set of convenience methods. There is no analogous C API exposed via FFI, mainly because C doesn't have standard types for growable byte buffers and Unicode strings that know their length.
The C API (header file generated at target/include/encoding_rs.h
when
building encoding_rs) can, in turn, be wrapped for use from C++. Such a
C++ wrapper could re-create the non-streaming API in C++ for C++ callers.
Currently, encoding_rs comes with a
C++ wrapper
that uses STL+GSL types, but this
wrapper doesn't provide non-streaming convenience methods at this time. A
C++ wrapper with XPCOM/MFBT types is planned but does not exist yet.
The Encoding
type is common to both the streaming and non-streaming
modes. In the streaming mode, decoding operations are performed with a
Decoder
and encoding operations with an Encoder
object obtained via
Encoding
. In the non-streaming mode, decoding and encoding operations are
performed using methods on Encoding
objects themselves, so the Decoder
and Encoder
objects are not used at all.
Mapping Spec Concepts onto the API
Spec Concept | Streaming | Non-Streaming |
---|---|---|
encoding | &'static Encoding | &'static Encoding |
UTF-8 encoding | UTF_8 | UTF_8 |
get an encoding | Encoding::for_label(label) | Encoding::for_label(label) |
name | encoding.name() | encoding.name() |
get an output encoding | encoding.output_encoding() | encoding.output_encoding() |
decode | let d = encoding.new_decoder(); | encoding.decode(src) |
UTF-8 decode | let d = UTF_8.new_decoder_with_bom_removal(); | UTF_8.decode_with_bom_removal(src) |
UTF-8 decode without BOM | let d = UTF_8.new_decoder_without_bom_handling(); | UTF_8.decode_without_bom_handling(src) |
UTF-8 decode without BOM or fail | let d = UTF_8.new_decoder_without_bom_handling(); | UTF_8.decode_without_bom_handling_and_without_replacement(src) |
encode | let e = encoding.new_encoder(); | encoding.encode(src) |
UTF-8 encode | Use the UTF-8 nature of Rust strings directly:write(src.as_bytes()); | Use the UTF-8 nature of Rust strings directly:src.as_bytes() |
Reexports
pub use ffi::*; |
Modules
ffi |
Structs
Decoder |
A converter that decodes a byte stream into Unicode according to a character encoding in a streaming (incremental) manner. |
Encoder |
A converter that encodes a Unicode stream into bytes according to a character encoding in a streaming (incremental) manner. |
Encoding |
An encoding as defined in the Encoding Standard. |
Enums
CoderResult |
Result of a (potentially partial) decode or encode operation with replacement. |
DecoderResult |
Result of a (potentially partial) decode operation without replacement. |
EncoderResult |
Result of a (potentially partial) encode operation without replacement. |
Constants
BIG5 |
The Big5 encoding. |
EUC_JP |
The EUC-JP encoding. |
EUC_KR |
The EUC-KR encoding. |
GB18030 |
The gb18030 encoding. |
GBK |
The GBK encoding. |
IBM866 |
The IBM866 encoding. |
ISO_2022_JP |
The ISO-2022-JP encoding. |
ISO_8859_10 |
The ISO-8859-10 encoding. |
ISO_8859_13 |
The ISO-8859-13 encoding. |
ISO_8859_14 |
The ISO-8859-14 encoding. |
ISO_8859_15 |
The ISO-8859-15 encoding. |
ISO_8859_16 |
The ISO-8859-16 encoding. |
ISO_8859_2 |
The ISO-8859-2 encoding. |
ISO_8859_3 |
The ISO-8859-3 encoding. |
ISO_8859_4 |
The ISO-8859-4 encoding. |
ISO_8859_5 |
The ISO-8859-5 encoding. |
ISO_8859_6 |
The ISO-8859-6 encoding. |
ISO_8859_7 |
The ISO-8859-7 encoding. |
ISO_8859_8 |
The ISO-8859-8 encoding. |
ISO_8859_8_I |
The ISO-8859-8-I encoding. |
KOI8_R |
The KOI8-R encoding. |
KOI8_U |
The KOI8-U encoding. |
MACINTOSH |
The macintosh encoding. |
REPLACEMENT |
The replacement encoding. |
SHIFT_JIS |
The Shift_JIS encoding. |
UTF_16BE |
The UTF-16BE encoding. |
UTF_16LE |
The UTF-16LE encoding. |
UTF_8 |
The UTF-8 encoding. |
WINDOWS_1250 |
The windows-1250 encoding. |
WINDOWS_1251 |
The windows-1251 encoding. |
WINDOWS_1252 |
The windows-1252 encoding. |
WINDOWS_1253 |
The windows-1253 encoding. |
WINDOWS_1254 |
The windows-1254 encoding. |
WINDOWS_1255 |
The windows-1255 encoding. |
WINDOWS_1256 |
The windows-1256 encoding. |
WINDOWS_1257 |
The windows-1257 encoding. |
WINDOWS_1258 |
The windows-1258 encoding. |
WINDOWS_874 |
The windows-874 encoding. |
X_MAC_CYRILLIC |
The x-mac-cyrillic encoding. |
X_USER_DEFINED |
The x-user-defined encoding. |