Struct encoding_rs::Encoding [] [src]

pub struct Encoding {
    // some fields omitted
}

An encoding as defined in the Encoding Standard.

An encoding defines a mapping from a u8 sequence to a char sequence and, in most cases, vice versa. Each encoding has a name, an output encoding, and one or more labels.

Labels are ASCII-case-insensitive strings that are used to identify an encoding in formats and protocols. The name of the encoding is the preferred label in the case appropriate for returning from the characterSet property of the Document DOM interface, except for the replacement encoding whose name is not one of its labels.

The output encoding is the encoding used for form submission and URL parsing on Web pages in the encoding. This is UTF-8 for the replacement, UTF-16LE and UTF-16BE encodings and the encoding itself for other encodings.

Instances

All instances of Encoding are statically allocated and have the 'static lifetime. There is precisely one unique Encoding instance for each encoding defined in the Encoding Standard.

To obtain a reference to a particular encoding whose identity you know at compile time, use a constant. There is a constant for each encoding. The constants are named in all caps with hyphens replaced with underscores (and in C/C++ have _ENCODING appended to the name). For example, if you know at compile time that you will want to decode using the UTF-8 encoding, use the UTF_8 constant (UTF_8_ENCODING in C/C++).

If you don't know what encoding you need at compile time and need to dynamically get an encoding by label, use Encoding::for_label(label).

Instances of Encoding can be compared with == (in both Rust and in C/C++).

Streaming vs. Non-Streaming

When you have the entire input in a single buffer, you can use the convenience methods XXX. (These methods are available to Rust callers only and are not available in the C API.) Unlike the rest of the API available to Rust, these methods perform heap allocations. You should the Decoder and Encoder objects when your input is split into multiple buffers or when you want to control the allocation of the output buffers.

Methods

impl Encoding
[src]

fn for_label(label: &[u8]) -> Option<&'static Encoding>

Implements the get an encoding algorithm.

If, after ASCII-lowercasing and removing leading and trailing whitespace, the argument matches a label defined in the Encoding Standard, Some(&'static Encoding) representing the corresponding encoding is returned. If there is no match, None is returned.

The argument is of type &[u8] instead of &str to save callers that are extracting the label from a non-UTF-8 protocol the trouble of conversion to UTF-8. (If you have a &str, just call .as_bytes() on it.)

Available via the C wrapper.

fn for_label_no_replacement(label: &[u8]) -> Option<&'static Encoding>

This method behaves the same as for_label(), except when for_label() would return Some(REPLACEMENT), this method returns None instead.

This method is useful in scenarios where a fatal error is required upon invalid label, because in those cases the caller typically wishes to treat the labels that map to the replacement encoding as fatal errors, too.

Available via the C wrapper.

fn for_name(name: &[u8]) -> Option<&'static Encoding>

If the argument matches exactly (case-sensitively; no whitespace removal performed) the name of an encoding, returns Some(&'static Encoding) representing that encoding. Otherwise, return None.

The motivating use case for this method is interoperability with legacy Gecko code that represents encodings as name string instead of type-safe Encoding objects. Using this method for other purposes is most likely the wrong thing to do.

XXX: Should this method be made FFI-only to discourage Rust callers?

Available via the C wrapper.

fn for_bom(buffer: &[u8]) -> Option<&'static Encoding>

Performs non-incremental BOM sniffing.

The argument must either be a buffer representing the entire input stream (non-streaming case) or a buffer representing at least the first three bytes of the input stream (streaming case).

Returns Some(UTF_8), Some(UTF_16LE) or Some(UTF_16BE) if the argument starts with the UTF-8, UTF-16LE or UTF-16BE BOM or None otherwise.

Available via the C wrapper.

fn name(&'static self) -> &'static str

Returns the name of this encoding.

This name is appropriate to return as-is from the DOM document.characterSet property.

Available via the C wrapper.

fn can_encode_everything(&'static self) -> bool

Checks whether the output encoding of this encoding can encode every char. (Only true if the output encoding is UTF-8.)

Available via the C wrapper.

fn is_ascii_compatible(&'static self) -> bool

Checks whether the bytes 0x00...0x7F map exclusively to the characters U+0000...U+007F and vice versa.

Available via the C wrapper.

fn output_encoding(&'static self) -> &'static Encoding

Returns the output encoding of this encoding. This is UTF-8 for UTF-16BE, UTF-16LE and replacement and the encoding itself otherwise.

Available via the C wrapper.

fn new_decoder(&'static self) -> Decoder

Instantiates a new decoder for this encoding with BOM sniffing enabled.

BOM sniffing may cause the returned decoder to morph into a decoder for UTF-8, UTF-16LE or UTF-16BE instead of this encoding.

Available via the C wrapper.

fn new_decoder_with_bom_removal(&'static self) -> Decoder

Instantiates a new decoder for this encoding with BOM removal.

If the input starts with bytes that are the BOM for this encoding, those bytes are removed. However, the decoder never morphs into a decoder for another encoding: A BOM for another encoding is treated as (potentially malformed) input to the decoding algorithm for this encoding.

Available via the C wrapper.

fn new_decoder_without_bom_handling(&'static self) -> Decoder

Instantiates a new decoder for this encoding with BOM handling disabled.

If the input starts with bytes that look like a BOM, those bytes are not treated as a BOM. (Hence, the decoder never morphs into a decoder for another encoding.)

Note: If the caller has performed BOM sniffing on its own but has not removed the BOM, the caller should use new_decoder_with_bom_removal() instead of this method to cause the BOM to be removed.

Available via the C wrapper.

fn new_encoder(&'static self) -> Encoder

Instantiates a new encoder for the output encoding of this encoding.

Available via the C wrapper.

fn decode(&'static self, bytes: &[u8]) -> (String, &'static Encoding, bool)

Convenience method for decoding to String with BOM sniffing and with malformed sequences replaced with the REPLACEMENT CHARACTER when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

This method implements the (non-streaming version of) the decode spec concept.

The second item in the returned tuple is the encoding that was actually used (which may differ from this encoding thanks to BOM sniffing).

The third item in the returned tuple indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder() when decoding segmented input.

This method performs a single heap allocation for the backing buffer of the String.

Available to Rust only.

fn decode_with_bom_removal(&'static self, bytes: &[u8]) -> (String, bool)

Convenience method for decoding to String with BOM removal and with malformed sequences replaced with the REPLACEMENT CHARACTER when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

When invoked on UTF_8, this method implements the (non-streaming version of) the UTF-8 decode spec concept.

The second item in the returned pair indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder_with_bom_removal() when decoding segmented input.

This method performs a single heap allocation for the backing buffer of the String.

Available to Rust only.

fn decode_without_bom_handling(&'static self, bytes: &[u8]) -> (String, bool)

Convenience method for decoding to String without BOM handling and with malformed sequences replaced with the REPLACEMENT CHARACTER when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

When invoked on UTF_8, this method implements the (non-streaming version of) the UTF-8 decode without BOM spec concept.

The second item in the returned pair indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder_without_bom_handling() when decoding segmented input.

This method performs a single heap allocation for the backing buffer of the String.

Available to Rust only.

fn decode_without_bom_handling_and_without_replacement(&'static self, bytes: &[u8]) -> Option<String>

Convenience method for decoding to String without BOM handling and with malformed sequences treated as fatal when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

When invoked on UTF_8, this method implements the (non-streaming version of) the UTF-8 decode without BOM or fail spec concept.

Returns None if a malformed sequence was encountered and the result of the decode as Some(String) otherwise.

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder_without_bom_handling() when decoding segmented input.

This method performs a single heap allocation for the backing buffer of the String.

Available to Rust only.

fn encode(&'static self, string: &str) -> (Vec<u8>, &'static Encoding, bool)

Convenience method for encoding to Vec<u8> with unmappable characters replaced with decimal numeric character references when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

This method implements the (non-streaming version of) the encode spec concept. For the UTF-8 encode spec concept, use string.as_bytes() instead of invoking this method on UTF_8.

The second item in the returned tuple is the encoding that was actually used (which may differ from this encoding thanks to some encodings having UTF-8 as their output encoding).

The third item in the returned tuple indicates whether there were unmappable characters (that were replaced with HTML numeric character references).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_encoder() when encoding segmented output.

This method performs a single heap allocation for the backing buffer of the Vec<u8> if there are no unmappable characters and potentially multiple heap allocations if there are. These allocations are tuned for jemalloc and may not be optimal when using a different allocator that doesn't use power-of-two buckets.

Available to Rust only.

Trait Implementations

impl PartialEq for Encoding
[src]

fn eq(&self, other: &Encoding) -> bool

This method tests for self and other values to be equal, and is used by ==. Read more

fn ne(&self, other: &Rhs) -> bool
1.0.0

This method tests for !=.

impl Eq for Encoding
[src]

impl Debug for Encoding
[src]

fn fmt(&self, f: &mut Formatter) -> Result

Formats the value using the given formatter.