Struct StringOffsets

Source

pub struct StringOffsets<C: ConfigType = AllConfig> { /* private fields */ }

Expand description

Converts positions within a given string between UTF-8 byte offsets (the usual in Rust), UTF-16 code units, Unicode code points, and line numbers.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of Unicode code points. It’s therefore necessary to adjust string offsets when communicating across programming language boundaries. StringOffsets does these adjustments.

Each StringOffsets instance contains offset information for a single string. Building the data structure takes O(n) time and memory, but then most conversions are O(1).

“UTF-8 Conversions with BitRank” is a blog post explaining the implementation.

§Converting offsets

The conversion methods follow a naming scheme that uses these terms for different kinds of offsets:

utf8 - UTF-8 byte offsets (Rust style).
utf16 - UTF-16 code unit offsets (JavaScript style).
char - Count of Unicode scalar values (Python style).
utf16_pos - Zero-based line number and utf16 offset within the line.
char_pos - Zero-based line number and char offset within the line.

For example, StringOffsets::utf8_to_utf16 converts a Rust byte offset to a number that will index to the same position in a JavaScript string. Offsets are expressed as usize or Pos values.

All methods accept arguments that are past the end of the string, interpreting them as pointing to the end of the string.

§Converting ranges

Some methods translate position ranges. These are expressed as Range<usize> except for line, which is a usize:

line - Zero-based line numbers. The range a line refers to is the whole line, including the trailing newline character if any.
lines - A range of line numbers.
utf8s - UTF-8 byte ranges.
utf16s - UTF-16 code unit ranges.
chars - Ranges of Unicode scalar values.

When mapping offsets to line ranges, it is important to use a _to_lines function in order to end up with the correct line range. We have these methods because if you tried to do it yourself you would screw it up; use them! (And see the source code for StringOffsets::utf8s_to_lines if you don’t believe us.)

§Complexity

Most operations run in O(1) time. A few require O(log n) time. The memory consumed by this data structure is typically less than the memory occupied by the actual content. In the best case, it requires ~45% of the content space. One can reduce memory requirements further by only requesting the necessary features via the configuration type.

Implementations§

Source §

impl<C: ConfigType> StringOffsets<C>

Source

pub fn new(content: &str) -> Self

Create a new converter to work with offsets into the given string.

Source

pub fn from_bytes(content: &[u8]) -> Self

Create a new converter to work with offsets into the given byte-string.

If content is UTF-8, this is just like StringOffsets::new. Otherwise, the conversion methods will produce unspecified (but memory-safe) results.

Source §

impl<C: ConfigType<HasLines = True>> StringOffsets<C>

Source

pub fn len(&self) -> usize

Returns the number of bytes in the string.

Source

pub fn is_empty(&self) -> bool

Returns whether there are no bytes in the string.

Source

pub fn lines(&self) -> usize

Returns the number of lines in the string.

Source

pub fn line_to_utf8_begin(&self, line_number: usize) -> usize

Return the byte offset of the first character on the specified (zero-based) line.

If line_number is greater than or equal to the number of lines in the text, this returns the length of the string.

Source

pub fn line_to_utf8_end(&self, line_number: usize) -> usize

UTF-8 offset of the first character of a line.

Source

pub fn utf8_to_line(&self, byte_number: usize) -> usize

Return the zero-based line number of the line containing the specified UTF-8 offset. Newline characters count as part of the preceding line.

Source

pub fn utf8s_to_lines(&self, bytes: Range<usize>) -> Range<usize>

Returns the range of line numbers containing the substring specified by the Rust-style range bytes. Newline characters count as part of the preceding line.

If bytes is an empty range at a position within or at the beginning of a line, this returns a nonempty range containing the line number of that one line. An empty range at or beyond the end of the string translates to an empty range of line numbers.

Source

pub fn line_to_utf8s(&self, line_number: usize) -> Range<usize>

UTF-8 offset one past the end of a line (the offset of the start of the next line).

Source

pub fn lines_to_utf8s(&self, line_numbers: Range<usize>) -> Range<usize>

UTF-8 offsets for the beginning and end of a range of lines, including the newline if any.

Source §

impl<C: ConfigType<HasChars = True, HasLines = True>> StringOffsets<C>

Source

pub fn line_chars(&self, line_number: usize) -> usize

Returns the number of Unicode characters on the specified line.

Source

pub fn line_to_char_begin(&self, line_number: usize) -> usize

UTF-32 offset of the first character of a line.

That is, return the offset that would point to the start of that line in a UTF-32 representation of the source string.

Source

pub fn line_to_char_end(&self, line_number: usize) -> usize

UTF-32 offset one past the end of a line (the offset of the start of the next line).

Source

pub fn line_to_chars(&self, line_number: usize) -> Range<usize>

UTF-32 offsets for the beginning and end of a line, including the newline if any.

Source

pub fn lines_to_chars(&self, line_numbers: Range<usize>) -> Range<usize>

UTF-32 offsets for the beginning and end of a range of lines, including the newline if any.

Source

pub fn utf8_to_char_pos(&self, byte_number: usize) -> Pos

Converts a UTF-8 offset to a zero-based line number and UTF-32 offset within the line.

Source

pub fn chars_to_lines(&self, chars: Range<usize>) -> Range<usize>

Returns the range of line numbers containing the substring specified by the UTF-32 range chars. Newline characters count as part of the preceding line.

Source §