pub struct StringOffsets<C: ConfigType = AllConfig> { /* private fields */ }
Expand description
Converts positions within a given string between UTF-8 byte offsets (the usual in Rust), UTF-16 code units, Unicode code points, and line numbers.
Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences
of Unicode code points. It’s therefore necessary to adjust string offsets when communicating
across programming language boundaries. StringOffsets
does these adjustments.
Each StringOffsets
instance contains offset information for a single string. Building the
data structure takes O(n) time and memory, but then most conversions are
O(1).
“UTF-8 Conversions with BitRank” is a blog post explaining the implementation.
§Converting offsets
The conversion methods follow a naming scheme that uses these terms for different kinds of offsets:
utf8
- UTF-8 byte offsets (Rust style).utf16
- UTF-16 code unit offsets (JavaScript style).char
- Count of Unicode scalar values (Python style).utf16_pos
- Zero-based line number andutf16
offset within the line.char_pos
- Zero-based line number andchar
offset within the line.
For example, StringOffsets::utf8_to_utf16
converts a Rust byte offset to a number that will
index to the same position in a JavaScript string. Offsets are expressed as usize
or Pos
values.
All methods accept arguments that are past the end of the string, interpreting them as pointing to the end of the string.
§Converting ranges
Some methods translate position ranges. These are expressed as Range<usize>
except for
line
, which is a usize
:
line
- Zero-based line numbers. The range aline
refers to is the whole line, including the trailing newline character if any.lines
- A range of line numbers.utf8s
- UTF-8 byte ranges.utf16s
- UTF-16 code unit ranges.chars
- Ranges of Unicode scalar values.
When mapping offsets to line ranges, it is important to use a _to_lines
function in order to
end up with the correct line range. We have these methods because if you tried to do it
yourself you would screw it up; use them! (And see the source code for
StringOffsets::utf8s_to_lines
if you don’t believe us.)
§Complexity
Most operations run in O(1) time. A few require O(log n) time. The memory consumed by this data structure is typically less than the memory occupied by the actual content. In the best case, it requires ~45% of the content space. One can reduce memory requirements further by only requesting the necessary features via the configuration type.
Implementations§
Source§impl<C: ConfigType> StringOffsets<C>
impl<C: ConfigType> StringOffsets<C>
Sourcepub fn new(content: &str) -> Self
pub fn new(content: &str) -> Self
Create a new converter to work with offsets into the given string.
Sourcepub fn from_bytes(content: &[u8]) -> Self
pub fn from_bytes(content: &[u8]) -> Self
Create a new converter to work with offsets into the given byte-string.
If content
is UTF-8, this is just like StringOffsets::new
. Otherwise, the
conversion methods will produce unspecified (but memory-safe) results.
Source§impl<C: ConfigType<HasLines = True>> StringOffsets<C>
impl<C: ConfigType<HasLines = True>> StringOffsets<C>
Sourcepub fn line_to_utf8_begin(&self, line_number: usize) -> usize
pub fn line_to_utf8_begin(&self, line_number: usize) -> usize
Return the byte offset of the first character on the specified (zero-based) line.
If line_number
is greater than or equal to the number of lines in the text, this returns
the length of the string.
Sourcepub fn line_to_utf8_end(&self, line_number: usize) -> usize
pub fn line_to_utf8_end(&self, line_number: usize) -> usize
UTF-8 offset of the first character of a line.
Sourcepub fn utf8_to_line(&self, byte_number: usize) -> usize
pub fn utf8_to_line(&self, byte_number: usize) -> usize
Return the zero-based line number of the line containing the specified UTF-8 offset. Newline characters count as part of the preceding line.
Sourcepub fn utf8s_to_lines(&self, bytes: Range<usize>) -> Range<usize>
pub fn utf8s_to_lines(&self, bytes: Range<usize>) -> Range<usize>
Returns the range of line numbers containing the substring specified by the Rust-style
range bytes
. Newline characters count as part of the preceding line.
If bytes
is an empty range at a position within or at the beginning of a line, this
returns a nonempty range containing the line number of that one line. An empty range at or
beyond the end of the string translates to an empty range of line numbers.
Sourcepub fn line_to_utf8s(&self, line_number: usize) -> Range<usize>
pub fn line_to_utf8s(&self, line_number: usize) -> Range<usize>
UTF-8 offset one past the end of a line (the offset of the start of the next line).
Source§impl<C: ConfigType<HasChars = True, HasLines = True>> StringOffsets<C>
impl<C: ConfigType<HasChars = True, HasLines = True>> StringOffsets<C>
Sourcepub fn line_chars(&self, line_number: usize) -> usize
pub fn line_chars(&self, line_number: usize) -> usize
Returns the number of Unicode characters on the specified line.
Sourcepub fn line_to_char_begin(&self, line_number: usize) -> usize
pub fn line_to_char_begin(&self, line_number: usize) -> usize
UTF-32 offset of the first character of a line.
That is, return the offset that would point to the start of that line in a UTF-32 representation of the source string.
Sourcepub fn line_to_char_end(&self, line_number: usize) -> usize
pub fn line_to_char_end(&self, line_number: usize) -> usize
UTF-32 offset one past the end of a line (the offset of the start of the next line).
Sourcepub fn line_to_chars(&self, line_number: usize) -> Range<usize>
pub fn line_to_chars(&self, line_number: usize) -> Range<usize>
UTF-32 offsets for the beginning and end of a line, including the newline if any.
Sourcepub fn lines_to_chars(&self, line_numbers: Range<usize>) -> Range<usize>
pub fn lines_to_chars(&self, line_numbers: Range<usize>) -> Range<usize>
UTF-32 offsets for the beginning and end of a range of lines, including the newline if any.
Sourcepub fn utf8_to_char_pos(&self, byte_number: usize) -> Pos
pub fn utf8_to_char_pos(&self, byte_number: usize) -> Pos
Converts a UTF-8 offset to a zero-based line number and UTF-32 offset within the line.
Source§impl<C: ConfigType<HasWhitespace = True>> StringOffsets<C>
impl<C: ConfigType<HasWhitespace = True>> StringOffsets<C>
Sourcepub fn only_whitespaces(&self, line_number: usize) -> bool
pub fn only_whitespaces(&self, line_number: usize) -> bool
Returns true if the specified line is empty except for whitespace.
Source§impl<C: ConfigType<HasChars = True>> StringOffsets<C>
impl<C: ConfigType<HasChars = True>> StringOffsets<C>
Sourcepub fn utf8_to_char(&self, byte_number: usize) -> usize
pub fn utf8_to_char(&self, byte_number: usize) -> usize
Converts a UTF-8 offset to a UTF-32 offset.
Sourcepub fn char_to_utf8(&self, char_number: usize) -> usize
pub fn char_to_utf8(&self, char_number: usize) -> usize
Converts a UTF-32 offset to a UTF-8 offset.
Source§impl<C: ConfigType<HasChars = True, HasUtf16 = True>> StringOffsets<C>
impl<C: ConfigType<HasChars = True, HasUtf16 = True>> StringOffsets<C>
Sourcepub fn utf8_to_utf16(&self, byte_number: usize) -> usize
pub fn utf8_to_utf16(&self, byte_number: usize) -> usize
Converts a UTF-8 offset to a UTF-16 offset.
Source§impl<C: ConfigType<HasChars = True, HasLines = True, HasUtf16 = True>> StringOffsets<C>
impl<C: ConfigType<HasChars = True, HasLines = True, HasUtf16 = True>> StringOffsets<C>
Sourcepub fn line_to_utf16_begin(&self, line_number: usize) -> usize
pub fn line_to_utf16_begin(&self, line_number: usize) -> usize
UTF-16 offset of the first character of a line.
That is, return the offset that would point to the start of that line in a UTF-16 representation of the source string.
Sourcepub fn line_to_utf16_end(&self, line_number: usize) -> usize
pub fn line_to_utf16_end(&self, line_number: usize) -> usize
UTF-16 offset one past the end of a line (the offset of the start of the next line).
Sourcepub fn utf8_to_utf16_pos(&self, byte_number: usize) -> Pos
pub fn utf8_to_utf16_pos(&self, byte_number: usize) -> Pos
Converts a UTF-8 offset to a zero-based line number and UTF-16 offset within the line.