Skip to main content

EncodingWithOffsets

Struct EncodingWithOffsets 

Source
pub struct EncodingWithOffsets {
    pub ids: Vec<u32>,
    pub tokens: Vec<String>,
    pub offsets: Vec<(usize, usize)>,
}
Expand description

Encoding result with tokens and their character offsets.

Produced by a tokenizer’s encode_with_offsets method (or equivalent). Used to map between character positions in source text and token indices.

§Example

use candle_mi::EncodingWithOffsets;

let encoding = EncodingWithOffsets::new(
    vec![1, 2, 3],
    vec!["def".into(), " ".into(), "add".into()],
    vec![(0, 3), (3, 4), (4, 7)],
);

// Character 4 ('a' in "add") is in token 2
assert_eq!(encoding.char_to_token(4), Some(2));

Fields§

§ids: Vec<u32>

Token IDs.

§tokens: Vec<String>

Token strings.

§offsets: Vec<(usize, usize)>

Character offset for each token: (start, end).

Implementations§

Source§

impl EncodingWithOffsets

Source

pub const fn new( ids: Vec<u32>, tokens: Vec<String>, offsets: Vec<(usize, usize)>, ) -> Self

Create a new encoding with offsets.

Source

pub fn tokens_with_offsets(&self) -> Vec<TokenWithOffset>

Get tokens with their character offsets.

Source

pub fn char_to_token(&self, char_pos: usize) -> Option<usize>

Find the token index that contains the given character position.

Returns None if no token spans that position.

Source

pub fn char_to_token_fuzzy(&self, char_pos: usize) -> Option<usize>

Find the token index for a character position, with fuzzy fallback.

If the exact position isn’t contained in any token, returns the index of the closest token by midpoint distance.

Source

pub fn char_to_token_start(&self, char_pos: usize) -> Option<usize>

Find the token index that starts at or after the given character position.

Source

pub fn char_range_to_tokens( &self, start_char: usize, end_char: usize, ) -> Vec<usize>

Find all token indices that overlap with the given character range.

Source

pub fn token_to_char_range(&self, token_idx: usize) -> Option<(usize, usize)>

Get the character range for a token index.

Source

pub const fn len(&self) -> usize

Number of tokens.

Source

pub const fn is_empty(&self) -> bool

Whether the encoding is empty.

Source

pub fn label_spans(&self, spans: &[(&str, Range<usize>)]) -> Vec<String>

Label each token by which named span it overlaps with.

For each token, finds the first span (by input order) whose byte range overlaps the token’s byte range. The last token matching each span label gets "_final" appended. Tokens matching no span receive "other".

§Example
use candle_mi::EncodingWithOffsets;

let enc = EncodingWithOffsets::new(
    vec![1, 2, 3, 4],
    vec!["The".into(), " Eiffel".into(), " Tower".into(), " is".into()],
    vec![(0, 3), (3, 10), (10, 16), (16, 19)],
);
let labels = enc.label_spans(&[("subject", 0..16), ("relation", 16..19)]);
assert_eq!(labels, vec!["subject", "subject", "subject_final", "relation_final"]);

Trait Implementations§

Source§

impl Clone for EncodingWithOffsets

Source§

fn clone(&self) -> EncodingWithOffsets

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for EncodingWithOffsets

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> ErasedDestructor for T
where T: 'static,