pub struct TextData { /* private fields */ }
Expand description
A structure representing compiled text/matching data.
This is the key structure used to compare two texts against one another. It handles pre-processing the text to n-grams, scoring, and optimizing the result to try to identify specific details about a match.
§Examples
Basic scoring of two texts:
use scancode_rust::askalono::TextData;
let license = TextData::from("My First License");
let sample = TextData::from("copyright 20xx me irl\n\n // my first license");
assert_eq!(sample.match_score(&license), 1.0);
The above example is a perfect match, as identifiable copyright statements are stripped out during pre-processing.
Building on that, TextData is able to tell you where in the text a license is located:
let sample = TextData::from("copyright 20xx me irl\n// My First License\nfn hello() {\n ...");
let (optimized, score) = sample.optimize_bounds(&license);
assert_eq!((1, 2), optimized.lines_view());
assert!(score > 0.99f32, "license within text matches");
Implementations§
Source§impl TextData
impl TextData
Sourcepub fn new(text: &str) -> TextData
pub fn new(text: &str) -> TextData
Create a new TextData structure from a string.
The given text will be normalized, then smashed down into n-grams for
matching. By default, the normalized text is stored inside the
structure for future diagnostics. This is necessary for optimizing a
match and for diffing against other texts. If you don’t want this extra
data, you can call without_text
throw it out. Generally, as a user of
this library you want to keep the text data, but askalono will throw it
away in its own Store
as it’s not needed.
Sourcepub fn without_text(self) -> Self
pub fn without_text(self) -> Self
Consume this TextData
, returning one without normalized/processed
text stored.
Unless you know you don’t want the text, you probably don’t want to use
this. Other methods on TextData
require that text is present.
Sourcepub fn lines_view(&self) -> (usize, usize)
pub fn lines_view(&self) -> (usize, usize)
Get the bounds of the active line view.
This represents the “active” region of lines that matches are generated
from. The bounds are a 0-indexed (start, end)
tuple, with inclusive
start and exclusive end indicies. See optimize_bounds
.
This is largely for informational purposes; other methods in
TextView
, such as lines
and match_score
, will already account for
the line range. However, it’s useful to call it after running
optimize_bounds
to discover where the input text was discovered.
Sourcepub fn with_view(&self, start: usize, end: usize) -> Self
pub fn with_view(&self, start: usize, end: usize) -> Self
Clone this TextView
, creating a copy with the given view.
This will re-generate match data for the given view. It’s used in
optimize_bounds
to shrink/expand the view of the text to discover
bounds.
Other methods on TextView
respect this boundary, so it’s not needed
outside this struct.
Sourcepub fn white_out(&self) -> Self
pub fn white_out(&self) -> Self
“Erase” the current lines in view and restore the view to its original bounds.
For example, consider a file with two licenses in it. One was identified
(and located) with optimize_bounds
. Now you want to find the other:
white-out the matched lines, and re-run the overall search to find a
new high score.
Sourcepub fn match_score(&self, other: &TextData) -> f32
pub fn match_score(&self, other: &TextData) -> f32
Compare this TextData
with another, returning a similarity score.
This is what’s used during analysis to rank licenses.
Sourcepub fn optimize_bounds(&self, other: &TextData) -> (Self, f32)
pub fn optimize_bounds(&self, other: &TextData) -> (Self, f32)
Attempt to optimize a known match to locate possible line ranges.
Returns a new TextData
struct and a score. The returned struct is a
clone of self
, with its view set to the best match against other
.
This will respect any views set on the TextData (an optimized result won’t go outside the original view).
Note that this won’t be 100% optimal if there are blank lines surrounding the actual match, since successive blank lines in a range will likely have the same score.
You should check the value of lines_view
on the returned struct to
find the line ranges.
Trait Implementations§
Source§impl<'de> Deserialize<'de> for TextData
impl<'de> Deserialize<'de> for TextData
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Auto Trait Implementations§
impl Freeze for TextData
impl RefUnwindSafe for TextData
impl Send for TextData
impl Sync for TextData
impl Unpin for TextData
impl UnwindSafe for TextData
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more