pub struct TextData { /* private fields */ }
Expand description
A structure representing compiled text/matching data.
This is the key structure used to compare two texts against one another. It handles pre-processing the text to n-grams, scoring, and optimizing the result to try to identify specific details about a match.
Examples
Basic scoring of two texts:
use askalono::TextData;
let license = TextData::from("My First License");
let sample = TextData::from("copyright 20xx me irl\n\n // my first license");
assert_eq!(sample.match_score(&license), 1.0);
The above example is a perfect match, as identifiable copyright statements are stripped out during pre-processing.
Building on that, TextData is able to tell you where in the text a license is located:
let sample = TextData::from("copyright 20xx me irl\n// My First License\nfn hello() {\n ...");
let (optimized, score) = sample.optimize_bounds(&license);
assert_eq!((1, 2), optimized.lines_view());
assert!(score > 0.99f32, "license within text matches");
Implementations
sourceimpl TextData
impl TextData
sourcepub fn new(text: &str) -> TextData
pub fn new(text: &str) -> TextData
Create a new TextData structure from a string.
The given text will be normalized, then smashed down into n-grams for
matching. By default, the normalized text is stored inside the
structure for future diagnostics. This is necessary for optimizing a
match and for diffing against other texts. If you don’t want this extra
data, you can call without_text
throw it out. Generally, as a user of
this library you want to keep the text data, but askalono will throw it
away in its own Store
as it’s not needed.
sourcepub fn without_text(self) -> Self
pub fn without_text(self) -> Self
Consume this TextData
, returning one without normalized/processed
text stored.
Unless you know you don’t want the text, you probably don’t want to use
this. Other methods on TextData
require that text is present.
sourcepub fn lines_view(&self) -> (usize, usize)
pub fn lines_view(&self) -> (usize, usize)
Get the bounds of the active line view.
This represents the “active” region of lines that matches are generated
from. The bounds are a 0-indexed (start, end)
tuple, with inclusive
start and exclusive end indicies. See optimize_bounds
.
This is largely for informational purposes; other methods in
TextView
, such as lines
and match_score
, will already account for
the line range. However, it’s useful to call it after running
optimize_bounds
to discover where the input text was discovered.
sourcepub fn with_view(&self, start: usize, end: usize) -> Self
pub fn with_view(&self, start: usize, end: usize) -> Self
Clone this TextView
, creating a copy with the given view.
This will re-generate match data for the given view. It’s used in
optimize_bounds
to shrink/expand the view of the text to discover
bounds.
Other methods on TextView
respect this boundary, so it’s not needed
outside this struct.
sourcepub fn white_out(&self) -> Self
pub fn white_out(&self) -> Self
“Erase” the current lines in view and restore the view to its original bounds.
For example, consider a file with two licenses in it. One was identified
(and located) with optimize_bounds
. Now you want to find the other:
white-out the matched lines, and re-run the overall search to find a
new high score.
sourcepub fn match_score(&self, other: &TextData) -> f32
pub fn match_score(&self, other: &TextData) -> f32
Compare this TextData
with another, returning a similarity score.
This is what’s used during analysis to rank licenses.
sourcepub fn optimize_bounds(&self, other: &TextData) -> (Self, f32)
pub fn optimize_bounds(&self, other: &TextData) -> (Self, f32)
Attempt to optimize a known match to locate possible line ranges.
Returns a new TextData
struct and a score. The returned struct is a
clone of self
, with its view set to the best match against other
.
This will respect any views set on the TextData (an optimized result won’t go outside the original view).
Note that this won’t be 100% optimal if there are blank lines surrounding the actual match, since successive blank lines in a range will likely have the same score.
You should check the value of lines_view
on the returned struct to
find the line ranges.
Trait Implementations
sourceimpl<'de> Deserialize<'de> for TextData
impl<'de> Deserialize<'de> for TextData
sourcefn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error> where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error> where
__D: Deserializer<'de>,
Deserialize this value from the given Serde deserializer. Read more
Auto Trait Implementations
impl RefUnwindSafe for TextData
impl Send for TextData
impl Sync for TextData
impl Unpin for TextData
impl UnwindSafe for TextData
Blanket Implementations
sourceimpl<T> BorrowMut<T> for T where
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
const: unstable · sourcefn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more