Struct Entity

Source

pub struct Entity {
    pub text: String,
    pub entity_type: EntityType,
    pub confidence: Confidence,
    pub normalized: Option<String>,
    pub provenance: Option<Provenance>,
    pub kb_id: Option<String>,
    pub canonical_id: Option<CanonicalId>,
    pub hierarchical_confidence: Option<HierarchicalConfidence>,
    pub visual_span: Option<Span>,
    pub discontinuous_span: Option<DiscontinuousSpan>,
    pub mention_type: Option<MentionType>,
    /* private fields */
}

Expand description

A recognized named entity or relation trigger.

§Entity Structure

"Contact John at john@example.com on Jan 15"
         ^^^^    ^^^^^^^^^^^^^^^^    ^^^^^^
         PER     EMAIL               DATE
         |       |                   |
         Named   Contact             Temporal
         (ML)    (Pattern)           (Pattern)

§Core Fields (Stable API)

text, entity_type, start, end, confidence — always present
normalized, provenance — commonly used optional fields
kb_id, canonical_id — knowledge graph and coreference support

§Extended Fields (Research/Experimental)

The following fields support advanced research applications but may evolve:

Field	Purpose	Status
`visual_span`	Multi-modal (ColPali) extraction	Experimental
`discontinuous_span`	W2NER non-contiguous entities	Experimental
`hierarchical_confidence`	Coarse-to-fine NER	Experimental

These fields are #[serde(skip_serializing_if = "Option::is_none")] so they have no overhead when unused.

§Knowledge Graph Support

For GraphRAG and coreference resolution, entities support:

kb_id: External knowledge base identifier (e.g., Wikidata Q-ID)
canonical_id: Local coreference cluster ID (links “John” and “he”)

§Normalization

Entities can have a normalized form for downstream processing:

Dates: “Jan 15” → “2024-01-15” (ISO 8601)
Money: “$1.5M” → “1500000 USD”
Locations: “NYC” → “New York City”

Fields§

§text: String

Entity text (surface form as it appears in source)

§entity_type: EntityType

Entity type classification

§confidence: Confidence

Confidence score (0.0-1.0, calibrated).

Construction via Confidence::new clamps to [0.0, 1.0]. Use .value() or Into<f64> to extract the raw score.

§normalized: Option<String>

Normalized/canonical form (e.g., “Jan 15” → “2024-01-15”)

§provenance: Option<Provenance>

Provenance: which backend/method produced this entity

§kb_id: Option<String>

External knowledge base ID (e.g., “Q7186” for Marie Curie in Wikidata). Used for entity linking and GraphRAG applications.

§canonical_id: Option<CanonicalId>

Local coreference cluster ID. Multiple mentions with the same canonical_id refer to the same entity. Example: “Marie Curie” and “she” might share canonical_id = CanonicalId(42).

§hierarchical_confidence: Option<HierarchicalConfidence>

Hierarchical confidence (coarse-to-fine). Provides linkage, type, and boundary scores separately.

§visual_span: Option

Visual span for multi-modal (ColPali) extraction. When set, provides bounding box location in addition to text offsets.

§discontinuous_span: Option<DiscontinuousSpan>

Discontinuous span for non-contiguous entity mentions (W2NER support). When set, overrides start/end for length calculations. Example: “New York and LA [airports]” where “airports” modifies both.

§mention_type: Option<MentionType>

Mention type classification (Proper, Nominal, Pronominal, Zero).

Classifies the referring expression type for coreference resolution. Follows the Accessibility Hierarchy (Ariel 1990): Proper > Nominal > Pronominal > Zero.

Implementations§

Source §

impl Entity

Source

pub fn new( text: impl Into<String>, entity_type: EntityType, start: usize, end: usize, confidence: impl Into<Confidence>, ) -> Self

Create a new entity.

use anno_core::{Entity, EntityType};

let e = Entity::new("Berlin", EntityType::Location, 10, 16, 0.95);
assert_eq!(e.text, "Berlin");
assert_eq!(e.entity_type, EntityType::Location);
assert_eq!((e.start(), e.end()), (10, 16));

Source

pub fn start(&self) -> usize

Start character offset (inclusive, 0-indexed).

Source

pub fn end(&self) -> usize

End character offset (exclusive).

Source

pub fn set_start(&mut self, start: usize)

Set the start offset. For use in post-processing pipelines.

Source

pub fn set_end(&mut self, end: usize)

Set the end offset. For use in post-processing pipelines.

Source

pub fn with_provenance( text: impl Into<String>, entity_type: EntityType, start: usize, end: usize, confidence: impl Into<Confidence>, provenance: Provenance, ) -> Self

Create a new entity with provenance information.

Source

pub fn with_hierarchical_confidence( text: impl Into<String>, entity_type: EntityType, start: usize, end: usize, confidence: HierarchicalConfidence, ) -> Self

Create an entity with hierarchical confidence scores.

Source

pub fn from_visual( text: impl Into<String>, entity_type: EntityType, bbox: Span, confidence: impl Into<Confidence>, ) -> Self

Create an entity from a visual bounding box (ColPali multi-modal).

Source

pub fn with_type( text: impl Into<String>, entity_type: EntityType, start: usize, end: usize, ) -> Self

Create an entity with default confidence (1.0).

Source

pub fn link_to_kb(&mut self, kb_id: impl Into<String>)

Link this entity to an external knowledge base.

§Examples

use anno_core::{Entity, EntityType};
let mut e = Entity::new("Marie Curie", EntityType::Person, 0, 11, 0.95);
e.link_to_kb("Q7186");
assert_eq!(e.kb_id.as_deref(), Some("Q7186"));

Source

pub fn set_canonical(&mut self, canonical_id: impl Into<CanonicalId>)

Assign this entity to a coreference cluster.

Entities with the same canonical_id refer to the same real-world entity.

Source

pub fn with_canonical_id(self, canonical_id: impl Into<CanonicalId>) -> Self

Builder-style method to set canonical ID.

§Example

use anno_core::{CanonicalId, Entity, EntityType};
let entity = Entity::new("John", EntityType::Person, 0, 4, 0.9)
    .with_canonical_id(42);
assert_eq!(entity.canonical_id, Some(CanonicalId::new(42)));

Source

pub fn is_linked(&self) -> bool

Check if this entity is linked to a knowledge base.

Source

pub fn has_coreference(&self) -> bool

Check if this entity has coreference information.

Source

pub fn is_discontinuous(&self) -> bool

Check if this entity has a discontinuous span.

Discontinuous entities span non-contiguous text regions. Example: “New York and LA airports” contains “New York airports” as a discontinuous entity.

Source

pub fn discontinuous_segments(&self) -> Option<Vec<Range<usize>>>

Get the discontinuous segments if present.

Returns None if this is a contiguous entity.

Source

pub fn set_discontinuous_span(&mut self, span: DiscontinuousSpan)

Set a discontinuous span for this entity.

This is used by W2NER and similar models that detect non-contiguous mentions.

Source

pub fn total_len(&self) -> usize

Get the total length covered by this entity, in characters.

Contiguous: end - start
Discontinuous: sum of segment lengths

This is intentionally consistent: all offsets in anno::core entity spans are character offsets (Unicode scalar values), not byte offsets.

Source

pub fn set_normalized(&mut self, normalized: impl Into<String>)

Set the normalized form for this entity.

§Examples

use anno_core::{Entity, EntityType};

let mut entity = Entity::new("Jan 15", EntityType::Date, 0, 6, 0.95);
entity.set_normalized("2024-01-15");
assert_eq!(entity.normalized.as_deref(), Some("2024-01-15"));

Source

pub fn normalized_or_text(&self) -> &str

Get the normalized form, or the original text if not normalized.

Source

pub fn method(&self) -> ExtractionMethod

Get the extraction method, if known.

Source

pub fn source(&self) -> Option<&str>

Get the source backend name, if known.

Source

pub fn category(&self) -> EntityCategory

Get the entity category.

Source

pub fn is_structured(&self) -> bool

Returns true if this entity was detected via patterns (not ML).

Source

pub fn is_named(&self) -> bool

Returns true if this entity required ML for detection.

Source

pub fn overlaps(&self, other: &Entity) -> bool

Check if this entity overlaps with another.

Source

pub fn overlap_ratio(&self, other: &Entity) -> f64

Calculate overlap ratio (IoU) with another entity.

Source

pub fn set_hierarchical_confidence( &mut self, confidence: HierarchicalConfidence, )

Set hierarchical confidence scores.

Source

pub fn linkage_confidence(&self) -> Confidence

Get the linkage confidence (coarse filter score).

Source

pub fn type_confidence(&self) -> Confidence

Get the type classification confidence.

Source

pub fn boundary_confidence(&self) -> Confidence

Get the boundary confidence.

Source

pub fn is_visual(&self) -> bool

Check if this entity has visual location (multi-modal).

Source

pub const fn text_span(&self) -> (usize, usize)

Get the text span (start, end).

Source

pub const fn span_len(&self) -> usize

Get the span length.

Source

pub fn set_visual_span(&mut self, span: Span)

Create a unified TextSpan with both byte and char offsets.

This is useful when you need to work with both offset systems. The text parameter must be the original source text from which this entity was extracted.

§Arguments

source_text - The original text (needed to compute byte offsets)

§Returns

A TextSpan with both byte and char offsets.

§Note

This method requires the offset conversion utilities from the anno crate. Use anno::offset::char_to_byte_offsets() directly for now.

§Example

use anno_core::{Entity, EntityType};

let (byte_start, byte_end) = char_to_byte_offsets(text, entity.start(), entity.end());

Set visual span for multi-modal extraction.

Source

pub fn extract_text(&self, source_text: &str) -> String

Safely extract text from source using character offsets.

Entity stores character offsets, not byte offsets. This method correctly extracts text by iterating over characters.

§Arguments

source_text - The original text from which this entity was extracted

§Returns

The extracted text, or empty string if offsets are invalid

§Example

use anno_core::{Entity, EntityType};

let text = "Hello, 日本!";
let entity = Entity::new("日本", EntityType::Location, 7, 9, 0.95);
assert_eq!(entity.extract_text(text), "日本");

Source

pub fn extract_text_with_len( &self, source_text: &str, text_char_count: usize, ) -> String

Extract text with pre-computed text length (performance optimization).

Use this when validating/clamping multiple entities from the same text to avoid recalculating text.chars().count() for each entity.

§Arguments

source_text - The original text
text_char_count - Pre-computed character count (from text.chars().count())

§Returns

The extracted text, or empty string if offsets are invalid

Source

pub fn builder( text: impl Into<String>, entity_type: EntityType, ) -> EntityBuilder

Create a builder for fluent entity construction.

Source

pub fn validate(&self, source_text: &str) -> Vec<ValidationIssue>

Validate this entity against the source text.

Returns a list of validation issues. Empty list means the entity is valid.

§Checks Performed

Span bounds: start < end, both within text length
Text match: text matches the span in source
Confidence range: confidence in [0.0, 1.0]
Type consistency: Custom types have non-empty names
Discontinuous consistency: If present, segments are valid

§Example

use anno_core::{Entity, EntityType};

let text = "John works at Apple";
let entity = Entity::new("John", EntityType::Person, 0, 4, 0.95);

let issues = entity.validate(text);
assert!(issues.is_empty(), "Entity should be valid");

// Invalid entity: span doesn't match text
let bad = Entity::new("Jane", EntityType::Person, 0, 4, 0.95);
let issues = bad.validate(text);
assert!(!issues.is_empty(), "Entity text doesn't match span");

Source

pub fn validate_with_len( &self, source_text: &str, text_char_count: usize, ) -> Vec<ValidationIssue>

Validate entity with pre-computed text length (performance optimization).

Use this when validating multiple entities from the same text to avoid recalculating text.chars().count() for each entity.

§Arguments

source_text - The original text
text_char_count - Pre-computed character count (from text.chars().count())

§Returns

Vector of validation issues (empty if valid)

Source

pub fn is_valid(&self, source_text: &str) -> bool

Check if this entity is valid against the source text.

Convenience method that returns true if validate() returns empty.

Source

pub fn validate_batch( entities: &[Entity], source_text: &str, ) -> HashMap<usize, Vec<ValidationIssue>>

Validate a batch of entities efficiently.

Returns a map of entity index -> validation issues. Only entities with issues are included.

§Example

use anno_core::{Entity, EntityType};

let text = "John and Jane work at Apple";
let entities = vec![
    Entity::new("John", EntityType::Person, 0, 4, 0.95),
    Entity::new("Wrong", EntityType::Person, 9, 13, 0.8),
];

let issues = Entity::validate_batch(&entities, text);
assert!(issues.is_empty() || issues.contains_key(&1)); // Second entity might fail

Trait Implementations§

Source §

impl Clone for Entity

Source §

fn clone(&self) -> Entity

Returns a duplicate of the value. Read more

1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Source §

impl Debug for Entity

Source §

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Source §

impl<'de> Deserialize<'de> for Entity

Source §

fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error>

Deserialize this value from the given Serde deserializer. Read more

Source §

impl From<&Entity> for Mention

Source §

fn from(entity: &Entity) -> Self

Converts to this type from the input type.

Source §

impl From<&Entity> for Signal<Location>

Convert an Entity to a Signal<Location>.

Uses Location::Text for the span and preserves normalized, provenance, and hierarchical_confidence fields. Discontinuous and visual spans are not handled; use GroundedDocument::from_entities for full fidelity.

Source §