pii 0.1.0 - Docs.rs

# Worka PII

[![Crates.io](https://img.shields.io/crates/v/pii.svg)](https://crates.io/crates/pii)
[![Docs](https://docs.rs/pii/badge.svg)](https://docs.rs/pii)
[![CI](https://github.com/worka-ai/pii/actions/workflows/ci.yml/badge.svg)](https://github.com/worka-ai/pii/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](https://github.com/worka-ai/pii/blob/main/LICENSE-APACHE)

Worka PII is a Rust library for detecting and anonymizing personally identifiable information (PII).
It provides deterministic, capability-aware NLP pipelines designed to run on CPU-only environments
with explicit auditability and controlled degradation when language features are unavailable.

This crate was extracted from the Worka internal monorepo to become a standalone, reusable
component. The APIs and the RFCs are maintained here to support independent development and
external adoption.

## Features

- Deterministic PII detection with stable byte offsets
- Regex, validator, dictionary, and NER-backed recognizers
- Capability-aware pipeline (tokenization, lemma, POS, NER)
- Configurable anonymization operators (redact, mask, replace, hash)
- Optional Candle-based NER via `candle-ner` feature

## Examples

```bash
cargo run --example redact
cargo run --example extract
```

## Redaction Example

```rust
use pii::anonymize::{AnonymizeConfig, Anonymizer};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Contact Jane at jane@example.com or +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), pii::anonymize::Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), pii::anonymize::Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;

let redacted = Anonymizer::anonymize(text, &result.entities, &config).unwrap();
assert!(redacted.text.contains("<EMAIL>"));
```

## Span Extraction Example

This example keeps the input text intact and uses the detected spans directly.

```rust
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Reach me at jane@example.com from 10.0.0.5.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

for detection in &result.entities {
    let span = &text[detection.start..detection.end];
    println!(
        "type={} start={} end={} value={}",
        detection.entity_type.as_str(),
        detection.start,
        detection.end,
        span
    );
}
```

## Custom Operators + Audit Log Example

This example applies per-entity operators and emits a simple audit log that
records the original value alongside the replacement.

```rust
use pii::anonymize::{AnonymizeConfig, Anonymizer, Operator};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Email jane@example.com or call +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;

let anonymized = Anonymizer::anonymize(text, &result.entities, &config).unwrap();

for item in &anonymized.items {
    let original = &text[item.entity.start..item.entity.end];
    println!(
        "type={} value={} replacement={}",
        item.entity.entity_type.as_str(),
        original,
        item.replacement
    );
}
```

## Supported Entity Types (Built-in)

The following entity types are supported out of the box via built-in recognizers:

- Email
- Phone
- IpAddress (IPv4)
- Ipv6
- CreditCard
- Iban
- Ssn
- Itin
- TaxId
- Passport
- DriverLicense
- BankAccount
- RoutingNumber
- CryptoAddress
- MacAddress
- Uuid
- Vin
- Imei
- Url
- Domain
- Hostname

The following types are supported when a NER engine is enabled:

- Person
- Location
- Organization

## Custom Entities and Recognizers

You can add custom entities and recognizers to the pipeline.

```rust
use pii::recognizers::regex::RegexRecognizer;
use pii::types::EntityType;

let mut recognizers = default_recognizers();
let employee_id = RegexRecognizer::new(
    "regex_employee_id",
    EntityType::Custom("EmployeeId".to_string()),
    r"\bEMP-\d{4}\b",
    0.7,
    "employee_id",
).unwrap();
recognizers.push(Box::new(employee_id));

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    recognizers,
    Vec::new(),
    PolicyConfig::default(),
);
```

## Custom Pipeline

The pipeline is fully customizable: you can supply your own NLP engine, recognizers, and
context enhancers.

- Implement `NlpEngine` if you want custom tokenization, lemma/POS, or NER.
- Add domain-specific recognizers and context enhancers for tuned detection.
- Swap the default recognizers with your own curated set for strict control.

## Language Support and Degradation

The default `SimpleNlpEngine` is language-agnostic and provides tokenization plus sentence
splitting for any language tag. For EN/DE/ES, you can provide richer language profiles
and context terms to improve recall.

For unsupported languages:

- Regex and validator recognizers still work (language-neutral).
- Lemma/POS/NER capabilities will be absent unless your `NlpEngine` provides them.
- Context enhancement falls back to surface terms when lemma is unavailable.

### Adding Languages

To add a new language with higher fidelity:

1. Implement or integrate an `NlpEngine` that can emit token offsets, lemmas, POS tags, and/or NER.
2. Provide a `LanguageProfile` with context terms for that language.
3. Attach those to the analyzer via your pipeline configuration.

## Specification

The full specification is in `docs/rfc-1200-pii.md` and defines the data model, pipeline behavior,
capability reporting, and conformance requirements.

## Tests

```bash
cargo test
```

## Benchmarks

```bash
cargo bench
```

Candle NER tests are ignored by default and require `--features candle-ner` plus a model:

```bash
PII_CANDLE_MODEL_DIR=/path/to/model \
  cargo test --features candle-ner --test candle_ner -- --ignored
```

You can also set `PII_CANDLE_MODEL_ID` to download a model via `hf-hub`.

## License

Licensed under either of:
- Apache License, Version 2.0
- MIT license