penmanship 0.1.0

A Unicode character lookup library for converting text patterns to Unicode characters
Documentation
# penmanship - Project Context for Claude

## Project Overview

`penmanship` is a Rust library for Unicode character lookup via text patterns. It provides a simple, efficient API for converting text aliases (like `"..."`, `"alpha"`, `"(c)"`) to their corresponding Unicode characters (like `…`, `α`, `©`). The library emphasizes strict code quality, comprehensive documentation, and safe coding practices.

This is a **library-only** project (no binary):

- **Library**: Rust API for Unicode character lookup and conversion
- **Zero runtime overhead**: Uses compile-time perfect hash maps via `phf`
- **Feature-gated categories**: Optional modules for different character categories

## Key Project Characteristics

### Strict Linting and Quality Standards

- **`no_std` compatible**: Library works without the standard library for embedded use
- **Forbidden unsafe code**: `unsafe_code = "forbid"`
- **Required documentation**: All items (public and private) must be documented
  - Missing docs are denied (both code and rustdoc)
  - All functions must include `# Errors`, `# Panics`, and `# Safety` docs where applicable

### Architecture Patterns

- **Category modules**: Organized by Unicode category (punctuation, math, greek, fractions, etc.)
- **Feature flags**: Each category is a Cargo feature that can be independently enabled/disabled
- **Static mappings**: Uses `phf` (Perfect Hash Function) for zero-cost compile-time lookups
- **No dependencies for core**: Minimal dependency footprint (only `phf` for mappings)

### Available Features

All features are enabled by default via the `full` feature flag:

- `full` (default) - Enables all categories
- `punctuation` - Punctuation and typography symbols
- `math` - Mathematical operators and symbols
- `greek` - Greek letters (lowercase and uppercase)
- `fractions` - Fraction characters
- `currency` - Currency symbols
- `symbols` - Miscellaneous symbols (legal, general, card suits, music, zodiac)
- `superscripts` - Superscript characters
- `subscripts` - Subscript characters
- `html` - HTML named character references
- `emoji` - Emoji shortcode lookup (adds `emojis` dependency)

Users can selectively enable features:

```toml
# Only punctuation and math
penmanship = { version = "0.1", default-features = false, features = ["punctuation", "math"] }
```

### File Organization (One Item Per File)

- **Guideline**: Place one public item (struct, enum, or trait) per file as a general rule.
  - File names should match the item name (e.g., `Mapping` struct goes in `mapping.rs`)
  - Each file contains the item and all its implementations (Display, Default, methods, etc.)
  - Type aliases are exempt from this rule and can be grouped logically
  - When violating this guideline, include a comment explaining why:

    ```rust
    // Multiple mapping constants grouped together for cohesion.
    pub static PUNCTUATION: phf::Map<...> = ...;
    pub static MATH: phf::Map<...> = ...;
    ```

- **Benefits of this pattern**:
  - Clear file-to-type mapping for navigation
  - Focused context when editing specific types
  - Precise git history (changes to `Punctuation` only touch `punctuation.rs`)
  - Reduced merge conflicts when working on different types
  - Tests can be colocated with their specific type

### Module Organization (mod.rs as Table of Contents)

`mod.rs` files should **only** contain module declarations and re-exports.

- All implementation code (structs, enums, functions, impls, tests) must be in separate files
- `mod.rs` serves as the module's table of contents
- Each public item gets its own file following the "one item per file" convention

**Example**:

```rust
// categories/mod.rs - GOOD: Only declarations and re-exports
pub mod punctuation;
pub mod math;
pub mod greek;

pub use punctuation::PUNCTUATION;
pub use math::MATH;
pub use greek::GREEK;
```

**Benefits**: Clear module index, easier navigation, better git history, consistent with "one item per file".

### Code Structure

```text
penmanship/
├── Cargo.toml
├── README.md
├── CHANGELOG.md
├── LICENSE.txt
├── SECURITY.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── CLAUDE.md
├── .gitignore
├── .markdownlintrc
└── src/
    ├── lib.rs              # Crate root with main API
    ├── lookup.rs           # Core lookup function
    └── categories/         # Feature-gated category modules
        ├── mod.rs
        ├── punctuation.rs
        ├── math.rs
        ├── greek.rs
        ├── fractions.rs
        ├── currency.rs
        ├── symbols.rs
        ├── superscripts.rs
        ├── subscripts.rs
        ├── emoji.rs
        └── html/           # HTML entities split across multiple files
            ├── mod.rs
            ├── part1.rs
            ├── part2.rs
            └── part3.rs
```

### Development Workflow

- Build locally: `cargo build`
- Build release: `cargo build --release`
- Test: `cargo test`
- Test all features: `cargo test --all-features`
- Test without default features: `cargo test --no-default-features`
- Test specific features: `cargo test --features="punctuation,math"`
- Lint: `cargo clippy --all-targets --all-features`
- Documentation: `cargo doc --open`
- Coverage: `cargo llvm-cov --all-features --text`

### Library Usage

Users can add `penmanship` as a dependency in their `Cargo.toml`:

```toml
[dependencies]
penmanship = "0.1"
```

Basic usage:

```rust
use penmanship::lookup;

fn main() {
    // Look up Unicode characters by pattern
    if let Some((char, description)) = lookup("...") {
        println!("{} - {}", char, description);  // … - horizontal ellipsis
    }

    if let Some((char, _)) = lookup("alpha") {
        println!("{}", char);  // α
    }

    if let Some((char, _)) = lookup("(c)") {
        println!("{}", char);  // ©
    }
}
```

**Key library patterns:**

- Primary function: `lookup<S: AsRef<str>>(pattern: S) -> Option<(&'static str, &'static str)>`
- Returns tuple of (character, description) or None
- Accepts any string-like type via `AsRef<str>`
- Case-sensitive for Greek letters (alpha vs Alpha)
- Supports multiple aliases for same character
- Feature flags control which categories are included (default: all via `full` feature)

### Release Process

- Update version in `Cargo.toml` following [semver]https://semver.org/
- Update `CHANGELOG.md` with release notes
- Create git tag: `git tag v0.x.x && git push origin v0.x.x`
- Publish to crates.io: `cargo publish`
- Current version: `0.1.0` (unreleased)

### Git Configuration

**Whitelist .gitignore**: This project uses a whitelist approach to version control. By default, all files are ignored (`*`), and only specific file types and paths are explicitly allowed:

- Documentation: `README.md`, `CHANGELOG.md`, `LICENSE.txt`, `SECURITY.md`, `CODE_OF_CONDUCT.md`, `CONTRIBUTING.md`, `CLAUDE.md`
- Rust files: `Cargo.toml`, `Cargo.lock`, `src/**/*.rs`
- Configuration: `.gitignore`, `.markdownlintrc`
- GitHub config: `.github/ISSUE_TEMPLATE/**/*.md`, `.github/workflows/**/*.yaml`
- Scripts: `scripts/**/*.py`
- Explicitly disallowed: `/target`, `/.focus`, `.local/`

This ensures only intentional source files are committed, preventing accidental inclusion of build artifacts, local config, or temporary files.

### Important Conventions

1. **Documentation First**: Write docs before implementation
2. **Module Privacy**: Document all private items too
3. **No panics**: Library functions should never panic
4. **Feature Gates**: Use `#[cfg(feature = "...")]` for optional categories
5. **Static data**: All mappings are compile-time constants using `phf`
6. **Comprehensive testing**: Test all mappings from the whitepaper

### Mapping Pattern

Each category follows a consistent mapping pattern:

1. **Category module**: Define in `categories/category_name.rs`
2. **Static map**: Use `phf::phf_map!` macro for compile-time perfect hashing
3. **Naming convention**: `CATEGORY_NAME` constant (e.g., `PUNCTUATION`, `GREEK_LOWERCASE`)
4. **Map structure**: `&'static str` (pattern) → `(&'static str, &'static str)` (character, description)
5. **Feature gate**: Each category behind a feature flag

Example category structure:

```rust
// categories/punctuation.rs
use phf::phf_map;

/// Punctuation and typography mappings
#[cfg(feature = "punctuation")]
pub static PUNCTUATION: phf::Map<&'static str, (&'static str, &'static str)> = phf_map! {
    "..." => ("…", "horizontal ellipsis"),
    "en" => ("–", "en dash"),
    "em" => ("—", "em dash"),
    "--" => ("—", "em dash"),
};
```

### Common Tasks

- **Adding a new mapping**:
  1. Identify the appropriate category module
  2. Add entry to the category's `phf_map!`
  3. Add test case for the new mapping
  4. Update documentation if needed

- **Adding a new category**:
  1. Create `categories/category_name.rs` with static map
  2. Add feature to `Cargo.toml`
  3. Add module to `categories/mod.rs`
  4. Include category in main lookup function
  5. Gate with `#[cfg(feature = "category_name")]`
  6. Add comprehensive tests

### Dependencies Philosophy

- **Minimal dependencies**: Only what's absolutely necessary
- **Zero runtime dependencies**: Core lookup has no runtime deps
- **Build-time only**: `phf` is compile-time only
- **No optional bloat**: Feature flags keep unused categories out

### Current Dependencies

- `phf` (with `macros` feature): Compile-time perfect hash maps (required for all builds)
- `emojis` (optional, enabled via `emoji` feature): Emoji shortcode lookup database

### Testing Notes

- Use `#[cfg(test)]` modules
- Test every mapping from the whitepaper
- Test case sensitivity (alpha vs Alpha)
- Test multiple aliases for same character
- Test unknown patterns return None
- IMPORTANT: Never add dependencies without giving me a chance to review it BEFORE you add them.

### Data Sources

The library includes mappings from multiple sources:

**Custom mappings** (implemented as `phf` maps):

- Punctuation & Typography (ellipsis, dashes, quotes, guillemets)
- Inverted & Special Punctuation (interrobang, inverted marks)
- Legal & Copyright Symbols
- Fractions
- Mathematical Operators
- Greek Letters (lowercase and uppercase)
- Superscripts and Subscripts
- Currency Symbols
- Miscellaneous Symbols
- Whitespace Characters
- Card Suits
- Music Notation
- Zodiac Symbols

**External sources**:

- HTML named character references from [WHATWG HTML Living Standard]https://html.spec.whatwg.org/entities.json
- Emoji shortcodes from the `emojis` crate

### Special Implementations

**HTML Entities**: The HTML module is split across three files (`part1.rs`, `part2.rs`, `part3.rs`) to manage the large number of entities (~2,200+ mappings). The `lookup_html()` function chains lookups across all three maps.

**Emoji**: Uses the external `emojis` crate for comprehensive shortcode support. Patterns must include surrounding colons (e.g., `:smile:`).

### Known Issues and Design Decisions

1. **Sequential lookup**: The main `lookup()` function checks categories sequentially. While fast with `phf`, a future optimization could combine all maps into a single compile-time map.

2. **HTML numeric entities**: Numeric HTML entities (e.g., `&#65;` or `&#x41;`) are intentionally not supported to maintain zero-allocation guarantees with static strings.

### Quality Metrics

- **`no_std` compatible**: Yes (enabled via `#![no_std]` in lib.rs)
- **Test coverage**: 100% (maintained via `cargo llvm-cov`)
- **Linting**: Zero clippy warnings with strict lints enabled
- **Documentation**: 100% coverage (all public and private items documented)
- **Edition**: Rust 2024

### Future Use

This library is designed to be a general-purpose Unicode character lookup library for use in other projects, particularly text editors, CLI tools, and markdown processors.