tagpath 0.5.1 - Docs.rs

# Tag Path Specification

## 1. Overview

Tag Path is a system for decomposing identifiers into canonical tag sequences, enabling semantic equivalence across naming conventions and languages.

Every identifier is a **path** — an ordered sequence of **tags** separated by convention-specific delimiters. The same concept expressed in different conventions produces the same canonical tag list.

## 2. Conventions

Tag Path recognizes five naming conventions:

| Convention | Example | Delimiter |
|-----------|---------|-----------|
| snake_case | `person_name` | `_` |
| camelCase | `personName` | case boundary |
| PascalCase | `PersonName` | case boundary |
| kebab-case | `person-name` | `-` |
| UPPER_SNAKE_CASE | `PERSON_NAME` | `_` |
| Ada_Case | `Person_Name` | `_` |

### 2.1 Convention Detection

Detection is heuristic, applied in order:

1. Contains `_` or `__` AND all uppercase → UPPER_SNAKE_CASE
2. Contains `_` or `__` AND every segment starts with uppercase → Ada_Case
3. Contains `_` or `__` → snake_case
4. Contains `-` → kebab-case
5. Starts with uppercase letter → PascalCase
6. Otherwise → camelCase

### 2.2 Mixed Conventions

Identifiers may mix conventions (e.g., `createContext_auth`). Tokenization always splits on underscores first, then applies camelCase splitting to each segment.

## 3. Tokenization

### 3.1 snake_case / UPPER_SNAKE_CASE

Split on `_` (single underscore). Double underscore `__` is a namespace separator (see Section 5).

### 3.2 camelCase / PascalCase

Split on case boundaries:
- Lowercase → uppercase: `personName` → `[person, Name]`
- Uppercase run → lowercase (acronym boundary): `HTMLElement` → `[HTML, Element]`

### 3.3 kebab-case

Split on `-`.

### 3.4 Normalization

All tags are lowercased in the output. The canonical form is the lowercase tag list joined by `_`.

## 4. Semantic Equivalence

Two identifiers are semantically equivalent if they produce the same canonical tag list:

```
person_name  → [person, name]
personName   → [person, name]
PersonName   → [person, name]
person-name  → [person, name]
PERSON_NAME  → [person, name]
```

All five are equivalent.

## 5. Namespace Dimensions

In snake_case and UPPER_SNAKE_CASE, `__` (double underscore) separates namespace dimensions:

```
auth0__user__validate → dimensions: [[auth0], [user], [validate]]
highest_net_worth__company_person_name → dimensions: [[highest, net, worth], [company, person, name]]
```

Namespace dimensions are not extracted for camelCase, PascalCase, or kebab-case.

## 6. Role Detection

Roles are detected from prefix/suffix patterns:

| Pattern | Role |
|---------|------|
| `create_*`, `make_*`, `new_*`, `build_*` | factory |
| `use_*` | hook |
| `set_*` | setter |
| `get_*` | getter |
| `is_*`, `has_*`, `can_*`, `should_*` | predicate |
| `on_*` | handler |
| `validate_*`, `check_*`, `verify_*` | validator |
| `*_validate`, `*_check`, `*_verify` | validator (suffix) |

## 7. Shape Detection

Data shapes are detected from the last tag:

| Suffix | Shape |
|--------|-------|
| `a`, `a1`, `a2`, `a3`, `list`, `array` | array |
| `r`, `record` | record |
| `m`, `map` | map |
| `set` (when first tag is not `set`) | set |
| `$` (trailing) | signal |

## 8. Configuration (.naming.toml)

### 8.1 Schema

```toml
version = 1              # Schema version (required)
name = "<string>"        # Project/config name (required)
extends = ["<string>"]   # Parent configs to inherit from
convention = "<string>"  # Default convention
immutable = <bool>       # Tags never mutate when composing
singular = <bool>        # Tags are always singular form

[vectors]
join = "_"               # Tag join character
namespace = "__"         # Namespace separator

[patterns]
<role> = "<template>"    # Role-specific name templates

[externals]
preserve_casing = <bool>   # Keep external library casing
join_with = "<string>"     # How to join external names

[packages]
separator = "<string>"    # Package name separator
pattern = "<template>"    # Package naming template

[contexts.<context_name>]
convention = "<string>"   # Convention for this context
prefix = "<string>"       # Optional prefix
suffix = "<string>"       # Optional suffix

[tags]
open = <bool>             # Allow undeclared tags

[tags.declared.<tag_name>]
level = "<string>"        # abstraction level
domain = "<string>"       # domain classification
shape = "<string>"        # data shape
role = "<string>"         # functional role
```

### 8.2 Resolution

When multiple `.naming.toml` files exist in a directory hierarchy, they merge bottom-up (closest to the file wins). The `extends` field pulls in named presets.

## 9. CLI Interface

```
tagpath parse <NAME> [--convention <CONV>] [--format text|json]
tagpath init [--lang <LANG>] [--preset <PRESET>]
tagpath extract <PATH> [--format text|json] [--ast]
tagpath search <QUERY> <PATH> [--format text|json]
tagpath lint [<PATH>]
tagpath alias <NAME> [--convention <CONV>] [--format text|json]
tagpath prose <NAME> [--format text|json]
tagpath graph [<PATH>] [--format text|dot|json] [--query <QUERY>]
```

### 9.1 parse

Decomposes an identifier into its tag structure. Auto-detects convention unless overridden.

### 9.2 init

Generates a `.naming.toml` from a language or convention preset.

### 9.3 extract

Extracts identifiers from source files under `<PATH>`.

- Recursively walks directories, selecting files by known language extensions.
- **Regex mode** (default): Uses regex patterns to extract identifiers from source text. Works for all 39 supported languages.
- **AST mode** (`--ast`): Uses tree-sitter to parse source files into an AST and extract identifiers with context classification. Available for 14 languages (Rust, Python, JavaScript, TypeScript, TSX, Go, C, C++, Java, Ruby, PHP, C#, Swift, Kotlin). Falls back to regex for unsupported languages.
- Each extracted identifier includes: name, file path, line number, detected convention, canonical tags, and context (when using `--ast`).
- `--format text` (default) outputs one identifier per line. `--format json` outputs a JSON array of identifier records.

### 9.4 search

Performs cross-convention semantic search over source files under `<PATH>`.

- The `<QUERY>` is parsed into canonical tags using the same tokenization rules as `parse`.
- All identifiers in `<PATH>` are extracted and decomposed into canonical tags.
- An identifier matches if the query's canonical tags appear as a subsequence of the identifier's canonical tags.
- Matches across all naming conventions: searching for `"user"` finds `user_name`, `userName`, `UserName`, `user-name`, and `USER_NAME`.
- Searching for `"validate_user"` finds `validateUser`, `ValidateUser`, `validate_user`, etc.
- `--format text` (default) outputs matching identifiers with file location. `--format json` outputs a JSON array.

### 9.5 lint

Validates source file identifiers against `.naming.toml` rules.

- Loads the nearest `.naming.toml` (with `extends` resolution) for each file.
- Extracts identifiers from source files (uses tree-sitter AST when available).
- Checks each identifier's convention against the expected convention for its context.
- Reports violations with file path, line number, identifier name, expected convention, and actual convention.

### 9.6 alias

Generates cross-convention aliases for an identifier.

- Parses the input identifier into canonical tags using the same tokenization rules as `parse`.
- Reconstructs the identifier in all 6 naming conventions: snake_case, camelCase, PascalCase, kebab-case, UPPER_SNAKE_CASE, Ada_Case.
- Optional `--convention` flag to produce only a single target convention.
- `--format text` (default) outputs one convention per line. `--format json` outputs a JSON object with `tags` and `aliases` fields.

### 9.7 prose

Generates a human-readable prose description of an identifier.

- Parses the input identifier into canonical tags.
- Strips role prefixes (`create`, `get`, `is`, etc.) and shape suffixes (`a`, `list`, `map`, etc.) from the core tags.
- Builds a natural English phrase based on role and shape:
  - Factory: "Creates a {noun}" (e.g., `create_user_profile` -> "Creates a user profile")
  - Predicate: "Checks if {subject} {predicate} {modifiers}" (e.g., `is_valid_email` -> "Checks if email is valid")
  - Array shape: "Array of {noun}s" (e.g., `user_name_a` -> "Array of user names")
  - No role/shape: capitalizes the noun phrase (e.g., `PersonName` -> "Person name")
- `--format text` (default) outputs the prose string. `--format json` outputs a JSON object with `original`, `prose`, `tags`, `role`, and `shape` fields.

### 9.8 graph

Builds a tag co-occurrence graph from extracted identifiers.

- Extracts all identifiers from source files under `<PATH>` (defaults to `.`).
- Nodes represent individual tags (lowercase, deduplicated).
- Edges connect sequential tag pairs within identifiers (e.g., `create_user` produces edge `create -> user`).
- Edge weights count how many identifiers share that tag pair.
- Optional `--query` flag filters to a subgraph: seed nodes matching query tags plus their direct 1-hop neighbors.
- `--format dot` (default for `text`) outputs Graphviz DOT format.
- `--format json` outputs a JSON object with `nodes` (sorted array) and `edges` (array of `{from, to, weight}` objects).

## 10. Extends Resolution

The `extends` field in `.naming.toml` enables composable configuration.

### 10.1 Syntax

```toml
extends = ["rust"]          # extend a single language preset
extends = ["rust", "custom"] # extend multiple presets (applied left to right)
```

### 10.2 Resolution Rules

1. Presets are resolved by name from the `lang/` and `presets/` directories.
2. When extending multiple presets, they are applied left to right — later presets override earlier ones.
3. The extending config's fields override all inherited fields at the same level.
4. Context-level merging: `[contexts.<name>]` sections merge with inherited contexts. Only the fields specified in the extending config replace the parent values; unspecified fields are retained from the parent.
5. Top-level fields (`convention`, `immutable`, `singular`, etc.) are fully replaced if present in the extending config.
6. `[tags.declared]` entries merge additively — the extending config can add new tag declarations without removing inherited ones.