prefix-register 0.2.2

A PostgreSQL-backed namespace prefix registry for CURIE expansion and prefix management
Documentation
# prefix-register

[![PyPI](https://img.shields.io/pypi/v/prefix-register.svg)](https://pypi.org/project/prefix-register/)
[![License](https://img.shields.io/pypi/l/prefix-register.svg)](LICENSE)

**Status: Beta** - API may change before 1.0 release.

A PostgreSQL-backed namespace prefix registry for [CURIE](https://www.w3.org/TR/curie/) expansion, shortening, and prefix management.

## Features

- **Async-only** - Built for high concurrency with asyncio
- **In-memory caching** - Prefixes loaded on startup for fast CURIE expansion
- **First-prefix-wins** - Each URI can only have one registered prefix
- **Batch operations** - Efficiently process multiple prefixes/URIs in a single call
- **Longest-match shortening** - Overlapping namespaces handled correctly
- **PostgreSQL backend** - Durable, scalable storage with connection pooling
- **Startup resilience** - Optional retry with exponential backoff for container orchestration
- **Input validation** - Prevents DoS via length limits (prefix max 64, URI max 2048 chars)

## Installation

```bash
pip install prefix-register
```

Requires Python 3.10+.

## Database Setup

Create the namespaces table in your PostgreSQL database:

```sql
CREATE TABLE IF NOT EXISTS namespaces (
    uri TEXT PRIMARY KEY,
    prefix TEXT NOT NULL UNIQUE
);
```

## Quick Start

```python
import asyncio
from prefix_register import PrefixRegistry

async def main():
    # Connect to PostgreSQL (loads existing prefixes into memory)
    registry = await PrefixRegistry.new(
        "postgres://user:password@localhost:5432/mydb",
        10  # max connections in pool
    )

    # Register a namespace prefix
    await registry.store_prefix_if_new("foaf", "http://xmlns.com/foaf/0.1/")

    # Expand a CURIE to full URI
    uri = await registry.expand_curie("foaf", "Person")
    print(uri)  # http://xmlns.com/foaf/0.1/Person

    # Shorten a URI back to a CURIE
    result = await registry.shorten_uri("http://xmlns.com/foaf/0.1/Person")
    if result:
        prefix, local = result
        print(f"{prefix}:{local}")  # foaf:Person

asyncio.run(main())
```

## Examples

### Registering Namespace Prefixes

The registry uses a "first prefix wins" rule - once a URI has a prefix, subsequent attempts to register a different prefix for the same URI are ignored.

```python
# Store a single prefix - returns True if stored, False if URI already has a prefix
stored = await registry.store_prefix_if_new("foaf", "http://xmlns.com/foaf/0.1/")
if stored:
    print("New prefix registered")
else:
    print("URI already has a prefix")

# Batch store multiple prefixes (more efficient than individual calls)
prefixes = [
    ("rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#"),
    ("rdfs", "http://www.w3.org/2000/01/rdf-schema#"),
    ("owl", "http://www.w3.org/2002/07/owl#"),
    ("xsd", "http://www.w3.org/2001/XMLSchema#"),
    ("schema", "https://schema.org/"),
    ("dc", "http://purl.org/dc/elements/1.1/"),
    ("dcterms", "http://purl.org/dc/terms/"),
    ("skos", "http://www.w3.org/2004/02/skos/core#"),
]
result = await registry.store_prefixes_if_new(prefixes)
print(f"Stored {result['stored']} new prefixes, skipped {result['skipped']}")
```

### Expanding CURIEs to Full URIs

CURIEs (Compact URIs) like `foaf:Person` are expanded by looking up the prefix and appending the local name.

```python
# Expand a single CURIE
uri = await registry.expand_curie("foaf", "Person")
if uri:
    print(uri)  # http://xmlns.com/foaf/0.1/Person
else:
    print("Unknown prefix")

# Expand multiple CURIEs in batch (more efficient for bulk operations)
curies = [
    ("foaf", "Person"),
    ("foaf", "name"),
    ("rdf", "type"),
    ("unknown", "Thing"),  # This prefix doesn't exist
]
results = await registry.expand_curie_batch(curies)

for (prefix, local), uri in zip(curies, results):
    if uri:
        print(f"{prefix}:{local} -> {uri}")
    else:
        print(f"{prefix}:{local} -> UNKNOWN PREFIX")

# Output:
# foaf:Person -> http://xmlns.com/foaf/0.1/Person
# foaf:name -> http://xmlns.com/foaf/0.1/name
# rdf:type -> http://www.w3.org/1999/02/22-rdf-syntax-ns#type
# unknown:Thing -> UNKNOWN PREFIX
```

### Shortening URIs to CURIEs

Convert full URIs back to compact CURIEs. The registry uses **longest-match semantics** - if multiple registered namespaces match, the longest one wins.

```python
# Shorten a single URI - returns (prefix, local_name) tuple or None
result = await registry.shorten_uri("http://xmlns.com/foaf/0.1/Person")
if result:
    prefix, local_name = result
    print(f"{prefix}:{local_name}")  # foaf:Person
else:
    print("No matching namespace found")

# Convenience method: get a formatted CURIE string, or the original URI if no match
curie = await registry.shorten_uri_or_full("http://xmlns.com/foaf/0.1/Person")
print(curie)  # "foaf:Person"

curie = await registry.shorten_uri_or_full("http://unknown.example.org/Thing")
print(curie)  # "http://unknown.example.org/Thing" (returned as-is)

# Batch shorten multiple URIs
uris = [
    "http://xmlns.com/foaf/0.1/Person",
    "http://xmlns.com/foaf/0.1/name",
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
    "http://unknown.example.org/Thing",  # No matching namespace
]
results = await registry.shorten_uri_batch(uris)

for uri, result in zip(uris, results):
    if result:
        prefix, local = result
        print(f"{uri} -> {prefix}:{local}")
    else:
        print(f"{uri} -> NO MATCH")

# Output:
# http://xmlns.com/foaf/0.1/Person -> foaf:Person
# http://xmlns.com/foaf/0.1/name -> foaf:name
# http://www.w3.org/1999/02/22-rdf-syntax-ns#type -> rdf:type
# http://unknown.example.org/Thing -> NO MATCH
```

### Longest-Match Semantics

When multiple registered namespaces could match a URI, the longest one wins. This handles overlapping namespaces correctly.

```python
# Register two overlapping namespaces
await registry.store_prefix_if_new("ex", "http://example.org/")
await registry.store_prefix_if_new("exdata", "http://example.org/data#")

# This URI matches both namespaces, but exdata is longer
result = await registry.shorten_uri("http://example.org/data#Person")
prefix, local = result
print(f"{prefix}:{local}")  # exdata:Person (NOT ex:data#Person)

# This URI only matches the shorter namespace
result = await registry.shorten_uri("http://example.org/other/Thing")
prefix, local = result
print(f"{prefix}:{local}")  # ex:other/Thing
```

### Looking Up Prefixes and URIs

```python
# Get the URI for a known prefix
uri = await registry.get_uri_for_prefix("foaf")
if uri:
    print(f"foaf -> {uri}")  # foaf -> http://xmlns.com/foaf/0.1/

# Get the prefix for a known URI
prefix = await registry.get_prefix_for_uri("http://xmlns.com/foaf/0.1/")
if prefix:
    print(f"http://xmlns.com/foaf/0.1/ -> {prefix}")  # -> foaf

# Get all registered prefixes
all_prefixes = await registry.get_all_prefixes()
for prefix, uri in all_prefixes.items():
    print(f"{prefix}: {uri}")

# Get count of registered prefixes
count = await registry.prefix_count()
print(f"Total prefixes: {count}")
```

### Connection with Retry (for Container Orchestration)

When running in Kubernetes or Docker Compose, your database might not be ready when your app starts. Use `new_with_retry` for automatic reconnection with exponential backoff.

```python
# Wait for database to become available (useful in container startup)
registry = await PrefixRegistry.new_with_retry(
    "postgres://user:password@db:5432/mydb",
    max_connections=10,
    max_retries=5,         # Try up to 5 times
    initial_delay_ms=1000, # Start with 1 second delay
    max_delay_ms=30000     # Cap delay at 30 seconds
)
# Delays: 1s -> 2s -> 4s -> 8s -> 16s (capped at 30s)
```

### Real-World Example: Processing RDF Data

```python
import asyncio
from prefix_register import PrefixRegistry

async def process_rdf_triples(registry, triples):
    """Convert full URIs in triples to CURIEs for display."""
    results = []

    # Collect all URIs that need shortening
    all_uris = []
    for s, p, o in triples:
        all_uris.extend([s, p, o] if isinstance(o, str) and o.startswith("http") else [s, p])

    # Batch shorten for efficiency
    shortened = await registry.shorten_uri_batch(all_uris)
    uri_to_curie = {}
    for uri, result in zip(all_uris, shortened):
        if result:
            prefix, local = result
            uri_to_curie[uri] = f"{prefix}:{local}"
        else:
            uri_to_curie[uri] = uri  # Keep original if no match

    # Format triples with CURIEs
    for s, p, o in triples:
        s_short = uri_to_curie.get(s, s)
        p_short = uri_to_curie.get(p, p)
        o_short = uri_to_curie.get(o, o) if isinstance(o, str) else repr(o)
        results.append(f"{s_short} {p_short} {o_short}")

    return results

async def main():
    registry = await PrefixRegistry.new("postgres://localhost/mydb", 10)

    # Register common prefixes
    await registry.store_prefixes_if_new([
        ("foaf", "http://xmlns.com/foaf/0.1/"),
        ("rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#"),
    ])

    # Sample RDF triples (as full URIs)
    triples = [
        ("http://example.org/john", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "http://xmlns.com/foaf/0.1/Person"),
        ("http://example.org/john", "http://xmlns.com/foaf/0.1/name", "John Doe"),
    ]

    formatted = await process_rdf_triples(registry, triples)
    for line in formatted:
        print(line)
    # Output:
    # http://example.org/john rdf:type foaf:Person
    # http://example.org/john foaf:name 'John Doe'

asyncio.run(main())
```

## API Reference

### PrefixRegistry

| Method | Description |
|--------|-------------|
| `new(database_url, max_connections)` | Connect to PostgreSQL and load existing prefixes |
| `new_with_retry(...)` | Connect with retry logic for transient failures |
| `store_prefix_if_new(prefix, uri)` | Store a prefix if URI doesn't have one (returns `bool`) |
| `store_prefixes_if_new(prefixes)` | Batch store prefixes (returns `{"stored": n, "skipped": m}`) |
| `get_uri_for_prefix(prefix)` | Get URI for a prefix, or `None` |
| `get_prefix_for_uri(uri)` | Get prefix for a URI, or `None` |
| `expand_curie(prefix, local_name)` | Expand CURIE to full URI, or `None` if unknown |
| `expand_curie_batch(curies)` | Batch expand (list of `str` or `None`) |
| `shorten_uri(uri)` | Shorten to `(prefix, local)` tuple, or `None` |
| `shorten_uri_or_full(uri)` | Shorten to `"prefix:local"` string, or return original URI |
| `shorten_uri_batch(uris)` | Batch shorten (list of tuples or `None`) |
| `get_all_prefixes()` | Get all prefixes as `{prefix: uri}` dict |
| `prefix_count()` | Get number of registered prefixes |

## Use Cases

- **CURIE expansion** in RDF/SPARQL processing
- **Namespace management** for semantic web applications
- **Prefix discovery** from Turtle, JSON-LD, RDF/XML documents
- **URI shortening** for human-readable output

## License

Apache-2.0