# URI Register
[](https://github.com/telicent-oss/uri-register/actions/workflows/ci.yml)
> **Beta Software**: This library is in active development and the API may change. While it's being used in production environments, you should pin to a specific version and test thoroughly before upgrading.
A caching PostgreSQL-backed URI register service for assigning unique integer IDs to URIs. Perfect for string interning, deduplication, and systems that need consistent global identifier mappings.
**Note:** The Rust library requires an async runtime (tokio). Python bindings support both synchronous and asynchronous usage.
## Overview
The URI Register provides a simple, fast way to assign unique integer IDs to URI strings. Once registered, a URI always returns the same ID, making it ideal for string interning and deduplication in distributed systems.
## Features
- **Simple API**: Just 2 methods - `register_uri()` and `register_uri_batch()`
- **Async + Sync**: Built on tokio for high concurrency, with sync wrappers for Python
- **Batch optimised**: Process thousands of URIs in a single database round-trip
- **Configurable caching**: W-TinyLFU (Moka) or LRU caching for frequently accessed URIs
- **Order preservation**: Batch operations maintain strict order correspondence
- **PostgreSQL backend**: Durable, scalable, with connection pooling
- **Automatic retry logic**: Configurable exponential backoff for transient database errors
- **Thread-safe**: Designed for concurrent access from multiple threads/processes
## Use Cases
- **String interning systems**: Reduce memory footprint by storing strings once and referencing by ID
- **URL deduplication**: Assign unique IDs to URLs across distributed crawlers
- **Global identifier systems**: Centralised ID assignment for URIs/strings in microservices
- **Data warehousing**: Efficient storage of repeated string values
- **Distributed caching**: Consistent ID assignment across cache nodes
## Installation
### Rust
Add to your `Cargo.toml`:
```toml
[dependencies]
uri-register = "0.2.0"
```
Or use as a git dependency:
```toml
[dependencies]
uri-register = { git = "https://github.com/telicent-oss/uri-register" }
```
### Python
Install from TestPyPI (during beta):
```bash
pip install --index-url https://test.pypi.org/simple/ uri-register
```
**Requirements**: Python 3.8+
**Note**: The package is currently published to TestPyPI for testing. Once stable, it will be available on the main PyPI repository.
## Setup
### 1. Database Initialisation
Before using the URI Register service, you must initialise the PostgreSQL schema.
**Run the schema creation script:**
```bash
psql -U username -d database_name -f schema.sql
```
Or execute the SQL directly:
```sql
CREATE TABLE IF NOT EXISTS uri_register (
id BIGSERIAL PRIMARY KEY,
uri TEXT NOT NULL,
uri_hash UUID GENERATED ALWAYS AS (md5(uri)::uuid) STORED UNIQUE
);
```
### 2. Database Configuration
The service requires a PostgreSQL connection string. Set it as an environment variable or pass it directly:
```bash
export DATABASE_URL="postgresql://username:password@localhost:5432/database_name"
```
## Usage
### Rust Example
```rust
use uri_register::{CacheStrategy, PostgresUriRegister, UriService};
#[tokio::main]
async fn main() -> uri_register::Result<()> {
// Connect to PostgreSQL
let register = PostgresUriRegister::new(
"postgres://localhost/mydb",
"uri_register", // table name
20, // max connections
10_000 // cache size (uses Moka/W-TinyLFU by default)
).await?;
// Register a single URI
let id = register.register_uri("http://example.org/resource/1").await?;
println!("Registered URI with ID: {}", id);
// Register the same URI again - returns the same ID
let same_id = register.register_uri("http://example.org/resource/1").await?;
assert_eq!(id, same_id);
// Register multiple URIs in batch (much faster!)
let uris = vec![
"http://example.org/resource/2".to_string(),
"http://example.org/resource/3".to_string(),
"http://example.org/resource/4".to_string(),
];
let ids = register.register_uri_batch(&uris).await?;
// IDs maintain order: ids[i] corresponds to uris[i]
for (uri, id) in uris.iter().zip(ids.iter()) {
println!("{} -> {}", uri, id);
}
Ok(())
}
```
### Synchronous Rust API
For synchronous Rust applications that cannot use async/await, use `SyncPostgresUriRegister`:
```rust
use uri_register::SyncPostgresUriRegister;
fn main() -> uri_register::Result<()> {
// Connect to PostgreSQL
let register = SyncPostgresUriRegister::new(
"postgres://localhost/mydb",
"uri_register", // table name
20, // max connections
10_000 // cache size (uses Moka/W-TinyLFU by default)
)?;
// Register a single URI (blocks until complete)
let id = register.register_uri("http://example.org/resource/1")?;
println!("Registered URI with ID: {}", id);
// Register multiple URIs in batch
let uris = vec![
"http://example.org/resource/2".to_string(),
"http://example.org/resource/3".to_string(),
];
let ids = register.register_uri_batch(&uris)?;
Ok(())
}
```
The synchronous API wraps the async implementation with a Tokio runtime internally. All methods have identical semantics to their async counterparts but block the calling thread until completion.
### Python Example (Synchronous)
```python
from uri_register import UriRegister
# Connect to PostgreSQL
register = UriRegister(
"postgres://localhost/mydb",
"uri_register", # table name
20, # max connections
10_000, # cache size
"moka", # cache strategy ("moka" is default, or use "lru")
)
# Register a single URI
id = register.register_uri("http://example.org/resource/1")
print(f"Registered URI with ID: {id}")
# Register the same URI again - returns the same ID
same_id = register.register_uri("http://example.org/resource/1")
assert id == same_id
# Register multiple URIs in batch (much faster!)
uris = [
"http://example.org/resource/2",
"http://example.org/resource/3",
"http://example.org/resource/4",
]
ids = register.register_uri_batch(uris)
# IDs maintain order: ids[i] corresponds to uris[i]
for uri, id in zip(uris, ids):
print(f"{uri} -> {id}")
# Get statistics
stats = register.stats()
print(f"Total URIs: {stats['total_uris']}")
```
### Python Example (Asynchronous)
```python
import asyncio
from uri_register import UriRegister
async def main():
# Connect to PostgreSQL
register = await UriRegister.new_async(
"postgres://localhost/mydb",
"uri_register", # table name
20, # max connections
10_000, # cache size
"moka", # cache strategy ("moka" is default, or use "lru")
)
# Register a single URI
id = await register.register_uri_async("http://example.org/resource/1")
print(f"Registered URI with ID: {id}")
# Register multiple URIs in batch (much faster!)
uris = [
"http://example.org/resource/2",
"http://example.org/resource/3",
]
ids = await register.register_uri_batch_async(uris)
# Get statistics
stats = await register.stats_async()
print(f"Total URIs: {stats['total_uris']}")
asyncio.run(main())
```
### API Reference
The `UriService` trait provides two methods:
#### `register_uri(uri: &str) -> u64`
Register a single URI and return its ID.
- If the URI exists, returns the existing ID
- If the URI is new, creates a new ID and returns it
- Uses configurable cache (Moka/LRU) for fast repeated lookups
```rust
let id = register.register_uri("http://example.org/page").await?;
```
#### `register_uri_batch(uris: &[String]) -> Vec<u64>`
Register multiple URIs in batch and return their IDs.
- **Order preserved**: `ids[i]` corresponds to `uris[i]`
- Much faster than calling `register_uri()` multiple times
- Handles duplicate URIs in input correctly
- Cache-optimised: only queries database for cache misses
```rust
let uris = vec![
"http://example.org/page1".to_string(),
"http://example.org/page2".to_string(),
];
let ids = register.register_uri_batch(&uris).await?;
// Access by index
assert_eq!(ids[0], register.register_uri("http://example.org/page1").await?);
```
### Statistics and Observability
The register exposes comprehensive metrics suitable for OpenTelemetry and Prometheus:
```rust
let stats = register.stats().await?;
// Database metrics
println!("Total URIs: {}", stats.total_uris);
println!("Storage size: {} bytes", stats.size_bytes);
// Cache performance metrics
println!("Cache hits: {}", stats.cache.hits);
println!("Cache misses: {}", stats.cache.misses);
println!("Cache hit rate: {:.2}%", stats.cache.hit_rate());
println!("Cache entries: {}/{}", stats.cache.entry_count, stats.cache.capacity);
// Connection pool metrics
println!("Active connections: {}", stats.pool.connections_active);
println!("Idle connections: {}", stats.pool.connections_idle);
println!("Max connections: {}", stats.pool.connections_max);
```
#### Integration with OpenTelemetry
The statistics are designed for easy integration with observability systems:
```rust
use opentelemetry::metrics::Meter;
let stats = register.stats().await?;
// Report as gauges
meter.u64_gauge("uri_register.cache.hits").record(stats.cache.hits, &[]);
meter.u64_gauge("uri_register.cache.misses").record(stats.cache.misses, &[]);
meter.f64_gauge("uri_register.cache.hit_rate").record(stats.cache.hit_rate(), &[]);
meter.u64_gauge("uri_register.cache.size").record(stats.cache.entry_count, &[]);
meter.u64_gauge("uri_register.pool.active").record(stats.pool.connections_active as u64, &[]);
meter.u64_gauge("uri_register.pool.idle").record(stats.pool.connections_idle as u64, &[]);
meter.u64_gauge("uri_register.total_uris").record(stats.total_uris, &[]);
meter.u64_gauge("uri_register.size_bytes").record(stats.size_bytes, &[]);
```
All metrics are cumulative since process start and safe for concurrent access.
## Cache Strategies
The URI register supports two caching strategies:
### Moka (W-TinyLFU) - Default
**Recommended for most workloads.** W-TinyLFU (Window Tiny Least Frequently Used) combines recency and frequency tracking to provide better cache hit rates than plain LRU, especially for workloads with mixed hot/cold data.
Moka is the default cache strategy, so you don't need to specify it:
```rust
let register = PostgresUriRegister::new(
db_url,
"uri_register",
20, // max connections
10_000 // cache size
).await?;
```
To explicitly specify Moka:
```rust
use uri_register::CacheStrategy;
let register = PostgresUriRegister::new_with_cache_strategy(
db_url,
"uri_register",
20,
10_000,
Some(CacheStrategy::Moka), // Explicitly use Moka
None // No TLS
).await?;
```
**Python:**
```python
register = UriRegister(
db_url,
"uri_register",
20,
10_000,
"moka", # W-TinyLFU algorithm
)
```
### LRU (Least Recently Used)
Simple eviction based on recency of access. Use this if you have specific requirements or want more predictable behavior.
```rust
use uri_register::CacheStrategy;
let register = PostgresUriRegister::new_with_cache_strategy(
db_url,
"uri_register",
20,
10_000,
Some(CacheStrategy::Lru), // Use LRU instead of default Moka
None // No TLS
).await?;
```
**Python:**
```python
register = UriRegister(
db_url,
"uri_register",
20,
10_000,
"lru", # Simple LRU
)
```
**Performance Comparison:**
For most real-world workloads, Moka (W-TinyLFU) provides 10-30% better cache hit rates compared to LRU, especially when:
- Access patterns have varying frequency (some URIs accessed much more than others)
- There are periodic "scans" or one-time accesses that would pollute an LRU cache
- Working set size is close to cache capacity
## Logging
The library uses the `tracing` crate for structured logging. Logs include connection info, cache hit/miss statistics, and batch sizes.
### Rust
Use `tracing-subscriber` to see logs:
```rust
use tracing_subscriber::EnvFilter;
// Initialize logging (typically in main())
tracing_subscriber::fmt()
.with_env_filter(EnvFilter::from_default_env())
.init();
// Set RUST_LOG environment variable to control log levels:
// RUST_LOG=uri_register=debug - see debug logs from uri-register
// RUST_LOG=uri_register=trace - see trace logs (cache hits/misses)
```
### Python
Logs are automatically bridged to Python's `logging` module:
```python
import logging
# Configure Python logging as usual
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(name)s: %(message)s'
)
# Logs from uri-register will appear with logger name 'uri_register'
# You can also configure just the uri_register logger:
logging.getLogger('uri_register').setLevel(logging.DEBUG)
```
**Log Levels:**
- `INFO`: Connection events, configuration
- `DEBUG`: Cache statistics, batch sizes, database queries
- `TRACE`: Individual cache hits/misses (verbose)
## Performance
### Logged Tables (Default)
With default logged tables on typical hardware:
- **Single registration**: ~500-1K URIs/sec (with cache: 100K+/sec)
- **Batch registration**: ~10K-50K URIs/sec
- **Batch lookup (cached)**: ~1M+ URIs/sec (no DB round-trip)
- **Batch lookup (uncached)**: ~100K-200K URIs/sec
### Unlogged Tables (Optional)
For 2-3x faster writes at the cost of durability:
```sql
ALTER TABLE uri_register SET UNLOGGED;
```
**Performance with unlogged tables:**
- **Batch registration**: ~30K-150K URIs/sec
**WARNING**: Unlogged tables lose all data if PostgreSQL crashes. Only use this if you can rebuild the register from source data.
To revert back to logged mode:
```sql
ALTER TABLE uri_register SET LOGGED;
```
## Performance Tips
1. **Always use batch operations** when processing multiple URIs
2. **Configure connection pooling** appropriately for your workload (typical: 10-50 connections)
3. **Tune cache size** based on your working set size and available memory (typical: 10,000-100,000 entries)
4. **Batch size**: Optimal batch size is typically 1,000-10,000 URIs per operation
5. **Hash-based indexing**: The compact UUID index on `uri_hash` scales much better than indexing full URIs
6. **Consider unlogged tables** for initial bulk loading, then switch to logged
## Architecture
```
Application
↓
UriService trait (2 methods)
↓
PostgresUriRegister impl
↓ ↓
Cache (Moka/LRU) Connection Pool (20 connections)
↓ ↓
└───────────────→ PostgreSQL Database
```
## Schema Details
The register uses a three-column table with hash-based indexing:
- `id`: BIGSERIAL primary key (auto-incrementing u64)
- `uri`: TEXT storing the full URI (not indexed)
- `uri_hash`: UUID generated from `md5(uri)::uuid` with UNIQUE constraint (indexed)
### Why Hash-Based Indexing?
In environments with enormous numbers of URIs, maintaining a B-tree index on the full URI text becomes prohibitively expensive - both in storage and maintenance overhead. By hashing the URI to a compact 16-byte UUID, we get:
1. **Compact index**: 16 bytes per entry vs potentially hundreds of bytes for full URIs
2. **Fast lookups**: B-tree operations on fixed-size UUIDs are very efficient
3. **Automatic computation**: PostgreSQL computes the hash via `GENERATED ALWAYS AS`
The hash collision probability with MD5 (128-bit) is vanishingly small - you'd need ~2^64 URIs before expecting a collision. However, for absolute safety, queries should verify the full URI matches when retrieving data:
```sql
SELECT id FROM uri_register
WHERE uri_hash = md5('http://example.com/my-uri')::uuid -- Fast index lookup
AND uri = 'http://example.com/my-uri'; -- Collision safety check
```
Inserts use `ON CONFLICT (uri_hash)` to handle duplicates efficiently:
```sql
INSERT INTO uri_register (uri)
VALUES ('http://example.com/my-uri')
ON CONFLICT (uri_hash)
DO UPDATE SET uri = EXCLUDED.uri -- No-op trick to return existing ID
RETURNING id;
```
## Testing
For testing purposes, an in-memory implementation is available:
```rust
#[cfg(test)]
use uri_register::InMemoryUriRegister;
#[tokio::test]
async fn test_uri_register() {
let register = InMemoryUriRegister::new();
let id = register.register_uri("http://example.org").await.unwrap();
assert_eq!(id, 1); // First URI gets ID 1
}
```
## Error Handling
The library uses structured error types for better error handling and programmatic error inspection:
```rust
use uri_register::{CacheStrategy, ConfigurationError, Error, Result};
// Configuration errors with specific variants
match PostgresUriRegister::new("postgres://localhost/db", "uri_register", 0, 10_000).await {
Ok(register) => { /* use register */ },
Err(Error::Configuration(ConfigurationError::InvalidMaxConnections(n))) => {
eprintln!("Invalid max_connections: {}", n);
},
Err(Error::Configuration(ConfigurationError::InvalidCacheSize(n))) => {
eprintln!("Invalid cache_size: {}", n);
},
Err(Error::Configuration(ConfigurationError::InvalidTableName(msg))) => {
eprintln!("Invalid table_name: {}", msg);
},
Err(Error::Configuration(ConfigurationError::InvalidBackoff(msg))) => {
eprintln!("Invalid backoff configuration: {}", msg);
},
Err(e) => eprintln!("Other error: {}", e),
}
// Database errors (connection strings are sanitised to prevent password leaks)
match register.register_uri("http://example.org").await {
Ok(id) => println!("Registered with ID: {}", id),
Err(Error::Database(msg)) => eprintln!("Database error: {}", msg),
Err(Error::InvalidUri(msg)) => eprintln!("Invalid URI: {}", msg),
Err(e) => eprintln!("Other error: {}", e),
}
```
### Error Types
- **Configuration** - Invalid configuration parameters (structured with specific variants)
- **Database** - Database operation failures (error messages sanitised)
- **ConnectionPool** - Connection pool errors
- **Cache** - Cache operation failures
- **InvalidUri** - URI validation failures (non-RFC 3986 compliant URIs)
## License
Licensed under the Apache License, Version 2.0 ([LICENSE](LICENSE) or http://www.apache.org/licenses/LICENSE-2.0).
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.