# URI Register
[](https://github.com/telicent-oss/uri-register/actions/workflows/ci.yml)
> **⚠️ Beta Software**: This library is in active development and the API may change. While it's being used in production environments, you should pin to a specific version and test thoroughly before upgrading.
A high-performance, async-first PostgreSQL-backed URI register service for assigning unique integer IDs to URIs. Perfect for string interning, deduplication, and systems that need consistent global identifier mappings.
**Note:** This library is async-only and requires an async runtime (tokio).
## Overview
The URI Register provides a simple, fast way to assign unique integer IDs to URI strings. Once registered, a URI always returns the same ID, making it ideal for string interning and deduplication in distributed systems.
## Features
- **Simple API**: Just 2 methods - `register_uri()` and `register_uri_batch()`
- **Async-only**: Built on tokio/sqlx for high concurrency
- **Batch optimised**: Process thousands of URIs in a single database round-trip
- **LRU caching**: In-memory cache for frequently accessed URIs (configurable size)
- **Order preservation**: Batch operations maintain strict order correspondence
- **PostgreSQL backend**: Durable, scalable, with connection pooling
- **Thread-safe**: Designed for concurrent access from multiple threads/processes
## Use Cases
- **String interning systems**: Reduce memory footprint by storing strings once and referencing by ID
- **URL deduplication**: Assign unique IDs to URLs across distributed crawlers
- **Global identifier systems**: Centralised ID assignment for URIs/strings in microservices
- **Data warehousing**: Efficient storage of repeated string values
- **Distributed caching**: Consistent ID assignment across cache nodes
## Installation
### Rust
Add to your `Cargo.toml`:
```toml
[dependencies]
uri-register = "0.1.2"
```
Or use as a git dependency:
```toml
[dependencies]
uri-register = { git = "https://github.com/telicent-oss/uri-register" }
```
### Python
Install from TestPyPI (during beta):
```bash
pip install --index-url https://test.pypi.org/simple/ uri-register
```
**Requirements**: Python 3.8+
**Note**: The package is currently published to TestPyPI for testing. Once stable, it will be available on the main PyPI repository.
## Setup
### 1. Database Initialisation
Before using the URI Register service, you must initialise the PostgreSQL schema.
**Run the schema creation script:**
```bash
psql -U username -d database_name -f schema.sql
```
Or execute the SQL directly:
```sql
CREATE TABLE IF NOT EXISTS uri_register (
id BIGSERIAL PRIMARY KEY,
uri TEXT NOT NULL UNIQUE
);
CREATE INDEX IF NOT EXISTS uri_register_uri_idx ON uri_register (uri);
```
### 2. Database Configuration
The service requires a PostgreSQL connection string. Set it as an environment variable or pass it directly:
```bash
export DATABASE_URL="postgresql://username:password@localhost:5432/database_name"
```
## Usage
### Rust Example
```rust
use uri_register::{UriService, PostgresUriRegister};
#[tokio::main]
async fn main() -> uri_register::Result<()> {
// Connect to PostgreSQL
let register = PostgresUriRegister::new("postgres://localhost/mydb").await?;
// Register a single URI
let id = register.register_uri("http://example.org/resource/1").await?;
println!("Registered URI with ID: {}", id);
// Register the same URI again - returns the same ID
let same_id = register.register_uri("http://example.org/resource/1").await?;
assert_eq!(id, same_id);
// Register multiple URIs in batch (much faster!)
let uris = vec![
"http://example.org/resource/2".to_string(),
"http://example.org/resource/3".to_string(),
"http://example.org/resource/4".to_string(),
];
let ids = register.register_uri_batch(&uris).await?;
// IDs maintain order: ids[i] corresponds to uris[i]
for (uri, id) in uris.iter().zip(ids.iter()) {
println!("{} -> {}", uri, id);
}
Ok(())
}
```
### Python Example
```python
import asyncio
from uri_register import UriRegister
async def main():
# Connect to PostgreSQL
register = await UriRegister.new("postgres://localhost/mydb")
# Register a single URI
id = await register.register_uri("http://example.org/resource/1")
print(f"Registered URI with ID: {id}")
# Register the same URI again - returns the same ID
same_id = await register.register_uri("http://example.org/resource/1")
assert id == same_id
# Register multiple URIs in batch (much faster!)
uris = [
"http://example.org/resource/2",
"http://example.org/resource/3",
"http://example.org/resource/4",
]
ids = await register.register_uri_batch(uris)
# IDs maintain order: ids[i] corresponds to uris[i]
for uri, id in zip(uris, ids):
print(f"{uri} -> {id}")
# Get statistics
stats = await register.stats()
print(f"Total URIs: {stats['total_uris']}")
asyncio.run(main())
```
### API Reference
The `UriService` trait provides two methods:
#### `register_uri(uri: &str) -> u64`
Register a single URI and return its ID.
- If the URI exists, returns the existing ID
- If the URI is new, creates a new ID and returns it
- Uses LRU cache for fast repeated lookups
```rust
let id = register.register_uri("http://example.org/page").await?;
```
#### `register_uri_batch(uris: &[String]) -> Vec<u64>`
Register multiple URIs in batch and return their IDs.
- **Order preserved**: `ids[i]` corresponds to `uris[i]`
- Much faster than calling `register_uri()` multiple times
- Handles duplicate URIs in input correctly
- Cache-optimised: only queries database for cache misses
```rust
let uris = vec![
"http://example.org/page1".to_string(),
"http://example.org/page2".to_string(),
];
let ids = register.register_uri_batch(&uris).await?;
// Access by index
assert_eq!(ids[0], register.register_uri("http://example.org/page1").await?);
```
### Statistics
Get information about the register:
```rust
let stats = register.stats().await?;
println!("Total URIs: {}", stats.total_uris);
println!("Storage size: {} bytes", stats.size_bytes);
```
## Performance
### Logged Tables (Default)
With default logged tables on typical hardware:
- **Single registration**: ~500-1K URIs/sec (with cache: 100K+/sec)
- **Batch registration**: ~10K-50K URIs/sec
- **Batch lookup (cached)**: ~1M+ URIs/sec (no DB round-trip)
- **Batch lookup (uncached)**: ~100K-200K URIs/sec
### Unlogged Tables (Optional)
For 2-3x faster writes at the cost of durability:
```sql
ALTER TABLE uri_register SET UNLOGGED;
```
**Performance with unlogged tables:**
- **Batch registration**: ~30K-150K URIs/sec
**WARNING**: Unlogged tables lose all data if PostgreSQL crashes. Only use this if you can rebuild the register from source data.
To revert back to logged mode:
```sql
ALTER TABLE uri_register SET LOGGED;
```
### LRU Cache Configuration
The service includes an in-memory LRU cache that dramatically improves performance for repeated URI lookups. Configure the cache size via environment variable:
```bash
export URI_REGISTER_CACHE_SIZE=50000 # Default: 10,000 entries
```
The cache provides near-instant lookups for frequently accessed URIs, completely bypassing database queries.
## Performance Tips
1. **Always use batch operations** when processing multiple URIs
2. **Connection pooling** is configured by default (20 max connections)
3. **Batch size**: Optimal batch size is typically 1K-10K URIs per operation
4. **Indexing**: The URI index is essential for lookup performance
5. **Consider unlogged tables** for initial bulk loading, then switch to logged
6. **Tune cache size** based on your working set size and available memory
## Architecture
```
Application
↓
UriService trait (2 methods)
↓
PostgresUriRegister impl
↓ ↓
LRU Cache Connection Pool (20 connections)
↓ ↓
└─────→ PostgreSQL Database
```
## Schema Details
The register uses a simple two-column table:
- `id`: BIGSERIAL primary key (auto-incrementing u64)
- `uri`: TEXT with UNIQUE constraint (indexed)
The UNIQUE constraint prevents duplicate URIs, and the index provides fast lookups.
## Testing
For testing purposes, an in-memory implementation is available:
```rust
#[cfg(test)]
use uri_register::InMemoryUriRegister;
#[tokio::test]
async fn test_uri_register() {
let register = InMemoryUriRegister::new();
let id = register.register_uri("http://example.org").await.unwrap();
assert_eq!(id, 1); // First URI gets ID 1
}
```
## Examples
### Rust
```rust
use uri_register::{UriService, PostgresUriRegister};
#[tokio::main]
async fn main() -> uri_register::Result<()> {
// Connect to PostgreSQL with full connection parameters
// Format: postgres://user:password@host:port/database
let database_url = "postgres://myuser:mypassword@localhost:5432/mydb";
let max_connections = 20; // Connection pool size
let cache_size = 50_000; // LRU cache size
let register = PostgresUriRegister::new(database_url, max_connections, cache_size).await?;
// Register single URIs
let id1 = register.register_uri("http://example.org/resource/1").await?;
let id2 = register.register_uri("http://example.org/resource/2").await?;
println!("Registered: {} -> ID {}", "resource/1", id1);
println!("Registered: {} -> ID {}", "resource/2", id2);
// Register the same URI again - returns the same ID
let id1_again = register.register_uri("http://example.org/resource/1").await?;
assert_eq!(id1, id1_again);
// Batch registration (faster for multiple URIs)
let uris = vec![
"http://example.org/a".to_string(),
"http://example.org/b".to_string(),
"http://example.org/c".to_string(),
"http://example.org/a".to_string(), // Duplicate
];
let ids = register.register_uri_batch(&uris).await?;
// Order preserved: ids[i] corresponds to uris[i]
assert_eq!(ids[0], ids[3]); // Duplicates get same ID
// Batch registration with hashmap (automatic deduplication)
let mapping = register.register_uri_batch_hashmap(&uris).await?;
println!("Unique URIs: {}", mapping.len()); // 3 unique URIs
// Get statistics
let stats = register.stats().await?;
println!("Total URIs: {}", stats.total_uris);
println!("Storage: {} bytes", stats.size_bytes);
Ok(())
}
```
### Python
```python
import asyncio
from uri_register import UriRegister
async def main():
# Connect to PostgreSQL with full connection parameters
# Format: postgres://user:password@host:port/database
database_url = "postgres://myuser:mypassword@localhost:5432/mydb"
max_connections = 20 # Connection pool size
cache_size = 50_000 # LRU cache size
register = await UriRegister.new(database_url, max_connections, cache_size)
# Register single URIs
id1 = await register.register_uri("http://example.org/resource/1")
id2 = await register.register_uri("http://example.org/resource/2")
print(f"Registered: resource/1 -> ID {id1}")
print(f"Registered: resource/2 -> ID {id2}")
# Register the same URI again - returns the same ID
id1_again = await register.register_uri("http://example.org/resource/1")
assert id1 == id1_again
# Batch registration (faster for multiple URIs)
uris = [
"http://example.org/a",
"http://example.org/b",
"http://example.org/c",
"http://example.org/a", # Duplicate
]
ids = await register.register_uri_batch(uris)
# Order preserved: ids[i] corresponds to uris[i]
assert ids[0] == ids[3] # Duplicates get same ID
# Batch registration with hashmap (automatic deduplication)
mapping = await register.register_uri_batch_hashmap(uris)
print(f"Unique URIs: {len(mapping)}") # 3 unique URIs
# Get statistics
stats = await register.stats()
print(f"Total URIs: {stats['total_uris']}")
print(f"Storage: {stats['size_bytes']} bytes")
asyncio.run(main())
```
## Error Handling
The library uses custom error types for better error handling:
```rust
use uri_register::{Error, Result};
match register.register_uri("http://example.org").await {
Ok(id) => println!("Registered with ID: {}", id),
Err(Error::Database(e)) => eprintln!("Database error: {}", e),
Err(Error::ConnectionPool(e)) => eprintln!("Connection pool error: {}", e),
Err(e) => eprintln!("Other error: {}", e),
}
```
## License
Licensed under the Apache License, Version 2.0 ([LICENSE](LICENSE) or http://www.apache.org/licenses/LICENSE-2.0).
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.