# Database Connectors Guide
Profile data directly from PostgreSQL, MySQL, and SQLite databases with ISO 8000/25012 quality assessment.
## Feature Requirements
Database support requires feature flags at compile time:
```bash
cargo install dataprof --features postgres # PostgreSQL only
cargo install dataprof --features mysql # MySQL/MariaDB only
cargo install dataprof --features sqlite # SQLite only
cargo install dataprof --features all-db # All three
cargo install dataprof --features full-cli # CLI + Parquet + all databases
```
For Rust library usage:
```toml
[dependencies]
dataprof = { version = "0.6", features = ["postgres"] }
```
## Supported Databases
| PostgreSQL | `postgres` | `postgres://user:pass@host:5432/dbname` |
| MySQL/MariaDB | `mysql` | `mysql://user:pass@host:3306/dbname` |
| SQLite | `sqlite` | `sqlite:///path/to/db.db` or `/path/to/db.db` |
## Quick Start
### CLI
```bash
# PostgreSQL -- profile a table with quality metrics
dataprof database postgres://user:pass@localhost/mydb --table users --quality
# MySQL -- custom query
dataprof database mysql://root:pass@localhost/shop --query "SELECT * FROM orders WHERE status = 'active'"
# SQLite
dataprof database sqlite:///data.db --table events
```
CLI flags:
| `--table <NAME>` | Table to profile (generates `SELECT * FROM <table>`) | -- |
| `--query <SQL>` | Custom SQL query | -- |
| `--batch-size <N>` | Rows per streaming batch | `10000` |
| `--quality` | Compute ISO 8000/25012 quality metrics | off |
You must provide either `--table` or `--query`.
### Python
Python database functions are async:
```python
import asyncio
import dataprof as dp
async def main():
# Profile a query
report = await dp.analyze_database_async(
"postgres://user:pass@localhost/mydb",
"SELECT * FROM users",
batch_size=10000,
calculate_quality=True,
)
print(f"{report.rows} rows, quality: {report.quality_score}")
# Test connection
ok = await dp.test_connection_async("postgres://user:pass@localhost/mydb")
# Get table schema
columns = await dp.get_table_schema_async(
"postgres://user:pass@localhost/mydb", "users"
)
# Count rows
count = await dp.count_table_rows_async(
"postgres://user:pass@localhost/mydb", "users"
)
asyncio.run(main())
```
### Rust
#### Using the Profiler builder
```rust
use dataprof::Profiler;
let report = Profiler::new()
.connection_string("postgres://user:pass@localhost/mydb")
.analyze_query("SELECT * FROM users")
.await?;
println!("Rows: {}", report.execution.rows_processed);
if let Some(quality) = &report.quality {
println!("Quality: {:.1}%", quality.score());
}
```
#### Using `analyze_database()` directly
```rust
use dataprof::{DatabaseConfig, analyze_database};
use dataprof::database::{RetryConfig, SslConfig};
let config = DatabaseConfig {
connection_string: "postgres://user:pass@localhost/mydb".to_string(),
batch_size: 10000,
max_connections: Some(5),
connection_timeout: Some(std::time::Duration::from_secs(30)),
retry_config: Some(RetryConfig::default()),
ssl_config: Some(SslConfig::default()),
sampling_config: None,
load_credentials_from_env: true,
};
let report = analyze_database(config, "SELECT * FROM users", true, None).await?;
println!("Rows: {}", report.execution.rows_processed);
```
## Connection Strings
### PostgreSQL
```
postgres://username:password@hostname:5432/database
postgresql://username:password@hostname:5432/database
```
Examples:
- `postgres://myuser:mypass@localhost:5432/production_db`
- `postgres://readonly_user@db.company.com:5432/analytics`
### MySQL/MariaDB
```
mysql://username:password@hostname:3306/database
```
Examples:
- `mysql://root:password@localhost:3306/ecommerce`
- `mysql://app_user:secret@mysql.internal:3306/app_data`
### SQLite
```
sqlite:///path/to/database.db
/path/to/database.db
:memory:
```
Examples:
- `sqlite:///home/user/data.db`
- `/var/data/app.sqlite`
- `:memory:` (in-memory database for testing)
## DatabaseConfig Reference
| `connection_string` | `String` | required | Database connection URL |
| `batch_size` | `usize` | `10000` | Rows per streaming batch |
| `max_connections` | `Option<u32>` | `Some(10)` | Connection pool size |
| `connection_timeout` | `Option<Duration>` | `Some(30s)` | Connection timeout |
| `retry_config` | `Option<RetryConfig>` | `Some(default)` | Retry with exponential backoff |
| `sampling_config` | `Option<SamplingConfig>` | `None` | Sampling strategy for large tables |
| `ssl_config` | `Option<SslConfig>` | `Some(default)` | SSL/TLS encryption settings |
| `load_credentials_from_env` | `bool` | `true` | Load credentials from environment variables |
## Streaming for Large Datasets
dataprof processes database results in configurable batches to handle tables with millions of rows without exceeding memory:
1. Counts total rows for progress tracking
2. Fetches data in batches (default: 10,000 rows)
3. Merges statistics from all batches
4. Reports progress
Adjust `batch_size` based on your dataset:
| < 100k rows | `1000` -- `5000` |
| 100k -- 1M rows | `10000` -- `25000` |
| > 1M rows | `50000` -- `100000` |
## Sampling
For very large tables, configure sampling to analyze a representative subset:
```rust
use dataprof::database::SamplingConfig;
// Quick random sample
let config = DatabaseConfig {
sampling_config: Some(SamplingConfig::quick_sample(10000)),
..Default::default()
};
// Representative stratified sample
let config = DatabaseConfig {
sampling_config: Some(SamplingConfig::representative_sample(
25000,
Some("category".to_string()),
)),
..Default::default()
};
// Temporal sample for time-series data
let config = DatabaseConfig {
sampling_config: Some(SamplingConfig::temporal_sample(
50000,
"created_at".to_string(),
)),
..Default::default()
};
```
## Security
### SSL/TLS Encryption
```rust
use dataprof::database::SslConfig;
// Production: requires SSL with certificate verification
let ssl = SslConfig::production();
// Development: relaxed for local testing
let ssl = SslConfig::development();
// Custom configuration
let ssl = SslConfig {
require_ssl: true,
verify_server_cert: true,
ssl_mode: Some("require".to_string()),
ca_cert_path: Some("/etc/ssl/certs/ca.pem".to_string()),
..Default::default()
};
```
### Environment Variables
Avoid embedding credentials in connection strings. Set environment variables instead:
| `POSTGRES_HOST` | PostgreSQL hostname |
| `POSTGRES_USER` | PostgreSQL username |
| `POSTGRES_PASSWORD` | PostgreSQL password |
| `POSTGRES_DATABASE` | PostgreSQL database name |
| `POSTGRES_PORT` | PostgreSQL port (default: 5432) |
| `POSTGRES_SSL_MODE` | SSL mode (`require`, `verify-full`, etc.) |
| `POSTGRES_SSL_CA` | Path to CA certificate |
| `MYSQL_HOST` | MySQL hostname |
| `MYSQL_USER` | MySQL username |
| `MYSQL_PASSWORD` | MySQL password |
| `DATABASE_URL` | Generic connection URL |
With `load_credentials_from_env: true` (the default), dataprof will read these automatically.
## Troubleshooting
### Enable Debug Logging
```bash
RUST_LOG=dataprof=debug dataprof database postgres://... --table users
```
### Test Connection (Rust)
```rust
use dataprof::{DatabaseConfig, create_connector};
let config = DatabaseConfig {
connection_string: "postgres://user:pass@localhost/mydb".to_string(),
..Default::default()
};
let mut connector = create_connector(config)?;
connector.connect().await?;
let ok = connector.test_connection().await?;
println!("Connection: {}", if ok { "OK" } else { "FAILED" });
```
### Test Connection (Python)
```python
ok = await dp.test_connection_async("postgres://user:pass@localhost/mydb")
print(f"Connection: {'OK' if ok else 'FAILED'}")
```
### Common Issues
**Connection refused** -- verify the database is running and accessible. dataprof retries with exponential backoff (3 attempts by default).
**Table not found** -- check table name and schema permissions. Use fully qualified names (`schema.table`) if needed.
**SSL errors** -- for local development, use `SslConfig::development()`. For production, ensure the CA certificate path is correct.
**Feature not enabled** -- if you see `Database feature not enabled`, recompile with the appropriate feature flag (e.g. `--features postgres`).
## Performance Tips
1. **Add WHERE clauses** to filter data before profiling
2. **Use indexes** on filtered columns for faster queries
3. **Avoid SELECT \*** on very wide tables -- select relevant columns
4. **Set appropriate batch sizes** based on row count (see table above)
5. **Use connection pooling** (`max_connections`) for repeated queries