# Internal Collectors
The `internal` collector module provides self-monitoring capabilities for pg_exporter itself. Unlike other collectors that monitor PostgreSQL, these collectors monitor the exporter's own health and performance.
## Why Self-Monitoring?
When running in production, you need visibility into the exporter itself:
- **Resource Usage**: Is the exporter leaking memory? Using too much CPU?
- **Performance**: Which collectors are slow? Are scrapes failing?
- **Cardinality**: How many metrics are being exported? (Critical for Cortex/Mimir with series limits)
## Architecture
The internal collector consists of two sub-collectors:
### 1. ProcessCollector (`process.rs`)
Monitors the exporter's process resource consumption. Matches output from `scripts/monitor-exporter.sh`.
**Metrics:**
- `pg_exporter_process_cpu_percent` - Current CPU usage percentage (Gauge)
- Matches `ps %cpu` output - instantaneous CPU usage
- Range: 0-100% (percentage of one CPU core)
- Example: 2.5 = using 2.5% of one core
- No rate() needed - value is already a percentage
- `pg_exporter_process_resident_memory_bytes` - RAM usage / RSS (IntGauge)
- `pg_exporter_process_virtual_memory_bytes` - Virtual memory size / VSZ (IntGauge)
- `pg_exporter_process_open_fds` - Open file descriptors, Linux only (IntGauge)
- `pg_exporter_process_start_time_seconds` - Process start time (Gauge)
**Simple Approach:**
- Reads current CPU% directly from kernel (via sysinfo)
- No complex delta tracking or counters
- Matches what `ps %cpu` and monitor-exporter.sh show
- Easy to understand and troubleshoot
**Implementation:**
- Uses the `sysinfo` crate (v0.37) to read process info from the OS
- Linux: Reads `/proc/$PID/stat`, `/proc/$PID/status`, `/proc/$PID/fd/`
- Cached `System` object protected by `std::sync::Mutex`
- Collection time: ~1-5ms
### 2. ScraperCollector (`scraper.rs`)
Tracks scrape performance and health across all collectors.
**Metrics:**
- `pg_exporter_collector_scrape_duration_seconds{collector}` - Histogram with buckets
- Buckets: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s
- Automatically creates `_bucket`, `_sum`, `_count` metrics
- Use `histogram_quantile()` for percentiles (p50, p95, p99)
- `pg_exporter_collector_scrape_errors_total{collector}` - Error counter per collector
- `pg_exporter_collector_last_scrape_timestamp_seconds{collector}` - Last scrape timestamp
- `pg_exporter_collector_last_scrape_success{collector}` - Success indicator (1/0)
- `pg_exporter_metrics_total` - ⭐ Total active time series / cardinality (matches `curl -s 0:9432/metrics | grep -vEc '^(#|\s*$)'`)
- `pg_exporter_scrapes_total` - Total scrapes performed
**Implementation:**
- RAII `ScrapeTimer` for automatic duration recording
- Direct updates to `prometheus` metric types in the scrape hot path
- Histogram automatically exports `_bucket`, `_sum`, `_count` suffixes
## Why Standard Library Instead of parking_lot?
We use `std::sync::Mutex` where shared process-state caching needs it instead of
external crates like `parking_lot`:
### 1. No External Dependencies
```rust
use std::sync::Mutex;
// Handle lock poisoning explicitly
let guard = match mutex.lock() {
Ok(guard) => guard,
Err(poisoned) => {
warn!("Mutex poisoned, recovering");
poisoned.into_inner()
}
};
```
This is explicit and self-documenting. The code clearly shows how we handle panics.
### 2. Educational Value
Demonstrates proper `PoisonError` handling:
- Shows awareness of panic safety
- Teaches recovery patterns
- No hidden behavior from external libraries
### 3. Minimal Dependencies
Since internal metrics are supplementary (not core PostgreSQL monitoring):
- Zero dependencies for a "nice-to-have" feature
- Simpler dependency tree
- Easier to audit
### 4. Lock Usage Benefits
- **Mutex** protects the cached `sysinfo::System` object
- Scrape counters update `prometheus` metrics directly, without an extra lock
- Poison handling remains explicit with the `.into_inner()` pattern
- Performance is identical for our use case (low contention)
## Usage
The exporter collector is **disabled by default**. Enable it explicitly:
```bash
pg_exporter --dsn postgresql://localhost/postgres --collector.exporter
# Automatically exports pg_exporter_process_* and pg_exporter_collector_* metrics
```
```
Disable with:
```bash
# Internal metrics not needed - disabled by default
pg_exporter --dsn postgresql://localhost/postgres
```
## Prometheus Queries
### CPU Usage (%)
```promql
pg_exporter_process_cpu_percent
```
### Memory Usage (MB)
```promql
pg_exporter_process_resident_memory_bytes / 1024 / 1024
```
### Slowest Collector (p99 latency)
```promql
topk(5,
histogram_quantile(0.99,
rate(pg_exporter_collector_scrape_duration_seconds_bucket[5m])
)
) by (collector)
```
### Failed Collectors
```promql
sum by (collector) (
rate(pg_exporter_collector_scrape_errors_total[5m])
) > 0
```
### Total Metric Cardinality (for Cortex/Mimir limits)
```promql
pg_exporter_metrics_total
```
### Stale Collectors (not scraped in 2 minutes)
```promql
time() - pg_exporter_collector_last_scrape_timestamp_seconds{collector!=""} > 120
```
## Alerting Examples
```yaml
# High memory usage
- alert: ExporterHighMemory
expr: pg_exporter_process_resident_memory_bytes > 500 * 1024 * 1024
for: 5m
annotations:
summary: "pg_exporter using >500MB RAM"
# High CPU usage
- alert: ExporterHighCPU
expr: rate(pg_exporter_process_cpu_seconds_total[5m]) > 0.5
for: 5m
annotations:
summary: "pg_exporter using >50% CPU"
# Slow collector
- alert: SlowCollector
expr: |
histogram_quantile(0.99,
rate(pg_exporter_collector_scrape_duration_seconds_bucket[5m])
) > 1.0
annotations:
summary: "Collector {{ $labels.collector }} p99 latency >1s"
# Metric cardinality explosion
- alert: HighMetricCardinality
expr: pg_exporter_metrics_total > 10000
annotations:
summary: "Exporting {{ $value }} metrics (may hit Cortex limits)"
# Failed collector
- alert: CollectorFailing
expr: rate(pg_exporter_collector_scrape_errors_total[5m]) > 0
annotations:
summary: "Collector {{ $labels.collector }} is failing"
```
## Grafana Dashboard Panels
Add these panels to monitor the exporter:
1. **CPU Usage** - `rate(pg_exporter_process_cpu_seconds_total[5m]) * 100`
- Unit: percent (0-100)
- Thresholds: Yellow >50%, Red >80%
2. **Memory Usage** - `pg_exporter_process_resident_memory_bytes`
- Unit: bytes
- Alert on steady growth (leak detection)
3. **Collector Latency Heatmap** - Histogram quantiles by collector
- Shows which collectors are slow
4. **Metric Cardinality** - `pg_exporter_metrics_total`
- Track against your Cortex/Mimir series limits
5. **Failed Collectors** - `rate(pg_exporter_collector_scrape_errors_total[5m])`
- Alert if any collector has errors
## Platform Support
| Platform | CPU | Memory | Threads | File Descriptors |
|----------|-----|--------|---------|------------------|
| Linux | ✅ | ✅ | ✅ | ✅ |
| macOS | ✅ | ✅ | ⚠️ Fallback | ❌ Not available |
| Windows | ✅ | ✅ | ⚠️ Fallback | ❌ Not available |
Platform-specific code is guarded with `#[cfg(target_os = "linux")]`.
## Performance Impact
- **CPU**: <0.1% additional overhead
- **Memory**: ~10KB for cached `System` object
- **Collection time**: ~1-5ms per scrape
- **Lock contention**: Minimal (scrapes happen every 15-60 seconds)
## Comparison with `scripts/monitor-exporter.sh`
| Feature | Internal Metrics | Bash Script |
|---------|-----------------|-------------|
| Accuracy | ✅ Same (reads /proc) | ✅ Same |
| Sampling Rate | 15-60s (scrape interval) | 1-5s (configurable) |
| Historical Data | ✅ In Prometheus | ❌ Point-in-time only |
| Alerting | ✅ Prometheus alerts | ❌ Manual monitoring |
| Use Case | Production monitoring | Debugging/troubleshooting |
**Both are complementary!** Use internal metrics for production monitoring and alerts. Use the script for high-frequency debugging during incidents.
## Testing
Run tests:
```bash
cargo test --lib internal
```
Output:
```
running 9 tests
test collectors::internal::process::tests::test_process_collector_new ... ok
test collectors::internal::process::tests::test_process_collector_registers_without_error ... ok
test collectors::internal::process::tests::test_process_collector_collects_stats ... ok
test collectors::internal::scraper::tests::test_scraper_collector_new ... ok
test collectors::internal::scraper::tests::test_scraper_collector_registers_without_error ... ok
test collectors::internal::scraper::tests::test_scrape_timer_records_duration ... ok
test collectors::internal::scraper::tests::test_scrape_timer_records_error ... ok
test collectors::internal::scraper::tests::test_update_metrics_count ... ok
test collectors::internal::scraper::tests::test_increment_scrapes ... ok
test result: ok. 9 passed; 0 failed; 0 ignored; 0 measured
```
## References
- [Process Collector Source](process.rs)
- [Scraper Collector Source](scraper.rs)
## Dependencies
The internal collector only requires one external dependency:
- **sysinfo = "0.37"** - Cross-platform system information library
- Used to read process stats from the OS
- Well-maintained, widely used
- Platform-specific implementations (Linux: /proc, macOS: proc_pidinfo, etc.)
All synchronization primitives use standard library (`std::sync`).