# rustcdc Troubleshooting Guide
**Version:** v0.1+
**Audience:** Operators and developers debugging rustcdc issues
---
## Table of Contents
1. [Connection Issues](#connection-issues)
2. [Checkpoint and Recovery Issues](#checkpoint-and-recovery-issues)
3. [Performance and Throughput Issues](#performance-and-throughput-issues)
4. [Data Quality Issues](#data-quality-issues)
5. [Transform and Filter Issues](#transform-and-filter-issues)
6. [Diagnostics Toolkit](#diagnostics-toolkit)
---
## Integration Scaffolding Assumptions
Command examples in this guide assume your embedder/deployment provides:
- Service controls (for example `systemctl`, container orchestration commands, or custom supervisor)
- Runtime/admin metrics endpoint (examples use `http://localhost:9090/metrics`)
- Connector/client CLIs installed for ad-hoc diagnostics (`psql`, `mysql`, `sqlcmd`)
Adapt commands to your runtime model. If your deployment does not expose these controls yet,
establish them first using the deployment guidance in `docs/deployment.md`.
---
## Connection Issues
### Symptom: "connection refused" or timeout on startup
**Error Examples:**
```
ERROR source error: failed to connect to postgres: connection refused
ERROR source error: failed to connect to mysql: timeout
ERROR source error: connection to sqlserver closed unexpectedly
```
**Diagnosis Checklist:**
1. **Verify network connectivity:**
```bash
ping -c 3 <database_host>
telnet <database_host> <port> ```
✅ Should respond; ❌ if not, check network/firewall
2. **Verify database is running:**
```bash
pg_isready -h <host> -p 5432
mysql --defaults-extra-file=<mysql-client.cnf> -h <host> -u <user> -e "SELECT 1;"
SQLCMDPASSWORD="${SQLCMDPASSWORD:?set from secret manager}" sqlcmd -S <host> -U <user> -Q "SELECT 1;"
```
✅ Should return connection OK; ❌ if not, restart database
3. **Verify credentials:**
```bash
psql "postgresql://<user>@<host>:5432/<database>"
mysql --defaults-extra-file=<mysql-client.cnf> -h <host> -u <user> <database>
SQLCMDPASSWORD="${SQLCMDPASSWORD:?set from secret manager}" sqlcmd -S <host> -U <user> -d <database>
```
✅ Should authenticate; ❌ if not, verify configured secret source and connector credentials
4. **Check user permissions:**
```sql
SELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'cdc_user';
SHOW GRANTS FOR 'cdc_user'@'%';
SELECT is_member('cdc_admin');
```
5. **Check connection string format:**
- PostgreSQL: `postgresql://user:pass@host:port/database?sslmode=require`
- MySQL: `mysql://user:pass@host:port/database`
- SQL Server: `sqlserver://user:pass@host:port;database=name;Encrypt=yes`
**Resolution:**
| Network blocked | Whitelist rustcdc server IP on database firewall |
| Database not running | Restart database service |
| Invalid credentials | Verify in config file; check for special characters |
| Missing REPLICATION role (PG) | Run: `ALTER ROLE cdc_user WITH REPLICATION;` |
| Missing CDC admin (SQL Server) | Run: `ALTER ROLE cdc_user ADD MEMBER cdc_admin;` |
| TLS/Certificate error | Verify `transport` is set to `TransportConfig::tls()` or `TransportConfig::tls_with_ca_cert_path(...)` |
---
### Symptom: "TLS handshake failed" or certificate validation error
**Error Examples:**
```
ERROR source error: tls error: certificate verify failed
ERROR source error: tls error: x509: certificate signed by unknown authority
```
**Diagnosis:**
```bash
# 1. Check if database requires TLS
openssl s_client -connect <host>:<port>
# Should show certificate chain; if connection fails, TLS may be required
# 2. Verify CA certificate (if using custom CA)
openssl x509 -in /path/to/ca.pem -text -noout
# Should show certificate details; verify Subject and Issuer match your CA
# 3. Test TLS connection manually
# PostgreSQL
psql "postgresql://user:pass@host/db?sslmode=require&sslcert=/path/to/cert.pem&sslkey=/path/to/key.pem"
# MySQL
mysql -h <host> --ssl-mode=REQUIRED --ssl-ca=/path/to/ca.pem -u <user> -p<pass>
```
**Resolution:**
| Certificate not trusted | Use `TransportConfig::tls_with_ca_cert_path(...)` with the CA bundle path |
| TLS handshake fails behind proxy | Verify proxy certificates are rooted in the CA bundle configured via `TransportConfig::tls_with_ca_cert_path(...)` |
| Self-signed cert in test/air-gapped env | Use explicit opt-in `TransportConfig::tls_insecure_skip_verify()` only for non-production environments |
| Expired certificate | Request new certificate from database admin |
| Wrong CA certificate | Verify CA cert matches database server certificate issuer |
---
## Checkpoint and Recovery Issues
### Symptom: "checkpoint error" or "replication slot diverged"
**Error Examples:**
```
ERROR: source error: postgres checkpoint/slot divergence for slot '...'
ERROR checkpoint error: checkpoint file does not exist
ERROR checkpoint error: failed to read checkpoint: invalid JSON
```
**Diagnosis Checklist:**
1. **Verify checkpoint file exists and is readable:**
```bash
ls -lh /var/rustcdc/checkpoint_*.json
```
✅ Should list checkpoint files; ❌ if not, check directory permissions
2. **Verify checkpoint is valid JSON:**
```bash
cat /var/rustcdc/checkpoint_postgres.json | jq .
```
3. **For PostgreSQL: verify replication slot exists:**
```sql
SELECT slot_name, active, restart_lsn FROM pg_replication_slots WHERE slot_name = 'rustcdc_postgres_*';
```
✅ Slot exists and active; ❌ if not, may have been dropped manually
4. **For PostgreSQL: check LSN divergence:**
```sql
cat /var/rustcdc/checkpoint_postgres.json | jq '.offset.lsn'
SELECT pg_current_wal_lsn();
```
**Resolution:**
| Checkpoint corrupted | Stop rustcdc; delete checkpoint file; restart (will scan from current position) |
| Replication slot dropped | Stop rustcdc; recreate checkpoint with current LSN; restart |
| WAL/binlog purged | See [Replication Slot Divergence Recovery](runbook.md#replication-slot-divergence-recovery) |
| Checkpoint permissions | Verify `/var/rustcdc/` is writable by rustcdc process owner |
---
### Symptom: "buffer full" error or frequent checkpoint pauses
**Error Examples:**
```
ERROR checkpoint error: commit barrier buffer is full
WARNING checkpoint latency exceeding 1s
```
**Diagnosis:**
1. **Check buffer utilization:**
```bash
grep "buffer_size" /var/log/rustcdc/structured.log | tail -20
curl http://localhost:9090/metrics | grep "rustcdc_runtime_buffer_depth"
```
2. **Check checkpoint commit latency:**
```bash
curl http://localhost:9090/metrics | grep "rustcdc_runtime_checkpoint_age_ms"
```
3. **Check checkpoint store I/O:**
```bash
iostat -x 1 5 | grep sda
```
**Resolution:**
| Checkpoint store slow (disk I/O) | 1. Switch to FileCheckpoint on faster disk; 2. Increase max_buffer_size to batch more events |
| Checkpoint store slow (external backend) | Optimize backend-specific checkpoint writes and indexes; monitor write latency and contention |
| Buffer size too small | Increase `max_buffer_size` in RuntimeConfig (e.g., 50_000 → 100_000) |
| Transform errors causing queue buildup | Check transform error logs; fix failing transforms or set `transform_error_policy = Skip` |
---
## Performance and Throughput Issues
### Symptom: SQL Server high p99 latency (bursty / non-uniform)
**Expected behavior — not a bug:**
SQL Server CDC is **polling-based**. rustcdc calls `cdc.fn_cdc_get_all_changes_*`
at a configurable interval (`stream_poll_interval_ms`, default 5 000 ms). Unlike
PostgreSQL logical replication (push-based, near-zero propagation), SQL Server
events are only visible after the next poll cycle **and** after the SQL Server CDC
capture agent has written them to the change tables (typically < 5 s on an idle server).
Expected latency profile:
| p50 | ≈ stream_poll_interval_ms / 2 |
| p99 | ≈ stream_poll_interval_ms + capture agent delay |
| p99.9 | ≈ 2 × stream_poll_interval_ms (poll jitter under load) |
A p99/p50 ratio of 1 000× is **normal** (e.g. p50 = 0.3 ms measured within a poll
window, p99 = 318 ms = one poll cycle).
**Tuning for lower latency:**
```rust
SqlServerSourceConfig {
stream_poll_interval_ms: 500, // default 5000; 500–1000 ms for latency-sensitive
..SqlServerSourceConfig::default()
}
```
Lower values increase SQL Server query load. 500 ms is the practical lower bound
for most production SQL Server deployments.
---
### Symptom: Low event throughput or high latency
**Error Examples:**
```
WARNING events processed per second dropping below baseline (was 10K/sec, now 5K/sec)
WARNING snapshot progress stalled (no new chunks for 30s)
```
**Diagnosis Checklist:**
1. **Check replication lag (startup note):**
> **Startup behavior:** When rustcdc first connects to a source that has been idle or
> when the replication slot/offset has not been read recently, the connector must replay
> all WAL/binlog entries since the last confirmed LSN. This causes an **expected initial
> replication lag spike** at startup that is *not* an error. The lag metric will trend
> downward as the connector catches up. Allow at least `max_poll_wait_ms × 2` seconds
> before treating non-zero replication lag as a problem.
```bash
curl http://localhost:9090/metrics | grep "rustcdc_runtime_replication_lag_ms"
```
2. **Check source database load:**
```bash
SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;
SHOW FULL PROCESSLIST;
SELECT command, status, sql_text FROM sys.dm_exec_requests;
```
✅ Should show normal query activity; ❌ if high, source DB is overloaded
3. **Check network latency:**
```bash
ping -c 10 <database_host> | tail -1
```
4. **Check rustcdc resource utilization:**
```bash
top -p <rustcdc_pid> | grep CPU
ps aux | grep rustcdc | grep -v grep | awk '{print $6}'
lsof -p <rustcdc_pid> | wc -l
```
5. **Check transform pipeline overhead:**
```bash
curl http://otel-collector:9090/metrics | grep "rustcdc_transform_duration"
```
**Resolution:**
| Source DB overloaded | Reduce rustcdc poll frequency; scale source DB; check for long-running queries |
| Network congested | Verify network MTU (1500 default); check for packet loss (ping -c 100) |
| rustcdc CPU maxed | Increase max_poll_wait_ms (batches more events per poll); reduce transform complexity |
| rustcdc memory growing | Check for transform memory leaks; verify checkpoint is committing (check committed_count) |
| Transform pipeline slow | Profile individual transforms; consider removing non-critical transforms |
---
### Symptom: Snapshot taking too long
**Error Examples:**
```
WARNING snapshot progress: 5% complete (10 hours in, estimated 200 hours remaining)
```
**Diagnosis:**
1. **Check snapshot progress:**
```bash
grep "snapshot_chunk_received\|snapshot_complete" /var/log/rustcdc/structured.log | tail -20
```
2. **Check source table sizes:**
```sql
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables WHERE tablename IN ('users', 'orders', ...);
SELECT table_schema, table_name, ROUND(((data_length + index_length) / 1024 / 1024), 2)
FROM information_schema.tables WHERE table_name IN ('users', 'orders', ...);
SELECT OBJECT_NAME(ps.object_id), SUM(ps.row_count)
FROM sys.dm_db_partition_stats ps
WHERE OBJECT_NAME(ps.object_id) IN ('users', 'orders', ...)
GROUP BY ps.object_id;
```
3. **Check snapshot query performance:**
```bash
time psql -U cdc_user -d mydb -c "SELECT * FROM public.users LIMIT 10000;"
```
**Resolution:**
| Table too large to snapshot | 1. Reduce `snapshot_tables` list; 2. Increase `snapshot_chunk_size` (e.g., 10K → 50K); 3. Add index on clustering key |
| Source DB query slow | Add index on clustering/primary key; schedule snapshot during low-activity window |
| Network bandwidth limited | Verify network bandwidth (iperf); consider moving rustcdc to same datacenter |
| rustcdc CPU bottleneck | Scale to additional rustcdc instances; profile hot path in transform pipeline |
---
## Data Quality Issues
### Symptom: Missing events or duplicate events in output
**Error Examples:**
```
WARNING event_id=12345 received duplicate after checkpoint restart
ERROR detected missing event (event_id=12346 skipped)
```
**Diagnosis Checklist:**
1. **Verify checkpoint is committing:**
```bash
curl http://localhost:9090/metrics | grep "rustcdc_runtime_events_committed_total"
```
2. **Check for buffered events during shutdown:**
```bash
grep "drain_pending\|final_checkpoint" /var/log/rustcdc/structured.log
```
3. **Verify consumer is calling commit callbacks:**
```bash
```
4. **Check for transform filtering:**
```bash
curl http://otel-collector:9090/metrics | grep "rustcdc_events_filtered"
```
**Resolution:**
| Consumer not calling commit | Verify consumer code calls CommitBarrier::notify_consumer_accepted() + commit() |
| Checkpoint not persisted | Verify checkpoint store is writable; check FileCheckpoint path permissions |
| Process killed without graceful shutdown | Implement SIGTERM handler to flush pending events before exit |
| Transform filtering unintentional | Review transform configuration; verify filter rules are correct |
---
### Symptom: Events with incorrect data or wrong schema
**Error Examples:**
```
ERROR validation error: field 'before' is None but operation is Update
ERROR schema error: table schema not found for public.users
```
**Diagnosis:**
1. **Verify source schema is correct:**
```sql
\d public.users
DESC users;
EXEC sp_help 'dbo.users';
```
2. **Check event envelope validation:**
```bash
export RUST_LOG=rustcdc::core::event=debug
grep "validation error\|ValidationError" /var/log/rustcdc/structured.log
```
3. **Verify transform rules are correct:**
```bash
grep -A 10 "transform" /etc/rustcdc/config.toml
```
**Resolution:**
| Source schema changed (DDL) | 1. Update rustcdc snapshot_tables list; 2. Manually trigger schema refresh in SchemaHistory |
| Transform filter too broad | Review transform rules; test in development first |
| Event validation rule violated | Check docs/api.md and src/core/event.rs validation contract; verify source is generating events correctly |
---
## Transform and Filter Issues
### Symptom: Transform errors or events filtered unexpectedly
**Error Examples:**
```
ERROR transform error: route: no matching output for table public.unknown_table
ERROR transform error: mask: regex compilation failed for pattern '(?P<invalid>)'
WARNING events filtered by transform (count=50)
```
**Diagnosis:**
1. **Enable debug logging for transforms:**
```bash
export RUST_LOG=rustcdc::transform=debug
```
2. **Verify transform configuration:**
```bash
grep -A 10 "transform\|route\|filter\|mask" /etc/rustcdc/config.toml
```
3. **Test transforms in isolation:**
```bash
```
**Resolution:**
| Route table not found | Update route transform to include all source tables |
| Regex invalid | Use online regex tester (regex101.com); test pattern before deploying |
| Transform policy = Halt | If acceptable data loss, change to `transform_error_policy = Skip` |
| Column doesn't exist | Verify column name matches source schema exactly (case-sensitive) |
---
## Diagnostics Toolkit
### Essential Commands
```bash
# 1. Health check
# 2. Recent errors
# 3. Checkpoint status
cat /var/rustcdc/checkpoint_postgres.json | jq .
# 4. Source connectivity test
psql "postgresql://user:pass@host/db" -c "SELECT 1;"
mysql -h host -u user -ppass db -e "SELECT 1;"
sqlcmd -S host -U user -P pass -d db -Q "SELECT 1;"
# 5. Network diagnostics
ping -c 10 <source_host>
telnet <source_host> <port>
iperf -c <source_host> # Bandwidth test
# 6. System resource check
top -p $(pgrep -f rustcdc)
# 7. Detailed metrics
# 8. OTel trace export check
curl -s http://otel-collector:4317/... # Check exporter is responding
```
### Log Analysis
```bash
# Count errors by type
# Find slow operations
# Timeline of events
# Calculates duration of log file
# Export metrics trend
### Interactive Debugging
```bash
# Start rustcdc with maximum logging
export RUST_LOG=rustcdc=trace
export RUST_LOG_FORMAT=json
cargo run --release
# Attach debugger (if debug build)
rust-gdb --args ./target/debug/rustcdc --config config.toml
# Health endpoint (if embedded in app)
curl -v http://localhost:8080/health
```
---
**Last Updated:** May 25, 2026
**Version:** Troubleshooting Guide v0.1+
**Contributing:** Found a new troubleshooting scenario? File an issue on GitHub!