# rustcdc Troubleshooting Guide
**Version:** v0.1+
**Audience:** Operators and developers debugging rustcdc issues
---
## Table of Contents
1. [Connection Issues](#connection-issues)
2. [Checkpoint and Recovery Issues](#checkpoint-and-recovery-issues)
3. [Performance and Throughput Issues](#performance-and-throughput-issues)
4. [Data Quality Issues](#data-quality-issues)
5. [Transform and Filter Issues](#transform-and-filter-issues)
6. [Diagnostics Toolkit](#diagnostics-toolkit)
---
## Integration Scaffolding Assumptions
Command examples in this guide assume your embedder/deployment provides:
- Service controls (for example `systemctl`, container orchestration commands, or custom supervisor)
- Runtime/admin metrics endpoint (examples use `http://localhost:9090/metrics`)
- Connector/client CLIs installed for ad-hoc diagnostics (`psql`, `mysql`, `sqlcmd`)
Adapt commands to your runtime model. If your deployment does not expose these controls yet,
establish them first using the deployment guidance in `docs/deployment.md`.
---
## Connection Issues
### Symptom: "connection refused" or timeout on startup
**Error Examples:**
```
ERROR source error: failed to connect to postgres: connection refused
ERROR source error: failed to connect to mysql: timeout
ERROR source error: connection to sqlserver closed unexpectedly
```
**Diagnosis Checklist:**
1. **Verify network connectivity:**
```bash
ping -c 3 <database_host>
telnet <database_host> <port> ```
✅ Should respond; ❌ if not, check network/firewall
2. **Verify database is running:**
```bash
pg_isready -h <host> -p 5432
mysql --defaults-extra-file=<mysql-client.cnf> -h <host> -u <user> -e "SELECT 1;"
SQLCMDPASSWORD="${SQLCMDPASSWORD:?set from secret manager}" sqlcmd -S <host> -U <user> -Q "SELECT 1;"
```
✅ Should return connection OK; ❌ if not, restart database
3. **Verify credentials:**
```bash
psql "postgresql://<user>@<host>:5432/<database>"
mysql --defaults-extra-file=<mysql-client.cnf> -h <host> -u <user> <database>
SQLCMDPASSWORD="${SQLCMDPASSWORD:?set from secret manager}" sqlcmd -S <host> -U <user> -d <database>
```
✅ Should authenticate; ❌ if not, verify configured secret source and connector credentials
4. **Check user permissions:**
```sql
SELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'cdc_user';
SHOW GRANTS FOR 'cdc_user'@'%';
SELECT is_member('cdc_admin');
```
5. **Check connection string format:**
- PostgreSQL: `postgresql://user:pass@host:port/database?sslmode=require`
- MySQL: `mysql://user:pass@host:port/database`
- SQL Server: `sqlserver://user:pass@host:port;database=name;Encrypt=yes`
**Resolution:**
| Network blocked | Whitelist rustcdc server IP on database firewall |
| Database not running | Restart database service |
| Invalid credentials | Verify in config file; check for special characters |
| Missing REPLICATION role (PG) | Run: `ALTER ROLE cdc_user WITH REPLICATION;` |
| Missing CDC admin (SQL Server) | Run: `ALTER ROLE cdc_user ADD MEMBER cdc_admin;` |
| TLS/Certificate error | Verify `transport` is set to `TransportConfig::tls()` or `TransportConfig::tls_with_ca_cert_path(...)` |
---
### Symptom: "TLS handshake failed" or certificate validation error
**Error Examples:**
```
ERROR source error: tls error: certificate verify failed
ERROR source error: tls error: x509: certificate signed by unknown authority
```
**Diagnosis:**
```bash
# 1. Check if database requires TLS
openssl s_client -connect <host>:<port>
# Should show certificate chain; if connection fails, TLS may be required
# 2. Verify CA certificate (if using custom CA)
openssl x509 -in /path/to/ca.pem -text -noout
# Should show certificate details; verify Subject and Issuer match your CA
# 3. Test TLS connection manually
# PostgreSQL
psql "postgresql://user:pass@host/db?sslmode=require&sslcert=/path/to/cert.pem&sslkey=/path/to/key.pem"
# MySQL
mysql -h <host> --ssl-mode=REQUIRED --ssl-ca=/path/to/ca.pem -u <user> -p<pass>
```
**Resolution:**
| Certificate not trusted | Use `TransportConfig::tls_with_ca_cert_path(...)` with the CA bundle path |
| TLS handshake fails behind proxy | Verify proxy certificates are rooted in the CA bundle configured via `TransportConfig::tls_with_ca_cert_path(...)` |
| Self-signed cert in test/air-gapped env | Use explicit opt-in `TransportConfig::tls_insecure_skip_verify()` only for non-production environments |
| Expired certificate | Request new certificate from database admin |
| Wrong CA certificate | Verify CA cert matches database server certificate issuer |
---
## Checkpoint and Recovery Issues
### Symptom: "checkpoint error" or "replication slot diverged"
**Error Examples:**
```
ERROR: source error: postgres checkpoint/slot divergence for slot '...'
ERROR checkpoint error: checkpoint file does not exist
ERROR checkpoint error: failed to read checkpoint: invalid JSON
```
**Diagnosis Checklist:**
1. **Verify checkpoint file exists and is readable:**
```bash
ls -lh /var/rustcdc/checkpoint_*.json
```
✅ Should list checkpoint files; ❌ if not, check directory permissions
2. **Verify checkpoint is valid JSON:**
```bash
cat /var/rustcdc/checkpoint_postgres.json | jq .
```
3. **For PostgreSQL: verify replication slot exists:**
```sql
SELECT slot_name, active, restart_lsn FROM pg_replication_slots WHERE slot_name = 'rustcdc_postgres_*';
```
✅ Slot exists and active; ❌ if not, may have been dropped manually
4. **For PostgreSQL: check LSN divergence:**
```sql
cat /var/rustcdc/checkpoint_postgres.json | jq '.offset.lsn'
SELECT pg_current_wal_lsn();
```
**Resolution:**
| Checkpoint corrupted | Stop rustcdc; delete checkpoint file; restart (will scan from current position) |
| Replication slot dropped | Stop rustcdc; recreate checkpoint with current LSN; restart |
| WAL/binlog purged | See [Replication Slot Divergence Recovery](runbook.md#replication-slot-divergence-recovery) |
| Checkpoint permissions | Verify `/var/rustcdc/` is writable by rustcdc process owner |
---
### Symptom: "buffer full" error or frequent checkpoint pauses
**Error Examples:**
```
ERROR checkpoint error: commit barrier buffer is full
WARNING checkpoint latency exceeding 1s
```
**Diagnosis:**
1. **Check buffer utilization:**
```bash
grep "buffer_size" /var/log/rustcdc/structured.log | tail -20
curl http://localhost:9090/metrics | grep "rustcdc_runtime_buffer_depth"
```
2. **Check checkpoint commit latency:**
```bash
curl http://localhost:9090/metrics | grep "rustcdc_runtime_checkpoint_age_ms"
```
3. **Check checkpoint store I/O:**
```bash
iostat -x 1 5 | grep sda
```
**Resolution:**
| Checkpoint store slow (disk I/O) | 1. Switch to FileCheckpoint on faster disk; 2. Increase max_buffer_size to batch more events |
| Checkpoint store slow (external backend) | Optimize backend-specific checkpoint writes and indexes; monitor write latency and contention |
| Buffer size too small | Increase `max_buffer_size` in RuntimeConfig (e.g., 50_000 → 100_000) |
| Transform errors causing queue buildup | Check transform error logs; fix failing transforms or set `transform_error_policy = Skip` |
---
## Performance and Throughput Issues
### Symptom: SQL Server high p99 latency (bursty / non-uniform)
**Expected behavior — not a bug:**
SQL Server CDC is **polling-based**. rustcdc calls `cdc.fn_cdc_get_all_changes_*`
at a configurable interval (`stream_poll_interval_ms`, default 5 000 ms). The
PostgreSQL connector also uses short-interval SQL polling
(`pg_logical_slot_peek_binary_changes`, default 50 ms with a per-poll bounded
timeout), but its much shorter default interval means p99 latency is typically
well under 100 ms. SQL Server CDC adds the capture agent delay on top of the
poll interval, which is why the latency profile looks so different.
Expected latency profile:
| p50 | ≈ stream_poll_interval_ms / 2 |
| p99 | ≈ stream_poll_interval_ms + capture agent delay |
| p99.9 | ≈ 2 × stream_poll_interval_ms (poll jitter under load) |
A p99/p50 ratio of 1 000× is **normal** (e.g. p50 = 0.3 ms measured within a poll
window, p99 = 318 ms = one poll cycle).
**Tuning for lower latency:**
```rust
SqlServerSourceConfig {
stream_poll_interval_ms: 500, // default 5000; 500–1000 ms for latency-sensitive
..SqlServerSourceConfig::default()
}
```
Lower values increase SQL Server query load. 500 ms is the practical lower bound
for most production SQL Server deployments.
---
### Symptom: Low event throughput or high latency
**Error Examples:**
```
WARNING events processed per second dropping below baseline (was 10K/sec, now 5K/sec)
WARNING snapshot progress stalled (no new chunks for 30s)
```
**Diagnosis Checklist:**
1. **Check replication lag (startup note):**
> **Startup behavior:** When rustcdc first connects to a source that has been idle or
> when the replication slot/offset has not been read recently, the connector must replay
> all WAL/binlog entries since the last confirmed LSN. This causes an **expected initial
> replication lag spike** at startup that is *not* an error. The lag metric will trend
> downward as the connector catches up. Allow at least `max_poll_wait_ms × 2` seconds
> before treating non-zero replication lag as a problem.
```bash
curl http://localhost:9090/metrics | grep "rustcdc_runtime_replication_lag_ms"
```
2. **Check source database load:**
```bash
SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;
SHOW FULL PROCESSLIST;
SELECT command, status, sql_text FROM sys.dm_exec_requests;
```
✅ Should show normal query activity; ❌ if high, source DB is overloaded
3. **Check network latency:**
```bash
ping -c 10 <database_host> | tail -1
```
4. **Check rustcdc resource utilization:**
```bash
top -p <rustcdc_pid> | grep CPU
ps aux | grep rustcdc | grep -v grep | awk '{print $6}'
lsof -p <rustcdc_pid> | wc -l
```
5. **Check transform pipeline overhead:**
```bash
curl http://otel-collector:9090/metrics | grep "rustcdc_transform_duration"
```
**Resolution:**
| Source DB overloaded | Reduce rustcdc poll frequency; scale source DB; check for long-running queries |
| Network congested | Verify network MTU (1500 default); check for packet loss (ping -c 100) |
| rustcdc CPU maxed | Increase max_poll_wait_ms (batches more events per poll); reduce transform complexity |
| rustcdc memory growing | Check for transform memory leaks; verify checkpoint is committing (check committed_count) |
| Transform pipeline slow | Profile individual transforms; consider removing non-critical transforms |
---
### Symptom: Snapshot taking too long
**Error Examples:**
```
WARNING snapshot progress: 5% complete (10 hours in, estimated 200 hours remaining)
```
**Diagnosis:**
1. **Check snapshot progress:**
```bash
grep "snapshot_chunk_received\|snapshot_complete" /var/log/rustcdc/structured.log | tail -20
```
2. **Check source table sizes:**
```sql
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables WHERE tablename IN ('users', 'orders', ...);
SELECT table_schema, table_name, ROUND(((data_length + index_length) / 1024 / 1024), 2)
FROM information_schema.tables WHERE table_name IN ('users', 'orders', ...);
SELECT OBJECT_NAME(ps.object_id), SUM(ps.row_count)
FROM sys.dm_db_partition_stats ps
WHERE OBJECT_NAME(ps.object_id) IN ('users', 'orders', ...)
GROUP BY ps.object_id;
```
3. **Check snapshot query performance:**
```bash
time psql -U cdc_user -d mydb -c "SELECT * FROM public.users LIMIT 10000;"
```
**Resolution:**
| Table too large to snapshot | 1. Reduce `snapshot_tables` list; 2. Increase `snapshot_chunk_size` (e.g., 10K → 50K); 3. Add index on clustering key |
| Source DB query slow | Add index on clustering/primary key; schedule snapshot during low-activity window |
| Network bandwidth limited | Verify network bandwidth (iperf); consider moving rustcdc to same datacenter |
| rustcdc CPU bottleneck | Scale to additional rustcdc instances; profile hot path in transform pipeline |
---
## Data Quality Issues
### Symptom: Missing events or duplicate events in output
**Error Examples:**
```
WARNING event_id=12345 received duplicate after checkpoint restart
ERROR detected missing event (event_id=12346 skipped)
```
**Diagnosis Checklist:**
1. **Verify checkpoint is committing:**
```bash
curl http://localhost:9090/metrics | grep "rustcdc_runtime_events_committed_total"
```
2. **Check for buffered events during shutdown:**
```bash
grep "drain_pending\|final_checkpoint" /var/log/rustcdc/structured.log
```
3. **Verify consumer is calling commit callbacks:**
```bash
```
4. **Check for transform filtering:**
```bash
curl http://otel-collector:9090/metrics | grep "rustcdc_events_filtered"
```
**Resolution:**
| Consumer not calling commit | Verify consumer code calls CommitBarrier::notify_consumer_accepted() + commit() |
| Checkpoint not persisted | Verify checkpoint store is writable; check FileCheckpoint path permissions |
| Process killed without graceful shutdown | Implement SIGTERM handler to flush pending events before exit |
| Transform filtering unintentional | Review transform configuration; verify filter rules are correct |
---
### Symptom: Events with incorrect data or wrong schema
**Error Examples:**
```
ERROR validation error: field 'before' is None but operation is Update
ERROR schema error: table schema not found for public.users
```
**Diagnosis:**
1. **Verify source schema is correct:**
```sql
\d public.users
DESC users;
EXEC sp_help 'dbo.users';
```
2. **Check event envelope validation:**
```bash
export RUST_LOG=rustcdc::core::event=debug
grep "validation error\|ValidationError" /var/log/rustcdc/structured.log
```
3. **Verify transform rules are correct:**
```bash
grep -A 10 "transform" /etc/rustcdc/config.toml
```
**Resolution:**
| Source schema changed (DDL) | 1. Update rustcdc snapshot_tables list; 2. Manually trigger schema refresh in SchemaHistory |
| Transform filter too broad | Review transform rules; test in development first |
| Event validation rule violated | Check docs/api.md and src/core/event.rs validation contract; verify source is generating events correctly |
---
## Transform and Filter Issues
### Symptom: Transform errors or events filtered unexpectedly
**Error Examples:**
```
ERROR transform error: route: no matching output for table public.unknown_table
ERROR transform error: mask: regex compilation failed for pattern '(?P<invalid>)'
WARNING events filtered by transform (count=50)
```
**Diagnosis:**
1. **Enable debug logging for transforms:**
```bash
export RUST_LOG=rustcdc::transform=debug
```
2. **Verify transform configuration:**
```bash
grep -A 10 "transform\|route\|filter\|mask" /etc/rustcdc/config.toml
```
3. **Test transforms in isolation:**
```bash
```
**Resolution:**
| Route table not found | Update route transform to include all source tables |
| Regex invalid | Use online regex tester (regex101.com); test pattern before deploying |
| Transform policy = Halt | If acceptable data loss, change to `transform_error_policy = Skip` |
| Column doesn't exist | Verify column name matches source schema exactly (case-sensitive) |
---
## Diagnostics Toolkit
### Essential Commands
```bash
# 1. Health check
# 2. Recent errors
# 3. Checkpoint status
cat /var/rustcdc/checkpoint_postgres.json | jq .
# 4. Source connectivity test
psql "postgresql://user:pass@host/db" -c "SELECT 1;"
mysql -h host -u user -ppass db -e "SELECT 1;"
sqlcmd -S host -U user -P pass -d db -Q "SELECT 1;"
# 5. Network diagnostics
ping -c 10 <source_host>
telnet <source_host> <port>
iperf -c <source_host> # Bandwidth test
# 6. System resource check
top -p $(pgrep -f rustcdc)
# 7. Detailed metrics
# 8. OTel trace export check
curl -s http://otel-collector:4317/... # Check exporter is responding
```
### Log Analysis
```bash
# Count errors by type
# Find slow operations
# Timeline of events
# Calculates duration of log file
# Export metrics trend
### Interactive Debugging
```bash
# Start rustcdc with maximum logging
export RUST_LOG=rustcdc=trace
export RUST_LOG_FORMAT=json
cargo run --release
# Attach debugger (if debug build)
rust-gdb --args ./target/debug/rustcdc --config config.toml
# Health endpoint (if embedded in app)
curl -v http://localhost:8080/health
```
---
**Last Updated:** May 25, 2026
**Version:** Troubleshooting Guide v0.1+
**Contributing:** Found a new troubleshooting scenario? File an issue on GitHub!