# pxs
[](https://github.com/nbari/pxs/actions/workflows/test.yml)
[](https://github.com/nbari/pxs/actions/workflows/build.yml)
[](https://codecov.io/gh/nbari/pxs)
**pxs** (Parallel X-Sync) is an integrity-first sync and clone tool for large
mutable datasets. It keeps existing copies accurate across local paths, SSH,
and raw TCP, and it is designed to outperform `rsync` in the workloads it
targets.
The project is aimed at repeated refreshes of datasets such as:
- PostgreSQL `PGDATA`
- VM images
- large directory trees with many unchanged files
- large files that are usually modified in place instead of shifted
`pxs` is not a drop-in replacement for `rsync`. Its goal is narrower: exact and
safe synchronization first, then speed through Rust parallelism, fixed-block
delta sync, and transport choices that fit modern large-data workloads.
## What `pxs` Is For
Use `pxs` when you need:
- exact refreshes of an existing copy
- safe replacement behavior instead of in-place corruption risk
- repeated large-data sync where many files or blocks stay unchanged
- local, SSH, or raw TCP sync with one public command shape
- a tool that favors large mutable dataset workflows over general archive parity
`pxs` is a good fit when the destination already exists and you want to keep it
accurate with as little rewrite work as possible.
## What `pxs` Is Not
`pxs` is intentionally not trying to be all of `rsync`.
- It is not a full `rsync -aHAXx` replacement.
- It does not currently promise hardlink, ACL, xattr, SELinux-label, or sparse
layout parity.
- It does not target Windows.
- Raw TCP is for trusted networks and private links, not hostile networks.
If you need universal protocol compatibility or broad filesystem metadata
parity, `rsync` is still the reference point.
## Installation
Install from crates.io:
```bash
cargo install pxs
```
Build from source:
```bash
cargo build --release
```
The binary will be available at `./target/release/pxs`.
> [!IMPORTANT]
> For SSH or raw TCP sync, `pxs` must be installed and available in `$PATH` on
> both sides.
## Command Model
The public sync model is:
```bash
pxs sync SRC DST
```
The first operand is always the source. The second operand is always the
destination.
`DEST` and `SRC` can be:
- local filesystem paths
- SSH endpoints like `user@host:/path`
- raw TCP endpoints like `host:port/path`
### Local Examples
```bash
# File -> file
pxs sync snapshot/base.tar.zst backup/base.tar.zst
# Directory -> directory
pxs sync /var/lib/postgresql/data /srv/restore/pgdata
# Exact mirror with checksum validation and delete
pxs sync /var/lib/postgresql/data /srv/restore/pgdata --checksum --delete
```
### SSH Examples
```bash
# Remote file -> local file
pxs sync user@db2:/srv/export/snapshot.bin ./snapshot.bin
# Local file -> remote file
pxs sync ./snapshot.bin user@db2:/srv/incoming/snapshot.bin
# Remote directory -> local directory
pxs sync user@db2:/srv/export/pgdata /srv/restore/pgdata
# Local directory -> remote directory
pxs sync /var/lib/postgresql/data user@db2:/srv/incoming/pgdata
```
> [!IMPORTANT]
> SSH sync is designed for non-interactive authentication. In practice that
> means SSH keys, `ssh-agent`, or an already-established multiplexed session.
### Raw TCP Examples
Raw TCP uses `pxs sync` for the client side and `pxs listen` or `pxs serve` on
the remote side.
```bash
# Remote file -> local file
pxs sync 192.168.1.10:8080/snapshots/base.tar.zst ./snapshot.bin
# Local file -> remote file
pxs sync ./snapshot.bin 192.168.1.10:8080/incoming/snapshot.bin
# Remote directory -> local directory
pxs sync 192.168.1.10:8080/pgdata /srv/restore/pgdata
# Local directory -> remote directory
pxs sync /var/lib/postgresql/data 192.168.1.10:8080/incoming/pgdata
```
## Raw TCP Setup
Use `listen` on the receiving host to expose an allowed destination root:
```bash
pxs listen 0.0.0.0:8080 /srv
pxs listen 0.0.0.0:8080 /srv --fsync
```
Then sync into a path beneath that root:
```bash
pxs sync ./snapshot.bin 192.168.1.10:8080/incoming/snapshot.bin
pxs sync /var/lib/postgresql/data 192.168.1.10:8080/incoming/pgdata --delete
```
Use `serve` on the source host to expose an allowed source root:
```bash
pxs serve 0.0.0.0:8080 /srv/export
pxs serve 0.0.0.0:8080 /srv/export --checksum
```
Then sync from a path beneath that root:
```bash
pxs sync 192.168.1.10:8080/snapshots/base.tar.zst ./snapshot.bin
pxs sync 192.168.1.10:8080/pgdata /srv/restore/pgdata --checksum
```
The `host:port/path` suffix selects what to read or write inside the configured
root. Client-side per-sync flags such as `--checksum`, `--threshold`,
`--delete`, and `--ignore` travel with the `pxs sync` request.
For SSH and raw TCP, `pxs` negotiates block compression automatically. It
prefers `zstd`, falls back to `lz4` when needed, and otherwise uses the
existing uncompressed path.
## Guarantees and Safety Model
`pxs` is built around exactness first.
What it aims to preserve today:
- exact file contents
- exact file, directory, and symlink shape
- exact names, including valid Unix non-UTF8 names over local, SSH, and raw TCP
- file mode and modification time
- ownership when privileges and platform support allow it
- staged replacement so an existing destination is kept until the new object is
ready to commit
Safety rules that are intentionally strict:
- destination roots and destination parent path components must be real
directories, not symlinks
- leaf symlinks inside the destination tree are treated as entries and may be
replaced or removed without following their targets
- raw TCP requested paths are resolved beneath the configured `listen` or
`serve` root
- raw TCP and SSH protocol paths reject absolute paths, traversal components,
and unsupported path forms
Durability and verification options:
- `--checksum` forces block comparison and enables end-to-end BLAKE3
verification for network sync
- `--delete` removes destination entries that are not present in the source
tree for directory sync
- `--fsync` flushes committed file writes, deletes, directory installs, symlink
installs, and final metadata before success is reported
> [!NOTE]
> Without `--checksum`, `pxs` skips unchanged files by size and modification
> time. Keep clocks synchronized between hosts if you rely on mtime-based skip
> decisions.
## Performance Model
`pxs` tries to win where its model fits the workload.
- It hashes and compares blocks in parallel.
- It walks directory trees concurrently.
- It uses fixed 128 KiB blocks, which works well for many in-place update
workloads.
- It can compress outbound network block batches with negotiated `zstd` and
fall back to `lz4` when needed.
- It can keep multiple small outbound network files in flight on the main
control session.
- It can avoid SSH overhead entirely on trusted networks via raw TCP.
- It can fan out large outbound network transfers with multiple worker
sessions.
This is workload-dependent. `pxs` should be described as:
- faster than `rsync` in its target workloads
- not universally faster than `rsync`
- optimized for repeated large-data refreshes on modern hardware
### Threshold Behavior
`--threshold` controls when `pxs` should give up on reuse and rewrite a file
fully.
- Default: `0.1`
- Meaning: if `destination_size / source_size` is below the threshold, do a
full copy instead of block reuse
That default is intentionally low so interrupted or partially existing files can
still benefit from delta sync and resume-like reuse.
## Benchmarking
This repository includes workload-specific comparison helpers:
- `./benchmark.sh`
- `./local_pxs_vs_rsync.sh`
- `./remote_pxs_vs_rsync.sh --source <PATH> --host <USER@HOST> --remote-root <PATH>`
Use them as targeted evidence, not as universal performance proof. When
benchmarking:
- compare equivalent integrity settings
- keep transport choices explicit
- describe the workload shape
- report both speed and correctness results
## Common Options
- `--checksum`, `-c`: force block comparison and end-to-end network verification
- `--delete`: remove extraneous destination entries during directory sync
- `--fsync`, `-f`: durably flush committed data and metadata before completion
- `--ignore`, `-i`: repeatable glob-based ignore pattern
- `--exclude-from`, `-E`: read ignore patterns from a file
- `--threshold`, `-t`: reuse threshold for block-based sync, default `0.1`
- `--dry-run`, `-n`: show what would change without mutating the destination
- `--large-file-parallel-threshold`: send outbound network files at or above
this size with chunk-parallel workers
- `--large-file-parallel-workers`: set the worker count for those large outbound
network transfers
- `--network-file-concurrency`: keep multiple smaller outbound network files in
flight on the main control session
## PostgreSQL Helper
This repository includes [`sync.sh`](./sync.sh), a PostgreSQL-oriented helper
for repeated SSH `pxs sync SRC DST` passes around `pg_backup_start()` and
`pg_backup_stop()`.
It is useful when evaluating `pxs` on the workload it was originally built to
care about most: repeated `PGDATA` refreshes.
## Design Notes
The front page is intentionally operator-focused. Protocol flow, transport
design, and internal architecture notes live in [docs/design.md](./docs/design.md).
## Testing
Run the core validation flow:
```bash
cargo fmt --all
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
```
Podman end-to-end suites are also available for SSH and raw TCP under
`tests/podman/`.
## License
BSD-3-Clause