RSDOS
RSDOS - ([R]u[S]ty [D]isk-[O]bject[S]tore), is a fast, server-less, rust-native disk object store for dataset management.
It handles huge datasets without breaking a sweat—whether if you’re juggling thousands of tiny files or streaming multi-gigabyte blobs. It’s not designed as a backup solution, but rather for storing millions of files in a compact and manageable way.
It packs data intelligently to maximize disk usage, deduplicates content via SHA-256 hashing.
The tool appling on-the-fly compression (zstd
as default or zlib
) whenever it’s beneficial—no manual tuning required.
I keep I/O straightforward with streaming-based insert and extract methods so you don’t flood your RAM when dealing with large files.
Thanks to Rust’s memory safety guarantees, RSDOS delivers great performance without the usual headaches or subtle bugs. If you’re integrating with Python, that’s covered through pyo3 bindings.
More design details can be found at design notes
Installation
Planned installation methods include:
- cargo binstall
- cargo install
- curl
- Python library (providing both Python API and CLI)
- Apt / Pacman / Brew (system packages)
MSRV
Minimum Supported Rust Version (MSRV): 1.78
Usage
CLI tool
Manage your large file datasets through CLI:
- Initialize a new container in the current directory
# [info] Container initialized at ./container
- Add files as loose objects
# abc123... - mydata1.txt: 1.2 MB
# def456... - mydata2.bin: 3.4 MB
- Pack all loose objects for efficient storage
# [info] Packed 2 loose objects into pack file #1
- Display container status
# [container]
# Location = ./container
# Id = 0123456789abcdef
# ZipAlgo = zstd
#
# [container.count]
# Loose = 0
# Packs = 1
# Pack Files = 1
#
# [container.size]
# Loose = 0 B
# Packs = 4.6 MB
# Packs Files = 4.6 MB
Python binding
Here’s a quick-start guide for the Python API, showcasing core operations:
# 1. Create a new container (or open an existing one) at a specified path:
=
# 2. Initialize the container with desired settings
# 3. Add objects in loose storage
= 10
=
=
=
# 4. Pack all loose objects for optimal storage
# 5. Retrieve the content of the first file
=
Additional Tips
- Heuristics: RSDOS automatically decides whether to compress data based on size and content type (e.g., text vs. binary). You can override this with the compress parameter.
- Large Repositories: For very large sets of files, consider batch insertion (add_objects_to_pack) and periodic calls to pack_all_loose for best performance.
- Streaming Approach: When handling files that exceed available memory, always use the streaming methods (add_streamed_object, get_object_stream).
Batch Insertion
=
=
Streaming to and from Files
# Write from a file
=
# Read back into a file-like object
Disclaimer
- RSDOS is heavily inspired by aiidateam/disk-objectstore, this reimplementation aims to explore alternative design and performance optimizations.
Progress
- Init command
- Status command (tested on large disk-objectstore)
- Add files (insert objects to loose storage)
- Stream-based reading (has_objects, get_object_hash, list_all_objects, etc.)
- Container struct
- PyO3 bindings
- Benchmarking (loose read/write, packed read/write)
- Pack (write)
- Repack (planned after initial design)
- Compression (zlib & zstd)
- Heuristics for compression
- Repack (finalize vacuuming logic)
- Migration (tools & Python wrapper for AiiDA)
- Memory footprint tracking
- Progress bar for long-running operations
- Documentation (library docs & examples)
- Validate, Optimize, Backup
- Thread safety (pack write synchronization)
- Use
sled
as a K/V DB (v2) - Implement
io_uring
(v2) - Compression at loose stage (v2)
- Refactor legacy
packs
→packed
(v2) - OpenDAL integration (v3)
- Generic container interfaces (v3)