DataCortex
The best standalone JSON/NDJSON compressor. Beats zstd-19 and brotli-11 on every file tested.
Site | crates.io | PyPI | Docs
DataCortex auto-infers your JSON schema, applies columnar reorg + type-specific encoding, then picks the optimal entropy coder (zstd or brotli). No schema files, no database, no configuration. Just datacortex compress data.json.
Benchmarks
Fast mode vs the best general-purpose compressors:
| File | Size | DataCortex | zstd -19 | brotli -11 | vs best |
|---|---|---|---|---|---|
| NDJSON (analytics) | 107 KB | 22.0x | 15.6x | 16.6x | +32% |
| NDJSON (10K rows) | 3.3 MB | 27.8x | 16.0x | 16.4x | +70% |
| JSON API response | 36 KB | 16.0x | 13.2x | 15.0x | +7% |
| Twitter API (nested) | 617 KB | 19.7x | 16.7x | 18.9x | +4% |
| Event tickets (repetitive) | 1.7 MB | 221.7x | 176.0x | 190.0x | +17% |
On larger structured logs:
| Data | Size | DataCortex | zstd -19 | Advantage |
|---|---|---|---|---|
| k8s structured logs (100K rows) | 9.9 MB | ~40x | 18.9x | +113% |
| nginx access logs (100K rows) | 9.5 MB | ~28x | 17.3x | +62% |
Higher is better. DataCortex wins on every file. Lossless, byte-exact decompression guaranteed.
Performance
Throughput on an Apple M-series chip (Fast mode, single run, release build):
| File | Size | Ratio | Encode | Decode |
|---|---|---|---|---|
| NDJSON (10K rows) | 3.3 MB | 27.6x | 4.1 MB/s | 176 MB/s |
| GH Archive (diverse) | 10.0 MB | 7.8x | 3.2 MB/s | 574 MB/s |
| Twitter API | 617 KB | 19.7x | 2.3 MB/s | 384 MB/s |
| Event tickets | 1.7 MB | 221.6x | 8.6 MB/s | 1124 MB/s |
Decode is near-instant (176-1124 MB/s). Encode trades speed for 2x better compression vs zstd. For throughput-critical pipelines, DataCortex is best suited as a batch compressor for log storage, not a real-time codec.
Run datacortex bench corpus/ -m fast --compare to measure on your hardware.
Installation
Rust:
Python:
From source:
Requires Rust 1.85+.
Usage
# Compress (auto-detects format, picks best compression)
# Decompress
# Streaming (pipe-friendly)
|
# Chunked compression (for large NDJSON)
# Custom dictionary (for known schemas)
# Benchmark against zstd
# Higher compression (slower)
# Inspect a .dcx file
Compression modes
| Mode | Engine | Best for |
|---|---|---|
| fast (default) | Columnar + typed encoding + zstd/brotli | JSON/NDJSON (best ratio at high speed) |
| balanced | Context mixing (CM) engine | General text, small files |
| max | CM with larger context maps | Maximum compression |
Fast mode is recommended for JSON/NDJSON. It runs the full preprocessing pipeline (schema inference, columnar reorg, typed encoding) then picks the best entropy coder automatically.
Balanced/Max modes use a bit-level context mixing engine with 13 specialized models. Better for general text but slower.
Python
=
=
# File-based
# Format detection
= # "ndjson", "json", "generic"
How it works
- Format detection - auto-identifies JSON, NDJSON, or generic data
- Schema inference - discovers column types (integer, boolean, timestamp, enum, string, etc.)
- Columnar reorg - transposes row-oriented NDJSON into column-oriented layout
- Type-specific encoding - delta+varint for integers, bitmaps for booleans, epoch deltas for timestamps
- Auto-fallback - tries 6+ compression paths (zstd, brotli, with/without preprocessing) and picks the smallest
No schema files. No configuration. Fully automatic.
Development
License
MIT