base-d 3.0.34

Universal base encoder: Encode binary data to 33+ dictionaries including RFC standards, hieroglyphs, emoji, and more
Documentation
# Base1024 Dictionary Implementation

## Overview
Added a 1024-character dictionary to demonstrate and test large dictionary support, particularly for testing the HashMap fallback path (non-ASCII characters that don't use the fast lookup table).

## What is Base1024?

Base1024 is an encoding system that uses 1024 unique characters from Unicode's CJK (Chinese, Japanese, Korean) blocks. Each character represents 10 bits of information (2^10 = 1024).

### Character Sources
- **256 chars**: CJK Unified Ideographs (U+4E00-U+4EFF)
- **256 chars**: CJK Unified Ideographs Extension A (U+3400-U+34FF)
- **256 chars**: Hangul Syllables (U+AC00-U+ACFF)
- **256 chars**: Yi Syllables (U+A000-U+A0FF)

## Performance Characteristics

### Encoding Efficiency
- **Information density**: 10 bits per character
- **Compared to base64**: 1.7x more compact (10 bits vs 6 bits per char)
- **Example**: 61 bytes → 49 characters (vs 84 for base64)

### Speed (64-byte blocks)
- **Encoding**: ~7 MiB/s
- **Decoding**: ~21 MiB/s
- Slower than base64 due to:
  - BigUint arithmetic in mathematical mode
  - HashMap lookups (no ASCII lookup table)
  - Larger character set requires more computation

### When to Use
- **Compact representation** when character count matters more than encoding speed
- **Testing** large dictionary support and HashMap fallback paths
- **Educational** purposes to demonstrate encoding efficiency tradeoffs

## Implementation Details

### Generator Script
Created `examples/generate_base1024.rs` to programmatically generate the 1024-character dictionary from Unicode blocks.

```rust
// Generates dictionary from CJK ideographs and syllables
cargo run --example generate_base1024
```

### Integration
Added to `dictionaries.toml`:
```toml
[dictionaries.base1024]
chars = "一丁丂七丄丅丆万丈三上下丌不与丏..."  # 1024 chars
mode = "base_conversion"
```

### Tests Added
- `test_base1024_large_dictionary` - Basic encode/decode functionality
- `test_base1024_uses_hashmap` - Verifies HashMap fallback (no lookup table)
- `test_base1024_efficiency` - Confirms more compact than base64

### Benchmarks
Added benchmarks to `benches/encoding.rs`:
- `bench_encode_base1024` - Encoding performance across sizes
- `bench_decode_base1024` - Decoding performance across sizes

### Example
Created `examples/base1024_demo.rs` showing:
- Dictionary properties (size, mode)
- Encoding/decoding demonstration
- Comparison with base64
- Information density explanation

## Usage

```bash
# Encode with base1024
echo "Hello, World!" | base-d -e base1024

# Decode
echo "丒乥㒱곆ꃈ乢㒅ꀶ㑌㐰仈ꁂ乃㐊" | base-d -d base1024

# List dictionaries (includes base1024)
base-d --list
```

## Technical Notes

### No Lookup Table
Base1024 uses **HashMap for all lookups** because:
- All characters are > U+255 (non-ASCII)
- Lookup table only built for ASCII dictionaries (< 256)
- This is by design to test the HashMap fallback path

### Mathematical Mode Only
- Uses `BaseConversion` encoding mode
- Power-of-two dictionaries (like 1024) could theoretically use chunked mode
- Currently implemented as mathematical for simplicity
- Future: Could add chunked mode support for base1024 (10 bits per char)

### Character Display
Some terminals/fonts may not display all CJK characters correctly:
- Requires Unicode-capable terminal
- Font must include CJK character ranges
- Characters appear as boxes/placeholders if missing from font

## Comparison with Other Dictionaries

| Dictionary | Size | Bits/Char | Efficiency | Speed | Use Case |
|----------|------|-----------|------------|-------|----------|
| base64 | 64 | 6 | 1.0x | 370 MiB/s | Standard, fast |
| base1024 | 1024 | 10 | 1.7x | 7 MiB/s | Compact, educational |
| base100 | 256 | 8 | 1.3x | Very fast | Emoji, byte-range |

## Future Enhancements

1. **Chunked Mode Implementation**
   - Add base1024 chunked mode support
   - Would improve speed significantly
   - 10-bit chunks align well with binary data

2. **Alternative Character Sets**
   - Create base1024 variants with different Unicode blocks
   - Egyptian hieroglyphics, mathematical symbols, etc.
   - Each offers different visual characteristics

3. **Base2048/Base4096**
   - Extend to even larger dictionaries
   - Base2048 = 11 bits per char (2^11)
   - Base4096 = 12 bits per char (2^12)

## Files Added/Modified

### New Files
- `examples/generate_base1024.rs` - Generator script
- `examples/base1024_demo.rs` - Usage demonstration
- `docs/BASE1024.md` - This documentation

### Modified Files
- `dictionaries.toml` - Added base1024 dictionary definition
- `src/tests.rs` - Added 3 base1024 tests
- `benches/encoding.rs` - Added base1024 benchmarks
- `README.md` - Updated dictionary count (33→34)
- `docs/PERFORMANCE.md` - Added base1024 performance notes

## Testing

All tests pass (41 total):
```bash
cargo test test_base1024
# test_base1024_large_dictionary ... ok
# test_base1024_uses_hashmap ... ok
# test_base1024_efficiency ... ok
```

## Conclusion

Base1024 successfully demonstrates:
- ✅ Large dictionary support (>256 characters)
- ✅ HashMap fallback for non-ASCII dictionaries
- ✅ Mathematical encoding mode with 1024-base
- ✅ More compact representation than base64
- ✅ Complete encode/decode cycle
- ✅ CLI integration
- ✅ Benchmark support

While slower than base64, base1024 serves as an excellent test case for large dictionaries and provides educational value in understanding encoding efficiency tradeoffs.