# rust-readmdict
A port of https://github.com/ffreemt/readmdict
A Rust implementation for reading MDict dictionary files (.mdx format).
## Usage
### Basic Usage
To open an MDX file and display basic information:
```bash
cargo run example_resources/webster.mdx
```
Output:
```
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
```
### List Keys
To list the first 10 keys from the dictionary:
```bash
cargo run example_resources/webster.mdx --list-keys
```
Output:
```
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
Keys:
1: 12 a.m
2: 12 midnight
3: 12 p.m.
4: 20/20
5: 20/20 hindsight
6: 20 hindsight
7: .22
8: .22s
9: 24-7
10: 24/7
... and 109343 more
```
### List Keys Since a Word
To list keys that are alphabetically equal to or greater than a specific word:
```bash
cargo run example_resources/webster.mdx --list-keys-since "apple"
```
Output:
```
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
Keys since 'apple':
1: apple
2: apple cheeked
3: apple of someone's eye
4: apple pie
5: apple pies
6: apple polisher
7: apple polishers
8: apple-cheeked
9: apples
10: applesauce
... and 99401 more
```
### Look up a word and show its content
```bash
# Look up the definition of "apple"
cargo run example_resources/webster.mdx --lookup apple
# Look up a resource file in MDD
cargo run resources.mdd --lookup "image.png"
```
Example output:
```
Successfully opened MDX file: example_resources/webster.mdx
Number of entries: 109353
Looking up 'apple':
Definition:
<div class="entry">...[HTML content with definition]...</div>
```
## Features
- Read MDX dictionary files
- Extract header information and metadata
- Parse and list dictionary keys
- List keys alphabetically from a specific starting word
- Look up words and display their content from MDX files
- Look up resources and display their content from MDD files
- Support for compressed key blocks (zlib)
- Handle different MDX versions (1.x and 2.x)
## Building
```bash
cargo build --release
```
## Implementation Details
This is a Rust port of the Python readmdict library. The implementation follows a simplified file structure that closely mirrors the original Python codebase.
#### File Structure Mapping
| `readmdict/__main__.py` | `src/main.rs` | CLI entry point and argument parsing |
| `readmdict/readmdict.py` | `src/readmdict.rs` | Core library with all classes (MDict, MDX, MDD) |
| `readmdict/pureSalsa20.py` | Use `salsa20` crate | Salsa20 encryption (external crate) |
| `readmdict/ripemd128.py` | Use `ripemd` crate | RIPEMD128 hashing (external crate) |
| N/A | `src/lib.rs` | Library entry point (re-exports from readmdict.rs) |
##### Core Classes
| `MDict` (base class) | `struct MDict` | `src/readmdict.rs` |
| `MDX` (inherits MDict) | `struct Mdx` | `src/readmdict.rs` |
| `MDD` (inherits MDict) | `struct Mdd` | `src/readmdict.rs` |
##### Method-to-Method Mapping
**Utility Functions:**
| `_unescape_entities(text)` | `unescape_entities(text: &[u8]) -> Vec<u8>` | `src/readmdict.rs` |
| `_fast_decrypt(data, key)` | `fast_decrypt(data: &[u8], key: &[u8]) -> Vec<u8>` | `src/readmdict.rs` |
| `_mdx_decrypt(comp_block)` | `mdx_decrypt(comp_block: &[u8]) -> Result<Vec<u8>>` | `src/readmdict.rs` |
| `_salsa_decrypt(ciphertext, key)` | `salsa_decrypt(ciphertext: &[u8], key: &[u8]) -> Result<Vec<u8>>` | `src/readmdict.rs` |
| `_decrypt_regcode_by_deviceid(regcode, deviceid)` | `decrypt_regcode_by_deviceid(regcode: &[u8], deviceid: &[u8]) -> Result<Vec<u8>>` | `src/readmdict.rs` |
| `_decrypt_regcode_by_email(regcode, email)` | `decrypt_regcode_by_email(regcode: &[u8], email: &[u8]) -> Result<Vec<u8>>` | `src/readmdict.rs` |
**MDict Class Methods:**
| `__init__(fname, encoding, passcode)` | `new(fname: &str, encoding: Option<String>, passcode: Option<Passcode>) -> Result<Self>` | Constructor |
| `__len__()` | `len(&self) -> usize` | Get number of entries |
| `__iter__()` | `keys(&self) -> impl Iterator<Item = &[u8]>` | Iterator over keys |
| `keys()` | `keys(&self) -> impl Iterator<Item = &[u8]>` | Get dictionary keys |
| `_read_number(f)` | `read_number<R: Read>(&self, reader: &mut R) -> Result<u64>` | Read number from file |
| `_parse_header(header)` | `parse_header(header: &[u8]) -> Result<HashMap<String, String>>` | Parse header attributes |
| `_decode_key_block_info(data)` | `decode_key_block_info(&self, data: &[u8]) -> Result<Vec<(u64, u64)>>` | Decode key block info |
| `_decode_key_block(data, info)` | `decode_key_block(&self, data: &[u8], info: &[(u64, u64)]) -> Result<Vec<(u64, Vec<u8>)>>` | Decode key block |
| `_split_key_block(data)` | `split_key_block(&self, data: &[u8]) -> Result<Vec<(u64, Vec<u8>)>>` | Split key block into entries |
| `_read_header()` | `read_header(&mut self) -> Result<HashMap<String, String>>` | Read and parse file header |
| `_read_keys()` | `read_keys(&mut self) -> Result<Vec<(u64, Vec<u8>)>>` | Read key blocks |
| `_read_keys_brutal()` | `read_keys_brutal(&mut self) -> Result<Vec<(u64, Vec<u8>)>>` | Fallback key reading method |
**MDX Class Methods:**
| `__init__(fname, encoding, substyle, passcode)` | `new(fname: &str, encoding: Option<String>, substyle: bool, passcode: Option<Passcode>) -> Result<Self>` | Constructor |
| `items()` | `items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>>` | Iterator over key-value pairs |
| `_substitute_stylesheet(txt)` | `substitute_stylesheet(&self, txt: &str) -> String` | Apply stylesheet substitution |
| `_decode_record_block()` | `decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>>` | Decode record blocks |
**MDD Class Methods:**
| `__init__(fname, passcode)` | `new(fname: &str, passcode: Option<Passcode>) -> Result<Self>` | Constructor |
| `items()` | `items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>>` | Iterator over filename-content pairs |
| `_decode_record_block()` | `decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>>` | Decode record blocks |
#### Implementation Checklist
- [ ] 1. Create basic project structure (`src/lib.rs`, `src/main.rs`)
- [ ] 2. Implement core readmdict module (`src/readmdict.rs`) containing:
- [ ] 2.1. Utility functions (`unescape_entities`, etc.)
- [ ] 2.2. Crypto functions (`fast_decrypt`, `mdx_decrypt`, `salsa_decrypt`, etc.)
- [ ] 2.3. Base `MDict` struct with all methods
- [ ] 2.4. `Mdx` struct inheriting from `MDict`
- [ ] 2.5. `Mdd` struct inheriting from `MDict`
- [ ] 3. Implement CLI interface (`src/main.rs`) matching `__main__.py`
- [ ] 4. Update `src/lib.rs` to re-export from `readmdict.rs`
- [ ] 5. Add error handling and comprehensive tests
- [ ] 6. Add documentation and usage examples
- [ ] 7. Performance optimization and benchmarking
#### Detailed Structure Plan
**src/readmdict.rs** (single file containing everything from readmdict.py):
```rust
// Imports and dependencies
use std::collections::HashMap;
use std::fs::File;
use std::io::{Read, Seek, SeekFrom, BufReader, Cursor};
use std::path::Path;
use byteorder::{BigEndian, LittleEndian, ReadBytesExt};
use flate2::read::ZlibDecoder;
use regex::bytes::Regex;
use encoding_rs::Encoding;
use salsa20::{Salsa20, StreamCipher};
use ripemd::{Ripemd128, Digest};
use sha2::Sha256;
use adler::adler32;
// Error types
#[derive(Debug, thiserror::Error)]
pub enum Error {
#[error("IO error: {0}")]
Io(#[from] std::io::Error),
#[error("Invalid file format: {0}")]
InvalidFormat(String),
#[error("Unsupported compression type")]
UnsupportedCompression,
#[error("Encryption error: {0}")]
Encryption(String),
#[error("Invalid passcode")]
InvalidPasscode,
#[error("Checksum mismatch")]
ChecksumMismatch,
#[error("Encoding error: {0}")]
Encoding(String),
#[error("Parse error: {0}")]
Parse(String),
}
pub type Result<T> = std::result::Result<T, Error>;
// Utility functions (direct ports from Python)
fn unescape_entities(text: &[u8]) -> Vec<u8> {
// Convert HTML entities like < > & " back to < > & "
// Implementation matches Python _unescape_entities
}
fn fast_decrypt(data: &[u8], key: &[u8]) -> Vec<u8> {
// Simple XOR decryption with key cycling
// Direct port of Python _fast_decrypt
}
fn mdx_decrypt(comp_block: &[u8]) -> Result<Vec<u8>> {
// MDX-specific decryption algorithm
// Direct port of Python _mdx_decrypt
}
fn salsa_decrypt(ciphertext: &[u8], key: &[u8]) -> Result<Vec<u8>> {
// Salsa20 decryption using external crate
// Direct port of Python _salsa_decrypt
}
fn decrypt_regcode_by_deviceid(regcode: &[u8], deviceid: &[u8]) -> Result<Vec<u8>> {
// Device ID based decryption
// Direct port of Python _decrypt_regcode_by_deviceid
}
fn decrypt_regcode_by_email(regcode: &[u8], email: &[u8]) -> Result<Vec<u8>> {
// Email based decryption
// Direct port of Python _decrypt_regcode_by_email
}
// Passcode struct
#[derive(Debug, Clone)]
pub struct Passcode {
pub regcode: Vec<u8>,
pub userid: String,
}
// Base MDict struct (equivalent to Python MDict class)
#[derive(Debug)]
pub struct MDict {
fname: String,
encoding: String,
passcode: Option<Passcode>,
header: HashMap<String, String>,
key_list: Vec<(u64, Vec<u8>)>,
num_entries: usize,
version: f32,
encrypt: u8,
number_width: usize,
key_block_offset: u64,
record_block_offset: u64,
stylesheet: HashMap<String, (String, String)>,
}
impl MDict {
// Constructor - direct port of Python MDict.__init__
pub fn new(fname: &str, encoding: Option<String>, passcode: Option<Passcode>) -> Result<Self> {
// Initialize struct, read header, read keys
// Handle encoding detection and passcode validation
}
// Length - direct port of Python MDict.__len__
pub fn len(&self) -> usize { self.num_entries }
// Keys iterator - direct port of Python MDict.keys
pub fn keys(&self) -> impl Iterator<Item = &[u8]> {
self.key_list.iter().map(|(_, key)| key.as_slice())
}
// Private methods - direct ports from Python
fn read_number<R: Read>(&self, reader: &mut R) -> Result<u64> {
// Read number based on version (4 or 8 bytes)
}
fn parse_header(header: &[u8]) -> Result<HashMap<String, String>> {
// Parse XML-like header attributes
}
fn decode_key_block_info(&self, data: &[u8]) -> Result<Vec<(u64, u64)>> {
// Decode key block compression info
}
fn decode_key_block(&self, data: &[u8], info: &[(u64, u64)]) -> Result<Vec<(u64, Vec<u8>)>> {
// Decompress and decode key blocks
}
fn split_key_block(&self, data: &[u8]) -> Result<Vec<(u64, Vec<u8>)>> {
// Split key block into individual entries
}
fn read_header(&mut self) -> Result<HashMap<String, String>> {
// Read and parse file header
}
fn read_keys(&mut self) -> Result<Vec<(u64, Vec<u8>)>> {
// Read key blocks with encryption support
}
fn read_keys_brutal(&mut self) -> Result<Vec<(u64, Vec<u8>)>> {
// Fallback key reading for problematic files
}
}
// MDX struct (equivalent to Python MDX class)
#[derive(Debug)]
pub struct Mdx {
mdict: MDict,
substyle: bool,
}
impl Mdx {
// Constructor - direct port of Python MDX.__init__
pub fn new(fname: &str, encoding: Option<String>, substyle: bool, passcode: Option<Passcode>) -> Result<Self> {
let mdict = MDict::new(fname, encoding, passcode)?;
Ok(Self { mdict, substyle })
}
// Items iterator - direct port of Python MDX.items
pub fn items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
self.decode_record_block()
}
// Stylesheet substitution - direct port of Python MDX._substitute_stylesheet
fn substitute_stylesheet(&self, txt: &str) -> String {
// Apply stylesheet definitions to text
}
// Record block decoder - direct port of Python MDX._decode_record_block
fn decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
// Decode and decompress record blocks, apply encoding and stylesheet
}
// Delegate methods to MDict
pub fn len(&self) -> usize { self.mdict.len() }
pub fn keys(&self) -> impl Iterator<Item = &[u8]> { self.mdict.keys() }
pub fn header(&self) -> &HashMap<String, String> { &self.mdict.header }
}
// MDD struct (equivalent to Python MDD class)
#[derive(Debug)]
pub struct Mdd {
mdict: MDict,
}
impl Mdd {
// Constructor - direct port of Python MDD.__init__
pub fn new(fname: &str, passcode: Option<Passcode>) -> Result<Self> {
let mdict = MDict::new(fname, Some("UTF-16".to_string()), passcode)?;
Ok(Self { mdict })
}
// Items iterator - direct port of Python MDD.items
pub fn items(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
self.decode_record_block()
}
// Record block decoder - direct port of Python MDD._decode_record_block
fn decode_record_block(&self) -> impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>> {
// Decode and decompress record blocks for binary data
}
// Delegate methods to MDict
pub fn len(&self) -> usize { self.mdict.len() }
pub fn keys(&self) -> impl Iterator<Item = &[u8]> { self.mdict.keys() }
pub fn header(&self) -> &HashMap<String, String> { &self.mdict.header }
}
```
**src/lib.rs** (simple re-export):
```rust
mod readmdict;
pub use readmdict::*;
```
**src/main.rs** (direct port of __main__.py):
```rust
use clap::Parser;
use rust_readmdict::*;
use std::path::Path;
use std::fs;
use std::io::Write;
#[derive(Parser)]
#[command(name = "readmdict")]
#[command(about = "A Rust implementation of readmdict for reading MDict dictionary files")]
struct Args {
#[arg(short = 'x', long, help = "extract mdx to source format and extract files from mdd")]
extract: bool,
#[arg(short = 's', long, help = "substitute style definition if present")]
substyle: bool,
#[arg(short = 'd', long, default_value = "data", help = "folder to extract data files from mdd")]
datafolder: String,
#[arg(short = 'e', long, default_value = "", help = "encoding for the dictionary")]
encoding: String,
#[arg(short = 'p', long, help = "passcode in format: register_code,email_or_deviceid")]
passcode: Option<String>,
#[arg(help = "mdx file name")]
filename: Option<String>,
}
fn parse_passcode(s: &str) -> Result<Passcode> {
// Parse passcode string in format "regcode,userid"
let parts: Vec<&str> = s.split(',').collect();
if parts.len() != 2 {
return Err(Error::InvalidPasscode);
}
Ok(Passcode {
regcode: hex::decode(parts[0]).map_err(|_| Error::InvalidPasscode)?,
userid: parts[1].to_string(),
})
}
fn main() -> Result<()> {
let args = Args::parse();
// Handle file selection (GUI fallback would require additional crate)
let filename = match args.filename {
Some(f) => f,
None => {
eprintln!("Please specify a valid MDX/MDD file");
std::process::exit(1);
}
};
if !Path::new(&filename).exists() {
eprintln!("Please specify a valid MDX/MDD file");
std::process::exit(1);
}
let base = Path::new(&filename).file_stem().unwrap().to_str().unwrap();
let ext = Path::new(&filename).extension().unwrap_or_default().to_str().unwrap();
// Parse passcode if provided
let passcode = args.passcode.as_ref()
.map(|s| parse_passcode(s))
.transpose()?;
// Handle MDX files
let mdx = if ext.to_lowercase() == "mdx" {
let encoding = if args.encoding.is_empty() { None } else { Some(args.encoding.clone()) };
let mdx = Mdx::new(&filename, encoding, args.substyle, passcode.clone())?;
println!("======== {} ========", filename);
println!(" Number of Entries : {}", mdx.len());
for (key, value) in mdx.header() {
println!(" {} : {}", key, value);
}
Some(mdx)
} else {
None
};
// Handle MDD files
let mdd_filename = format!("{}.mdd", base);
let mdd = if Path::new(&mdd_filename).exists() {
let mdd = Mdd::new(&mdd_filename, passcode)?;
println!("======== {} ========", mdd_filename);
println!(" Number of Entries : {}", mdd.len());
for (key, value) in mdd.header() {
println!(" {} : {}", key, value);
}
Some(mdd)
} else {
None
};
// Extract files if requested
if args.extract {
// Extract MDX to text file
if let Some(mdx) = &mdx {
let output_filename = format!("{}.txt", base);
let mut file = fs::File::create(&output_filename)?;
for item in mdx.items() {
let (key, value) = item?;
file.write_all(&key)?;
file.write_all(b"\r\n")?;
file.write_all(&value)?;
if !value.ends_with(b"\n") {
file.write_all(b"\r\n")?;
}
file.write_all(b"</>\r\n")?;
}
// Extract stylesheet if present
if let Some(stylesheet) = mdx.header().get("StyleSheet") {
let style_filename = format!("{}_style.txt", base);
fs::write(&style_filename, stylesheet.replace('\n', "\r\n"))?;
}
}
// Extract MDD data files
if let Some(mdd) = &mdd {
let data_folder = Path::new(&filename).parent().unwrap().join(&args.datafolder);
fs::create_dir_all(&data_folder)?;
for item in mdd.items() {
let (key, value) = item?;
let filename = String::from_utf8_lossy(&key).replace('\\', "/");
let file_path = data_folder.join(&filename);
if let Some(parent) = file_path.parent() {
fs::create_dir_all(parent)?;
}
fs::write(&file_path, &value)?;
}
}
}
Ok(())
}
```
#### Implementation Considerations
**Key Differences from Python:**
1. **Error Handling**: Rust uses `Result<T, E>` instead of exceptions
2. **Memory Management**: No garbage collection, explicit ownership
3. **String Handling**: Distinction between `String`, `&str`, and `Vec<u8>`
4. **Iterator Patterns**: Rust iterators are lazy and zero-cost
5. **File I/O**: More explicit error handling required
**External Crate Dependencies:**
- `clap`: Command-line argument parsing (replaces `argparse`)
- `flate2`: Zlib compression (replaces `zlib`)
- `salsa20`: Salsa20 encryption (replaces `pureSalsa20.py`)
- `ripemd`: RIPEMD128 hashing (replaces `ripemd128.py`)
- `encoding_rs`: Text encoding support
- `regex`: Regular expressions for parsing
- `byteorder`: Binary data reading
- `thiserror`: Error type derivation
- `hex`: Hexadecimal encoding/decoding
- `adler`: Adler32 checksums
**Performance Optimizations:**
1. **Zero-copy where possible**: Use `&[u8]` slices instead of `Vec<u8>` when data doesn't need to be owned
2. **Streaming iterators**: Process records on-demand instead of loading everything into memory
3. **Efficient string handling**: Use `Cow<str>` for strings that might not need allocation
4. **Memory mapping**: Consider using `memmap2` for large files
5. **Parallel processing**: Use `rayon` for CPU-intensive operations like decompression
**Testing Strategy:**
1. **Unit tests**: Test each utility function and method individually
2. **Integration tests**: Test with real MDX/MDD files
3. **Property-based tests**: Use `proptest` for edge cases
4. **Benchmark tests**: Compare performance with Python implementation
5. **Compatibility tests**: Ensure output matches Python version exactly