# MRRC Architecture
This document describes the architecture of the MRRC library and key design decisions.
## Overview
MRRC is a Rust library for reading, writing, and manipulating MARC bibliographic records with Python bindings via PyO3. The library is organized into three main components:
1. **Core Rust Library** - Pure Rust MARC record parsing and manipulation
2. **Python Wrapper** - PyO3 bindings providing Python access with GIL release for concurrency
3. **Benchmarking Infrastructure** - Comprehensive performance testing and profiling
## Core Architecture
### Record Types
MRRC supports three MARC record types:
- **Bibliographic Records** - Standard library catalog records (type 'a', 'c', 'm', etc.)
- **Authority Records** - Subject headings and name authority data (type 'z')
- **Holdings Records** - Physical item location and enumeration data (types 'x', 'y', 'v', 'u')
All record types share common infrastructure through the `MarcRecord` trait.
### Data Structure
A MARC record consists of:
1. **Leader** - 24-byte header with metadata
2. **Control Fields** (000-009) - Fixed-length fields
3. **Data Fields** (010+) - Variable-length fields with indicators and subfields
### Parser Architecture
The core parser uses a state machine approach:
```
ISO 2709 Binary Format
↓
Record Boundary Scanner (finds 0x1D terminators)
↓
Leader Parser (24 bytes)
↓
Directory Parser (field offsets and lengths)
↓
Field Parser (control fields vs data fields)
↓
Subfield Parser (for data fields with indicators)
↓
Character Decoder (MARC-8 or UTF-8)
↓
Record Object
```
## Python Wrapper Architecture
For a higher-level overview of how the Rust and Python code relate, how
maturin and PyO3 fit in, and what builds what, see
[Project Layout](project-layout.md).
### GIL Release Strategy: Three-Phase Model
The Python wrapper implements a three-phase pattern for GIL management during every `read_record()` call:
```
Phase 1: Read bytes (GIL held)
↓
Phase 2: Parse bytes (GIL RELEASED) ← Concurrent work happens here
↓
Phase 3: Convert to Python object (GIL re-acquired)
```
**Phase 1 (GIL held):**
- Acquire raw record bytes from source
- Python file object: via Python `read()` method
- File path: via Rust `std::fs::File` (no GIL overhead)
- Bytes: already in memory (no I/O)
- Duration: Very short (I/O cached in kernel)
**Phase 2 (GIL released):**
- Parse record bytes to MARC structure (CPU-intensive work)
- Uses `py.detach()` (PyO3 0.23+) to explicitly release GIL
- Creates Rust `ParseError` without Python objects
- SmallVec buffer handles most records inline
- Duration: ~90% of total parse time
- **Result: Multiple threads can parse concurrently**
**Phase 3 (GIL re-acquired):**
- Convert Rust `ParseError` to Python exception (if needed)
- Convert Rust `Record` to Python `PyRecord`
- Return to caller
- Duration: Negligible (quick object construction)
### Why GIL Release Matters
The Python GIL (Global Interpreter Lock) serializes all Python bytecode execution. Without GIL release during parsing:
**Without GIL Release (current state of pure pymarc)**:
```
Thread 1: Read bytes (GIL) → Parse (GIL) → Convert (GIL)
Thread 2: Waiting... → Waiting... → Waiting...
Result: Threading provides no speedup (1.0x)
```
**With GIL Release (pymrrc)**:
```
Thread 1: Read (GIL) → Parse (GIL RELEASED) → Convert
Thread 2: Read (GIL) → Parse (GIL RELEASED) → Convert
Result: Threads parse in parallel (3.74x on 4 cores)
```
The key insight: parsing is CPU-intensive but doesn't need Python objects, so releasing the GIL enables true parallelism.
**Single-threaded benefit:** Even without multiple threads, Rust parsing is simply faster (~4x vs pymarc).
**Multi-threaded benefit:** With explicit `ThreadPoolExecutor`, the GIL release enables concurrent parsing across threads (additional 3.74x speedup on 4 cores).
### ReaderBackend Enum
The unified reader supports multiple input types via a backend enum:
```rust
enum ReaderBackend {
RustFile(std::fs::File), // Pure Rust I/O, zero GIL
Cursor(io::Cursor<Vec<u8>>), // In-memory, zero GIL
PythonFile(PyObject), // Python file object, GIL managed
}
```
**Advantages**:
- **Automatic Detection**: Input type determined at construction
- **Optimal Performance**: Each backend uses fastest available method
- **Backward Compatible**: Python file objects still work via GIL management
- **Zero-GIL Paths**: File paths and bytes bypass Python entirely
**Performance Impact**:
- File path: Pure Rust I/O, Phase 1 has minimal GIL hold
- Bytes: Zero I/O, Phase 1 is trivial
- File object: Requires GIL for `.read()`, but Phase 2 still releases it
### Batched Reader (Optimization)
For Python file objects, batching reduces GIL contention:
```
Without batching (N records):
FOR i = 1 to N:
Acquire GIL → Read 1 record → Release GIL → Parse
With batching (N records, batch size = 100):
FOR batch = 1 to N/100:
Acquire GIL → Read 100 records → Release GIL
FOR record in batch:
Parse record (GIL released)
```
Result: N/100 GIL acquisitions instead of N.
### SmallVec Optimization
MARC records vary in size (typically 500-4000 bytes). The SmallVec buffer:
```rust
SmallVec<[u8; 4096]>
```
**Benefits**:
- Inline storage for ~85-90% of records (no allocation)
- Dynamic heap allocation for oversized records
- <3% memory overhead
- Eliminates borrow checker issues in Phase 2
## Error Handling
### ParseError Enum
Custom error type allows error creation without GIL:
```rust
pub enum ParseError {
InvalidRecord(String),
InvalidLeader(String),
InvalidDirectory(String),
EncodingError(String),
}
impl From<ParseError> for PyErr {
// Conversion happens after GIL re-acquisition in Phase 3
}
```
**Why Custom Error Type?**
- PyErr requires GIL to create
- ParseError can be created during Phase 2 (GIL released)
- Defers PyErr conversion to Phase 3 (after GIL re-acquired)
## Thread Safety
### Not Send/Sync by Design
The readers are intentionally **not** Send or Sync:
```rust
// Readers hold Python references (not Send/Sync)
pub struct PyMARCReader {
reader: Option<ReaderType>,
// ReaderType may contain PythonFile(PyObject) which is !Send
}
```
**Why?**
- Each thread needs its own GIL-aware reader
- Sharing readers across threads causes undefined behavior
- Forces correct usage pattern: one reader per thread
## Concurrency Model
### Two APIs for Different Use Cases
#### Standard MARCReader (Sequential - No Multi-Threading Benefit)
```python
from mrrc import MARCReader
# Simple sequential reading
reader = MARCReader("records.mrc")
for record in reader:
process(record)
```
**Performance:**
- ✅ Single-threaded: **~4x faster than pymarc**
- ❌ Multi-threaded: **0.85x slowdown** (GIL contention)
- **Use when:** Sequential processing or single-file reads
#### ProducerConsumerPipeline (High-Performance Single-File Multi-Threading)
```python
from mrrc import ProducerConsumerPipeline
# Background producer thread reads file and parses with Rayon
pipeline = ProducerConsumerPipeline.from_file('large_file.mrc')
for record in pipeline:
process(record)
```
**Verified Performance:**
- 2 threads: 2.0x speedup
- 4 threads: 3.74x speedup
- Scales with CPU core count
**How it works:**
- Background producer thread reads file in 512 KB chunks
- Bounded channel provides backpressure (1000 records)
- Rayon parses batches in parallel on all CPU cores
- Producer runs without GIL, eliminating contention
**Use when:** Processing a single large MARC file with maximum throughput from available cores
## Performance Characteristics
### Throughput (Records/Second)
| Sequential (1 thread) | 549,500 rec/s | Baseline |
| Parallel (2 threads) | ~1.1M rec/s | ~2.0x speedup |
| Parallel (4 threads) | ~2.0M rec/s | ~3.74x speedup |
### Memory Usage
| Per reader | ~4 KB | SmallVec buffer |
| Per record (memory) | ~4 KB | Typical MARC record |
| Overhead (4 readers) | ~16 KB | Negligible |
### GIL Contention
| Phase 1 | Held | Short | Read bytes only |
| Phase 2 | Released | Long | Parsing (CPU-bound) |
| Phase 3 | Held | Short | Convert to Python |
## Character Encoding
### MARC-8 Support
MARC-8 is a legacy encoding with:
- Basic Latin (ASCII)
- ANSEL Extended Latin with diacritical marks
- Greek, Cyrillic, Arabic, Hebrew scripts
- East Asian support (Chinese, Japanese, Korean)
- Combining characters with Unicode NFC normalization
### UTF-8 Support
Modern MARC records use UTF-8 (detected from leader position 9).
### Automatic Detection
Character set detected from MARC leader:
- Position 9: ' ' = MARC-8, 'a' = UTF-8
- Decoder selected automatically
- Invalid bytes produce errors with context
## Format Conversions
### Supported Formats
- **JSON**: Generic field-based representation
- **MARCJSON**: Standard JSON-LD format (LOC spec)
- **XML**: Field/subfield XML structure
- **CSV**: Tabular export for spreadsheets
- **Dublin Core**: Simplified 15-element metadata
- **MODS**: Metadata Object Description Schema
- **BIBFRAME**: RDF/Linked Data (bidirectional conversion)
### Conversion Approach
Each format has:
1. **Serializer**: Record → Format bytes
2. **Deserializer**: Format bytes → Record
3. **Round-trip tests**: Ensure lossless conversion
## Testing
### Test Categories
1. **Unit Tests**: Individual components (parsers, builders, queries)
2. **Integration Tests**: End-to-end workflows (read → process → write)
3. **Compatibility Tests**: pymarc compatibility validation (75+ tests)
4. **Performance Tests**: Benchmarking with Criterion.rs and pytest-benchmark
5. **Encoding Tests**: MARC-8, UTF-8, multilingual records
### Test Fixtures
Located in `tests/data/`:
- `simple_book.mrc` - Basic bibliographic record
- `multi_records.mrc` - Multiple records in one file
- `simple_authority.mrc` - Authority record
- `simple_holdings.mrc` - Holdings record
- `with_control_fields.mrc` - Record with 008 field
### Benchmark Fixtures
Located in `tests/data/fixtures/`:
- `1k_records.mrc` (257 KB) - Quick benchmarks
- `10k_records.mrc` (2.5 MB) - Standard benchmarks
## Key Design Principles
1. **Rust-Idiomatic**: Uses iterators, Result types, ownership patterns
2. **Zero-Copy Where Possible**: Efficient memory usage for large workloads
3. **Format Flexibility**: Multiple serialization formats out of box
4. **Compatibility**: Maintains data fidelity with pymarc
5. **Performance**: Concurrent I/O with intelligent GIL management
6. **Safety**: GIL release without unsafe code (except PyO3 glue)
## References
**Performance & Benchmarking:**
- [Performance Tuning Guide](../guides/performance-tuning.md) - Usage patterns and tuning
- [Benchmarking Results](../benchmarks/results.md) - Detailed performance data
- [Performance FAQ](../benchmarks/faq.md) - Quick Q&A about speedups
**Guides:**
- [Threading in Python](../guides/threading-python.md) - Thread safety and GIL behavior
**External References:**
- [MARC Standard](https://www.loc.gov/marc/)
- [ISO 2709](https://en.wikipedia.org/wiki/MARC_standards)
- [PyO3 Documentation](https://pyo3.rs/)