mrrc 0.8.2 - Docs.rs

# Benchmarking FAQ

## Performance Questions

### Is pymrrc faster than pymarc?

Early benchmarking suggested at least a 4x single-threaded speedup over
pymarc; these benchmarks need to be updated and reconsidered. The mechanism
behind the speedup is structural: each record is parsed by compiled Rust code
instead of interpreted Python. To measure the difference on your own workload,
see [Results](results.md).

### Does pymrrc automatically use multiple cores?

No. By default, pymrrc runs on one thread (like pymarc), but each record
parses faster because parsing happens in Rust.

To use multiple cores:

- Use `ProducerConsumerPipeline` for single-file processing
- Use `ThreadPoolExecutor` to process multiple files in parallel

### Do I need to change my code to get the single-threaded speedup?

No. If upgrading from pymarc, install pymrrc and existing code benefits
automatically.

### Do I need to change my code for multi-threading?

Yes. For single-file multi-threaded processing, use
`ProducerConsumerPipeline`. For multi-file processing, use
`ThreadPoolExecutor`. See the [threading guide](../guides/threading-python.md)
for examples.

### Is GIL release automatic?

Yes. Every call to `read_record()` automatically releases the GIL during
parsing. No special handling required.

---

## Understanding the Speedups

### Single-Threaded

```python
from mrrc import MARCReader

# Parsing runs in Rust; no per-record Python interpretation.
# Pass a path (not a file object) so I/O also runs in Rust, GIL-free.
reader = MARCReader("records.mrc")
while record := reader.read_record():
    title = record.title
    # ... process record ...
```

The Python wrapper adds overhead relative to pure Rust (each record crosses
the Rust/Python boundary), but record parsing itself runs at Rust speed.
File objects also work, but the GIL is held for their `.read()` calls —
prefer paths.

### Multi-Threaded

```python
from concurrent.futures import ThreadPoolExecutor
from mrrc import MARCReader

def process_file(path):
    # Each thread gets its own reader; path input keeps I/O in Rust,
    # so threads never contend on the GIL for reads.
    reader = MARCReader(path)
    while record := reader.read_record():
        ...  # process record

# Process 4 files in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(process_file, ["file1.mrc", "file2.mrc", "file3.mrc", "file4.mrc"])
```

Because the GIL is released during parsing, threads parse concurrently and
speedup scales with core count. Efficiency is sub-linear: thread management,
memory bandwidth, and the Python code between records all take their share.
Measure with your own files to find the practical ceiling on your hardware.

Use multi-threading when processing multiple independent files. For single
files, use `ProducerConsumerPipeline`.

---

## Upgrade Considerations

### When pymrrc provides clear benefit:

- Processing large files (100k+ records)
- MARC processing is a bottleneck in your workflow
- You need multi-threading for parallel file processing

### When pymrrc may not be necessary:

- Pure Python environment required (no C extensions)
- Deeply integrated custom code with pymarc internals
- Processing very small files (< 1k records)

---

## Benchmark Selection

### Which benchmark matches my use case?

- **Raw reading** (`test_benchmark_reading.py`): best case for mrrc
- **Reading + field extraction** (`test_benchmark_reading.py`): typical use
- **Round-trip read + write** (`test_benchmark_writing.py`): format conversion
- **Parallel processing** (`test_benchmark_parallel.py`,
  `test_benchmark_pipeline_parallel.py`): processing many files or one large
  file with threads (local runs only — see [Results](results.md))

### What does "rec/s" mean?

Records per second. Higher is better. Throughput depends on hardware, record
size, and what your code does with each record — measure on your own data.

---

## GIL and Threading

### What is the GIL?

Global Interpreter Lock. Python's mechanism to make the interpreter
thread-safe. Only one thread can execute Python bytecode at a time. When one
thread releases the GIL, other threads can run.

### Does pymrrc release the GIL?

Yes. During record parsing, the GIL is released via `py.detach()`. This
happens automatically with every `read_record()` call.

### Do I need to think about the GIL?

No. In single-threaded mode, it's automatic and transparent. In
multi-threaded mode, the GIL release enables concurrent parsing.

### What happens if I share a reader across threads?

Undefined behavior. The reader is not thread-safe. Create a new reader per
thread.

```python
# Wrong - do not share readers across threads
reader = MARCReader("file.mrc")
ThreadPoolExecutor().map(lambda _: reader.read_record(), range(4))

# Correct - each thread gets its own reader
def read_all(path):
    reader = MARCReader(path)
    while record := reader.read_record():
        # ... process ...

ThreadPoolExecutor().map(read_all, ["file1.mrc", "file2.mrc", "file3.mrc", "file4.mrc"])
```

---

## Performance Tuning

### How can I make pymrrc faster?

1. **File path input:** Pass file paths directly instead of file objects
2. **Multi-threading:** Use `ThreadPoolExecutor` for multiple files,
   `ProducerConsumerPipeline` for single files
3. **Batch processing:** Read multiple records at once
4. **Use Rust:** Rewrite critical path in Rust if possible

### Does record size affect performance?

Minimally. Throughput stays consistent regardless of record size.

### Does Python version matter?

Only for availability. pymrrc supports Python 3.10+. Performance is similar
across versions.

### Does OS matter?

Only for binary compatibility. pymrrc provides wheels for Linux, macOS, and
Windows. Performance is similar across platforms.

---

## Further Reading

- [Results](results.md) - Measurement infrastructure and benchmarking
  procedure
- [Threading Guide](../guides/threading-python.md) - Threading patterns and
  code examples
- [Performance Tuning](../guides/performance-tuning.md) - Usage patterns and
  optimization