Module text_reader

Source
Expand description

An iterator adapter to suppress CRLF (\r\n) sequences in a stream of bytes.

§Overview

This module provides CrlfSuppressor, an iterator adapter to filter out CR (\r, 0x0D) when it is immediately followed by LF (\n, 0x0A), as commonly found in Windows line endings.

It also provides an extension trait CrlfSuppressorExt so you can easily call .crlf_suppressor() on any iterator over bytes (e.g., from BufReader::bytes()).

§Usage

§Basic example

use std::io::{Cursor, Error, Read};
use tpnote_lib::text_reader::CrlfSuppressorExt;

let data = b"hello\r\nworld";
let normalized: Result<Vec<u8>, Error> = Cursor::new(data)
    .bytes()
    .crlf_suppressor()
    .collect();
let s = String::from_utf8(normalized.unwrap()).unwrap();
assert_eq!(s, "hello\nworld");

§Reading from a file

use std::fs::File;
use tpnote_lib::text_reader::read_as_string_with_crlf_suppression;

let normalized = read_as_string_with_crlf_suppression(File::open("file.txt")?)?;
println!("{}", normalized);

§Implementation details

In UTF-8, continuation bytes for multi-byte code points are always in the range 0x80..0xBF. Since 0x0D and 0x0A are not in this range, searching for CRLF as byte values is safe.

§See also

Structs§

CrlfSuppressor
An iterator adapter that suppresses CR (\r, 0x0D) when followed by LF (\n, 0x0A). In a valid multi-byte UTF-8 sequence, continuation bytes must be in the range 0x80 to 0xBF. As 0x0D and 0x0A are not in this range, we can search for them in a stream of bytes.

Traits§

CrlfSuppressorExt
Extension trait to add .crlf_suppressor() to any iterator over bytes.
StringExt
Additional method for String suppressing \r in \r\n sequences: When no \r\n is found, no memory allocation occurs.

Functions§

read_as_string_with_crlf_suppression
Reads all bytes from the given reader, suppressing CR (\r) bytes that are immediately followed by LF (\n), and returns the resulting data as a UTF-8 string.
read_with_crlf_suppression
Reads all bytes from the given reader, suppressing CR (\r) bytes that are immediately followed by LF (\n).