Expand description
Extractous is a library that extracts text from various file formats.
- Supports many file formats such as Word, Excel, PowerPoint, PDF, and many more.
- Strives to be simple fast and efficient
§Quick Start
Extractous API entry point is the Extractor
struct.
All public apis are accessible through an extractor.
The extractor provides functions to extract text from files, Urls, and byte arrays.
To use an extractor, you need to:
- create and configure new the extractor
- use the extractor to extract text
- enable OCR for the extractor
§Create and config an extractor
use extractous::Extractor;
use extractous::PdfParserConfig;
// Create a new extractor. Note it uses a consuming builder pattern
let mut extractor = Extractor::new()
.set_extract_string_max_length(1000);
// can also perform conditional configuration
let custom_pdf_config = true;
if custom_pdf_config {
extractor = extractor.set_pdf_config(
PdfParserConfig::new().set_extract_annotation_text(false)
);
}
§Extract text
use extractous::Extractor;
use extractous::PdfParserConfig;
// Create a new extractor. Note it uses a consuming builder pattern
let mut extractor = Extractor::new().set_extract_string_max_length(1000);
// Extract text from a file
let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
println!("{}", text);
§Extract text with OCR
- Make sure Tesseract is installed with the corresponding language packs. For example on debian
sudo apt install tesseract-ocr tesseract-ocr-deu
to install tesseract with German language pack. - If you get
Parse error occurred : Unable to extract PDF content
, it is most likely that the OCR language pack is not installed
use extractous::{Extractor, TesseractOcrConfig, PdfParserConfig, PdfOcrStrategy};
let file_path = "../test_files/documents/deu-ocr.pdf";
// Create a new extractor. Note it uses a consuming builder pattern
let extractor = Extractor::new()
.set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
.set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
// extract file with extractor
let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
println!("{}", content);
Structs§
- Extractor
- Extractor for extracting text from different file formats
- Office
Parser Config - Microsoft Office parser configuration settings
- PdfParser
Config - PDF parsing configuration settings
- Stream
Reader - StreamReader implements std::io::Read
- Tesseract
OcrConfig - Tesseract OCR configuration settings
Enums§
- CharSet
- CharSet enum of all supported encodings
- Error
- Represent errors returned by extractous
- PdfOcr
Strategy - OCR Strategy for PDF parsing
Constants§
- DEFAULT_
BUF_ SIZE - Default buffer size
Type Aliases§
- Extract
Result - Result that is a wrapper of Result<T, extractous::Error>
- Metadata
- Metadata type alias