1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
//! Extractous is a library that extracts text from various file formats.
//! * Supports many file formats such as Word, Excel, PowerPoint, PDF, and many more.
//! * Strives to be simple fast and efficient
//!
//! # Quick Start
//! Extractous API entry point is the [`Extractor`] struct.
//! All public apis are accessible through an extractor.
//! The extractor provides functions to extract text from files, Urls, and byte arrays.
//! To use an extractor, you need to:
//! - [create and configure new the extractor](#create-and-config-an-extractor)
//! - [use the extractor to extract text](#extract-text)
//! - [enable OCR for the extractor](#extract-text-with-ocr)
//!
//! ## Create and config an extractor
//!
//! ```no_run
//! use extractous::Extractor;
//! use extractous::PdfParserConfig;
//!
//! // Create a new extractor. Note it uses a consuming builder pattern
//! let mut extractor = Extractor::new()
//! .set_extract_string_max_length(1000);
//!
//! // can also perform conditional configuration
//! let custom_pdf_config = true;
//! if custom_pdf_config {
//! extractor = extractor.set_pdf_config(
//! PdfParserConfig::new().set_extract_annotation_text(false)
//! );
//! }
//!
//! ```
//!
//! ## Extract text
//!
//! ```no_run
//! use extractous::Extractor;
//! use extractous::PdfParserConfig;
//!
//! // Create a new extractor. Note it uses a consuming builder pattern
//! let mut extractor = Extractor::new().set_extract_string_max_length(1000);
//!
//! // Extract text from a file
//! let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
//! println!("{}", text);
//!
//! ```
//!
//! ## Extract text with OCR
//! * Make sure Tesseract is installed with the corresponding language packs. For example on debian `sudo apt install tesseract-ocr tesseract-ocr-deu` to install tesseract with German language pack.
//! * If you get `Parse error occurred : Unable to extract PDF content`, it is most likely that the OCR language pack is not installed
//!
//! ```no_run
//! use extractous::{Extractor, TesseractOcrConfig, PdfParserConfig, PdfOcrStrategy};
//!
//! let file_path = "../test_files/documents/deu-ocr.pdf";
//!
//! // Create a new extractor. Note it uses a consuming builder pattern
//! let extractor = Extractor::new()
//! .set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
//! .set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
//!
//! // extract file with extractor
//! let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
//! println!("{}", content);
//!
//! ```
/// Default buffer size
pub const DEFAULT_BUF_SIZE: usize = 32768;
// errors module
pub use *;
// extractor module is the config interface
pub use *;
// extractor module is the main public api interface
pub use *;
// tika module, not exposed outside this crate