Struct PdfiumTextPage

Source

pub struct PdfiumTextPage { /* private fields */ }

Expand description

§Rust interface to FPDF_TEXTPAGE

Implementations§

Source §

impl PdfiumTextPage

Source

pub fn load_web_links(&self) -> PdfiumResult<PdfiumPageLink>

Get information about weblinks in a page.

Comments:

Weblinks are those links implicitly embedded in PDF pages. PDF also has a type of annotation called “link” (FPDFTEXT doesn’t deal with that kind of link). FPDFTEXT weblink feature is useful for automatically detecting links in the page contents. For example, things like https://www.example.com will be detected, so applications can allow user to click on those characters to activate the link, even the PDF doesn’t come with link annotations.

Source

pub fn char_count(&self) -> PdfiumResult<i32>

Get number of characters in a page.

Generated characters, like additional space characters, new line characters, are also counted.

Comments:

Characters in a page form a “stream”, inside the stream, each character has an index. We will use the index parameters in many of FPDFTEXT functions. The first character in the page has an index value of zero.

Source

pub fn count_rects(&self, start_index: i32, count: i32) -> PdfiumResult<i32>

Counts number of rectangular areas occupied by a segment of text

Parameters:

start_index - Index for the start character.
count - Number of characters, or -1 for all remaining.

Returns:

Number of rectangles, Err -1 on bad start_index.

Comments:

This function, along with FPDFText_GetRect can be used by applications to detect the position on the page for a text segment, so proper areas can be highlighted. The FPDFText_* functions will automatically merge small character boxes into bigger one if those characters are on the same line and use same font settings.
Caches the result for subsequent FPDFText_GetRect() calls.

Source

pub fn find( &self, findwhat: &str, flags: PdfiumSearchFlags, start_index: i32, ) -> PdfiumSearchIterator

Start a search.

Parameters:

findwhat - A unicode match pattern.
flags - Option flags.
start_index - Start from this character. -1 for end of the page.

Examples found in repository ?

examples/text_extract_search.rs (line 93)

80pub fn example_search() -> PdfiumResult<()> {
81    // Load the PDF document to search within
82    let document = PdfiumDocument::new_from_path("resources/groningen.pdf", None)?;
83
84    // Get the first page (index 0) for searching
85    let page = document.page(0)?;
86
87    // Extract text objects from the page for searching
88    let text = page.text()?;
89
90    // Search for "amsterdam" with case-insensitive matching
91    // PdfiumSearchFlags::empty() means no special search flags (case-insensitive by default)
92    // The last parameter (0) is the starting position for the search
93    let search = text.find("amsterdam", PdfiumSearchFlags::empty(), 0);
94    println!("Found amsterdam {} times", search.count());
95
96    // Search for "groningen" with case-insensitive matching
97    let search = text.find("groningen", PdfiumSearchFlags::empty(), 0);
98    println!(
99        "Found groningen {} times (case insensitive)",
100        search.count()
101    );
102
103    // Search for "Groningen" with case-sensitive matching
104    // MATCH_CASE flag enforces exact case matching
105    let search = text.find("Groningen", PdfiumSearchFlags::MATCH_CASE, 0);
106    println!("Found Groningen {} times (case sensitive)", search.count());
107
108    // Perform another case-insensitive search to iterate through results
109    let search = text.find("groningen", PdfiumSearchFlags::empty(), 0);
110
111    // Iterate through each search result to extract detailed information
112    for result in search {
113        // Extract the text fragment at the found position
114        // result.index() gives the character position where the match starts
115        // result.count() gives the length of the matched text
116        let fragment = text.extract(result.index(), result.count());
117        println!(
118            "Found groningen (case insensitive) at {}, fragment = '{fragment}'",
119            result.index()
120        );
121    }
122
123    // Expected output:
124    //
125    // Found amsterdam 0 times
126    // Found groningen 5 times (case insensitive)
127    // Found Groningen 5 times (case sensitive)
128    // Found groningen (case insensitive) at 14, fragment = 'Groningen'
129    // Found groningen (case insensitive) at 232, fragment = 'Groningen'
130    // Found groningen (case insensitive) at 475, fragment = 'Groningen'
131    // Found groningen (case insensitive) at 920, fragment = 'Groningen'
132    // Found groningen (case insensitive) at 1050, fragment = 'Groningen'
133
134    Ok(())
135}

Source

pub fn get_bounded_text( &self, left: f64, top: f64, right: f64, bottom: f64, buffer: &mut c_ushort, buflen: i32, ) -> i32

Function: FPDFText_GetBoundedText Extract unicode text within a rectangular boundary on the page. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. left - Left boundary. top - Top boundary. right - Right boundary. bottom - Bottom boundary. buffer - Caller-allocated buffer to receive UTF-16 values. buflen - Number of UTF-16 values (not bytes) that buffer is capable of holding. Returns: If buffer is NULL or buflen is zero, return number of UTF-16 values (not bytes) of text present within the rectangle, excluding a terminating NUL. Generally you should pass a buffer at least one larger than this if you want a terminating NUL, which will be provided if space is available. Otherwise, return number of UTF-16 values copied into the buffer, including the terminating NUL when space for it is available. Comment: If the buffer is too small, as much text as will fit is copied into it. May return a split surrogate in that case.

Source

pub fn get_char_angle(&self, index: i32) -> f32

Get character rotation angle.

Parameters:

index - Zero-based index of the character.

Returns: On success, return the angle value in radian. Value will always be greater or equal to 0. If index is out of bounds, then return -1.

Source

pub fn get_char_box(&self, index: i32) -> PdfiumResult<PdfiumRect>

Get bounding box of a particular character.

Parameters:

index - Zero-based index of the character.

Returns:

The position of the character box as PdfiumRect. An Err if index is out of bounds

Comments:

All positions are measured in PDF “user space”

Source

pub fn get_char_index_at_pos( &self, x: f64, y: f64, x_tolerance: f64, y_tolerance: f64, ) -> i32

Function: FPDFText_GetCharIndexAtPos Get the index of a character at or nearby a certain position on the page. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. x - X position in PDF “user space”. y - Y position in PDF “user space”. xTolerance - An x-axis tolerance value for character hit detection, in point units. yTolerance - A y-axis tolerance value for character hit detection, in point units. Returns: The zero-based index of the character at, or nearby the point (x,y). If there is no character at or nearby the point, Returns will be -1. If an error occurs, -3 will be returned.

Source

pub fn get_char_index_from_text_index( &self, n_text_index: i32, ) -> PdfiumResult<i32>

Get the character index in this PdfiumTextPage internal character list.

nTextIndex - index of the text returned from FPDFText_GetText().

Returns the index of the character in internal character list. -1 for error.

Source

pub fn get_char_origin( &self, index: i32, x: &mut f64, y: &mut f64, ) -> PdfiumResult<()>

Function: FPDFText_GetCharOrigin Get origin of a particular character. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. x - Pointer to a double number receiving x coordinate of the character origin. y - Pointer to a double number receiving y coordinate of the character origin. Returns: Whether the call succeeded. If false, x and y are unchanged. Comments: All positions are measured in PDF “user space”.

Source

pub fn get_fill_color( &self, index: i32, r: &mut u32, g: &mut u32, b: &mut u32, a: &mut u32, ) -> PdfiumResult<()>

Function: FPDFText_GetFillColor Get the fill color of a particular character. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. R - Pointer to an unsigned int number receiving the red value of the fill color. G - Pointer to an unsigned int number receiving the green value of the fill color. B - Pointer to an unsigned int number receiving the blue value of the fill color. A - Pointer to an unsigned int number receiving the alpha value of the fill color. Returns: Whether the call succeeded. If false, |R|, |G|, |B| and |A| are unchanged.

Source

pub fn get_font_info( &self, index: i32, buffer: Option<&mut [u8]>, buflen: c_ulong, flags: &mut i32, ) -> c_ulong

Function: FPDFText_GetFontInfo Get the font name and flags of a particular character. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. buffer - A buffer receiving the font name. buflen - The length of |buffer| in bytes. flags - Optional pointer to an int receiving the font flags. These flags should be interpreted per PDF spec 1.7 Section 5.7.1 Font Descriptor Flags. Returns: On success, return the length of the font name, including the trailing NUL character, in bytes. If this length is less than or equal to |length|, |buffer| is set to the font name, |flags| is set to the font flags. |buffer| is in UTF-8 encoding. Return 0 on failure.

Source

pub fn get_font_size(&self, index: i32) -> f64

Function: FPDFText_GetFontSize Get the font size of a particular character. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. Returns: The font size of the particular character, measured in points (about 1/72 inch). This is the typographic size of the font (so called “em size”).

Source

pub fn get_font_weight(&self, index: i32) -> PdfiumResult<i32>

Function: FPDFText_GetFontWeight Get the font weight of a particular character. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. Returns: On success, return the font weight of the particular character. If |text_page| is invalid, if index is out of bounds, or if the character’s text object is undefined, return -1.

Source

pub fn get_loose_char_box( &self, index: i32, rect: &mut FS_RECTF, ) -> PdfiumResult<()>

Function: FPDFText_GetLooseCharBox Get a “loose” bounding box of a particular character, i.e., covering the entire glyph bounds, without taking the actual glyph shape into account. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. rect - Pointer to a FS_RECTF receiving the character box. Returns: On success, return TRUE and fill in |rect|. If |text_page| is invalid, or if index is out of bounds, then return FALSE, and the |rect| out parameter remains unmodified. Comments: All positions are measured in PDF “user space”.

Source

pub fn get_matrix(&self, index: i32, matrix: &mut FS_MATRIX) -> PdfiumResult<()>

Function: FPDFText_GetMatrix Get the effective transformation matrix for a particular character. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage(). index - Zero-based index of the character. matrix - Pointer to a FS_MATRIX receiving the transformation matrix. Returns: On success, return TRUE and fill in |matrix|. If |text_page| is invalid, or if index is out of bounds, or if |matrix| is NULL, then return FALSE, and |matrix| remains unmodified.

Source

pub fn get_rect( &self, rect_index: i32, left: &mut f64, top: &mut f64, right: &mut f64, bottom: &mut f64, ) -> PdfiumResult<()>

Function: FPDFText_GetRect Get a rectangular area from the result generated by FPDFText_CountRects. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. rect_index - Zero-based index for the rectangle. left - Pointer to a double value receiving the rectangle left boundary. top - Pointer to a double value receiving the rectangle top boundary. right - Pointer to a double value receiving the rectangle right boundary. bottom - Pointer to a double value receiving the rectangle bottom boundary. Returns: On success, return TRUE and fill in |left|, |top|, |right|, and |bottom|. If |text_page| is invalid then return FALSE, and the out parameters remain unmodified. If |text_page| is valid but |rect_index| is out of bounds, then return FALSE and set the out parameters to 0.

Source

pub fn get_stroke_color( &self, index: i32, r: &mut u32, g: &mut u32, b: &mut u32, a: &mut u32, ) -> PdfiumResult<()>

Function: FPDFText_GetStrokeColor Get the stroke color of a particular character. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. R - Pointer to an unsigned int number receiving the red value of the stroke color. G - Pointer to an unsigned int number receiving the green value of the stroke color. B - Pointer to an unsigned int number receiving the blue value of the stroke color. A - Pointer to an unsigned int number receiving the alpha value of the stroke color. Returns: Whether the call succeeded. If false, |R|, |G|, |B| and |A| are unchanged.

Source

pub fn extract(&self, start_index: i32, count: i32) -> String

Extract unicode text section from the page as string.

Parameters:

start_index - Index for the start characters.
count - Number of UCS-2 values to be extracted.

Returns:

String containing the requested text part

Comments:

UTF-16 and UCS-2 are both character encoding schemes for representing Unicode text
- UCS-2: stands for Universal Character Set-2
  - Fixed-length encoding that uses 2 bytes (16 bits) per character.
  - Supports only the Basic Multilingual Plane (BMP), which includes Unicode code points from U+0000 to U+FFFF (65,536 characters).
- UTF-16: stands for Unicode Transformation Format-16.
  - Variable-length encoding that uses 2 or 4 bytes per character.
  - Can represent all Unicode code points (U+0000 to U+10FFFF), including those outside the BMP
  - Backward compatible with UCS-2 for BMP characters, as they are encoded identically
If the page contains UTF-16 4-byte characters they are handled as two UCS-2 values, and may get split up depending on start_index and count. This will result into an invalid UTF-16 character and returned as REPLACEMENT_CHARACTER. See the test-case.
This function ignores characters without UCS-2 representations. It considers all characters on the page, even those that are not visible when the page has a cropbox. To filter out the characters outside of the cropbox, use FPDF_GetPageBoundingBox() and FPDFText_GetCharBox().

Examples found in repository ?

examples/text_extract_search.rs (line 116)

80pub fn example_search() -> PdfiumResult<()> {
81    // Load the PDF document to search within
82    let document = PdfiumDocument::new_from_path("resources/groningen.pdf", None)?;
83
84    // Get the first page (index 0) for searching
85    let page = document.page(0)?;
86
87    // Extract text objects from the page for searching
88    let text = page.text()?;
89
90    // Search for "amsterdam" with case-insensitive matching
91    // PdfiumSearchFlags::empty() means no special search flags (case-insensitive by default)
92    // The last parameter (0) is the starting position for the search
93    let search = text.find("amsterdam", PdfiumSearchFlags::empty(), 0);
94    println!("Found amsterdam {} times", search.count());
95
96    // Search for "groningen" with case-insensitive matching
97    let search = text.find("groningen", PdfiumSearchFlags::empty(), 0);
98    println!(
99        "Found groningen {} times (case insensitive)",
100        search.count()
101    );
102
103    // Search for "Groningen" with case-sensitive matching
104    // MATCH_CASE flag enforces exact case matching
105    let search = text.find("Groningen", PdfiumSearchFlags::MATCH_CASE, 0);
106    println!("Found Groningen {} times (case sensitive)", search.count());
107
108    // Perform another case-insensitive search to iterate through results
109    let search = text.find("groningen", PdfiumSearchFlags::empty(), 0);
110
111    // Iterate through each search result to extract detailed information
112    for result in search {
113        // Extract the text fragment at the found position
114        // result.index() gives the character position where the match starts
115        // result.count() gives the length of the matched text
116        let fragment = text.extract(result.index(), result.count());
117        println!(
118            "Found groningen (case insensitive) at {}, fragment = '{fragment}'",
119            result.index()
120        );
121    }
122
123    // Expected output:
124    //
125    // Found amsterdam 0 times
126    // Found groningen 5 times (case insensitive)
127    // Found Groningen 5 times (case sensitive)
128    // Found groningen (case insensitive) at 14, fragment = 'Groningen'
129    // Found groningen (case insensitive) at 232, fragment = 'Groningen'
130    // Found groningen (case insensitive) at 475, fragment = 'Groningen'
131    // Found groningen (case insensitive) at 920, fragment = 'Groningen'
132    // Found groningen (case insensitive) at 1050, fragment = 'Groningen'
133
134    Ok(())
135}

Source

pub fn full(&self) -> String

Gets the full text of the page as string.

Examples found in repository ?

examples/text_extract_search.rs (line 36)

23pub fn example_extract_text() -> PdfiumResult<()> {
24    // Load the PDF document from the specified file path
25    // The second parameter (None) indicates no password is required
26    let document = PdfiumDocument::new_from_path("resources/chapter1.pdf", None)?;
27
28    // Iterate through all pages in the document
29    // enumerate() provides both the index and the page object
30    for (index, page) in document.pages().enumerate() {
31        // Extract the full text content from the current page
32        // The ?. operators handle potential errors at each step:
33        // - page? ensures the page loaded successfully
34        // - .text()? extracts text objects from the page
35        // - .full() gets the complete text content as a string
36        let text = page?.text()?.full();
37
38        // Print formatted output for each page
39        println!("Page {}", index + 1); // Pages are 1-indexed for user display
40        println!("------");
41        println!("{text}");
42        println!() // Empty line for separation between pages
43    }
44
45    // Expected output:
46    //
47    // Page 1
48    // ------
49    //
50    // Page 2
51    // ------
52    // Ruskin
53    // House.
54    // 156. Charing
55    // Cross Road.
56    // London
57    // George Allen.
58    //
59    // Page 3
60    // ------
61    //
62    // Page 4
63    // ------
64    // I
65    // Chapter I.
66    // T is a truth universally acknowledged, that a single man in possession of a good
67    // fortune must be in want of a wife.
68    // However little known the feelings or views of such a man may be on his first
69    // entering a neighbourhood, this truth is so well fixed in the minds of the surrounding
70    // families, that he is considered as the rightful property of some one or other of their
71    // daughters.
72    // “My dear Mr. Bennet,” said his lady to him one day, “have you heard that
73    // Netherfield Park is let at last?”
74    // ...
75
76    Ok(())
77}

Source

pub fn get_text_index_from_char_index( &self, n_char_index: i32, ) -> PdfiumResult<i32>

Get the text index in this PdfiumTextPage internal character list.

nCharIndex - index of the character in internal character list.

Returns the index of the text returned from FPDFText_GetText(). -1 for error.

Source

pub fn get_text_object(&self, index: i32) -> PdfiumResult<PdfiumPageObject>

Function: FPDFText_GetTextObject

Get the FPDF_PAGEOBJECT associated with a given character.

Parameters:

index - Zero-based index of the character.

Returns:

The associated text object for the character at index, or NULL on error. The returned text object, if non-null, is of type |FPDF_PAGEOBJ_TEXT|. The caller does not own the returned object.

Source

pub fn get_unicode(&self, index: i32) -> u32

Get Unicode of a character in a page.

Parameters:

index - Zero-based index of the character.

Returns:

The Unicode of the particular character.

Notes:

If a character is not encoded in Unicode and Foxit engine can’t convert to Unicode, the Returns will be zero.
This does not support UTF-16 4-byte characters

Source

pub fn has_unicode_map_error(&self, index: i32) -> PdfiumResult<bool>

Function: FPDFText_HasUnicodeMapError Get if a character in a page has an invalid unicode mapping. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. Returns: 1 if the character has an invalid unicode mapping. 0 if the character has no known unicode mapping issues. -1 if there was an error.

Source

pub fn is_generated(&self, index: i32) -> PdfiumResult<bool>

Function: FPDFText_IsGenerated Get if a character in a page is generated by PDFium. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. Returns: 1 if the character is generated by PDFium. 0 if the character is not generated by PDFium. -1 if there was an error.

Source

pub fn is_hyphen(&self, index: i32) -> PdfiumResult<bool>

Function: FPDFText_IsHyphen Get if a character in a page is a hyphen. Parameters: text_page - Handle to a text page information structure. Returned by FPDFText_LoadPage function. index - Zero-based index of the character. Returns: 1 if the character is a hyphen. 0 if the character is not a hyphen. -1 if there was an error.