Expand description
§Abbreviation Extractor
Abbreviation Extractor is a high-performance Rust library with Python bindings for extracting abbreviation-definition pairs from text, particularly focused on biomedical literature. It implements an improved version of the Schwartz-Hearst algorithm as described in:
A. Schwartz and M. Hearst (2003) A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text. Biocomputing, 451-462.
§Overview
This library provides functionality to extract abbreviation-definition pairs from text. It supports both single-threaded and parallel processing, making it suitable for various text processing tasks. The library is designed with a focus on biomedical literature but can be applied to other domains as well.
Key components of the library include:
- Support for parallel processing of large datasets
- Customizable extraction parameters like selecting the most common or first definition for each abbreviation
- Python bindings for easy integration with Python projects
- Tokenization of input text for more accurate extraction
§Basic Usage
§Rust
use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};
let text = "The World Health Organization (WHO) is a specialized agency.";
let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs(text, options).unwrap();
for pair in result {
println!("Abbreviation: {}, Definition: {}", pair.abbreviation, pair.definition);
}
§Python
from abbreviation_extractor import extract_abbreviation_definition_pairs
text = "The World Health Organization (WHO) is a specialized agency."
result = extract_abbreviation_definition_pairs(text)
for pair in result:
print(f"Abbreviation: {pair.abbreviation}, Definition: {pair.definition}")
§Customizing Extraction
You can customize the extraction process using AbbreviationOptions
:
use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};
let text = "The World Health Organization (WHO) is a specialized agency. \
The World Heritage Organization (WHO) is different.";
// Get only the most common definition for each abbreviation
let options = AbbreviationOptions::new(true, false, true);
let result = extract_abbreviation_definition_pairs(text, options);
// Get only the first definition for each abbreviation
let options = AbbreviationOptions::new(false, true, true);
let result = extract_abbreviation_definition_pairs(text, options);
// Disable tokenization (if the input is already tokenized)
let options = AbbreviationOptions::new(false, false, false);
let result = extract_abbreviation_definition_pairs(text, options);
§Parallel Processing
For processing multiple texts in parallel, you can use the extract_abbreviation_definition_pairs_parallel
function:
§Rust
use abbreviation_extractor::{extract_abbreviation_definition_pairs_parallel, AbbreviationOptions};
let texts = vec![
"The World Health Organization (WHO) is a specialized agency.",
"The United Nations (UN) works closely with WHO.",
"The European Union (EU) is a political and economic union.",
];
let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs_parallel(texts, options);
for extraction in result.extractions {
println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}
§Python
from abbreviation_extractor import extract_abbreviation_definition_pairs_parallel
texts = [
"The World Health Organization (WHO) is a specialized agency.",
"The United Nations (UN) works closely with WHO.",
"The European Union (EU) is a political and economic union.",
]
result = extract_abbreviation_definition_pairs_parallel(texts)
for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")
§Processing Large Files
For extracting abbreviations from large files, you can use the extract_abbreviations_from_file
function:
§Rust
use abbreviation_extractor::{extract_abbreviations_from_file, AbbreviationOptions, FileExtractionOptions};
let file_path = "path/to/your/large/file.txt";
let abbreviation_options = AbbreviationOptions::default();
let file_options = FileExtractionOptions::default();
let result = extract_abbreviations_from_file(file_path, abbreviation_options, file_options);
for extraction in result.extractions {
println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}
§Python
from abbreviation_extractor import extract_abbreviations_from_file
file_path = "path/to/your/large/file.txt"
result = extract_abbreviations_from_file(file_path)
for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")
You can customize the file extraction process by specifying additional parameters:
result = extract_abbreviations_from_file(
file_path,
most_common_definition=True,
first_definition=False,
tokenize=True,
num_threads=4,
show_progress=True,
chunk_size=2048 * 1024 # 2MB chunks
)
§Functions
The main functions provided by this library are:
extract_abbreviation_definition_pairs
: Extracts abbreviation-definition pairs from a single text.extract_abbreviation_definition_pairs_parallel
: Extracts abbreviation-definition pairs from multiple texts in parallel.extract_abbreviations_from_file
: Extracts abbreviation-definition pairs from a large file.
For detailed information on each function, please refer to their individual documentation.
§Structs/Enums
AbbreviationOptions
: Defines theAbbreviationOptions
struct for customizing abbreviation extractionFileExtractionOptions
: Defines theFileExtractionOptions
struct for customizing file extraction forextract_abbreviations_from_file
AbbreviationDefinition
: Defines theAbbreviationDefinition
struct for storing abbreviation-definition pairsExtractionResult
: Defines theExtractionResult
struct returned byextract_abbreviation_definition_pairs_parallel
andextract_abbreviations_from_file
ExtractionError
: Defines theExtractionError
enum for error handling
§Modules
candidate
: Defines theCandidate
struct used in the extraction processextraction
: Contains the core logic for extracting abbreviation-definition pairsutils
: Utility functions and regular expressions used in the extraction processabbreviation_definitions
: Defines theAbbreviationDefinition
andAbbreviationOptions
structs
Re-exports§
pub use abbreviation_definitions::AbbreviationDefinition;
pub use abbreviation_definitions::AbbreviationOptions;
pub use abbreviation_definitions::ExtractionError;
pub use abbreviation_definitions::ExtractionResult;
pub use abbreviation_definitions::FileExtractionOptions;
pub use candidate::Candidate;
pub use extraction::best_candidates;
pub use extraction::extract_abbreviation_definition_pairs;
pub use extraction::extract_abbreviation_definition_pairs_parallel;
pub use extraction::extract_abbreviations_from_file;
pub use extraction::get_definition;
pub use extraction::select_definition;
Modules§
- abbreviation_
definitions - candidate
- extraction
- This module contains the core logic for extracting abbreviation-definition pairs from text. It implements the Schwartz-Hearst algorithm for identifying abbreviations and their definitions in biomedical text, as described in:
- utils