Expand description
§Abbreviation Extractor
Abbreviation Extractor is a high-performance Rust library with Python bindings for extracting abbreviation-definition pairs from text, particularly focused on biomedical literature. It implements an improved version of the Schwartz-Hearst algorithm as described in:
A. Schwartz and M. Hearst (2003) A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text. Biocomputing, 451-462.
§Overview
This library provides functionality to extract abbreviation-definition pairs from text. It supports both single-threaded and parallel processing, making it suitable for various text processing tasks. The library is designed with a focus on biomedical literature but can be applied to other domains as well.
Key components of the library include:
- Support for parallel processing of large datasets
- Customizable extraction parameters like selecting the most common or first definition for each abbreviation
- Python bindings for easy integration with Python projects
- Tokenization of input text for more accurate extraction
§Basic Usage
§Rust
use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};
let text = "The World Health Organization (WHO) is a specialized agency.";
let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs(text, options).unwrap();
for pair in result {
println!("Abbreviation: {}, Definition: {}", pair.abbreviation, pair.definition);
}§Python
from abbreviation_extractor import extract_abbreviation_definition_pairs
text = "The World Health Organization (WHO) is a specialized agency."
result = extract_abbreviation_definition_pairs(text)
for pair in result:
print(f"Abbreviation: {pair.abbreviation}, Definition: {pair.definition}")§Customizing Extraction
You can customize the extraction process using AbbreviationOptions:
use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};
let text = "The World Health Organization (WHO) is a specialized agency. \
The World Heritage Organization (WHO) is different.";
// Get only the most common definition for each abbreviation
let options = AbbreviationOptions::new(true, false, true);
let result = extract_abbreviation_definition_pairs(text, options);
// Get only the first definition for each abbreviation
let options = AbbreviationOptions::new(false, true, true);
let result = extract_abbreviation_definition_pairs(text, options);
// Disable tokenization (if the input is already tokenized)
let options = AbbreviationOptions::new(false, false, false);
let result = extract_abbreviation_definition_pairs(text, options);§Parallel Processing
For processing multiple texts in parallel, you can use the extract_abbreviation_definition_pairs_parallel function:
§Rust
use abbreviation_extractor::{extract_abbreviation_definition_pairs_parallel, AbbreviationOptions};
let texts = vec![
"The World Health Organization (WHO) is a specialized agency.",
"The United Nations (UN) works closely with WHO.",
"The European Union (EU) is a political and economic union.",
];
let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs_parallel(texts, options);
for extraction in result.extractions {
println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}§Python
from abbreviation_extractor import extract_abbreviation_definition_pairs_parallel
texts = [
"The World Health Organization (WHO) is a specialized agency.",
"The United Nations (UN) works closely with WHO.",
"The European Union (EU) is a political and economic union.",
]
result = extract_abbreviation_definition_pairs_parallel(texts)
for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")§Processing Large Files
For extracting abbreviations from large files, you can use the extract_abbreviations_from_file function:
§Rust
use abbreviation_extractor::{extract_abbreviations_from_file, AbbreviationOptions, FileExtractionOptions};
let file_path = "path/to/your/large/file.txt";
let abbreviation_options = AbbreviationOptions::default();
let file_options = FileExtractionOptions::default();
let result = extract_abbreviations_from_file(file_path, abbreviation_options, file_options);
for extraction in result.extractions {
println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}§Python
from abbreviation_extractor import extract_abbreviations_from_file
file_path = "path/to/your/large/file.txt"
result = extract_abbreviations_from_file(file_path)
for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")You can customize the file extraction process by specifying additional parameters:
result = extract_abbreviations_from_file(
file_path,
most_common_definition=True,
first_definition=False,
tokenize=True,
num_threads=4,
show_progress=True,
chunk_size=2048 * 1024 # 2MB chunks
)§Functions
The main functions provided by this library are:
extract_abbreviation_definition_pairs: Extracts abbreviation-definition pairs from a single text.extract_abbreviation_definition_pairs_parallel: Extracts abbreviation-definition pairs from multiple texts in parallel.extract_abbreviations_from_file: Extracts abbreviation-definition pairs from a large file.
For detailed information on each function, please refer to their individual documentation.
§Structs/Enums
AbbreviationOptions: Defines theAbbreviationOptionsstruct for customizing abbreviation extractionFileExtractionOptions: Defines theFileExtractionOptionsstruct for customizing file extraction forextract_abbreviations_from_fileAbbreviationDefinition: Defines theAbbreviationDefinitionstruct for storing abbreviation-definition pairsExtractionResult: Defines theExtractionResultstruct returned byextract_abbreviation_definition_pairs_parallelandextract_abbreviations_from_fileExtractionError: Defines theExtractionErrorenum for error handling
§Modules
candidate: Defines theCandidatestruct used in the extraction processextraction: Contains the core logic for extracting abbreviation-definition pairsutils: Utility functions and regular expressions used in the extraction processabbreviation_definitions: Defines theAbbreviationDefinitionandAbbreviationOptionsstructs
Re-exports§
pub use abbreviation_definitions::AbbreviationDefinition;pub use abbreviation_definitions::AbbreviationOptions;pub use abbreviation_definitions::ExtractionError;pub use abbreviation_definitions::ExtractionResult;pub use abbreviation_definitions::FileExtractionOptions;pub use candidate::Candidate;pub use extraction::best_candidates;pub use extraction::extract_abbreviation_definition_pairs;pub use extraction::extract_abbreviation_definition_pairs_parallel;pub use extraction::extract_abbreviations_from_file;pub use extraction::get_definition;pub use extraction::select_definition;
Modules§
- abbreviation_
definitions - candidate
- extraction
- This module contains the core logic for extracting abbreviation-definition pairs from text. It implements the Schwartz-Hearst algorithm for identifying abbreviations and their definitions in biomedical text, as described in:
- utils