Abbreviation Extractor
Abbreviation Extractor is a high-performance Rust library with Python bindings for extracting abbreviation-definition pairs from text, particularly focused on biomedical text. It implements an improved version of the Schwartz-Hearst algorithm, offering enhanced accuracy and speed. It's based the original python implementation.
Speed Comparison With Other Abbreviation Extraction Libraries
Extraction Accuracy Comparison With Other Abbreviation Extraction Libraries
Features
- Fast and accurate extraction of abbreviation-definition pairs with tokenization.
- Support for both single-threaded and parallel processing
- Python bindings for easy integration with Python projects
- Customizable extraction parameters like selecting the most common or first definition for each abbreviation
Installation
Rust
Add this to your Cargo.toml:
= "0.1.0"
Python
pip install abbreviation-extractor-rs
Usage
Rust
use ;
let text = "The World Health Organization (WHO) is a specialized agency.";
let options = default;
let result = extract_abbreviation_definition_pairs;
for pair in result
Python
=
=
Customizing Extraction
Python
=
# Get only the most common definition for each abbreviation
=
# Get only the first definition for each abbreviation
=
# Disable tokenization (if the input is already tokenized)
=
# Combine options
=
Rust
use ;
let text = "The World Health Organization (WHO) is a specialized agency. The World Heritage Organization (WHO) is different.";
// Get only the most common definition for each abbreviation
let options = new;
let result = extract_abbreviation_definition_pairs;
// Get only the first definition for each abbreviation
let options = new;
let result = extract_abbreviation_definition_pairs;
// Disable tokenization (if the input is already tokenized)
let options = new;
let result = extract_abbreviation_definition_pairs;
for pair in result
Benchmark
Below is a comparison of how the abbreviation extractor performs in comparison to other libraries, namely Schwartz-Hearst and ScispaCy in terms of accuracy and speed.
Performance Comparison of Abbreviation Extractor Against Other Libraries
| Abbrv | Ground Truth | abbreviation-extractor (This Library) | abbreviation-extraction | ScispaCy |
|---|---|---|---|---|
| '3-meAde' | '3-methyl-adenine' | '3-methyl-adenine' | '3-methyl-adenine' | 'N/A' |
| '5'UTR' | '5' untranslated region' | '5' untranslated region' | 'N/A' | 'N/A' |
| '5LO' | '5-lipoxygenase' | '5-lipoxygenase' | '5-lipoxygenase' | 'N/A' |
| 'AAV' | 'adeno-associated virus' | 'adeno-associated virus' | 'associated virus' | 'adeno-associated virus' |
| 'ACP' | 'Enoyl-acyl carrier protein' | 'Enoyl-acyl carrier protein' | 'acyl carrier protein' | 'Enoyl-acyl carrier protein' |
| 'ADIOL' | '5-androstene-3beta, 17beta-diol' | '5-androstene-3beta, 17beta-diol' | 'androstene-3beta, 17beta-diol' | '5-androstene-3beta, 17beta-diol' |
| cAMP | 'cyclic AMP' | 'cyclic AMP' | 'N/A' | |
| 'ALAD' | '5-aminolaevulinic acid dehydratase' | '5-aminolaevulinic acid dehydratase' | 'N/A' | '5-aminolaevulinic acid dehydratase' |
| 'AMPK' | 'AMP-activated protein kinase' | 'AMP-activated protein kinase' | 'N/A' | 'AMP-activated protein kinase' |
| 'AP' | 'apurinic/apyrimidinic site' | 'apurinic/apyrimidinic site' | 'apyrimidinic site' | 'apurinic/apyrimidinic site' |
| 'AcCoA' | 'acetyl coenzyme A' | 'acetyl coenzyme A' | 'N/A' | 'acetyl coenzyme A' |
| 'Ahr' | 'aryl hydrocarbon receptor' | 'aryl hydrocarbon receptor' | 'N/A' | 'aryl hydrocarbon receptor' |
| 'BD' | 'binding domain' | 'binding domain' | 'N/A' | 'binding domain' |
| '8-OxoG' | '7,8-dihydro-8-oxoguanine' | '7,8-dihydro-8-oxoguanine' | '8-oxoguanine' | 'N/A' |
| dsRNA | double-stranded RNA | double-stranded RNA | double-stranded RNA | 'N/A' |
| 'BERI' | 'Biomolecular Engineering Research Institute' | 'Biomolecular Engineering Research Institute' | 'N/A' | 'Biomolecular Engineering Research Institute' |
| 'CTLs | 'cytotoxic T lymphocytes' | 'cytotoxic T lymphocytes' | 'N/A' | 'N/A' |
| 'C-RBD' | 'C-terminal RNA binding domain' | 'C-terminal RNA binding domain' | 'N/A' | 'C-terminal RNA binding domain' |
| 'CAP' | 'cyclase-associated protein' | 'cyclase-associated protein' | 'N/A' | 'cyclase-associated protein' |
Speed Comparison with Other Abbreviation Extraction Libraries
API Reference
For detailed API documentation, please refer to the Rust docs or the Python module docstrings.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License.
Acknowledgements
This library is based on the Schwartz-Hearst algorithm:
The implementation is inspired by the original Python variant by Phil Gooch: abbreviation-extractor