Crate abbreviation_extractor

Source
Expand description

§Abbreviation Extractor

Abbreviation Extractor is a high-performance Rust library with Python bindings for extracting abbreviation-definition pairs from text, particularly focused on biomedical literature. It implements an improved version of the Schwartz-Hearst algorithm as described in:

A. Schwartz and M. Hearst (2003) A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text. Biocomputing, 451-462.

§Overview

This library provides functionality to extract abbreviation-definition pairs from text. It supports both single-threaded and parallel processing, making it suitable for various text processing tasks. The library is designed with a focus on biomedical literature but can be applied to other domains as well.

Key components of the library include:

  • Support for parallel processing of large datasets
  • Customizable extraction parameters like selecting the most common or first definition for each abbreviation
  • Python bindings for easy integration with Python projects
  • Tokenization of input text for more accurate extraction

§Basic Usage

§Rust

use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};

let text = "The World Health Organization (WHO) is a specialized agency.";
let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs(text, options).unwrap();

for pair in result {
    println!("Abbreviation: {}, Definition: {}", pair.abbreviation, pair.definition);
}

§Python

from abbreviation_extractor import extract_abbreviation_definition_pairs

text = "The World Health Organization (WHO) is a specialized agency."
result = extract_abbreviation_definition_pairs(text)

for pair in result:
    print(f"Abbreviation: {pair.abbreviation}, Definition: {pair.definition}")

§Customizing Extraction

You can customize the extraction process using AbbreviationOptions:

use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};

let text = "The World Health Organization (WHO) is a specialized agency. \
            The World Heritage Organization (WHO) is different.";

// Get only the most common definition for each abbreviation
let options = AbbreviationOptions::new(true, false, true);
let result = extract_abbreviation_definition_pairs(text, options);

// Get only the first definition for each abbreviation
let options = AbbreviationOptions::new(false, true, true);
let result = extract_abbreviation_definition_pairs(text, options);

// Disable tokenization (if the input is already tokenized)
let options = AbbreviationOptions::new(false, false, false);
let result = extract_abbreviation_definition_pairs(text, options);

§Parallel Processing

For processing multiple texts in parallel, you can use the extract_abbreviation_definition_pairs_parallel function:

§Rust

use abbreviation_extractor::{extract_abbreviation_definition_pairs_parallel, AbbreviationOptions};

let texts = vec![
    "The World Health Organization (WHO) is a specialized agency.",
    "The United Nations (UN) works closely with WHO.",
    "The European Union (EU) is a political and economic union.",
];

let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs_parallel(texts, options);

for extraction in result.extractions {
    println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}

§Python

from abbreviation_extractor import extract_abbreviation_definition_pairs_parallel

texts = [
    "The World Health Organization (WHO) is a specialized agency.",
    "The United Nations (UN) works closely with WHO.",
    "The European Union (EU) is a political and economic union.",
]

result = extract_abbreviation_definition_pairs_parallel(texts)

for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")

§Processing Large Files

For extracting abbreviations from large files, you can use the extract_abbreviations_from_file function:

§Rust

use abbreviation_extractor::{extract_abbreviations_from_file, AbbreviationOptions, FileExtractionOptions};

let file_path = "path/to/your/large/file.txt";
let abbreviation_options = AbbreviationOptions::default();
let file_options = FileExtractionOptions::default();

let result = extract_abbreviations_from_file(file_path, abbreviation_options, file_options);

for extraction in result.extractions {
    println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}

§Python

from abbreviation_extractor import extract_abbreviations_from_file

file_path = "path/to/your/large/file.txt"
result = extract_abbreviations_from_file(file_path)

for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")

You can customize the file extraction process by specifying additional parameters:

result = extract_abbreviations_from_file(
    file_path,
    most_common_definition=True,
    first_definition=False,
    tokenize=True,
    num_threads=4,
    show_progress=True,
    chunk_size=2048 * 1024  # 2MB chunks
)

§Functions

The main functions provided by this library are:

For detailed information on each function, please refer to their individual documentation.

§Structs/Enums

§Modules

  • candidate: Defines the Candidate struct used in the extraction process
  • extraction: Contains the core logic for extracting abbreviation-definition pairs
  • utils: Utility functions and regular expressions used in the extraction process
  • abbreviation_definitions: Defines the AbbreviationDefinition and AbbreviationOptions structs

Re-exports§

pub use abbreviation_definitions::AbbreviationDefinition;
pub use abbreviation_definitions::AbbreviationOptions;
pub use abbreviation_definitions::ExtractionError;
pub use abbreviation_definitions::ExtractionResult;
pub use abbreviation_definitions::FileExtractionOptions;
pub use candidate::Candidate;
pub use extraction::best_candidates;
pub use extraction::extract_abbreviation_definition_pairs;
pub use extraction::extract_abbreviation_definition_pairs_parallel;
pub use extraction::extract_abbreviations_from_file;
pub use extraction::get_definition;
pub use extraction::select_definition;

Modules§

abbreviation_definitions
candidate
extraction
This module contains the core logic for extracting abbreviation-definition pairs from text. It implements the Schwartz-Hearst algorithm for identifying abbreviations and their definitions in biomedical text, as described in:
utils