pub struct KeywordExtractionConfig<'a> {
pub sentence_embeddings_config: SentenceEmbeddingsConfig,
pub tokenizer_stopwords: Option<HashSet<&'a str>>,
pub tokenizer_pattern: Option<Regex>,
pub tokenizer_forbidden_ngram_chars: Option<&'a [char]>,
pub scorer_type: KeywordScorerType,
pub ngram_range: (usize, usize),
pub num_keywords: usize,
pub diversity: Option<f64>,
pub max_sum_candidates: Option<usize>,
}
Expand description
Fields§
§sentence_embeddings_config: SentenceEmbeddingsConfig
SentenceEmbeddingsConfig
defining the sentence embeddings model to use
tokenizer_stopwords: Option<HashSet<&'a str>>
Optional list of tokenizer stopwords to exclude from the keywords candidate list. Default to a list of English stopwords.
tokenizer_pattern: Option<Regex>
Optional tokenization regex pattern. Defaults to sequence of word characters.
tokenizer_forbidden_ngram_chars: Option<&'a [char]>
Optional list of characters that should not be included in ngrams (useful to filter ngrams spanning over punctuation marks).
scorer_type: KeywordScorerType
KeywordScorerType
used to rank keywords.
ngram_range: (usize, usize)
N-gram range (inclusive) for keywords. (1, 2) would consider all 1 and 2 word gram for keyword candidates.
num_keywords: usize
Number of keywords to return
diversity: Option<f64>
Optional diversity parameter used for the MaximalMarginRelevance
ranker, defaults to 0.5.
A high diversity (closer to 1.0) will give more importance to getting varied keywords, at the
cost of less relevance to the original document.
max_sum_candidates: Option<usize>
Optional number of candidate sets used for MaxSum
ranker. Higher values are more likely to
identify a global optimum for the ranker criterion, but are more likely to include sets that are less relevant to the
input document. Larger values also have a higher computational and memory cost (N2 scale)