Module string_kernels

Module string_kernels 

Source
Expand description

String Kernel Approximations

This module implements various string kernel approximation methods for sequence and text analysis. String kernels measure similarity between sequences of symbols (characters, words, etc.) by counting shared subsequences or n-grams.

§Key Features

  • N-gram Kernels: Count shared n-grams between sequences
  • Spectrum Kernels: Fixed-length contiguous substring kernels
  • Subsequence Kernels: Count all shared subsequences with gaps
  • Edit Distance Approximations: Approximate edit distance kernels
  • Mismatch Kernels: Allow for mismatches in n-gram comparisons
  • Weighted Subsequence Kernels: Weight subsequences by length and gaps

§Mathematical Background

String kernel between sequences s and t: K(s, t) = Σ φ(s)[u] * φ(t)[u]

Where φ(s)[u] is the feature map that counts occurrences of substring u.

§References

  • Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis
  • Lodhi, H., et al. (2002). Text classification using string kernels

Structs§

EditDistanceKernel
Edit distance approximation kernel EditDistanceKernel
FittedEditDistanceKernel
Fitted edit distance kernel FittedEditDistanceKernel
FittedMismatchKernel
Fitted mismatch kernel FittedMismatchKernel
FittedNGramKernel
Fitted n-gram kernel FittedNGramKernel
FittedSpectrumKernel
Fitted spectrum kernel FittedSpectrumKernel
FittedSubsequenceKernel
Fitted subsequence kernel (computes full kernel matrix) FittedSubsequenceKernel
MismatchKernel
Mismatch kernel that allows k mismatches in n-grams MismatchKernel
NGramKernel
N-gram kernel for sequences NGramKernel
SpectrumKernel
Spectrum kernel for fixed-length contiguous substrings SpectrumKernel
SubsequenceKernel
Subsequence kernel that counts all shared subsequences (with gaps) SubsequenceKernel

Enums§

NGramMode
N-gram extraction mode NGramMode