rustling 0.8.0

A blazingly fast library for computational linguistics
Documentation
Language Models
===============

Rustling provides n-gram language models with various smoothing methods.

MLE (Maximum Likelihood Estimation)
------------------------------------

The :py:class:`~rustling.lm.MLE` model uses raw counts to estimate probabilities, with no smoothing.

.. code-block:: python

   from rustling.lm import MLE

   model = MLE(order=2)
   model.fit([
       ["the", "cat", "sat"],
       ["the", "dog", "ran"],
   ])

   # Score a word given context
   print(model.score("cat", ["the"]))   # 0.5
   print(model.score("dog", ["the"]))   # 0.5

   # Log probability (base 2)
   print(model.logscore("cat", ["the"]))  # -1.0

Lidstone Smoothing
------------------

The :py:class:`~rustling.lm.Lidstone` model adds a constant ``gamma`` to all counts,
ensuring non-zero probabilities for unseen n-grams.

.. code-block:: python

   from rustling.lm import Lidstone

   model = Lidstone(order=2, gamma=0.1)
   model.fit([
       ["the", "cat", "sat"],
       ["the", "dog", "ran"],
   ])

   # Unseen n-grams get non-zero probability
   print(model.score("bird", ["the"]))  # > 0

Laplace Smoothing
-----------------

The :py:class:`~rustling.lm.Laplace` model is Lidstone smoothing with ``gamma=1``.

.. code-block:: python

   from rustling.lm import Laplace

   model = Laplace(order=2)
   model.fit([
       ["the", "cat", "sat"],
       ["the", "dog", "ran"],
   ])

   print(model.score("cat", ["the"]))

Text Generation
---------------

All models support text generation via weighted random sampling.

.. code-block:: python

   from rustling.lm import MLE

   model = MLE(order=2)
   model.fit([
       ["the", "cat", "sat", "on", "the", "mat"],
       ["the", "dog", "ran", "to", "the", "park"],
   ])

   # Generate words with a random seed for reproducibility
   words = model.generate(num_words=5, random_seed=42)
   print(words)

   # Generate with a text seed (starting context)
   words = model.generate(num_words=3, text_seed=["the"], random_seed=42)
   print(words)