oxillama-py 0.1.3

Python bindings for OxiLLaMa LLM inference engine
Documentation
Progress hooks
==============

Every ``generate*`` method on :class:`oxillama_py.Engine`,
:class:`oxillama_py.SpeculativeEngine`, and :class:`oxillama_py.AsyncEngine`
accepts a ``progress=`` keyword argument that drives a polymorphic progress
display.  The hook is throttled on the Rust side (default: 50 ms or 4 tokens,
whichever first) and is finalised exactly once even on Python exception,
cancellation, or end-of-sequence.

Accepted argument types
-----------------------

``progress=`` accepts any of the following:

* ``None`` — no progress reporting (the default).
* Any ``tqdm``/``tqdm.notebook.tqdm`` instance — duck-typed via
  ``update``/``set_postfix_str``/``close``.
* Any ``ipywidgets.IntProgress`` (or ``FloatProgress``) instance — duck-typed
  via ``value``/``max`` plus ``"Progress"`` in the class name.
* Any ``Callable[[ProgressEvent], None]`` — invoked once per throttled tick
  (and always for the first and last tokens).

The Rust side dispatches to the appropriate adapter via
:func:`oxillama_py.progress.make_progress_adapter` so all duck-typing logic
lives in pure Python.

ProgressEvent contract
----------------------

.. autoclass:: oxillama_py.progress.ProgressEvent
   :members:

Examples
--------

tqdm in a notebook
~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from tqdm.auto import tqdm
   from oxillama_py import Engine, EngineConfig

   engine = Engine(EngineConfig("model.gguf"))
   engine.load_model()

   with tqdm(desc="Generating", unit="tok") as bar:
       text = engine.generate("Hello", max_tokens=128, progress=bar)

ipywidgets progress widget
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import ipywidgets as widgets
   from IPython.display import display

   bar = widgets.IntProgress(min=0, max=128, description="Generating")
   display(bar)
   text = engine.generate("Hello", max_tokens=128, progress=bar)
   # When generation completes the bar snaps to 100 % and ``bar_style``
   # becomes "success".  On cancellation it becomes "warning"; on error,
   # "danger".

Custom callable
~~~~~~~~~~~~~~~

.. code-block:: python

   def on_progress(event):
       print(
           f"{event.tokens_generated}/{event.tokens_total}: "
           f"{event.tokens_per_sec:.1f} tok/s"
       )

   text = engine.generate("Hello", max_tokens=128, progress=on_progress)

Tuning the throttle
~~~~~~~~~~~~~~~~~~~

Two keyword arguments tune the throttle gates (both default to ``None``,
which falls back to 50 ms / 4 tokens):

* ``progress_throttle_ms`` — minimum milliseconds between consecutive
  callback fires.
* ``progress_throttle_tokens`` — minimum number of decoded tokens between
  consecutive callback fires.

The first decoded token always fires, and a synthetic final event always
fires after generation completes (with ``is_final=True``).

Capturing the decoded text
~~~~~~~~~~~~~~~~~~~~~~~~~~

By default :class:`~oxillama_py.progress.ProgressEvent` ``text_so_far`` is
the empty string — populating it would force an O(n) string copy on every
fired tick.  Pass ``progress_capture_text=True`` to opt in:

.. code-block:: python

   text = engine.generate(
       "Hello",
       max_tokens=128,
       progress=lambda evt: print(evt.text_so_far[-32:]),
       progress_capture_text=True,
   )

Strict error handling
~~~~~~~~~~~~~~~~~~~~~

Exceptions raised inside the progress callback are silently swallowed by
default so that a misbehaving widget cannot abort generation.  Pass
``strict_progress=True`` to re-raise the first stashed exception once
generation completes.

Migrating from ``TqdmProgress``
-------------------------------

The v0.1.1 :class:`oxillama_py.tqdm_helper.TqdmProgress` shim still works
and is re-exported from the package top-level under a
:class:`DeprecationWarning`.  To migrate, drop the wrapper and pass the bar
directly:

.. code-block:: python

   # Before (v0.1.1):
   from tqdm.auto import tqdm
   from oxillama_py import TqdmProgress
   bar = tqdm(desc="Generating", unit="tok")
   engine.generate_streaming(prompt, callback=TqdmProgress(bar))
   bar.close()

   # After (v0.1.3+):
   from tqdm.auto import tqdm
   with tqdm(desc="Generating", unit="tok") as bar:
       engine.generate(prompt, progress=bar)