rustling 0.8.0

A blazingly fast library for computational linguistics
Documentation
SRT (SubRip Subtitle)
=====================

The ``rustling.srt`` module provides tools for parsing
`SubRip <https://en.wikipedia.org/wiki/SubRip>`_ subtitle (``.srt``) files.

An ``.srt`` file is a plain-text format where each subtitle block has a
sequence number, a time range, and one or more lines of text:

.. code-block:: text

   1
   00:02:16,612 --> 00:02:19,376
   Senator, we're making
   our final approach into Coruscant.

   2
   00:02:19,482 --> 00:02:21,609
   Very good, Lieutenant.

Loading Data
------------

:func:`~rustling.read_srt`
^^^^^^^^^^^^^^^^^^^^^^^^^^

The quickest way to load SRT data is with :func:`~rustling.read_srt`.
It accepts a file path, directory, ZIP archive, git URL, or HTTP URL
and figures out the right loading strategy automatically:

.. code-block:: python

   import rustling

   # From a local .srt file
   srt = rustling.read_srt("path/to/movie.srt")

   # From a directory (recursively finds all .srt files)
   srt = rustling.read_srt("path/to/subtitles/")

   # From a ZIP archive
   srt = rustling.read_srt("path/to/subtitles.zip")

   # From a git repository
   srt = rustling.read_srt("https://github.com/user/corpus.git")

   # From a URL (ZIP files are automatically detected and extracted)
   srt = rustling.read_srt("https://example.com/subtitles.zip")

Using the class methods directly
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you need finer control — for example, to pass specific files,
filter by regex, change the file extension, control caching, or parse
in-memory strings — use the :py:class:`~rustling.srt.SRT` class methods directly:

.. code-block:: python

   from rustling.srt import SRT

From specific files:

.. code-block:: python

   srt = SRT.from_files(["path/to/file1.srt", "path/to/file2.srt"])

From a directory with a regex filter:

.. code-block:: python

   srt = SRT.from_dir("path/to/subtitles/", match=r"episode_01")

The ``extension`` parameter controls which file extension to look for (default: ``".srt"``).

From a ZIP archive:

.. code-block:: python

   srt = SRT.from_zip("path/to/subtitles.zip")

From a git repository:

.. code-block:: python

   srt = SRT.from_git("https://github.com/user/corpus.git")

From a URL (ZIP files are automatically detected and extracted):

.. code-block:: python

   srt = SRT.from_url("https://example.com/subtitles.zip")

From in-memory strings:

.. code-block:: python

   srt = SRT.from_strs([srt_string_1, srt_string_2])

Parallel processing
^^^^^^^^^^^^^^^^^^^

All loading methods accept a ``parallel`` parameter (default: ``True``)
to enable parallel parsing of multiple files.

Accessing Subtitle Data
-----------------------

Call :py:meth:`~rustling.srt.SRT.utterances` to get a flat list of all
subtitle blocks across all files:

.. code-block:: python

   import rustling

   srt = rustling.read_srt("movie.srt")

   for utterance in srt.utterances():
       print(utterance.index, utterance.time_marks, utterance.line)

An :py:class:`~rustling.srt.Utterance` has the following properties:

- ``index`` -- 1-based sequence number from the SRT file.
- ``line`` -- The subtitle text (multiline text preserved with ``\n``).
- ``time_marks`` -- Start and end time in milliseconds as a ``tuple[int, int]``.

:py:class:`~rustling.srt.Utterance` objects can also be constructed directly:

.. code-block:: python

   from rustling.srt import Utterance

   utt = Utterance(index=1, line="Hello world.", time_marks=(0, 1500))

Converting to CHAT
------------------

An :py:class:`~rustling.srt.SRT` reader can convert its data to CHAT format
for use with `CHILDES <https://childes.talkbank.org/>`_ / TalkBank tools.

.. code-block:: python

   import rustling

   srt = rustling.read_srt("recording.srt")

   # Convert to a CHAT object
   chat = srt.to_chat()

   # Or get CHAT-formatted strings
   chat_strs = srt.to_chat_strs()

   # Or write .cha files directly
   srt.to_chat_files("output_dir/")

Since SRT files have no participant information, a default participant code
``"SPK"`` (Speaker) is used. Multiline subtitle text is joined with a space
in the CHAT output (CHAT utterances are single-line).

Converting to ELAN
------------------

An :py:class:`~rustling.srt.SRT` reader can convert its data to ELAN format.

.. code-block:: python

   import rustling

   srt = rustling.read_srt("recording.srt")

   # Convert to an ELAN object
   elan = srt.to_elan()

   # Or get EAF XML strings
   eaf_strs = srt.to_elan_strs()

   # Or write .eaf files directly
   srt.to_elan_files("output_dir/")

The conversion creates a single alignable tier named ``"SPK"`` (Speaker)
with one annotation per subtitle block.

Converting to TextGrid
----------------------

An :py:class:`~rustling.srt.SRT` reader can convert its data to
`TextGrid <https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html>`_
format for use with Praat.

.. code-block:: python

   import rustling

   srt = rustling.read_srt("recording.srt")

   # Convert to a TextGrid object
   textgrid = srt.to_textgrid()

   # Or get TextGrid-formatted strings
   textgrid_strs = srt.to_textgrid_strs()

   # Or write .TextGrid files directly
   srt.to_textgrid_files("output_dir/")

The conversion creates a single IntervalTier named ``"SPK"`` (Speaker)
with one interval per subtitle block.

Collection Operations
---------------------

An :py:class:`~rustling.srt.SRT` reader behaves like a collection of files.
You can iterate, slice, combine, and modify it:

.. code-block:: python

   import rustling

   srt = rustling.read_srt("path/to/subtitles/")

   # File count and paths
   print(srt.n_files)
   print(srt.file_paths)

   # Iteration and slicing
   for single_file in srt:
       print(single_file.n_files)  # 1

   subset = srt[0:3]

   # Combining
   combined = srt1 + srt2
   srt1 += srt2

   # Appending and extending
   srt1.append(srt2)
   srt1.extend([srt2, srt3])

   # Removing
   last = srt.pop()
   first = srt.pop_left()
   srt.clear()