rustling 0.8.0 - Docs.rs

CoNLL-U (Universal Dependencies)
=================================

The ``rustling.conllu`` module provides tools for parsing
`CoNLL-U <https://universaldependencies.org/format.html>`_ files,
the standard format for `Universal Dependencies <https://universaldependencies.org/>`_ datasets.

A CoNLL-U file is a plain-text, tab-separated format where sentences are
separated by blank lines. Each token line has 10 fields:
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, and MISC.
Comment lines start with ``#``.

.. code-block:: text

   # sent_id = 1
   # text = The cat sat on the mat.
   1   The     the     DET     DT      Definite=Def|PronType=Art   2   det     _   _
   2   cat     cat     NOUN    NN      Number=Sing                 3   nsubj   _   _
   3   sat     sit     VERB    VBD     Mood=Ind|Tense=Past         0   root    _   _
   4   on      on      ADP     IN      _                           6   case    _   _
   5   the     the     DET     DT      Definite=Def|PronType=Art   6   det     _   _
   6   mat     mat     NOUN    NN      Number=Sing                 3   nmod    _   _
   7   .       .       PUNCT   .       _                           3   punct   _   SpaceAfter=No

Loading Data
------------

:func:`~rustling.read_conllu`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The quickest way to load CoNLL-U data is with :func:`~rustling.read_conllu`.
It accepts a file path, directory, ZIP archive, git URL, or HTTP URL
and figures out the right loading strategy automatically:

.. code-block:: python

   import rustling

   # From a local .conllu file
   conllu = rustling.read_conllu("path/to/data.conllu")

   # From a directory (recursively finds all .conllu files)
   conllu = rustling.read_conllu("path/to/ud-treebank/")

   # From a ZIP archive
   conllu = rustling.read_conllu("path/to/treebank.zip")

   # From a git repository (e.g., a Universal Dependencies treebank)
   conllu = rustling.read_conllu("https://github.com/UniversalDependencies/UD_English-EWT.git")

   # From a URL (ZIP files are automatically detected and extracted)
   conllu = rustling.read_conllu("https://example.com/treebank.zip")

Using the class methods directly
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you need finer control -- for example, to pass specific files,
filter by regex, change the file extension, control caching, or parse
in-memory strings -- use the :py:class:`~rustling.conllu.CoNLLU` class methods directly:

.. code-block:: python

   from rustling.conllu import CoNLLU

From specific files:

.. code-block:: python

   conllu = CoNLLU.from_files(["path/to/train.conllu", "path/to/test.conllu"])

From a directory with a regex filter:

.. code-block:: python

   conllu = CoNLLU.from_dir("path/to/treebank/", match=r"test")

The ``extension`` parameter controls which file extension to look for (default: ``".conllu"``).

From a ZIP archive:

.. code-block:: python

   conllu = CoNLLU.from_zip("path/to/treebank.zip")

From a git repository:

.. code-block:: python

   conllu = CoNLLU.from_git("https://github.com/UniversalDependencies/UD_English-EWT.git")

From a URL (ZIP files are automatically detected and extracted):

.. code-block:: python

   conllu = CoNLLU.from_url("https://example.com/treebank.zip")

From in-memory strings:

.. code-block:: python

   conllu = CoNLLU.from_strs([conllu_string_1, conllu_string_2])

Parallel processing
^^^^^^^^^^^^^^^^^^^

All loading methods accept a ``parallel`` parameter (default: ``True``)
to enable parallel parsing of multiple files.

Accessing Data
--------------

Sentences
^^^^^^^^^

Call :py:meth:`~rustling.conllu.CoNLLU.sentences` to get a flat list of all
sentences across all files:

.. code-block:: python

   import rustling

   conllu = rustling.read_conllu("treebank.conllu")

   for sentence in conllu.sentences():
       print(sentence.comments)  # list[str] or None
       for token in sentence.tokens():
           print(token.id, token.form, token.lemma, token.upos, token.deprel)

Tokens
^^^^^^

A :py:class:`~rustling.conllu.Token` has the following properties, corresponding
to the 10 CoNLL-U fields:

- ``id`` -- Word index (integer, range like ``"1-2"`` for multiword tokens, or decimal like ``"1.1"`` for empty nodes).
- ``form`` -- Word form or punctuation symbol.
- ``lemma`` -- Lemma or stem of the word.
- ``upos`` -- Universal POS tag.
- ``xpos`` -- Language-specific POS tag, or ``"_"``.
- ``feats`` -- Morphological features, or ``"_"``.
- ``head`` -- Head of the current word (``"0"`` for root), or ``"_"``.
- ``deprel`` -- Universal dependency relation to HEAD, or ``"_"``.
- ``deps`` -- Enhanced dependency graph, or ``"_"``.
- ``misc`` -- Any other annotation, or ``"_"``.

Comments
^^^^^^^^

A :py:class:`~rustling.conllu.Sentence` has a ``comments`` property that returns
the comment lines (without the leading ``#``), or ``None`` if there are no comments:

.. code-block:: python

   sentence = conllu.sentences()[0]
   if sentence.comments:
       for comment in sentence.comments:
           print(comment)  # e.g., "sent_id = 1" or "text = The cat sat."

Converting to CHAT
------------------

A :py:class:`~rustling.conllu.CoNLLU` reader can convert its data to CHAT format
for use with `CHILDES <https://childes.talkbank.org/>`_ / TalkBank tools.

.. code-block:: python

   import rustling

   conllu = rustling.read_conllu("treebank.conllu")

   # Convert to a CHAT object
   chat = conllu.to_chat()

   # Or get CHAT-formatted strings
   chat_strs = conllu.to_chat_strs()

   # Or write .cha files directly
   conllu.to_chat_files("output_dir/")

The conversion maps CoNLL-U token fields to CHAT morphology and grammar tiers:

- ``%mor`` tier: ``UPOS|LEMMA`` (with ``&FEATS`` appended if features are present)
- ``%gra`` tier: ``ID|HEAD|DEPREL``

Since CoNLL-U files have no participant information, a default participant code
``"SPK"`` (Speaker) is used.

Collection Operations
---------------------

A :py:class:`~rustling.conllu.CoNLLU` reader behaves like a collection of files.
You can iterate, slice, combine, and modify it:

.. code-block:: python

   import rustling

   conllu = rustling.read_conllu("path/to/treebank/")

   # File count and paths
   print(conllu.n_files)
   print(conllu.file_paths)

   # Iteration and slicing
   for single_file in conllu:
       print(single_file.n_files)  # 1

   subset = conllu[0:3]

   # Combining
   combined = conllu1 + conllu2
   conllu1 += conllu2

   # Appending and extending
   conllu1.append(conllu2)
   conllu1.extend([conllu2, conllu3])

   # Removing
   last = conllu.pop()
   first = conllu.pop_left()
   conllu.clear()