rustling 0.8.0 - Docs.rs

.. _chat_transcriptions:

Transcriptions and Annotations
==============================

Conversational data formatted in CHAT provides transcriptions with rich
annotations for both linguistic and extra-linguistic information.
``rustling.chat`` is designed to extract data and annotations in CHAT and expose them
in Python data structures for flexible data analyses and modeling work.
This page explains how ``rustling.chat`` represents CHAT data and annotations.

CHAT Format
-----------

To see how the CHAT format translates to ``rustling.chat``, let's look at the very first
two utterances in Eve's data in the American English
`Brown <https://childes.talkbank.org/access/Eng-NA/Brown.html>`_
dataset on CHILDES (data file: ``Brown/Eve/010600a.cha``),
where apparently Eve demands cookies in the first utterance
and her mother responds with a question for confirmation in the second utterance:

.. code-block::

    *CHI:	more cookie . [+ IMP]
    %mor:	adj|more-Cmp-S1 noun|cookie .
    %gra:	1|2|AMOD 2|2|ROOT 3|2|PUNCT
    %int:	distinctive , loud
    *MOT:	you 0v more cookies ?
    %mor:	pron|you-Prs-Acc-S2 adj|more-Cmp-S1 noun|cookie-Plur ?
    %gra:	1|3|NSUBJ 2|3|AMOD 3|3|ROOT 4|3|PUNCT

``rustling.chat`` handles CHAT data by paying attention to the following:

* **Participants:**
  The two participants are ``CHI`` and ``MOT``.
  In CHILDES, it is customary to denote the target child (i.e., Eve in this example)
  by ``CHI`` and the child's mother by ``MOT``.
  The asterisk ``*`` that comes just before the participant code signals
  a transcription line, known as the main tier in CHAT.
  Each utterance must begin with this main tier.

* **Transcriptions:**
  The two main tiers are ``more cookie . [+ IMP]`` from Eve
  and ``you 0v more cookies ?`` from her mother.
  The transcriptions are word-segmented by spaces
  (even for languages that don't have such orthographic conventions as English does).
  Punctuation marks are also treated as "words".
  Annotations such as ``[+ IMP]`` and ``0v`` here can be found in transcriptions.

* **Dependent tiers:**
  Between one utterance and the next, there are often what's known
  as dependent tiers, signaled by ``%`` and
  associated with the transcription line just immediately above;
  Eve's utterance has the dependent tiers ``%mor``
  (morphological information), ``%gra`` (grammatical relations),
  and ``%int`` (intonation),
  whereas Eve's mother's has only ``%mor`` and ``%gra``.
  Although certain dependent tiers are more standardized and more commonly found
  in CHILDES datasets (especially ``%mor`` and ``%gra``),
  none of the dependent tiers are obligatory in a CHAT utterance.

* **The %mor tier:**
  The morphological information aligns one-to-one to the segmented words
  (including punctuation marks) in the main tier.;
  Annotations in the main tier are ignored.
  In each item of ``%mor``, the part-of-speech tag is on the left of the pipe ``|``,
  e.g., ``adj`` for an adjective in ``adj|more-Cmp-S1`` aligned to ``more`` in Eve's utterance.
  Inflectional and derivational information is on the right of ``|``,
  e.g., ``you-Prs-Acc-S2`` for the second-person, singular, accusative, personal pronoun
  in ``pron|you-Prs-Acc-S2`` aligned to ``you`` in Eve's mother's line.

* **The %gra tier:**
  CHAT represents grammatical relations in terms of heads and dependents in
  dependency grammar.
  Every item on the ``%gra`` tier corresponds one-to-one to the segmented words
  in the transcription (and therefore one-to-one to the ``%mor`` items as well).
  In Eve's mother's ``%gra``, ``2|3|AMOD`` means ``more`` at position 2 of the utterance
  is a dependent of the word ``cookies`` at position 3 as the head,
  and that the relation is one of adjectival modification.

* **Other tiers:**
  Apart from ``%mor`` and ``%gra``, other dependent tiers may appear in CHAT data files.
  Some of them contain more linguistic information, e.g., ``%int`` for intonation
  in Eve's utterance here, and others contain contextual information about the
  utterance or recording session.
  Many of these tiers are used only as needed (``%int`` not used in Eve's mother's
  utterance in this example).

Once you have a :class:`~rustling.chat.CHAT` object,
several methods are available for accessing the transcriptions and annotations.
Which method suits your need best depends on which level of information you need.
The following sections introduce these :class:`~rustling.chat.CHAT` methods.
As an example, let's work with the
`Brown <https://childes.talkbank.org/access/Eng-NA/Brown.html>`_
dataset of American English on CHILDES
(see :ref:`chat_read` for how to download and read this dataset):

.. code-block:: python

    import rustling
    brown = rustling.read_chat("path/to/your/local/Brown.zip")


Filtering by File
-----------------

The Brown dataset contains data for the three children Adam, Eve, and Sarah.
Let's first take a look at how the Brown dataset is structure,
because we need to separate the children's data for analysis:

.. code-block:: python

    brown.n_files
    # 214
    brown.file_paths
    # ['Brown/Adam/020304.cha',
    #  'Brown/Adam/020318.cha',
    #  ...
    #  'Brown/Eve/010600a.cha',
    #  'Brown/Eve/010600b.cha',
    #  ...
    #  'Brown/Sarah/020305.cha',
    #  'Brown/Sarah/020307.cha',
    #  ...]

The three children's data is organized in subdirectories under their respective name.
The :meth:`~rustling.chat.CHAT.filter` method can be used to create a new :class:`~rustling.chat.CHAT`
from the data matching a subdirectory path:

.. code-block:: python

    eve = brown.filter(files="Eve")
    eve.n_files
    # 20
    eve.head()
    # *CHI:  more             cookie       .
    # %mor:  adj|more-Cmp-S1  noun|cookie  .
    # %gra:  1|2|AMOD         2|2|ROOT     3|2|PUNCT
    # %int:  distinctive , loud

    # *MOT:  you                  more             cookies           ?
    # %mor:  pron|you-Prs-Acc-S2  adj|more-Cmp-S1  noun|cookie-Plur  ?
    # %gra:  1|3|NSUBJ            2|3|AMOD         3|3|ROOT          4|3|PUNCT

    # *MOT:  how_about      another              graham        cracker       ?
    # %mor:  intj|howabout  det|another-Def-Ind  noun|graham   noun|cracker  ?
    # %gra:  1|4|DISCOURSE  2|4|DET              3|4|COMPOUND  4|4|ROOT      5|4|PUNCT

    # *MOT:  would            that           do             just        as          well       ?
    # %mor:  aux|would-Fin-S  pron|that-Dem  verb|do-Inf-S  adv|just    adv|as      adv|well   ?
    # %gra:  1|3|AUX          2|3|NSUBJ      3|6|ROOT       4|5|ADVMOD  5|3|ADVMOD  6|5|FIXED  7|3|PUNCT

    # *MOT:  here      .
    # %mor:  adv|here  .
    # %gra:  1|1|ROOT  2|1|PUNCT

The string ``"Eve"`` appears in the file paths for Eve's data,
which is what we've passed in to the ``files`` keyword argument of :meth:`~rustling.chat.CHAT.filter`
for filtering. There are 20 CHAT data files for Eve in Brown.


Filtering by Participant
------------------------

To filter by participant, use the ``participants`` keyword argument.
Let's further filter ``eve`` into child speech and child-directed speech:

.. code-block:: python

    eve_chi = eve.filter(participants="CHI")  # child speech
    eve_chi.head()
    # *CHI:  more             cookie       .
    # %mor:  adj|more-Cmp-S1  noun|cookie  .
    # %gra:  1|2|AMOD         2|2|ROOT     3|2|PUNCT
    # %int:  distinctive , loud

    # *CHI:  more             cookie       .
    # %mor:  adj|more-Cmp-S1  noun|cookie  .
    # %gra:  1|2|AMOD         2|2|ROOT     3|2|PUNCT
    # %int:  distinctive , loud

    # *CHI:  more             juice       ?
    # %mor:  adj|more-Cmp-S1  noun|juice  ?
    # %gra:  1|2|AMOD         2|2|ROOT    3|2|PUNCT

    # *CHI:  Fraser        .
    # %mor:  propn|Fraser  .
    # %gra:  1|1|ROOT      2|1|PUNCT
    # %com:  pronounces Fraser as fr&jdij .

    # *CHI:  Fraser        .
    # %mor:  propn|Fraser  .
    # %gra:  1|1|ROOT      2|1|PUNCT


    eve_cds = eve.filter(participants="^(?!CHI$)")  # child-directed speech, regex ^(?!CHI$) for "not CHI"
    eve_cds.head()
    # *MOT:  you                  more             cookies           ?
    # %mor:  pron|you-Prs-Acc-S2  adj|more-Cmp-S1  noun|cookie-Plur  ?
    # %gra:  1|3|NSUBJ            2|3|AMOD         3|3|ROOT          4|3|PUNCT

    # *MOT:  how_about      another              graham        cracker       ?
    # %mor:  intj|howabout  det|another-Def-Ind  noun|graham   noun|cracker  ?
    # %gra:  1|4|DISCOURSE  2|4|DET              3|4|COMPOUND  4|4|ROOT      5|4|PUNCT

    # *MOT:  would            that           do             just        as          well       ?
    # %mor:  aux|would-Fin-S  pron|that-Dem  verb|do-Inf-S  adv|just    adv|as      adv|well   ?
    # %gra:  1|3|AUX          2|3|NSUBJ      3|6|ROOT       4|5|ADVMOD  5|3|ADVMOD  6|5|FIXED  7|3|PUNCT

    # *MOT:  here      .
    # %mor:  adv|here  .
    # %gra:  1|1|ROOT  2|1|PUNCT

    # *MOT:  here      you                  go                       .
    # %mor:  adv|here  pron|you-Prs-Nom-S2  verb|go-Fin-Ind-Pres-S2  .
    # %gra:  1|3|ROOT  2|3|NSUBJ            3|1|ADVCL-RELCL          4|1|PUNCT

The ``participants`` argument of :meth:`~rustling.chat.CHAT.filter` supports
regex matching (which is also true for the ``files`` argument, though not illustrated here).
We've taken advantage of this capability to filter Eve's data down to
child-directed speech, by the regular expression ``"^(?!CHI$)"``
for "not CHI".


Words
-----

The :class:`~rustling.chat.CHAT` method :meth:`~rustling.chat.CHAT.words`
returns the transcriptions as segmented words.

Calling :meth:`~rustling.chat.CHAT.words` with no arguments gives a
flat list of all the words:

.. code-block:: python

    eve_chi.words()[:9]
    # ['more', 'cookie', '.', 'more', 'cookie', '.', 'more', 'juice', '?']
    len(eve_chi.words())
    # 44119
    eve_cds.words()[:9]
    # ['you', 'more', 'cookies', '?', 'how_about', 'another', 'graham', 'cracker', '?']
    len(eve_cds.words())
    # 76198

To preserve the utterance-level structure, pass in ``by_utterance=True``
so that an inner list is created around the words from each utterance:

.. code-block:: python

    eve_chi.words(by_utterance=True)[:5]
    # [['more', 'cookie', '.'],
    #  ['more', 'cookie', '.'],
    #  ['more', 'juice', '?'],
    #  ['Fraser', '.'],
    #  ['Fraser', '.']]
    len(eve_chi.words(by_utterance=True))
    # 12113
    eve_cds.words(by_utterance=True)[:5]
    # [['you', 'more', 'cookies', '?'],
    #  ['how_about', 'another', 'graham', 'cracker', '?'],
    #  ['would', 'that', 'do', 'just', 'as', 'well', '?'],
    #  ['here', '.'],
    #  ['here', 'you', 'go', '.']]
    len(eve_cds.words(by_utterance=True))
    # 14807

Eve's data comes from 20 CHAT data files.
To get the file-level structure, pass in ``by_file=True``.
Each inner list contains the flat words from one file:

.. code-block:: python

    eve_chi_by_file = eve_chi.words(by_file=True)
    len(eve_chi_by_file)
    # 20
    eve_chi_by_file[0][:9]
    # ['more', 'cookie', '.', 'more', 'cookie', '.', 'more', 'juice', '?']
    eve_cds_by_file = eve_cds.words(by_file=True)
    len(eve_cds_by_file)
    # 20
    eve_cds_by_file[0][:9]
    # ['you', 'more', 'cookies', '?', 'how_about', 'another', 'graham', 'cracker', '?']

Passing both ``by_utterance=True`` and ``by_file=True`` gives a list of files,
where each file is a list of utterances, and each utterance is a list of words:

.. code-block:: python

    eve_chi_both = eve_chi.words(by_utterance=True, by_file=True)
    len(eve_chi_both)
    # 20
    len(eve_chi_both[0])
    # 741
    eve_chi_both[0][:5]
    # [['more', 'cookie', '.'],
    #  ['more', 'cookie', '.'],
    #  ['more', 'juice', '?'],
    #  ['Fraser', '.'],
    #  ['Fraser', '.']]
    eve_cds_both = eve_cds.words(by_utterance=True, by_file=True)
    len(eve_cds_both)
    # 20
    len(eve_cds_both[0])
    # 847
    eve_cds_both[0][:5]
    # [['you', 'more', 'cookies', '?'],
    #  ['how_about', 'another', 'graham', 'cracker', '?'],
    #  ['would', 'that', 'do', 'just', 'as', 'well', '?'],
    #  ['here', '.'],
    #  ['here', 'you', 'go', '.']]


Tokens
------

While :meth:`~rustling.chat.CHAT.words` gives you transcriptions as plain strings,
:meth:`~rustling.chat.CHAT.tokens` gives you the ``%mor`` and ``%gra``
annotations bundled with each word:

.. code-block:: python

    eve_chi.tokens()[:3]
    # [Token(word='more', pos='adj', mor='more-Cmp-S1', gra=Gra(dep=1, head=2, rel='AMOD')),
    #  Token(word='cookie', pos='noun', mor='cookie', gra=Gra(dep=2, head=2, rel='ROOT')),
    #  Token(word='.', pos='', mor='.', gra=Gra(dep=3, head=2, rel='PUNCT'))]

Each element is a :class:`~rustling.chat.Token` object
with the attributes ``word``, ``pos``, ``mor``, and ``gra``:

.. code-block:: python

    first_token = eve_chi.tokens()[0]
    first_token.word
    # 'more'
    first_token.pos
    # 'adj'
    first_token.mor
    # 'more-Cmp-S1'
    first_token.gra
    # Gra(dep=1, head=2, rel='AMOD')

The ``gra`` attribute is a :class:`~rustling.chat.Gra` object,
with the attributes
``dep`` (the position of the word in the utterance),
``head`` (position of the head word),
and ``rel`` (the grammatical relation):

.. code-block:: python

    first_token.gra.dep
    # 1
    first_token.gra.head
    # 2
    first_token.gra.rel
    # 'AMOD'

Like :meth:`~rustling.chat.CHAT.words`,
:meth:`~rustling.chat.CHAT.tokens` also accepts
``by_utterance`` and ``by_file`` to organize the results
at the utterance and file level, respectively.

Clitics
^^^^^^^

In CHAT, clitics are morphemes that attach to a host word but carry their own
part-of-speech and morphological information on the ``%mor`` tier.
Postclitics are marked with ``~`` and preclitics with ``$``.
For example, the contraction *that's* is annotated as
``pro:dem|that~cop|be&3S`` -- the demonstrative pronoun *that* followed by
the postclitic copula *be*.
When ``rustling.chat`` parses such forms, the host word's :class:`~rustling.chat.Token`
receives the transcribed word (e.g., ``"that's"``), while clitic tokens
get an empty string for their ``word`` attribute but retain their ``pos``,
``mor``, and ``gra`` annotations.
This means the number of tokens in an utterance can exceed the number of words,
because each clitic produces its own :class:`~rustling.chat.Token`:

.. code-block:: python

    from rustling.chat import CHAT

    # "that's good ." with %mor: pro:dem|that~cop|be&3S adj|good .
    chat_str = (
        "@UTF8\n@Begin\n"
        "@Participants:\tCHI Target_Child\n"
        "*CHI:\tthat's good .\n"
        "%mor:\tpro:dem|that~cop|be&3S adj|good .\n"
        "@End\n"
    )
    reader = CHAT.from_strs([chat_str])
    tokens = reader.tokens(by_utterance=True)[0]
    len(tokens)
    # 4 (three words, but four tokens because of the postclitic)
    tokens[0].word, tokens[0].pos
    # ("that's", 'pro:dem')
    tokens[1].word, tokens[1].pos
    # ('', 'cop')           # postclitic: empty word, but POS is retained
    tokens[2].word, tokens[2].pos
    # ('good', 'adj')
    tokens[3].word, tokens[3].pos
    # ('.', '')


Utterances
----------

The :meth:`~rustling.chat.CHAT.utterances` method returns
:class:`~rustling.chat.Utterance` objects that bundle together
the participant, tokens, original tiers, and time marks for each utterance:

.. code-block:: python

    eve_chi.utterances()[0]
    # Utterance(participant='CHI', tokens=[...3 tokens], time_marks=None)

Each :class:`~rustling.chat.Utterance` object has the following attributes:

* ``participant`` -- the speaker code (e.g., ``'CHI'``, ``'MOT'``).
* ``tokens`` -- a list of :class:`~rustling.chat.Token` objects,
  the same kind introduced in the Tokens section above.
* ``audible`` -- the audibly faithful transcription of this utterance,
  with CHAT coding conventions stripped out while preserving
  repetitions and retracings as they were heard; ``None`` for changeable headers.
* ``tiers`` -- a dictionary of the original, unparsed tier lines.
* ``time_marks`` -- a tuple of ``(start, end)`` in milliseconds, or ``None``.
* ``changeable_header`` -- a :class:`~rustling.chat.ChangeableHeader` object
  if this entry is a mid-file header, or ``None`` for regular utterances.

Let's inspect these attributes on the first utterance of Eve's child speech:

.. code-block:: python

    u = eve_chi.utterances()[0]
    u.participant
    # 'CHI'
    u.tokens
    # [Token(word='more', pos='adj', mor='more-Cmp-S1', gra=Gra(dep=1, head=2, rel='AMOD')),
    #  Token(word='cookie', pos='noun', mor='cookie', gra=Gra(dep=2, head=2, rel='ROOT')),
    #  Token(word='.', pos='', mor='.', gra=Gra(dep=3, head=2, rel='PUNCT'))]
    u.tokens[0].word
    # 'more'
    u.tokens[0].pos
    # 'adj'
    u.audible
    # 'more cookie .'
    u.time_marks is None
    # True

The ``tokens`` here are exactly the same :class:`~rustling.chat.Token` objects
returned by :meth:`~rustling.chat.CHAT.tokens` --
each with the ``word``, ``pos``, ``mor``, and ``gra`` attributes
as described in the Tokens section above.

Like :meth:`~rustling.chat.CHAT.words` and :meth:`~rustling.chat.CHAT.tokens`,
:meth:`~rustling.chat.CHAT.utterances` accepts ``by_file``
to organize the results at the file level:

.. code-block:: python

    len(eve_chi.utterances())  # number of utterances, in Eve's child speech data
    # 12113
    eve_chi_by_file = eve_chi.utterances(by_file=True)
    len(eve_chi_by_file)  # number of files, in Eve's child speech data
    # 20
    len(eve_chi_by_file[0])  # number of utterances in the 1st file of Eve's child speech data
    # 741


Audibly Faithful Transcription
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``audible`` attribute of an :class:`~rustling.chat.Utterance` object
gives you a transcription that faithfully represents what was audibly spoken,
with CHAT coding conventions (e.g., ``[+ IMP]``) stripped out
while preserving repetitions and retracings as they were heard:

.. code-block:: python

    from rustling.chat import CHAT

    # Repetitions marked with [x N] are expanded:
    data1 = CHAT.from_strs(["*CHI:\tno [x 3] ."])
    data1.utterances()[0].audible
    # 'no no no .'

    # Retracings are kept as spoken:
    data2 = CHAT.from_strs(["*CHI:\tI want [/] I want cookie ."])
    data2.utterances()[0].audible
    # 'I want I want cookie .'

This transcription is useful for tasks where the goal is to model
the actual speech signal, such as automatic speech recognition (ASR)
and forced alignment, where to the extent possible the text matches what was audibly produced.
For changeable header entries, ``audible`` is ``None``.


Changeable Headers
^^^^^^^^^^^^^^^^^^

CHAT data files can contain mid-file headers marked by ``@`` in the source data,
such as ``@G``, ``@Comment``, ``@Date``, and ``@Situation``.
These "changeable headers" (as they're called in the official CHAT documentation)
signal metadata changes within a recording session
(as opposed to the file-level headers that appear at the top of a CHAT file).

When :meth:`~rustling.chat.CHAT.utterances` encounters a mid-file header,
it includes it in the returned list as an :class:`~rustling.chat.Utterance` object
whose ``changeable_header`` attribute is set
(while ``participant``, ``tokens``, and ``tiers`` are all ``None``):

.. code-block:: python

    eve = brown.filter(files="Eve")
    utts = eve.utterances()
    headers = [u for u in utts if u.changeable_header is not None]
    len(headers)
    # 49
    h = headers[0]
    h.changeable_header
    # <builtins.ChangeableHeader_Date object at ...>
    h.changeable_header.value
    # '17-OCT-1962'
    h.participant is None
    # True
    h.tokens is None
    # True
    h.tiers is None
    # True

You can use ``isinstance`` with :class:`~rustling.chat.ChangeableHeader` variants
to classify the headers you find.
For example, to collect all dates, comments, and situations from Eve's data:

.. code-block:: python

    from rustling.chat import ChangeableHeader
    found_dates = []
    found_comments = []
    found_situations = []
    for u in eve.utterances():
        if u.changeable_header is not None:
            ch = u.changeable_header
            if isinstance(ch, ChangeableHeader.Date):
                found_dates.append(ch.value)
            elif isinstance(ch, ChangeableHeader.Comment):
                found_comments.append(ch.value)
            elif isinstance(ch, ChangeableHeader.Situation):
                found_situations.append(ch.value)
    found_dates[:5]
    # ['17-OCT-1962', '31-OCT-1962', '28-NOV-1962', '10-DEC-1962', '12-DEC-1962']
    found_comments[:3]
    # ['end of episode', '15:00-16:00', '30-JAN-1963 , 10:45-11:45']
    found_situations[:2]
    # ['Eve is playing with large wooden beads. she sorts them by colors , although she often fails to use color names appropriately.',
    #  'Father is going to have apple']


Time Marks
^^^^^^^^^^

Many of the more recent CHILDES datasets (especially starting from the 1990s)
come with digitized audio and video data associated with the text-based CHAT data files.
In these datasets, an utterance in the CHAT file has time marks to indicate
its start and end time (in milliseconds) in the corresponding audio and/or video data.
If the information is available, the ``time_marks`` attribute of an
:class:`~rustling.chat.Utterance` object is a tuple of two integers,
e.g., ``(0, 1073)``, for ``·0_1073·`` found at the end of the CHAT main tier.


Original Tiers
^^^^^^^^^^^^^^

You may sometimes need the original, unparsed transcription lines,
because they contain information (e.g., annotations for pauses) that is dropped
when :class:`~rustling.chat.Token` objects are constructed
from the cleaned-up words aligned with ``%mor`` and ``%gra``.
Or you may need access to other ``%`` tiers,
e.g., ``%int`` for intonation or ``%com`` for comments.
The ``tiers`` attribute of an :class:`~rustling.chat.Utterance` object
gives you a dictionary of all the original tiers of the utterance
for your custom needs:

.. code-block:: python

    u = eve_chi.utterances()[0]
    u.tiers
    # {'%gra': '1|2|AMOD 2|2|ROOT 3|2|PUNCT',
    #  '%int': 'distinctive , loud',
    #  'CHI': 'more cookie . [+ IMP]',
    #  '%mor': 'adj|more-Cmp-S1 noun|cookie .'}

The dictionary keys include the participant code (``'CHI'``) for the main tier
and the dependent tier names (``'%mor'``, ``'%gra'``, ``'%int'``, etc.).
Notice that the main tier retains the original transcription ``'more cookie . [+ IMP]'``,
including the ``[+ IMP]`` annotation that is not part of the parsed tokens.


.. _chat_from_utterances:

Creating a ``CHAT`` Object from ``Utterance`` Objects
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you have a list of :class:`~rustling.chat.Utterance` objects
(e.g., after filtering or transforming utterances programmatically),
you can construct a new :class:`~rustling.chat.CHAT` reader from them
using the :meth:`~rustling.chat.CHAT.from_utterances` classmethod.
The resulting reader behaves like any other :class:`~rustling.chat.CHAT` object,
so you can call :meth:`~rustling.chat.CHAT.words`, :meth:`~rustling.chat.CHAT.tokens`,
and other methods on it as usual:

.. code-block:: python

    eve_chi = eve.filter(participants="CHI")
    utts = eve_chi.utterances()

    # Create a new reader from the first 10 utterances
    subset = chat.CHAT.from_utterances(utts[:10])
    subset.words()[:9]
    # ['more', 'cookie', '.', 'more', 'cookie', '.', 'more', 'juice', '?']
    len(subset.utterances())
    # 10

    # Round-trip: reconstructing a reader preserves all data
    reconstructed = chat.CHAT.from_utterances(utts)
    reconstructed.words() == eve_chi.words()
    # True