rustling 0.8.0

A blazingly fast library for computational linguistics
Documentation
.. _chat_quickstart:

Quickstart
==========

Reading CHAT data
-----------------

First off, we need some CHAT data to work with.
As the CHAT data format is primarily associated with the TalkBank and CHILDES
ecosystem,
it is natural to use one of the available datasets.
A classic dataset is `Brown <https://childes.talkbank.org/access/Eng-NA/Brown.html>`_
for American English from CHILDES.
On this webpage, after you've logged in (account setup is free),
you should be able to download the full transcripts of CHAT data as a ZIP archive to your local drive.
:func:`~rustling.read_chat` can read this ZIP archive directly:

.. code-block:: python

    import rustling
    brown = rustling.read_chat("path/to/your/local/Brown.zip")
    brown.n_files
    # 214
    brown.file_paths
    # ['Brown/Adam/020304.cha',
    #  'Brown/Adam/020318.cha',
    #  ...
    #  'Brown/Eve/010600a.cha',
    #  'Brown/Eve/010600b.cha',
    #  ...
    #  'Brown/Sarah/020305.cha',
    #  'Brown/Sarah/020307.cha',
    #  ...]

Brown has 214 ``.cha`` data files.
They are organized in subdirectories for the three children of Adam, Eve, and Sarah.
``brown`` is a :class:`~rustling.chat.CHAT` instance that represents the Brown data
and provides various methods.
But before we get into those, let's filter the data down to one of the children,
so that we aren't producing results with all children's data mixed up down the road.
With ``brown`` in hand, we can filter based on knowledge of the file paths,
targeting those that contain the string ``"Eve"``:

.. code-block:: python

    eve = brown.filter(files="Eve")
    eve.n_files
    # 20
    eve.file_paths
    # ['Brown/Eve/010600a.cha',
    #  'Brown/Eve/010600b.cha',
    #  'Brown/Eve/010700a.cha',
    #  ...]

``eve`` is also a :class:`~rustling.chat.CHAT` instance, but with only 20 data files.
Let's check out some of the data access methods below.

More on :ref:`chat_read`.

Header Information
------------------

CHAT transcript files store metadata in the header with lines beginning with ``@``.
Among other things, ``eve`` has the age information of Eve when the recordings were made,
which is from 1 year and 6 months old to 2 years and 3 months old:

.. code-block:: python

    eve.ages()
    # [Age('1;06.00'),
    #  Age('1;06.00'),
    #  Age('1;07.00'),
    #  Age('1;07.00'),
    #  Age('1;08.00'),
    #  Age('1;09.00'),
    #  Age('1;09.00'),
    #  Age('1;09.00'),
    #  Age('1;10.00'),
    #  Age('1;10.00'),
    #  Age('1;11.00'),
    #  Age('1;11.00'),
    #  Age('2;00.00'),
    #  Age('2;00.00'),
    #  Age('2;01.00'),
    #  Age('2;01.00'),
    #  Age('2;02.00'),
    #  Age('2;02.00'),
    #  Age('2;03.00'),
    #  Age('2;03.00')]

More on :ref:`chat_headers`.

Transcriptions and Annotations
------------------------------

:meth:`~rustling.chat.CHAT.words` is one of the methods to access the transcriptions:

.. code-block:: python

    words = eve.words()  # list of strings, for all the words across all 20 files
    len(words)  # total word count
    # 120317
    words[:8]
    # ['more', 'cookie', '.', 'you', 'more', 'cookies', '?', 'how_about']

By default, :meth:`~rustling.chat.CHAT.words`
returns a flat list of results from all the files.
If we are interested in the results for individual files,
the method has the optional boolean parameter ``by_file``:

.. code-block:: python

    words_by_files = eve.words(by_file=True)  # list of lists of strings, each inner list for one file
    len(words_by_files)  # expects 20 -- that's the number of files of ``eve``
    # 20
    for words_one_file in words_by_files:
        print(len(words_one_file))

    # 5833
    # 5272
    # 2500
    # 5765
    # 5742
    # 4355
    # 5352
    # 8934
    # 4474
    # 4573
    # 4207
    # 6218
    # 4459
    # 5240
    # 8109
    # 7378
    # 10910
    # 8427
    # 6931
    # 5638

Apart from transcriptions, CHAT data has rich annotations for linguistic
and extra-linguistic information.
Specifically, many CHAT datasets on CHILDES have the ``%mor`` and ``%gra`` tiers
for morphological information and grammatical relations, respectively.
A CHAT data object such as ``eve`` from above has all this information readily available
to you via :meth:`~rustling.chat.CHAT.tokens`
-- think of :meth:`~rustling.chat.CHAT.tokens`
as :meth:`~rustling.chat.CHAT.words` with annotations:

.. code-block:: python

    some_tokens = eve.tokens()[:5]
    some_tokens
    # [Token(word='more', pos='adj', mor='more-Cmp-S1', gra=Gra(dep=1, head=2, rel='AMOD')),
    #  Token(word='cookie', pos='noun', mor='cookie', gra=Gra(dep=2, head=2, rel='ROOT')),
    #  Token(word='.', pos='', mor='.', gra=Gra(dep=3, head=2, rel='PUNCT')),
    #  Token(word='you', pos='pron', mor='you-Prs-Acc-S2', gra=Gra(dep=1, head=3, rel='NSUBJ')),
    #  Token(word='more', pos='adj', mor='more-Cmp-S1', gra=Gra(dep=2, head=3, rel='AMOD'))]

    # The Token class is a dataclass. A Token instance has attributes as shown above.
    for token in some_tokens:
        print(token.word, token.pos)

    # more adj
    # cookie noun
    # .
    # you pron
    # more adj

Beyond the ``%mor`` and ``%gra`` tiers,
an utterance has yet more information from the original CHAT data file.
If you need information such as the unsegmented transcription, time marks,
or any unparsed tiers, :meth:`~rustling.chat.CHAT.utterances` is what you need:

.. code-block:: python

    utterance = eve.utterances()[0]
    utterance
    # Utterance(participant='CHI', tokens=[...3 tokens], time_marks=None)
    utterance.tiers  # original, unparsed tiers for annotations etc.
    # {'CHI': 'more cookie . [+ IMP]',
    #  '%gra': '1|2|AMOD 2|2|ROOT 3|2|PUNCT',
    #  '%mor': 'adj|more-Cmp-S1 noun|cookie .',
    #  '%int': 'distinctive , loud'}

As we've started digging into the data, it would be helpful to distinguish
the child speech (i.e., participant = "CHI") versus child-directed speech.
Use :meth:`~rustling.chat.CHAT.filter` to filter ``eve`` down to the desired subset
of the data:

.. code-block:: python

    eve_chi = eve.filter(participants="CHI")  # child speech
    eve_chi.utterances()[:5]
    # [Utterance(participant='CHI', tokens=[...3 tokens], time_marks=None),
    #  Utterance(participant='CHI', tokens=[...3 tokens], time_marks=None),
    #  Utterance(participant='CHI', tokens=[...3 tokens], time_marks=None),
    #  Utterance(participant='CHI', tokens=[...2 tokens], time_marks=None),
    #  Utterance(participant='CHI', tokens=[...2 tokens], time_marks=None)]

    eve_cds = eve.filter(participants="^(?!CHI$)")  # child-directed speech, regex ^(?!CHI$) for "not CHI"
    eve_cds.utterances()[:5]
    # [Utterance(participant='MOT', tokens=[...4 tokens], time_marks=None),
    #  Utterance(participant='MOT', tokens=[...5 tokens], time_marks=None),
    #  Utterance(participant='MOT', tokens=[...7 tokens], time_marks=None),
    #  Utterance(participant='MOT', tokens=[...2 tokens], time_marks=None),
    #  Utterance(participant='MOT', tokens=[...4 tokens], time_marks=None)]

More on :ref:`chat_transcriptions`.


Word Frequencies and Ngrams
---------------------------

For word combinatorics, check out :meth:`~rustling.chat.CHAT.word_ngrams`.
A special case of general interest is word frequencies,
which are unigrams (ngrams with n = 1):

.. code-block:: python

    word_unigrams_chi = eve_chi.word_ngrams(1)
    type(word_unigrams_chi)
    # Ngrams  # stores ngrams efficiently, otherwise works like a collections.Counter
    word_counter_chi = word_unigrams_chi.to_counter()
    type(word_counter_chi)
    # collections.Counter  # https://docs.python.org/3/library/collections.html#collections.Counter
    word_counter_chi.most_common(10)
    # [(('.',), 10389),
    #  (('?',), 1449),
    #  (('I',), 1197),
    #  (('that',), 1047),
    #  (('a',), 883),
    #  (('it',), 799),
    #  (('Fraser',), 682),
    #  (('you',), 636),
    #  (('the',), 558),
    #  (('my',), 519)]

    word_counter_cds = eve_cds.word_ngrams(1).to_counter()
    word_counter_cds.most_common(10)
    # [(('.',), 9682),
    #  (('?',), 4909),
    #  (('you',), 3080),
    #  ((',',), 2090),
    #  (('the',), 1966),
    #  (('it',), 1565),
    #  (('what',), 1550),
    #  (('a',), 1324),
    #  (('I',), 899),
    #  (('is',), 894)]

Note that ngrams are represented as Python tuples, which is also true for unigrams
as shown. This short illustration already shows some of the characteristic differences
between child speech and child-directed speech, e.g., more questions and second-person
pronouns in child-directed speech than child speech.

To check out the top word bigrams:

.. code-block:: python

    eve_chi.word_ngrams(2).to_counter().most_common(5)
    # [(('it', '.'), 356),
    #  (('that', '?'), 326),
    #  (('yeah', '.'), 326),
    #  (('no', '.'), 296),
    #  (('there', '.'), 253)]

    eve_cds.word_ngrams(2).to_counter().most_common(5)
    # [(('what', '?'), 503),
    #  (('it', '.'), 347),
    #  (('on', 'the'), 327),
    #  (('are', 'you'), 308),
    #  (('in', 'the'), 301)]

More on :ref:`chat_frequencies`.


Developmental Measures
----------------------

To get the mean length of utterance (MLU), use :meth:`~rustling.chat.CHAT.mlu`:

.. code-block:: python

    eve_chi.mlu()  # mean length of utterance in morphemes
    # [1.43,
    #  1.82,
    #  2.15,
    #  2.07,
    #  2.16,
    #  2.4,
    #  2.43,
    #  2.4,
    #  2.86,
    #  2.72,
    #  2.69,
    #  3.4,
    #  3.5,
    #  2.83,
    #  3.54,
    #  3.24,
    #  3.61,
    #  3.2,
    #  3.8,
    #  2.24]

The result is the MLU in morphemes for each of Eve's CHAT files in order.
As this is a list of floats, they can be readily piped into
code for making plots, for example:

.. code-block:: python

    import rustling

    # matplotlib and seaborn required for this code snippet
    import matplotlib.pyplot as plt
    import seaborn as sns

    brown = rustling.read_chat("path/to/your/local/Brown.zip")
    eve = brown.filter(files="Eve")
    eve_chi = eve.filter(participants="CHI")
    ages_in_months = [age.in_months() if age else None for age in eve_chi.ages()]

    plt.figure(figsize=(8, 5))
    sns.lineplot(
        x=ages_in_months,
        y=eve_chi.mlum(),
        errorbar=None,
    )

    plt.title("Mean Length of Utterance in Morphemes for Brown's Eve")
    plt.xlabel("Age in months")
    plt.ylabel("MLU (morphemes)")
    plt.xticks(ages_in_months)

    plt.savefig("brown_eve_mlum.png")
    plt.close()

.. image:: ../_static/brown_eve_mlum.png
   :alt: Mean Length of Utterance in Morphemes for Brown's Eve

More on :ref:`chat_measures`.

Questions?
----------

If you have any questions, comments, bug reports etc, please open `issues
at the GitHub repository <https://github.com/jacksonllee/rustling/issues>`_, or
contact `Jackson L. Lee <https://jacksonllee.com/>`_.