deepfrog 0.2.0

A deep learning NLP suite (PoS,lemmatiser,NER) with FoLiA XML support
docs.rs failed to build deepfrog-0.2.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

DeepFrog - NLP Suite

Language Machines Badge Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

Introduction

DeepFrog aims to be a (partial) successor of the Dutch-NLP suite Frog. Whereas the various NLP modules in Frog wre built on k-NN classifiers, DeepFrog builds on deep learning techniques and can use a variety of neural transformers.

Our deliverables are multi-faceted:

  1. Fine-tuned neural network models for Dutch NLP that can be compared with Frog and are directly usable with Huggingface's Transformers library for Python (or rust-bert for Rust).
  2. Training pipelines for the above models (see training).
  3. A software tool that integrates multiple models (not just limited to dutch!) and provides a single pipeline solution for end-users.
    • with full support for FoLiA XML input/output.
    • usage is not limited to the models we provide

Models

We aim to make available various models for Dutch NLP.

RobBERT v1 Part-of-Speech (CGN tagset) for Dutch

Model page with instructions: https://huggingface.co/proycon/robbert-pos-cased-deepfrog-nld

Uses pre-trained model RobBERT (a Roberta model), fine-tuned on part-of-speech tags with the full corpus as also used by Frog. Uses the tag set of Corpus Gesproken Nederlands (CGN), this corpus constitutes a subset of the training data.

Test Evaluation:

f1 = 0.9708171206225681
loss = 0.07882563415198372
precision = 0.9708171206225681
recall = 0.9708171206225681

RobBERT v2 Part-of-Speech (CGN tagset) for Dutch

Model page with instructions: https://huggingface.co/proycon/robbert2-pos-cased-deepfrog-nld

Uses pre-trained model RobBERT v2 (a Roberta model), fine-tuned on part-of-speech tags with the full corpus as also used by Frog. Uses the tag set of Corpus Gesproken Nederlands (CGN), this corpus constitutes a subset of the training data.

f1 = 0.9664560038891591
loss = 0.09085878504153627
precision = 0.9659863945578231
recall = 0.9669260700389105

BERT Part-of-Speech (CGN tagset) for Dutch

Model page with instructions: https://huggingface.co/proycon/bert-pos-cased-deepfrog-nld

Uses pre-trained model BERTje (a BERT model), fine-tuned on part-of-speech tags with the full corpus as also used by Frog. Uses the tag set of Corpus Gesproken Nederlands (CGN), this corpus constitutes a subset of the training data.

Test Evaluation:

f1 = 0.9737354085603113
loss = 0.0647074995296342
precision = 0.9737354085603113
recall = 0.9737354085603113

RobBERT SoNaR1 Named Entities for Dutch

Model page with instructions: https://huggingface.co/proycon/robbert-ner-cased-sonar1-nld

Uses pre-trained model RobBERT (a Roberta model), fine-tuned on Named Entities from the SoNaR1 corpus (as also used by Frog). Provides basic PER,LOC,ORG,PRO,EVE,MISC tags.

Test Evaluation (note: this is a simple token-based evaluation rather than entity based!)

f1 = 0.9170731707317074
loss = 0.023864904676364467
precision = 0.9306930693069307
recall = 0.9038461538461539

Note: the tokenisation in this model is English rather than Dutch

RobBERT v2 SoNaR1 Named Entities for Dutch

Model page with instructions: https://huggingface.co/proycon/robbert2-ner-cased-sonar1-nld

Uses pre-trained model RobBERT (v2) (a Roberta model), fine-tuned on Named Entities from the SoNaR1 corpus (as also used by Frog). Provides basic PER,LOC,ORG,PRO,EVE,MISC tags.

f1 = 0.8878048780487806
loss = 0.03555946223787032
precision = 0.900990099009901
recall = 0.875

BERT SoNaR1 Named Entities for Dutch

Model page with instructions: https://huggingface.co/proycon/bert-ner-cased-sonar1-nld

Uses pre-trained model BERTje (a BERT model), fine-tuned on Named Entities from the SoNaR1 corpus (as also used by Frog). Provides basic PER,LOC,ORG,PRO,EVE,MISC tags.

Test Evaluation (note: this is a simple token-based evaluation rather than entity based!)

f1 = 0.9519230769230769
loss = 0.02323892477299803
precision = 0.9519230769230769
recall = 0.9519230769230769