Documentation
Benchmarks
==========


Tokenization
------------

Following benchmarks illustrate the tokenization accuracy (F1 score) on `UD treebanks <https://universaldependencies.org/>`_
,

======= ========= =========  =========== ======= 
  lang   dataset   regexp     spacy 2.1   vtext            
======= ========= =========  =========== ======= 
  en     EWT        0.812     0.972       0.966   
  en     GUM        0.881     0.989       0.996   
  de     GSD        0.896     0.944       0.964   
  fr     Sequoia    0.844     0.968       0.971   
======= ========= =========  =========== ======= 

and the English tokenization speed in million words per second (MWPS)

================== ========== =========== ==========
 .                   regexp     spacy 2.1   vtext
================== ========== =========== ==========
 **Speed (MB/s)**   3.1 MWPS   0.14 MWPS   2.1 MWPS
================== ========== =========== ==========


Text vectorization
------------------

Below are  benchmarks for converting
textual data to a sparse document-term matrix using the 20 newsgroups dataset, 
run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,

 ===============================  =====================  =================== =================
  Speed (MB/s)                     scikit-learn 0.20.1     vtext (n_jobs=1)  vtext (n_jobs=4)
 ===============================  =====================  =================== =================
  CountVectorizer.fit              14                  104               225
  CountVectorizer.transform        14                  82                303
  CountVectorizer.fit_transform    14                  70                NA
  HashingVectorizer.transform      19                  89                309
 ===============================  =====================  =================== =================

Note however that these two estimators in vtext currently support only a fraction of
scikit-learn's functionality.