finalfrontier 0.9.4

Train/use word embeddings with subword units
Documentation
.\" Automatically generated by Pandoc 2.7.3
.\"
.TH "FINALFRONTIER-SKIPGRAM" "1" "Sep 8, 2018" "" ""
.hy
.SH NAME
.PP
\f[B]finalfrontier skipgram\f[R] \[en] train word embeddings with
subword representations
.SH SYNOPSIS
.PP
\f[B]finalfrontier skipgram\f[R] [\f[I]options\f[R]] \f[I]corpus\f[R]
\f[I]output\f[R]
.SH DESCRIPTION
.PP
The \f[B]finalfrontier skipgram\f[R] subcommand trains word embeddings
using data from a \f[I]corpus\f[R].
The corpus should have tokens separated by spaces and sentences
separated by newlines.
After training, the embeddings are written to \f[I]output\f[R] in the
finalfusion format.
.SH OPTIONS
.TP
.B \f[C]--buckets\f[R] \f[I]EXP\f[R]
The bucket exponent.
finalfrontier uses 2\[ha]\f[I]EXP\f[R] buckets to store subword
representations.
Each subword representation (n-gram) is hashed and mapped to a bucket
based on this hash.
Using more buckets will result in fewer bucket collisions between
subword representations at the cost of memory use.
The default bucket exponent is \f[I]21\f[R] (approximately 2 million
buckets).
.TP
.B \f[C]--context\f[R] \f[I]CONTEXT_SIZE\f[R]
Words within the \f[I]CONTEXT_SIZE\f[R] of a focus word will be used to
learn the representation of the focus word.
The default context size is \f[I]10\f[R].
.TP
.B \f[C]--dims\f[R] \f[I]DIMENSIONS\f[R]
The dimensionality of the trained word embeddings.
The default dimensionality is 300.
.TP
.B \f[C]--discard\f[R] \f[I]THRESHOLD\f[R]
The discard threshold influences how often frequent words are discarded
from training.
The default discard threshold is \f[I]1e-4\f[R].
.TP
.B \f[C]--epochs\f[R] \f[I]N\f[R]
The number of training epochs.
The number of necessary training epochs typically decreases with the
corpus size.
The default number of epochs is \f[I]15\f[R].
.TP
.B \f[C]-f\f[R], \f[C]--format\f[R] \f[I]FORMAT\f[R]
The output format.
This must be one of \f[I]fasttext\f[R], \f[I]finalfusion\f[R],
\f[I]word2vec\f[R], \f[I]text\f[R], and \f[I]textdims\f[R].
.RS
.PP
All formats, except \f[I]finalfusion\f[R], result in a loss of
information: \f[I]word2vec\f[R], \f[I]text\f[R], and \f[I]textdims\f[R]
do not store subword embeddings, nor hyperparameters.
The \f[I]fastText\f[R] format does not store all hyperparemeters.
.PP
The \f[I]fasttext\f[R] format can only be used in conjunction with
\f[C]--subwords buckets\f[R] and \f[C]--hash-indexer fasttext\f[R].
.RE
.TP
.B \f[C]--hash-indexer\f[R] \f[I]INDEXER\f[R]
The indexer to use when bucket-based subwords are used (see
\f[C]--subwords\f[R]).
The possible values are \f[I]finalfusion\f[R] or \f[I]fasttext\f[R].
Default: finalfusion
.RS
.PP
\f[I]finalfusion\f[R] uses the FNV-1a hasher, whereas \f[I]fasttext\f[R]
emulates the (broken) implementation of FNV-1a in fastText.
Use of \f[I]finalfusion\f[R] is recommended, unless the resulting
embeddings should be compatible with fastText.
.RE
.TP
.B \f[C]--lr\f[R] \f[I]LEARNING_RATE\f[R]
The learning rate determines what fraction of a gradient is used for
parameter updates.
The default initial learning rate is \f[I]0.05\f[R], the learning rate
decreases monotonically during training.
.TP
.B \f[C]--maxn\f[R] \f[I]LEN\f[R]
The maximum n-gram length for subword representations.
Default: 6
.TP
.B \f[C]--mincount\f[R] \f[I]FREQ\f[R]
The minimum count controls discarding of infrequent.
Words occuring fewer than \f[I]FREQ\f[R] times are not considered during
training.
The default minimum count is 5.
.TP
.B \f[C]--minn\f[R] \f[I]LEN\f[R]
The minimum n-gram length for subword representations.
Default: 3
.TP
.B \f[C]--model\f[R] \f[I]MODEL\f[R]
The model to use for training word embeddings.
The choices here are: \f[I]dirgram\f[R] for the directional skip-gram
model (Song et al., 2018), \f[I]skipgram\f[R] for the skip-gram model
(Mikolov et al., 2013), and \f[I]structgram\f[R] for the stuctured
skip-gram model (Ling et al.\ 2015).
.RS
.PP
The structured skip-gram model takes the position of a context word into
account and results in embeddings that are typically better suited for
syntax-oriented tasks.
.PP
The dependency embeddings model is supported by the separate
\f[C]finalfrontier deps\f[R](1) subcommand.
.PP
The default model is \f[I]skipgram\f[R].
.RE
.TP
.B \f[C]--ngram-mincount\f[R] \f[I]FREQ\f[R]
The minimum n-gram frequency.
n-grams occurring fewer than \f[I]FREQ\f[R] times are excluded from
training.
This option is only applicable with the \f[I]ngrams\f[R] argument of the
\f[C]subwords\f[R] option.
.TP
.B \f[C]--ngram-target-size\f[R] \f[I]SIZE\f[R]
The target size for the n-gram vocabulary.
At most \f[I]SIZE\f[R] n-ngrams are included for training.
Only n-grams appearing more frequently than the n-gram at \f[I]SIZE\f[R]
are included.
This option is only applicable with the \f[I]ngrams\f[R] argument of the
\f[C]subwords\f[R] option.
.TP
.B \f[C]--ns\f[R] \f[I]FREQ\f[R]
The number of negatives to sample per positive example.
Default: 5
.TP
.B \f[C]--subwords\f[R] \f[I]SUBWORDS\f[R]
The type of subword embeddings to train.
The possible types are \f[I]buckets\f[R], \f[I]ngrams\f[R], and
\f[I]none\f[R].
Subword embeddings are used to compute embeddings for unknown words by
summing embeddings of n-grams within unknown words.
.RS
.PP
The \f[I]none\f[R] type does not use subwords.
The resulting model will not be able assign an embeddings to unknown
words.
.PP
The \f[I]ngrams\f[R] type stores subword n-grams explicitly.
The included n-gram lengths are specified using the \f[C]minn\f[R] and
\f[C]maxn\f[R] options.
The frequency threshold for n-grams is configured with the
\f[C]ngram-mincount\f[R] option.
.PP
The \f[I]buckets\f[R] type maps n-grams to buckets using the FNV1 hash.
The considered n-gram lengths are specified using the \f[C]minn\f[R] and
\f[C]maxn\f[R] options.
The number of buckets is controlled with the \f[C]buckets\f[R] option.
.RE
.TP
.B \f[C]--target-size\f[R] \f[I]SIZE\f[R]
The target size for the token vocabulary.
At most \f[I]SIZE\f[R] tokens are included for training.
Only tokens appearing more frequently than the token at \f[I]SIZE\f[R]
are included.
.TP
.B \f[C]--threads\f[R] \f[I]N\f[R]
The number of thread to use during training for parallelization.
The default is to use half of the logical CPUs of the machine, capped at
20 threads.
Increasing the number of threads increases the probability of update
collisions, requiring more epochs to reach the same loss.
.TP
.B \f[C]--zipf\f[R] \f[I]EXP\f[R]
Exponent \f[I]s\f[R] used in the Zipf distribution
\f[C]p(k) = 1 / (k\[ha]s H_N)\f[R] for negative sampling.
Default: 0.5
.SH EXAMPLES
.PP
Train embeddings on \f[I]dewiki.txt\f[R] using the skip-gram model:
.IP
.nf
\f[C]
finalfrontier skipgram dewiki.txt dewiki-skipgram.bin
\f[R]
.fi
.PP
Train embeddings with dimensionality 200 on \f[I]dewiki.txt\f[R] using
the structured skip-gram model with a context window of 5 tokens:
.IP
.nf
\f[C]
finalfrontier skipgram --model structgram --context 5 --dims 200 \[rs]
  dewiki.txt dewiki-structgram.bin
\f[R]
.fi
.SH SEE ALSO
.PP
\f[C]finalfrontier\f[R](1), \f[C]finalfrontier-deps\f[R](1)
.SH AUTHORS
Daniel de Kok.