The same style of ngram extraction is always used at index time and at
query time.
Each ngram type uses the ngram size configuration differently.
All ngram styles used Unicode codepoints as the definition of a character.
For example, a 3-gram might contain up to 4 bytes, if it contains 3 Unicode
codepoints that each require 4 UTF-8 code units.
This is the tradition style of ngram, where sliding window of size
N is moved across the entire content to be index. For example, the
3-grams for the string homer are hom, ome and mer.
This style of ngram produces ever longer ngrams, where each ngram is
anchored to the start of a word. Words are determined simply by
splitting whitespace.
For example, the edge ngrams of homer simpson, where the max ngram
size is 5, would be: hom, home, homer, sim, simp, simps. Generally,
for this ngram type, one wants to use a large maximum ngram size.
Perhaps somewhere close to the maximum number of ngrams in any word
in the corpus.
Note that there is no way to set the minimum ngram size (which is 3).