Expand description
§Name_Gen
A library for reading names and making guesses at derived names using ngrams to guess the next character in the sequence
§Intended Goal
A primary goal of this project is to have a smaller memory footprint than implementations that store lists of names with the library or rely on a lookup within a corpus. While also providing a mechanism that produces some know
§Recommended usage
- Invoke a mutable instance of a
NameExperiment
N=2 or N=3 are reasonable starting points. - Utilize
Name
struct to handle raw &str of name text or perform manual conversion from&str
to&[Option<char>]
. Dedupe if desired. - Iterate through names and run
NameExperiment::read_positive_sample
on each. - Utilize
NameExperiment::build_random_name
. Apply external analysis to separate valid names from non names. - Reinforce the weights within the
NameExperiment
by continuing to callNameExperiment::read_positive_sample
andNameExperiment::read_negative_sample
using valid and invalid names.
§Examples
let mut name_guess_experiments: NameExperiments<3> = NameExperiments::new();
let orc_names: &[str] = &["Morgash", "Nargul", "Snarlgash"];
let names = Name::new_from_batch(orc_names,
"male",
PaddingBias::Left,
Some("Orc"),
None,
None,
None
);
for n in names.iter() {
let _ = name_guess_experiments.read_positive_sample(&n.text).unwrap();
}
let new_name = name_guess_experiments.build_random_name(Some(16)).unwrap();
println!("Hello, {}!", new_name);
§Implementation details explained
This library exports a struct of NameExperiments
and supports the analysis and extraction of probability distributions of character combinations.
To start, define a new NameExperiments with a generic const parameter N. N indicates how many characters to look backwards while analyzing a name
(Values of N less than 2 will result in a panic when NameExperiments::new()
is called).
The NameExperiments::read_positive_sample
function can be used to iterate through a list of names. This library assumes that a user will utilize the text
field in the included Name
struct,
but this can be bypassed by passing an array slice of Option<char>
into read_positive_sample
Note: The
read_positive_sample
function makes no attempt to de-duplicate text that has already be read. If the same name is read into a NameExperiments struct weights around that name’s character sequences will become stronger. This might not be the intent; users of this library are advised to apply filtering or de-duplication earlier in their data pipeline.
Aside from gaining data about names, the NameExperiments
struct can also read array slices of characters that are decidedly not names. The determination of what is or isn’t a name is up to the
user of the API. But as a starting point, this can help to de-weight ngrams that would result in long sequences of vowels, consonants or simply letters that don’t often follow one another.
Use NameExperiments::read_negative_sample
to update weights that should correspond de-emphasized character sequences.
Note: Again,
read_negative_sample
does not de-duplicate names.
§Runtime Memory impact
Under the hood, the weights of the samples are stored within Four total Vec
that are size allocated when “new” is called. Two of the Vec
instances are used to hold observations about character
sequences and the count of an N+1 character observations in an array of length corresponding to the number of ValidChar
variants.
The other two Vec
instances hold observation data about character type sequences and following character type encounters in an array of length corresponding to the number of CharType
variants.
All observation is stored in u8 format to minimize the memory impact of the weights (see Intended Goal), but analysis of larger data sets with frequent occurences of the same ngram sets may prove this
primitive too small.
Given an N`` number of preceding characters assuming that there are 29 valid characters and 10 character types the
NameExperimentholds two
Vecof capacity
29^Nand each array within the vec will be size 29 bytes. Meanwhile the two char_type sample weights will be
10^Nwith arrays of size 10 bytes. In the case of
N=2memory footprint is estimated to be 51 kB. In the case of
N=3` memory footprint is estimated to be 1.4 MB.
For reference: In a system that loads a corpus of names (of average length 8). 1.4 MB could hold around 22,400 names. But would be dependant on a user to provide the names.
§TODO
- Exports weights and import weights to facilitate storage and retrieval between reinforcement sessions.
- Measure runtime memory impact and compare to estimated
- Estimates provided in the runtime memory impact imply that names could be generated with significantly lower memory consumption if the system relies on lower dimensions of character encoding (e.g. character type classifications) instead of using lengthier ngrams.
Structs§
- Name
- A stack allocated struct to hold information about the name being created.
- Name
Experiments - A datastructure that holds a variety of weights from reading lists of names and not-names. A NameExperiments struct is the primary way to read and generate derived names based on a body of text.
Enums§
- Char
Type - A tagged enum to label characters as having kinds of phonetic sounds. Analysis of ngrams of character sounds can potentially discover a rational reason why specific sounds may occur in a group of names. The goal of this enum is to group categories so that their relationship to one another can be studied in a given identity group.
- Padding
Bias - A tagged enum with to flag if the name is left or right biased in terms of null padding
- Valid
Char - An enum of character to make rust better use of pattern matching in code elsewhere.