easy_ml/
naive_bayes.rs

1/*!
2Naïve Bayes examples
3
4# Naïve Bayes
5
6The naïve bayes assumption is that all features in the labelled data are independent of each other
7given the class they correspond to. This means the probability of some input given a class
8can be computed as the product of each individual feature in that input conditioned on that class.
9
10By Baye's Theorum we can relate the probability of the class given the input to the probability
11of the input given the class. As a classifier only needs to determine which class some input
12is most likely to be we can compare just the product of the probability of the input given a class
13and the probability of that class.
14
15## Bayes' Theorum
16
17posterior = ( prior x likelihood ) / evidence
18
19P(C<sub>k</sub> | **x**) = ( P(C<sub>k</sub>) * P(**x** | C<sub>k</sub>) ) / P(**x**)
20
21P(C<sub>k</sub> | **x**) ∝ P(C<sub>k</sub>) * P(**x** | C<sub>k</sub>)
22
23where C<sub>k</sub> is the kth class and **x** is the input to classify.
24
25## Classifier
26
27taking logs on Bayes' rule yields
28
29log(P(C<sub>k</sub> | **x**)) ∝ log(P(C<sub>k</sub>)) + log(P(**x** | C<sub>k</sub>))
30
31given the naïve bayes assumption
32
33log(P(C<sub>k</sub> | **x**)) ∝ log(P(C<sub>k</sub>)) + the sum over all i features of (log(P(x<sub>i</sub> | C<sub>k</sub>)))
34
35Then to determine the class we take the class corresponding to the largest
36log(P(C<sub>k</sub> | **x**)).
37
38Computing the individual probabilities of a feature conditioned on a class depends on what the
39type of data is.
40
41For categorical data this is simply occurances in class / total occurances. In practise laplacian
42smoothing (adding one occurance for each class to the computed probabilities) may be used to
43avoid computing a probability of 0 when some category doesn't have any samples for a class.
44
45For continuous data we can model the feature as distributed according to a Gaussian distribution.
46
47## Simple Naïve Bayes Example with F-1 score analysis
48
49Naïve Bayes can be done by hand (with a calculator) which is what the below example will show.
50We have a list of data about the environment and want to predict if we should go outside based
51on the conditions. Some of these are categorical values and others are real valued.
52
53```
54use easy_ml::distributions::Gaussian;
55use easy_ml::linear_algebra;
56
57#[derive(Clone, Copy, PartialEq, Debug)]
58enum Weather {
59    Stormy, Rainy, Cloudy, Clear, Sunny
60}
61
62type WindSpeed = f64;
63
64#[derive(Clone, Copy, PartialEq, Debug)]
65enum Pandemic {
66    Pandemic, NoPandemic
67}
68
69#[derive(Clone, Copy, PartialEq, Debug)]
70enum Decision {
71    GoOut, StayIn
72}
73
74#[derive(Clone, Copy, PartialEq, Debug)]
75struct Observation {
76    weather: Weather,
77    wind: WindSpeed,
78    pandemic: Pandemic,
79    decision: Decision,
80}
81
82impl Observation {
83    fn new(
84    weather: Weather, wind: WindSpeed, pandemic: Pandemic, decision: Decision
85) -> Observation {
86        Observation {
87            weather,
88            wind,
89            pandemic,
90            decision
91        }
92    }
93}
94
95let observations = vec![
96    Observation::new(Weather::Clear, 0.5, Pandemic::NoPandemic, Decision::GoOut),
97    Observation::new(Weather::Clear, 0.9, Pandemic::NoPandemic, Decision::GoOut),
98    Observation::new(Weather::Clear, 0.8, Pandemic::NoPandemic, Decision::StayIn),
99    Observation::new(Weather::Stormy, 0.7, Pandemic::NoPandemic, Decision::StayIn),
100    Observation::new(Weather::Rainy, 0.1, Pandemic::NoPandemic, Decision::GoOut),
101    Observation::new(Weather::Rainy, 0.5, Pandemic::NoPandemic, Decision::StayIn),
102    Observation::new(Weather::Rainy, 0.6, Pandemic::NoPandemic, Decision::GoOut),
103    Observation::new(Weather::Rainy, 0.7, Pandemic::NoPandemic, Decision::StayIn),
104    Observation::new(Weather::Cloudy, 0.3, Pandemic::NoPandemic, Decision::GoOut),
105    Observation::new(Weather::Cloudy, 0.5, Pandemic::NoPandemic, Decision::GoOut),
106    Observation::new(Weather::Cloudy, 0.2, Pandemic::NoPandemic, Decision::StayIn),
107    Observation::new(Weather::Cloudy, 0.8, Pandemic::NoPandemic, Decision::StayIn),
108    Observation::new(Weather::Sunny, 0.3, Pandemic::NoPandemic, Decision::GoOut),
109    Observation::new(Weather::Sunny, 0.9, Pandemic::NoPandemic, Decision::GoOut),
110    Observation::new(Weather::Sunny, 0.5, Pandemic::NoPandemic, Decision::GoOut),
111    Observation::new(Weather::Sunny, 0.5, Pandemic::NoPandemic, Decision::StayIn),
112    Observation::new(Weather::Sunny, 0.5, Pandemic::Pandemic, Decision::StayIn),
113    Observation::new(Weather::Clear, 0.1, Pandemic::Pandemic, Decision::StayIn),
114    Observation::new(Weather::Clear, 0.9, Pandemic::Pandemic, Decision::StayIn)
115];
116
117fn predict(
118    observations: &Vec<Observation>, weather: Weather, wind: WindSpeed, pandemic: Pandemic
119) -> Decision {
120    let total = observations.len() as f64;
121    // first compute the number of each class in the data
122    let total_stay_in = observations.iter()
123        .filter(|observation| observation.decision == Decision::StayIn)
124        .count() as f64;
125    let total_go_out = observations.iter()
126        .filter(|observation| observation.decision == Decision::GoOut)
127        .count() as f64;
128
129    let weather_log_probability_stay_in = {
130        // compute how many rows in the data are this weather and stay in
131        let total = observations.iter()
132            .filter(|observation| observation.decision == Decision::StayIn)
133            .filter(|observation| observation.weather == weather)
134            .count() as f64;
135        // there are 5 variants for the weather and we use laplacian smoothing
136        // to avoid introducing zero probabilities, +1 / +5 treats the data
137        // as if there is one stay in for each weather type.
138        ((total + 1.0) / (total_stay_in + 5.0)).ln()
139    };
140
141    let weather_log_probability_go_out = {
142        // compute how many rows in the data are this weather and go out
143        let total = observations.iter()
144            .filter(|observation| observation.decision == Decision::GoOut)
145            .filter(|observation| observation.weather == weather)
146            .count() as f64;
147        // there are 5 variants for the weather and we use laplacian smoothing
148        // to avoid introducing zero probabilities, +1 / +5 treats the data
149        // as if there is one go out for each weather type.
150        ((total + 1.0) / (total_go_out + 5.0)).ln()
151    };
152
153    // we're modelling the wind as a Gaussian so we get a probability by
154    // computing the probability density for the wind we have, if it is
155    // the same as the mean wind for stay in then it will be closest to 1
156    // and the further away the closer to 0 it will be
157    let wind_speed_model_stay_in: Gaussian<WindSpeed> = Gaussian::approximating(
158        observations.iter()
159        .filter(|observation| observation.decision == Decision::StayIn)
160        .map(|observation| observation.wind)
161    );
162    let wind_log_probability_stay_in = wind_speed_model_stay_in.probability(&wind);
163
164    let wind_speed_model_go_out: Gaussian<WindSpeed> = Gaussian::approximating(
165        observations.iter()
166        .filter(|observation| observation.decision == Decision::GoOut)
167        .map(|observation| observation.wind)
168    );
169    let wind_log_probability_go_out = wind_speed_model_go_out.probability(&wind);
170
171    let pandemic_log_probability_stay_in = {
172        // compute how many rows in the data are this pandemic and stay in
173        let total = observations.iter()
174            .filter(|observation| observation.decision == Decision::StayIn)
175            .filter(|observation| observation.pandemic == pandemic)
176            .count() as f64;
177        // there are 2 variants for the pandemic type and we use laplacian smoothing
178        // to avoid introducing zero probabilities, +1 / +2 treats the data
179        // as if there is one stay in for each pandemic type.
180        ((total + 1.0) / (total_stay_in + 2.0)).ln()
181    };
182
183    let pandemic_log_probability_go_out = {
184        // compute how many rows in the data are this pandemic and go out
185        let total = observations.iter()
186            .filter(|observation| observation.decision == Decision::GoOut)
187            .filter(|observation| observation.pandemic == pandemic)
188            .count() as f64;
189        // there are 2 variants for the pandemic type and we use laplacian smoothing
190        // to avoid introducing zero probabilities, +1 / +2 treats the data
191        // as if there is one go out for each pandemic type.
192        ((total + 1.0) / (total_go_out + 2.0)).ln()
193    };
194
195    let prior_log_probability_stay_in = total_stay_in / total;
196    let prior_log_probability_go_out = total_go_out / total;
197
198    let posterior_log_probability_stay_in = (
199        prior_log_probability_stay_in
200        + weather_log_probability_stay_in
201        + wind_log_probability_stay_in
202        + pandemic_log_probability_stay_in
203    );
204    let posterior_log_probability_go_out = (
205        prior_log_probability_go_out
206        + weather_log_probability_go_out
207        + wind_log_probability_go_out
208        + pandemic_log_probability_go_out
209    );
210
211    if posterior_log_probability_go_out > posterior_log_probability_stay_in {
212        Decision::GoOut
213    } else {
214        Decision::StayIn
215    }
216}
217
218let test_data = vec![
219    Observation::new(Weather::Sunny, 0.8, Pandemic::NoPandemic, Decision::StayIn),
220    Observation::new(Weather::Sunny, 0.2, Pandemic::NoPandemic, Decision::GoOut),
221    Observation::new(Weather::Stormy, 0.2, Pandemic::NoPandemic, Decision::StayIn),
222    Observation::new(Weather::Cloudy, 0.3, Pandemic::NoPandemic, Decision::GoOut),
223    Observation::new(Weather::Rainy, 0.8, Pandemic::NoPandemic, Decision::GoOut),
224    Observation::new(Weather::Rainy, 0.5, Pandemic::NoPandemic, Decision::GoOut),
225    Observation::new(Weather::Stormy, 0.6, Pandemic::Pandemic, Decision::StayIn),
226    Observation::new(Weather::Rainy, 0.1, Pandemic::Pandemic, Decision::StayIn),
227];
228
229let predictions = test_data.iter()
230    .map(|data| predict(&observations, data.weather, data.wind, data.pandemic))
231    .collect::<Vec<Decision>>();
232
233println!("Test data and predictions\n{:?}", test_data.iter()
234    .cloned()
235    .zip(predictions.clone())
236    .collect::<Vec<(Observation, Decision)>>());
237
238println!("Accuracy: {:?}", test_data.iter()
239    .zip(predictions.clone())
240    .map(|(data, decision)| if data.decision == decision { 1.0 } else { 0.0 })
241    .sum::<f64>() / (test_data.len() as f64));
242
243// To compute Recall and Precision it is neccessary to decide which class should be
244// considered the positive and which should be the negative. For medical diagnosis
245// this is always diagnosing something as present. For this example we take GoOut to
246// be the positive case.
247
248// True Positives are where the model predicts the positive class when it is
249// the correct decision, eg in this scenario, going outside when it should
250// decide to go outside
251let true_positives = test_data.iter()
252    .cloned()
253    .zip(predictions.clone())
254    .filter(|(data, _)| data.decision == Decision::GoOut)
255    .filter(|(_, decision)| decision == &Decision::GoOut)
256    .count() as f64;
257
258// False Positives are when the model predicts the positive class when it is
259// not the correct decision, eg in this scenario, going outside when it should
260// decide to stay in
261let false_positives = test_data.iter()
262    .cloned()
263    .zip(predictions.clone())
264    .filter(|(data, _)| data.decision == Decision::StayIn)
265    .filter(|(_, decision)| decision == &Decision::GoOut)
266    .count() as f64;
267
268// True Negatives are when the model predicts the negative class when it is
269// the correct decision, eg in this scenario, staying in when it should
270// decide to stay in
271// there's a pandemic and no good reason to go outside
272let true_negatives = test_data.iter()
273    .cloned()
274    .zip(predictions.clone())
275    .filter(|(data, _)| data.decision == Decision::StayIn)
276    .filter(|(_, decision)| decision == &Decision::StayIn)
277    .count() as f64;
278
279// False Negatives are when the model predicts the negative class when it is
280// not the correct decision, eg in this scenario, staying in when it should
281// decide to go outside
282// there's a pandemic and no good reason to go outside
283let false_negatives = test_data.iter()
284    .cloned()
285    .zip(predictions.clone())
286    .filter(|(data, _)| data.decision == Decision::GoOut)
287    .filter(|(_, decision)| decision == &Decision::StayIn)
288    .count() as f64;
289
290// Precision measures how good the model is at identifying the positive
291// class (you can trivially get 100% precision by never predicting the
292// positive class, as this means you can't get a false positive).
293let precision = true_positives / (true_positives + false_positives);
294
295// Recall is the true positive rate which is how good the model is at
296// identifying the positive class out of all the positive cases (you can
297// trivially get 100% recall by always predicting the positive class).
298let recall = true_positives / (true_positives + false_negatives);
299
300// The F-1 score is the harmonic mean of precision and recall and
301// averages them for an accuracy measure.
302// In this case the two classes are roughly equally likely, so the F-1
303// score and Accuracy are similar. However, if the model had learned
304// to always predict GoOut, then it would still have an accuracy of
305// roughly 50% because of the equal likelihood, whereas its F-1 score
306// would be much lower than 50%.
307let f1_score = linear_algebra::f1_score(precision, recall);
308println!("F1-Score: {:?}", f1_score);
309
310assert!(f1_score > 0.8);
311```
312
313## 3 Class Naïve Bayes Example
314[See submodule for 3 Class Naïve Bayes Example](three_class)
315*/
316
317#[rustfmt::skip]
318pub mod three_class {
319/*!
320# 3 Class Naïve Bayes Example
321
322For this example some population data is generated for a fictional alien race as I didn't
323have any real datasets to hand. This alien race has 3 sexes (mysteriously no individuals
324are ever intersex or trans) and is sexually dimorphic, meaning we can try to determine their
325sex from various measuremnts.
326
327As with humans, a gaussian distribution for physical charateristics is sensible due to
328evolutionary and biological factors.
329
330The example includes categorical features such as marking color and real valued
331features such as height in order to show how both can be modelled with Naïve Bayes.
332
333Note that most of the code below is for generating and clustering the data to perform
334Naïve Bayes on, not doing it.
335
336### Clustering
337
338After generating the unlabelled data clustering is performed on a very small subset of that data
339to find three cluster centres that are then used to assign the whole dataset sex class labels.
340This creates a labelled dataset to perform Naïve Bayes on to see if we can predict sex given
341the various features.
342
343By clustering only a very small bit of the data, by chance we can expect there to be large gaps
344in the subset because our data has many dimensions.
345
346## Matrix APIs
347
348```
349use easy_ml::matrices::Matrix;
350use easy_ml::matrices::views::{MatrixView, MatrixRange};
351use easy_ml::linear_algebra;
352use easy_ml::distributions::Gaussian;
353
354use rand::{Rng, SeedableRng};
355use rand::distr::{Iter, StandardUniform};
356use rand_chacha::ChaCha8Rng;
357
358#[derive(Clone, Copy, PartialEq, Debug)]
359struct Alien {
360    height: f64,
361    primary_marking_color: AlienMarkingColor,
362    tail_length: f64,
363    metabolic_rate: f64,
364    spiked_tail: bool,
365    sex: AlienSex,
366}
367
368#[derive(Clone, Copy, Eq, PartialEq, Debug)]
369enum AlienMarkingColor {
370    Red = 1, Yellow, Orange, Blue, Purple, White, Black
371}
372
373#[derive(Clone, Copy, Eq, PartialEq, Debug)]
374enum AlienSex {
375    A, B, C
376}
377
378/**
379 * ===============================
380 * =       Data Generation       =
381 * ===============================
382 */
383
384// rather than explicitly define a generative function that already knows the relationship
385// between alien charateristics instead the samples are generated without an assigned sex
386// and then clustered using k-means
387
388// use a fixed seed random generator from the rand crate
389let mut random_generator = ChaCha8Rng::seed_from_u64(16);
390let mut random_numbers: Iter<StandardUniform, &mut ChaCha8Rng, f64> =
391    (&mut random_generator).sample_iter(StandardUniform);
392
393/**
394 * Generates height data for creating the alien dataset.
395 */
396fn generate_heights(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
397    // the average height shall be 1.5 meters with a standard deviation of 0.25
398    let heights_distribution = Gaussian::new(1.5, 0.25 * 0.25);
399    let mut heights = heights_distribution.draw(random_numbers, samples).unwrap();
400    // ensure all aliens are at least 0.25 meters tall
401    heights.drain(..).map(|x| x.max(0.25)).collect()
402}
403
404/**
405 * Generates tail length data for creating the alien dataset.
406 */
407fn generate_tail_length(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
408    // the average length shall be 1.25 meters with more variation in tail length
409    let tails_distribution = Gaussian::new(1.25, 0.5 * 0.5);
410    let mut tails = tails_distribution.draw(random_numbers, samples).unwrap();
411    // ensure all tails are at least 0.5 meters long
412    tails.drain(..).map(|x| if x > 0.5 { x } else { 0.5 }).collect()
413}
414
415/**
416 * Generates color data for creating the alien dataset.
417 *
418 * Note that floats are still returned despite this being a category because we need all the
419 * data types to be the same for clustering
420 */
421fn generate_colors(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
422    let mut colors = Vec::with_capacity(samples);
423    for i in 0..samples {
424        let x = random_numbers.next().unwrap();
425        if x < 0.2  {
426            colors.push(AlienMarkingColor::Red as u8 as f64);
427        } else if x < 0.3 {
428            colors.push(AlienMarkingColor::Yellow as u8 as f64);
429        } else if x < 0.45 {
430            colors.push(AlienMarkingColor::Orange as u8 as f64);
431        } else if x < 0.59 {
432            colors.push(AlienMarkingColor::Blue as u8 as f64);
433        } else if x < 0.63 {
434            colors.push(AlienMarkingColor::Purple as u8 as f64);
435        }  else if x < 0.9 {
436            colors.push(AlienMarkingColor::White as u8 as f64);
437        } else {
438            colors.push(AlienMarkingColor::Black as u8 as f64);
439        }
440    }
441    colors
442}
443
444/**
445 * Recovers the color type which is the closest match to the input floating point color
446 */
447fn recover_generated_color(color: f64) -> AlienMarkingColor {
448    let numerical_colors = [
449        AlienMarkingColor::Red as u8 as f64,
450        AlienMarkingColor::Yellow as u8 as f64,
451        AlienMarkingColor::Orange as u8 as f64,
452        AlienMarkingColor::Blue as u8 as f64,
453        AlienMarkingColor::Purple as u8 as f64,
454        AlienMarkingColor::White as u8 as f64,
455        AlienMarkingColor::Black as u8 as f64,
456    ];
457    let colors = [
458        AlienMarkingColor::Red,
459        AlienMarkingColor::Yellow,
460        AlienMarkingColor::Orange,
461        AlienMarkingColor::Blue,
462        AlienMarkingColor::Purple,
463        AlienMarkingColor::White,
464        AlienMarkingColor::Black,
465    ];
466    // look for the closest fit, as manipulated floating points may not be exact
467    let color_index = numerical_colors.iter()
468        // take the absolute difference so an exact match will become 0
469        .map(|c| (c - color).abs())
470        .enumerate()
471        // find the element with the smallest difference in the list
472        .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
473        // discard the difference
474        .map(|(index, difference)| index)
475        // retrieve the value from the option
476        .unwrap();
477    colors[color_index]
478}
479
480/**
481 * Generates metabolic rate data for creating the alien dataset.
482 */
483fn generate_metabolic_rate(
484    samples: usize, random_numbers: &mut impl Iterator<Item = f64>
485) -> Vec<f64> {
486    // the average rate shall be 100 heart beats per minute with a standard deviation of 20
487    let metabolic_rate_distribution = Gaussian::new(100.0, 20.0 * 20.0);
488    let mut metabolic_rates = metabolic_rate_distribution.draw(random_numbers, samples).unwrap();
489    // ensure all rates are at least 50 and less than 200
490    metabolic_rates.drain(..).map(|x| x.max(50.0).min(200.0)).collect()
491}
492
493/**
494 * Generates spiked tailness data for creating the alien dataset.
495 */
496fn generate_spiked_tail(
497    samples: usize, random_numbers: &mut impl Iterator<Item = f64>
498) -> Vec<f64> {
499    let mut spikes = Vec::with_capacity(samples);
500    for i in 0..samples {
501        let x = random_numbers.next().unwrap();
502        if x < 0.4 {
503            // 0 shall represent a non spiked tail
504            spikes.push(0.0)
505        } else {
506            // 1 shall represent a spiked tail
507            spikes.push(1.0)
508        }
509    }
510    spikes
511}
512
513/**
514 * Recovers the spiked tail type which is the closest match to the input floating point
515 */
516fn recover_generated_spiked_tail(spiked: f64) -> bool {
517    return spiked >= 0.5
518}
519
520// We shall generate 1000 samples for the dataset
521const SAMPLES: usize = 1000;
522
523// Collect all the float typed features into a matrix
524let unlabelled_dataset = {
525    let mut dataset = Matrix::column(generate_heights(SAMPLES, &mut random_numbers));
526    dataset.insert_column_with(1,
527        generate_colors(SAMPLES, &mut random_numbers).drain(..));
528    dataset.insert_column_with(2,
529        generate_tail_length(SAMPLES, &mut random_numbers).drain(..));
530    dataset.insert_column_with(3,
531        generate_metabolic_rate(SAMPLES, &mut random_numbers).drain(..));
532    dataset.insert_column_with(4,
533        generate_spiked_tail(SAMPLES, &mut random_numbers).drain(..));
534    dataset
535};
536
537/**
538 * ===============================
539 * =          Clustering         =
540 * ===============================
541 */
542
543// Create a subset of the first 30 samples from the full dataset to use for clustering
544let unlabelled_subset = MatrixView::from(
545    MatrixRange::from(&unlabelled_dataset, 0..30, 0..unlabelled_dataset.columns())
546);
547
548// We normalise all the features to 0 mean and 1 standard deviation because
549// we will use euclidean distance as the distance matric, and our features
550// have very different variances. This avoids the distance metric being
551// dominated by any particular feature.
552
553let mut means_and_variances = Vec::with_capacity(unlabelled_subset.columns());
554
555// The normalised subset is computed taking the mean and variance from the subset,
556// these means and variances will be needed later to apply to the rest of the data.
557let mut normalised_subset = {
558    let mut normalised_subset = unlabelled_subset.map(|x| x);
559    for feature in 0..normalised_subset.columns() {
560        let mean = linear_algebra::mean(normalised_subset.column_iter(feature));
561        let variance = linear_algebra::variance(normalised_subset.column_iter(feature));
562        // save the data for normalisation and denormalisation for each feature
563        // for use later
564        means_and_variances.push(vec![ mean, variance ]);
565
566        for row in 0..normalised_subset.rows() {
567            let x = normalised_subset.get(row, feature);
568            normalised_subset.set(row, feature, (x - mean) / variance);
569        }
570    }
571    normalised_subset
572};
573
574// create a 5 x 2  matrix where each row is the mean and variance of each of the 5 features
575let means_and_variances = Matrix::from(means_and_variances);
576
577// pick the first 3 samples as the starting points for the cluster centres
578// and place them into a 3 x 5 matrix where we have 3 rows of cluster centres
579// and 5 features which are all normalised
580let mut clusters = Matrix::from(vec![
581    normalised_subset.row_iter(0).collect(),
582    normalised_subset.row_iter(1).collect(),
583    normalised_subset.row_iter(2).collect()]);
584
585// add a 6th column to the subset to track the closest cluster for each sample
586const CLUSTER_ID_COLUMN: usize = 5;
587normalised_subset.insert_column(CLUSTER_ID_COLUMN, -1.0);
588
589// set a threshold at which we consider the cluster centres to have converged
590const CHANGE_THRESHOLD: f64 = 0.001;
591
592// track how much the means have changed each update
593let mut absolute_changes = -1.0;
594
595// loop until we go under the CHANGE_THRESHOLD, reassigning points to the nearest
596// cluster then cluster centres to their mean of points
597while absolute_changes == -1.0 || absolute_changes > CHANGE_THRESHOLD {
598    // assign each point to the nearest cluster centre by euclidean distance
599    for point in 0..normalised_subset.rows() {
600        let mut closest_cluster = -1.0;
601        let mut least_squared_distance = std::f64::MAX;
602        for cluster in 0..clusters.rows() {
603            // we don't actually need to square root the distances for finding
604            // which is least because least squared distance is the same as
605            // least distance
606            let squared_distance = {
607                let mut sum = 0.0;
608                for feature in 0..clusters.columns() {
609                    let cluster_coordinate = clusters.get(cluster, feature);
610                    let point_coordiante = normalised_subset.get(point, feature);
611                    sum += (cluster_coordinate - point_coordiante).powi(2);
612                }
613                sum
614            };
615
616            if squared_distance < least_squared_distance {
617                closest_cluster = cluster as f64;
618                least_squared_distance = squared_distance;
619            }
620        }
621        // save the cluster that is closest to each point in the 6th column
622        normalised_subset.set(point, CLUSTER_ID_COLUMN, closest_cluster);
623    }
624
625    // update cluster centres to the mean of their points
626    absolute_changes = 0.0;
627    for cluster in 0..clusters.rows() {
628        // construct a list of the points this cluster owns
629        let owned = normalised_subset.column_iter(CLUSTER_ID_COLUMN)
630            // zip together the id values with their index
631            .enumerate()
632            // exclude the points that aren't assigned to this cluster
633            .filter(|(index, id)| (*id as usize) == cluster)
634            // drop the cluster ids from each item and copy over the data
635            // for each point for each feature
636            .map(|(index, id)| {
637                // for each point copy all its data in each feature excluding the
638                // final cluster id column into a new vec
639                normalised_subset.row_iter(index)
640                    // taking the first 5 excludes the 6th column due to 0 indexing
641                    .take(CLUSTER_ID_COLUMN)
642                    .collect::<Vec<f64>>()
643            })
644            // collect into a vector of vectors containing each feature's data
645            .collect::<Vec<Vec<f64>>>();
646        // pass the vector of vectors into a matrix so we have
647        // a matrix where each row is the data of a point this cluster owns
648        let owned = Matrix::from(owned);
649
650        // construct a vector of the mean for each feature that this cluster
651        // now has
652        let new_means = {
653            let mut means = Vec::with_capacity(owned.rows());
654
655            for feature in 0..owned.columns() {
656                let mean = owned.column_iter(feature).sum::<f64>() / (owned.rows() as f64);
657                means.push(mean);
658            }
659
660            means
661        };
662
663        // update each new mean for the cluster
664        for feature in 0..clusters.columns() {
665            let previous_mean = clusters.get(cluster, feature);
666            // track the absolute difference between the new mean and the old one
667            // so we know when to stop updating the clusters
668            absolute_changes += (previous_mean - new_means[feature]).abs();
669
670            println!(
671                "Cluster {:?} mean for feature {:?} is now {:?} was {:?}",
672                cluster, feature, new_means[feature], previous_mean
673            );
674            clusters.set(cluster, feature, new_means[feature]);
675        }
676    }
677}
678
679println!(
680    "Denormalised clusters at convergence:\n{:?}\n{:.3}",
681    vec![ "H", "C", "T", "M", "S" ],
682    clusters.map_with_index(|x, _cluster, feature| {
683        let mean = means_and_variances.get(feature, 0);
684        let variance = means_and_variances.get(feature, 1);
685        (x * variance) + mean
686    }));
687
688// Now we will assign every alien in the full dataset a sex using these cluster centres
689let mut aliens: Vec<Alien> = Vec::with_capacity(unlabelled_dataset.rows());
690
691fn assign_alien_sex(index: u8) -> AlienSex {
692    if index == 0 {
693        AlienSex::A
694    } else if index == 1 {
695        AlienSex::B
696    } else {
697        AlienSex::C
698    }
699}
700
701for i in 0..unlabelled_dataset.rows() {
702    let alien_data = Matrix::column(unlabelled_dataset.row_iter(i).collect());
703    // normalise the alien data first so comparisons are on unit variance
704    // and zero mean
705    let normalised_alien_data = alien_data.map_with_index(|x, feature, _| {
706        let mean = means_and_variances.get(feature, 0);
707        let variance = means_and_variances.get(feature, 1);
708        // normalise each feature in the alien data
709        (x - mean) / variance
710    });
711    let mut distances = Vec::with_capacity(clusters.rows());
712    for j in 0..clusters.rows() {
713        let cluster_data = Matrix::row(clusters.row_iter(j).collect());
714        // use euclidean distance to compare the alien with the cluster, the cluster
715        // is already normalised
716        let sum_of_squares = (cluster_data * &normalised_alien_data).scalar();
717        distances.push(sum_of_squares);
718    }
719
720    // find the cluster with the lowest distance to each point and get its index
721    let chosen_cluster = distances.iter()
722        .enumerate()
723        .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
724        .map(|(i, _)| i)
725        .unwrap();
726
727    // convert each float to the correct data type
728    aliens.push(Alien {
729        height: alien_data.get(0, 0),
730        primary_marking_color: recover_generated_color(alien_data.get(1, 0)),
731        tail_length: alien_data.get(2, 0),
732        metabolic_rate: alien_data.get(3, 0),
733        spiked_tail: recover_generated_spiked_tail(alien_data.get(4, 0)),
734        sex: assign_alien_sex(chosen_cluster as u8),
735    })
736}
737
738println!("First 10 aliens");
739for i in 0..10 {
740    println!("{:?}", aliens[i]);
741}
742
743// Put the aliens in a matrix for convenience
744let aliens = Matrix::row(aliens);
745
746println!("Sex A aliens total: {}", aliens.row_reference_iter(0)
747    .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::A { 1 } else { 0 }));
748
749println!("Sex B aliens total: {}", aliens.row_reference_iter(0)
750    .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::B { 1 } else { 0 }));
751
752println!("Sex C aliens total: {}", aliens.row_reference_iter(0)
753    .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::C { 1 } else { 0 }));
754
755/**
756 * ===============================
757 * =         Naïve Bayes         =
758 * ===============================
759 */
760
761// Each class is roughly one third so we should not have a strong prior to a particular class
762
763// In order to evaluate the performance of the Naïve Bayes classifier we will hold out
764// the last 100 aliens from the dataset and use them as training data
765
766let training_data = Matrix::column(aliens.row_iter(0).take(900).collect());
767let test_data = Matrix::column(aliens.row_iter(0).skip(900).collect());
768
769/**
770 * Predicts the most probable alien sex for each test input alien (disregarding
771 * the sex field in those inputs)
772 *
773 * For the real valued features the probabilities are computed by modelling
774 * the features (conditioned on each class) as gaussian distributions.
775 * For categorical features laplacian smoothing of the counts is used to
776 * estimate probabilities of the features (conditioned on each class).
777 */
778fn predict_aliens(training_data: &Matrix<Alien>, test_data: &Matrix<Alien>) -> Matrix<AlienSex> {
779    let mut relative_log_probabilities = Vec::with_capacity(3);
780
781    for class in &[ AlienSex::A, AlienSex::B, AlienSex::C ] {
782        let training_data_class_only = training_data.column_iter(0)
783            .filter(|a| &a.sex == class)
784            .collect::<Vec<Alien>>();
785
786        // compute how likely each class is in the training set
787        let prior = (training_data_class_only.len() as f64) / (training_data.rows() as f64);
788
789        // We model the real valued features as Gaussians, note that these
790        // are Gaussian distributions over only the training data of each class
791        let heights: Gaussian<f64> = Gaussian::approximating(
792            training_data_class_only.iter().map(|a| a.height)
793        );
794        let tail_lengths: Gaussian<f64> = Gaussian::approximating(
795            training_data_class_only.iter().map(|a| a.tail_length)
796        );
797        let metabolic_rates: Gaussian<f64> = Gaussian::approximating(
798            training_data_class_only.iter().map(|a| a.metabolic_rate)
799        );
800
801        // gradually build up the sum of log probabilities to get the
802        // log of the prior * likelihood which will be proportional to the posterior
803        let relative_log_probabilities_of_class = test_data.column_reference_iter(0)
804        .map(|alien| {
805            // probabilitiy of the alien sex and the alien
806            let mut log_relative_probability = prior.ln();
807
808            // Compute the probability using the Gaussian model for each real valued feature.
809            // Due to floating point precision limits and the variance for some of these
810            // Gaussian models being extremely small (0.01 for heights) we
811            // check if a probability computed is zero or extremely close to zero
812            // and if so increase it a bit to avoid computing -inf when we take the log.
813
814            let mut height_given_class = heights.probability(&alien.height);
815            if height_given_class.abs() <= 0.000000000001 {
816                height_given_class = 0.000000000001;
817            }
818            log_relative_probability += height_given_class.ln();
819
820            let mut tail_given_class = tail_lengths.probability(&alien.tail_length);
821            if tail_given_class.abs() <= 0.000000000001 {
822                tail_given_class = 0.000000000001;
823            }
824            log_relative_probability += tail_given_class.ln();
825
826            let mut metabolic_rates_given_class = metabolic_rates.probability(
827                &alien.metabolic_rate);
828            if metabolic_rates_given_class.abs() <= 0.000000000001 {
829                metabolic_rates_given_class = 0.000000000001;
830            }
831            log_relative_probability += metabolic_rates_given_class.ln();
832
833            // compute the probability of the categorical features using lapacian smoothing
834            let color_of_class = training_data_class_only.iter()
835                .map(|a| a.primary_marking_color)
836                // count how many aliens of this class have this color
837                .fold(0, |acc, color|
838                    acc + if color == alien.primary_marking_color { 1 } else { 0 });
839            // with laplacian smoothing we assume there is one datapoint for each color
840            // which avoids zero probabilities but does not distort the probabilities much
841            // there are 7 color types so we add 7 to the total
842            let color_given_class = ((color_of_class + 1) as f64)
843                / ((training_data_class_only.len() + 7) as f64);
844            log_relative_probability += color_given_class.ln();
845
846            let spiked_tail_of_class = training_data_class_only.iter()
847                .map(|a| a.spiked_tail)
848                // count how many aliens of this class have a spiked tail or not
849                .fold(0, |acc, spiked| acc + if spiked == alien.spiked_tail { 1 } else { 0 });
850            // again we assume one alien of the class with a spiked tail and one without
851            // to avoid zero probabilities
852            let spiked_tail_given_class = ((spiked_tail_of_class + 1) as f64)
853                / ((training_data_class_only.len() + 2) as f64);
854            log_relative_probability += spiked_tail_given_class.ln();
855
856            if log_relative_probability == std::f64::NEG_INFINITY {
857                println!("Individual probs P:{} H:{} T:{} M:{} C:{} S:{}",
858                    prior, height_given_class, tail_given_class, metabolic_rates_given_class,
859                    color_given_class, spiked_tail_given_class);
860            }
861
862            log_relative_probability
863        }).collect();
864
865        relative_log_probabilities.push(relative_log_probabilities_of_class);
866    }
867
868    // collect the relative probabilitiy estimates for each class and each alien
869    // into a 3 x 100 matrix respectively
870    let probabilities = Matrix::from(relative_log_probabilities);
871
872    let predictions = (0..probabilities.columns()).map(|i| {
873        let predicted_class_index = probabilities.column_iter(i)
874            .enumerate()
875            // find the class with the highest relative probability estimate
876            .max_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
877            // discard the probability
878            .map(|(index, p)| index)
879            // retrieve the value from the option
880            .unwrap();
881        if predicted_class_index == 0 {
882            AlienSex::A
883        } else if predicted_class_index == 1 {
884            AlienSex::B
885        } else {
886            AlienSex::C
887        }
888    }).collect();
889
890    Matrix::column(predictions)
891}
892
893let predictions = predict_aliens(&training_data, &test_data);
894
895println!("First 10 test aliens and predictions");
896for i in 0..10 {
897    println!("Predicted: {:?} Input: {:?}", predictions.get(i, 0), test_data.get(i, 0));
898}
899
900let accuracy = test_data.column_iter(0)
901    .zip(predictions.column_iter(0))
902    .map(|(alien, prediction)| if alien.sex == prediction { 1 } else { 0 })
903    .sum::<u16>() as f64 / (test_data.rows() as f64);
904
905println!("Accuracy {}%", accuracy * 100.0);
906
907/**
908 * ===============================
909 * =           Analysis          =
910 * ===============================
911 */
912
913// We can get a better sense of how well our classifier has done by
914// printing the confusion matrix
915
916// Construct a confusion matrix of actual x predicted classes, using A as 0, B as 1 and C as 2
917// for indexing. If the accuracy was 100% we would see only non zero numbers on the diagonal
918// as every prediction would be the actual class.
919let confusion_matrix = {
920    let mut confusion_matrix = Matrix::empty(0, (3, 3));
921
922    // loop through all the actual and predicted classes to fill the confusion matrix
923    // with the total occurances of each possible combination
924    for (actual, predicted) in test_data.column_iter(0).zip(predictions.column_iter(0)) {
925        match actual.sex {
926            AlienSex::A => {
927                match predicted {
928                    AlienSex::A => confusion_matrix.set(0, 0, confusion_matrix.get(0, 0) + 1),
929                    AlienSex::B => confusion_matrix.set(0, 1, confusion_matrix.get(0, 1) + 1),
930                    AlienSex::C => confusion_matrix.set(0, 2, confusion_matrix.get(0, 2) + 1),
931                }
932            },
933            AlienSex::B => {
934                match predicted {
935                    AlienSex::A => confusion_matrix.set(1, 0, confusion_matrix.get(1, 0) + 1),
936                    AlienSex::B => confusion_matrix.set(1, 1, confusion_matrix.get(1, 1) + 1),
937                    AlienSex::C => confusion_matrix.set(1, 2, confusion_matrix.get(1, 2) + 1),
938                }
939            },
940            AlienSex::C => {
941                match predicted {
942                    AlienSex::A => confusion_matrix.set(2, 0, confusion_matrix.get(2, 0) + 1),
943                    AlienSex::B => confusion_matrix.set(2, 1, confusion_matrix.get(2, 1) + 1),
944                    AlienSex::C => confusion_matrix.set(2, 2, confusion_matrix.get(2, 2) + 1),
945                }
946            }
947        }
948    }
949
950    confusion_matrix
951};
952
953println!("Confusion matrix: Rows are actual class, Columns are predicted class\n{}",
954    confusion_matrix);
955println!("  A  B  C");
956```
957
958## Tensor APIs
959
960
961```
962use easy_ml::tensors::Tensor;
963use easy_ml::tensors::views::{TensorView, TensorStack};
964use easy_ml::linear_algebra;
965use easy_ml::distributions::Gaussian;
966
967use rand::{Rng, SeedableRng};
968use rand::distr::{Iter, StandardUniform};
969use rand_chacha::ChaCha8Rng;
970
971#[derive(Clone, Copy, PartialEq, Debug)]
972struct Alien {
973    height: f64,
974    primary_marking_color: AlienMarkingColor,
975    tail_length: f64,
976    metabolic_rate: f64,
977    spiked_tail: bool,
978    sex: AlienSex,
979}
980
981#[derive(Clone, Copy, Eq, PartialEq, Debug)]
982enum AlienMarkingColor {
983    Red = 1, Yellow, Orange, Blue, Purple, White, Black
984}
985
986#[derive(Clone, Copy, Eq, PartialEq, Debug)]
987enum AlienSex {
988    A, B, C
989}
990
991/**
992 * ===============================
993 * =       Data Generation       =
994 * ===============================
995 */
996
997// rather than explicitly define a generative function that already knows the relationship
998// between alien charateristics instead the samples are generated without an assigned sex
999// and then clustered using k-means
1000
1001// use a fixed seed random generator from the rand crate
1002let mut random_generator = ChaCha8Rng::seed_from_u64(16);
1003let mut random_numbers: Iter<StandardUniform, &mut ChaCha8Rng, f64> =
1004    (&mut random_generator).sample_iter(StandardUniform);
1005
1006/**
1007 * Generates height data for creating the alien dataset.
1008 */
1009fn generate_heights(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
1010    // the average height shall be 1.5 meters with a standard deviation of 0.25
1011    let heights_distribution = Gaussian::new(1.5, 0.25 * 0.25);
1012    let mut heights = heights_distribution.draw(random_numbers, samples).unwrap();
1013    // ensure all aliens are at least 0.25 meters tall
1014    heights.drain(..).map(|x| x.max(0.25)).collect()
1015}
1016
1017/**
1018 * Generates tail length data for creating the alien dataset.
1019 */
1020fn generate_tail_length(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
1021    // the average length shall be 1.25 meters with more variation in tail length
1022    let tails_distribution = Gaussian::new(1.25, 0.5 * 0.5);
1023    let mut tails = tails_distribution.draw(random_numbers, samples).unwrap();
1024    // ensure all tails are at least 0.5 meters long
1025    tails.drain(..).map(|x| if x > 0.5 { x } else { 0.5 }).collect()
1026}
1027
1028/**
1029 * Generates color data for creating the alien dataset.
1030 *
1031 * Note that floats are still returned despite this being a category because we need all the
1032 * data types to be the same for clustering
1033 */
1034fn generate_colors(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
1035    let mut colors = Vec::with_capacity(samples);
1036    for i in 0..samples {
1037        let x = random_numbers.next().unwrap();
1038        if x < 0.2  {
1039            colors.push(AlienMarkingColor::Red as u8 as f64);
1040        } else if x < 0.3 {
1041            colors.push(AlienMarkingColor::Yellow as u8 as f64);
1042        } else if x < 0.45 {
1043            colors.push(AlienMarkingColor::Orange as u8 as f64);
1044        } else if x < 0.59 {
1045            colors.push(AlienMarkingColor::Blue as u8 as f64);
1046        } else if x < 0.63 {
1047            colors.push(AlienMarkingColor::Purple as u8 as f64);
1048        }  else if x < 0.9 {
1049            colors.push(AlienMarkingColor::White as u8 as f64);
1050        } else {
1051            colors.push(AlienMarkingColor::Black as u8 as f64);
1052        }
1053    }
1054    colors
1055}
1056
1057/**
1058 * Recovers the color type which is the closest match to the input floating point color
1059 */
1060fn recover_generated_color(color: f64) -> AlienMarkingColor {
1061    let numerical_colors = [
1062        AlienMarkingColor::Red as u8 as f64,
1063        AlienMarkingColor::Yellow as u8 as f64,
1064        AlienMarkingColor::Orange as u8 as f64,
1065        AlienMarkingColor::Blue as u8 as f64,
1066        AlienMarkingColor::Purple as u8 as f64,
1067        AlienMarkingColor::White as u8 as f64,
1068        AlienMarkingColor::Black as u8 as f64,
1069    ];
1070    let colors = [
1071        AlienMarkingColor::Red,
1072        AlienMarkingColor::Yellow,
1073        AlienMarkingColor::Orange,
1074        AlienMarkingColor::Blue,
1075        AlienMarkingColor::Purple,
1076        AlienMarkingColor::White,
1077        AlienMarkingColor::Black,
1078    ];
1079    // look for the closest fit, as manipulated floating points may not be exact
1080    let color_index = numerical_colors.iter()
1081        // take the absolute difference so an exact match will become 0
1082        .map(|c| (c - color).abs())
1083        .enumerate()
1084        // find the element with the smallest difference in the list
1085        .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
1086        // discard the difference
1087        .map(|(index, difference)| index)
1088        // retrieve the value from the option
1089        .unwrap();
1090    colors[color_index]
1091}
1092
1093/**
1094 * Generates metabolic rate data for creating the alien dataset.
1095 */
1096fn generate_metabolic_rate(
1097    samples: usize, random_numbers: &mut impl Iterator<Item = f64>
1098) -> Vec<f64> {
1099    // the average rate shall be 100 heart beats per minute with a standard deviation of 20
1100    let metabolic_rate_distribution = Gaussian::new(100.0, 20.0 * 20.0);
1101    let mut metabolic_rates = metabolic_rate_distribution.draw(random_numbers, samples).unwrap();
1102    // ensure all rates are at least 50 and less than 200
1103    metabolic_rates.drain(..).map(|x| x.max(50.0).min(200.0)).collect()
1104}
1105
1106/**
1107 * Generates spiked tailness data for creating the alien dataset.
1108 */
1109fn generate_spiked_tail(
1110    samples: usize, random_numbers: &mut impl Iterator<Item = f64>
1111) -> Vec<f64> {
1112    let mut spikes = Vec::with_capacity(samples);
1113    for i in 0..samples {
1114        let x = random_numbers.next().unwrap();
1115        if x < 0.4 {
1116            // 0 shall represent a non spiked tail
1117            spikes.push(0.0)
1118        } else {
1119            // 1 shall represent a spiked tail
1120            spikes.push(1.0)
1121        }
1122    }
1123    spikes
1124}
1125
1126/**
1127 * Recovers the spiked tail type which is the closest match to the input floating point
1128 */
1129fn recover_generated_spiked_tail(spiked: f64) -> bool {
1130    return spiked >= 0.5
1131}
1132
1133// We shall generate 1000 samples for the dataset
1134const SAMPLES: usize = 1000;
1135
1136// Each sample has 5 features
1137const FEATURES: usize = 5;
1138
1139// Collect all the float typed features into a 2 dimensional tensor of samples x data
1140let unlabelled_dataset = {
1141    let heights = Tensor::from(
1142        [("sample", SAMPLES)], generate_heights(SAMPLES, &mut random_numbers)
1143    );
1144    let colours = Tensor::from(
1145        [("sample", SAMPLES)], generate_colors(SAMPLES, &mut random_numbers)
1146    );
1147    let tail_lengths = Tensor::from(
1148        [("sample", SAMPLES)], generate_tail_length(SAMPLES, &mut random_numbers)
1149    );
1150    let metabolic_rates = Tensor::from(
1151        [("sample", SAMPLES)], generate_metabolic_rate(SAMPLES, &mut random_numbers)
1152    );
1153    let spiked_tail = Tensor::from(
1154        [("sample", SAMPLES)], generate_spiked_tail(SAMPLES, &mut random_numbers)
1155    );
1156
1157    TensorView::from(
1158        TensorStack::<_, [_; 5], 1>::from(
1159            [ heights, colours, tail_lengths, metabolic_rates, spiked_tail ],
1160            // we stack along the second axis that we're creating to get our rows of samples
1161            // this leaves the existing axis in all our individual column vectors to be each
1162            // feature's data
1163            (1, "feature")
1164        )
1165    )
1166};
1167
1168/**
1169 * ===============================
1170 * =          Clustering         =
1171 * ===============================
1172 */
1173
1174// Create a subset of the first 30 samples from the full dataset to use for clustering
1175const SUBSET_SAMPLES: usize = 30;
1176let unlabelled_subset = unlabelled_dataset.range([("sample", 0..SUBSET_SAMPLES)])
1177    .expect("The unlabelled data set should have at least 30 samples");
1178
1179// We normalise all the features to 0 mean and 1 standard deviation because
1180// we will use euclidean distance as the distance matric, and our features
1181// have very different variances. This avoids the distance metric being
1182// dominated by any particular feature.
1183
1184let mut means_and_variances = Vec::with_capacity(FEATURES);
1185
1186// The normalised subset is computed taking the mean and variance from the subset,
1187// these means and variances will be needed later to apply to the rest of the data.
1188let mut normalised_subset = {
1189    let mut normalised_subset = unlabelled_subset.map(|x| x);
1190    for feature in 0..FEATURES {
1191        let mut feature_data = normalised_subset.select_mut([("feature", feature)]);
1192        let mean = linear_algebra::mean(feature_data.iter());
1193        let variance = linear_algebra::variance(feature_data.iter());
1194        // save the data for normalisation and denormalisation for each feature
1195        // for use later
1196        means_and_variances.push((mean, variance));
1197
1198        for (row, x) in feature_data.iter_reference_mut().enumerate() {
1199            *x = (*x - mean) / variance;
1200        }
1201    }
1202    normalised_subset
1203};
1204
1205// create a tensor where each row is a tuple of the mean and variance of each of the
1206// 5 features
1207let means_and_variances = Tensor::from([("feature", FEATURES)], means_and_variances);
1208
1209// pick the first 3 samples as the starting points for the cluster centres
1210// and place them into a 3 x 5 tensor where we have 3 rows of cluster centres
1211// and 5 features which are all normalised
1212const CLUSTERS: usize = 3;
1213let mut clusters = TensorView::from(
1214    TensorStack::<_, [_; 3], 1>::from(
1215        [
1216            normalised_subset.select([("sample", 0)]).source(),
1217            normalised_subset.select([("sample", 1)]).source(),
1218            normalised_subset.select([("sample", 2)]).source()
1219        ],
1220        (0, "cluster")
1221    )
1222).map(|x| x);
1223
1224// add a 6th column to the subset to track the closest cluster for each sample
1225const CLUSTER_ID_COLUMN: usize = 5;
1226let mut normalised_subset = {
1227    let mut data = normalised_subset.iter();
1228    let mut tensor = Tensor::empty([("sample", SUBSET_SAMPLES), ("feature", FEATURES + 1)], -1.0);
1229    for ([_row, feature], x) in tensor.iter_reference_mut().with_index() {
1230        *x = match feature {
1231            0 | 1 | 2 | 3 | 4 => data.next().unwrap(),
1232            _ => -1.0
1233        };
1234    }
1235    tensor
1236};
1237
1238// set a threshold at which we consider the cluster centres to have converged
1239const CHANGE_THRESHOLD: f64 = 0.001;
1240
1241// track how much the means have changed each update
1242let mut absolute_changes = -1.0;
1243
1244// loop until we go under the CHANGE_THRESHOLD, reassigning points to the nearest
1245// cluster then cluster centres to their mean of points
1246while absolute_changes == -1.0 || absolute_changes > CHANGE_THRESHOLD {
1247    // assign each point to the nearest cluster centre by euclidean distance
1248    for point in 0..SUBSET_SAMPLES {
1249        let mut closest_cluster = -1.0;
1250        let mut least_squared_distance = std::f64::MAX;
1251        let clusters_indexing = clusters.index();
1252        let mut samples_indexing = normalised_subset.index_mut();
1253        for cluster in 0..CLUSTERS {
1254            // we don't actually need to square root the distances for finding
1255            // which is least because least squared distance is the same as
1256            // least distance
1257            let squared_distance = {
1258                let mut sum = 0.0;
1259
1260                for feature in 0..FEATURES {
1261                    let cluster_coordinate = clusters_indexing.get([cluster, feature]);
1262                    let point_coordiante = samples_indexing.get([point, feature]);
1263                    sum += (cluster_coordinate - point_coordiante).powi(2);
1264                }
1265                sum
1266            };
1267
1268            if squared_distance < least_squared_distance {
1269                closest_cluster = cluster as f64;
1270                least_squared_distance = squared_distance;
1271            }
1272        }
1273        // save the cluster that is closest to each point in the 6th column
1274        *samples_indexing.get_ref_mut([point, CLUSTER_ID_COLUMN]) = closest_cluster;
1275    }
1276
1277    // update cluster centres to the mean of their points
1278    absolute_changes = 0.0;
1279    for cluster in 0..CLUSTERS {
1280        // construct a list of the points this cluster owns
1281        let owned = normalised_subset.select([("feature", CLUSTER_ID_COLUMN)]).iter()
1282            // zip together the id values with their index
1283            .enumerate()
1284            // exclude the points that aren't assigned to this cluster
1285            .filter(|(index, id)| (*id as usize) == cluster)
1286            // drop the cluster ids from each item and copy over the data
1287            // for each point for each feature
1288            .map(|(index, id)| {
1289                // for each point copy all its data in each feature excluding the
1290                // final cluster id column into a new vec
1291                normalised_subset.select([("sample", index)]).iter()
1292                    // taking the first 5 excludes the 6th column due to 0 indexing
1293                    .take(CLUSTER_ID_COLUMN)
1294                    .collect::<Vec<f64>>()
1295            })
1296            // collect into a vector of vectors containing each feature's data
1297            .collect::<Vec<Vec<f64>>>();
1298        // pass the vector of vectors into a tensor so we have
1299        // a tensor where each row is the data of a point this cluster owns
1300        let owned_points = owned.len();
1301        let owned = Tensor::from([("owned", owned.len())], owned);
1302
1303        // construct a vector of the mean for each feature that this cluster
1304        // now has
1305        let new_means = {
1306            let mut means = Vec::with_capacity(owned_points);
1307
1308            for feature in 0..FEATURES {
1309                let mean = owned.iter().map(|x| x[feature]).sum::<f64>() / (owned_points as f64);
1310                means.push(mean);
1311            }
1312
1313            means
1314        };
1315
1316        // update each new mean for the cluster
1317        for feature in 0..FEATURES {
1318            let mut clusters_indexing = clusters.index_mut();
1319            let previous_mean = clusters_indexing.get([cluster, feature]);
1320            // track the absolute difference between the new mean and the old one
1321            // so we know when to stop updating the clusters
1322            absolute_changes += (previous_mean - new_means[feature]).abs();
1323
1324            println!(
1325                "Cluster {:?} mean for feature {:?} is now {:?} was {:?}",
1326                cluster, feature, new_means[feature], previous_mean
1327            );
1328            *clusters_indexing.get_ref_mut([cluster, feature]) = new_means[feature];
1329        }
1330    }
1331}
1332
1333println!(
1334    "Denormalised clusters at convergence:\n{:?}\n{:.3}",
1335    vec![ "H", "C", "T", "M", "S" ],
1336    clusters.map_with_index(|[_cluster, feature], x| {
1337        let indexing = means_and_variances.index();
1338        let (mean, variance) = indexing.get([feature]);
1339        (x * variance) + mean
1340    }));
1341
1342// Now we will assign every alien in the full dataset a sex using these cluster centres
1343let mut aliens: Vec<Alien> = Vec::with_capacity(SAMPLES);
1344
1345fn assign_alien_sex(index: u8) -> AlienSex {
1346    if index == 0 {
1347        AlienSex::A
1348    } else if index == 1 {
1349        AlienSex::B
1350    } else {
1351        AlienSex::C
1352    }
1353}
1354
1355for i in 0..SAMPLES {
1356    let alien_data = unlabelled_dataset.select([("sample", i)]).map(|x| x);
1357    // normalise the alien data first so comparisons are on unit variance
1358    // and zero mean
1359    let normalised_alien_data = alien_data.map_with_index(|[feature], x| {
1360        let indexing = means_and_variances.index();
1361        let (mean, variance) = indexing.get([feature]);
1362        // normalise each feature in the alien data
1363        (x - mean) / variance
1364    });
1365    let mut distances = Vec::with_capacity(CLUSTERS);
1366    for j in 0..CLUSTERS {
1367        let cluster_data = clusters.select([("cluster", j)]);
1368        // use euclidean distance to compare the alien with the cluster, the cluster
1369        // is already normalised
1370        // we have a 1 x 5 tensor for cluser data and a 5 x 1 tensor for alien data so matrix
1371        // multiplication here gives us the sum of products
1372        let sum_of_squares = (
1373            cluster_data.expand([(0, "samples")]) * normalised_alien_data.expand([(1, "alien")])
1374        ).first();
1375        distances.push(sum_of_squares);
1376    }
1377
1378    // find the cluster with the lowest distance to each point and get its index
1379    let chosen_cluster = distances.iter()
1380        .enumerate()
1381        .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
1382        .map(|(i, _)| i)
1383        .unwrap();
1384
1385    // convert each float to the correct data type
1386    let alien_data_indexing = alien_data.index();
1387    aliens.push(Alien {
1388        height: alien_data_indexing.get([0]),
1389        primary_marking_color: recover_generated_color(alien_data_indexing.get([1])),
1390        tail_length: alien_data_indexing.get([2]),
1391        metabolic_rate: alien_data_indexing.get([3]),
1392        spiked_tail: recover_generated_spiked_tail(alien_data_indexing.get([4])),
1393        sex: assign_alien_sex(chosen_cluster as u8),
1394    })
1395}
1396
1397println!("First 10 aliens");
1398for i in 0..10 {
1399    println!("{:?}", aliens[i]);
1400}
1401
1402// Put the aliens in a tensor for convenience
1403let aliens = Tensor::from([("sample", SAMPLES)], aliens);
1404
1405println!("Sex A aliens total: {}", aliens.iter_reference()
1406    .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::A { 1 } else { 0 }));
1407
1408println!("Sex B aliens total: {}", aliens.iter_reference()
1409    .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::B { 1 } else { 0 }));
1410
1411println!("Sex C aliens total: {}", aliens.iter_reference()
1412    .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::C { 1 } else { 0 }));
1413
1414/**
1415 * ===============================
1416 * =         Naïve Bayes         =
1417 * ===============================
1418 */
1419
1420// Each class is roughly one third so we should not have a strong prior to a particular class
1421
1422// In order to evaluate the performance of the Naïve Bayes classifier we will hold out
1423// the last 100 aliens from the dataset and use them as training data
1424
1425const TRAINING_SAMPLES: usize = 900;
1426const TESTING_SAMPLES: usize = 100;
1427let training_data = Tensor::from([("sample", 900)], aliens.iter().take(TRAINING_SAMPLES).collect());
1428let test_data = Tensor::from([("sample", 100)], aliens.iter().skip(TRAINING_SAMPLES).collect());
1429
1430/**
1431 * Predicts the most probable alien sex for each test input alien (disregarding
1432 * the sex field in those inputs)
1433 *
1434 * For the real valued features the probabilities are computed by modelling
1435 * the features (conditioned on each class) as gaussian distributions.
1436 * For categorical features laplacian smoothing of the counts is used to
1437 * estimate probabilities of the features (conditioned on each class).
1438 */
1439fn predict_aliens(training_data: &Tensor<Alien, 1>, test_data: &Tensor<Alien, 1>) -> Tensor<AlienSex, 1> {
1440    let mut relative_log_probabilities: Vec<f64> = Vec::with_capacity(300);
1441
1442    for class in &[ AlienSex::A, AlienSex::B, AlienSex::C ] {
1443        let training_data_class_only = training_data.iter()
1444            .filter(|a| &a.sex == class)
1445            .collect::<Vec<Alien>>();
1446
1447        // compute how likely each class is in the training set
1448        let prior = (training_data_class_only.len() as f64) / (TRAINING_SAMPLES as f64);
1449
1450        // We model the real valued features as Gaussians, note that these
1451        // are Gaussian distributions over only the training data of each class
1452        let heights: Gaussian<f64> = Gaussian::approximating(
1453            training_data_class_only.iter().map(|a| a.height)
1454        );
1455        let tail_lengths: Gaussian<f64> = Gaussian::approximating(
1456            training_data_class_only.iter().map(|a| a.tail_length)
1457        );
1458        let metabolic_rates: Gaussian<f64> = Gaussian::approximating(
1459            training_data_class_only.iter().map(|a| a.metabolic_rate)
1460        );
1461
1462        // gradually build up the sum of log probabilities to get the
1463        // log of the prior * likelihood which will be proportional to the posterior
1464        let mut relative_log_probabilities_of_class = test_data.iter_reference()
1465        .map(|alien| {
1466            // probabilitiy of the alien sex and the alien
1467            let mut log_relative_probability = prior.ln();
1468
1469            // Compute the probability using the Gaussian model for each real valued feature.
1470            // Due to floating point precision limits and the variance for some of these
1471            // Gaussian models being extremely small (0.01 for heights) we
1472            // check if a probability computed is zero or extremely close to zero
1473            // and if so increase it a bit to avoid computing -inf when we take the log.
1474
1475            let mut height_given_class = heights.probability(&alien.height);
1476            if height_given_class.abs() <= 0.000000000001 {
1477                height_given_class = 0.000000000001;
1478            }
1479            log_relative_probability += height_given_class.ln();
1480
1481            let mut tail_given_class = tail_lengths.probability(&alien.tail_length);
1482            if tail_given_class.abs() <= 0.000000000001 {
1483                tail_given_class = 0.000000000001;
1484            }
1485            log_relative_probability += tail_given_class.ln();
1486
1487            let mut metabolic_rates_given_class = metabolic_rates.probability(
1488                &alien.metabolic_rate);
1489            if metabolic_rates_given_class.abs() <= 0.000000000001 {
1490                metabolic_rates_given_class = 0.000000000001;
1491            }
1492            log_relative_probability += metabolic_rates_given_class.ln();
1493
1494            // compute the probability of the categorical features using lapacian smoothing
1495            let color_of_class = training_data_class_only.iter()
1496                .map(|a| a.primary_marking_color)
1497                // count how many aliens of this class have this color
1498                .fold(0, |acc, color|
1499                    acc + if color == alien.primary_marking_color { 1 } else { 0 });
1500            // with laplacian smoothing we assume there is one datapoint for each color
1501            // which avoids zero probabilities but does not distort the probabilities much
1502            // there are 7 color types so we add 7 to the total
1503            let color_given_class = ((color_of_class + 1) as f64)
1504                / ((training_data_class_only.len() + 7) as f64);
1505            log_relative_probability += color_given_class.ln();
1506
1507            let spiked_tail_of_class = training_data_class_only.iter()
1508                .map(|a| a.spiked_tail)
1509                // count how many aliens of this class have a spiked tail or not
1510                .fold(0, |acc, spiked| acc + if spiked == alien.spiked_tail { 1 } else { 0 });
1511            // again we assume one alien of the class with a spiked tail and one without
1512            // to avoid zero probabilities
1513            let spiked_tail_given_class = ((spiked_tail_of_class + 1) as f64)
1514                / ((training_data_class_only.len() + 2) as f64);
1515            log_relative_probability += spiked_tail_given_class.ln();
1516
1517            if log_relative_probability == std::f64::NEG_INFINITY {
1518                println!("Individual probs P:{} H:{} T:{} M:{} C:{} S:{}",
1519                    prior, height_given_class, tail_given_class, metabolic_rates_given_class,
1520                    color_given_class, spiked_tail_given_class);
1521            }
1522
1523            log_relative_probability
1524        }).collect();
1525
1526        relative_log_probabilities.append(&mut relative_log_probabilities_of_class);
1527    }
1528
1529    // collect the relative probabilitiy estimates for each class and each alien
1530    // into a 3 x 100 matrix respectively
1531    let probabilities = Tensor::from([("class", 3), ("sample", 100)], relative_log_probabilities);
1532
1533    let predictions = (0..100).map(|i| {
1534        let predicted_class_index = probabilities.select([("sample", i)])
1535            .iter()
1536            .enumerate()
1537            // find the class with the highest relative probability estimate
1538            .max_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
1539            // discard the probability
1540            .map(|(index, p)| index)
1541            // retrieve the value from the option
1542            .unwrap();
1543        if predicted_class_index == 0 {
1544            AlienSex::A
1545        } else if predicted_class_index == 1 {
1546            AlienSex::B
1547        } else {
1548            AlienSex::C
1549        }
1550    }).collect();
1551
1552    Tensor::from([("class", 100)], predictions)
1553}
1554
1555let predictions = predict_aliens(&training_data, &test_data);
1556
1557println!("First 10 test aliens and predictions");
1558for i in 0..10 {
1559    println!("Predicted: {:?} Input: {:?}", predictions.index().get([i]), test_data.index().get([i]));
1560}
1561
1562let accuracy = test_data.iter_reference()
1563    .zip(predictions.iter_reference())
1564    .map(|(alien, prediction)| if alien.sex == *prediction { 1 } else { 0 })
1565    .sum::<u16>() as f64 / (TESTING_SAMPLES as f64);
1566
1567println!("Accuracy {}%", accuracy * 100.0);
1568
1569/**
1570 * ===============================
1571 * =           Analysis          =
1572 * ===============================
1573 */
1574
1575// We can get a better sense of how well our classifier has done by
1576// printing the confusion matrix
1577
1578// Construct a confusion matrix of actual x predicted classes, using A as 0, B as 1 and C as 2
1579// for indexing. If the accuracy was 100% we would see only non zero numbers on the diagonal
1580// as every prediction would be the actual class.
1581let confusion_matrix = {
1582    let mut confusion_matrix = Tensor::empty([("actual_class", 3), ("predicted_class", 3)], 0);
1583    let mut confusion_matrix_indexing = confusion_matrix.index_mut();
1584
1585    // loop through all the actual and predicted classes to fill the confusion matrix
1586    // with the total occurances of each possible combination
1587    for (actual, predicted) in test_data.iter_reference().zip(predictions.iter_reference()) {
1588        match actual.sex {
1589            AlienSex::A => {
1590                match predicted {
1591                    AlienSex::A => *confusion_matrix_indexing.get_ref_mut([0, 0]) += 1,
1592                    AlienSex::B => *confusion_matrix_indexing.get_ref_mut([0, 1]) += 1,
1593                    AlienSex::C => *confusion_matrix_indexing.get_ref_mut([0, 2]) += 1,
1594                }
1595            },
1596            AlienSex::B => {
1597                match predicted {
1598                    AlienSex::A => *confusion_matrix_indexing.get_ref_mut([1, 0]) += 1,
1599                    AlienSex::B => *confusion_matrix_indexing.get_ref_mut([1, 1]) += 1,
1600                    AlienSex::C => *confusion_matrix_indexing.get_ref_mut([1, 2]) += 1,
1601                }
1602            },
1603            AlienSex::C => {
1604                match predicted {
1605                    AlienSex::A => *confusion_matrix_indexing.get_ref_mut([2, 0]) += 1,
1606                    AlienSex::B => *confusion_matrix_indexing.get_ref_mut([2, 1]) += 1,
1607                    AlienSex::C => *confusion_matrix_indexing.get_ref_mut([2, 2]) += 1,
1608                }
1609            }
1610        }
1611    }
1612
1613    confusion_matrix
1614};
1615
1616println!("Confusion matrix: Rows are actual class, Columns are predicted class\n{}",
1617    confusion_matrix);
1618println!("  A  B  C");
1619```
1620
1621The above code prints 60% accuracy which isn't amazing as if we learned the cluster centers
1622that labelled the data in the first place we should be able to get 100% accuracy but
162360% is still far better than guessing for a three class problem. As Naïve Bayes has no
1624hyperparameters it is good for establishing a good baseline to compare other examples such
1625as [Logistic Regression](super::super::logistic_regression) to.
1626*/
1627
1628}