easy_ml/naive_bayes.rs
1/*!
2Naïve Bayes examples
3
4# Naïve Bayes
5
6The naïve bayes assumption is that all features in the labelled data are independent of each other
7given the class they correspond to. This means the probability of some input given a class
8can be computed as the product of each individual feature in that input conditioned on that class.
9
10By Baye's Theorum we can relate the probability of the class given the input to the probability
11of the input given the class. As a classifier only needs to determine which class some input
12is most likely to be we can compare just the product of the probability of the input given a class
13and the probability of that class.
14
15## Bayes' Theorum
16
17posterior = ( prior x likelihood ) / evidence
18
19P(C<sub>k</sub> | **x**) = ( P(C<sub>k</sub>) * P(**x** | C<sub>k</sub>) ) / P(**x**)
20
21P(C<sub>k</sub> | **x**) ∝ P(C<sub>k</sub>) * P(**x** | C<sub>k</sub>)
22
23where C<sub>k</sub> is the kth class and **x** is the input to classify.
24
25## Classifier
26
27taking logs on Bayes' rule yields
28
29log(P(C<sub>k</sub> | **x**)) ∝ log(P(C<sub>k</sub>)) + log(P(**x** | C<sub>k</sub>))
30
31given the naïve bayes assumption
32
33log(P(C<sub>k</sub> | **x**)) ∝ log(P(C<sub>k</sub>)) + the sum over all i features of (log(P(x<sub>i</sub> | C<sub>k</sub>)))
34
35Then to determine the class we take the class corresponding to the largest
36log(P(C<sub>k</sub> | **x**)).
37
38Computing the individual probabilities of a feature conditioned on a class depends on what the
39type of data is.
40
41For categorical data this is simply occurances in class / total occurances. In practise laplacian
42smoothing (adding one occurance for each class to the computed probabilities) may be used to
43avoid computing a probability of 0 when some category doesn't have any samples for a class.
44
45For continuous data we can model the feature as distributed according to a Gaussian distribution.
46
47## Simple Naïve Bayes Example with F-1 score analysis
48
49Naïve Bayes can be done by hand (with a calculator) which is what the below example will show.
50We have a list of data about the environment and want to predict if we should go outside based
51on the conditions. Some of these are categorical values and others are real valued.
52
53```
54use easy_ml::distributions::Gaussian;
55use easy_ml::linear_algebra;
56
57#[derive(Clone, Copy, PartialEq, Debug)]
58enum Weather {
59 Stormy, Rainy, Cloudy, Clear, Sunny
60}
61
62type WindSpeed = f64;
63
64#[derive(Clone, Copy, PartialEq, Debug)]
65enum Pandemic {
66 Pandemic, NoPandemic
67}
68
69#[derive(Clone, Copy, PartialEq, Debug)]
70enum Decision {
71 GoOut, StayIn
72}
73
74#[derive(Clone, Copy, PartialEq, Debug)]
75struct Observation {
76 weather: Weather,
77 wind: WindSpeed,
78 pandemic: Pandemic,
79 decision: Decision,
80}
81
82impl Observation {
83 fn new(
84 weather: Weather, wind: WindSpeed, pandemic: Pandemic, decision: Decision
85) -> Observation {
86 Observation {
87 weather,
88 wind,
89 pandemic,
90 decision
91 }
92 }
93}
94
95let observations = vec![
96 Observation::new(Weather::Clear, 0.5, Pandemic::NoPandemic, Decision::GoOut),
97 Observation::new(Weather::Clear, 0.9, Pandemic::NoPandemic, Decision::GoOut),
98 Observation::new(Weather::Clear, 0.8, Pandemic::NoPandemic, Decision::StayIn),
99 Observation::new(Weather::Stormy, 0.7, Pandemic::NoPandemic, Decision::StayIn),
100 Observation::new(Weather::Rainy, 0.1, Pandemic::NoPandemic, Decision::GoOut),
101 Observation::new(Weather::Rainy, 0.5, Pandemic::NoPandemic, Decision::StayIn),
102 Observation::new(Weather::Rainy, 0.6, Pandemic::NoPandemic, Decision::GoOut),
103 Observation::new(Weather::Rainy, 0.7, Pandemic::NoPandemic, Decision::StayIn),
104 Observation::new(Weather::Cloudy, 0.3, Pandemic::NoPandemic, Decision::GoOut),
105 Observation::new(Weather::Cloudy, 0.5, Pandemic::NoPandemic, Decision::GoOut),
106 Observation::new(Weather::Cloudy, 0.2, Pandemic::NoPandemic, Decision::StayIn),
107 Observation::new(Weather::Cloudy, 0.8, Pandemic::NoPandemic, Decision::StayIn),
108 Observation::new(Weather::Sunny, 0.3, Pandemic::NoPandemic, Decision::GoOut),
109 Observation::new(Weather::Sunny, 0.9, Pandemic::NoPandemic, Decision::GoOut),
110 Observation::new(Weather::Sunny, 0.5, Pandemic::NoPandemic, Decision::GoOut),
111 Observation::new(Weather::Sunny, 0.5, Pandemic::NoPandemic, Decision::StayIn),
112 Observation::new(Weather::Sunny, 0.5, Pandemic::Pandemic, Decision::StayIn),
113 Observation::new(Weather::Clear, 0.1, Pandemic::Pandemic, Decision::StayIn),
114 Observation::new(Weather::Clear, 0.9, Pandemic::Pandemic, Decision::StayIn)
115];
116
117fn predict(
118 observations: &Vec<Observation>, weather: Weather, wind: WindSpeed, pandemic: Pandemic
119) -> Decision {
120 let total = observations.len() as f64;
121 // first compute the number of each class in the data
122 let total_stay_in = observations.iter()
123 .filter(|observation| observation.decision == Decision::StayIn)
124 .count() as f64;
125 let total_go_out = observations.iter()
126 .filter(|observation| observation.decision == Decision::GoOut)
127 .count() as f64;
128
129 let weather_log_probability_stay_in = {
130 // compute how many rows in the data are this weather and stay in
131 let total = observations.iter()
132 .filter(|observation| observation.decision == Decision::StayIn)
133 .filter(|observation| observation.weather == weather)
134 .count() as f64;
135 // there are 5 variants for the weather and we use laplacian smoothing
136 // to avoid introducing zero probabilities, +1 / +5 treats the data
137 // as if there is one stay in for each weather type.
138 ((total + 1.0) / (total_stay_in + 5.0)).ln()
139 };
140
141 let weather_log_probability_go_out = {
142 // compute how many rows in the data are this weather and go out
143 let total = observations.iter()
144 .filter(|observation| observation.decision == Decision::GoOut)
145 .filter(|observation| observation.weather == weather)
146 .count() as f64;
147 // there are 5 variants for the weather and we use laplacian smoothing
148 // to avoid introducing zero probabilities, +1 / +5 treats the data
149 // as if there is one go out for each weather type.
150 ((total + 1.0) / (total_go_out + 5.0)).ln()
151 };
152
153 // we're modelling the wind as a Gaussian so we get a probability by
154 // computing the probability density for the wind we have, if it is
155 // the same as the mean wind for stay in then it will be closest to 1
156 // and the further away the closer to 0 it will be
157 let wind_speed_model_stay_in: Gaussian<WindSpeed> = Gaussian::approximating(
158 observations.iter()
159 .filter(|observation| observation.decision == Decision::StayIn)
160 .map(|observation| observation.wind)
161 );
162 let wind_log_probability_stay_in = wind_speed_model_stay_in.probability(&wind);
163
164 let wind_speed_model_go_out: Gaussian<WindSpeed> = Gaussian::approximating(
165 observations.iter()
166 .filter(|observation| observation.decision == Decision::GoOut)
167 .map(|observation| observation.wind)
168 );
169 let wind_log_probability_go_out = wind_speed_model_go_out.probability(&wind);
170
171 let pandemic_log_probability_stay_in = {
172 // compute how many rows in the data are this pandemic and stay in
173 let total = observations.iter()
174 .filter(|observation| observation.decision == Decision::StayIn)
175 .filter(|observation| observation.pandemic == pandemic)
176 .count() as f64;
177 // there are 2 variants for the pandemic type and we use laplacian smoothing
178 // to avoid introducing zero probabilities, +1 / +2 treats the data
179 // as if there is one stay in for each pandemic type.
180 ((total + 1.0) / (total_stay_in + 2.0)).ln()
181 };
182
183 let pandemic_log_probability_go_out = {
184 // compute how many rows in the data are this pandemic and go out
185 let total = observations.iter()
186 .filter(|observation| observation.decision == Decision::GoOut)
187 .filter(|observation| observation.pandemic == pandemic)
188 .count() as f64;
189 // there are 2 variants for the pandemic type and we use laplacian smoothing
190 // to avoid introducing zero probabilities, +1 / +2 treats the data
191 // as if there is one go out for each pandemic type.
192 ((total + 1.0) / (total_go_out + 2.0)).ln()
193 };
194
195 let prior_log_probability_stay_in = total_stay_in / total;
196 let prior_log_probability_go_out = total_go_out / total;
197
198 let posterior_log_probability_stay_in = (
199 prior_log_probability_stay_in
200 + weather_log_probability_stay_in
201 + wind_log_probability_stay_in
202 + pandemic_log_probability_stay_in
203 );
204 let posterior_log_probability_go_out = (
205 prior_log_probability_go_out
206 + weather_log_probability_go_out
207 + wind_log_probability_go_out
208 + pandemic_log_probability_go_out
209 );
210
211 if posterior_log_probability_go_out > posterior_log_probability_stay_in {
212 Decision::GoOut
213 } else {
214 Decision::StayIn
215 }
216}
217
218let test_data = vec![
219 Observation::new(Weather::Sunny, 0.8, Pandemic::NoPandemic, Decision::StayIn),
220 Observation::new(Weather::Sunny, 0.2, Pandemic::NoPandemic, Decision::GoOut),
221 Observation::new(Weather::Stormy, 0.2, Pandemic::NoPandemic, Decision::StayIn),
222 Observation::new(Weather::Cloudy, 0.3, Pandemic::NoPandemic, Decision::GoOut),
223 Observation::new(Weather::Rainy, 0.8, Pandemic::NoPandemic, Decision::GoOut),
224 Observation::new(Weather::Rainy, 0.5, Pandemic::NoPandemic, Decision::GoOut),
225 Observation::new(Weather::Stormy, 0.6, Pandemic::Pandemic, Decision::StayIn),
226 Observation::new(Weather::Rainy, 0.1, Pandemic::Pandemic, Decision::StayIn),
227];
228
229let predictions = test_data.iter()
230 .map(|data| predict(&observations, data.weather, data.wind, data.pandemic))
231 .collect::<Vec<Decision>>();
232
233println!("Test data and predictions\n{:?}", test_data.iter()
234 .cloned()
235 .zip(predictions.clone())
236 .collect::<Vec<(Observation, Decision)>>());
237
238println!("Accuracy: {:?}", test_data.iter()
239 .zip(predictions.clone())
240 .map(|(data, decision)| if data.decision == decision { 1.0 } else { 0.0 })
241 .sum::<f64>() / (test_data.len() as f64));
242
243// To compute Recall and Precision it is neccessary to decide which class should be
244// considered the positive and which should be the negative. For medical diagnosis
245// this is always diagnosing something as present. For this example we take GoOut to
246// be the positive case.
247
248// True Positives are where the model predicts the positive class when it is
249// the correct decision, eg in this scenario, going outside when it should
250// decide to go outside
251let true_positives = test_data.iter()
252 .cloned()
253 .zip(predictions.clone())
254 .filter(|(data, _)| data.decision == Decision::GoOut)
255 .filter(|(_, decision)| decision == &Decision::GoOut)
256 .count() as f64;
257
258// False Positives are when the model predicts the positive class when it is
259// not the correct decision, eg in this scenario, going outside when it should
260// decide to stay in
261let false_positives = test_data.iter()
262 .cloned()
263 .zip(predictions.clone())
264 .filter(|(data, _)| data.decision == Decision::StayIn)
265 .filter(|(_, decision)| decision == &Decision::GoOut)
266 .count() as f64;
267
268// True Negatives are when the model predicts the negative class when it is
269// the correct decision, eg in this scenario, staying in when it should
270// decide to stay in
271// there's a pandemic and no good reason to go outside
272let true_negatives = test_data.iter()
273 .cloned()
274 .zip(predictions.clone())
275 .filter(|(data, _)| data.decision == Decision::StayIn)
276 .filter(|(_, decision)| decision == &Decision::StayIn)
277 .count() as f64;
278
279// False Negatives are when the model predicts the negative class when it is
280// not the correct decision, eg in this scenario, staying in when it should
281// decide to go outside
282// there's a pandemic and no good reason to go outside
283let false_negatives = test_data.iter()
284 .cloned()
285 .zip(predictions.clone())
286 .filter(|(data, _)| data.decision == Decision::GoOut)
287 .filter(|(_, decision)| decision == &Decision::StayIn)
288 .count() as f64;
289
290// Precision measures how good the model is at identifying the positive
291// class (you can trivially get 100% precision by never predicting the
292// positive class, as this means you can't get a false positive).
293let precision = true_positives / (true_positives + false_positives);
294
295// Recall is the true positive rate which is how good the model is at
296// identifying the positive class out of all the positive cases (you can
297// trivially get 100% recall by always predicting the positive class).
298let recall = true_positives / (true_positives + false_negatives);
299
300// The F-1 score is the harmonic mean of precision and recall and
301// averages them for an accuracy measure.
302// In this case the two classes are roughly equally likely, so the F-1
303// score and Accuracy are similar. However, if the model had learned
304// to always predict GoOut, then it would still have an accuracy of
305// roughly 50% because of the equal likelihood, whereas its F-1 score
306// would be much lower than 50%.
307let f1_score = linear_algebra::f1_score(precision, recall);
308println!("F1-Score: {:?}", f1_score);
309
310assert!(f1_score > 0.8);
311```
312
313## 3 Class Naïve Bayes Example
314[See submodule for 3 Class Naïve Bayes Example](three_class)
315*/
316
317#[rustfmt::skip]
318pub mod three_class {
319/*!
320# 3 Class Naïve Bayes Example
321
322For this example some population data is generated for a fictional alien race as I didn't
323have any real datasets to hand. This alien race has 3 sexes (mysteriously no individuals
324are ever intersex or trans) and is sexually dimorphic, meaning we can try to determine their
325sex from various measuremnts.
326
327As with humans, a gaussian distribution for physical charateristics is sensible due to
328evolutionary and biological factors.
329
330The example includes categorical features such as marking color and real valued
331features such as height in order to show how both can be modelled with Naïve Bayes.
332
333Note that most of the code below is for generating and clustering the data to perform
334Naïve Bayes on, not doing it.
335
336### Clustering
337
338After generating the unlabelled data clustering is performed on a very small subset of that data
339to find three cluster centres that are then used to assign the whole dataset sex class labels.
340This creates a labelled dataset to perform Naïve Bayes on to see if we can predict sex given
341the various features.
342
343By clustering only a very small bit of the data, by chance we can expect there to be large gaps
344in the subset because our data has many dimensions.
345
346## Matrix APIs
347
348```
349use easy_ml::matrices::Matrix;
350use easy_ml::matrices::views::{MatrixView, MatrixRange};
351use easy_ml::linear_algebra;
352use easy_ml::distributions::Gaussian;
353
354use rand::{Rng, SeedableRng};
355use rand::distr::{Iter, StandardUniform};
356use rand_chacha::ChaCha8Rng;
357
358#[derive(Clone, Copy, PartialEq, Debug)]
359struct Alien {
360 height: f64,
361 primary_marking_color: AlienMarkingColor,
362 tail_length: f64,
363 metabolic_rate: f64,
364 spiked_tail: bool,
365 sex: AlienSex,
366}
367
368#[derive(Clone, Copy, Eq, PartialEq, Debug)]
369enum AlienMarkingColor {
370 Red = 1, Yellow, Orange, Blue, Purple, White, Black
371}
372
373#[derive(Clone, Copy, Eq, PartialEq, Debug)]
374enum AlienSex {
375 A, B, C
376}
377
378/**
379 * ===============================
380 * = Data Generation =
381 * ===============================
382 */
383
384// rather than explicitly define a generative function that already knows the relationship
385// between alien charateristics instead the samples are generated without an assigned sex
386// and then clustered using k-means
387
388// use a fixed seed random generator from the rand crate
389let mut random_generator = ChaCha8Rng::seed_from_u64(16);
390let mut random_numbers: Iter<StandardUniform, &mut ChaCha8Rng, f64> =
391 (&mut random_generator).sample_iter(StandardUniform);
392
393/**
394 * Generates height data for creating the alien dataset.
395 */
396fn generate_heights(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
397 // the average height shall be 1.5 meters with a standard deviation of 0.25
398 let heights_distribution = Gaussian::new(1.5, 0.25 * 0.25);
399 let mut heights = heights_distribution.draw(random_numbers, samples).unwrap();
400 // ensure all aliens are at least 0.25 meters tall
401 heights.drain(..).map(|x| x.max(0.25)).collect()
402}
403
404/**
405 * Generates tail length data for creating the alien dataset.
406 */
407fn generate_tail_length(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
408 // the average length shall be 1.25 meters with more variation in tail length
409 let tails_distribution = Gaussian::new(1.25, 0.5 * 0.5);
410 let mut tails = tails_distribution.draw(random_numbers, samples).unwrap();
411 // ensure all tails are at least 0.5 meters long
412 tails.drain(..).map(|x| if x > 0.5 { x } else { 0.5 }).collect()
413}
414
415/**
416 * Generates color data for creating the alien dataset.
417 *
418 * Note that floats are still returned despite this being a category because we need all the
419 * data types to be the same for clustering
420 */
421fn generate_colors(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
422 let mut colors = Vec::with_capacity(samples);
423 for i in 0..samples {
424 let x = random_numbers.next().unwrap();
425 if x < 0.2 {
426 colors.push(AlienMarkingColor::Red as u8 as f64);
427 } else if x < 0.3 {
428 colors.push(AlienMarkingColor::Yellow as u8 as f64);
429 } else if x < 0.45 {
430 colors.push(AlienMarkingColor::Orange as u8 as f64);
431 } else if x < 0.59 {
432 colors.push(AlienMarkingColor::Blue as u8 as f64);
433 } else if x < 0.63 {
434 colors.push(AlienMarkingColor::Purple as u8 as f64);
435 } else if x < 0.9 {
436 colors.push(AlienMarkingColor::White as u8 as f64);
437 } else {
438 colors.push(AlienMarkingColor::Black as u8 as f64);
439 }
440 }
441 colors
442}
443
444/**
445 * Recovers the color type which is the closest match to the input floating point color
446 */
447fn recover_generated_color(color: f64) -> AlienMarkingColor {
448 let numerical_colors = [
449 AlienMarkingColor::Red as u8 as f64,
450 AlienMarkingColor::Yellow as u8 as f64,
451 AlienMarkingColor::Orange as u8 as f64,
452 AlienMarkingColor::Blue as u8 as f64,
453 AlienMarkingColor::Purple as u8 as f64,
454 AlienMarkingColor::White as u8 as f64,
455 AlienMarkingColor::Black as u8 as f64,
456 ];
457 let colors = [
458 AlienMarkingColor::Red,
459 AlienMarkingColor::Yellow,
460 AlienMarkingColor::Orange,
461 AlienMarkingColor::Blue,
462 AlienMarkingColor::Purple,
463 AlienMarkingColor::White,
464 AlienMarkingColor::Black,
465 ];
466 // look for the closest fit, as manipulated floating points may not be exact
467 let color_index = numerical_colors.iter()
468 // take the absolute difference so an exact match will become 0
469 .map(|c| (c - color).abs())
470 .enumerate()
471 // find the element with the smallest difference in the list
472 .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
473 // discard the difference
474 .map(|(index, difference)| index)
475 // retrieve the value from the option
476 .unwrap();
477 colors[color_index]
478}
479
480/**
481 * Generates metabolic rate data for creating the alien dataset.
482 */
483fn generate_metabolic_rate(
484 samples: usize, random_numbers: &mut impl Iterator<Item = f64>
485) -> Vec<f64> {
486 // the average rate shall be 100 heart beats per minute with a standard deviation of 20
487 let metabolic_rate_distribution = Gaussian::new(100.0, 20.0 * 20.0);
488 let mut metabolic_rates = metabolic_rate_distribution.draw(random_numbers, samples).unwrap();
489 // ensure all rates are at least 50 and less than 200
490 metabolic_rates.drain(..).map(|x| x.max(50.0).min(200.0)).collect()
491}
492
493/**
494 * Generates spiked tailness data for creating the alien dataset.
495 */
496fn generate_spiked_tail(
497 samples: usize, random_numbers: &mut impl Iterator<Item = f64>
498) -> Vec<f64> {
499 let mut spikes = Vec::with_capacity(samples);
500 for i in 0..samples {
501 let x = random_numbers.next().unwrap();
502 if x < 0.4 {
503 // 0 shall represent a non spiked tail
504 spikes.push(0.0)
505 } else {
506 // 1 shall represent a spiked tail
507 spikes.push(1.0)
508 }
509 }
510 spikes
511}
512
513/**
514 * Recovers the spiked tail type which is the closest match to the input floating point
515 */
516fn recover_generated_spiked_tail(spiked: f64) -> bool {
517 return spiked >= 0.5
518}
519
520// We shall generate 1000 samples for the dataset
521const SAMPLES: usize = 1000;
522
523// Collect all the float typed features into a matrix
524let unlabelled_dataset = {
525 let mut dataset = Matrix::column(generate_heights(SAMPLES, &mut random_numbers));
526 dataset.insert_column_with(1,
527 generate_colors(SAMPLES, &mut random_numbers).drain(..));
528 dataset.insert_column_with(2,
529 generate_tail_length(SAMPLES, &mut random_numbers).drain(..));
530 dataset.insert_column_with(3,
531 generate_metabolic_rate(SAMPLES, &mut random_numbers).drain(..));
532 dataset.insert_column_with(4,
533 generate_spiked_tail(SAMPLES, &mut random_numbers).drain(..));
534 dataset
535};
536
537/**
538 * ===============================
539 * = Clustering =
540 * ===============================
541 */
542
543// Create a subset of the first 30 samples from the full dataset to use for clustering
544let unlabelled_subset = MatrixView::from(
545 MatrixRange::from(&unlabelled_dataset, 0..30, 0..unlabelled_dataset.columns())
546);
547
548// We normalise all the features to 0 mean and 1 standard deviation because
549// we will use euclidean distance as the distance matric, and our features
550// have very different variances. This avoids the distance metric being
551// dominated by any particular feature.
552
553let mut means_and_variances = Vec::with_capacity(unlabelled_subset.columns());
554
555// The normalised subset is computed taking the mean and variance from the subset,
556// these means and variances will be needed later to apply to the rest of the data.
557let mut normalised_subset = {
558 let mut normalised_subset = unlabelled_subset.map(|x| x);
559 for feature in 0..normalised_subset.columns() {
560 let mean = linear_algebra::mean(normalised_subset.column_iter(feature));
561 let variance = linear_algebra::variance(normalised_subset.column_iter(feature));
562 // save the data for normalisation and denormalisation for each feature
563 // for use later
564 means_and_variances.push(vec![ mean, variance ]);
565
566 for row in 0..normalised_subset.rows() {
567 let x = normalised_subset.get(row, feature);
568 normalised_subset.set(row, feature, (x - mean) / variance);
569 }
570 }
571 normalised_subset
572};
573
574// create a 5 x 2 matrix where each row is the mean and variance of each of the 5 features
575let means_and_variances = Matrix::from(means_and_variances);
576
577// pick the first 3 samples as the starting points for the cluster centres
578// and place them into a 3 x 5 matrix where we have 3 rows of cluster centres
579// and 5 features which are all normalised
580let mut clusters = Matrix::from(vec![
581 normalised_subset.row_iter(0).collect(),
582 normalised_subset.row_iter(1).collect(),
583 normalised_subset.row_iter(2).collect()]);
584
585// add a 6th column to the subset to track the closest cluster for each sample
586const CLUSTER_ID_COLUMN: usize = 5;
587normalised_subset.insert_column(CLUSTER_ID_COLUMN, -1.0);
588
589// set a threshold at which we consider the cluster centres to have converged
590const CHANGE_THRESHOLD: f64 = 0.001;
591
592// track how much the means have changed each update
593let mut absolute_changes = -1.0;
594
595// loop until we go under the CHANGE_THRESHOLD, reassigning points to the nearest
596// cluster then cluster centres to their mean of points
597while absolute_changes == -1.0 || absolute_changes > CHANGE_THRESHOLD {
598 // assign each point to the nearest cluster centre by euclidean distance
599 for point in 0..normalised_subset.rows() {
600 let mut closest_cluster = -1.0;
601 let mut least_squared_distance = std::f64::MAX;
602 for cluster in 0..clusters.rows() {
603 // we don't actually need to square root the distances for finding
604 // which is least because least squared distance is the same as
605 // least distance
606 let squared_distance = {
607 let mut sum = 0.0;
608 for feature in 0..clusters.columns() {
609 let cluster_coordinate = clusters.get(cluster, feature);
610 let point_coordiante = normalised_subset.get(point, feature);
611 sum += (cluster_coordinate - point_coordiante).powi(2);
612 }
613 sum
614 };
615
616 if squared_distance < least_squared_distance {
617 closest_cluster = cluster as f64;
618 least_squared_distance = squared_distance;
619 }
620 }
621 // save the cluster that is closest to each point in the 6th column
622 normalised_subset.set(point, CLUSTER_ID_COLUMN, closest_cluster);
623 }
624
625 // update cluster centres to the mean of their points
626 absolute_changes = 0.0;
627 for cluster in 0..clusters.rows() {
628 // construct a list of the points this cluster owns
629 let owned = normalised_subset.column_iter(CLUSTER_ID_COLUMN)
630 // zip together the id values with their index
631 .enumerate()
632 // exclude the points that aren't assigned to this cluster
633 .filter(|(index, id)| (*id as usize) == cluster)
634 // drop the cluster ids from each item and copy over the data
635 // for each point for each feature
636 .map(|(index, id)| {
637 // for each point copy all its data in each feature excluding the
638 // final cluster id column into a new vec
639 normalised_subset.row_iter(index)
640 // taking the first 5 excludes the 6th column due to 0 indexing
641 .take(CLUSTER_ID_COLUMN)
642 .collect::<Vec<f64>>()
643 })
644 // collect into a vector of vectors containing each feature's data
645 .collect::<Vec<Vec<f64>>>();
646 // pass the vector of vectors into a matrix so we have
647 // a matrix where each row is the data of a point this cluster owns
648 let owned = Matrix::from(owned);
649
650 // construct a vector of the mean for each feature that this cluster
651 // now has
652 let new_means = {
653 let mut means = Vec::with_capacity(owned.rows());
654
655 for feature in 0..owned.columns() {
656 let mean = owned.column_iter(feature).sum::<f64>() / (owned.rows() as f64);
657 means.push(mean);
658 }
659
660 means
661 };
662
663 // update each new mean for the cluster
664 for feature in 0..clusters.columns() {
665 let previous_mean = clusters.get(cluster, feature);
666 // track the absolute difference between the new mean and the old one
667 // so we know when to stop updating the clusters
668 absolute_changes += (previous_mean - new_means[feature]).abs();
669
670 println!(
671 "Cluster {:?} mean for feature {:?} is now {:?} was {:?}",
672 cluster, feature, new_means[feature], previous_mean
673 );
674 clusters.set(cluster, feature, new_means[feature]);
675 }
676 }
677}
678
679println!(
680 "Denormalised clusters at convergence:\n{:?}\n{:.3}",
681 vec![ "H", "C", "T", "M", "S" ],
682 clusters.map_with_index(|x, _cluster, feature| {
683 let mean = means_and_variances.get(feature, 0);
684 let variance = means_and_variances.get(feature, 1);
685 (x * variance) + mean
686 }));
687
688// Now we will assign every alien in the full dataset a sex using these cluster centres
689let mut aliens: Vec<Alien> = Vec::with_capacity(unlabelled_dataset.rows());
690
691fn assign_alien_sex(index: u8) -> AlienSex {
692 if index == 0 {
693 AlienSex::A
694 } else if index == 1 {
695 AlienSex::B
696 } else {
697 AlienSex::C
698 }
699}
700
701for i in 0..unlabelled_dataset.rows() {
702 let alien_data = Matrix::column(unlabelled_dataset.row_iter(i).collect());
703 // normalise the alien data first so comparisons are on unit variance
704 // and zero mean
705 let normalised_alien_data = alien_data.map_with_index(|x, feature, _| {
706 let mean = means_and_variances.get(feature, 0);
707 let variance = means_and_variances.get(feature, 1);
708 // normalise each feature in the alien data
709 (x - mean) / variance
710 });
711 let mut distances = Vec::with_capacity(clusters.rows());
712 for j in 0..clusters.rows() {
713 let cluster_data = Matrix::row(clusters.row_iter(j).collect());
714 // use euclidean distance to compare the alien with the cluster, the cluster
715 // is already normalised
716 let sum_of_squares = (cluster_data * &normalised_alien_data).scalar();
717 distances.push(sum_of_squares);
718 }
719
720 // find the cluster with the lowest distance to each point and get its index
721 let chosen_cluster = distances.iter()
722 .enumerate()
723 .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
724 .map(|(i, _)| i)
725 .unwrap();
726
727 // convert each float to the correct data type
728 aliens.push(Alien {
729 height: alien_data.get(0, 0),
730 primary_marking_color: recover_generated_color(alien_data.get(1, 0)),
731 tail_length: alien_data.get(2, 0),
732 metabolic_rate: alien_data.get(3, 0),
733 spiked_tail: recover_generated_spiked_tail(alien_data.get(4, 0)),
734 sex: assign_alien_sex(chosen_cluster as u8),
735 })
736}
737
738println!("First 10 aliens");
739for i in 0..10 {
740 println!("{:?}", aliens[i]);
741}
742
743// Put the aliens in a matrix for convenience
744let aliens = Matrix::row(aliens);
745
746println!("Sex A aliens total: {}", aliens.row_reference_iter(0)
747 .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::A { 1 } else { 0 }));
748
749println!("Sex B aliens total: {}", aliens.row_reference_iter(0)
750 .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::B { 1 } else { 0 }));
751
752println!("Sex C aliens total: {}", aliens.row_reference_iter(0)
753 .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::C { 1 } else { 0 }));
754
755/**
756 * ===============================
757 * = Naïve Bayes =
758 * ===============================
759 */
760
761// Each class is roughly one third so we should not have a strong prior to a particular class
762
763// In order to evaluate the performance of the Naïve Bayes classifier we will hold out
764// the last 100 aliens from the dataset and use them as training data
765
766let training_data = Matrix::column(aliens.row_iter(0).take(900).collect());
767let test_data = Matrix::column(aliens.row_iter(0).skip(900).collect());
768
769/**
770 * Predicts the most probable alien sex for each test input alien (disregarding
771 * the sex field in those inputs)
772 *
773 * For the real valued features the probabilities are computed by modelling
774 * the features (conditioned on each class) as gaussian distributions.
775 * For categorical features laplacian smoothing of the counts is used to
776 * estimate probabilities of the features (conditioned on each class).
777 */
778fn predict_aliens(training_data: &Matrix<Alien>, test_data: &Matrix<Alien>) -> Matrix<AlienSex> {
779 let mut relative_log_probabilities = Vec::with_capacity(3);
780
781 for class in &[ AlienSex::A, AlienSex::B, AlienSex::C ] {
782 let training_data_class_only = training_data.column_iter(0)
783 .filter(|a| &a.sex == class)
784 .collect::<Vec<Alien>>();
785
786 // compute how likely each class is in the training set
787 let prior = (training_data_class_only.len() as f64) / (training_data.rows() as f64);
788
789 // We model the real valued features as Gaussians, note that these
790 // are Gaussian distributions over only the training data of each class
791 let heights: Gaussian<f64> = Gaussian::approximating(
792 training_data_class_only.iter().map(|a| a.height)
793 );
794 let tail_lengths: Gaussian<f64> = Gaussian::approximating(
795 training_data_class_only.iter().map(|a| a.tail_length)
796 );
797 let metabolic_rates: Gaussian<f64> = Gaussian::approximating(
798 training_data_class_only.iter().map(|a| a.metabolic_rate)
799 );
800
801 // gradually build up the sum of log probabilities to get the
802 // log of the prior * likelihood which will be proportional to the posterior
803 let relative_log_probabilities_of_class = test_data.column_reference_iter(0)
804 .map(|alien| {
805 // probabilitiy of the alien sex and the alien
806 let mut log_relative_probability = prior.ln();
807
808 // Compute the probability using the Gaussian model for each real valued feature.
809 // Due to floating point precision limits and the variance for some of these
810 // Gaussian models being extremely small (0.01 for heights) we
811 // check if a probability computed is zero or extremely close to zero
812 // and if so increase it a bit to avoid computing -inf when we take the log.
813
814 let mut height_given_class = heights.probability(&alien.height);
815 if height_given_class.abs() <= 0.000000000001 {
816 height_given_class = 0.000000000001;
817 }
818 log_relative_probability += height_given_class.ln();
819
820 let mut tail_given_class = tail_lengths.probability(&alien.tail_length);
821 if tail_given_class.abs() <= 0.000000000001 {
822 tail_given_class = 0.000000000001;
823 }
824 log_relative_probability += tail_given_class.ln();
825
826 let mut metabolic_rates_given_class = metabolic_rates.probability(
827 &alien.metabolic_rate);
828 if metabolic_rates_given_class.abs() <= 0.000000000001 {
829 metabolic_rates_given_class = 0.000000000001;
830 }
831 log_relative_probability += metabolic_rates_given_class.ln();
832
833 // compute the probability of the categorical features using lapacian smoothing
834 let color_of_class = training_data_class_only.iter()
835 .map(|a| a.primary_marking_color)
836 // count how many aliens of this class have this color
837 .fold(0, |acc, color|
838 acc + if color == alien.primary_marking_color { 1 } else { 0 });
839 // with laplacian smoothing we assume there is one datapoint for each color
840 // which avoids zero probabilities but does not distort the probabilities much
841 // there are 7 color types so we add 7 to the total
842 let color_given_class = ((color_of_class + 1) as f64)
843 / ((training_data_class_only.len() + 7) as f64);
844 log_relative_probability += color_given_class.ln();
845
846 let spiked_tail_of_class = training_data_class_only.iter()
847 .map(|a| a.spiked_tail)
848 // count how many aliens of this class have a spiked tail or not
849 .fold(0, |acc, spiked| acc + if spiked == alien.spiked_tail { 1 } else { 0 });
850 // again we assume one alien of the class with a spiked tail and one without
851 // to avoid zero probabilities
852 let spiked_tail_given_class = ((spiked_tail_of_class + 1) as f64)
853 / ((training_data_class_only.len() + 2) as f64);
854 log_relative_probability += spiked_tail_given_class.ln();
855
856 if log_relative_probability == std::f64::NEG_INFINITY {
857 println!("Individual probs P:{} H:{} T:{} M:{} C:{} S:{}",
858 prior, height_given_class, tail_given_class, metabolic_rates_given_class,
859 color_given_class, spiked_tail_given_class);
860 }
861
862 log_relative_probability
863 }).collect();
864
865 relative_log_probabilities.push(relative_log_probabilities_of_class);
866 }
867
868 // collect the relative probabilitiy estimates for each class and each alien
869 // into a 3 x 100 matrix respectively
870 let probabilities = Matrix::from(relative_log_probabilities);
871
872 let predictions = (0..probabilities.columns()).map(|i| {
873 let predicted_class_index = probabilities.column_iter(i)
874 .enumerate()
875 // find the class with the highest relative probability estimate
876 .max_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
877 // discard the probability
878 .map(|(index, p)| index)
879 // retrieve the value from the option
880 .unwrap();
881 if predicted_class_index == 0 {
882 AlienSex::A
883 } else if predicted_class_index == 1 {
884 AlienSex::B
885 } else {
886 AlienSex::C
887 }
888 }).collect();
889
890 Matrix::column(predictions)
891}
892
893let predictions = predict_aliens(&training_data, &test_data);
894
895println!("First 10 test aliens and predictions");
896for i in 0..10 {
897 println!("Predicted: {:?} Input: {:?}", predictions.get(i, 0), test_data.get(i, 0));
898}
899
900let accuracy = test_data.column_iter(0)
901 .zip(predictions.column_iter(0))
902 .map(|(alien, prediction)| if alien.sex == prediction { 1 } else { 0 })
903 .sum::<u16>() as f64 / (test_data.rows() as f64);
904
905println!("Accuracy {}%", accuracy * 100.0);
906
907/**
908 * ===============================
909 * = Analysis =
910 * ===============================
911 */
912
913// We can get a better sense of how well our classifier has done by
914// printing the confusion matrix
915
916// Construct a confusion matrix of actual x predicted classes, using A as 0, B as 1 and C as 2
917// for indexing. If the accuracy was 100% we would see only non zero numbers on the diagonal
918// as every prediction would be the actual class.
919let confusion_matrix = {
920 let mut confusion_matrix = Matrix::empty(0, (3, 3));
921
922 // loop through all the actual and predicted classes to fill the confusion matrix
923 // with the total occurances of each possible combination
924 for (actual, predicted) in test_data.column_iter(0).zip(predictions.column_iter(0)) {
925 match actual.sex {
926 AlienSex::A => {
927 match predicted {
928 AlienSex::A => confusion_matrix.set(0, 0, confusion_matrix.get(0, 0) + 1),
929 AlienSex::B => confusion_matrix.set(0, 1, confusion_matrix.get(0, 1) + 1),
930 AlienSex::C => confusion_matrix.set(0, 2, confusion_matrix.get(0, 2) + 1),
931 }
932 },
933 AlienSex::B => {
934 match predicted {
935 AlienSex::A => confusion_matrix.set(1, 0, confusion_matrix.get(1, 0) + 1),
936 AlienSex::B => confusion_matrix.set(1, 1, confusion_matrix.get(1, 1) + 1),
937 AlienSex::C => confusion_matrix.set(1, 2, confusion_matrix.get(1, 2) + 1),
938 }
939 },
940 AlienSex::C => {
941 match predicted {
942 AlienSex::A => confusion_matrix.set(2, 0, confusion_matrix.get(2, 0) + 1),
943 AlienSex::B => confusion_matrix.set(2, 1, confusion_matrix.get(2, 1) + 1),
944 AlienSex::C => confusion_matrix.set(2, 2, confusion_matrix.get(2, 2) + 1),
945 }
946 }
947 }
948 }
949
950 confusion_matrix
951};
952
953println!("Confusion matrix: Rows are actual class, Columns are predicted class\n{}",
954 confusion_matrix);
955println!(" A B C");
956```
957
958## Tensor APIs
959
960
961```
962use easy_ml::tensors::Tensor;
963use easy_ml::tensors::views::{TensorView, TensorStack};
964use easy_ml::linear_algebra;
965use easy_ml::distributions::Gaussian;
966
967use rand::{Rng, SeedableRng};
968use rand::distr::{Iter, StandardUniform};
969use rand_chacha::ChaCha8Rng;
970
971#[derive(Clone, Copy, PartialEq, Debug)]
972struct Alien {
973 height: f64,
974 primary_marking_color: AlienMarkingColor,
975 tail_length: f64,
976 metabolic_rate: f64,
977 spiked_tail: bool,
978 sex: AlienSex,
979}
980
981#[derive(Clone, Copy, Eq, PartialEq, Debug)]
982enum AlienMarkingColor {
983 Red = 1, Yellow, Orange, Blue, Purple, White, Black
984}
985
986#[derive(Clone, Copy, Eq, PartialEq, Debug)]
987enum AlienSex {
988 A, B, C
989}
990
991/**
992 * ===============================
993 * = Data Generation =
994 * ===============================
995 */
996
997// rather than explicitly define a generative function that already knows the relationship
998// between alien charateristics instead the samples are generated without an assigned sex
999// and then clustered using k-means
1000
1001// use a fixed seed random generator from the rand crate
1002let mut random_generator = ChaCha8Rng::seed_from_u64(16);
1003let mut random_numbers: Iter<StandardUniform, &mut ChaCha8Rng, f64> =
1004 (&mut random_generator).sample_iter(StandardUniform);
1005
1006/**
1007 * Generates height data for creating the alien dataset.
1008 */
1009fn generate_heights(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
1010 // the average height shall be 1.5 meters with a standard deviation of 0.25
1011 let heights_distribution = Gaussian::new(1.5, 0.25 * 0.25);
1012 let mut heights = heights_distribution.draw(random_numbers, samples).unwrap();
1013 // ensure all aliens are at least 0.25 meters tall
1014 heights.drain(..).map(|x| x.max(0.25)).collect()
1015}
1016
1017/**
1018 * Generates tail length data for creating the alien dataset.
1019 */
1020fn generate_tail_length(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
1021 // the average length shall be 1.25 meters with more variation in tail length
1022 let tails_distribution = Gaussian::new(1.25, 0.5 * 0.5);
1023 let mut tails = tails_distribution.draw(random_numbers, samples).unwrap();
1024 // ensure all tails are at least 0.5 meters long
1025 tails.drain(..).map(|x| if x > 0.5 { x } else { 0.5 }).collect()
1026}
1027
1028/**
1029 * Generates color data for creating the alien dataset.
1030 *
1031 * Note that floats are still returned despite this being a category because we need all the
1032 * data types to be the same for clustering
1033 */
1034fn generate_colors(samples: usize, random_numbers: &mut impl Iterator<Item = f64>) -> Vec<f64> {
1035 let mut colors = Vec::with_capacity(samples);
1036 for i in 0..samples {
1037 let x = random_numbers.next().unwrap();
1038 if x < 0.2 {
1039 colors.push(AlienMarkingColor::Red as u8 as f64);
1040 } else if x < 0.3 {
1041 colors.push(AlienMarkingColor::Yellow as u8 as f64);
1042 } else if x < 0.45 {
1043 colors.push(AlienMarkingColor::Orange as u8 as f64);
1044 } else if x < 0.59 {
1045 colors.push(AlienMarkingColor::Blue as u8 as f64);
1046 } else if x < 0.63 {
1047 colors.push(AlienMarkingColor::Purple as u8 as f64);
1048 } else if x < 0.9 {
1049 colors.push(AlienMarkingColor::White as u8 as f64);
1050 } else {
1051 colors.push(AlienMarkingColor::Black as u8 as f64);
1052 }
1053 }
1054 colors
1055}
1056
1057/**
1058 * Recovers the color type which is the closest match to the input floating point color
1059 */
1060fn recover_generated_color(color: f64) -> AlienMarkingColor {
1061 let numerical_colors = [
1062 AlienMarkingColor::Red as u8 as f64,
1063 AlienMarkingColor::Yellow as u8 as f64,
1064 AlienMarkingColor::Orange as u8 as f64,
1065 AlienMarkingColor::Blue as u8 as f64,
1066 AlienMarkingColor::Purple as u8 as f64,
1067 AlienMarkingColor::White as u8 as f64,
1068 AlienMarkingColor::Black as u8 as f64,
1069 ];
1070 let colors = [
1071 AlienMarkingColor::Red,
1072 AlienMarkingColor::Yellow,
1073 AlienMarkingColor::Orange,
1074 AlienMarkingColor::Blue,
1075 AlienMarkingColor::Purple,
1076 AlienMarkingColor::White,
1077 AlienMarkingColor::Black,
1078 ];
1079 // look for the closest fit, as manipulated floating points may not be exact
1080 let color_index = numerical_colors.iter()
1081 // take the absolute difference so an exact match will become 0
1082 .map(|c| (c - color).abs())
1083 .enumerate()
1084 // find the element with the smallest difference in the list
1085 .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
1086 // discard the difference
1087 .map(|(index, difference)| index)
1088 // retrieve the value from the option
1089 .unwrap();
1090 colors[color_index]
1091}
1092
1093/**
1094 * Generates metabolic rate data for creating the alien dataset.
1095 */
1096fn generate_metabolic_rate(
1097 samples: usize, random_numbers: &mut impl Iterator<Item = f64>
1098) -> Vec<f64> {
1099 // the average rate shall be 100 heart beats per minute with a standard deviation of 20
1100 let metabolic_rate_distribution = Gaussian::new(100.0, 20.0 * 20.0);
1101 let mut metabolic_rates = metabolic_rate_distribution.draw(random_numbers, samples).unwrap();
1102 // ensure all rates are at least 50 and less than 200
1103 metabolic_rates.drain(..).map(|x| x.max(50.0).min(200.0)).collect()
1104}
1105
1106/**
1107 * Generates spiked tailness data for creating the alien dataset.
1108 */
1109fn generate_spiked_tail(
1110 samples: usize, random_numbers: &mut impl Iterator<Item = f64>
1111) -> Vec<f64> {
1112 let mut spikes = Vec::with_capacity(samples);
1113 for i in 0..samples {
1114 let x = random_numbers.next().unwrap();
1115 if x < 0.4 {
1116 // 0 shall represent a non spiked tail
1117 spikes.push(0.0)
1118 } else {
1119 // 1 shall represent a spiked tail
1120 spikes.push(1.0)
1121 }
1122 }
1123 spikes
1124}
1125
1126/**
1127 * Recovers the spiked tail type which is the closest match to the input floating point
1128 */
1129fn recover_generated_spiked_tail(spiked: f64) -> bool {
1130 return spiked >= 0.5
1131}
1132
1133// We shall generate 1000 samples for the dataset
1134const SAMPLES: usize = 1000;
1135
1136// Each sample has 5 features
1137const FEATURES: usize = 5;
1138
1139// Collect all the float typed features into a 2 dimensional tensor of samples x data
1140let unlabelled_dataset = {
1141 let heights = Tensor::from(
1142 [("sample", SAMPLES)], generate_heights(SAMPLES, &mut random_numbers)
1143 );
1144 let colours = Tensor::from(
1145 [("sample", SAMPLES)], generate_colors(SAMPLES, &mut random_numbers)
1146 );
1147 let tail_lengths = Tensor::from(
1148 [("sample", SAMPLES)], generate_tail_length(SAMPLES, &mut random_numbers)
1149 );
1150 let metabolic_rates = Tensor::from(
1151 [("sample", SAMPLES)], generate_metabolic_rate(SAMPLES, &mut random_numbers)
1152 );
1153 let spiked_tail = Tensor::from(
1154 [("sample", SAMPLES)], generate_spiked_tail(SAMPLES, &mut random_numbers)
1155 );
1156
1157 TensorView::from(
1158 TensorStack::<_, [_; 5], 1>::from(
1159 [ heights, colours, tail_lengths, metabolic_rates, spiked_tail ],
1160 // we stack along the second axis that we're creating to get our rows of samples
1161 // this leaves the existing axis in all our individual column vectors to be each
1162 // feature's data
1163 (1, "feature")
1164 )
1165 )
1166};
1167
1168/**
1169 * ===============================
1170 * = Clustering =
1171 * ===============================
1172 */
1173
1174// Create a subset of the first 30 samples from the full dataset to use for clustering
1175const SUBSET_SAMPLES: usize = 30;
1176let unlabelled_subset = unlabelled_dataset.range([("sample", 0..SUBSET_SAMPLES)])
1177 .expect("The unlabelled data set should have at least 30 samples");
1178
1179// We normalise all the features to 0 mean and 1 standard deviation because
1180// we will use euclidean distance as the distance matric, and our features
1181// have very different variances. This avoids the distance metric being
1182// dominated by any particular feature.
1183
1184let mut means_and_variances = Vec::with_capacity(FEATURES);
1185
1186// The normalised subset is computed taking the mean and variance from the subset,
1187// these means and variances will be needed later to apply to the rest of the data.
1188let mut normalised_subset = {
1189 let mut normalised_subset = unlabelled_subset.map(|x| x);
1190 for feature in 0..FEATURES {
1191 let mut feature_data = normalised_subset.select_mut([("feature", feature)]);
1192 let mean = linear_algebra::mean(feature_data.iter());
1193 let variance = linear_algebra::variance(feature_data.iter());
1194 // save the data for normalisation and denormalisation for each feature
1195 // for use later
1196 means_and_variances.push((mean, variance));
1197
1198 for (row, x) in feature_data.iter_reference_mut().enumerate() {
1199 *x = (*x - mean) / variance;
1200 }
1201 }
1202 normalised_subset
1203};
1204
1205// create a tensor where each row is a tuple of the mean and variance of each of the
1206// 5 features
1207let means_and_variances = Tensor::from([("feature", FEATURES)], means_and_variances);
1208
1209// pick the first 3 samples as the starting points for the cluster centres
1210// and place them into a 3 x 5 tensor where we have 3 rows of cluster centres
1211// and 5 features which are all normalised
1212const CLUSTERS: usize = 3;
1213let mut clusters = TensorView::from(
1214 TensorStack::<_, [_; 3], 1>::from(
1215 [
1216 normalised_subset.select([("sample", 0)]).source(),
1217 normalised_subset.select([("sample", 1)]).source(),
1218 normalised_subset.select([("sample", 2)]).source()
1219 ],
1220 (0, "cluster")
1221 )
1222).map(|x| x);
1223
1224// add a 6th column to the subset to track the closest cluster for each sample
1225const CLUSTER_ID_COLUMN: usize = 5;
1226let mut normalised_subset = {
1227 let mut data = normalised_subset.iter();
1228 let mut tensor = Tensor::empty([("sample", SUBSET_SAMPLES), ("feature", FEATURES + 1)], -1.0);
1229 for ([_row, feature], x) in tensor.iter_reference_mut().with_index() {
1230 *x = match feature {
1231 0 | 1 | 2 | 3 | 4 => data.next().unwrap(),
1232 _ => -1.0
1233 };
1234 }
1235 tensor
1236};
1237
1238// set a threshold at which we consider the cluster centres to have converged
1239const CHANGE_THRESHOLD: f64 = 0.001;
1240
1241// track how much the means have changed each update
1242let mut absolute_changes = -1.0;
1243
1244// loop until we go under the CHANGE_THRESHOLD, reassigning points to the nearest
1245// cluster then cluster centres to their mean of points
1246while absolute_changes == -1.0 || absolute_changes > CHANGE_THRESHOLD {
1247 // assign each point to the nearest cluster centre by euclidean distance
1248 for point in 0..SUBSET_SAMPLES {
1249 let mut closest_cluster = -1.0;
1250 let mut least_squared_distance = std::f64::MAX;
1251 let clusters_indexing = clusters.index();
1252 let mut samples_indexing = normalised_subset.index_mut();
1253 for cluster in 0..CLUSTERS {
1254 // we don't actually need to square root the distances for finding
1255 // which is least because least squared distance is the same as
1256 // least distance
1257 let squared_distance = {
1258 let mut sum = 0.0;
1259
1260 for feature in 0..FEATURES {
1261 let cluster_coordinate = clusters_indexing.get([cluster, feature]);
1262 let point_coordiante = samples_indexing.get([point, feature]);
1263 sum += (cluster_coordinate - point_coordiante).powi(2);
1264 }
1265 sum
1266 };
1267
1268 if squared_distance < least_squared_distance {
1269 closest_cluster = cluster as f64;
1270 least_squared_distance = squared_distance;
1271 }
1272 }
1273 // save the cluster that is closest to each point in the 6th column
1274 *samples_indexing.get_ref_mut([point, CLUSTER_ID_COLUMN]) = closest_cluster;
1275 }
1276
1277 // update cluster centres to the mean of their points
1278 absolute_changes = 0.0;
1279 for cluster in 0..CLUSTERS {
1280 // construct a list of the points this cluster owns
1281 let owned = normalised_subset.select([("feature", CLUSTER_ID_COLUMN)]).iter()
1282 // zip together the id values with their index
1283 .enumerate()
1284 // exclude the points that aren't assigned to this cluster
1285 .filter(|(index, id)| (*id as usize) == cluster)
1286 // drop the cluster ids from each item and copy over the data
1287 // for each point for each feature
1288 .map(|(index, id)| {
1289 // for each point copy all its data in each feature excluding the
1290 // final cluster id column into a new vec
1291 normalised_subset.select([("sample", index)]).iter()
1292 // taking the first 5 excludes the 6th column due to 0 indexing
1293 .take(CLUSTER_ID_COLUMN)
1294 .collect::<Vec<f64>>()
1295 })
1296 // collect into a vector of vectors containing each feature's data
1297 .collect::<Vec<Vec<f64>>>();
1298 // pass the vector of vectors into a tensor so we have
1299 // a tensor where each row is the data of a point this cluster owns
1300 let owned_points = owned.len();
1301 let owned = Tensor::from([("owned", owned.len())], owned);
1302
1303 // construct a vector of the mean for each feature that this cluster
1304 // now has
1305 let new_means = {
1306 let mut means = Vec::with_capacity(owned_points);
1307
1308 for feature in 0..FEATURES {
1309 let mean = owned.iter().map(|x| x[feature]).sum::<f64>() / (owned_points as f64);
1310 means.push(mean);
1311 }
1312
1313 means
1314 };
1315
1316 // update each new mean for the cluster
1317 for feature in 0..FEATURES {
1318 let mut clusters_indexing = clusters.index_mut();
1319 let previous_mean = clusters_indexing.get([cluster, feature]);
1320 // track the absolute difference between the new mean and the old one
1321 // so we know when to stop updating the clusters
1322 absolute_changes += (previous_mean - new_means[feature]).abs();
1323
1324 println!(
1325 "Cluster {:?} mean for feature {:?} is now {:?} was {:?}",
1326 cluster, feature, new_means[feature], previous_mean
1327 );
1328 *clusters_indexing.get_ref_mut([cluster, feature]) = new_means[feature];
1329 }
1330 }
1331}
1332
1333println!(
1334 "Denormalised clusters at convergence:\n{:?}\n{:.3}",
1335 vec![ "H", "C", "T", "M", "S" ],
1336 clusters.map_with_index(|[_cluster, feature], x| {
1337 let indexing = means_and_variances.index();
1338 let (mean, variance) = indexing.get([feature]);
1339 (x * variance) + mean
1340 }));
1341
1342// Now we will assign every alien in the full dataset a sex using these cluster centres
1343let mut aliens: Vec<Alien> = Vec::with_capacity(SAMPLES);
1344
1345fn assign_alien_sex(index: u8) -> AlienSex {
1346 if index == 0 {
1347 AlienSex::A
1348 } else if index == 1 {
1349 AlienSex::B
1350 } else {
1351 AlienSex::C
1352 }
1353}
1354
1355for i in 0..SAMPLES {
1356 let alien_data = unlabelled_dataset.select([("sample", i)]).map(|x| x);
1357 // normalise the alien data first so comparisons are on unit variance
1358 // and zero mean
1359 let normalised_alien_data = alien_data.map_with_index(|[feature], x| {
1360 let indexing = means_and_variances.index();
1361 let (mean, variance) = indexing.get([feature]);
1362 // normalise each feature in the alien data
1363 (x - mean) / variance
1364 });
1365 let mut distances = Vec::with_capacity(CLUSTERS);
1366 for j in 0..CLUSTERS {
1367 let cluster_data = clusters.select([("cluster", j)]);
1368 // use euclidean distance to compare the alien with the cluster, the cluster
1369 // is already normalised
1370 // we have a 1 x 5 tensor for cluser data and a 5 x 1 tensor for alien data so matrix
1371 // multiplication here gives us the sum of products
1372 let sum_of_squares = (
1373 cluster_data.expand([(0, "samples")]) * normalised_alien_data.expand([(1, "alien")])
1374 ).first();
1375 distances.push(sum_of_squares);
1376 }
1377
1378 // find the cluster with the lowest distance to each point and get its index
1379 let chosen_cluster = distances.iter()
1380 .enumerate()
1381 .min_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
1382 .map(|(i, _)| i)
1383 .unwrap();
1384
1385 // convert each float to the correct data type
1386 let alien_data_indexing = alien_data.index();
1387 aliens.push(Alien {
1388 height: alien_data_indexing.get([0]),
1389 primary_marking_color: recover_generated_color(alien_data_indexing.get([1])),
1390 tail_length: alien_data_indexing.get([2]),
1391 metabolic_rate: alien_data_indexing.get([3]),
1392 spiked_tail: recover_generated_spiked_tail(alien_data_indexing.get([4])),
1393 sex: assign_alien_sex(chosen_cluster as u8),
1394 })
1395}
1396
1397println!("First 10 aliens");
1398for i in 0..10 {
1399 println!("{:?}", aliens[i]);
1400}
1401
1402// Put the aliens in a tensor for convenience
1403let aliens = Tensor::from([("sample", SAMPLES)], aliens);
1404
1405println!("Sex A aliens total: {}", aliens.iter_reference()
1406 .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::A { 1 } else { 0 }));
1407
1408println!("Sex B aliens total: {}", aliens.iter_reference()
1409 .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::B { 1 } else { 0 }));
1410
1411println!("Sex C aliens total: {}", aliens.iter_reference()
1412 .fold(0, |accumulator, alien| accumulator + if alien.sex == AlienSex::C { 1 } else { 0 }));
1413
1414/**
1415 * ===============================
1416 * = Naïve Bayes =
1417 * ===============================
1418 */
1419
1420// Each class is roughly one third so we should not have a strong prior to a particular class
1421
1422// In order to evaluate the performance of the Naïve Bayes classifier we will hold out
1423// the last 100 aliens from the dataset and use them as training data
1424
1425const TRAINING_SAMPLES: usize = 900;
1426const TESTING_SAMPLES: usize = 100;
1427let training_data = Tensor::from([("sample", 900)], aliens.iter().take(TRAINING_SAMPLES).collect());
1428let test_data = Tensor::from([("sample", 100)], aliens.iter().skip(TRAINING_SAMPLES).collect());
1429
1430/**
1431 * Predicts the most probable alien sex for each test input alien (disregarding
1432 * the sex field in those inputs)
1433 *
1434 * For the real valued features the probabilities are computed by modelling
1435 * the features (conditioned on each class) as gaussian distributions.
1436 * For categorical features laplacian smoothing of the counts is used to
1437 * estimate probabilities of the features (conditioned on each class).
1438 */
1439fn predict_aliens(training_data: &Tensor<Alien, 1>, test_data: &Tensor<Alien, 1>) -> Tensor<AlienSex, 1> {
1440 let mut relative_log_probabilities: Vec<f64> = Vec::with_capacity(300);
1441
1442 for class in &[ AlienSex::A, AlienSex::B, AlienSex::C ] {
1443 let training_data_class_only = training_data.iter()
1444 .filter(|a| &a.sex == class)
1445 .collect::<Vec<Alien>>();
1446
1447 // compute how likely each class is in the training set
1448 let prior = (training_data_class_only.len() as f64) / (TRAINING_SAMPLES as f64);
1449
1450 // We model the real valued features as Gaussians, note that these
1451 // are Gaussian distributions over only the training data of each class
1452 let heights: Gaussian<f64> = Gaussian::approximating(
1453 training_data_class_only.iter().map(|a| a.height)
1454 );
1455 let tail_lengths: Gaussian<f64> = Gaussian::approximating(
1456 training_data_class_only.iter().map(|a| a.tail_length)
1457 );
1458 let metabolic_rates: Gaussian<f64> = Gaussian::approximating(
1459 training_data_class_only.iter().map(|a| a.metabolic_rate)
1460 );
1461
1462 // gradually build up the sum of log probabilities to get the
1463 // log of the prior * likelihood which will be proportional to the posterior
1464 let mut relative_log_probabilities_of_class = test_data.iter_reference()
1465 .map(|alien| {
1466 // probabilitiy of the alien sex and the alien
1467 let mut log_relative_probability = prior.ln();
1468
1469 // Compute the probability using the Gaussian model for each real valued feature.
1470 // Due to floating point precision limits and the variance for some of these
1471 // Gaussian models being extremely small (0.01 for heights) we
1472 // check if a probability computed is zero or extremely close to zero
1473 // and if so increase it a bit to avoid computing -inf when we take the log.
1474
1475 let mut height_given_class = heights.probability(&alien.height);
1476 if height_given_class.abs() <= 0.000000000001 {
1477 height_given_class = 0.000000000001;
1478 }
1479 log_relative_probability += height_given_class.ln();
1480
1481 let mut tail_given_class = tail_lengths.probability(&alien.tail_length);
1482 if tail_given_class.abs() <= 0.000000000001 {
1483 tail_given_class = 0.000000000001;
1484 }
1485 log_relative_probability += tail_given_class.ln();
1486
1487 let mut metabolic_rates_given_class = metabolic_rates.probability(
1488 &alien.metabolic_rate);
1489 if metabolic_rates_given_class.abs() <= 0.000000000001 {
1490 metabolic_rates_given_class = 0.000000000001;
1491 }
1492 log_relative_probability += metabolic_rates_given_class.ln();
1493
1494 // compute the probability of the categorical features using lapacian smoothing
1495 let color_of_class = training_data_class_only.iter()
1496 .map(|a| a.primary_marking_color)
1497 // count how many aliens of this class have this color
1498 .fold(0, |acc, color|
1499 acc + if color == alien.primary_marking_color { 1 } else { 0 });
1500 // with laplacian smoothing we assume there is one datapoint for each color
1501 // which avoids zero probabilities but does not distort the probabilities much
1502 // there are 7 color types so we add 7 to the total
1503 let color_given_class = ((color_of_class + 1) as f64)
1504 / ((training_data_class_only.len() + 7) as f64);
1505 log_relative_probability += color_given_class.ln();
1506
1507 let spiked_tail_of_class = training_data_class_only.iter()
1508 .map(|a| a.spiked_tail)
1509 // count how many aliens of this class have a spiked tail or not
1510 .fold(0, |acc, spiked| acc + if spiked == alien.spiked_tail { 1 } else { 0 });
1511 // again we assume one alien of the class with a spiked tail and one without
1512 // to avoid zero probabilities
1513 let spiked_tail_given_class = ((spiked_tail_of_class + 1) as f64)
1514 / ((training_data_class_only.len() + 2) as f64);
1515 log_relative_probability += spiked_tail_given_class.ln();
1516
1517 if log_relative_probability == std::f64::NEG_INFINITY {
1518 println!("Individual probs P:{} H:{} T:{} M:{} C:{} S:{}",
1519 prior, height_given_class, tail_given_class, metabolic_rates_given_class,
1520 color_given_class, spiked_tail_given_class);
1521 }
1522
1523 log_relative_probability
1524 }).collect();
1525
1526 relative_log_probabilities.append(&mut relative_log_probabilities_of_class);
1527 }
1528
1529 // collect the relative probabilitiy estimates for each class and each alien
1530 // into a 3 x 100 matrix respectively
1531 let probabilities = Tensor::from([("class", 3), ("sample", 100)], relative_log_probabilities);
1532
1533 let predictions = (0..100).map(|i| {
1534 let predicted_class_index = probabilities.select([("sample", i)])
1535 .iter()
1536 .enumerate()
1537 // find the class with the highest relative probability estimate
1538 .max_by(|(_, a), (_, b)| a.partial_cmp(b).expect("NaN should not be in list"))
1539 // discard the probability
1540 .map(|(index, p)| index)
1541 // retrieve the value from the option
1542 .unwrap();
1543 if predicted_class_index == 0 {
1544 AlienSex::A
1545 } else if predicted_class_index == 1 {
1546 AlienSex::B
1547 } else {
1548 AlienSex::C
1549 }
1550 }).collect();
1551
1552 Tensor::from([("class", 100)], predictions)
1553}
1554
1555let predictions = predict_aliens(&training_data, &test_data);
1556
1557println!("First 10 test aliens and predictions");
1558for i in 0..10 {
1559 println!("Predicted: {:?} Input: {:?}", predictions.index().get([i]), test_data.index().get([i]));
1560}
1561
1562let accuracy = test_data.iter_reference()
1563 .zip(predictions.iter_reference())
1564 .map(|(alien, prediction)| if alien.sex == *prediction { 1 } else { 0 })
1565 .sum::<u16>() as f64 / (TESTING_SAMPLES as f64);
1566
1567println!("Accuracy {}%", accuracy * 100.0);
1568
1569/**
1570 * ===============================
1571 * = Analysis =
1572 * ===============================
1573 */
1574
1575// We can get a better sense of how well our classifier has done by
1576// printing the confusion matrix
1577
1578// Construct a confusion matrix of actual x predicted classes, using A as 0, B as 1 and C as 2
1579// for indexing. If the accuracy was 100% we would see only non zero numbers on the diagonal
1580// as every prediction would be the actual class.
1581let confusion_matrix = {
1582 let mut confusion_matrix = Tensor::empty([("actual_class", 3), ("predicted_class", 3)], 0);
1583 let mut confusion_matrix_indexing = confusion_matrix.index_mut();
1584
1585 // loop through all the actual and predicted classes to fill the confusion matrix
1586 // with the total occurances of each possible combination
1587 for (actual, predicted) in test_data.iter_reference().zip(predictions.iter_reference()) {
1588 match actual.sex {
1589 AlienSex::A => {
1590 match predicted {
1591 AlienSex::A => *confusion_matrix_indexing.get_ref_mut([0, 0]) += 1,
1592 AlienSex::B => *confusion_matrix_indexing.get_ref_mut([0, 1]) += 1,
1593 AlienSex::C => *confusion_matrix_indexing.get_ref_mut([0, 2]) += 1,
1594 }
1595 },
1596 AlienSex::B => {
1597 match predicted {
1598 AlienSex::A => *confusion_matrix_indexing.get_ref_mut([1, 0]) += 1,
1599 AlienSex::B => *confusion_matrix_indexing.get_ref_mut([1, 1]) += 1,
1600 AlienSex::C => *confusion_matrix_indexing.get_ref_mut([1, 2]) += 1,
1601 }
1602 },
1603 AlienSex::C => {
1604 match predicted {
1605 AlienSex::A => *confusion_matrix_indexing.get_ref_mut([2, 0]) += 1,
1606 AlienSex::B => *confusion_matrix_indexing.get_ref_mut([2, 1]) += 1,
1607 AlienSex::C => *confusion_matrix_indexing.get_ref_mut([2, 2]) += 1,
1608 }
1609 }
1610 }
1611 }
1612
1613 confusion_matrix
1614};
1615
1616println!("Confusion matrix: Rows are actual class, Columns are predicted class\n{}",
1617 confusion_matrix);
1618println!(" A B C");
1619```
1620
1621The above code prints 60% accuracy which isn't amazing as if we learned the cluster centers
1622that labelled the data in the first place we should be able to get 100% accuracy but
162360% is still far better than guessing for a three class problem. As Naïve Bayes has no
1624hyperparameters it is good for establishing a good baseline to compare other examples such
1625as [Logistic Regression](super::super::logistic_regression) to.
1626*/
1627
1628}