1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314
/*!
Logistic regression example
Logistic regression can be used for classification. By performing linear regression on a logit
function a linear classifier can be obtained that retains probabilistic semantics.
Given some data on a binary classification problem (ie a
[Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)),
transforming the probabilities with a logit function:
<pre>log(p / (1 - p))</pre>
where p is the probability of success, ie P(y=True|x), puts them in the -infinity to infinity
range. If we assume a simple linear model over two inputs x1 and x2 then:
<pre>log(p / (1 - p)) = w0 + w1*x1 + w2*x2</pre>
For more complex data [basis functions](super::linear_regression)
can be used on the inputs to model non linearity. Once we have a model we can define the
objective function to maximise to learn the weights for. Once we have fixed weights
we can estimate the probability of new data by taking the inverse of the logit function
(the sigmoid function):
<pre>1 / (1 + e^(-(w0 + w1*x1 + w2*x2)))</pre>
which maps back into the 0 - 1 range and produces a probability for the unseen data. We can then
choose a cutoff at say 0.5 and we have a classifier that ouputs True for any unseen data estimated
to have a probability ≥ 0.5 and False otherwise.
# Arriving at the update rule
If the samples are independent of each other, ie knowing P(y<sub>1</sub>=True|x<sub>1</sub>)
tells you nothing about P(y<sub>1</sub>=True|x<sub>2</sub>), as is the case in a bernoulli
distribution, then the probability of P(**y**|X) is the product of each
P(y<sub>i</sub>|**x<sub>i</sub>**). For numerical stability reasons we often want to take logs
of the probability, which transforms the product into a sum.
log(P(**y**|X)) = the sum over all i data of (log(P(y<sub>i</sub>|**x<sub>i</sub>**)))
Our model sigmoid(**w**<sup>T</sup>**x**) is already defined as P(y<sub>i</sub>=True|**x<sub>i</sub>**) so
P(y<sub>i</sub>) = p<sub>i</sub> if y<sub>i</sub> = 1 and 1 - p<sub>i</sub> if y<sub>i</sub> = 0,
where p<sub>i</sub> = P(y<sub>i</sub>=True|**x<sub>i</sub>**) = sigmoid(**w**<sup>T</sup>**x**)
this can be converted into a single equation because a<sup>0</sup> = 1
P(y<sub>i</sub>) = (p<sub>i</sub>^y<sub>i</sub>) * ((1 - p<sub>i</sub>)^(1 - y<sub>i</sub>))
putting the two equations together gives the log probability we want to maximise in terms of
p<sub>i</sub>, which is itself in terms of our model's weights
log(P(**y**|X)) = the sum over all i data of (log((p<sub>i</sub>^y<sub>i</sub>) * (1 - p<sub>i</sub>)^(1 - y<sub>i</sub>)))
by log rules we can remove the exponents
log(P(**y**|X)) = the sum over all i data of (y<sub>i</sub> * log(p<sub>i</sub>) + (1 - y<sub>i</sub>) * log(1 - p<sub>i</sub>)))
we want to maximise P(**y**|X) with our weights so we take the derivative with respect to **w**
d(log(P(**y**|X))) / d**w** = the sum over all i data of ((y<sub>i</sub> - p(y<sub>i</sub>=True|**x<sub>i</sub>**))**x<sub>i</sub>**)
where p(y<sub>i</sub>=True|**x<sub>i</sub>**) = 1 / (1 + e^(-(w0 + w1 * x1 + w2 * x2))) as defined
earlier. This derivative will maximise log(P(**y**|X)), and as logs are monotonic P(**y**|X) as
well, when it equals 0. Unfortunatly there is no closed form solution so we must perform gradient
descent to fit **w**. In this example i is small enough we perform gradient descent over all the
training data, for big data problems stochastic gradient descent would scale better.
The update rule:
**w<sub>new</sub>** = **w<sub>old</sub>** + learning_rate * (the sum over all i data of (y<sub>i</sub> - [1 / (1 + e^(-(**w**<sup>T</sup>**x**)))])**x<sub>i</sub>**))
# Logistic regression example
```
// Actual types of datasets logistic regression might be performed on include diagnostic
// datasets such as cancer/not cancer diagnosis and various measurements of patients.
// More abstract datasets could be related to coin flipping.
//
// To ensure our example is not overly complicated but requires two dimensions for the inputs
// to estimate the probability distribution of the random variable we sample from, we model
// two clusters from different classes to classify. As each class is arbitary, we assign the first
// class as the True case that the model should predict >0.5 probability for, and the second
// class as the False case that the model should predict <0.5 probability for.
use easy_ml::matrices::Matrix;
use easy_ml::matrices::slices::{Slice2D, Slice};
use easy_ml::distributions::MultivariateGaussian;
use rand::{Rng, SeedableRng};
use rand::distributions::{DistIter, Standard};
use rand_chacha::ChaCha8Rng;
use textplots::{Chart, Plot, Shape};
// use a fixed seed random generator from the rand crate
let mut random_generator = ChaCha8Rng::seed_from_u64(13);
// define two cluster centres using two 2d gaussians, making sure they overlap a bit
let class1 = MultivariateGaussian::new(
Matrix::column(vec![ 2.0, 3.0 ]),
Matrix::from(vec![
vec![ 1.0, 0.1 ],
vec![ 0.1, 1.0 ]]));
// make the second cluster more spread out so there will be a bit of overlap with the first
// in the (0,0) to (1, 1) area
let class2 = MultivariateGaussian::new(
Matrix::column(vec![ -2.0, -1.0 ]),
Matrix::from(vec![
vec![ 2.5, 1.2 ],
vec![ 1.2, 2.5 ]]));
// Generate 200 points for each cluster
let points = 200;
let mut random_numbers: DistIter<Standard, &mut ChaCha8Rng, f64> =
(&mut random_generator).sample_iter(Standard);
// unwrap is perfectly safe if and only if we know we have supplied enough random numbers
let class1_points = class1.draw(&mut random_numbers, points).unwrap();
let class2_points = class2.draw(&mut random_numbers, points).unwrap();
// Plot each class of the generated data in a scatter plot
println!("Generated data points");
/**
* Helper function to print a scatter plot of a provided matrix with x, y in each row
*/
fn scatter_plot(data: &Matrix<f64>) {
// textplots expects a Vec<(f32, f32)> where each tuple is a (x,y) point to plot,
// so we must transform the data from the cluster points slightly to plot
let scatter_points = data.column_iter(0)
// zip is used to merge the x and y columns in the data into a single tuple
.zip(data.column_iter(1))
// finally we map the tuples of (f64, f64) into (f32, f32) for handing to the library
.map(|(x, y)| (x as f32, y as f32))
.collect::<Vec<(f32, f32)>>();
Chart::new(180, 60, -8.0, 8.0)
.lineplot(Shape::Points(&scatter_points))
.display();
}
println!("Classs 1");
scatter_plot(&class1_points);
println!("Classs 2");
scatter_plot(&class2_points);
// for ease of use later we insert a 0th column into both class's points so w0 + w1*x1 + w2*x2
// can be computed by w^T x
let class1_inputs = {
let mut design_matrix = class1_points.clone();
design_matrix.insert_column(0, 1.0);
design_matrix
};
let class2_inputs = {
let mut design_matrix = class2_points.clone();
design_matrix.insert_column(0, 1.0);
design_matrix
};
/**
* The sigmoid function, taking values in the [-inf, inf] range and mapping
* them into the [0, 1] range.
*/
fn sigmoid(x: f64) -> f64 {
1.0 / (1.0 + ((-x).exp()))
}
/**
* The logit function, taking values in the [0, 1] range and mapping
* them into the [-inf, inf] range
*/
fn logit(x: f64) -> f64 {
(x / (1.0 - x)).ln()
}
// First we initialise the weights matrix to some initial values
let mut weights = Matrix::column(vec![ 1.0, 1.0, 1.0 ]);
/**
* The log of the likelihood function P(**y**|X). This is what we want to update our
* weights to maximise as we want to train the model to predict y given **x**,
* where y is the class and **x** is the two features the model takes as input.
* It should be noted that something has probably gone wrong if you ever get 100%
* performance on your training data, either your training data is linearly seperable
* or you are overfitting and the weights won't generalise to predicting the correct
* class given unseen inputs.
*
* This function is mostly defined for completeness, we maximise it using the derivative
* and never need to compute it.
*/
fn log_likelihood(
weights: &Matrix<f64>, class1_inputs: &Matrix<f64>, class2_inputs: &Matrix<f64>
) -> f64 {
// The probability of predicting all inputs as the correct class is the product
// of the probability of predicting each individal input and class correctly, as we
// assume each sample is independent of each other.
let mut likelihood = 1_f64.ln();
// the model should predict 1 for each class 1
let predictions = (class1_inputs * weights).map(sigmoid);
for i in 0..predictions.rows() {
likelihood += predictions.get(i, 0).ln();
}
// the model should predict 0 for each class 2
let predictions = (class2_inputs * weights).map(sigmoid).map(|p| 1.0 - p);
for i in 0..predictions.rows() {
likelihood += predictions.get(i, 0).ln();
}
likelihood
}
/**
* The derivative of the negative log likelihood function, which we want to set to 0 in order
* to maximise P(**y**|X).
*/
fn update_function(
weights: &Matrix<f64>, class1_inputs: &Matrix<f64>, class2_inputs: &Matrix<f64>
) -> Matrix<f64> {
let mut derivative = Matrix::column(vec![ 0.0, 0.0, 0.0 ]);
// compute y - predictions for all the first class of inputs
let prediction_errors = (class1_inputs * weights).map(sigmoid).map(|p| 1.0 - p);
for i in 0..prediction_errors.rows() {
// compute diff * x_i
let diff = prediction_errors.get(i, 0);
let ith_error = Matrix::column(class1_inputs.row_iter(i).collect()).map(|x| x * diff);
derivative = derivative + ith_error;
}
// compute y - predictions for all the second class of inputs
let prediction_errors = (class2_inputs * weights).map(sigmoid).map(|p| 0.0 - p);
for i in 0..prediction_errors.rows() {
// compute diff * x_i
let diff = prediction_errors.get(i, 0);
let ith_error = Matrix::column(class2_inputs.row_iter(i).collect()) * diff;
derivative = derivative + ith_error;
}
derivative
}
let learning_rate = 0.002;
let mut log_likelihood_progress = Vec::with_capacity(25);
// For this example we cheat and have simply found what number of iterations and learning rate
// yields a correct decision boundry so don't actually check for convergence. In a real example
// you would stop once the updates for the weights become 0 or very close to 0.
for i in 0..25 {
let update = update_function(&weights, &class1_inputs, &class2_inputs);
weights = weights + (update * learning_rate);
log_likelihood_progress.push(
(i as f32, log_likelihood(&weights, &class1_inputs, &class2_inputs) as f32)
);
}
println!("Log likelihood over 25 iterations (bigger is better as logs are monotonic)");
Chart::new(180, 60, 0.0, 15.0)
.lineplot(Shape::Lines(&log_likelihood_progress))
.display();
println!("Decision boundry after 25 iterations");
decision_boundry(&weights);
// The model should have learnt to classify class 1 correctly at the expected value
// of the cluster
assert!(
sigmoid(
(weights.transpose() * Matrix::column(vec![ 1.0, 2.0, 3.0])).scalar()
) > 0.5);
// The model should have learnt to classify class 2 correctly at the expected value
// of the cluster
assert!(
sigmoid(
(weights.transpose() * Matrix::column(vec![ 1.0, -2.0, -1.0])).scalar()
) < 0.5);
/**
* A utility function to plot the decision boundry of the model. As the terminal plotting
* library doesn't support colored plotting at the time of writing this is a little challenging
* to do given we have two dimensions of inputs and one dimension of output which is also real
* valued as logistic regression computes probability. This could best be done with a 3d
* plot or a heatmap, but is done with this function by taking 0.5 as the cutoff for
* classification, generating a grid of points in the two dimensional space and classifying all
* of them, then plotting the ones classified as class 1.
*
* TODO: work out how to retrieve a line from these points so the decision boundry is easier
* to see
*/
fn decision_boundry(weights: &Matrix<f64>) {
// compute a matrix of coordinate pairs from (-8.0, -8.0) to (8.0, 8.0)
let grid_values = Matrix::empty(0.0, (160, 160));
// create a matrix of tuples combining every combination of coordinates
let grid_values = grid_values.map_with_index(|_, i, j| (
(i as f64 - 80.0) * 0.1, (j as f64 - 80.0) * 0.1)
);
// iterate through every tuple and see if the model predicts class 1
let points = grid_values.column_major_iter()
.map(|(x1, x2)| {
let input = Matrix::column(vec![ 1.0, x1, x2 ]);
let prediction = sigmoid((weights.transpose() * input).scalar());
return if prediction > 0.5 {
(x1, x2, 1)
} else {
(x1, x2, 0)
}
})
.filter(|(_, _, class)| class == &1)
.map(|(x1, x2, _)| (x1 as f32, x2 as f32))
.collect::<Vec<(f32, f32)>>();
Chart::new(180, 60, -8.0, 8.0)
.lineplot(Shape::Points(&points))
.display();
}
```
*/