the activation is taken from sigmoid function
Cost is taken from cross entropy
Chain
W[L] W[L-1]
/ /
C<-A[L]<-Z[L]<-A[L-1]<-Z[L-1]<-...
\ \
b[L] b[L-1]
& - derivative symbol(because it's easier to write on keyboard than actual symbol)
&C &C &A[L]
----- = ----- * -----
&Z[L] &A[L] &Z[L]
that transpose to A[L] - Y, where Y means real data
why?
&C
----- = -(Y/A[L] - (1-Y)/(1-A[L]))
&A[L]
and
&A[L]
----- = sig`(Z[L]) =math_shananigans= sig(Z[L]) * (1-sig(Z[L])) = A[L](1-A[L])
&Z[L]
by multiplying them we get
-(Y/A - (1-Y)/(1-A)) * A(1-A) = -[Y(1-A) - (1-Y)*A] = -[Y - YA - A + YA] = -[Y - A] = A - Y
yeah, some math shananigans are used here. Ask gpt to explain how to get a derivative of sigmoid function or something like that, that is to long to explain here
to get earlier layers we use the previously calculated derivative where
&C &C &Z[L] &A[L-1]
------- = ------ * ------- * -------
&Z[L-1] &Z[L] &A[L-1] &Z[L-1]
where
&C
----- = calculated earlier, will call it delta later
&Z[L]
Z[L] = W[L] * A[L-1] + b[L];
&Z[L]
------- = W[L] (treat A[L-1] as constant and b[L] is constant, so we skip them)
&A[L-1]
(yep, thats all we get from that calculation)
A[L-1] = sig(Z[L-1])
where did I saw that formula before?
&A[L-1]
------- = sig`(Z[L-1]) = sig(Z[L-1]) * (1-sig(Z[L-1])) = A[L-1] * (1-A[L-1])
&Z[L-1]
so we end up with
&C
------- = delta * (W[L])^T? * (A[L-1] * (1-A[L-1]))
&Z[L-1]
if I remember correctly, the W[L] is transposed
now, how do we get the Weight and bias from Z derivative...
&C &C &Z[L]
----- = ----- * -------
&W[L] &Z[L] &W[L]
and
&C &C &Z[L]
----- = ----- * -------
&b[L] &Z[L] &b[L]
Z[L] = W[L] * A[L-1] + b[L]
so
&Z[L]
----- = A[L-1] (because we treat W as const and b is const, so we get A[L-1])
&W[L]
so we end up with previously calculated derivative of Z * A[L-1](also remember to transpose A[L-1]) to be able to do a matrix_multiplication. we use that here, not iterative matrix_multiplication
and with bias
we end up with W[L] * A[L-1], which is the same for each neuron in bach we use, so we can just use a sum of error on each bach, and we end up with sum(C/Z[l] der) * 1/m(count of samples, used to get average of baches). dont ask questions, its 2:30 right now, and I dont think.
I end with this, maybe someone will go throu that, I dont care. Maybe you will learn something, or you will have a reason to shit on me.