flashlight 0.0.12

the activation is taken from sigmoid function
Cost is taken from cross entropy

Chain
			W[L]			W[L-1]
			/				/
C<-A[L]<-Z[L]<-A[L-1]<-Z[L-1]<-...
			\				\
			b[L]			b[L-1]

& - derivative symbol(because it's easier to write on keyboard than actual symbol)

 &C		  &C    &A[L]
----- = ----- * -----
&Z[L]   &A[L]   &Z[L]

that transpose to A[L] - Y, where Y means real data

why?
 &C
----- = -(Y/A[L] - (1-Y)/(1-A[L]))
&A[L]

and

&A[L]
----- = sig`(Z[L]) =math_shananigans= sig(Z[L]) * (1-sig(Z[L])) = A[L](1-A[L])
&Z[L]

by multiplying them we get
-(Y/A - (1-Y)/(1-A)) * A(1-A) = -[Y(1-A) - (1-Y)*A] = -[Y - YA - A + YA] = -[Y - A] = A - Y

yeah, some math shananigans are used here. Ask gpt to explain how to get a derivative of sigmoid function or something like that, that is to long to explain here

to get earlier layers we use the previously calculated derivative where

  &C		&C		&Z[L]    &A[L-1]
------- = ------ * ------- * -------
&Z[L-1]	  &Z[L]    &A[L-1]   &Z[L-1]

where 
  &C
----- = calculated earlier, will call it delta later
&Z[L]

Z[L] = W[L] * A[L-1] + b[L];

 &Z[L]
------- = W[L] (treat A[L-1] as constant and b[L] is constant, so we skip them)
&A[L-1]

(yep, thats all we get from that calculation)

A[L-1] = sig(Z[L-1])
where did I saw that formula before?

&A[L-1]
------- = sig`(Z[L-1]) = sig(Z[L-1]) * (1-sig(Z[L-1])) = A[L-1] * (1-A[L-1])
&Z[L-1]

so we end up with
   &C
------- = delta * (W[L])^T? * (A[L-1] * (1-A[L-1]))
&Z[L-1]

if I remember correctly, the W[L] is transposed

now, how do we get the Weight and bias from Z derivative...

 &C		  &C     &Z[L]
----- = ----- * -------
&W[L]   &Z[L]    &W[L]

and

 &C		  &C     &Z[L]
----- = ----- * -------
&b[L]   &Z[L]    &b[L]

Z[L] = W[L] * A[L-1] + b[L]

so

&Z[L]
----- = A[L-1] (because we treat W as const and b is const, so we get A[L-1])
&W[L]

so we end up with previously calculated derivative of Z * A[L-1](also remember to transpose A[L-1]) to be able to do a matrix_multiplication. we use that here, not iterative matrix_multiplication

and with bias

we end up with W[L] * A[L-1], which is the same for each neuron in bach we use, so we can just use a sum of error on each bach, and we end up with sum(C/Z[l] der) * 1/m(count of samples, used to get average of baches). dont ask questions, its 2:30 right now, and I dont think.

I end with this, maybe someone will go throu that, I dont care. Maybe you will learn something, or you will have a reason to shit on me.