Neural Networks and Calculus - More Basics

Another Calculus lesson for Neural Networks
Published

April 15, 2021

There has been another calculus workshop. Since I’m a scrub tier data sciencer I find this stuff very valuable. So I’m going to go through this stuff again. This time I will be more focused on the questions and my solutions as I have a limited amount of time available to me.

Before we defined the loss of the x,y point classifier using mean square error. This was based on the output of our network being

L(bias,wx,wy,target)=(11+ebiaswxxwyytarget)2

We set bias=1 and wy=1 and then attempted to derive the minimum value for wx, which was somewhat fraught. It could be done more easily by calculating the derivative of the function and then working out the point at which the derivative is zero.

If we set the bias and wy as above and then use two fixed points:

x y target
2 0 1
-1 1 0

Then we can write out the mse as follows:

L(wx)=12(11+e12wx1)2+12(11+ewx)2L(wx)=12(11+e12wx1)2+1211+2ewx+e2wxL(wx)=12(11+e12wx1)2+12+4ewx+2e2wxL(wx)=12(1+11+2e12wx+e24wx21+e12wx)+12+4ewx+2e2wxL(wx)=12+12+4e12wx+2e24wx11+e12wx+12+4ewx+2e2wx

I wonder if derivatives are cool with just applying them to the denominators? I can check my work with wolfram alpha. This may mean that me expanding the equation wasn’t required.

I’ve just tested this on Wolfram Alpha and the derivative it calculates is complex. I think this is beyond me at the moment.

So the next question that was introduced was calculating derivatives for multiple variables at once. The core principle is that you can treat a N variable derivative as N separate derivatives, one for each variable. When calculating the derivative of a variable you treat the values of the other variables as constants.

For example, to calculate the derivative of f(x,y)=x2+xy you calculate fx and fy:

f(x,y)=x2+xyfx=2x+yfy=x

The two excersizes are:

f(x,y)=x2y2f(x,y)=ex2+y2

The first one is straightforward:

f(x,y)=x2y2fx=2xy2fy=2yx2

I’m not sure how to calculate the derivative for the powers. Ug. It’s the chain rule.

h(x)=(gf)(x)=g(f(x))(gf)=(gf)f

You multiply the derivatives of the two functions. Remember that the value of the outer function is the derivative of the inner function.

Also e is it’s own derivative.

h(x)=ex2+y2f(x)=x2+y2g(x)=exg(f(x))=ef(x)f(x)=x2+y2fx=2xfy=2yg(x)=exgx=exhx=gx(f(x))fxhx=ex2+y22xhy=ex2+y22y

Wolfram Alpha agrees with me.


The next challenge is to compute the Jacobian matrix for a given vector valued function. A vector valued function is one that maps a vector to a vector.

A vector is a list of N numbers. A vector valued function is a function that maps from a vector to a vector. We can express a vector valued function in it’s general form as:

RnRm

Here the function maps a vector of size n to a vector of size m. We can think of a vector as a matrix where one of the dimensions is 1. This means that a vector valued function is like a matrix, where the matrix is n by m.

R1xn(RnRm)Rmx1R1xnRmxnRmx1

We are limiting this function to a given point and then attempting to calculate a gradient at that point. The gradient being the derivative of the function at that point.

I think my explanation is confused here really. The vector valued function does not have to be linear, it can be anything. The derivative is calculated by taking a specific point p and then calculating how that point tends to change.

df(p)=f(p)+Cp

For this to work C must be a matrix of the desired shape, as defined above. I feel like this confusion just shows that I need to understand this better.

This matrix is introduced as the Jacobian Matrix, and it may not exist. It is also referred to as the total derivative of x.

For an R3R2 function the Jacobian matrix would be:

Jf=(f1x1f1x2f1x3f2x1f2x2f2x3)

The excersize is to calculate the Jacobian matrix for a R2R3 where:

f(x1,x2)=(x1x2x22x12)

I’m expecting the matrix of derivatives to look like this:

Jf=(f1x1f1x2f2x1f2x2f3x1f3x2)

f1(x1,x2)=x1x2f2(x1,x2)=x22f3(x1,x2)=x12

f1x1=1f1x2=1f2x1=0f2x2=2x2f3x1=2x1f3x2=0

Jf=(1102x22x10)

My formatting could use work however this seems legit? I’ve run out of time at this point and I’m probably going to leave this here.