Neural Networks and Calculus - More Basics

Another Calculus lesson for Neural Networks
Published

April 15, 2021

There has been another calculus workshop. Since I’m a scrub tier data sciencer I find this stuff very valuable. So I’m going to go through this stuff again. This time I will be more focused on the questions and my solutions as I have a limited amount of time available to me.

Before we defined the loss of the x,y point classifier using mean square error. This was based on the output of our network being

\[ L(bias, w_x, w_y, target) = \left( \frac{1}{1 + e^{-bias - w_x x - w_y y}} - target \right)^2 \]

We set \(bias = -1\) and \(w_y = 1\) and then attempted to derive the minimum value for \(w_x\), which was somewhat fraught. It could be done more easily by calculating the derivative of the function and then working out the point at which the derivative is zero.

If we set the \(bias\) and \(w_y\) as above and then use two fixed points:

x y target
2 0 1
-1 1 0

Then we can write out the mse as follows:

\[ \begin{aligned} L(w_x) &= \frac{1}{2} \left ( \frac{1}{1 + e^{1 - 2 w_x}} - 1 \right)^2 + \frac{1}{2} \left( \frac{1}{1 + e^{w_x}} \right)^2 \\ L(w_x) &= \frac{1}{2} \left ( \frac{1}{1 + e^{1 - 2 w_x}} - 1 \right)^2 + \frac{1}{2} \frac{1}{1 + 2 e^{w_x} + e^{2 w_x}} \\ L(w_x) &= \frac{1}{2} \left ( \frac{1}{1 + e^{1 - 2 w_x}} - 1 \right)^2 + \frac{1}{2 + 4 e^{w_x} + 2 e^{2 w_x}} \\ L(w_x) &= \frac{1}{2} \left ( 1 + \frac{1}{1 + 2 e^{1 - 2 w_x} + e^{2 - 4 w_x}} - \frac{2}{1 + e^{1 - 2 w_x}} \right) + \frac{1}{2 + 4 e^{w_x} + 2 e^{2 w_x}} \\ L(w_x) &= \frac{1}{2} + \frac{1}{2 + 4 e^{1 - 2 w_x} + 2 e^{2 - 4 w_x}} - \frac{1}{1 + e^{1 - 2 w_x}} + \frac{1}{2 + 4 e^{w_x} + 2 e^{2 w_x}} \end{aligned} \]

I wonder if derivatives are cool with just applying them to the denominators? I can check my work with wolfram alpha. This may mean that me expanding the equation wasn’t required.

I’ve just tested this on Wolfram Alpha and the derivative it calculates is complex. I think this is beyond me at the moment.

So the next question that was introduced was calculating derivatives for multiple variables at once. The core principle is that you can treat a N variable derivative as N separate derivatives, one for each variable. When calculating the derivative of a variable you treat the values of the other variables as constants.

For example, to calculate the derivative of \(f(x,y) = x^2 + xy\) you calculate \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\):

\[ \begin{aligned} f(x, y) &= x^2 + xy \\ \frac{\partial f}{\partial x} &= 2x + y \\ \frac{\partial f}{\partial y} &= x \end{aligned} \]

The two excersizes are:

\[ \begin{aligned} f(x, y) &= x^2 y^2 \\ f(x, y) &= e^{x^2 + y^2} \end{aligned} \]

The first one is straightforward:

\[ \begin{aligned} f(x, y) &= x^2 y^2 \\ \frac{\partial f}{\partial x} &= 2x y^2 \\ \frac{\partial f}{\partial y} &= 2y x^2 \end{aligned} \]

I’m not sure how to calculate the derivative for the powers. Ug. It’s the chain rule.

\[ \begin{aligned} h(x) &= (g \circ f)(x) = g(f(x)) \\ (g \circ f)' &= (g' \circ f) \cdot f' \end{aligned} \]

You multiply the derivatives of the two functions. Remember that the value of the outer function is the derivative of the inner function.

Also \(e\) is it’s own derivative.

\[ \begin{aligned} h(x) &= e^{x^2 + y^2} \\ f(x) &= x^2 + y^2 \\ g(x) &= e^{x} \\ g(f(x)) &= e^{f(x)} \\ \\ f(x) &= x^2 + y^2 \\ \frac{\partial f}{\partial x} &= 2x \\ \frac{\partial f}{\partial y} &= 2y \\ \\ g(x) &= e^x \\ \frac{\partial g}{\partial x} &= e^x \\ \frac{\partial h}{\partial x} &= \frac{\partial g}{\partial x}(f(x)) \cdot \frac{\partial f}{\partial x} \\ \frac{\partial h}{\partial x} &= e^{x^2+y^2} \cdot 2x \\ \\ \frac{\partial h}{\partial y} &= e^{x^2+y^2} \cdot 2y \end{aligned} \]

Wolfram Alpha agrees with me.


The next challenge is to compute the Jacobian matrix for a given vector valued function. A vector valued function is one that maps a vector to a vector.

A vector is a list of \(N\) numbers. A vector valued function is a function that maps from a vector to a vector. We can express a vector valued function in it’s general form as:

\[ \mathbb{R}^n \longrightarrow \mathbb{R}^m \]

Here the function maps a vector of size \(n\) to a vector of size \(m\). We can think of a vector as a matrix where one of the dimensions is 1. This means that a vector valued function is like a matrix, where the matrix is \(n\) by \(m\).

\[ \begin{aligned} \mathbb{R}^{1xn} \rightarrow ( \mathbb{R}^n \longrightarrow \mathbb{R}^m ) &{\rightarrow} \mathbb{R}^{mx1} \\ \mathbb{R}^{1xn} \cdot \mathbb{R}^{mxn} &{\rightarrow} \mathbb{R}^{mx1} \end{aligned} \]

We are limiting this function to a given point and then attempting to calculate a gradient at that point. The gradient being the derivative of the function at that point.

I think my explanation is confused here really. The vector valued function does not have to be linear, it can be anything. The derivative is calculated by taking a specific point \(p\) and then calculating how that point tends to change.

\[ df(p) = f(p) + Cp \]

For this to work \(C\) must be a matrix of the desired shape, as defined above. I feel like this confusion just shows that I need to understand this better.

This matrix is introduced as the Jacobian Matrix, and it may not exist. It is also referred to as the total derivative of x.

For an \(\mathbb{R}^3 \longrightarrow \mathbb{R}^2\) function the Jacobian matrix would be:

\[ Jf = \left ( \begin{array}{ccc} \frac { \partial f_1 } { \partial x_1 } \frac { \partial f_1 } { \partial x_2 } \frac { \partial f_1 } { \partial x_3 } \\ \frac { \partial f_2 } { \partial x_1 } \frac { \partial f_2 } { \partial x_2 } \frac { \partial f_2 } { \partial x_3 } \end{array} \right ) \]

The excersize is to calculate the Jacobian matrix for a \(\mathbb{R}^2 \longrightarrow \mathbb{R}^3\) where:

\[ f(x_1, x_2) = \left ( \begin{array}{c} x_1 - x_2 \\ x^2_2 \\ -x^2_1 \end{array} \right ) \]

I’m expecting the matrix of derivatives to look like this:

\[ Jf = \left ( \begin{array}{ccc} \frac { \partial f_1 } { \partial x_1 } \frac { \partial f_1 } { \partial x_2 } \\ \frac { \partial f_2 } { \partial x_1 } \frac { \partial f_2 } { \partial x_2 } \\ \frac { \partial f_3 } { \partial x_1 } \frac { \partial f_3 } { \partial x_2 } \end{array} \right ) \]

\[ \begin{aligned} f_1(x_1, x_2) &= x_1 - x_2 \\ f_2(x_1, x_2) &= x^2_2 \\ f_3(x_1, x_2) &= -x^2_1 \end{aligned} \]

\[ \begin{aligned} \frac { \partial f_1 } { \partial x_1 } &= 1 \\ \frac { \partial f_1 } { \partial x_2 } &= -1 \\ \frac { \partial f_2 } { \partial x_1 } &= 0 \\ \frac { \partial f_2 } { \partial x_2 } &= 2x_2 \\ \frac { \partial f_3 } { \partial x_1 } &= -2x_1 \\ \frac { \partial f_3 } { \partial x_2 } &= 0 \\ \end{aligned} \]

\[ Jf = \left ( \begin{array}{ccc} 1 -1 \\ 0 2x_2 \\ -2x_1 0 \end{array} \right ) \]

My formatting could use work however this seems legit? I’ve run out of time at this point and I’m probably going to leave this here.