To keep myself entertained between classes this fall semester (2023), I decided to try to learn a little about machine learning mathematics. I've heard of the Python library TensorFlow, which does a lot of the heavy lifting for you (in c++, so likely not at the expense of Python's slow interpreter). I am interested in learning these types of python libraries, but first I wanted to develop a very strong familiarity with the basics and mathematics of machine learning. I also am trying to get as familiar as possible with C for my own personal enrichment. As of starting this project, I thought it would be nice to have a machine learning library in C for any of those who didn't want to use python for whatever reason. I realize now that this was foolish. My code probably doesn't have much utility for anyone else, but it was a very good learning experience for me.
I started my journey by scouring a book by Michael Nielsen on machine learning. The book is a few years old and behind the times now, but I'm not going to be setting the state of the art here so I'm not too concerned. I quickly learned that training a neural network really is just a minimization problem, and the whole thing can be elegantly expressed as pure math instead of as a programming problem. I started taking notes like I was in any other math class. The first big concept to learn was gradient descent.
For those who have taken a multivariable calculus class, gradient descent is a pretty straightforward exercise. For those who haven't, the core idea is still not too inaccessable. We can think of a neural network as a huge, crazy black box with a million different parameters to tweak. We throw a bunch of data into the box with an idea of what we want the network to spit back out. As long as we can quantify how far off our network is from the ideal output, we can use this quantity as a function of these million parameters. The idea here is to find the multivariable derivative of this error metric, usually called a cost function. If we can find a method to figure out how tweaking each parameter changes our output, then we can use that information to tweak them in just the right way to minimize the cost and maximize how close our output is to the ideal. This multivariable differential is called the gradient. Once we have our table of which parameters to tweak and in what way, we then choose how large a step to take and we actually change all of our parameter values. This is very similar to the idea of walking down a hill until you reach the lowest valley you can find. Because the network is a complex mathematical entity with many moving parts, the hard part is figuring out how to even calculate this gradient. That's where backpropagation comes in.
Backpropagation is the method by which a gradient is calculated for a neural network. The important thing to remember is that all backpropagation does is find the gradient of the cost function. The complicated part is figuring out how every single parameter should be tweaked to minimize the cost. The first step is to figure out how the output should change in order to minimize the cost. Usually this means the output will have a random set of activations that do not correlated to the input at all. Our cost will likely be high. We know that the cost is a function of each output neuron's activation, so we can take the derivative of the cost with respect to each activation value. Once we have this differential, we then think of our output activations as functions of the activation of the previous neuron, as well as the weight and bias for the current neuron. We can find this differential with respect to these values. We then consider the activations of the previous neuron as functions of the parameters of the neurons before that, and so on and so forth. This is why the process is called backpropagation, we start by looking at how the output ought to change and work backwards from there. The complicated part is the nuts and bolts of the math required to compute all this, which I will not cover here. I highly recommend checking out Nielsen's online book.
I started my program by building structures to hold all the neuron, weight, and bias data for each layer. My goal was to have one structure which holds the entirety of the network for ease of reference. I decided to use a structure to represent a single layer in a network. This structure holds vectors which represent the data for each neuron, and a matrix for the weights. I used the GNU Scientific library for my vector maths, which may have been a little overkill but it allows me to work with complex math structures easily. It will also help when I expand into convolutional networks.
Once I was sure I had my network data structures working, I started building code for importing data into the network. The example problem I used to get the network functioning was the classic digit recognization problem. The idea is to train a network to reconize and correctly categorize a low- resolution image of a handwritten digit, in black and white. In order to import an image, I had to open the file containing the images and scan through each line (where one line is one image) and import the 0-255 greyscale values into my dayta structure. Firstly, the network only works with values from 0-1, so I had to map the 0-255 values to 0-1. I decided to use a simple linear mapping between values, as opposed to some non-linear function like the sigmoid. I did experiment with the sigmoid function once the network was working, but the results were not as good. I suspect this is because the sigmoid is capable of mapping an infinite domain of values to a range of 0-1, meaning you could plug in an arbitrarily large number in and get a number restricted to the finite range out. This is very useful for calculating network activations, when the weights and biases can sum up to very large numbers. But if your data input is already restricted to a finite domain, then a simple linear map is better. Long story short, after some tweaking I had the image data properly converted and imported to my network's input.
The next task was to get my network to properly feedforward, that is, to properly calculate the activations for every neuron in every layer up to the output. Of course, the untrained network will output nonsense, but the important part is that it is properly calculated. The calculus will not work otherwise. Once the output can be calculated from the input data, then the cost needs to be calculated. I started using mean squared error for my cost function, then switched to log-likelihood later. I know the cost is working properly if it is very high for the untrained network, usually in the ballpark of 1-3. I can't really know for sure that the cost is working until I see it going down during the training process, but I was confident at this point it was doing okay. Once this was all up and running, I could move on to backpropagation.
The first really serious programming challenge was getting backpropagation up and running, although because the algorithm is so well-documented it boiled down to effectively implementing the algorithm in code. The first step is to find the 'error' of the output layer, which really just means figuring out how the output should change to decrease the cost. Then the error for the layer before it is calculated, and so on, until we get to the input layer. The input activations cannot be controlled by the network of course, but we can control the weights and biases connected to the inputs. We can figure out how the weights and biases ought to change, then we tweak them. This is the gradient descent step. We choose a value for the small step we should take, then we change the value of each weight and bias by the product of that step and the corresponding differential. The idea is analogous to stepping down a hill, but if the hill was a surface in a 13,000-dimensional space. We know that these steps are working if we watch the cost go down after each descent. After many hours of watching the cost jump around or not move at all or return a divide by zero error, I eventually was able to get my program to descent properly and actually recognize features of different digits.
Watching the network's cost function minimize is extremely satisfying and important, but it doesn't matter at all if we aren't actually categorizing digits properly. It is possible for the network to get stuck in small valleys of its cost function, where it categorizes all data the same way and gets a mildly acceptable cost, or to spit out 0 on all of its output neurons, or something like that. It's also common for the network to learn to recognize the training dataset with 100% accuracy, but when shown new data it wasn't trained on it gets totally lost. This is sort of like the difference between memorization and actual understanding of a concept in human brains. So in order to avoid this, we show the network a set of digits it wasn't trained on and test how many it correctly categorizes. This gives us a direct accuracy metric that shows us how well it has actually learned, instead of 'memorizing' the dataset. The general task list for my program is to train the network, test the network, and return the score so the user can tell how well the network has learned. The first version of my code got an accuracy of around 75-80%. After implementing some better cost functions, better activation functions, and tweaking the backpropagation a little, I was able to get around 90% max accuracy. This is not nearly as high as Nielsen's program, which achieved around 95-96% with the same methods. Frankly, I am still trying to figure out where the disparity comes from. I am relatively happy with the network so far however.
The next step for this project is to implement a convolutional neural network, which is a different architecture specialize in image processing. This requires a lot more math and complexity, but it is doable (I think.) Currently, I have been able to get the data structure up and running, but I have not implemented feedforwarding and backpropagation. I will make the code available on my GitHub once I can get this working. So far, the project has been extremely educational for me, not only on the math of machine learning but also for my general literacy in C. I would recommend that any software engineer with some spare time and hunger for learning and challenge try this themselves.