141 lines
11 KiB
HTML
Executable File
141 lines
11 KiB
HTML
Executable File
<title>MACHINE LEARNING IN C</title>
|
|
<body>
|
|
<div id="body">
|
|
<h1>LEARNING NEURAL NETWORKING IN C</h1>
|
|
<h6>10/19/2023</h6>
|
|
<p>
|
|
To keep myself entertained between classes this fall semester (2023), I decided to try to learn a
|
|
little about machine learning mathematics. I've heard of the Python library TensorFlow, which does
|
|
a lot of the heavy lifting for you (in c++, so likely not at the expense of Python's slow interpreter).
|
|
I am interested in learning these types of python libraries, but first I wanted to develop a very
|
|
strong familiarity with the basics and mathematics of machine learning. I also am trying to get
|
|
as familiar as possible with C for my own personal enrichment. As of starting this project, I thought
|
|
it would be nice to have a machine learning library in C for any of those who didn't want to use
|
|
python for whatever reason. I realize now that this was foolish. My code probably doesn't have much
|
|
utility for anyone else, but it was a very good learning experience for me.
|
|
</p>
|
|
<h3>DOING MY RESEARCH</h3>
|
|
<p>
|
|
I started my journey by scouring a
|
|
<a href="http://neuralnetworksanddeeplearning.com/index.html">book by Michael Nielsen</a> on machine
|
|
learning. The book is a few years old and behind the times now, but I'm not going to be setting the
|
|
state of the art here so I'm not too concerned. I quickly learned that training a neural network really
|
|
is just a minimization problem, and the whole thing can be elegantly expressed as pure math instead of
|
|
as a programming problem. I started taking notes like I was in any other math class. The first big
|
|
concept to learn was gradient descent.
|
|
</p>
|
|
<p>
|
|
For those who have taken a multivariable calculus class, gradient descent is a pretty straightforward
|
|
exercise. For those who haven't, the core idea is still not too inaccessable. We can think of
|
|
a neural network as a huge, crazy black box with a million different parameters to tweak. We throw
|
|
a bunch of data into the box with an idea of what we want the network to spit back out. As long as we
|
|
can quantify how far off our network is from the ideal output, we can use this quantity as a function
|
|
of these million parameters. The idea here is to find the multivariable derivative of this error metric,
|
|
usually called a cost function. If we can find a method to figure out how tweaking each parameter
|
|
changes our output, then we can use that information to tweak them in just the right way to minimize the
|
|
cost and maximize how close our output is to the ideal. This multivariable differential is called the
|
|
gradient. Once we have our table of which parameters to tweak and in what way, we then choose how large
|
|
a step to take and we actually change all of our parameter values. This is very similar to the idea
|
|
of walking down a hill until you reach the lowest valley you can find. Because the network is a
|
|
complex mathematical entity with many moving parts, the hard part is figuring out how to even calculate
|
|
this gradient. That's where backpropagation comes in.
|
|
</p>
|
|
<p>
|
|
Backpropagation is the method by which a gradient is calculated for a neural network. The important
|
|
thing to remember is that all backpropagation does is find the gradient of the cost function. The
|
|
complicated part is figuring out how every single parameter should be tweaked to minimize the cost.
|
|
The first step is to figure out how the output should change in order to minimize the cost. Usually this
|
|
means the output will have a random set of activations that do not correlated to the input at all.
|
|
Our cost will likely be high. We know that the cost is a function of each output neuron's activation,
|
|
so we can take the derivative of the cost with respect to each activation value. Once we have this
|
|
differential, we then think of our output activations as functions of the activation of the previous
|
|
neuron, as well as the weight and bias for the current neuron. We can find this differential with
|
|
respect to these values. We then consider the activations of the previous neuron as functions of the
|
|
parameters of the neurons before that, and so on and so forth. This is why the process is called
|
|
backpropagation, we start by looking at how the output ought to change and work backwards from there.
|
|
The complicated part is the nuts and bolts of the math required to compute all this, which I will
|
|
not cover here. I highly recommend checking out Nielsen's online book.
|
|
</p>
|
|
<h3>IMPLEMENTATION</h3>
|
|
<p>
|
|
I started my program by building structures to hold all the neuron, weight, and bias data for each
|
|
layer. My goal was to have one structure which holds the entirety of the network for ease of reference.
|
|
I decided to use a structure to represent a single layer in a network. This structure holds vectors
|
|
which represent the data for each neuron, and a matrix for the weights. I used the GNU Scientific
|
|
library for my vector maths, which may have been a little overkill but it allows me to work
|
|
with complex math structures easily. It will also help when I expand into convolutional networks.
|
|
</p>
|
|
<p>
|
|
Once I was sure I had my network data structures working, I started building code for importing data
|
|
into the network. The example problem I used to get the network functioning was the classic digit
|
|
recognization problem. The idea is to train a network to reconize and correctly categorize a low-
|
|
resolution image of a handwritten digit, in black and white. In order to import an image, I
|
|
had to open the file containing the images and scan through each line (where one line is one image)
|
|
and import the 0-255 greyscale values into my dayta structure. Firstly, the network only works
|
|
with values from 0-1, so I had to map the 0-255 values to 0-1. I decided to use a simple linear
|
|
mapping between values, as opposed to some non-linear function like the sigmoid. I did experiment
|
|
with the sigmoid function once the network was working, but the results were not as good. I suspect
|
|
this is because the sigmoid is capable of mapping an infinite domain of values to a range of 0-1,
|
|
meaning you could plug in an arbitrarily large number in and get a number restricted to the finite
|
|
range out. This is very useful for calculating network activations, when the weights and biases can
|
|
sum up to very large numbers. But if your data input is already restricted to a finite domain,
|
|
then a simple linear map is better. Long story short, after some tweaking I had the image data properly
|
|
converted and imported to my network's input.
|
|
</p>
|
|
<p>
|
|
The next task was to get my network to properly <i>feedforward</i>, that is, to properly calculate
|
|
the activations for every neuron in every layer up to the output. Of course, the untrained network
|
|
will output nonsense, but the important part is that it is properly calculated. The calculus will not
|
|
work otherwise. Once the output can be calculated from the input data, then the cost needs to be
|
|
calculated. I started using mean squared error for my cost function, then switched to log-likelihood
|
|
later. I know the cost is working properly if it is very high for the untrained network, usually
|
|
in the ballpark of 1-3. I can't really know for sure that the cost is working until I see it going
|
|
down during the training process, but I was confident at this point it was doing okay. Once this was
|
|
all up and running, I could move on to backpropagation.
|
|
</p>
|
|
<p>
|
|
The first really serious programming challenge was getting backpropagation up and running, although
|
|
because the algorithm is so well-documented it boiled down to effectively implementing the algorithm
|
|
in code. The first step is to find the <i>'error'</i> of the output layer, which really just means
|
|
figuring out how the output should change to decrease the cost. Then the error for the layer before
|
|
it is calculated, and so on, until we get to the input layer. The input activations cannot be controlled
|
|
by the network of course, but we can control the weights and biases connected to the inputs. We
|
|
can figure out how the weights and biases ought to change, then we tweak them. This is the gradient
|
|
descent step. We choose a value for the small step we should take, then we change the value
|
|
of each weight and bias by the product of that step and the corresponding differential. The idea is
|
|
analogous to stepping down a hill, but if the hill was a surface in a 13,000-dimensional space.
|
|
We know that these steps are working if we watch the cost go down after each descent. After many hours
|
|
of watching the cost jump around or not move at all or return a divide by zero error, I eventually
|
|
was able to get my program to descent properly and actually recognize features of different digits.
|
|
</p>
|
|
<p>
|
|
Watching the network's cost function minimize is extremely satisfying and important, but it doesn't
|
|
matter at all if we aren't actually categorizing digits properly. It is possible for the network
|
|
to get stuck in small valleys of its cost function, where it categorizes all data the same way
|
|
and gets a mildly acceptable cost, or to spit out 0 on all of its output neurons, or something like
|
|
that. It's also common for the network to learn to recognize the training dataset with 100% accuracy,
|
|
but when shown new data it wasn't trained on it gets totally lost. This is sort of like the difference
|
|
between memorization and actual understanding of a concept in human brains. So in order to avoid this,
|
|
we show the network a set of digits it wasn't trained on and test how many it correctly categorizes.
|
|
This gives us a direct accuracy metric that shows us how well it has actually learned, instead of
|
|
'memorizing' the dataset. The general task list for my program is to train the network, test the
|
|
network, and return the score so the user can tell how well the network has learned. The first
|
|
version of my code got an accuracy of around 75-80%. After implementing some better cost functions,
|
|
better activation functions, and tweaking the backpropagation a little, I was able to get
|
|
around 90% max accuracy. This is not nearly as high as Nielsen's program, which achieved around 95-96%
|
|
with the same methods. Frankly, I am still trying to figure out where the disparity comes from. I am
|
|
relatively happy with the network so far however.
|
|
</p>
|
|
<p>
|
|
The next step for this project is to implement a <i>convolutional</i> neural network, which is
|
|
a different architecture specialize in image processing. This requires a lot more math and complexity,
|
|
but it is doable (I think.) Currently, I have been able to get the data structure up and running,
|
|
but I have not implemented feedforwarding and backpropagation. I will make the code available
|
|
on my GitHub once I can get this working. So far, the project has been extremely educational for me,
|
|
not only on the math of machine learning but also for my general literacy in C. I would recommend
|
|
that any software engineer with some spare time and hunger for learning and challenge try this
|
|
themselves.
|
|
</p>
|
|
</div>
|
|
</body>
|