jameswitzeman.net/content/articles/mlib-c.html

<title>MACHINE LEARNING IN C</title>
<body>
    <div id="body">
        <h1>LEARNING NEURAL NETWORKING IN C</h1>
        <h6>10/19/2023</h6>
	<p>
		To keep myself entertained between classes this fall semester (2023), I decided to try to learn a
		little about machine learning mathematics. I've heard of the Python library TensorFlow, which does
		a lot of the heavy lifting for you (in c++, so likely not at the expense of Python's slow interpreter).
		I am interested in learning these types of python libraries, but first I wanted to develop a very
		strong familiarity with the basics and mathematics of machine learning. I also am trying to get
		as familiar as possible with C for my own personal enrichment. As of starting this project, I thought
		it would be nice to have a machine learning library in C for any of those who didn't want to use
		python for whatever reason. I realize now that this was foolish. My code probably doesn't have much
		utility for anyone else, but it was a very good learning experience for me.
	</p>
	<h3>DOING MY RESEARCH</h3>
	<p>
		I started my journey by scouring a
		<a href="http://neuralnetworksanddeeplearning.com/index.html">book by Michael Nielsen</a> on machine
		learning. The book is a few years old and behind the times now, but I'm not going to be setting the
		state of the art here so I'm not too concerned. I quickly learned that training a neural network really
		is just a minimization problem, and the whole thing can be elegantly expressed as pure math instead of
		as a programming problem. I started taking notes like I was in any other math class. The first big
		concept to learn was gradient descent.
	</p>
	<p>
		For those who have taken a multivariable calculus class, gradient descent is a pretty straightforward
		exercise. For those who haven't, the core idea is still not too inaccessable. We can think of
		a neural network as a huge, crazy black box with a million different parameters to tweak. We throw
		a bunch of data into the box with an idea of what we want the network to spit back out. As long as we
		can quantify how far off our network is from the ideal output, we can use this quantity as a function
		of these million parameters. The idea here is to find the multivariable derivative of this error metric,
		usually called a cost function. If we can find a method to figure out how tweaking each parameter
		changes our output, then we can use that information to tweak them in just the right way to minimize the
		cost and maximize how close our output is to the ideal. This multivariable differential is called the
		gradient. Once we have our table of which parameters to tweak and in what way, we then choose how large
		a step to take and we actually change all of our parameter values. This is very similar to the idea
		of walking down a hill until you reach the lowest valley you can find. Because the network is a
		complex mathematical entity with many moving parts, the hard part is figuring out how to even calculate
		this gradient. That's where backpropagation comes in.
	</p>
	<p>
		Backpropagation is the method by which a gradient is calculated for a neural network. The important
		thing to remember is that all backpropagation does is find the gradient of the cost function. The
		complicated part is figuring out how every single parameter should be tweaked to minimize the cost.
		The first step is to figure out how the output should change in order to minimize the cost. Usually this
		means the output will have a random set of activations that do not correlated to the input at all.
		Our cost will likely be high. We know that the cost is a function of each output neuron's activation,
		so we can take the derivative of the cost with respect to each activation value. Once we have this
		differential, we then think of our output activations as functions of the activation of the previous
		neuron, as well as the weight and bias for the current neuron. We can find this differential with
		respect to these values. We then consider the activations of the previous neuron as functions of the
		parameters of the neurons before that, and so on and so forth. This is why the process is called
		backpropagation, we start by looking at how the output ought to change and work backwards from there.
		The complicated part is the nuts and bolts of the math required to compute all this, which I will
		not cover here. I highly recommend checking out Nielsen's online book.
	</p>
	<h3>IMPLEMENTATION</h3>
	<p>
		I started my program by building structures to hold all the neuron, weight, and bias data for each
		layer. My goal was to have one structure which holds the entirety of the network for ease of reference.
		I decided to use a structure to represent a single layer in a network. This structure holds vectors
		which represent the data for each neuron, and a matrix for the weights. I used the GNU Scientific
		library for my vector maths, which may have been a little overkill but it allows me to work
		with complex math structures easily. It will also help when I expand into convolutional networks.
	</p>
	<p>
		Once I was sure I had my network data structures working, I started building code for importing data
		into the network. The example problem I used to get the network functioning was the classic digit
		recognization problem. The idea is to train a network to reconize and correctly categorize a low-
		resolution image of a handwritten digit, in black and white. In order to import an image, I
		had to open the file containing the images and scan through each line (where one line is one image)
		and import the 0-255 greyscale values into my dayta structure. Firstly, the network only works
		with values from 0-1, so I had to map the 0-255 values to 0-1. I decided to use a simple linear
		mapping between values, as opposed to some non-linear function like the sigmoid. I did experiment
		with the sigmoid function once the network was working, but the results were not as good. I suspect
		this is because the sigmoid is capable of mapping an infinite domain of values to a range of 0-1,
		meaning you could plug in an arbitrarily large number in and get a number restricted to the finite
		range out. This is very useful for calculating network activations, when the weights and biases can
		sum up to very large numbers. But if your data input is already restricted to a finite domain,
		then a simple linear map is better. Long story short, after some tweaking I had the image data properly
		converted and imported to my network's input.
	</p>
	<p>
		The next task was to get my network to properly <i>feedforward</i>, that is, to properly calculate
		the activations for every neuron in every layer up to the output. Of course, the untrained network
		will output nonsense, but the important part is that it is properly calculated. The calculus will not
		work otherwise. Once the output can be calculated from the input data, then the cost needs to be
		calculated. I started using mean squared error for my cost function, then switched to log-likelihood
		later. I know the cost is working properly if it is very high for the untrained network, usually
		in the ballpark of 1-3. I can't really know for sure that the cost is working until I see it going
		down during the training process, but I was confident at this point it was doing okay. Once this was
		all up and running, I could move on to backpropagation.
	</p>
	<p>
		The first really serious programming challenge was getting backpropagation up and running, although
		because the algorithm is so well-documented it boiled down to effectively implementing the algorithm
		in code. The first step is to find the <i>'error'</i> of the output layer, which really just means
		figuring out how the output should change to decrease the cost. Then the error for the layer before
		it is calculated, and so on, until we get to the input layer. The input activations cannot be controlled
		by the network of course, but we can control the weights and biases connected to the inputs. We
		can figure out how the weights and biases ought to change, then we tweak them. This is the gradient
		descent step. We choose a value for the small step we should take, then we change the value
		of each weight and bias by the product of that step and the corresponding differential. The idea is
		analogous to stepping down a hill, but if the hill was a surface in a 13,000-dimensional space.
		We know that these steps are working if we watch the cost go down after each descent. After many hours
		of watching the cost jump around or not move at all or return a divide by zero error, I eventually
		was able to get my program to descent properly and actually recognize features of different digits.
	</p>
	<p>
		Watching the network's cost function minimize is extremely satisfying and important, but it doesn't
		matter at all if we aren't actually categorizing digits properly. It is possible for the network
		to get stuck in small valleys of its cost function, where it categorizes all data the same way
		and gets a mildly acceptable cost, or to spit out 0 on all of its output neurons, or something like
		that. It's also common for the network to learn to recognize the training dataset with 100% accuracy,
		but when shown new data it wasn't trained on it gets totally lost. This is sort of like the difference
		between memorization and actual understanding of a concept in human brains. So in order to avoid this,
		we show the network a set of digits it wasn't trained on and test how many it correctly categorizes.
		This gives us a direct accuracy metric that shows us how well it has actually learned, instead of
		'memorizing' the dataset. The general task list for my program is to train the network, test the
		network, and return the score so the user can tell how well the network has learned. The first
		version of my code got an accuracy of around 75-80%. After implementing some better cost functions,
		better activation functions, and tweaking the backpropagation a little, I was able to get
		around 90% max accuracy. This is not nearly as high as Nielsen's program, which achieved around 95-96%
		with the same methods. Frankly, I am still trying to figure out where the disparity comes from. I am
		relatively happy with the network so far however.
	</p>
	<p>
		The next step for this project is to implement a <i>convolutional</i> neural network, which is
		a different architecture specialize in image processing. This requires a lot more math and complexity,
		but it is doable (I think.) Currently, I have been able to get the data structure up and running,
		but I have not implemented feedforwarding and backpropagation. I will make the code available
		on my GitHub once I can get this working. So far, the project has been extremely educational for me,
		not only on the math of machine learning but also for my general literacy in C. I would recommend
		that any software engineer with some spare time and hunger for learning and challenge try this
		themselves.
	</p>
    </div>
</body>