Neural Network from Scratch in 100 lines of Python code

Papers in 100 Lines of Code
4 min readFeb 23, 2023
Neural Network from Scratch: Tutorial

While Neural Networks can now be created, trained and tested in just three line of code, it is crucial to understand their inner workings. Although most people have a basic understanding of forward pass, gradients, and backward pass, only a few are familiar with the essential techniques, such as the jacobian-vector product or the log-sum-exp trick, which prevent neural networks from experiencing memory overflow and numerical instability.

In this tutorial, we will guide you through implementing a Neural Network from scratch using Python in just 100 lines of code. Our focus is on implementation, and we will not delve into the technicalities behind each equation. However, if you are interested in gaining a deeper understanding of neural networks, I have a five-hour course on Udemy that covers this topic in detail.

Multilayer perceptron (MLP)

Neural Network from Scratch: MLP

Implementing the MLP is a straightforward process involving two set of learnable parameters, the weights W and bias b. To ensure proper initialization, the Xavier initialization method is commonly used. The forward pass is a direct implementation of the MLP equation. For the backward pass, we need to compute the gradient of the loss with respect to the weights and bias, which enables us to update them with gradient descent. We can achieve this by multiplying the Jacobian of the MLP transformation with the gradient of the loss with respect to the MLP’s outputs. Instead of computing and storing the Jacobian, which would require a significant amount of memory, we can directly compute the results using the Jacobian-vector product. This technique is essential in implementing neural networks, and a naive implementation without this trick would not scale due to exploding memory. The Jacobian is typically sparse, and the Jacobian-vector product can be analytically rewritten and computed directly.

After calculating the gradient of the loss with respect to the weights and bias, we can determine the gradient of the outputs with respect to the inputs. This enables the previous modules to use it for computing the gradient of the loss with respect to their parameters, following the chain rule of derivative.

Sequential Neural Network

Neural Network from Scratch: Sequential NN

Neural networks consist of various components, including MLPs, activation functions, and complex layers. To combine these blocks, we can create a sequential Neural Network class that takes a list of modules as input and wraps them together to be viewed as a single block.

During the forward pass, the inputs are passed through each layer to obtain the outputs of the sequential NN. Similarly, during the backward pass, we start from the gradient of the loss and propagate it backwards through the modules, from the last layer to the first.

Rectified Linear Units (ReLU)

Neural Network from Scratch: ReLU

Non-linear activations are essential for neural networks as they introduce non-linearity and enable them to function as universal function approximators.

ReLU is a popular activation function that is easy to implement. It involves setting all negative values to 0 during the forward pass. For the backward pass, we can analytically compute the Jacobian-vector product and implement it directly.

LogSoftmax

Neural Network from Scratch: LogSoftmax

When performing classification, the softmax or log softmax activation functions are crucial (you can check my course if you are interested in regression). It converts the neural network’s unconstrained logit outputs into probabilities greater than zero for each class, which sum up to one.

While it’s possible to implement the log softmax function by applying the softmax equation and then taking its logarithm, this approach can result in numerical instability due to the logarithm. The best way to implement it is by rewriting the equation using the log-sum-exp trick. Once the Jacobian vector product has been computed analytically, the backward pass can be implemented directly.

Negative Log Likelihood Loss (NLLLoss)

Neural Network from Scratch: NLLLoss

The loss function is the final component in the sequential pipeline, and unlike the other modules, it takes two values as input during its forward pass: the predictions and the target.

In classification tasks, the negative log likelihood loss is commonly used. To compute this loss, we aim at maximizing the total log likelihood of the data which consists in summing the log likelihoods predicted by the model for the target classes.

In the backward direction, the loss is the first layer, and so it does not take a gradient as input. Instead, it returns the gradient of the loss with respect to the prediction, which is then fed into the previous layers.

Optimizer

Neural Network from Scratch: Optimizer

The optimizer receives a sequential neural network as input. Whenever the step function is called, it performs a simple gradient descent to update the neural network’s weights. Alternatively, more advanced optimization methods such as Adam could be utilized (follow me for my forthcoming post about its implementation).

Train

Neural Network from Scratch: Training loop

Undoubtedly, the training loop is the most straightforward section. It requires no mathematical knowledge and simply entails invoking the previous modules. Since we’re doing supervised learning, during each epoch, we randomly select input and target values of a specific batch size, utilize the model to make predictions, and compute the loss between the predictions and the target. Then, we backpropagate the loss and update the weights of the NN with the optimizer.

Putting it all together

Neural Network from Scratch: Main loop

Having implemented all the necessary components, we can now seamlessly integrate them. Upon loading and preprocessing the data, we can initialize a neural network, an optimizer, and launch the neural network training process. It takes merely 20 seconds of training on CPU to reach a test accuracy as high as 98%.

I hope this story was helful to you. If it was, consider clapping this story, and do not forget to follow for more tutorials related to Machine Learning.

[Full Code] [Udemy Course]

--

--