Python Deep Learning Training a Neural Network
Training a Neural Network in Python Deep Learning
We will now learn how to train a neural network. We will also learn about the backpropagation algorithm and backward pass in Python deep learning.
We must find the optimal weight values for the neural network to achieve the desired output. To train a neural network, we use iterative gradient descent. We initially start with a random initialization of the weights. After random initialization, we use the forward propagation process to make predictions on some subset of the data, calculate the corresponding cost function C, and update each weight w by an amount proportional to dC/dw, the derivative of the cost function with respect to the weight. This proportionality constant is called the learning rate.
Gradients can be efficiently computed using the backpropagation algorithm. The key observation of backpropagation, or backward pass, is that due to the chain rule of differentiation, the gradient of each neuron in a neural network can be calculated using the gradients of its outgoing neurons. Therefore, we compute the gradient backwards, first for the output layer, then for the top hidden layer, then for the previous hidden layer, and so on, finally for the input layer.
The backpropagation algorithm is primarily implemented using the concept of a computational graph, where each neuron is expanded into many nodes and simple mathematical operations such as addition and multiplication are performed. The edges of the computational graph do not have any weights; all weights are assigned to the nodes, so the weights become their own nodes. The backpropagation algorithm is then run on the computational graph. Once the calculations are complete, only the gradients of the weight nodes need to be updated. The remaining gradients can be discarded.
Gradient Descent Optimization Technique
A common optimization function that adjusts weights based on the error they cause is called “gradient descent.”
Gradient is another name for slope, which represents the relationship between two variables on an X-Y graph: rise is greater than run, change in distance is greater than change in time, and so on. In this case, the slope is the ratio of the network’s error to a single weight; in other words, how the error changes as the weights change.
More precisely, we want to find which weight produces the smallest error. We want to find weights that accurately represent the signal contained in the input data and translate it into a correct classification.
As a neural network learns, it slowly adjusts its many weights so that they can correctly map signals into meaning. The ratio between the network’s error and each weight is a derivative, dE/dw, which calculates the extent to which a slight change in one weight leads to a slight change in error.
Each weight is just one factor in a deep network involving many transformations; the signal from a weight passes through activations and sums over several layers, so we use the chain rule from calculus to trace back through the activations and outputs of the network. This allows us to see the weight in question and its relationship to the overall error.
Given two variables, error and weight, are passed through a third variable, activation, through which the weights are transmitted. We can calculate how changes in weights affect changes in activations by first calculating how changes in weights affect changes in error, and then how changes in weights affect changes in activations.
The basic idea of deep learning is simply this: adjust the model’s weights based on the error it produces, until you can no longer reduce the error.
Deep nets train slowly if the gradient values are small, while they train quickly if the gradient values are high. Any inaccuracies in training will lead to inaccurate outputs. The process of training a net from output to input is called backpropagation or backwards propagation. As we know, forward propagation starts at the input and works forward. Backpropagation works in reverse, calculating gradients from right to left.
Each time we calculate a gradient, all previous gradients up to that point are used.
Let’s start at a node in the output layer. Edges use the gradient of that node. As we go back to the hidden layers, things get more complicated. The product of two numbers between 0 and 1 gives you a smaller number. Gradient values get smaller and smaller, so subsequent props take longer to train, and accuracy suffers.
Challenges of Deep Learning Algorithms
Both shallow and deep neural networks present certain challenges, such as overfitting and computation time. DNNs are susceptible to overfitting because they use additional layers of abstraction to model rare dependencies in the training data.
Regularization methods, such as dropout, early stopping, data augmentation, and transfer learning, are applied during training to combat overfitting. Dropout regularization randomly omits units from hidden layers during training, which helps avoid rare dependencies. DNNs consider several training parameters, such as scale (the number of layers and units per layer), learning rate, and initial weights. Finding optimal parameters is not always practical due to the high cost in time and computational resources. Hacker techniques, such as batching, can speed up computation. The immense processing power of GPUs greatly aids the training process, as the required matrix and vector calculations are well-performed on GPUs.
Dropout
Dropout is a popular regularization technique for neural networks. Deep neural networks are particularly prone to overfitting.
Now let’s look at what dropout is and how it works.
In the words of Geoffrey Hinton, one of the pioneers of deep learning, “If you have a deep neural network and it’s not overfitting, you should probably use a larger network and dropout.”
Dropout is a technique where, during each iteration of gradient descent, we remove a randomly selected set of nodes. This means we randomly ignore some nodes, as if they didn’t exist.
Each neuron has a probability q of being retained and a probability 1-q of being randomly dropped. The value of q can be different for each layer in the neural network. A value of 0.5 for hidden layers and 0 for input layers works well for a wide range of tasks.
During evaluation and prediction, dropout is not used. The output of each neuron is multiplied by q so that the input to the next layer has the same expected value.
The idea behind dropout is this: in a neural network without dropout regularization, neurons develop shared dependencies, which can lead to overfitting.
Implementation Tips
In libraries like TensorFlow and Pytorch, dropout is implemented by keeping the outputs of randomly selected neurons at zero. In other words, while the neurons exist, their outputs are overwritten with 0.
Early Stopping
We use an iterative algorithm called gradient descent to train neural networks.
The idea behind early stopping is straightforward: we stop training when the error starts to increase. Here, the error we’re referring to is the error measured on validation data, which is part of the training data used to tune hyperparameters. In this case, the hyperparameters are the stopping criteria.
Data Augmentation
In this process, we increase the amount of data we have, or augment it, by using existing data and applying some transformations to it. The specific transformations used depend on the task we’re trying to accomplish. Furthermore, the transformations that aid the neural network also depend on its architecture.
For example, in many computer vision tasks, such as object classification, an effective data augmentation technique is to add new data points that are cropped or translated versions of the original data.
When a computer receives an image as input, it receives an array of pixel values. Let’s say the entire image is shifted left by 15 pixels. We apply many different shifts in different directions, resulting in an enlarged dataset that is many times the size of the original dataset.
Transfer Learning
The process of taking a pretrained model and “fine-tuning” it with our own dataset is called transfer learning. There are several ways to do this. Here are a few methods.
- We train the pretrained model on a large dataset. Then, we remove the last layer of the network and replace it with a new layer with random weights.
-
We then freeze the weights of all other layers and train the network normally. Freezing a layer here means not changing the weights during gradient descent or optimization.
The concept behind this is that the pre-trained model will act as a feature extractor, and only the last layer will be trained on the current task.