Training Convolutional Networks with Weight–wise Adaptive Learning Rates
Abstract
Current state–of–the–art Deep Learning classification with Convolutional Neural Networks achieves very impressive results, which are, in some cases, close to human-level performance. However, training these methods to their optimal performance requires very long training periods, usually by applying the Stochastic Gradient Descent method. We show that by applying more modern methods, which involve adapting a different learning rate for each weight rather than using a single, global, learning rate for the entire network, we are able to reach close to state–of–the–art performance on the same architectures, and improve the training time and accuracy.
1 Introduction
In the field of supervised classification with Deep Learning methods, specifically Convolutional Networks [1], it is typical to train a model on a large dataset for a very long number of epochs, which equates to long training periods [2]. One of the common patterns of training these models is the use of Stochastic Gradi- ent Descent (SGD) method with Momentum. In the field of traditional ANNs, improved gradient-based update rules have been developed, which greatly im- prove training speed and classification performance [3, 4, 5, 6, 7]. Some of these rules have been already used in conjunction with Convolutional Neural Net- works [8], and new ones have been developed specifically with Deep Learning in mind [9, 10]. We conducted a study to show whether it is possible to train Con- volutional Neural Networks efficiently in relatively short times, and concluded that there are still some improvements that can be made, in the form of weight- wise learning rates. We formulated a new gradient-based update rule, called Weight–wise Adaptive learning rates with Moving average Estimator – WAME, which draws on some of the strengths of existing algorithms and adds a so–called per weight acceleration factor. We then conducted an experimental comparison between WAME and other update rules to show that it reaches better results in less training time, on some well–known benchmark datasets in computer vision.
2 Background
In this section, we provide a brief overview of previous work relevant to this paper.
The Resilient propagation method– Rprop
Rprop [3] is a weight update algorithm which does not depend on the magnitude of the partial derivative of the error with respect to the weights ∂ E(t) . Instead, it makes use of the sign of the product of the current and previous gradients ∂E(t) ·∂ E(t−1) , to determine whether an adaptive step size ∆ij should be increased or decreased multiplicatively. Some variants based on this basic principle have been proposed in the literature [4, 11]. With small adaptations [8], this approach has been shown to be usable with modern Deep Neural Networks–DNNs and Convolutional Neural Networks–CNNs equipped with Dropout [12].
In Deep Learning, being able to train in mini–batches is often cited as an
important feature of the learning algorithm, because doing so provides speed and accuracy improvements. However, this is not possible with Rprop because in SGD the gradient values follow a stochastic process and in order to maintain convergence, an update rule has to maintain consistency across multiple mini- batches, e.g. if the sum of several successive partial derivatives for a weight is zero, then the sum of the updates for those weights must be zero.
The Root mean square propagation method– RMSprop
The RMSprop method, introduced by Hinton et al. during lectures delivered on Coursera [9], has not been studied extensively in the literature, as far as we are aware. The development of the method was motivated by the need to train using mini–batches and although it was inspired by Rprop, it does not use the sign of the derivatives as Rprop does. RMSprop attempts to alleviate the dependence on the size of the partial derivatives of the error with respect to the weight that causes phenomena like the vanishing gradients problem or getting stuck in saddle points when training with SGD.
This is achieved with the introduction of a divisor θij(t) for each gradient value, which is then multiplied by the learning rate λ(t) at time t. The θij(t) is then updated as an exponentially decaying mean of the square of the gradient, with a decaying rate alpha, as in Eq. 1, where α = 0.9 is suggested.
Adaptive Moment Estimation– Adam
Adam [10] is closely related to RMSprop, especially when RMSprop is used with momentum. The main difference between the two methods is that while the momentum component of RMSprop is calculated on the rescaled gradient
∆wij (t − 1), Adam utilises running means of the first and second moments of the gradient. Adam also includes an initial bias correction factor, because the running means are initialised at 0.
3 weight–wise adaptive learning rates with moving ave- rage estimator– WAME
In this section we propose a new algorithm for mini–batches and online train- ing which employs a per–weight acceleration factor ζij, used multiplicatively in the weight update. This leads to adaptively tuning the learning rate of each weight, effectively producing weight–wise adaptive learning rates. The acceler- ation factor is evolved following the same criterion as ∆ij in Rprop: the sign of the product of the current and previous gradients. We clip these values be- tween [ζmin, ζmax] to avoid runaway effects. If used directly, this factor would have the same non–linear effect as ∆ij in Rprop, so in order to use it in mini– batches and online, we have to apply smoothing that will guarantee an asymp- totic agreement between different batch sizes. To achieve this, we divide by an exponentially–weighted moving average (EWMA) of ζij, with exponential de- cay α. The full update rule is presented in Algorithm 1. Empirically, we have established that good values for the hyperparameters are α = 0.9, η+ = 1.2, η− = 0.1, ζmin = 0.01, ζmax = 100– these are used in all experiments reported in Section 4.
4 experimental results
We tested WAME on three commonly used benchmark datasets in Deep Learn- ing : MNIST, CIFAR–10 and CIFAR–100. Details about these datasets are given below and the architectures used are presented in Table 1. We followed common practice [13, 14]: MNIST uses training, validation and test sets, while CIFAR– 10 and CIFAR–100 have no validation sets with the test set used for validation purposes. No augmentations were applied to the training data. We compared WAME to SGD with momentum, RMSprop and Adam, using commonly used
64 conv, 5 × 5
64 conv, 1 × 1
2 × 2 max-pooling
128 conv, 5 × 5
128 conv, 1 × 1
2 × 2 max-pooling
Dense, 1024 nodes
50% dropout
(a) MNIST
2 × 96 conv, 3 × 3
96 conv, 3 × 3, 2 × 2 strides
2 × 192 conv, 3 × 3
192 conv, 3 × 3, 2 × 2 strides
192 conv, 3 × 3
192 conv, 1 × 1
10 conv, 1 × 1 global average pooling
(b)CIFAR–10, CIFAR–100
Table 1: Network structures used in the experiments.
mini–batch sizes. We did not include Rprop in our evaluation because it is already known to not work well on mini–batches [9]. We used the same ar- chitecture for each algorithm. We performed our comparison by generating a random set of initial weights and then using them as the starting point for each algorithm, so as to avoid any possible distortion of the results by the random initialisation process. We repeated each experiment 20 times, each time with a different random seed, and took the means of each value. All experiments were run on the Toupee Deep Learning experimentation library, which is based on Keras. It is available at http://github.com/nitbix/toupee. Results are reported in Table 2, showing the number of epochs required to train the CNNs to reach the best validation accuracy, and the test accuracy recorded at that epoch. Figure 1 shows how WAME is more effective during training, by hav- ing an average loss that is consistently below that of Adam and RMSprop. A Friedman aligned ranks tests shows that the accuracy improvement on Adam and RMSprop is not significant (p–value=0.0974), while the speed improvement is significant at the 5% level both w.r.t. epoch counts and (if we exclude SGD which has lower accuracy), real time.
MNIST [13] is a dataset for labelling pre–processed images of hand–written digits. We used the network in Table 1a, trained for 100 epochs. The difference in test accuracy between Adam, RMSprop and WAME does not appear to be significant, but the improvement in speed is noticeable. MNIST is a dataset that is easy to overfit, and the accuracy levels obtained are close to the state–of– the–art without data augmentation, so the lack of significant decay of learning ability is positive.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two majorconstructs of the C programming language – Fu
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of