Training Convolutional Networks with Weight–wise Adaptive Learning Rates
Abstract
Current state–of–the–art Deep Learning classification with Convolutional Neural Networks achieves very impressive results, which are, in some cases, close to human-level performance. However, training these methods to their optimal performance requires very long training periods, usually by applying the Stochastic Gradient Descent method. We show that by applying more modern methods, which involve adapting a different learning rate for each weight rather than using a single, global, learning rate for the entire network, we are able to reach close to state–of–the–art performance on the same architectures, and improve the training time and accuracy.
1 Introduction
In the field of supervised classification with Deep Learning methods, specifically Convolutional Networks [1], it is typical to train a model on a large dataset for a very long number of epochs, which equates to long training periods [2]. One of the common patterns of training these models is the use of Stochastic Gradi- ent Descent (SGD) method with Momentum. In the field of traditional ANNs, improved gradient-based update rules have been developed, which greatly im- prove training speed and classification performance [3, 4, 5, 6, 7]. Some of these rules have been already used in conjunction with Convolutional Neural Net- works [8], and new ones have been developed specifically with Deep Learning in mind [9, 10]. We conducted a study to show whether it is possible to train Con- volutional Neural Networks efficiently in relatively short times, and concluded that there are still some improvements that can be made, in the form of weight- wise learning rates. We formulated a new gradient-based update rule, called Weight–wise Adaptive learning rates with Moving average Estimator – WAME, which draws on some of the strengths of existing algorithms and adds a so–called per weight acceleration factor. We then conducted an experimental comparison between WAME and other update rules to show that it reaches better results in less training time, on some well–known benchmark datasets in computer vision.
2 Background
In this section, we provide a brief overview of previous work relevant to this paper.
The Resilient propagation method– Rprop
Rprop [3] is a weight update algorithm which does not depend on the magnitude of the partial derivative of the error with respect to the weights ∂ E(t) . Instead, it makes use of the sign of the product of the current and previous gradients ∂E(t) ·∂ E(t−1) , to determine whether an adaptive step size ∆ij should be increased or decreased multiplicatively. Some variants based on this basic principle have been proposed in the literature [4, 11]. With small adaptations [8], this approach has been shown to be usable with modern Deep Neural Networks–DNNs and Convolutional Neural Networks–CNNs equipped with Dropout [12].
In Deep Learning, being able to train in mini–batches is often cited as an
important feature of the learning algorithm, because doing so provides speed and accuracy improvements. However, this is not possible with Rprop because in SGD the gradient values follow a stochastic process and in order to maintain convergence, an update rule has to maintain consistency across multiple mini- batches, e.g. if the sum of several successive partial derivatives for a weight is zero, then the sum of the updates for those weights must be zero.
The Root mean square propagation method– RMSprop
The RMSprop method, introduced by Hinton et al. during lectures delivered on Coursera [9], has not been studied extensively in the literature, as far as we are aware. The development of the method was motivated by the need to train using mini–batches and although it was inspired by Rprop, it does not use the sign of the derivatives as Rprop does. RMSprop attempts to alleviate the dependence on the size of the partial derivatives of the error with respect to the weight that causes phenomena like the vanishing gradients problem or getting stuck in saddle points when training with SGD.
This is achieved with the introduction of a divisor θij(t) for each gradient value, which is then multiplied by the learning rate λ(t) at time t. The θij(t) is then updated as an exponentially decaying mean of the square of the gradient, with a decaying rate alpha, as in Eq. 1, where α = 0.9 is suggested.
Adaptive Moment Estimation– Adam
Adam [10] is closely related to RMSprop, especially when RMSprop is used with momentum. The main difference between the two methods is that while the momentum component of RMSprop is calculated on the rescaled gradient
∆wij (t − 1), Adam utilises running means of the first and second moments of the gradient. Adam also includes an initial bias correction factor, because the running means are initialised at 0.
3 weight–wise adaptive learning rates with moving ave- rage estimator– WAME
In this section we propose a new algorithm for mini–batches and online train- ing which employs a per–weight acceleration factor ζij, used multiplicatively in the weight update. This leads to adaptively tuning the learning rate of each weight, effectively producing weight–wise adaptive learning rates. The acceler- ation factor is evolved following the same criterion as ∆ij in Rprop: the sign of the product of the current and previous gradients. We clip these values be- tween [ζmin, ζmax] to avoid runaway effects. If used directly, this factor would have the same non–linear effect as ∆ij in Rprop, so in order to use it in mini– batches and online, we have to apply smoothing that will guarantee an asymp- totic agreement between different batch sizes. To achieve this, we divide by an exponentially–weighted moving average (EWMA) of ζij, with exponential de- cay α. The full update rule is presented in Algorithm 1. Empirically, we have established that good values for the hyperparameters are α = 0.9, η+ = 1.2, η− = 0.1, ζmin = 0.01, ζmax = 100– these are used in all experiments reported in Section 4.
4 experimental results
We tested WAME on three commonly used benchmark datasets in Deep Learn- ing : MNIST, CIFAR–10 and CIFAR–100. Details about these datasets are given below and the architectures used are presented in Table 1. We followed common practice [13, 14]: MNIST uses training, validation and test sets, while CIFAR– 10 and CIFAR–100 have no validation sets with the test set used for validation purposes. No augmentations were applied to the training data. We compared WAME to SGD with momentum, RMSprop and Adam, using commonly used
64 conv, 5 × 5
64 conv, 1 × 1
2 × 2 max-pooling
128 conv, 5 × 5
128 conv, 1 × 1
2 × 2 max-pooling
Dense, 1024 nodes
50% dropout
(a) MNIST
2 × 96 conv, 3 × 3
96 conv, 3 × 3, 2 × 2 strides
2 × 192 conv, 3 × 3
192 conv, 3 × 3, 2 × 2 strides
192 conv, 3 × 3
192 conv, 1 × 1
10 conv, 1 × 1 global average pooling
(b)CIFAR–10, CIFAR–100
Table 1: Network structures used in the experiments.
mini–batch sizes. We did not include Rprop in our evaluation because it is already known to not work well on mini–batches [9]. We used the same ar- chitecture for each algorithm. We performed our comparison by generating a random set of initial weights and then using them as the starting point for each algorithm, so as to avoid any possible distortion of the results by the random initialisation process. We repeated each experiment 20 times, each time with a different random seed, and took the means of each value. All experiments were run on the Toupee Deep Learning experimentation library, which is based on Keras. It is available at http://github.com/nitbix/toupee. Results are reported in Table 2, showing the number of epochs required to train the CNNs to reach the best validation accuracy, and the test accuracy recorded at that epoch. Figure 1 shows how WAME is more effective during training, by hav- ing an average loss that is consistently below that of Adam and RMSprop. A Friedman aligned ranks tests shows that the accuracy improvement on Adam and RMSprop is not significant (p–value=0.0974), while the speed improvement is significant at the 5% level both w.r.t. epoch counts and (if we exclude SGD which has lower accuracy), real time.
MNIST [13] is a dataset for labelling pre–processed images of hand–written digits. We used the network in Table 1a, trained for 100 epochs. The difference in test accuracy between Adam, RMSprop and WAME does not appear to be significant, but the improvement in speed is noticeable. MNIST is a dataset that is easy to overfit, and the accuracy levels obtained are close to the state–of– the–art without data augmentation, so the lack of significant decay of learning ability is positive.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two majorconstructs of the C programming language – Functions and Arrays – to solve computationalproblems.Arrays provide a convenient way to store &
The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is anPath finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going through certain obstacles. As the main aim is to thinkabout path finding, we focus on the common task
Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individualDevelop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. TheLineItem class will represent an individual line item of merchandise that acustomer is purchasing.
SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define: SeaPortProgram e
Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define: SeaPortProgram e
The first programming project involves writing a program that parses, using recursive descent, a GUI definition language defined in an input file and generates the GUI that it defines. The grammar for this language is defined below:1Project 1The first programming project involves writing a program that parses, using recursive descent, a GUIdefinition language defined in an input file and generates the GUI that it defines. The grammar for thislanguage is defined below:gui ::= Wi
CMSC 335 Project SeaPort Solved Project 3 Introduction the SeaPort Project series For this set of projects for the course we wish to simulate some of the aspects of a number of Sea PortsCMSC 335 Project SolvedProject 3 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:
CMSC 335 Project 4 Solved SeaPort Introduction the SeaPort Project series For this set of projects for the course we wish to simulate some of the aspects of a number of Sea Ports Here are the classes and their instance variables we wish to defineProject 4 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define: SeaPortProgram ext
CMSC 451 Project 2 Solved The second project involves completing and extending the C++ program that evaluates statements of an expression language contained in the module 3 case studyProject 2 The second project involves completing and extending the C++ program that evaluates statements of an expression language contained in the module 3 case study. The statements of that expression language consist of an arithmetic expression f
CMSC 451 Project 1 Solved The first project involves benchmarking the behavior of Java implementations of one of the following sorting algorithms bubble sort selection sort insertion sort Shell sort, merge sort quick sort or heap sortCMSC 451 Project 1The first project involves benchmarking the behavior of Java implementations of one of thefollowing sorting algorithms, bubble sort, selection sort, insertion sort, Shell sort, merge sort,quick sort or heap sort. You must post your
The ready solutions purchased from Library are already used solutions. Please do not submit them directly as it may lead to plagiarism. Once paid, the solution file download link will be sent to your provided email. Please either use them for learning purpose or re-write them in your own language. In case if you haven't get the email, do let us know via chat support.
Get Free Quote!
259 Experts Online