The stochastic delta rule: Faster and more accurate deep learning through adaptive weight noise

Noah Frazier-Logue

Research output: Contribution to journalLetter

1 Scopus citations

Abstract

Multilayer neural networks have led to remarkable performance on many kinds of benchmark tasks in text, speech, and image processing. Nonlinear parameter estimation in hierarchical models is known to be subject to over fitting and misspecification. One approach to these estimation and related problems (e.g., saddle points, colinearity, feature discovery) is called Dropout. The Dropout algorithm removes hidden units according to a binomial random variable with probability p prior to each update, creating random “shocks” to the network that are averaged over updates (thus creating weight sharing). In this letter, we reestablish an older parameter search method and show that Dropout is a special case of this more general model, stochastic delta rule (SDR), published originally in 1990. Unlike Dropout, SDR redefines each weight in the network as a random variable with mean μwi j and standard deviation σwi j. Each weight random variable is sampled on each forward activation, consequently creating an exponential number of potential networks with shared weights (accumulated in the mean values). Both parameters are updated according to prediction error, thus resulting in weight noise injections that reflect a local history of prediction error and local model averaging. SDR therefore implements a more sensitive local gradient dependent simulated annealing per weight converging in the limit to a Bayes optimal network. We run tests on standard benchmarks (CIFAR and ImageNet) using a modified version of DenseNet and show that SDR outperforms standard Dropout in top-5 validation error by approximately 13% with DenseNet-BC 121 on ImageNet and find various validation error improvements in smaller networks. We also show that SDR reaches the same accuracy that Dropout attains in 100 epochs in as few as 40 epochs, as well as improvements in training error by as much as 80%.

Original languageEnglish (US)
Pages (from-to)1018-1032
Number of pages15
JournalNeural computation
Volume32
Issue number5
DOIs
StatePublished - May 1 2020

All Science Journal Classification (ASJC) codes

  • Arts and Humanities (miscellaneous)
  • Cognitive Neuroscience

Fingerprint Dive into the research topics of 'The stochastic delta rule: Faster and more accurate deep learning through adaptive weight noise'. Together they form a unique fingerprint.

  • Cite this