Example: tourism industry

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING …

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS1 LSTM: A search space OdysseyKlaus Greff, Rupesh K. Srivastava, Jan Koutn k, Bas R. Steunebrink, J urgen SchmidhuberAbstract Several variants of the Long Short-Term Memory(LSTM) architecture for recurrent NEURAL NETWORKS have beenproposed since its inception in 1995. In recent years, thesenetworks have become the state-of-the-art models for a varietyof machine LEARNING problems. This has led to a renewed interestin understanding the role and utility of various computationalcomponents of typical LSTM variants. In this paper, we presentthe first large-scale analysis of eight LSTM variants on threerepresentative tasks: speech recognition, handwriting recognition,and polyphonic music modeling. The hyperparameters of allLSTM variants for each task were optimized separately usingrandom search , and their importance was assessed using thepowerful fANOVA framework.

LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan Koutn´ık, Bas R. Steunebrink, J urgen Schmidhuber¨ Abstract—Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these

Tags:

  Network, Transactions, Learning, Search, Space, Odyssey, Neural, Stlm, A search space odyssey, Transactions on neural networks and learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of TRANSACTIONS ON NEURAL NETWORKS AND LEARNING …

1 TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS1 LSTM: A search space OdysseyKlaus Greff, Rupesh K. Srivastava, Jan Koutn k, Bas R. Steunebrink, J urgen SchmidhuberAbstract Several variants of the Long Short-Term Memory(LSTM) architecture for recurrent NEURAL NETWORKS have beenproposed since its inception in 1995. In recent years, thesenetworks have become the state-of-the-art models for a varietyof machine LEARNING problems. This has led to a renewed interestin understanding the role and utility of various computationalcomponents of typical LSTM variants. In this paper, we presentthe first large-scale analysis of eight LSTM variants on threerepresentative tasks: speech recognition, handwriting recognition,and polyphonic music modeling. The hyperparameters of allLSTM variants for each task were optimized separately usingrandom search , and their importance was assessed using thepowerful fANOVA framework.

2 In total, we summarize the resultsof 5400 experimental runs ( 15years of CPU time), whichmakes our study the largest of its kind on LSTM results show that none of the variants can improve uponthe standard LSTM architecture significantly, and demonstratethe forget gate and the output activation function to be itsmost critical components. We further observe that the studiedhyperparameters are virtually independent and derive guidelinesfor their efficient Terms Recurrent NEURAL NETWORKS , Long Short-TermMemory, LSTM, sequence LEARNING , random search , INTRODUCTIONR ecurrent NEURAL NETWORKS with Long Short-Term Memory(which we will concisely refer to as LSTMs) have emerged asan effective and scalable model for several LEARNING problemsrelated to sequential data. Earlier methods for attacking theseproblems have either been tailored towards a specific problemor did not scale to long time dependencies.

3 LSTMs on theother hand are both general and effective at capturing long-term temporal dependencies. They do not suffer from theoptimization hurdles that plague simple recurrent NETWORKS (SRNs) [1,2] and have been used to advance the state-of-the-art for many difficult problems. This includes handwritingrecognition [3 5] and generation [6], language modeling [7]and translation [8], acoustic modeling of speech [9], speechc 2016 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other received May 15, 2015; revised March 17, 2016; accepted June 9,2016. Date of publication July 8, 2016; date of current version June 20, : research was supported by the Swiss National Science Foundation grants Theory and Practice of Reinforcement LEARNING 2 (#138219) and AdvancedReinforcement LEARNING (#156682), and by EU projects NASCENCE (FP7-ICT-317662), NeuralDynamics (FP7-ICT-270247) and WAY (FP7-ICT-288551).

4 K. Greff, R. K. Srivastava, J. Kout k, B. R. Steunebrink and J. Schmidhuberare with the Istituto Dalle Molle di studi sull Intelligenza Artificiale (IDSIA),the Scuola universitaria professionale della Svizzera italiana (SUPSI), and theUniversit`a della Svizzera italiana (USI).Author e-mails addresses:{klaus, rupesh, hkou, bas, [10], protein secondary structure prediction [11],analysis of audio [12], and video data [13] among central idea behind the LSTM architecture is a memorycell which can maintain its state over time, and non-lineargating units which regulate the information flow into and out ofthe cell. Most modern studies incorporate many improvementsthat have been made to the LSTM architecture since itsoriginal formulation [14,15]. However, LSTMs are nowapplied to many LEARNING problems which differ significantlyin scale and nature from the problems that these improvementswere initially tested on. A systematic study of the utility ofvarious computational components which comprise LSTMs(see Figure 1) was missing.}

5 This paper fills that gap andsystematically addresses the open question of improving theLSTM evaluate the most popular LSTM architecture (vanillaLSTM; Section II) and eight different variants thereof onthree benchmark problems: acoustic modeling, handwritingrecognition, and polyphonic music modeling. Each variantdiffers from the vanilla LSTM by a single change. Thisallows us to isolate the effect of each of these changeson the performance of the architecture. Random search [16 18] is used to find the best-performing hyperparameters foreach variant on each problem, enabling a reliable comparisonof the performance of different variants. We also provideinsights gained about hyperparameters and their interactionusing fANOVA [19].II. VANILLALSTMThe LSTM setup most commonly used in literature wasoriginally described by Graves and Schmidhuber[20]. We referto it asvanilla LSTMand use it as a reference for comparisonof all the variants.

6 The vanilla LSTM incorporates changesby Gers et al.[21]and Gers and Schmidhuber[22]into theoriginal LSTM [15] and uses full gradient training. Section IIIprovides descriptions of these major LSTM schematic of the vanilla LSTM block can be seen inFigure 1. It features three gates (input, forget, output), blockinput, a single cell (the Constant Error Carousel), an outputactivation function, and peephole connections1. The output ofthe block is recurrently connected back to the block input andall of the studies omit peephole connections, described in Section [ ] 4 Oct 2017 TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS2unweighted connectionLegendweighted connectionconnection with time-lagmutliplication+sum over all inputsbranching pointgate activation function(always sigmoid)++++forget gateinput gateblock inputcell+output gatepeepholesLSTM + activation function(usually tanh)houtput activation function(usually tanh)block outputFigure 1. Detailed schematic of the Simple Recurrent network (SRN) unit (left) and a Long Short-Term Memory block (right) as used in the hidden layers ofa recurrent NEURAL Forward PassLetxtbe the input vector at timet,Nbe the number ofLSTM blocks andMthe number of inputs.

7 Then we get thefollowing weights for an LSTM layer: Input weights:Wz,Wi,Wf,Wo RN M Recurrent weights:Rz,Ri,Rf,Ro RN N Peephole weights:pi,pf,po RN Bias weights:bz,bi,bf,bo RNThen the vector formulas for a vanilla LSTM layer forwardpass can be written as: zt=Wzxt+Rzyt 1+bzzt=g( zt)block input it=Wixt+Riyt 1+pi ct 1+biit= ( it)input gate ft=Wfxt+Rfyt 1+pf ct 1+bfft= ( ft)forget gatect=zt it+ct 1 ftcell ot=Woxt+Royt 1+po ct+boot= ( ot)output gateyt=h(ct) otblock outputWhere ,gandhare point-wise non-linear activation sigmoid( (x) =11+e x) is used as gate activationfunction and thehyperbolic tangent(g(x) =h(x) = tanh(x))is usually used as the block input and output activation multiplication of two vectors is denoted by .B. Backpropagation Through TimeThe deltas inside the LSTM block are then calculated as: yt= t+RTz zt+1+RTi it+1+RTf ft+1+RTo ot+1 ot= yt h(ct) ( ot) ct= yt ot h (ct) +po ot+pi it+1+pf ft+1+ ct+1 ft+1 ft= ct ct 1 ( ft) it= ct zt ( it) zt= ct it g ( zt)Here tis the vector of deltas passed down from the layerabove.

8 IfEis the loss function it formally corresponds to E yt,but not including the recurrent dependencies. The deltas forthe inputs are only needed if there is a layer below that needstraining, and can be computed as follows: xt=WTz zt+WTi it+WTf ft+WTo otFinally, the gradients for the weights are calculated asfollows, where?can be any of{ z, i, f, o}, and ?1,?2 denotesthe outer product of two vectors: W?=T t=0 ?t,xt pi=T 1 t=0ct it+1 R?=T 1 t=0 ?t+1,yt pf=T 1 t=0ct ft+1 b?=T t=0 ?t po=T t=0ct otTRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS3 III. HISTORY OFLSTMThe initial version of the LSTM block [14,15] included(possibly multiple) cells, input and output gates, but no forgetgate and no peephole connections. The output gate, unitbiases, or input activation function were omitted for certainexperiments. Training was done using a mixture of Real TimeRecurrent LEARNING (RTRL) [23,24] and BackpropagationThrough Time (BPTT) [24,25].

9 Only the gradient of the cellwas propagated back through time, and the gradient for theother recurrent connections was truncated. Thus, that studydid not use the exact gradient for training. Another feature ofthat version was the use offull gate recurrence, which meansthat all the gates received recurrent inputs from all gates at theprevious time-step in addition to the recurrent inputs from theblock outputs. This feature did not appear in any of the Forget GateThe first paper to suggest a modification of the LSTM architecture introduced the forget gate [21], enabling the LSTMto reset its own state. This allowed LEARNING of continual taskssuch as embedded Reber Peephole ConnectionsGers and Schmidhuber[22]argued that in order to learnprecise timings, the cell needs to control the gates. So farthis was only possible through an open output gate. Peepholeconnections (connections from the cell to the gates, bluein Figure 1) were added to the architecture in order tomake precise timings easier to learn.

10 Additionally, the outputactivation function was omitted, as there was no evidence thatit was essential for solving the problems that LSTM had beentested on so Full GradientThe final modification towards the vanilla LSTM wasdone by Graves and Schmidhuber[20]. This study presentedthe full backpropagation through time (BPTT) training forLSTM NETWORKS with the architecture described in Section II,and presented results on the TIMIT [26] benchmark. Usingfull BPTT had the added advantage that LSTM gradientscould be checked using finite differences, making practicalimplementations more Other VariantsSince its introduction the vanilla LSTM has been the mostcommonly used architecture, but other variants have beensuggested too. Before the introduction of full BPTT training,Gers et al.[27]utilized a training method based on ExtendedKalman Filtering which enabled the LSTM to be trained onsome pathological cases at the cost of high computationalcomplexity.


Related search queries