Neural Networks for Time Series Prediction

Neural Networks for Time SeriesPrediction15-486/782: Artificial Neural NetworksFall 2006(based on earlier slides by Dave Touretzky and Kornel Laskowski)What is a Time Series ?A sequence of vectors (or scalars) which depend on timet. In thislecture we will deal exclusively with scalars:{x(t0), x(t1), x(ti 1), x(ti), x(ti+1), }It s the output of some processPthat we are interested in:Px(t)2 Examples of Time Series Dow-Jones Industrial Average sunspot activity electricity demand for a city number of births in a community air temperature in a buildingThese phenomena may Phenomena Dow-Jones Industrial Average closing value each day sunspot activity each daySometimes data have to be aggregated to get meaningful : births per minute might not be as useful as births per month4 Continuous Phenomenatis real-valued, andx(t) is a continuous get a Series {x[t]}, mustsamplethe signal at discrete uniform sampling, if our sampling period is t, then{x[t]}={x(0), x( t), x(2 t), x(3 t), }(1)To ensure thatx(t) can be recovered fromx[t], tmust be chosenaccording to the Nyquist sampling Sampling TheoremIffmaxis the highest frequency component ofx(t), then we mustsample at a rate at least twice as high:1 t=fsampling>2fmax(2)Why?

Otherwise we will see aliasing of frequencies in the range[fsampling/2, fmax].6 Studying Time SeriesIn addition to describing either discrete or continuous phenomena,time Series can also be deterministic vs stochastic, governed by linearvs nonlinear dynamics, Series are the focus of several overlapping disciplines: Information Theorydeals with describing stochastic time Series . Dynamical Systems Theorydeals with describing and manipulatingmostly non-linear deterministic time Series . Digital Signal Processingdeals with describing and manipulatingmostly linear time Series , both deterministic and will use concepts from all Types of Processing predictfuture values ofx[t] classifya Series into one of a few classes price will go up price will go down sell now no change describea Series using a few parameter values of some model transformone time Series into anotheroil prices7 interest rates8 The Problem of Predicting the FutureExtending backward from timet, we have time Series {x[t], x[t 1], }.

From this, we now want to estimatexat some future time x[t+s] =f(x[t], x[t 1], )sis called thehorizon of Prediction . We will come back to this; inthe meantime, let s predict just one time sample into the future, s= is a function approximation s how we ll solve it:1. Assume a generative For every pointx[ti] in the past, train the generative modelwith what precededtias theInputsand what followedtias Now run the model to predict x[t+s] from{x[t], }.9 EmbeddingTime is constantly moving forward. Temporal data is hard to we set up a shift register of delays, we can retain successivevalues of our time Series . Then we can treat each past value asanadditionalspatialdimension in the input space to our implicit transformation of a one-dimensional time vector intoan infinite-dimensional spatial vector is input space to our predictor must be finite. At each instantt,truncate the history to only the calledtheembedding the Past to Predict the Futurex(t 1)x(t 2)x(t T)fx(t)tapped delay linedelay element x(t+ 1)11 Linear SystemsIt s possible thatP, the process whose output we are trying topredict, is governed by linear study of linear systems is the domain of Digital Signal Process-ing (DSP).

DSP is concerned with linear, translation-invariant (LTI)operationson data streams. These operations are implemented byfilters. Theanalysis and design of filters effectively forms the core of this operate on an input sequenceu[t], producing an output se-quencex[t]. They are typically described in terms of their frequencyresponse, ie. low-pass, high-pass, band-stop, are two basic filter architectures, known as the FIR filter andthe IIR Impulse Response (FIR) FiltersCharacterized byq+ 1 coefficients:x[t] =q i=0 iu[t i](3)FIR filters implement the convolution of the input signal with a givencoefficient vector{ i}.They are known asFinite Impulse Responsebecause, when the inputu[t] is the impulse function, the outputxis only as long asq+ 1,which must be RESPONSE13 Infinite Impulse Response (IIR) FiltersCharacterized bypcoefficients:x[t] =p i=1 ix[t i] +u[t](4)In IIR filters, the inputu[t] contributes directly tox[t] at timet, but,crucially,x[t] is otherwise a weighed sum ofits own past filters are known asInfinite Impulse Responsebecause, inspite of both the impulse function and the vector{ i}being finitein duration, the response only asympotically decays to of thex[t] s is non-zero, it will make non-zero contributions tofuture values ofx[t] ad and IIR DifferencesIn DSP notation: p 2 1u[t]x[t]x[t 1]x[t 2]x[t p]u[t]x[t] 1 2 q 0u[t 1]u[t 2]u[t q]F IRIIR15 DSP Process ModelsWe re interested in modeling a particular process, for the purposeof predicting future Signal Processing (DSP) theory offers three classesof pos-sible linear process models.

Autoregressive (AR[p]) models Moving Average (MA[q]) models Autoregressive Moving Average (ARMA[p, q]) models16 Autoregressive (AR[p]) ModelsAn AR[p] assumes that at its heart is an IIR filter applied to some(unknown) internal signal, [t].pis the order of that [t] =p i=1 ix[t i] + [t](5)This is simple, but adequately describes many complex phenomena(ie. speech production over short intervals).If on average [t] is small relative tox[t], then we can estimatex[t]using x[t] x[t] [t](6)=p i=1wix[t i](7)This is an FIR filter! Thewi s are estimates of the i AR[p]ParametersBatch version:x[t] x[t](8)=p i=1wix[t i](9) x[p+ 1]x[p+ 2].. = x[1]x[2] x[p]x[2]x[3] x[p+ 1].. w(10)Can use linear regression. : speech recognition. Assume that over small windowsof time, speech is governed by a static AR[p] model. To learnwisto characterize the vocal tract during that window. This is calledLinear Predictive Coding (LPC).18 Estimating AR[p]ParametersIncremental version (same equation):x[t] x[t]=p i=1wix[t i]For each sample, modify eachwiby a small wito reduce thesample squared error(x[t] x[t])2.

One iteration of : noise cancellation. Predict the next sample x[t] andgenerate x[t] at the next time stept. Used in noise cancellingheadsets for office, car, aircraft, Average (MA[q]) ModelsA MA[q] assumes that at its heart is an FIR filter applied to some(unknown) internal signal, [t].q+ 1 is the order of that [t] =q i=0 i [t i](11)Sadly, cannot assume that [t] is negligible;x[t] would have to benegligible. If our goal was to describe a noisy signalx[t] with specificfrequency characteristics, we could set [t] to white noise and the{wi}would just subtract the frequency components that we do used alone in practice. By using Eq 11 to estimatex[t], weare not making explicit use of past values ofx[t].20 Autoregressive Moving Average(ARMA[p, q]) ModelsA combination of the AR[p] and MA[q] models:x[t] =p i=1 ix[t i] +q i=1 i [t i] + [t](12)To estimate future values ofx[t], assume that [t] at timetis smallrelative tox[t]. We can obtain estimates of past values of [t] attimet ifrom past true values ofx[t] and past values of x[t]: [t i] =x[t i] x[t i](13)The estimate forx[t] is then x[t] =p i=1 ix[t i] +q i=1 i [t i](14)21 Linear DSP Models as Linear NNsDSP FilterDSP ModelNN ConnectionsFIRMA[q]feedforwardIIRAR[p]re currentAn AR[p] model is equivalent to: (t)x(t)x(t 1)x(t 2)x(t p) p 2 1 (t)x(t p)x(t p+ 1) p 1 p x(t)Train using backprop as in Eq AR[p]ModelsOnce we ve moved to NNs, there s nothing to stop us from replacingthe s with a nonlinear activation function like tanh( ).

Non-linear models are more powerful, but need more trainingdata,and are less well behaved (overfitting, local minima, etc).TDNNs can be viewed as NAR[p] example of nonlinear ARMA Neural net .. (next slide)23 Nonlinear ARMA[p, q]Modelsfffff train with backpropx[t 1]subtract x[t]x[t 3]x[t 2] [t 3] [t 2] [t 1]24 Jordan NetsA Jordan net can be viewed as a variant of a NARMA network has no memory; it remembers only the output fromthe previous Case for Alternative Memory ModelsUniform sampling is simple but has (t 1)x(t 2)x(t T)fx(t+ 1)x(t)Can only look backTequispaced time steps. To look far into thepast,Tmust be complicated model: many parameters, slow to Change of Notation xi[t] =x[t i+ 1](15)This is a just a reformulation. xi[t] is amemory term, allowing usto ellide the tapped delay line from our diagrams:fx(t+ 1)x(t)x(t T)x(t 2)x(t 1) xT+1[t] x3[t] x2[t] x1[t]27 Propose Non-uniform Sampling xi[t] =x[t di], di N(16)diis an integer delay; for example, for four inputs,dcould be{1,2,4,8}.

This is a generalization. Ifdwere{1,2,3,4}, we wouldbe back to uniform Memory TermsMozer has suggested treating each memory term as a convolutionofx[t] with a kernel function: xi[t] =t =1ci[t ] x[ ](17)Delay lines, non-uniformly and uniformly sampled, can be expressedusing this notation, with the kernel function defined by:ci[t] ={1 ift=di0 otherwise(18) [t]29 Exponential Trace MemoryThe idea: remember past values as exponentially decaying weighedaverage of input :ci[t] =(1 i) ti, ( 1,+1)(19) iis thedecay rate(a discount factor), eg. xiuses a different decay outputs are forgotten; they just fade away . [t]30 Exponential Trace Memory, cont dA nice feature: if all i , don t have to do the convolution ateach time step. Compute incrementally: xi[t] =(1 )x[t] + xi[t i](20)Example: a Jordan net with memoryhiddenoutplanstate31 Special Case: Binary SequencesLetxi[t] {0,1}, with = x[t] is a bit string, treated as a floating point [t] ={1} x[t] =.}

1{1,0}.01{1,0,0}.001{1,0,0,1}.1001{1,0,0 ,1,1}.11001 Earliest bit becomes least significant bit of x[t].32 Memory Depth and ResolutionDepthis how far back memory the degree to which information about individual se-quence elements is fixed model order, we have a tradeoff. Tapped delay line: low depth, high resolution. Exponential trace: high depth, low Memory (deVries & Principe)ci[t] = (tdi)(1 i)di+1 t diiift di0otherwise(21)diis an integer; i [0,1]. Eg. fordi= 4 and = [t]Ifdi= 0, this is exponential trace i 0, this becomes the tapped delay trade depth for resolution by adjustingdiand functions form a basis for a family of kernel ContentDon t have to store the rawx[t].Can store any transformation we like. For example, can storetheinternal state of the : Elman nethiddenoutplancontextThink of this as a 1-tap delay line storingf(x[t]), the hidden of PredictionSo far covered many Neural net architectures which could be usedfor predicting the next sample in a time Series .

What if we needa longer forecast, ie. not x[t+ 1] but x[t+s], with the horizon ofpredictions >1?Three options: Train on{x[t], x[t 1], x[t 2], }to predictx[t+s]. Train to predict allx[t+i], 1 i s(good for smalls). Train to predictx[t+ 1] only, but iterate to getx[t+s] for Sunspot ActivityFessant, Bengio and affect ionospheric propagation of radio companies want to predict sunspot activity six months follow an 11 year cycle, varying from 9-14 data goes back to focus on predictingIR5, a smoothed index of monthly et al: the IR5 Sunspots SeriesIR5[t] =15(R[t 3] +R[t 2] +R[t 1] +R[t] +R[t+ 1])whereR[t] is the mean sunspot number for monthtandIR5[t] isthe desired et al: Simple Feedforward NN(1087 weights)Output:{ x[t], , x[t+ 5]} input :{x[t 40], , x[t 1]}39 Fessant et al: Modular Feedforward NN(552 weights)Output: x[t+ 5] x[t], , x[t+ 5] x[t+ 5] input :{x[t 40], , x[t 1]}40 Fessant et al: Elman NN(786 weights)Output:{ x[t], , x[t+ 5]} input :{x[t 40], , x[t 1]}41 Fessant et al.

Neural Networks for Time Series Prediction

Tags:

Information

Transcription of Neural Networks for Time Series Prediction

Related search queries

Neural Networks for Time Series Prediction

Tags:

Information

Documents from same domain

Related documents

Related search queries