Example: barber

Informer: Beyond Efficient Transformer for Long Sequence ...

Informer: Beyond Efficient Transformer for Long SequenceTime-Series ForecastingHaoyi Zhou,1 Shanghang Zhang,2 Jieqi Peng,1 Shuai Zhang,1 Jianxin Li,1 Hui Xiong,3 Wancai Zhang41 Beihang University2UC Berkeley3 Rutgers University4 SEDD Company{zhouhy, pengjq, zhangs, real-world applications require the prediction of longsequence time-series, such as electricity consumption plan-ning. Long Sequence time-series forecasting (LSTF) demandsa high prediction capacity of the model, which is the abilityto capture precise long-range dependency coupling betweenoutput and input efficiently. Recent studies have shown thepotential of Transformer to increase the prediction , there are several severe issues with Transformerthat prevent it from being directly applicable to LSTF, includ-ing quadratic time complexity, high memory usage, and in-herent limitation of the encoder-decoder architecture.}

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting Haoyi Zhou, 1 Shanghang Zhang, 2 Jieqi Peng, 1 Shuai Zhang, 1 Jianxin Li, 1 Hui Xiong, 3 Wancai Zhang 4 1 Beihang University 2 UC Berkeley 3 Rutgers University 4 SEDD Company fzhouhy, pengjq, zhangs, [email protected], [email protected], …

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Informer: Beyond Efficient Transformer for Long Sequence ...

1 Informer: Beyond Efficient Transformer for Long SequenceTime-Series ForecastingHaoyi Zhou,1 Shanghang Zhang,2 Jieqi Peng,1 Shuai Zhang,1 Jianxin Li,1 Hui Xiong,3 Wancai Zhang41 Beihang University2UC Berkeley3 Rutgers University4 SEDD Company{zhouhy, pengjq, zhangs, real-world applications require the prediction of longsequence time-series, such as electricity consumption plan-ning. Long Sequence time-series forecasting (LSTF) demandsa high prediction capacity of the model, which is the abilityto capture precise long-range dependency coupling betweenoutput and input efficiently. Recent studies have shown thepotential of Transformer to increase the prediction , there are several severe issues with Transformerthat prevent it from being directly applicable to LSTF, includ-ing quadratic time complexity, high memory usage, and in-herent limitation of the encoder-decoder architecture.}

2 To ad-dress these issues, we design an efficient Transformer -basedmodel for LSTF, named Informer, with three distinctive char-acteristics: (i) aProbSparseself-attention mechanism, whichachievesO(LlogL)in time complexity and memory usage,and has comparable performance on sequences dependencyalignment. (ii) the self-attention distilling highlights dominat-ing attention by halving cascading layer input, and efficientlyhandles extreme long input sequences. (iii) the generativestyle decoder, while conceptually simple, predicts the longtime-series sequences at one forward operation rather thana step-by-step way, which drastically improves the inferencespeed of long- Sequence predictions.

3 Extensive experimentson four large-scale datasets demonstrate that Informer sig-nificantly outperforms existing methods and provides a newsolution to the LSTF IntroductionTime-series forecasting is a critical ingredient across manydomains, such as sensor network monitoring (Papadimitriouand Yu 2006), energy and smart grid management, eco-nomics and finance (Zhu and Shasha 2002), and diseasepropagation analysis (Matsubara et al. 2014). In these sce-narios, we can leverage a substantial amount of time-seriesdata on past behavior to make a forecast in the long run,namely long Sequence time-series forecasting (LSTF). How-ever, existing methods are mostly designed under short-termproblem setting, like predicting 48 points or less (Hochreiterand Schmidhuber 1997; Li et al.)

4 2018; Yu et al. 2017; Liuet al. 2019; Qin et al. 2017; Wen et al. 2017). The increas-ingly long sequences strain the models prediction capacityto the point where this trend is holding the research on an empirical example, Fig.(1) shows the forecasting re-sults on a real dataset, where the LSTM network predicts the2d4d6d8dTime0dGround truthPredictionShortLong(a) Sequence predict Sequence length246 MSE score10 1100 The predictions/sec(in log scale)MSE scoreInference speed(b) Run LSTM on 1: (a) LSTF can cover an extended period thanthe short Sequence predictions, making vital distinction inpolicy-planning and investment-protecting. (b) The predic-tion capacity of existing methods limits LSTF s perfor-mance.

5 , starting from length=48, MSE rises unaccept-ably high, and the inference speed drops temperature of an electrical Transformer station fromthe short-term period (12 points, days) to the long-termperiod (480 points, 20 days). The overall performance gapis substantial when the prediction length is greater than 48points (the solid star in Fig.(1b)), where the MSE rises tounsatisfactory performance, the inference speed gets sharpdrop, and the LSTM model starts to major challenge for LSTF is to enhance the predic-tion capacity to meet the increasingly long Sequence de-mand, which requires (a) extraordinary long-range align-ment ability and (b) efficient operations on long Sequence in-puts and outputs.

6 Recently, Transformer models have shownsuperior performance in capturing long-range dependencythan RNN models. The self-attention mechanism can re-duce the maximum length of network signals traveling pathsinto the theoretical shortestO(1)and avoid the recurrentstructure, whereby Transformer shows great potential forthe LSTF problem. Nevertheless, the self-attention mecha-nism violates requirement (b) due to itsL-quadratic compu-tation and memory consumption onL-length large-scale Transformer models pour resources andyield impressive results on NLP tasks (Brown et al. 2020),but the training on dozens of GPUs and expensive deployingcost make theses models unaffordable on real-world LSTF problem.

7 The efficiency of the self-attention mechanism andTransformer architecture becomes the bottleneck of apply-ing them to LSTF problems. Thus, in this paper, we seek toanswer the question:can we improve Transformer models [ ] 28 Mar 2021be computation, memory, and architecture efficient, as wellas maintaining higher prediction capacity?Vanilla Transformer (Vaswani et al. 2017) has three sig-nificant limitations when solving the LSTF quadratic computation of atomoperation of self-attention mechanism, namely canonicaldot-product, causes the time complexity and memory us-age per layer to beO(L2). memory bottleneck in stacking layers for long stack ofJencoder/decoder layers makes totalmemory usage to beO(J L2), which limits the modelscalability in receiving long Sequence speed plunge in predicting long of vanilla Transformer makes the step-by-stepinference as slow as RNN-based model (Fig.)

8 (1b)).There are some prior works on improving the efficiency ofself-attention. The Sparse Transformer (Child et al. 2019),LogSparse Transformer (Li et al. 2019), and Longformer(Beltagy, Peters, and Cohan 2020) all use a heuristic methodto tackle limitation 1 and reduce the complexity of self-attention mechanism toO(LlogL), where their efficiencygain is limited (Qiu et al. 2019). Reformer (Kitaev, Kaiser,and Levskaya 2019) also achievesO(LlogL)with locally-sensitive hashing self-attention, but it only works on ex-tremely long sequences. More recently, Linformer (Wanget al. 2020) claims a linear complexityO(L), but the projectmatrix can not be fixed for real-world long Sequence in-put, which may have the risk of degradation toO(L2).

9 Transformer -XL (Dai et al. 2019) and Compressive Trans-former (Rae et al. 2019) use auxiliary hidden states to cap-ture long-range dependency, which could amplify limitation1 and be adverse to break the efficiency bottleneck. All theseworks mainly focus on limitation 1, and the limitation 2&3remains unsolved in the LSTF problem. To enhance the pre-diction capacity, we tackle all these limitations and achieveimprovement Beyond efficiency in the proposed this end, our work delves explicitly into these three is-sues. We investigate the sparsity in the self-attention mecha-nism, make improvements of network components, and con-duct extensive experiments. The contributions of this paperare summarized as follows: We propose Informer to successfully enhance the predic-tion capacity in the LSTF problem, which validates theTransformer-like model s potential value to capture in-dividual long-range dependency between long sequencetime-series outputs and inputs.

10 We proposeProbSparseself-attention mechanism to ef-ficiently replace the canonical self-attention. It achievestheO(LlogL)time complexity andO(LlogL)memoryusage on dependency alignments. We propose self-attention distilling operation to privi-lege dominating attention scores inJ-stacking layers andsharply reduce the total space complexity to beO((2 )LlogL), which helps receiving long Sequence input. We propose generative style decoder to acquire long se-quence output with only one forward step needed, simul-taneously avoiding cumulative error spreading during theinference Multi-headProbSparseSelf-attentionMulti- headAttentionEncoderInputs: XenConcatenated Feature MapInputs: Xde={Xtoken, X0}0000000 Fully Connected LayerMulti-headProbSparseSelf-attentionM ulti-headProbSparseSelf-attentionFigure 2: Informer model overview.


Related search queries