1 Comparative Study of CNN and RNN for Natural Language Processing Wenpeng Yin , Katharina Kann , Mo Yu and Hinrich Schutze .. CIS, LMU Munich, Germany . IBM Research, USA. Abstract quence modeling task like language modeling as it requires flexible modeling of context dependen- Deep neural networks (DNNs) have rev- cies. But current NLP literature does not sup- olutionized the field of natural language port such a clear conclusion. For example, RNNs [ ] 7 Feb 2017 . processing (NLP). Convolutional Neural perform well on document-level sentiment clas- Network (CNN) and Recurrent Neural sification (Tang et al., 2015); and Dauphin et al. Network (RNN), the two main types of (2016) recently showed that gated CNNs outper- DNN architectures, are widely explored to form LSTMs on language modeling tasks, even handle various NLP tasks.
2 CNN is sup- though LSTMs had long been seen as better suited. posed to be good at extracting position- In summary, there is no consensus on DNN selec- invariant features and RNN at modeling tion for any particular NLP problem. units in sequence. The state-of-the-art on This work compares CNNs, GRUs and LSTMs many NLP tasks often switches due to the systematically on a broad array of NLP tasks: sen- battle of CNNs and RNNs. This work is timent/relation classification, textual entailment, the first systematic comparison of CNN answer selection, question-relation matching in and RNN on a wide range of representa- Freebase, Freebase path query answering and part- tive NLP tasks, aiming to give basic guid- of-speech tagging. ance for DNN selection. Our experiments support two key findings.
3 (i). 1 Introduction CNNs and RNNs provide complementary infor- mation for text classification tasks. Which archi- Natural language processing (NLP) has benefited tecture performs better depends on how impor- greatly from the resurgence of deep neural net- tant it is to semantically understand the whole se- works (DNNs), due to their high performance quence. (ii) Learning rate changes performance with less need of engineered features. There are relatively smoothly, while changes to hidden size two main DNN architectures: convolutional neu- and batch size result in large fluctuations. ral network (CNN) (LeCun et al., 1998) and recur- rent neural network (RNN) (Elman, 1990). Gat- 2 Related Work ing mechanisms have been developed to alleviate some limitations of the basic RNN, resulting in To our knowledge, there has been no systematic two prevailing RNN types: long short-term mem- comparison of CNN and RNN on a large array of ory (LSTM) (Hochreiter and Schmidhuber, 1997) NLP tasks.
4 And gated recurrent unit (GRU) (Cho et al., 2014). Vu et al. (2016) investigate CNN and basic Generally speaking, CNNs are hierarchical and RNN ( , no gating mechanisms) for relation RNNs sequential architectures. How should we classification. They report higher performance choose between them for processing language? of CNN than RNN and give evidence that CNN. Based on the characterization hierarchical (CNN) and RNN provide complementary information: vs. sequential (RNN) , it is tempting to choose a while the RNN computes a weighted combina- CNN for classification tasks like sentiment classi- tion of all words in the sentence, the CNN ex- fication since sentiment is usually determined by tracts the most informative ngrams for the rela- some key phrases; and to choose RNNs for a se- tion and only considers their resulting activations.
5 (a) CNN (b) GRU (c) LSTM. Figure 1: Three typical DNN architectures Both Wen et al. (2016) and Adel and Sch utze quence with n entries: x1 , x2 , .. , xn , let vec- ( 2017 ) support CNN over GRU/LSTM for classi- tor ci Rwd be the concatenated embeddings fication of long sentences. In addition, Yin et al. of w entries xi w+1 , .. , xi where w is the fil- (2016) achieve better performance of attention- ter width and 0 < i < s + w. Embeddings based CNN than attention-based LSTM for an- for xi , i < 1 or i > n, are zero padded. We swer selection. Dauphin et al. (2016) further argue then generate the representation pi Rd for that a fine-tuned gated CNN can also model long- the w-gram xi w+1 , .. , xi using the convolution context dependency, getting new state-of-the-art in weights W Rd wd : language modeling above all RNN competitors In contrast, Arkhipenko et al.
6 (2016) compare pi = tanh(W ci + b) (1). word2vec (Mikolov et al., 2013), CNN, GRU and LSTM in sentiment analysis of Russian tweets, and find GRU outperforms LSTM and CNN. where bias b Rd . In empirical evaluations, Chung et al. (2014). and Jozefowicz et al. (2015) found there is no clear Maxpooling All w-gram representations pi winner between GRU and LSTM. In many tasks, (i = 1 s + w 1) are used to generate the they yield comparable performance and tuning hy- representation of input sequence x by maxpool- perparameters like layer size is often more impor- ing: xj = max(p1,j , p2,j , ) (j = 1, , d). tant than picking the ideal architecture. Gated Recurrent Unit (GRU). 3 Models GRU, as shown in Figure 1(b), models text x as This section gives a brief introduction of CNN, follows: GRU and LSTM.
7 Convolutional Neural Network (CNN) z = (xt Uz + ht 1 Wz ) (2). r r Input Layer Sequence x contains n entries. r = (xt U + ht 1 W ) (3). Each entry is represented by a d-dimensional s s st = tanh(xt U + (ht 1 r)W ) (4). dense vector; thus the input x is represented as a ht = (1 z) st + z ht 1 (5). feature map of dimensionality d n. Figure 1(a). shows the input layer as the lower rectangle with multiple columns. xt Rd represents the token in x at position t, ht Rh is the hidden state at t, supposed to en- Convolution Layer is used for representation code the history x1 , , xt . z and r are two gates. learning from sliding w-grams. For an input se- All U Rd h ,W Rh h are parameters. Long Short-Time Memory (LSTM) dataset. We use the subtask that assumes that there LSTM is denoted in Figure 1(c).
8 It models the is at least one correct answer for a question. The word sequence x as follows: corresponding dataset consists of 20,360 question- candidate pairs in train, 1,130 in dev and 2,352. it = (xt Ui + ht 1 Wi + bi ) (6) in test where we adopt the standard setup of only f f ft = (xt U + ht 1 W + bf ) (7) considering questions with correct answers in test. o ot = (xt U + ht 1 W + bo ) o (8) The task is to choose the correct answer(s) from q q some candidates for a question. Measures: MAP. qt = tanh(xt U + ht 1 W + bq ) (9). and MRR. pt = ft pt 1 + it qt (10). Question Relation Match (QRM). We uti- ht = ot tanh(pt ) (11) lize WebQSP (Yih et al., 2016) dataset to create LSTM has three gates: input gate it , forget gate a large-scale relation detection task, benefitting ft and output gate ot.
9 All gates are generated by from the availability of labeled semantic parses a sigmoid function over the ensemble of input xt of questions. For each question, we (i) select the and the preceding hidden state ht 1 . In order to topic entity from the parse; (ii) select all the re- generate the hidden state at current step t, it first lations/relation chains (length 2) connecting to generates a temporary result qt by a tanh non- the topic entity; and (iii) set the relations/relation- linearity over the ensemble of input xt and the pre- chains in the labeled parse as positive and all the ceding hidden state ht 1 , then combines this tem- others as negative. Following Yih et al. (2016) and porary result qt with history pt 1 by input gate it Xu et al. (2016), we formulate this task as a se- and forget gate ft respectively to get an updated quence matching problem.
10 Ranking-loss is used history pt , finally uses output gate ot over this up- for training. Measure: accuracy. dated history pt to get the final hidden state ht . Path Query Answering (PQA) on the path query dataset released by Guu et al. (2015). It 4 Experiments contains KB paths like eh , r0 , r1 , , rt , et , where Tasks head entity eh and relation sequence r0 , r1 , , rt are encoded to predict the tail entity et . There are Sentiment Classification (SentiC) on Stanford 6,266,058/27,163/109,557 paths in train/dev/test, Sentiment Treebank (SST) (Socher et al., 2013). respectively. Measure: This dataset predicts the sentiment (positive or Part-of-Speech Tagging on WSJ. We use the negative) of movie reviews. We use the given split setup of (Blitzer et al., 2006; Petrov and McDon- of 6920 train, 872 dev and 1821 test sentences.)