### Transcription of Product-based Neural Networks for User Response Prediction

1 **Product-based** **Neural** **Networks** for User **Response** **Prediction** Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu Ying Wen, Jun Wang Shanghai Jiao Tong University University College London {kevinqu, hcai, kren, wnzhang, { , Abstract Predicting user responses, such as clicks and con- highly depend on feature engineering in order to capture high- [ ] 1 Nov 2016. versions, is of great importance and has found its usage in order latent patterns [7]. many Web applications including recommender systems, web Recently, deep **Neural** **Networks** (DNNs) [8] have shown search and online advertising. The data in those applications is mostly categorical and contains multiple fields; a typical great capability in classification and regression tasks, including representation is to transform it into a high-dimensional sparse computer vision [9], speech recognition [10] and natural binary feature representation via one-hot encoding.}}

2 Facing with language processing [11]. It is promising to adopt DNNs the extreme sparsity, traditional models may limit their capacity in user **Response** **Prediction** since DNNs could automatically of mining shallow patterns from the data, low-order feature learn more expressive feature representations and deliver better combinations. Deep models like deep **Neural** **Networks** , on the other hand, cannot be directly applied for the high-dimensional **Prediction** performance. In order to improve the multi-field input because of the huge feature space. In this paper, we propose categorical data interaction, [12] presented an embedding a **Product-based** **Neural** **Networks** (PNN) with an embedding methodology based on pre-training of a factorization machine. layer to learn a distributed representation of the categorical data, Based on the concatenated embedding vectors, multi-layer a product layer to capture interactive patterns between inter- perceptrons (MLPs) were built to explore feature interactions.

3 Field categories, and further fully connected layers to explore high-order feature interactions. Our experimental results on two However, the quality of embedding initialization is largely large-scale real-world ad click datasets demonstrate that PNNs limited by the factorization machine. More importantly, the consistently outperform the state-of-the-art models on various add operations of the perceptron layer might not be useful metrics. to explore the interactions of categorical data in multiple fields. Previous work [1], [6] has shown that local dependencies I. I NTRODUCTION between features from different fields can be effectively ex- plored by feature vector product operations instead of add . Learning and predicting user **Response** now plays a crucial operations.

4 Role in many personalization tasks in information retrieval To utilize the learning ability of **Neural** **Networks** and mine (IR), such as recommender systems, web search and online the latent patterns of data in a more effective way than MLPs, advertising. The goal of user **Response** **Prediction** is to estimate in this paper we propose **Product-based** **Neural** Network the probability that the user will provide a predefined positive (PNN) which (i) starts from an embedding layer without pre- **Response** , clicks, purchases etc., in a given context [1]. training as used in [12], and (ii) builds a product layer based on This predicted probability indicates the user's interest on the the embedded feature vectors to model the inter-field feature specific item such as a news article, a commercial item or interactions, and (iii) further distills the high-order feature an advertising post, which influences the subsequent decision patterns with fully connected MLPs.

5 We present two types of making such as document ranking [2] and ad bidding [3]. PNNs, with inner and outer product operations in the product The data collection in these IR tasks is mostly in a multi- layer, to efficiently model the interactive patterns. field categorical form, for example, [Weekday=Tuesday, We take CTR estimation in online advertising as the work- Gender=Male, City=London], which is normally trans- ing example to explore the learning ability of our PNN model. formed into high-dimensional sparse binary features via one- The extensive experimental results on two large-scale real- hot encoding [4]. For example, the three field vectors with world datasets demonstrate the consistent superiority of our one-hot encoding are concatenated as model over state-of-the-art user **Response** **Prediction** models [0, 1, 0, 0, 0, 0, 0] [0, 1] [0, 0, 1, 0.]

6 , 0, 0] . [6], [13], [12] on various metrics. | {z } | {z} | {z }. Weekday=Tuesday Gender=Male City=London II. R ELATED W ORK. Many machine learning models, including linear logistic re- The **Response** **Prediction** problem is normally formulated gression [5], non-linear gradient boosting decision trees [4] as a binary classification problem with **Prediction** likelihood and factorization machines [6], have been proposed to work or cross entropy as the training objective [14]. Area under on such high-dimensional sparse binary features and produce ROC Curve (AUC) and Relative Information Gain (RIG) are high quality user **Response** predictions. However, these models common evaluation metrics for **Response** **Prediction** accuracy CTR. [15]. From the modeling perspective, linear logistic regression (LR) [5], [16] and non-linear gradient boosting decision trees Hidden Layer 2.

7 (GBDT) [4] and factorization machines (FM) [6] are widely Fully Connected L2 . used in industrial applications. However, these models are limited in mining high-order latent patterns or learning quality Hidden Layer 1. L1 . feature representations. Fully Connected Deep learning is able to explore high-order latent patterns as well as generalizing expressive data representations [11]. Product Layer z p . Pair-wisely Connected The input data of DNNs are usually dense real vectors, while the solution of multi-field categorical data has not been well studied. Factorization-machine supported **Neural** **Networks** Embedding Layer 1 f Feature 1 Feature 2 Feature N. (FNN) was proposed in [12] to learn embedding vectors Field-wisely Connected of categorical data via pre-trained FM.

8 Convolutional Click **Prediction** Model (CCPM) was proposed in [13] to predict ad Input Field 1 Field 2 Field N. click by convolutional **Neural** **Networks** (CNN). However, in CCPM the convolutions are only performed on the neighbor Fig. 1: **Product-based** **Neural** Network Architecture. fields in a certain alignment, which fails to model the full interactions among non-neighbor features. Recurrent **Neural** **Networks** (RNN) was leveraged to model the user queries as A. **Product-based** **Neural** Network a series of user context to predict the ad click behavior [17]. The architecture of the PNN model is illustrated in Figure 1. Product unit **Neural** network (PUNN) [18] was proposed to From a top-down perspective, the output of PNN is a real build high-order combinations of the inputs.

9 However, neither number y (0, 1) as the predicted CTR: can PUNN learn local dependencies, nor produce bounded outputs to fit the **Response** rate. y = (W3 l2 + b3 ), (1). In this paper, we demonstrate the way our PNN models where W3 R1 D2 and b3 R are the parameters of the learn local dependencies and high-order feature interactions. output layer, l2 RD2 is the output of the second hidden layer, and (x) is the sigmoid activation function: (x) =. III. D EEP L EARNING FOR CTR E STIMATION 1/(1 + e x ). And we use Di to represent the dimension of the i-th hidden layer. We take CTR estimation in online advertising [14] as a The output l2 of the second hidden layer is constructed as working example to formulate our model and explore the per- l2 = relu(W2 l1 + b2 ), (2).

10 Formance on various metrics. The task is to build a **Prediction** D1. model to estimate the probability of a user clicking a specific where l1 R is the output of the first hidden layer. The ad in a given context. rectified linear unit (relu), defined as relu(x) = max(0, x), is Each data sample consists of multiple fields of categorical chosen as the activation function for hidden layer output since data such as user information (City, Hour, etc.), publisher it has outstanding performance and efficient computation. information (Domain, Ad slot, etc.) and ad information The first hidden layer is fully connected with the product (Ad creative ID, Campaign ID, etc.) [19]. All the layer. The inputs to it consist of linear signals lz and quadratic information is represented as a multi-field categorical feature signals lp.