### Transcription of A arXiv:1609.02907v4 [cs.LG] 22 Feb 2017

1 Published as a conference paper at ICLR 2017 SEMI-SUPERVISEDCLASSIFICATION WITHGRAPHCONVOLUTIONALNETWORKST homas N. KipfUniversity of WellingUniversity of AmsterdamCanadian Institute for Advanced Research present a scalable approach for semi-supervised learning on graph-structureddata that is based on an efficient variant of convolutional neural networks whichoperate directly on graphs. We motivate the choice of our convolutional archi-tecture via a localized first-order approximation of spectral graph model scales linearly in the number of graph edges and learns hidden layerrepresentations that encode both local graph structure and features of nodes. Ina number of experiments on citation networks and on a knowledge graph datasetwe demonstrate that our approach outperforms related methods by a consider the problem of classifying nodes (such as documents) in a graph (such as a citationnetwork), where labels are only available for a small subset of nodes.

2 This problem can be framedas graph-based semi-supervised learning, where label information is smoothed over the graph viasome form of explicit graph-based regularization (Zhu et al., 2003; Zhou et al., 2004; Belkin et al.,2006; Weston et al., 2012), by using a graph Laplacian regularization term in the loss function:L=L0+ Lreg,withLreg= i,jAij f(Xi) f(Xj) 2=f(X)> f(X).(1)Here,L0denotes the supervised loss the labeled part of the graph,f( )can be a neural network-like differentiable function, is a weighing factor andXis a matrix of node feature vectorsXi. =D Adenotes the unnormalized graph Laplacian of an undirected graphG= (V,E)withNnodesvi V, edges(vi,vj) E, an adjacency matrixA RN N(binary or weighted) anda degree matrixDii= jAij. The formulation of Eq. 1 relies on the assumption that connectednodes in the graph are likely to share the same label.

3 This assumption, however, might restrictmodeling capacity, as graph edges need not necessarily encode node similarity, but could containadditional this work, we encode the graph structure directly using a neural network modelf(X,A)andtrain on a supervised targetL0for all nodes with labels, thereby avoiding explicit graph-basedregularization in the loss function. Conditioningf( )on the adjacency matrix of the graph willallow the model to distribute gradient information from the supervised lossL0and will enable it tolearn representations of nodes both with and without contributions are two-fold. Firstly, we introduce a simple and well-behaved layer-wise prop-agation rule for neural network models which operate directly on graphs and show how it can bemotivated from a first-order approximation of spectral graph convolutions (Hammond et al.)

4 , 2011).Secondly, we demonstrate how this form of a graph-based neural network model can be used forfast and scalable semi-supervised classification of nodes in a graph. Experiments on a number ofdatasets demonstrate that our model compares favorably both in classification accuracy and effi-ciency (measured in wall-clock time) against state-of-the-art methods for semi-supervised [ ] 22 Feb 2017 Published as a conference paper at ICLR 20172 FASTAPPROXIMATECONVOLUTIONS ONGRAPHSIn this section, we provide theoretical motivation for a specific graph-based neural network modelf(X,A)that we will use in the rest of this paper. We consider a multi-layer Graph ConvolutionalNetwork (GCN) with the following layer-wise propagation rule:H(l+1)= ( D 12 A D 12H(l)W(l)).(2)Here, A=A+INis the adjacency matrix of the undirected graphGwith added the identity matrix, Dii= j AijandW(l)is a layer-specific trainable weight matrix.

5 ( )denotes an activation function, such as theReLU( ) = max(0, ).H(l) RN Dis the matrix of ac-tivations in thelthlayer;H(0)=X. In the following, we show that the form of this propagation rulecan be motivated1via a first-order approximation of localized spectral filters on graphs (Hammondet al., 2011; Defferrard et al., 2016). consider spectral convolutions on graphs defined as the multiplication of a signalx RN(ascalar for every node) with a filterg =diag( )parameterized by RNin the Fourier domain, :g ? x=Ug U>x,(3)whereUis the matrix of eigenvectors of the normalized graph LaplacianL=IN D 12AD 12=U U>, with a diagonal matrix of its eigenvalues andU>xbeing the graph Fourier transformofx. We can understandg as a function of the eigenvalues ofL, ( ). Evaluating Eq. 3 iscomputationally expensive, as multiplication with the eigenvector matrixUisO(N2).

6 Furthermore,computing the eigendecomposition ofLin the first place might be prohibitively expensive for largegraphs. To circumvent this problem, it was suggested in Hammond et al. (2011) thatg ( )can bewell-approximated by a **truncated** expansion in terms of Chebyshev polynomialsTk(x)up toKthorder:g ( ) K k=0 kTk( ),(4)with a rescaled =2 max IN. maxdenotes the largest eigenvalue ofL. RKis now avector of Chebyshev coefficients. The Chebyshev polynomials are recursively defined asTk(x) =2xTk 1(x) Tk 2(x), withT0(x) = 1andT1(x) =x. The reader is referred to Hammond et al.(2011) for an in-depth discussion of this back to our definition of a convolution of a signalxwith a filterg , we now have:g ? x K k=0 kTk( L)x,(5)with L=2 maxL IN; as can easily be verified by noticing that(U U>)k=U kU>.

7 Note thatthis expression is nowK-localized since it is aKth-order polynomial in the Laplacian, it dependsonly on nodes that are at maximumKsteps away from the central node (Kth-order neighborhood).The complexity of evaluating Eq. 5 isO(|E|), linear in the number of edges. Defferrard et al.(2016) use thisK-localized convolution to define a convolutional neural network on neural network model based on graph convolutions can therefore be built by stacking multipleconvolutional layers of the form of Eq. 5, each layer followed by a point-wise non-linearity. Now,imagine we limited the layer-wise convolution operation toK= 1(see Eq. 5), a function that islinear therefore a linear function on the graph Laplacian provide an alternative interpretation of this propagation rule based on the Weisfeiler-Lehman algorithm(Weisfeiler & Lehmann, 1968) in Appendix as a conference paper at ICLR 2017In this way, we can still recover a rich class of convolutional filter functions by stacking multiplesuch layers, but we are not limited to the explicit parameterization given by, , the Chebyshevpolynomials.

8 We intuitively expect that such a model can alleviate the problem of overfitting onlocal neighborhood structures for graphs with very wide node degree distributions, such as socialnetworks, citation networks, knowledge graphs and many other real-world graph datasets. Addition-ally, for a fixed computational budget, this layer-wise linear formulation allows us to build deepermodels, a practice that is known to improve modeling capacity on a number of domains (He et al.,2016).In this linear formulation of a GCN we further approximate max 2, as we can expect that neuralnetwork parameters will adapt to this change in scale during training. Under these approximationsEq. 5 simplifies to:g ? x 0x+ 1(L IN)x= 0x 1D 12AD 12x,(6)with two free parameters 0and 1. The filter parameters can be shared over the whole application of filters of this form then effectively convolve thekth-order neighborhood ofa node, wherekis the number of successive filtering operations or convolutional layers in the neuralnetwork practice, it can be beneficial to constrain the number of parameters further to address overfittingand to minimize the number of operations (such as matrix multiplications) per layer.

9 This leaves uswith the following expression:g ? x (IN+D 12AD 12)x,(7)with a single parameter = 0= 1. Note thatIN+D 12AD 12now has eigenvalues inthe range[0,2]. Repeated application of this operator can therefore lead to numerical instabilitiesand exploding/vanishing gradients when used in a deep neural network model. To alleviate thisproblem, we introduce the followingrenormalization trick:IN+D 12AD 12 D 12 A D 12, with A=A+INand Dii= j can generalize this definition to a signalX RN CwithCinput channels ( aC-dimensionalfeature vector for every node) andFfilters or feature maps as follows:Z= D 12 A D 12X ,(8)where RC Fis now a matrix of filter parameters andZ RN Fis the convolved signalmatrix. This filtering operation has complexityO(|E|FC), as AXcan be efficiently implementedas a product of a sparse matrix with a dense introduced a simple, yet flexible modelf(X,A)for efficient information propagation ongraphs, we can return to the problem of semi-supervised node classification.

10 As outlined in the in-troduction, we can relax certain assumptions typically made in graph-based semi-supervised learn-ing by conditioning our modelf(X,A)both on the dataXand on the adjacency matrixAof theunderlying graph structure. We expect this setting to be especially powerful in scenarios where theadjacency matrix contains information not present in the dataX, such as citation links between doc-uments in a citation network or relations in a knowledge graph. The overall model, a multi-layerGCN for semi-supervised learning, is schematically depicted in Figure the following, we consider a two-layer GCN for semi-supervised node classification on a graphwith a symmetric adjacency matrixA(binary or weighted). We first calculate A= D 12 A D 12ina pre-processing step. Our forward model then takes the simple form:Z=f(X,A) = softmax( AReLU( AXW(0))W(1)).