Neural Networks and Introduction to Bishop (1995) : …

1 Neural Networks and Introduction to Deep LearningNeural Networks and Introduction toDeep Learning1 IntroductionDeep learning is a set of learning methods attempting to model data withcomplex architectures combining different non-linear transformations. The el-ementary bricks of deep learning are the Neural Networks , that are combined toform the deep Neural techniques have enabled significant progress in the fields of soundand image processing, including facial recognition, speech recognition, com-puter vision, automated language processing, text classification (for examplespam recognition).

Potential applications are very numerous. A spectacularlyexample is the AlphaGo program, which learned to play the go game by thedeep learning method, and beated the world champion in exist several types of architectures for Neural Networks : The multilayer perceptrons, that are the oldest and simplest ones The Convolutional Neural Networks (CNN), particularly adapted for im-age processing The recurrent Neural Networks , used for sequential data such as text ortimes are based on deep cascade of layers. They need clever stochastic op-timization algorithms, and initialization, and also a clever choice of the struc-ture.

They lead to very impressive results, although very few theoretical fon-dations are available till main references for this course are : IanGoodfellow,YoshuaBengioandAaronCourvi lle: Bishop (1995) : Neural Networks for pattern recognition, Oxford Univer-sity Press. The elements of Statistical Learning by T. Hastie et al [3]. Hugo Larochelle (Sherbrooke): http larocheh/ ChristopherOlah sblog: Deeplearningcourse,CharlesOllionetOlivie rGrisel: Neural networksAn artificial Neural network is an application, non linear with respect to itsparameters that associates to an entryxan outputy=f(x, ).

For thesake of simplicity, we assume thatyis unidimensional, but it could also bemultidimensional. This applicationfhas a particular form that we will Neural Networks can be use for regression or classification. As usual instatistical learning, the parameters are estimated from a learning sample. Thefunction to minimize is not convex, leading to local minimizers. The successof the method came from a universal approximation theorem due to Cybenko(1989) and Hornik (1991). Moreover, Le Cun (1986) proposed an efficientway to compute the gradient of a Neural network , called backpropagation ofthe gradient, that allows to obtain a local minimizer of the quadratic Artificial NeuronAn artificial neuron is a functionfjof the inputx= (x1.)

,xd)weightedby a vector of connection weightswj= (wj,1,..,wj,d), completed by aneuron biasbj, and associated to an activation function , namelyyj=fj(x) = ( wj,x +bj).Several activation functions can be considered. The identity function (x) = Networks and Introduction to Deep Learning The sigmoid function (or logistic) (x) =11 + exp( x). The hyperbolic tangent function ("tanh") (x) =exp(x) exp( x)exp(x) + exp( x)=exp(2x) 1exp(2x) + 1. The hard threshold function (x) =1x . The Rectified Linear Unit (ReLU) activation function (x) = max(0,x).Here is a schematic representation of an artificial neuron where = wj,x + 1: source: andrewjames Figure 2 represents the activation function described 2: Activation functionsHistorically, the sigmoid was the mostly used activation function since it isdifferentiable and allows to keep values in the interval[0,1].

Nevertheless, itis problematic since its gradient is very close to0when|x|is not close Figure 3 represents the Sigmoid function and its Neural Networks with a high number of layers (which is the case for deeplearning), this causes troubles for the backpropagation algorithm to estimatethe parameter (backpropagation is explained in the following). This is why thesigmoid function was supplanted by the rectified linear function. This functionis not differentiable in0but in practice this is not really a problem since theprobability to have an entry equal to0is generally null.

The ReLU functionalso has a sparsification effect. The ReLU function and its derivative are equalto0for negative values, and no information can be obtain in this case for such a3 Neural Networks and Introduction to Deep LearningFigure 3: Sigmoid function (in black) and its derivatives (in red)unit, this is why it is advised to add a small positive bias to ensure that each unitis active. Several variations of the ReLU function are considered to make surethat all units have a non vanishing gradient and that forx <0the derivative isnot equal to0. Namely (x) = max(x,0) + min(x,0)where is either a fixed parameter set to a small positive value, or a parameterto Multilayer perceptronA multilayer perceptron (or Neural network ) is a structure composed by sev-eral hidden layers of neurons where the output of a neuron of a layer becomesthe input of a neuron of the next layer.

Moreover, the output of a neuron canalso be the input of a neuron of the same layer or of neuron of previous layers(this is the case for recurrent Neural Networks ). On last layer, called outputlayer, we may apply a different activation function as for the hidden layers de-pending on the type of problems we have at hand : regression or Figure 4 represents a Neural network with three input variables, one outputvariable, and two hidden 4: A basic Neural network . Source : perceptrons have a basic architecture since each unit (or neuron)of a layer is linked to all the units of the next layer but has no link with theneurons of the same layer.

The parameters of the architecture are the numberof hidden layers and of neurons in each layer. The activation functions are alsoto choose by the user. For the output layer, as mentioned previously, the acti-vation function is generally different from the one used on the hidden the case of regression, we apply no activation function on the output binary classification, the output gives a prediction ofP(Y= 1/X)sincethis value is in[0,1], the sigmoid activation function is generally multi-class classification, the output layer contains one neuron per classi, giving a prediction ofP(Y=i/X).

The sum of all these values has to beequal to1. The multidimensional functionsoftmaxis generally usedsoftmax(z)i=exp(zi) jexp(zj).Let us summarize the mathematical formulation of a multilayer perceptronwithLhidden seth(0)(x) = Networks and Introduction to Deep LearningFork= 1,..,L(hidden layers),a(k)(x) =b(k)+W(k)h(k 1)(x)h(k)(x) = (a(k)(x))Fork=L+ 1(output layer),a(L+1)(x) =b(L+1)+W(L+1)h(L)(x)h(L+1)(x) = (a(L+1)(x)) :=f(x, ).where is the activation function and is the output layer activation function(for example softmax for multiclass classification).

Neural Networks and Introduction to Bishop (1995) : …

Tags:

Information

Transcription of Neural Networks and Introduction to Bishop (1995) : …

Related search queries

Neural Networks and Introduction to Bishop (1995) : …

Tags:

Information

Documents from same domain

Related documents

Related search queries