Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning - 1 -Marcus HutterIntroduction toStatistical Machine LearningMarcus HutterCanberra, ACT, 0200, AustraliaMachine Learning Summer SchoolMLSS-2008, 2 15 March, KioloaANURSISENICTAI ntroduction to Statistical Machine Learning - 2 -Marcus HutterAbstractThis course provides a broad Introduction to the methods and practiceof Statistical Machine Learning , which is concerned with the developmentof algorithms and techniques that learn from observed data byconstructing stochastic models that can be used for making predictionsand decisions. Topics covered include bayesian inference and maximumlikelihood modeling; regression, classification, density estimation ,clustering, principal component analysis; parametric, semi-parametric,and non-parametric models; basis functions, neural networks, kernelmethods, and graphical models; deterministic and stochasticoptimization; overfitting, regularization, and to Statistical Machine Learning - 3 -Marcus HutterTable of / Overview / Methods for Methods for Assessment & & (Re)Active 4 -Marcus Hutter1 INTRO/OVERVIEW/PRELIMINARIES What is Machine Learning ?

Why Learn? Related Fields Applications of Machine Learning Supervised Unsupervised Reinforcement Learning Dichotomies in Machine Learning Mini- Introduction to ProbabilitiesIntro/Overview/Preliminarie s- 5 -Marcus HutterWhat is Machine Learning ? Machine Learningis concerned with the development ofalgorithms and techniques that allow computers tolearnLearningin this context is the process of gaining understanding byconstructing models of observed data with the intention to use them fields Artificial Intelligence:smart algorithms Statistics:inference from a sample Data Mining:searching through large volumes of data Computer Science:efficient algorithms and complex modelsIntro/Overview/Preliminaries- 6 -Marcus HutterWhy Learn ?There is no need to learn to calculate payrollLearning is used when: Human expertise does not exist (navigating on Mars), Humans are unable to explain their expertise (speech recognition) Solution changes in time (routing on a computer network) Solution needs to be adapted to particular cases (user biometrics)Example:It is easier to write a program thatlearnsto play checkers orbackgammon well by self-play rather than converting the expertise of amaster player to a 7 -Marcus HutterHandwritten Character Recognitionan example of a difficult Machine Learning problemTask:Learn general mapping from pixel images to digits from examplesIntro/Overview/Preliminaries- 8 -Marcus HutterApplications of Machine Learningmachine Learning has a wide spectrum of applications including.

Natural language processing, search engines, medical diagnosis, detecting credit card fraud, stock market analysis, bio-informatics, classifying DNA sequences, speech and handwriting recognition, object recognition in computer vision, playing games Learning by self-play: Checkers, Backgammon. robot 9 -Marcus HutterSome Fundamental Types of Learning Supervised LearningClassificationRegression Unsupervised LearningAssociationClusteringDensity estimation Reinforcement LearningAgents OthersSemiSupervised LearningActive LearningIntro/Overview/Preliminaries- 10 -Marcus HutterSupervised Learning Prediction of future cases:Use the rule to predict the output for future inputs Knowledge extraction:The rule is easy to understand Compression:The rule is simpler than the data it explains Outlier detection:Exceptions that are not covered by the rule, , fraudIntro/Overview/Preliminaries- 11 -Marcus HutterClassificationExample:Credit scoringDifferentiatingbetweenlow-risk and high-riskcustomers from theirIncomeandSavingsDiscriminant:IFinco me> 1 ANDsavings> 2 THEN low-risk ELSE high-riskIntro/Overview/Preliminaries- 12 -Marcus HutterRegressionExample.

Pricey=f(x)+noise of a used car as function of agexIntro/Overview/Preliminaries- 13 -Marcus HutterUnsupervised Learning Learning what normally happens No output Clustering: Grouping similar instances Example applications:Customer segmentation in CRMI mage compression: Color quantizationBioinformatics: Learning motifsIntro/Overview/Preliminaries- 14 -Marcus HutterReinforcement Learning Learning a policy: A sequence of outputs No supervised output but delayed reward Credit assignment problem Game playing Robot in a maze Multiple agents, partial observability, ..Intro/Overview/Preliminaries- 15 -Marcus HutterDichotomies in Machine Learningscope of my lecture scope of other lectures( Machine ) Learning / Statistical logic/knowledge-based (GOFAI)induction prediction decision actionregression classificationindependent identically distributed sequential / non-iidonline Learning offline/batch learningpassive prediction active learningparametric non-parametricconceptual/mathematical computational issuesexact/principled heuristicsupervised Learning unsupervised RL learningIntro/Overview/Preliminaries- 16 -Marcus HutterProbability BasicsProbability is used to describe uncertain events;the chance or belief that something is or will be : Fair Six-Sided Die: Sample space: ={1,2,3,4,5,6} Events:Even={2,4,6}, Odd={1,3,5} Probability:P(6) =16,P(Even) = P(Odd) =12 Outcome:6 E.

Conditional probability:P(6|Even) =P(6andEven)P(Even)=1/61/2=13 General Axioms: P({}) = 0 P(A) 1 = P( ), P(A B) + P(A B) = P(A) + P(B), P(A B) = P(A|B)P(B).Intro/Overview/Preliminaries- 17 -Marcus HutterProbability JargonExample: (Un)fair coin: ={Tail,Head}'{0,1}.P(1) = [0,1]:Likelihood:P(1101| ) = (1 ) Maximum Likelihood (ML) estimate: = arg max P(1101| ) =34 Prior:If we are indifferent, thenP( ) = :P(1101) = P(1101| )P( ) =120(actually )Posterior:P( |1101) =P(1101| )P( )P(1101) 3(1 )(BAYES RULE!).Maximum a Posterior (MAP) estimate: = arg max P( |1101) =34 Predictive distribution:P(1|1101) =P(11011)P(1101)=23 Expectation:E[f|..] = f( )P( |..), [ |1101] =23 Variance:Var( ) =E[( E )2|1101] =263 Probability density:P( ) =1 P([ , + ])for 0 Linear Methods for Regression- 18 -Marcus Hutter2 LINEAR METHODS FOR REGRESSION Linear Regression Coefficient Subset Selection Coefficient Shrinkage Linear Methods for Classifiction Linear Basis Function Regression (LBFR) Piecewise linear, Splines, Wavelets Local Smoothing & Kernel Regression Regularization & 1D Smoothing SplinesLinear Methods for Regression- 19 -Marcus HutterLinear Regressionfitting a linear function to the data Input feature vectorx:= (1 x(0), x(1).)

, x(d)) IRd+1 Real-valued noisy responsey IR. Linear regression model: y=fw(x)=w0x(0)+..+wdx(d) Data:D= (x1, y1), ..,(xn, yn) Error or loss function:Example: Residual sum of squares:Loss(w)= ni=1(yi fw(xi))2 Least squares (LSQ) regression: w= arg minwLoss(w) Example:Person s weightyas a function of agex1, Methods for Regression- 20 -Marcus HutterCoefficient Subset SelectionProblemswith least squares regression ifdis large: Overfitting:The plane fits the data well (perfect ford n),but predicts (generalizes) badly. Interpretation:We want to identify a small subset offeatures important/relevant for 1: Subset selection:Take thosekout ofdfeatures that minimize the LSQ Methods for Regression- 21 -Marcus HutterCoefficient ShrinkageSolution 2: Shrinkage methods:Shrink the least squareswby penalizing the Loss:Ridge regression:Add ||w|| :Add ||w|| linear regression:Comp. MAParg maxwP(w|D)from priorP(w)andsampling modelP(D|w).Weights of low variance components shrink Methods for Regression- 22 -Marcus HutterLinear Methods for ClassificationExample:Y={spam,non-spam}' { 1,1}(or{0,1})Reduction to regression:Regardy IR wfrom linear classification:Iff w(x)>0then y= 1else y= classification:Predict probability that newxis in (y=1|x,D)P(y=0|x,D):=f w(x)Improvements: Linear Discriminant Analysis (LDA) Logistic Regression Perceptron Maximum Margin Hyperplane Support Vector MachineGeneralization to Methods for Regression- 23 -Marcus HutterLinear Basis Function Regression (LBFR)= powerful generalization of and reduction to linear regression!

Problem:Responsey IRis often not linear inx :Transformx; (x)with :IRd responsey IRis now linear in .Examples: Linear regression:p=dand i(x) =xi. Polynomial regression:d= 1and i(x) =xi. Piecewise constant 1with i(x) = 1fori x < i+ 1and 0 else. Piecewise polynomials .. Splines ..Linear Methods for Regression- 24 -Marcus HutterLinear Methods for Regression- 25 -Marcus Hutter2D Spline LBFR and 1D Symmlet-8 WaveletsLinear Methods for Regression- 26 -Marcus HutterLocal Smoothing & Kernel RegressionEstimatef(x)by averaging theyifor allxia-close tox: f(x) =Pni=1K(x,xi)yiPni=1K(x,xi)Nearest-Neigh bor Kernel:K(x, xi) = 1if{|x xi|<aand 0 elseGeneralization to otherK, quadratic (Epanechnikov) Kernel:K(x, xi) = max{0, a2 (x xi)2}Linear Methods for Regression- 27 -Marcus HutterRegularization & 1D Smoothing Splinesto avoid overfitting if function class is large f= arg minf{ ni=1(yi f(xi))2+ (f (x))2dx} = 0 f=any functionthrough data = f=leastsquares line fit 0< < f=piecwise cubicwith continuousderivativeNonlinear Regression- 28 -Marcus Hutter3 NONLINEAR REGRESSION Artificial Neural Networks Kernel Trick Maximum Margin Classifier Sparse Kernel Methods / SVMsNonlinear Regression- 29 -Marcus HutterArtificial Neural Networks 1as non-linear function approximatorSingle hidden layer feed-forward neural network Hidden layer:zj=h( di=0w(1)jixi) Output:fw,k(x) = ( Mj=0w(2)kjzj) Sigmoidal activation functions:h()and () fnon-linear Goal:Find network weightsbest modeling the data: Back-propagation algorithm:Minimize ni=1||yi fw(xi)|| gradient Regression- 30 -Marcus HutterArtificial Neural Networks 2 Avoid overfitting by early stopping or smallM.}

Avoid local minima by random initial weightsor stochastic gradient : Image ProcessingNonlinear Regression- 31 -Marcus HutterKernel TrickThe Kernel-Trick allows to reduce the functionalminimization to finite-dimensional optimization. LetL(y, f(x))beanyloss function andJ(f)be a penalty quadratic inf. then minimum of penalized loss ni=1L(yi, f(xi)) + J(f) has formf(x) = ni=1 iK(x, xi) with minimizing ni=1L(yi,(K )i) + >K . and KernelKij=K(xi, xj)following Regression- 32 -Marcus HutterMaximum Margin Classifier w= arg maxw:||w||=1mini{yi(w> (xi))}withy { 1,1} Linear boundary for b(x) =x(b). Boundary is determined bySupport Vectors(circled data points) Margin negative ifclasses not Regression- 33 -Marcus HutterSparse Kernel Methods / SVMsNon-linear boundary for general b(x) w= ni=1ai (xi)for somea. f(x) = w> (x) = ni=1aiK(xi, x)depends only on via KernelK(xi, x) = db=1 b(xi) b(x). Huge time savings ifd nExampleK(x, x ): polynomial(1 + x, x )d, Gaussianexp( ||x x ||22), neural networktanh( x, x ).

Model Assessment & Selection- 34 -Marcus Hutter4 MODEL ASSESSMENT & SELECTION Example: Polynomial Regression Training=Empirical versus Test=True Error Empirical Model Selection Theoretical Model Selection The Loss Rank Principle for Model SelectionModel Assessment & Selection- 35 -Marcus HutterExample: Polynomial Regression Straight line does not fit data well (large training error)high bias poor predictive performance High order polynomial fits data perfectly (zero training error)high variance (overfitting) poor prediction too! Reasonable polynomialdegreedperforms to selectd?minimizing training errorobviously does not Assessment & Selection- 36 -Marcus HutterTraining=Empirical versus Test=True Error Learn functional relationffor dataD={(x1, y1), ..,(xn, yn)}. We can compute theempirical erroron past data:ErrD(f)=1n ni=1(yi f(xi))2. Assume dataDis sample from some distributionP. We want to know the expectedtrue erroron future examples:ErrP(f)=EP[(y f(x))]. How good an estimate ofErrP(f)isErrD(f)?

Problem:ErrD(f)decreases with increasing model complexity,but not ErrP(f).Model Assessment & Selection- 37 -Marcus HutterEmpirical Model SelectionHow to select complexityparameter Kernel widtha, penalization constant , numberkof nearest neighbors, the polynomial degreed?Empirical test-set-based methods:Regress on training set and mini-mize empirical error com-plexity parameter (a, , k, d) ona separate :cross-validation, bootstrap, ..Model Assessment & Selection- 38 -Marcus HutterTheoretical Model SelectionHow to select complexity or flexibility or smoothness parameter:Kernel widtha, penalization constant , numberkof nearest neighbors,the polynomial degreed?For parametric regression withdparameters: bayesian model selection, Akaike Information Criterion (AIC), bayesian Information Criterion (BIC), Minimum Description Length (MDL),They all add a penalty proportional todto the non-parametric linear regression: Add trace of on-data regressor = effective # of parameters to loss.

Introduction to Statistical Machine Learning

Tags:

Information

Advertisement

Transcription of Introduction to Statistical Machine Learning

Related search queries

Introduction to Statistical Machine Learning

Tags:

Information

Advertisement

Related documents

Related search queries