AN INTRODUCTION TO MACHINE LEARNING

M I C H A E L C L A R KC E N T E R F O R S O C I A L R E S E A R C HU N I V E R S I T Y O F N O T R E D A M EA N I N T R O D U C T I O N T O M A C H I N E L E A R N I N GW I T H A P P L I C AT I O N S I N RMachine Learning2 ContentsPreface5 INTRODUCTION : Explanation & Prediction6 Some Terminology7 Tools You Already Have7 The Standard Linear Model7 Logistic Regression8 Expansions of Those Tools9 Generalized Linear Models9 Generalized Additive Models9 The Loss Function10 Continuous Outcomes10 Squared Error10 Absolute Error10 Negative Log-likelihood10R Example11 Categorical Outcomes11 Misclassification11 Binomial log-likelihood11 Exponential12 Hinge Loss12 Regularization12R Example133 Applications in RBias-Variance Tradeoff14 Bias & Variance14 The Tradeoff15 Diagnosing Bias-Variance Issues & Possible Solutions16 Worst Case Scenario16 High Variance16 High Bias16 Cross-Validation16 Adding Another Validation Set17K-fold Cross-Validation17 Leave-one-out Cross-Validation17 Bootstrap18 Other Stuff18 Model Assessment & Selection18 Beyond Classification Accuracy.

Other Measures of Performance18 Process Overview20 Data Preparation20 Define Data and Data Partitions20 Feature Scaling21 Feature Engineering21 Discretization21 Model Selection22 Model Assessment22 Opening the Black Box22 The Dataset23R Implementation24 MACHINE Learning4 Feature Selection & The Data Partition24k-nearest Neighbors25 Strengths & Weaknesses27 Neural Nets28 Strengths & Weaknesses30 Trees & Forests30 Strengths & Weaknesses33 Support Vector Machines33 Strengths & Weaknesses35 Other35 Unsupervised Learning35 Clustering35 Latent Variable Models36 Graphical Structure36 Imputation36 Ensembles36 Bagging37 Boosting37 Stacking38 Feature Selection & Importance38 Textual Analysis39 Bayesian Approaches39 More Stuff40 Summary40 Cautionary Notes40 Some Guidelines40 Conclusion41 Brief Glossary of Common Terms425 Applications in RPrefaceThe purpose of this document is to provide a conceptual introductionto statistical or MACHINE LEARNING (ML) techniques for those that mightnot normally be exposed to such approaches during their requiredtypical statistical training1.

MACHINE learning2can be described as1I generally have in mind social scienceresearchers but hopefully keep thingsgeneral enough for other referred to as applied statisticallearning, statistical engineering, datascience or data mining in other form of a statistics, often even utilizing well-known nad familiartechniques, that has bit of a different focus than traditional analyticalpractice in the social sciences and other disciplines. The key notion isthat flexible, automatic approaches are used to detect patterns withinthe data, with a primary focus on making predictions on future one surveys the number of techniques available in ML withoutcontext, it will surely be overwhelming in terms of the sheer numberof those approaches and also the various tweaks and variations ofthem. However, the specifics of the techniques are not as importantas more general concepts that would be applicable in most every MLsetting, and indeed, many traditional ones as well. While there will beexamples using the R statistical environment and descriptions of a fewspecific approaches, the focus here is more on ideas than application33 Indeed, there is evidence that withlarge enough samples many techniquesconverge to similar kept at the conceptual level as much as possible.

However, someapplied examples of more common techniques will be provided for prerequisite knowledge, I will assume a basic familiarity withregression analyses typically presented to those in applied disciplines,particularly those of the social sciences. Regarding programming, oneshould be at least somewhat familiar with using R and Rstudio, andeither of my introductions here and here will be plenty. Note that Iwon t do as much explaining of the R code as in those introductions,and in some cases I will be more concerned with getting to a resultthan clearly detailing the path to it. Armed with such introductoryknowledge as can be found in those documents, if there are parts ofR code that are unclear one would have the tools to investigate anddiscover for themselves the details, which results in more latest version of this documentis dated May 2, 2013 (original March2013). MACHINE Learning6 INTRODUCTION : Explanation &PredictionFO R A N Y PA R T I C U L A R A N A LY S I S C O N D U C T E D, emphasis can beplaced on understanding the underlying mechanisms which have spe-cific theoretical underpinnings, versus a focus that dwells more onperformance and, more to the point, future performance.

These are notmutually exclusive goals in the least, and probably most studies con-tain a little of both in some form or fashion. I will refer to the formeremphasis as that ofexplanation, and the latter that studies with a more explanatory focus, traditionally analysis con-cerns a single data set. For example, one assumes a data generatingdistribution for the response, and one evaluates the overall fit of asingle model to the data at hand, in terms of R-squared, and statis-tical significance for the various predictors in the model. One assesseshow well the model lines up with the theory that led to the analysis,and modifies it accordingly, if need be, for future studies to studies may look at predictions for specific, possibly hypotheticalvalues of the predictors, or examine the particular nature of individualpredictors effects. In many cases, only a single model is general though, little attempt is made to explicitly understand howwell the model will do with future data, but we hope to have gainedgreater insight as to the underlying mechanisms guiding the responseof interest.

Following Breiman (2001), this would be more akin to thedata modeling the other type of study focused on prediction, newer techniquesare available that are far more focused on performance, not only forthe current data under examination but for future data the selectedmodel might be applied to. While still possible, relative predictor im-portance is less of an issue, and oftentimes there may be no particulartheory to drive the analysis. There may be thousands of input vari-ables, such that no simple summary would likely be possible , many of the techniques applied in such analyses are quitepowerful, and steps are taken to ensure better results for new referencing Breiman (2001), this perspective is more of thealgo-rithmic modeling the two approaches are not exclusive, I present two extremeviews of the situation:To paraphrase provocatively, MACHINE LEARNING is statistics minus anychecking of models and assumptions . ~Brian Ripley, the focus in the statistical community on data models has:Led to irrelevant theory and questionable scientific in RKept statisticians from using more suitable algorithmic statisticians from working on exciting new problems.

~LeoBrieman, 2001 Respective departments of computer science and statistics now over-lap more than ever as more relaxed views seem to prevail today, butthere are potential drawbacks to placing too much emphasis on eitherapproach historically associated with them. Models that just work have the potential to be dangerous if they are little understood. Situa-tions for which much time is spent sorting out details for an ill-fittingmodel suffers the converse problem- some (though often perhaps verylittle actually) understanding with little pragmatism. While this paperwill focus on more algorithmic approaches, guidance will be providedwith an eye toward their use in situations where the typical data mod-eling approach would be applied, thereby hopefully shedding somelight on a path toward obtaining the best of both TerminologyFor those used to statistical concepts such as dependent variables,clustering, and predictors, etc. you will have to get used to some dif-ferences in terminology4such as targets, unsupervised LEARNING , and4 See this for a comparisoninputs etc.

This doesn t take too much, even if it is somewhat annoyingwhen one is first starting out. I won t be too beholden to either in thispaper, and it should be clear from the context what s being referred I will start off mostly with non-ML terms and note in bracketsit s ML version to help the orientation You Already HaveON E T H I N G T H AT I S I M P O R TA N T T O K E E P I N M I N D A S Y O U B E G I Nisthat standard techniques are still available, although we might tweakthem or do more with them. So having a basic background in statisticsis all that is required to get started with MACHINE Standard Linear ModelAll introductory statistics courses will cover linear regression in greatdetail, and it certainly can serve as a starting point here. We can de-scribe it as follows in matrix notation:y=N( , 2) =X MACHINE Learning8 Where y is a normally distributed vector of responses [target] withmean and constant variance 2. X is a typical model matrix, amatrix of predictor variables and in which the first column is a vec-tor of 1s for the intercept [bias5], and is the vector of coefficients5 Yes, you will see bias refer to anintercept, and also mean somethingentirely different in our discussion ofbias vs.

Variance.[weights] corresponding to the intercept and predictors in the might be given less focus in applied courses however is howoften it won t be the best tool for the job or even applicable in the formit is presented. Because of this many applied researchers are still ham-mering screws with it, even as the explosion of statistical techniquesof the past quarter century has rendered obsolete many current intro-ductory statistical texts that are written for disciplines. Even so, theconcepts one gains in LEARNING the standard linear model are general-izable, and even a few modifications of it, while still maintaining thebasic design, can render it still very effective in situations where it in fitting [ LEARNING ] a model we tend to talk about R-squared and statistical significance of the coefficients for a smallnumber of predictors. For our purposes, let the focus instead be onthe residual sum of squares6with an eye towards its reduction and6 (y f(x))2wheref(x)is a functionof the model predictors, and in thiscontext a linear combination of them(X ).

Model comparison. We will not have a situation in which we are onlyconsidering one model fit, and so must find one that reduces the sumof the squared errors but without unnecessary complexity and overfit-ting, concepts we ll return to later. Furthermore, we will be much moreconcerned with the model fit on new data [generalization].Logistic RegressionLogistic regression is often used where the response is categorical innature, usually with binary outcome in which some event occurs ordoes not occur [label]. One could still use the standard linear modelhere, but you could end up with nonsensical predictions that fall out-side the 0-1 range regarding the probability of the event occurring, togo along with other shortcomings. Furthermore, it is no more effortnor is any understanding lost in using a logistic regression over thelinear probability model. It is also good to keep logistic regression inmind as we discuss other classification approaches later regression is also typically covered in an INTRODUCTION tostatistics for applied disciplines because of the pervasiveness of binaryresponses, or responses that have been made as such7.

AN INTRODUCTION TO MACHINE LEARNING

Tags:

Information

Transcription of AN INTRODUCTION TO MACHINE LEARNING

Related search queries

AN INTRODUCTION TO MACHINE LEARNING

Tags:

Information

Documents from same domain

Related documents

Related search queries