Example: bankruptcy

Springer Texts in Statistics - University of Southern ...

Springer Texts in StatisticsSeries Editors:G. CasellaS. FienbergI. OlkinFor further volumes: James Daniela Witten Trevor HastieRobert TibshiraniAn introduction toStatistical Learningwith Applications in R123 Gareth JamesDepartment of Information andOperations ManagementUniversity of Southern CaliforniaLos Angeles, CA, USAT revor HastieDepartment of StatisticsStanford UniversityStanford, CA, USAD aniela WittenDepartment of BiostatisticsUniversity of WashingtonSeattle, WA, USAR obert TibshiraniDepartment of StatisticsStanford UniversityStanford, CA, USAISSN 1431-875 XISBN 978-1-4614-7137-0 ISBN 978-1-4614-7138-7 (eBook)DOI New York Heidelberg Dordrecht LondonLibrary of Congress Control Number: 2013936251 Springer Science+Business Media NewYork 2013 (Corrected at 4 printing 2014)This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformation storage and retrieval, electronic adaptation, computer software, or by similar or dissim-ilar methodology now known or hereafter developed.

Gareth James •Daniela Witten •Trevor Hastie Robert Tibshirani An Introduction to Statistical Learning with Applications in R 123

Tags:

  Introduction, Statistical, Learning, Introduction to statistical learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Springer Texts in Statistics - University of Southern ...

1 Springer Texts in StatisticsSeries Editors:G. CasellaS. FienbergI. OlkinFor further volumes: James Daniela Witten Trevor HastieRobert TibshiraniAn introduction toStatistical Learningwith Applications in R123 Gareth JamesDepartment of Information andOperations ManagementUniversity of Southern CaliforniaLos Angeles, CA, USAT revor HastieDepartment of StatisticsStanford UniversityStanford, CA, USAD aniela WittenDepartment of BiostatisticsUniversity of WashingtonSeattle, WA, USAR obert TibshiraniDepartment of StatisticsStanford UniversityStanford, CA, USAISSN 1431-875 XISBN 978-1-4614-7137-0 ISBN 978-1-4614-7138-7 (eBook)DOI New York Heidelberg Dordrecht LondonLibrary of Congress Control Number: 2013936251 Springer Science+Business Media NewYork 2013 (Corrected at 4 printing 2014)This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformation storage and retrieval, electronic adaptation, computer software, or by similar or dissim-ilar methodology now known or hereafter developed.

2 Exempted from this legal reservation are briefexcerpts in connection with reviews or scholarly analysis or material supplied specifically for the pur-pose of being entered and executed on a computer system, for exclusive use by the purchaser of thework. Duplication of this publication or parts thereof is permitted only under the provisions of theCopyright Law of the Publisher s location, in its current version, and permission for use must alwaysbe obtained from Springer . Permissions for use may be obtained through RightsLink at the CopyrightClearance Center. Violations are liable to prosecution under the respective Copyright use of general descriptive names, registered names, trademarks, service marks, etc. in this publi-cation does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general the advice and information in this book are believed to be true and accurate at the date ofpublication, neither the authors nor the editors northe publisher can accept any legal responsibility forany errors or omissions that may be made.

3 The publisher makes no warranty, express or implied, withrespect to the material contained on acid-free paperSpringer is part of Springer Science+Business Media ( )To our parents:Alison and Michael JamesChiara Nappi and Edward WittenValerie and Patrick HastieVera and Sami Tibshiraniand to our families:Michael, Daniel, and CatherineAriSamantha, Timothy, and LyndaCharlie, Ryan, Julie, and CherylPrefaceStatistical learning refers to a set of tools for modeling and understandingcomplex datasets. It is a recently developed area in Statistics and blendswith parallel developments in computer science and, in particular, machinelearning. The field encompasses many methods such as the lasso and sparseregression, classification and regression trees, and boosting and supportvector the explosion of Big Data problems, statistical learning has be-come a very hot field in many scientific areas as well as marketing, finance,and other business disciplines. People with statistical learning skills are inhigh of the first books in this area The Elements of statistical learning (ESL) (Hastie, Tibshirani, and Friedman) was published in 2001, with asecond edition in 2009.

4 ESL has become a popular text not only in statis-tics but also in related fields. One of the reasons for ESL s popularity isits relatively accessible style. But ESL is intended for individuals with ad-vanced training in the mathematical introduction to StatisticalLearning(ISL) arose from the perceived need for a broader and less tech-nical treatment of these topics. In this new book, we cover many of thesame topics as ESL, but we concentrate more on the applications of themethods and less on the mathematical details. We have created labs illus-trating how to implement each of the statistical learning methods using thepopular statistical software packageR. These labs provide the reader withvaluable hands-on book is appropriate for advanced undergraduates or master s stu-dents in Statistics or related quantitative fields or for individuals in otherviiviiiPrefacedisciplines who wish to use statistical learning tools to analyze their can be used as a textbook for a course spanning one or two would like to thank several readers for valuable comments on prelim-inary drafts of this book: Pallavi Basu, Alexandra Chouldechova, PatrickDanaher, Will Fithian, Luella Fu, Sam Gross, Max Grazier G Sell, Court-ney Paulson, Xinghao Qiao, Elisa Sheng, Noah Simon, Kean Ming Tan,and Xin Lu s tough to make predictions, especially about the BerraLos Angeles, USAG areth JamesSeattle, USAD aniela WittenPalo Alto, USAT revor HastiePalo Alto, USAR obert TibshiraniContentsPrefacevii1 Introduction12 statistical WhatIsStatisticalLearning?

5 Why Estimatef?.. How Do We Estimatef? .. The Trade-Off Between Prediction Accuracyand Model Interpretability .. Supervised Versus Unsupervised learning .. Regression Versus Classification Problems .. MeasuringtheQualityofFit .. TheBias-VarianceTrade-Off .. Graphics .. IndexingData .. Additional Graphical and Numerical Summaries .. Exercises .. 52ixxContents3 Linear SimpleLinearRegression .. EstimatingtheCoefficients .. Assessing the Accuracy of the MultipleLinearRegression .. Estimating the Regression Coefficients .. Some Important Questions .. Other Considerations in the Regression Model .. TheMarketingPlan .. Comparison of Linear Regression SimpleLinearRegression .. MultipleLinearRegression .. InteractionTerms .. Non-linear Transformations of the Predictors .. WritingFunctions .. Exercises .. 1204 WhyNotLinearRegression?

6 Estimating the Regression Coefficients .. Logistic Regression for> LinearDiscriminantAnalysis .. Using Bayes Theorem for Classification .. Linear Discriminant Analysis forp= Linear Discriminant Analysis forp>1 .. A Comparison of Classification Methods .. Lab: Logistic Regression, LDA, QDA, and KNN .. TheStockMarketData .. LinearDiscriminantAnalysis .. An Application to Caravan Insurance Data .. Exercises .. 1685 Resampling Cross-Validation .. Leave-One-Out Cross-Validation .. Bias-Variance Trade-Off fork-FoldCross-Validation .. Cross-Validation on Classification Problems .. TheBootstrap .. Leave-One-Out Cross-Validation .. TheBootstrap .. Exercises .. 1976 Linear Model Selection and Subset Selection .. Best Subset Selection .. StepwiseSelection .. ChoosingtheOptimalModel .. DimensionReductionMethods .. Principal Components Regression .. Partial Least Squares.

7 High-DimensionalData .. What Goes Wrong in High Dimensions? .. RegressioninHighDimensions .. Interpreting Results in High Dimensions .. Lab 1: Subset Selection Methods .. Best Subset Selection .. Forward and Backward Stepwise Selection .. Choosing Among Models Using the Lab 2: Ridge Regression and the Lasso .. Lab3:PCRandPLSR egression .. Principal Components Regression .. Partial Least Squares .. Exercises .. 2597 Moving Beyond StepFunctions .. RegressionSplines .. Piecewise Polynomials .. ConstraintsandSplines .. TheSplineBasisRepresentation .. Choosing the Number and LocationsoftheKnots .. Comparison to Polynomial Regression .. SmoothingSplines .. An Overview of Smoothing Splines .. Choosing the Smoothing Parameter .. LocalRegression .. GeneralizedAdditiveModels .. GAMs for Regression Problems .. GAMs for Classification Problems .. Lab:Non-linearModeling .. Polynomial Regression and Step Functions.

8 Exercises .. 2978 Tree-Based TheBasicsofDecisionTrees .. RegressionTrees .. Advantages and Disadvantages of Trees .. Bagging, Random Forests, Boosting .. Bagging .. RandomForests .. FittingClassificationTrees .. Bagging and Random Forests .. Exercises .. 3329 Support Vector WhatIsaHyperplane? .. Classification Using a Separating Hyperplane .. Construction of the Maximal Margin Classifier .. Support Vector Classifiers .. Overview of the Support Vector Classifier .. Details of the Support Vector Classifier .. Support Vector Machines .. Classification with Non-linear DecisionBoundaries .. The Support Vector Machine .. An Application to the Heart Disease Data .. One-Versus-AllClassification .. Lab: Support Vector Machines .. Support Vector Classifier .. Support Vector Machine .. ROCC urves .. SVMwithMultipleClasses .. Application to Gene Expression Data.

9 Exercises .. 36810 Unsupervised The Challenge of Unsupervised learning .. What Are Principal Components? .. Another Interpretation of Principal Components .. Other Uses for Principal Components .. :NCI60 DataExample .. Clustering the Observations of the NCI60 Data .. 413 Index4191 IntroductionAn Overview of statistical LearningStatistical learningrefers to a vast set of tools forunderstanding can be classified assupervisedorunsupervised. Broadly speaking,supervised statistical learning involves building a statistical model for pre-dicting, or estimating, nature occur in fields as diverse as business, medicine, astrophysics, andpublic policy. With unsupervised statistical learning , there are inputs butno supervising output; nevertheless we can learn relationships and struc-ture from such data. To provide an illustration of some applications ofstatistical learning , we briefly discuss three real-world data sets that areconsidered in this DataIn this application (which we refer to as theWagedata set throughout thisbook), we examine a number of factors that relate to wages for a group ofmales from the Atlantic region of the United States.

10 In particular, we wishto understand the association between an employee sageandeducation,aswell as the calendaryear,onhiswage. Consider, for example, the left-handpanel of Figure , which displayswageversusagefor each of the individu-als in the data set. There is evidence thatwageincreases withagebut thendecreases again after approximately age 60. The blue line, which providesan estimate of the averagewagefor a givenage, makes this trend James et al.,An introduction to statistical learning : with Applications in R, Springer Texts in Statistics , DOI , Springer Science+Business Media New York 2013121. IntroductionAgeWageYe a rWage2040608050 10020030050 10020030050 10020030020032006200912345 Education LevelWageFIGURE , which contains income survey information for malesfrom the central Atlantic region of the United :wageas a function ofage. On average,wageincreases withageuntil about60years of age, at whichpoint it begins to :wageas a function steady increase of approximately$10,000in the :Boxplots displayingwageas a function ofeducation,with1indicating the lowest level (no high school diploma) and5the highest level (anadvanced graduate degree).


Related search queries