Transcription of Applied Statistics with R - GitHub Pages
1 Applied Statistics with RDavid Dalpiaz2 Contents1 About This Book.. Conventions.. Acknowledgements.. License..142 Introduction Getting Started.. Basic Calculations.. Getting Help.. Installing Packages..183 Data and Data Types.. Data Structures.. Vectors.. Vectorization.. Logical Operators.. More Vectorization.. Matrices.. Lists.. Data Frames.. Programming Basics.. Control Flow.. Functions..534 Summarizing Summary Statistics .. Plotting.. Histograms.. Barplots.. Boxplots.. Scatterplots..645 Probability and Statistics Probability inR.. Distributions.. Hypothesis Tests inR.. One Sample t-Test: Review.. One Sample t-Test: Example.. Two Sample t-Test: Review.. Two Sample t-Test: Example.. Simulation.. Paired Differences.. Distribution of a Sample Mean.. Beginner Tutorials and References.. Intermediate References.. Advanced References.. Quick Comparisons to Other Languages.
2 RStudio and RMarkdown Videos.. RMarkdown Template..87 CONTENTS57 Simple Linear Modeling.. Simple Linear Regression Model.. Least Squares Approach.. Making Predictions.. Residuals.. Variance Estimation.. Decomposition of Variation.. Coe icient of Determination.. ThelmFunction.. Maximum Likelihood Estimation (MLE) Approach.. Simulating SLR.. History..1228 Inference for Simple Linear Gauss Markov Theorem.. Sampling Distributions.. Simulating Sampling Distributions.. Standard Errors.. Confidence Intervals for Slope and Intercept.. Hypothesis Tests.. Tests inR.. Significance of Regression, t-Test.. Confidence Intervals inR.. Confidence Interval for Mean Response.. Prediction Interval for New Observations.. Confidence and Prediction Bands.. Significance of Regression, F-Test..1516 CONTENTS9 Multiple Linear Matrix Approach to Regression.. Sampling Distribution.. Single Parameter Tests.
3 Confidence Intervals.. Confidence Intervals for Mean Response.. Prediction Intervals.. Significance of Regression.. Nested Models.. Simulation..18410 Model Family, Form, and Fit.. Fit.. Form.. Family.. Assumed Model, Fitted Model.. Explanation versus Prediction.. Explanation.. Prediction.. Summary..19411 Categorical Predictors and Dummy Variables.. Interactions.. Factor Variables.. Factors with More Than Two Levels.. Parameterization.. Building Larger Models..229 CONTENTS712 Analysis of Experiments.. Two-Sample t-Test.. One-Way ANOVA.. Factor Variables.. Some Simulation.. Power.. Post Hoc Testing.. Two-Way ANOVA..25913 Model Model Assumptions.. Checking Assumptions.. Fitted versus Residuals Plot.. Breusch-Pagan Test.. Histograms.. Q-Q Plots.. Shapiro-Wilk Test.. Unusual Observations.. Leverage.. Outliers.. Influence.. Data Analysis Examples.. Good Diagnostics.. Suspect Diagnostics.
4 3018 CONTENTS14 Response Transformation.. Variance Stabilizing Transformations.. Box-Cox Transformations.. Predictor Transformation.. Polynomials.. A Quadratic Model.. Overfitting and Extrapolation.. Comparing Polynomial Models.. ()Function and Orthogonal Polynomials.. Inhibit Function.. Data Example..36315 Exact Collinearity.. Collinearity.. Variance Inflation .. Simulation..38216 Variable Selection and Model Quality Criterion.. Akaike Information Criterion.. Bayesian Information Criterion.. Adjusted R-Squared.. Cross-Validated RMSE.. Selection Procedures.. Backward Search.. Forward Search.. Stepwise Search.. Exhaustive Search.. Higher Order Terms.. Explanation versus Prediction.. Explanation.. Prediction..41617 Logistic Generalized Linear Models.. Binary Response.. Fitting Logistic Regression.. Fitting Issues.. Simulation Examples.. Working with Logistic Regression.. Testing with GLMs.. Wald Test.
5 Likelihood-Ratio Test.. Confidence Intervals.. Confidence Intervals for Mean Response.. Formula Syntax.. Deviance.. Classification.. Evaluating Classifiers..45318 What s Next.. RStudio.. Tidy Data.. Visualization.. Web Applications.. Experimental Design.. Machine Learning.. Deep Learning.. Time Series.. Bayesianism.. Performance Computing..45819 Appendix459 Chapter 1 IntroductionWelcome to Applied Statistics with R! About This BookThis book was originally (and currently) designed for use withSTAT 420,Methods of Applied Statistics , at the University of Illinois may certainly be used elsewhere, but any references to this course in thisbook specifically refer to STAT book is under active development. When possible, it would be best toalways access the text online to be sure you are using the most up-to-dateversion. Also, the html version provides additional features such as changingtext size, font, and colors.
6 If you are in need of a local copy, apdf versionis continuously maintained, however, because a pdf uses Pages , the formattingmay not be as functional. (In other words, the author needs to go back andspend some time working on the pdf formatting.)Since this book is under active development you may encounter errors rangingfrom typos, to broken code, to poorly explained topics. If you do, please let usknow! Simply send an email and we will make the changes as soon as possible.(dalpiaz2 AT illinois DOT edu) Or, if you know RMarkdown and are famil-iar with GitHub ,make a pull request and fix an issue yourself!This process ispartially automated by the edit button in the top-left corner of the html your suggestion or fix becomes part of the book, you will be added to the listat the end of this chapter. We ll also link to your GitHub account, or personalwebsite upon text usesMathJaxto render mathematical notation for the web. Occa-sionally, but rarely, a JavaScript error will prevent MathJax from rendering1112 CHAPTER 1.
7 INTRODUCTION correctly. In this case, you will see the code instead of the expected math-ematical equations. From experience, this is almost always fixed by simplyrefreshing the page. You ll also notice that if you right-click any equation youcan obtain the MathML Code (for copying into Microsoft Word) or the TeXcommand used to generate the equation. 2+ 2= ConventionsRcode will be typeset using amonospacefont which is syntax =3b =4sqrt(a^2+b^2)Routput lines, which would appear in the console will begin with##. They willgenerally not be syntax highlighted.## [1] 5We use the quantity to refer to the number of parameters in a linear model,notthe number of predictors. Don t worry if you don t know what this meansyet! AcknowledgementsMaterial in this book was heavily influenced by: Alex Stepanov Longtime instructor of STAT 420 at the University of Illinois atUrbana-Champaign. The author of this book actually took Alex sSTAT 420 class many years ago!
8 Alex provided or inspired many ofthe examples in the text. David Unger Another STAT 420 instructor at the University of Illinois at Urbana-Champaign. Co-taught with the author during the summer of 2016while this book was first being developed. Provided endless hours ofcopy editing and countless ACKNOWLEDGEMENTS13 James Balamuta Current graduate student at the University of Illinois at Urbana-Champaign. Provided the initial push to write this book by intro-ducing the author to thebookdownpackage inR. Also a frequentcontributor via name could be here! Suggest an edit! Correct a typo! If you submit acorrection and would like to be listed below, please provide your name as youwould like it to appear, as well as a link to a GitHub , LinkedIn, or personalwebsite. Daniel McQuillan Mason Rubenstein Yuhang Wang Zhao Liu Jinfeng Xiao Somu Palaniappan Michael Hung-Yiu Chan Eloise Rosen Kiomars Nassiri Jeff Gerlach Brandon Ching Ray Fix Tyler Kim Yeongho Kim Elmar Langholz Thai Duy Cuong Nguyen Junyoung Kim Sezgin Kucukcoban Tony Ma Radu Manolescu Dileep Pasumarthi Sihun Wang Joseph Wilson Yingkui Lin Andy Siddall Nishant Balepur Durga Krovi Raj Krishnan Ed Pureza Siddharth Singh Schillaci Mcinnis Ivan Valdes Castillo Tony Mu Salman Yousaf14 CHAPTER 1.
9 INTRODUCTION Yutaro Nishiyama Regina Sahani Goonetilleke Paul Zuradzki Will Tsai Ellen Veomett David LicenseFigure :This work is licensed under aCreative Commons Attribution-NonCommercial-ShareAlike International 2 Introduction Getting StartedRis both a programming language and software environment for statistical com-puting, which isfreeandopen-source. To get started, you will need to installtwo pieces of software: R, the actual programming language. Chose your operating system, and select the most recent version, RStudio, an excellent IDE for working withR. Note, you must haveRinstalled to use RStudio. RStudio is simplyan interface used to interact popularity ofRis on the rise, and every day it becomes a better tool forstatistical analysis. It even generated this book! (A skill you will learn in thiscourse.) There are many good resources for following few chapters will serve as a whirlwind introduction toR. They areby no means meant to be a complete reference for theRlanguage, but simply anintroduction to the basics that we will need along the way.
10 Several of the moreimportant topics will be re-stressed as they are actually needed for introductoryRchapters may feel like an overwhelming amount of infor-mation. You are not expected to pick up everything the first time through. Youshould try all of the code from these chapters, then return to them a number oftimes as you return to the concepts when performing used both for software development and data analysis. We will operate in agrey area, somewhere between these two tasks. Our main goal will be to analyze1516 CHAPTER 2. INTRODUCTION TORdata, but we will also perform programming exercises that help illustrate has a large number of useful keyboard shortcuts. A list of these can befound using a keyboard shortcut the keyboard shortcut to rule them all: On Windows:Alt+Shift+K On Mac:Option+Shift+KThe RStudio team has developeda number of cheatsheets for working withbothRand particular cheatsheet for Base Rwill summarizemany of the concepts in this document.