Chapter 3

Chapter 3 Linear RegressionOnce we ve acquired data with multiple variables, one very important question is how thevariables are related. For example, we could ask for the relationship between people s weightsand heights, or study time and test scores, or two animal a setof techniques for estimating relationships, and we ll focus on them for the next two this Chapter , we ll focus on finding one of the simplest type of relationship: linear. Thisprocess is unsurprisingly calledlinear regression, and it has many applications. For exam-ple, we can relate the force for stretching a spring and the distance that the spring stretches(Hooke s law, shown in Figure ), or explain how many transistors the semiconductorindustry can pack into a circuit over time (Moore s law, shown in Figure ).

Despite its simplicity, linear regression is an incredibly powerful tool for analyzing we ll focus on the basics in this Chapter , the next Chapter will show how just a fewsmall tweaks and extensions can enable more complex +35 x, r2 = on spring (Newtons)Amount of stretch (mm)(a) In classical mechanics, one could empiri-cally verify Hooke s law by dangling a masswith a spring and seeing how much the springis stretched.(b) In the semiconductor industry, Moore slaw is an observation that the number oftransistors on an integrated circuit doublesroughly every two : Examples of where a line fit explains physical phenomena and engineering Moore s law image is by Wgsimon (own work) [ or GFDL], via Wikimedia for Research ProjectsChapter 3 But just because fitting a line is easy doesn t mean that it always makes sense.

Let s takeanother look at Anscombe s quartet to underscore this : Anscombe s Quartet RevisitedRecall Anscombe s Quartet: 4 datasets with very similar statistical properties under a simple quanti-tative analysis, but that look very different. Here they are again, but this time with linear regressionlines fitted to each one:246810121416182024681012142468101214 1618202468101214246810121416182024681012 1424681012141618202468101214 For all 4 of them, the slope of the regression line is (to three decimal places) and the intercept (to two decimal places). This just goes to show: visualizing data can often reveal patterns that arehidden by pure numeric analysis!We begin withsimple linear regressionin which there are only two variables of interest( , weight and height, or force used and distance stretched).

After developing intuitionfor this setting, we ll then turn our attention tomultiple linear regression, where thereare more : While some of the equations in this Chapter might be a little intimidating,it s important to keep in mind that as a user of statistics, the most important thing is tounderstand their uses and limitations. Toward this end, make sure not to get bogged downin the details of the equations, but instead focus on understanding how they fit in to the bigpicture. Simple linear regressionWe re going to fit a liney= 0+ 1xto our data. Here,xis called theindependentvariableorpredictor variable, andyis called thedependent we talk about how to do the fit, let s take a closer look at the important quantitiesfrom the fit: 1is the slope of the line: this is one of the most important quantities in any linearregression analysis.

A value very close to 0 indicates little to no relationship; largepositive or negative values indicate large positive or negative relationships, our Hooke s law example earlier, the slope is the spring the spring constantkis defined asF= kx(whereFis the force andxis the stretch), the slopein Figure is actually the inverse of the spring for Research ProjectsChapter 3 0is the intercept of the order to actually fit a line, we ll start with a way to quantify how good a line is. We llthen use this to fit the best line we way to quantify a line s goodness is to propose a probabilistic model that generatesdata from lines. Then the best line is the one for which data generated from the line is most likely . This is a commonly used technique in statistics: proposing a probabilisticmodel and using the probability of data to evaluate how good a particular model is.

Let smake this more probabilistic model for linearly related dataWe observe paired data points (x1,y1),(x2,y2),..,(xn,yn), where we assume that as a func-tion ofxi, eachyiis generated by using some true underlying liney= 0+ 1xthat weevaluate atxi, and then adding some Gaussian noise. Formally,yi= 0+ 1xi+ i.( )Here, the noise irepresents the fact that our data won t fit the model perfectly. We ll model ias being Gaussian: N(0, 2). Note that the intercept 0, the slope 1, and the noisevariance 2are all treated as fixed ( , deterministic) but unknown for the fit: least-squares regressionAssuming that this is actually how the data (x1,y1),..,(xn,yn) we observe are generated,then it turns out that we can find the line for which the probability of the data is highestby solving the following optimization problem3:min 0, 1:n i=1[yi ( 0+ 1xi)]2,( )where min 0, 1means minimize over 0and 1.

This is known as theleast-squares linearregression problem. Given a set of points, the solution is: 1= ni=1xiyi 1n ni=1xi ni=1yi ni=1x2i 1n( ni=1xi)2( )=rsysx,( ) 0= y 1 x,( )3 This is an important point: the assumption of Gaussian noise leads to squared error as our minimizationcriterion. We ll see more regression techniques later that use different distributions and therefore differentcost for Research ProjectsChapter 3 4 2024 4 2024r = 4 2024 4 2024r = 4 2024 4 2024r = 4 2024 4 2024r = : An illustration of correlation strength. Each plot shows data with a particular correla-tion coefficientr. Values farther than 0 (outside) indicate a stronger relationship than values closerto 0 (inside). Negative values (left) indicate an inverse relationship, while positive values (right)indicate a direct x, y,sxandsyare the sample means and standard deviations forxvalues andyvalues, respectively, andris thecorrelation coefficient, defined asr=1n 1n i=1(xi xsx)(yi ysy).

( )By examining the second equation for the estimated slope 1, we see that since samplestandard deviationssxandsyare positive quantities, the correlation coefficientr, which isalways between 1 and 1, measures how muchxis related toyand whether the trend ispositive or negative. Figure illustrates different correlation square of the correlation coefficientr2will always be positive and is called thecoefficientof determination. As we ll see later, this also is equal to the proportion of the totalvariability that s explained by a linear an extremely crucial remark, correlation does not imply causation! We devote the entirenext page to this point, which is one of the most common sources of error in for Research ProjectsChapter 3 Example: Correlation and CausationJust because there s a strong correlation between two variables, there isn t necessarily a causal rela-tionship between them.

For example, drowning deaths and ice-cream sales are strongly correlated, butthat s because both are affected by the season (summer vs. winter). In general, there are several possiblecases, as illustrated below:xyxy(a)Causal link: Evenif there is a causal linkbetweenxandy, corre-lation alone cannot tellus (b)Hidden Cause:A hidden variablezcauses bothxandy,creating the (c)ConfoundingFactor:A hiddenvariablezandxboth affecty, so theresults also dependon the value (d)Coincidence:The correlation justhappened by chance( the strong cor-relation between suncycles and numberof Republicans inCongress, as shownbelow).(e)The number of Republican senators in congress (red) and the sunspot num-ber (blue, before 1986)/inverted sunspot number (blue, after 1986).

This fig-ure comes : Different explanations for correlation between two variables. In this diagram, arrows represent for Research ProjectsChapter 3 Tests and IntervalsRecall from last time that in order to do hypothesis tests and compute confidence intervals,we need to know our test statistic, its standard error, and its distribution. We ll look at thestandard errors for the most important quantities and their interpretation. Any statisticalanalysis software can compute these quantities automatically, so we ll focus on interpretingand understanding what comes :All the statistical tests here crucially depend on the assumption that the observeddata actually comes from the probabilistic model defined in Equation ( )! SlopeFor the slope 1, our test statistic ist 1= 1 1s 1,( )which has a Student stdistribution withn 2 degrees of freedom.

Chapter 3

Information

Transcription of Chapter 3

Related search queries

Chapter 3

Information

Documents from same domain

Related documents

Related search queries