Transcription of Probability Theory and Statistics
1 Probability Theory andStatisticsWith a view towards the natural sciencesLecture notesNiels Richard HansenDepartment of Mathematical SciencesUniversity of CopenhagenNovember 20102 PrefaceThe present lecture notes have been developed over the last couple of years for acourse aimed primarily at the students taking a Master s in bioinformatics at theUniversity of Copenhagen. There is an increasing demand fora general introductorystatistics course at the Master s level at the university, and the course has alsobecome a compulsory course for the Master s in eScience. Both educations emphasizea computational and data oriented approach to science in particular the aim of the notes is to combine the mathematical and theoretical underpinningof Statistics and statistical data analysis with computational methodology and prac-tical applications. Hopefully the notes pave the way for an understanding of thefoundation of data analysis with a focus on the probabilistic model and the method-ology that we can develop from this point of view.
2 In a single course there is nohope that we can present all models and all relevant methods that the students willneed in the future, and for this reason we develop general ideas so that new modelsand methods can be more easily approached by students after the course. We can,on the other hand, not develop the Theory without a number of good examples toillustrate its use. Due to the history of the course most examples in the notes arebiological of nature but span a range of different areas from molecular biology andbiological sequence analysis over molecular evolution andgenetics to toxicology andvarious assay who take the course are expected to become users of statistical methodologyin a subject matter field and potentially also developers of models and methodologyin such a field. It is therefore intentional that we focus on the fundamental principlesand develop these principles that by nature are mathematical. Advanced mathemat-ics is, however, kept out of the main text. Instead a number ofmath boxes canbe found in the notes.
3 Relevant, but mathematically more sophisticated, issues aretreated in these math boxes. The main text does not depend on results developed iniiithe math boxes, but the interested and capable reader may findthem formal mathematical prerequisites for reading the notes is a standard calculuscourse in addition to a few useful mathematical facts collected in an appendix. Thereader who is not so accustomed to the symbolic language of mathematics may,however, find the material challenging to begin fully benefit from the notes it is also necessary to obtain and install the statisti-cal computing environment R. It is evident that almost all applications of statisticstoday require the use of computers for computations and veryoften also simula-tions. The program R is a free, full fledge programming language and should beregarded as such. Previous experience with programming is thus beneficial but notnecessary. R is a language developed for statistical data analysis and it comes witha huge number of packages, which makes it a convenient framework for handlingmost standard statistical analyses, for implementing novel statistical procedures, fordoing simulation studies, and last but not least it does a fairly good job at producinghigh quality all have to crawl before we can walk let alone run.
4 We beginthe notes withthe simplest models but develop a sustainable Theory that can embrace the moreadvanced ones , but not least, I owe a special thank to Jessica Kasza fordetailed comments onan earlier version of the notes and for correcting a number ofgrammatical 2010 Niels Richard HansenContents1 Notion of probabilities .. Statistics and statistical models .. 42 Probability Introduction .. Sample spaces .. Probability measures .. Probability measures on discrete sets .. Descriptive methods .. Mean and variance .. Probability measures on the real line .. Descriptive methods .. Histograms and kernel density estimation .. Mean and variance .. Quantiles .. Conditional probabilities and independence .. Random variables .. Transformations of random variables .. Joint distributions, conditional distributions and independence .. Random variables and independence .. Random variables and conditional distributions.
5 Transformations of independent variables .. Simulations .. Local alignment - a case study .. Multivariate distributions .. Conditional distributions and conditional densities .. Descriptive methods .. Transition probabilities .. 1113 Statistical models and Statistical Modeling .. Classical sampling distributions .. Statistical Inference .. Parametric Statistical Models .. Estimators and Estimates .. Maximum Likelihood Estimation .. Hypothesis testing .. Two samplet-test .. Likelihood ratio tests .. Multiple testing .. Confidence intervals .. Parameters of interest .. Regression .. Ordinary linear regression .. Non-linear regression .. Bootstrapping .. The empirical measure and non-parametric bootstrapping .. The percentile method .. 2164 Mean and Expectations .. The empirical mean .. More on expectations .. Variance .. Multivariate Distributions .. Properties of the Empirical Approximations.
6 Monte Carlo Integration .. Asymptotic Theory .. MLE and Asymptotic Theory .. Entropy .. 260A Obtaining and running R .. Manuals, FAQs and online help .. The R language, functions and scripts .. Functions, expression evaluation, and objects .. Writing functions and scripts .. Graphics .. Packages .. Bioconductor .. Literature .. Other resources .. 274B Sets .. Combinatorics .. Limits and infinite sums .. Integration .. Gamma and beta integrals .. Multiple integrals .. Notion of probabilitiesFlipping coins and throwing dice are two commonly occurringexamples in an in-troductory course on Probability Theory and Statistics . They represent archetypicalexperiments where the outcome is uncertain no matter how many times we rollthe dice we are unable to predict the outcome of the next use probabilitiesto describe the uncertainty; a fair, classical dice has Probability 1/6 for each side toturn computations can to some extent be handled basedon intuition, common sense and high school mathematics.
7 In the popular dice gameYahtzee the Probability of getting a Yahtzee (five of a kind) in a single throw is forinstance665=164= argument for this and many similar computations is basedon thepseudo theoremthat the Probability for any event equalsnumber of favourable outcomesnumber of possible a Yahtzee consists of the six favorable outcomes with all five dice facing thesame side upwards. We call the formula above a pseudo theorembecause, as we willshow in Section , it is only the correct way of assigning probabilities to eventsunder a very special assumption about the probabilities describing our special assumption is that all outcomes are equally probable something wetend to believe if we don t know any better, or can see no way that one outcomeshould be more likely than , without some training most people will either get it wrong or have to giveup if they try computing the Probability of anything except the most elementary12 Introductionevents even when the pseudo theorem applies.
8 There exist numerous tricky prob-ability questions where intuition somehow breaks down and wrong conclusions canbe drawn if one is not extremely careful. A good challenge could be to compute theprobability of getting a Yahtzee in three throws with the usual rules and providedthat we always hold as many equal dice as : The relative frequency of times that the dice sequence comesout before the sequence as a function of the number of times the dice gamehas been Yahtzee problem can in principle be solved by counting simply write down allcombinations and count the number of favorable and possiblecombinations. Thenthe pseudo theorem applies. It is a futile task but in principle a many cases it is, however, impossible to rely on counting even in principle. Asan example we consider a simple dice game with two participants: First I choosea sequence of three dice throws, , say, and then you choose , say. Wethrow the dice until one of the two sequences comes out, and I win if comesout first and otherwise you win.
9 If the outcome is then I win. It is natural to ask with what Probability you willwin this game. Inaddition, it is clearly a quite boring game, since we have to throw a lot of dice andsimply wait for one of the two sequences to occur. Another question could thereforebe to ask how boring the game is? Can we for instance compute the Probability forNotion of probabilities3having to throw the dice more than 100, or perhaps 500, times before any of thetwo sequences shows up? The problem that we encounter here isfirst of all that thepseudo theorem does not apply simply because there is an infinite number of favor-able as well as possible outcomes. The event that you win consists of the outcomesbeing all finite sequences of throws ending with without occurringsomewhere as three subsequent throws. Moreover, these outcomes are certainly notequally probable. By developing the Theory of probabilities we obtain a frameworkfor solving problems like this and doing many other even moresubtle if we cannotcomputethe solution we might be able to obtain an answer toour questions usingcomputer simulations.
10 Moreover, the notes introduce probabil-ity Theory as the foundation for doing Statistics . The Probability Theory will providea framework, where it becomes possible to clearly formulateour statistical questionsand to clearly express the assumptions upon which the answers 1010 2020 3030 4040 5050 6060 7070 8080 9090 100100 110110 120120 130130 140140 150150 160160 170170 180180 190190 200200 210210 220220 230230 240240 250250 260260 270270 280280 290290 300300 310310 320320 330330 340340 350350 360360 370370 380380 390390 400400 410410 420420 430430 440440 450450 460460 470470 480480 490490 500nFrequency0100200300400 Figure : Playing the dice game 5000 times, this graph shows how the games aredistributed according to the number of times,n, we had to throw the dice beforeone of the sequences or about dice games! After all, these notes are about Probability Theory andstatistics with applications to the natural sciences. Therefore we will try to takeexamples and motivations from real biological, physical and chemical problems, butit can also be rewording intellectually to focus on simple problems like those froma dice game to really understand the fundamental issues.