Think Stats: Probability and Statistics for Programmers

Think stats : Probability andStatistics for ProgrammersVersion StatsProbability and Statistics for ProgrammersVersion B. DowneyGreen Tea PressNeedham, MassachusettsCopyright 2011 Allen B. Tea Press9 Washburn AveNeedham MA 02492 Permission is granted to copy, distribute, and/or modify this document underthe terms of the Creative Commons Attribution-NonCommercial Unported Li-cense, which is available original form of this book is LATEX source code. Compiling this code has theeffect of generating a device-independent representation of a textbook, which canbe converted to other formats and LATEX source for this book is available cover for this book is based on a photo by Paul Friel ( ), who made it available under the Creative Commons Attributionlicense. The original photo is I wrote this bookThink stats : Probability and Statistics for Programmersis a textbook for a newkind of introductory prob-stat class. It emphasizes the use of Statistics toexplore large datasets.

It takes a computational approach, which has severaladvantages: Students write programs as a way of developing and testing their un-derstanding. For example, they write functions to compute a leastsquares fit, residuals, and the coefficient of determination. Writingand testing this code requires them to understand the concepts andimplicitly corrects misunderstandings. Students run experiments to test statistical behavior. For example,they explore the Central Limit Theorem (CLT) by generating samplesfrom several distributions. When they see that the sum of values froma Pareto distribution doesn t converge to normal, they remember theassumptions the CLT is based on. Some ideas that are hard to grasp mathematically are easy to under-stand by simulation. For example, we approximate p-values by run-ning Monte Carlo simulations, which reinforces the meaning of thep-value. Using discrete distributions and computation makes it possible topresent topics like Bayesian estimation that are not usually coveredin an introductory class.

For example, one exercise asks students tocompute the posterior distribution for the German tank problem, which is difficult analytically but surprisingly easy computationally. Because students work in a general-purpose programming language(Python), they are able to import data from almost any source. Theyare not limited to data that has been cleaned and formatted for a par-ticular Statistics 0. PrefaceThe book lends itself to a project-based approach. In my class, studentswork on a semester-long project that requires them to pose a statistical ques-tion, find a dataset that can address it, and apply each of the techniques theylearn to their own demonstrate the kind of analysis I want students to do, the book presentsa case study that runs through all of the chapters. It uses data from twosources: The National Survey of Family Growth (NSFG), conducted by Centers for Disease Control and Prevention (CDC) to gather information on family life, marriage and divorce, pregnancy, infer-tility, use of contraception, and men s and women s health.

( ) The Behavioral Risk Factor Surveillance System (BRFSS), conductedby the National Center for Chronic Disease Prevention and HealthPromotion to track health conditions and risk behaviors in the UnitedStates. ( )Other examples use data from the IRS, the Census, and the I wrote this bookWhen people write a new textbook, they usually start by reading a stack ofold textbooks. As a result, most books contain the same material in prettymuch the same order. Often there are phrases, and errors, that propagatefrom one book to the next; Stephen Jay Gould pointed out an example in hisessay, The Case of the Creeping Fox Terrier1. I did not do that. In fact, I used almost no printed material while I waswriting this book , for several reasons: My goal was to explore a new approach to this material, so I didn twant much exposure to existing approaches. Since I am making this book available under a free license, I wanted tomake sure that no part of it was encumbered by copyright breed of dog that is about half the size of a Hyracotherium ( ).

Vii Many readers of my books don t have access to libraries of printed ma-terial, so I tried to make references to resources that are freely availableon the Internet. Proponents of old media Think that the exclusive use of electronic re-sources is lazy and unreliable. They might be right about the first part,but I Think they are wrong about the second, so I wanted to test resource I used more than any other is Wikipedia, the bugbear of li-brarians everywhere. In general, the articles I read on statistical topics werevery good (although I made a few small changes along the way). I includereferences to Wikipedia pages throughout the book and I encourage you tofollow those links; in many cases, the Wikipedia page picks up where mydescription leaves off. The vocabulary and notation in this book are gener-ally consistent with Wikipedia, unless I had a good reason to resources I found useful were Wolfram MathWorld and (of course)Google. I also used two books, David MacKay sInformation Theory, In-ference, and Learning Algorithms, which is the book that got me hooked onBayesian Statistics , and Press et al.

SNumerical Recipes in C. But both booksare available online, so I don t feel too B. DowneyNeedham MAAllen B. Downey is a Professor of Computer Science at the Franklin W. OlinCollege of ListIf you have a suggestion or correction,please send email I make a change based on your feed-back, I will add you to the contributor list (unless you ask to be omitted).If you include at least part of the sentence the error appears in, that makes iteasy for me to search. Page and section numbers are fine, too, but not quiteas easy to work with. Thanks! Lisa Downey and June Downey read an early draft and made many correc-tions and 0. Preface Steven Zhang found several errors. Andy Pethan and Molly Farison helped debug some of the solutions, andMolly spotted several typos. Andrew Heine found an error in my error function. Dr. Nikolas Akerblom knows how big a Hyracotherium is. Alex Morrow clarified one of the code examples. Jonathan Street caught an error in the nick of time.

G bor Lipt k found a typo in the book and the relay race solution. Many thanks to Kevin Smith and Tim Arnold for their work on plasTeX,which I used to convert this book to DocBook. George Caplan sent several suggestions for improving clarity. Julian Ceipek found an error and a number of typos. Stijn Debrouwere, Leo Marihart III, Jonathan Hammler, and Kent Johnsonfound errors in the first print edition. Dan Kearney found a typo. Jeff Pickhardt found a broken link and a typo. J rg Beyer found typos in the book and made many corrections in the doc-strings of the accompanying code. Tommie Gannert sent a patch file with a number of corrections. Alexander Gryzlov suggested a clarification in an exercise. Martin Veillette reported an error in one of the formulas for Pearson s corre-lation. Christoph Lendenmann submitted several errata. Haitao Ma noticed a typo and and sent me a Statistical thinking for first babies arrive late?.

Statistical approach .. National Survey of Family Growth .. and records ..92 Descriptive and averages .. histograms .. histograms .. PMFs .. PMFs .. visualizations .. risk .. Probability .. results .. 233 Cumulative distribution class size paradox .. limits of PMFs .. distribution functions .. CDFs .. to the survey data .. distributions .. numbers .. Statistics revisited .. 354 Continuous exponential distribution .. Pareto distribution .. normal distribution .. Probability plot .. lognormal distribution .. model? .. random numbers .. 50 Contentsxi5 of Probability .. Hall .. rule of Probability .. distribution .. and hot spots .. s theorem .. 656 Operations on .. Variables .. normal? .. limit theorem .. distribution framework .. 777 Hypothesis a difference in means.

A threshold .. the effect .. the result .. Bayesian probabilities .. test .. resampling .. 908 estimation game .. the variance .. errors .. distributions .. intervals .. estimation .. Bayesian estimation .. data .. locomotive problem .. 1059 scores .. scatterplots in pyplot .. s rank correlation .. squares fit .. of fit .. and Causation .. 121 Chapter 1 Statistical thinking forprogrammersThis book is about turning data into knowledge. Data is cheap (at leastrelatively); knowledge is harder to come will present three related pieces:Probabilityis the study of random events. Most people have an intuitiveunderstanding of degrees of Probability , which is why you can usewords like probably and unlikely without special training, but wewill talk about how to make quantitative claims about those the discipline of using data samples to support claims aboutpopulations.

Most statistical analysis is based on Probability , which iswhy these pieces are usually presented a tool that is well-suited to quantitative analysis, andcomputers are commonly used to process Statistics . Also, computa-tional experiments are useful for exploring concepts in Probability thesis of this book is that if you know how to program, you can usethat skill to help you understand Probability and Statistics . These topics areoften presented from a mathematical perspective, and that approach workswell for some people. But some important ideas in this area are hard to workwith mathematically and relatively easy to approach rest of this chapter presents a case study motivated by a question Iheard when my wife and I were expecting our first child: do first babiestend to arrive late?2 Chapter 1. Statistical thinking for Do first babies arrive late?If you Google this question, you will find plenty of discussion. Some peopleclaim it s true, others say it s a myth, and some people say it s the other wayaround: first babies come many of these discussions, people provide data to support their claims.

Think Stats: Probability and Statistics for Programmers

Tags:

Information

Advertisement

Transcription of Think Stats: Probability and Statistics for Programmers

Related search queries

Think Stats: Probability and Statistics for Programmers

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries