An introduction to biostatistics: part 1 - Academic Divisions

An introduction to biostatistics : part 1 Cavan ReillySeptember 4, 2019 Table of contentsIntroduction to data analysisUncertaintyProbabilityConditiona l probabilityRandom variablesDiscrete random variablesThe law of large numbersThe central limit theoremAssociations between variablesRelative riskOdds ratioStatistical inferencePearson s chi-squared testFisher s exact testInference for 2 groupsInference with more than 2 groupsData AnalysisOne should start by examining what sorts of variables one distinguish between categorical and continuous variables: onlya few possible values for the first, and in principle, an infinitenumber of possible values for the summaries: simple summaries of each : tables, median and AnalysisLet s read in some data from a genetic association study:fms <- (" ~ ")> ls()[1] "fms"> dim(fms)[1] 1397 347 Let s take a look at the variable called Race> table(fms$Race)African Am Am Indian Asian Caucasian Hispanic Other44 1 97 791 52 49 Data Analysisbut we don t have this for all subjects> table( (fms$Race))FALSE TRUE1034 363We also can see how 2 variables look together> table(fms$apoe_c472t,fms$Race)African Am Am Indian Asian Caucasian Hispanic OtherCC 6 0 9 101 3 6CT 1 0 3 26 0 0 UncertaintyIf I collected a different data set I would produce provide a way to think about this, we think of the data we seeas a realization of a random process.

The data set we have is only 1of many we could have model our observed data as values taken on by a is a mathematical framework that allows one to makestatements about phenomena with uncertain some experiment with an uncertain outcome ( flipping acoin) thesample spaceis the collection of all possible outcomes( heads or tails).The possible outcomes are calledeventsand probability is a mapfrom the collection of events to a number in the interval [0,1] in away that larger values indicate the event is more use notation likeAis an event andP(A) is the probability ofthat are just a few rules that probabilities must satisfy1. for all events,Ai, 0 P(Ai) ifSis the collection of all possible events,P(S) = ifA1andA2aremutually exclusive( they can t both occurin one trial of the experiment) thenP(A1orA2) =P(A1) +P(A2).ProbabilitySo, for example, if an experiment only has 2 possible outcomes andthey are equally likely then the probability of each is one an event is the number of times you expect to see theevent happen for every time it doesn t if the probability of an event iskNthe odds on that event arekto theN k.

> table(fms$apoe_c472t)CC CT128 30So the observed odds onCCis 128 to 30 and the observedprobability of seeingCCis 128/(128 + 30).Conditional probabilityTwo events,A1andA2areindependentifP(A1andA2) =P(A1)P(A2).The conditional probability of eventA1given that eventA2hasoccurred is given byP(A1andA2)/P(A2) and is denotedP(A1|A2).So what isP(A1|A2) in terms ofP(A2|A1)?If 2 events are independent thenP(A1|A2) =P(A1): this is theintuitive basis for understanding what independence variablesArandom variableis a quantity that takes on certain values withcertain example, if I toss a coin and assignXthe value of 0 if there isa head and 1 if there is a tail, thenXis a random distribution of a random variable is the values that the randomvariable takes and the probability that it takes those model the data we observe as the result of a randomizedexperiment and consequently as random random variablesWe can simulate random variables in 100 fair coins in R> (234)> rbinom(100,size=1,p=.)

5)[1] 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 1[38] 0 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0[75] 0 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 1and 100 unfair coins,> (234)> rbinom(100,size=1,p=.05)[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[75] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 Properties of random variablesTheexpectationof a random variable depends on its distributionand can be calculated as follows:ifXis the random variable then its expectation is given byEX= kkP(X=k).while thevarianceof a random variable is given byVarX= k(k EX)2P(X=k).So ifXis 1 with probabilitypand is otherwise 0 what is itsexpected value? Its variance?There are a number of commonly used probability models that areencountered, both discrete and random variablesTheBernoullidistribution: a single success or failure randomvariable.

Distribution only depends on the probability of a : the number of successes innindependent trials where each trial is a success or failure (and alltrials have the same success probability). Distribution depends onthe success probability and the total number of is a generalization of the Binomialdistribution to the case where each trial can have more than 2outcomes. Distribution depends on success probabilities andnumber of multinomial distribution is a natural model for nucleic acidsequences (since each position is one of 4 possible nucleotides).The law of large numbersMany possible models, none of which are particularly naturalexcept perhaps the normal distribution (also called the Gaussiandistribution).Origin of the normal distribution: there are 2 critical results inprobability law of large numbers: if I have an infinitely long sequence ofindependent realizations of a random variable, then the samplemean will converge to the expectation of the random if I compute the difference between the expectation and thesample mean it will go to zero.

> mean(rbinom(10,size=1,p=.6))[1] central limit theoremBut, if I multiply this difference by the square root of the number ofsamples, the resulting product will behave like a random fact one can determine the distribution of this random variable:this distribution is the normal often describe this in terms of the probability distribution ofthe sample mean, which is called itssampling distribution:x N( , 2/n).Here is the expectation of the individual observations, 2is theirvariance andnis the number of independent you use the expression for the approximate samplingdistribution to verify the first claim on this slide?Continuous random variablesFor this reason, some argue, traits which are the outcome of manydifferent factors, each with a small impact, are well modeled asnormal random frequently transpires that if the data are modeled as normallydistributed, calculations that are otherwise impossible become easy.

Quantiles are typically used to summarize continuous variables.> summary(fms$ )Min. 1st Qu. Median Mean 3rd Qu. Max. NA 365 Associations between variablesWhen you have 2 categorical variables, tables provide the mostconvenient way to summarize the association.> table(fms$resistin_c30t,fms$resistin_g54 0a)AA GA GGCC 75 281 356CT 0 11 10 When should one use row or column percents? Does it matterhere?Associations between variablesWe frequently are interested in variables with 2 categories whereone category indicates health status ( have cancer).Often we would like to know if some other dichotomous variableincreases the likelihood that one has this health outcome ( ).The risk is the probability that one has a particular health this risk is modified by another dichotomous variable, then weare often interested in how the risk varies with this other riskOne can look at the difference in risk,P(cancer|smoke) P(cancer|don t smoke), but for rare healthoutcomes this will always be a small commonly we look at the ratio rather than the difference-thisis called therelative directly estimateP(cancer|smoke) we would need to set up astudy where we enrolled many smokers and follow them in is expensive and time consuming, it is much easier to recruitsubjects with cancer and ascertain if they smoke.

RatioMuch like the relative risk is a ratio of probabilities, theodds ratiois a ratio of big difference is that one can estimate the odds ratio usingdata from a retrospective the odds ratio and relative risk are approximately equalwhen the event is between variablesLet s create a dichotomous variable that indicates if someone has aBMI greater than 30 and see if this is related to having metabolicsyndrome.> table(fms$Met_syn,fms$ >30)FALSE TRUE0 688 531 37 39 How would you summarize this table?Does a simple formula for the odds ratio exist?Statistical inferenceIf I collected another data set in the same fashion as this one, doyou think we would find a positive association?To address this question, we frequently pose the question asfollows: what is the probability that I would observe as big adifference as I have observed if there was not an associationbetween these 2 variables?

Statistical inferenceAbout 9% of the subjects have metabolic syndrome and about10% of subjects have a BMI greater than 30, so if these 2 variablesare independent then the probability of both of these propertiesbeing true is about 1%.However we have observed about 5% of our subjects having a BMIover 30 and having metabolic syndrome: is this simply too large tocall chance variation?Statistical inferenceImagine simulating this table under the assumption of do so we will assume that the margins of the table are just aswe have observed subjects have BMI greater than 30 and if there is no associationbetween the 2 variables about 9% of these subjects will havemetabolic syndrome, so conduct a few simulations> rbinom(1,92,p=.09)[1] 10> rbinom(1,92,p=.09)[1] 6> rbinom(1,92,p=.09)[1] 8 Statistical inferenceThese values are much smaller than our observed value of 39, infact looking at a million of these variables we are never even closeto the observed value of 39> summary(rbinom(1000000,92,p=.))

09))Min. 1st Qu. Median Mean 3rd Qu. what we are seeing doesn t even occur 1 in a million times ifthese 2 variables are inferenceThe process of using observed data to make statements aboutunobserved data ( data we will observe in the future) is calledstatistical hypothesis tests are used to produce estimates of theprobability of observing something more extreme than what wehave probabilities are calledp-valuesand when ap-value is lessthan we say that there is a statistically significant test for an association between 2 dichotomous variables oneuses a test procedure called Pearson s s Chi-squared testIn R this is accomplished with the following code> (table(fms$Met_syn,fms$ >30))Pearson s Chi-squared test with Yates continuity correctiondata: table(fms$Met_syn, fms$ > 30)X-squared = , df = 1, p-value < thep-value is so small the software just reports that it is lessthan a very small s Chi-squared testIf there were fewer observations it would be more difficult to assessif there was an association:> table(fms$Met_syn[1:40],fms$ [1:40]>30)FALSE TRUE0 26 31 1 0and if we conduct Pearson s test we get a warningPearson s Chi-squared test> (table(fms$Met_syn[1:40],fms$ [1:40]>30))Pearson s Chi-squared test with Yates continuity correctiondata: table(fms$Met_syn[1:40], fms$ [1:40] > 30)X-squared = , df = 1, p-value = 1 Warning message:In (table(fms$Met_syn[1:40], fms$ [1:40] > 30)).

An introduction to biostatistics: part 1 - Academic Divisions

Tags:

Information

Advertisement

Transcription of An introduction to biostatistics: part 1 - Academic Divisions

Related search queries

An introduction to biostatistics: part 1 - Academic Divisions

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries