Transcription of Introduction to Quantitative Genomics / Genetics
1 OverviewHistory and IntuitionStatistical FrameworkApproachesChallengesIntroductio n to Quantitative Genomics / GeneticsBTRY 7210: Topics in Quantitative Genomics and GeneticsSeptember 10, 2008 Jason G. MezeyOverviewHistory and IntuitionStatistical FrameworkApproachesChallengesOutline History and Intuition. Statistical Framework. Current Approaches. Current and IntuitionStatistical FrameworkApproachesChallengesHistory Quantitative Genomcs / Geneticsmay be loosely definedas the field concerned with the statistical modeling of thegenetics of complex phenotypes. Relevant history: 1900-1980: statistical analysis of the patterns of inheritence( the resemblance between relatives). 1980-2002: mapping (= identification) of the genetic lociresponsible for most Mendelian diseases ( diseases wherealleles at a single genetic locus determines disease). 2002-present: age of Genomics first convincing mapping ofgenetic loci for complex traits ( cases where genotypecannot be inferred directly from the phenotype).
2 Techniques come from three relatively distinct fields:evolutionary biology, agricultural sciences, medical sciences. All three share a common goal: mapping genetic and IntuitionStatistical FrameworkApproachesChallengesBiology The genome of an individual is an important determinant ofindividual phenotype. As a consequence, specific polymorphisms (SNPs, INDELs,transposable elements, chromosome aberations, etc.) canproduce differences among individuals in a population. This is because different states of a polymorphism can resultin differences in the biological processes that produce aphenotype, such polymorphisms arecausal. The goal of Quantitative Genomics is to identify such causalpolymorphisms, the genetic locus in which they are present, ortheir general genomic and IntuitionStatistical FrameworkApproachesChallengesPopulations We analyze a population (more specifically a sample) ofindividuals to map causal polymorphisms or genetic loci. At the minimum, data for mapping includes: measurements ofphenotypes (= traits) and genotypes (= set of genomicpositions where individuals have different polymorphism states or genetic loci where there are different alleles ).
3 Since genomes are organized into chromosomes, genotypes inclose physical proximity will be correlated. This is calledLinkage Disequilibriumand it means we need not havemeasured the causal polymorphism to map a genetic locus( polymorphisms analyzed are genetic markers ). Currently, the data for mapping experiments typically includeboth phenotypic measurements and genotypes (>500 KSNPs) for 100 s to 10K+ individuals. The challenge: using these data to identify individual causalpolymorphisms or genetic and IntuitionStatistical FrameworkApproachesChallengesThe Statistical ModelOur objective is to draw conclusions about a population from sample space in this case is: ={ }where is the set of possible phenotypes and is the set ofpossible this sample space we define a probability functionp( ) andthe random variablesY( ) andX( ) where:Ymaps to ak-dimensional vectorYof phenotypic measurements(each of which may be continuous or discrete).Xmaps to anm-dimensional vectorXwhere each element is adummy variable representing the genotype at genomic positionj(usually taking values -1, 0, 1).
4 OverviewHistory and IntuitionStatistical FrameworkApproachesChallengesThe Statistical ModelWe assume thatYand a causal subsetq<mof the genotypesXare related by the following:Y=g(X) + whereg(X) =E(Y|X) and is a random variable which accountsfor the difference of the phenotypic vector of an individualifromthe expected value given the the purposes of Quantitative genomic inference it is convenientto useGeneralized Linear Models(GLM) to represent thisrelationship. These have the following properties (for a singlephenotypeY) random component of the variableYhas a distribution inthe exponential link function relates the random vectorXand parametervector toY:E(Y|X) = 1(X ). variance ofY|Xis a function ofE(Y|X).OverviewHistory and IntuitionStatistical FrameworkApproachesChallengesA Recognizable the following for the random component follows a normal distribution withmean zero and unknown variance 2 . link function is the identity function: 1(X ) =X variance ofY|Xis a constant:V(Y X) = 2.
5 In this case, the GLM is the simple linear regression model, whichcan be written (for a single phenotype and a single polymorphismX) as follows:Y= + X+ N(0, 2 )OverviewHistory and IntuitionStatistical FrameworkApproachesChallengesGLM Inference Our goal is inference concerning GLM parameters using asample. For our purposes, we are interested in estimation and testinghypotheses concerning GLM parameters (generally the s). There are two broad inference approaches: Frequentist: do not assume parameters are random variables. Example, Maximum Likelihood Estimation: MLE=sup L( |Y). Example, Likelihood Ratio Tests: =sup 0L( |Y)sup L( |Y). Bayesian:p( |Y) p(Y| )p( ). Example, estimation using median of the posteriorp( |Y). Example, we can test using Bayes factor or a credible intervalof the and IntuitionStatistical FrameworkApproachesChallengesQuantitativ e Genomic Inference The most basic problem in Quantitative Genomics isdetermining which of theq<mgenotypes inXare inlinkagedisequilibriumwith polymorphisms which have causal effectson a phenotypeY.
6 Intuitively, these are cases where experimentally substitutingone allele for another would produce a change in the expectedphenotype:AiAjij6=kl AkAl E(Y) The simplest approach for identifying such cases is to fit aGLM foreach genotype and IntuitionStatistical FrameworkApproachesChallengesQuantitativ e Genomic Inference For example, if we have measured a normally distributedphenotype, such as human height, formgenotypes we can fitmsimple linear regression models of the form:Y= + X+ and perform the following hypothesis test in each case:H0: = 0HA: 6= 0 For runs of genotypes in linkage disequilibrium where we rejectH0, we consider this reasonable evidence that we have mappedthe causal polymorphism to a physical location in the genome. This approach of applying individual tests for every genotypeis currently the most commonly applied approach for and IntuitionStatistical FrameworkApproachesChallengesAlternative Individual Genotype Tests The GLM is the Quantitative genetic model that is thefoundation of Quantitative Genomics / genetic analysis(mapping loci, additive genetic variance, etc.)
7 However (intuitively) the parameterized GLM approachesmaps by testing for differences among means of groups in asample partitioned according to genotype (continuous traits)or differences in the frequency of genotypes in (discrete) traitcategories. This means that any statistical approach which tests for suchdifferences may be used for mapping genetic loci on a markerby marker basis. An incomplete list of common approaches: Continuous traits: Parametric (GLMs, ANOVAs, t-tests),Non-Parametric (Kruskal-Wallis, permutation-based). Discrete: GLM, 2, Cochran-Armitage, Fisher s Exact. Pedigree Based: Transmission-Disequilibrium Test (TDT),sib-pair test, and IntuitionStatistical FrameworkApproachesChallengesExperimenta l Mapping DesignsAssociation Analysis/Linkage Disequilibrium Mappingrefersto designs where the individuals in the sample are not Analysis/Perdigree Analysisrefer to desings whereindividuals in the sample are highly related and the relationshipsare generally Breeding Designs / QTL Analysis(F2, RILs, NILs)refer to cases where the relationship among individuals are bothknown and and IntuitionStatistical FrameworkApproachesChallengesChallenges: Multiple Tests The problem: If we perform tests for 500K markers this is asevere multiple testing problem, we expect significantresults by chance.
8 Question: How to control Type I error (false-positives)without sacrificing power? The problem is made more difficultbecause our tests arecorrelated. A few approaches: False Discovery Rates (FDR), Bonferroni,in combination with Principal Component Analysis (PCA),Permutation, Hidden Markov Model (HMM) and IntuitionStatistical FrameworkApproachesChallengesChallenges: Haplotypes A haplotype is a combination of alleles transmitted based on haplotypes (instead of individualpolymorphisms) can sometimes be more powerful. Questions: How to infer haplotype structures that produce the best tests? A few approaches: phasing using unrelated individuals andusing family structure, haplotype testing in combination withlinkage and IntuitionStatistical FrameworkApproachesChallengesChallenges: Population Structure The problem: If two populations differ in their mean value orfrequency of a discrete phenotype (disease) then everygenotype that varies between them will produce a positiveresult.
9 Question: how to identify cryptic population structure andcorrect for it when testing? A few approaches: STRUCTURE, PCA, incoporatingpopulation structure as a co-factor in GLM, tests which arerobust to structure such as and IntuitionStatistical FrameworkApproachesChallengesChallenges: Shared Ancestry The problem: If members of the population are related thiscan lead to reduction of power. Question: How to estimate ancestry and incorporate this intoour models? A few approaches: Pedigree based estimation, haplotypebased estimates, Mixed models with co-ancestry factor, directmodeling via and IntuitionStatistical FrameworkApproachesChallengesChallenges: Multi-Locus Approaches The problem: If there are more than one contributinggenotype, this can lead to less power and to false positiveswhen using individual marker testing approaches. Question: How to fit a model withpparameters when samplesizeNis small without over-fitting (the largep, smallNproblem)? A few approaches: Bayesian hierarchical models, algorithmicsearches and model selection (AIC, DIC, penalized likelihood).
10 OverviewHistory and IntuitionStatistical FrameworkApproachesChallengesChallenges: Epistasis The problem: The complete genetic model includesinteractions among polymorphisms, in fact there are 3qpossible parameters. Question: How to identify pair-wise effects when power islow? How to account for these effects overall, treatingthem as nuissance parameters? A few approaches: exhaustive pairwise tests, hierarchicalBayesian modeling, Kernel and IntuitionStatistical FrameworkApproachesChallengesOther Challenges genetic architecture (Common Disease-Common Variant,Many Rare Variants) and determining the most powerfulapproaches for different architectures. Imputation and testing approaches for missing genotypes. Experimental designs for controlled breeding approaches. Computationally efficient approaches for linkage analysis. Coalescent based approaches. and IntuitionStatistical FrameworkApproachesChallengesNext WeekI will lead discussion for the following paper:Simultaneous analysis of all SNPs in genome-wide andre-sequencing association studies; Hoggart et al.