Example: barber

Chapter 8 Hidden Markov Chains - math.rutgers.edu

Chapter 8 Hidden Markov Definition and basic theoryLetX1,X2,X3,..be the nucleotide bases observed along a randomly chosenDNA strand. The Markov chain models introduced so far for modelling{Xn}have made the assumption ofhomogeneity, that is, that the transitionprobabilitiesaij=P(Xn+1=j xn=i),are the same for all sitesn. (Sincenis generally thought of as time in Markovchain theory, this property is often calledtime-homogeneity.) As differentparts of a DNA strand may have different functions or chemical properties,the assumption of homogeneity may be very inaccurate. A case in point is thephenomenon ofCpG-islands. In some regions of DNA, the two base sequence5 GC 3 appears with markedly greater frequency than in other culprit is a process called DNA methylation, whereby methyl groups (-CH3) attach themselves to nucleotide bases. The methyl group binds to baseCin the pair and promotes its decay, or rather, transformation, toT. Thisprocess operates commonly over most of the DNA sequence.

2 CHAPTER 8. HIDDEN MARKOV CHAINS the succession of bases inside CpG islands alone and a separate Markov chain to model the bases outside CpGislands.

Tags:

  Chapter, Chain, Chapter 8, Hidden, Markov, Chapter 8 hidden markov chains, Hidden markov chains

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Chapter 8 Hidden Markov Chains - math.rutgers.edu

1 Chapter 8 Hidden Markov Definition and basic theoryLetX1,X2,X3,..be the nucleotide bases observed along a randomly chosenDNA strand. The Markov chain models introduced so far for modelling{Xn}have made the assumption ofhomogeneity, that is, that the transitionprobabilitiesaij=P(Xn+1=j xn=i),are the same for all sitesn. (Sincenis generally thought of as time in Markovchain theory, this property is often calledtime-homogeneity.) As differentparts of a DNA strand may have different functions or chemical properties,the assumption of homogeneity may be very inaccurate. A case in point is thephenomenon ofCpG-islands. In some regions of DNA, the two base sequence5 GC 3 appears with markedly greater frequency than in other culprit is a process called DNA methylation, whereby methyl groups (-CH3) attach themselves to nucleotide bases. The methyl group binds to baseCin the pair and promotes its decay, or rather, transformation, toT. Thisprocess operates commonly over most of the DNA sequence.

2 (I read thatin human DNA it reduces the occurence ofGCto about one-fifth of whatone would expect; I guess this means that the frequency is one-fifth of whatone would expect from an site model, but biologists don t seem to begiven to such accuracy of statement (grumble,grumble).) However, there areregions, calledCpG-islands, which are often close to genes and are associatedwith regulating gene expression, where this decay is prevented and a higherfrequency ofGCis maintained. Suppose we built a Markov chain to model12 Chapter 8. Hidden Markov CHAINSthe succession of bases insideCpGislands alone and a separate Markovchain to model the bases outsideCpGislands. The transition probabilityaGCwould differ substantially in the two models. In models estimated frombiological data in the book,Biological Sequence Analysis, by Durbin, Eddy,Krogh, and Mitchison, the values foraGCare and in the respectivecases. A homogeneous Markov chain model for entire DNA sequences wouldmiss theCpGislands obvious way to handleCpGislands using Markov Chains would beto relax homogeneity by allowing the transition probabilities to depend onn:a(n)ij=P(Xn+1=j xn=i).

3 There are at least two problems with this approach. It is first of all imprac-tical. If we want to model sequences of even moderate length, say 103basepairs, the model will have 12,000 parameters, because we will need to specify1000 4 4 transition matrices, each of which has 12 free parameters. Second,the inhomogeneous model would not allow random variation in the locationofCpGislands; their locations would be fixed around those sitesnfor whichthe transition probabilitya(n)GCexceeds some suitably large threshold this section we introduce so-calledhidden Markov models(HMM).They require many fewer parameters than general, inhomogeneous Markovchain models, yet can model randomly changing transition are popular and flexible models for many other biological sequenceproblems as well. Finally, they retain much of the simplicity of Markovmodels, and so are still relatively easy to compute :={A1,..,AK}be a finite set, which we refer to as an biological sequence analysis,Ais either the DNA alphabet{A,G,C,T},or the protein alphabet, a set of 20 capital letters used as code letters for the20 amino acids appearing in proteins.

4 A Hidden Markov models for a randomsequence{X1,X2,X3,..}of letters fromAis constructed as follows. Weimagine a finite set of states, labelled from 1 throughN, each of which hasa device that emits letters fromA. These emitters will be responsible forproducing{Xn}, which is called theobserved sequenceorobserved is important for the flexibility of the model that different states emitthe letters with different probabilities. Now we add a start state 0, andintroduce an auxiliary, homogeneous Markov chainZ0,Z1,Z2,..evolving inS={0,1,..,N}. We shall always assumeZ0= 0, since 0 is the start state;in some models we may also want to add toSa stop state that does DEFINITION AND BASIC THEORY3emit letters. A visit to the stop state terminates the chain .{Zn}is thehidden chain . In the model, it determines the sequence of states that emitthe letters of the observed process. That is, stateZ1emits letterX1, stateZ2emits letterX2, and so on.

5 To complete the definition of the model, itis necessary to specify the joint probabilities of sequences of emissions andpaths of the Hidden chain . The crucial assumption is thatgiven the sequenceof visited statesthe successive emissions are mutually independent. Theprecise mathematical meaning of this assumption is expressed in ( ) in theformal Hidden Markov chain model is a pair of process ({Zn},{Xn}),where{Zn}is a Markov chain and{Xn}is the sequence of letters in alphabetAemitted by the states visited by{Zn}, such that the joint emission andpath probabilities are calculated as follows. LetAdenote the transitionprobability matrix for{Zn}. For each statek, 1 k, letek(x)4=P(statekemits letterx)ThenP(X1=x1,..,Xn=xn;Z1=s1,..,Zn =sn)= [P(Z1=s1)es1(x1)][as1s2es2(x2)]..[asn 1snesn(xn)]( )Our first goal is to understand the formula ( ) for the joint simplify notation, we shall adopt the notationp(x1 xn;s1 sn) =P(X1=x1,..,Xn=xn;Z1=s1,..,Zn=sn).Since it is assumed thatZ0= 0, where 0 is the start state,P(Z1=s1) = ( ) can be rewritten:p(x1 xn;s1 sn) = [a0s1es1(x1)][as1s2es2(x2)].

6 [asn 1snesn(xn)= ni=1asi 1siesi(xi)( )The last expression makes the formula easy to remember; the factorasi 1siesi(xi) is the probability for the chain to move from statesi 1tosiand then to emit letterxifromsi. The probability of the entire path is justthe product of these successive transition-with-emission probabilities along4 Chapter 8. Hidden Markov CHAINSthe path. Notice that if we took away the emission probabilities from theformula, we would be left just with the producta0s1 asn 1sn, which is theprobability of the pathZ1=s1,..,Zn=snof the Hidden chain . Figure 1illustrates the operation of the Hidden Markov model 0Z1=s1Z2=s2Z3=s3x1x2x3-a0s1-as1s2-as2s3- as3s4?es1(x1)?es2(x2)?es3(x3)Figure example is adapted from the text,Biological Sequence Analy-sisby Durbin, Eddy, Krogh, and Mitchison, wherein it is called the occasion-ally dishonest casino. Imagine a betting game which depends on independentcoin tosses. Let us call the coin tosser the casino. The casino has two 1 is fair and coin 2 is loaded with probability of heads.]

7 At eachplay, the casino will change the coin being flipped with a small probability,say , but the player does not know what coin is being tossed. All theplayer sees is the sequence of heads and tails that result. Suppose that thecasino chooses which coin to start with at random, with each coin having thesame probability of being coin tossing process is a Hidden Markov chain . The alphabetAis{H,T}. The state spaceScontains three states; the start state 0, thestate corresponding to tossing coin 1, which we call state 1, and the statecorresponding to tossing coin 2, which we call state 2. The Hidden Markovchain{Zn}evolving inSdetermines which coin, fair or foul, is tossed in eachplay. According to the given prescription, the transition probability matrixfor{Zn}is0 1 2012 0 The emission probabilities aree1(H) = (T) = DEFINITION AND BASIC THEORY5e2(H) = (T) = an example we computep(TTTHHTTHTT; 2222221111)= [(.5)(.7)][(.)]

8 9)(.7)]2[(.9)(.3)]2[(.9)(.7)][(.1)(.5)][ (.9)(.5)]3= (.5)5[(.9)8][[(.7)4][(.3)2](.1)Example 2. A Hidden Markov model for CpG this examplewe shall build a HMM for DNA withCpGislands. The model we discusshere won t be the most general one, because, in the interests of simplicity,we shall try to keep the number of parameters in the model to the minimumnecessary. We will assume that DNA strictly inside aCpGisland can bemodelled well by a homogeneous Markov chain , and that we have built sucha model. Denote the transition probability matrix of this model byB14= aAAaACaAGaATaCAaCCaCGaCTaGAaGCaGGaGTaTAa TCaTGaTT ( )Likewise, assume we have a homogeneous Markov model for DNA outside ofCpG-islands, specified by a transition probability matrixB24= bAAbACbAGbATbCAbCCbCGbCTbGAbGCbGGbGTbTAb TCbTGbTT ( )Numerical values for these matrices may be estimated by using the maximumlikelihood estimates discussed in section 6 of the lecture on Markov human DNA data, Durbin, Eddy, Krogh and Mitchinson calculate thefollowing (which appear to be for DNA read in the 3 5 direction).

9 B1= . B2= . 6 Chapter 8. Hidden Markov CHAINSThe object is to create a HMM that occasionally and randomly switchesbetween the two different Markov chain models. To do this, we shall con-struct a state spaceSfor the Hidden chain consisting of nine states the startstate 0, four statesA1,C1,G1,T1 and another four statesA2,C2,G2,T2.(The start state is included only for consistency with the general frameworkdeveloped above.) The statesA1 andA2 emit onlyA; that iseA1(A) = 1andeA2(A) = 1. LikewiseC1 andC2 emit onlyC, and similarly for theother states; emissions are not random in this model. However, whether itis, for example,A1 orA2 that emitsAis not observed. The Hidden chain ,will determine whether sites are emitted in aCpGisland or correspond to sequences of sites for which the Hidden chain is movingabout in the set of states{A1,C1,G1,T1}; the site is not in aCpGis-land if the Hidden chain is instead in the set{A2,C2,G2,T2}. In short,the states{A1,C1,G1,T1}emit letters inCpGislands, while the states{A2,C2,G2,T2}emit letters outside of the islands.

10 We want the chain tospend relatively long stretches in either set of states, with transitions accord-ing to the respective matricesB1 andB2, with occasional transitions fromone set to the other. Specifically, assume there are two small parameters and such that(i) IfZiis in{A1,C1,G1,T1}(siteiis in aCpGisland), then with prob-ability , the island will end at sitei, andZi+1take on one of thevalues{A2,C2,G2,T2}, each state being equally likely. With proba-bility 1 , the island does not end and the move fromZitoZi+1takesplace according to the transition probability matrixB1.(ii) IfZiis in{A2,C2,G2,T2}(siteiis not in aCpGisland), then withprobability , an island will begin at sitei+ 1, andZi+1take on oneof the values{A1,C1,G1,T1}, each state being equally likely. Withprobability 1 , an island does not begin and the move fromZitoZi+1takes place according to the transition probability assumptions translate into the transition probabilities for the hiddenchain. For example, according to assumption (i),P(Zi+1=A2|Zi=A1) = 14P(Zi+1=A1|Zi=A1) = (1 )aAA( )By following the pattern in these equalities, can write the full state transitionmatrix.


Related search queries