Transcription of Introduction Sequence Alignment
1 , , Fall 2011 IntroductionSequence AlignmentMotivation: assess similarity of sequences and learn about theirevolutionary relationshipWhy do we want to know this?Example:SequencesACCCGAACTATCCTA alignAlignmentACCCGAAC--TATCC-TAHomology : Alignment reasonable, if sequences homologousACCGAACCCGAACCTATCCTATCTCACTAD efinition ( Sequence Homology)Two or more sequences arehomologousiffthey evolved from a common ancestor.[Homology in anatomy] , , Fall 2011 IntroductionPlan (and Some Preliminaries) First: study only pairwise , such that 6 . is called thegap elements of are twosequences a,b . For pairwise Sequence comparison: define edit distance, definealignment distance, show equivalence of distances, definealignment problem and efficient algorithmgap penalties, local Alignment Later: extend pairwise Alignment to multiple alignmentDefinition (Alphabet, words)Analphabet is a finite set (ofsymbols/characters).
2 +denotesthe set ofnon-empty wordsof , +:= i>0 i. A wordx nhaslength n, written|x|. := + { }, where denotestheempty wordof length , , Fall 2011 IntroductionLevenshtein DistanceDefinitionTheLevenshtein Distancebetween two words/sequences is theminimal number of substitutions, insertions and deletions totransform one into the and ACTA have (at most) distance 3:ACCCGA ACCGA ACCTA ACTAIn biology, operations have different cost. (Why?) , , Fall 2011 IntroductionEdit Distance: OperationsDefinition (Edit Operations)Anedit operationis a pair (x,y) ( { }6= ( , ). We call(x,y) substitutioniffx6= andy6= deletioniffy= insertioniffx= For sequencesa,b, writea (x,y)b, iffais transformed tobbyoperation (x,y). Furthermore, writea Sb, iffais transformedtobby a Sequence of edit (C, )ACCGA (G,T)ACCTA ( ,T)ATCCTAACCCGA (C, ),(G,T),( ,T)ATCCTAR ecall: 6 ,a,bare sequences in , , Fall 2011 IntroductionEdit Distance: Cost and Problem DefinitionDefinition (Cost, Edit Distance)Letw: ( { })2 R, such thatw(x,y) is thecost of an editoperation(x,y).)
3 Thecost of a Sequence of edit operationsS=e1,..,enis w(S) =n i=1w(e1).Theedit distance of sequences a and bisdw(a,b) = min{ w(S)|a Sb}. , , Fall 2011 IntroductionEdit Distance: Cost and Problem DefinitionDefinition (Cost, Edit Distance)Letw: ( { })2 R, such thatw(x,y) is thecost of an editoperation(x,y). Thecost of a Sequence of edit operationsS=e1,..,enis w(S) =n i=1w(e1).Theedit distance of sequences a and bisdw(a,b) = min{ w(S)|a Sb}.Is the definition reasonable?Definition (Metric)A functiond:X2 Ris calledmetriciff 1.)d(x,y) = 0 iffx=y2.)d(x,y) =d(y,x) 3.)d(x,y) d(x,z) +d(z,y).Remarks: 1.) for metric d,d(x,y) 0, 2.)dwis metric iffw(x,y) 0,3.) In the following, assumedwis , , Fall 2011 IntroductionEdit Distance: Cost and Problem DefinitionDefinition (Cost, Edit Distance)Letw: ( { })2 R, such thatw(x,y) is thecost of an editoperation(x,y). Thecost of a Sequence of edit operationsS=e1.
4 ,enis w(S) =n i=1w(e1).Theedit distance of sequences a and bisdw(a,b) = min{ w(S)|a Sb}.Remarks Natural evolution-motivated problem definition. Not obvious how to compute edit distance efficiently define Alignment , , Fall 2011 IntroductionAlignment DistanceDefinition ( Alignment )A pair of wordsa ,b ( { }) is calledalignment ofsequences a and b(a andb are calledalignment strings), iff1.|a |=|b |2. for all 1 i |a |:a i6= orb i6= 3. deleting all gap symbols froma yieldsaand deleting all fromb yieldsbExamplea=ACGGATb=CCGCTT possible alignments area =AC-GG-ATb =-CCGCT-Tora =ACGG---ATb =--CCGCT-Tor ..(exponentially many)edit operations of first Alignment : (A,-),(-,C),(G,C),(-,T),(A,-) , , Fall 2011 IntroductionAlignment DistanceDefinition (Cost of Alignment , Alignment Distance)Thecost of the Alignment (a ,b ), given a cost functionwon editoperations isw(a ,b ) =|a | i=1w(a i,b i)Thealignment distanceof a and b isDw(a,b) = min{w(a ,b )|(a ,b ) is Alignment ofaandb}.
5 , , Fall 2011 IntroductionAlignment Distance = Edit DistanceTheorem (Equivalence of Edit and Alignment Distance)For metric w , dw(a,b) =Dw(a,b).Recall:Definition (Edit Distance)Theedit distance of a and bisdw(a,b) = min{ w(S)|atransformed tobby }.Definition ( Alignment Distance)Thealignment distanceof a and b isDw(a,b) = min{w(a ,b )|(a ,b ) is Alignment ofaandb}. , , Fall 2011 IntroductionAlignment Distance = Edit DistanceTheorem (Equivalence of Edit and Alignment Distance)For metric w , dw(a,b) =Dw(a,b).Remarks Proof idea:dw(a,b) Dw(a,b): Alignment yields Sequence of edit opsDw(a,b) dw(a,b): Sequence of edit ops yields equal orbetter Alignment (needs triangle inequality) Reduces edit distance to Alignment distance We will see: the Alignment distance is computed efficiently bydynamic programming (usingBellman s Principle ofOptimality). , , Fall 2011 IntroductionPrinciple of Optimality and Dynamic ProgrammingPrinciple of Optimality: Optimal solutions consist of optimal partial solutions Example: Shortest PathIdea of Dynamic Programming (DP): Solve partial problems first and materialize results (recursively) solve larger problems based on smaller onesRemarks The principle is valid for the Alignment distance problem Principle of Optimality enables the programming method DP Dynamic programming is widely used in ComputationalBiology and you will meet it quite often in this , , Fall 2011 IntroductionAlignment MatrixIdea:choose Alignment distances of solutions and define matrix of these partial :=|a|,m:=|b|.
6 Definition ( Alignment matrix)Thealignment matrixofaandbis the (n+ 1) (m+ 1)-matrixD:= (Dij)0 i n,0 j mdefined byDij:=Dw( , )(= min{w(a ,b )|(a ,b ) is Alignment }).Notational remarks aiis the i-th character ofa the sequenceaxax+ (subsequenceofa). by ifx> , , Fall 2011 IntroductionAlignment Matrix ExampleExample a=AT,b=AAGT w(x,y) ={0iffx=y1otherwiseA A G TATR emark: The Alignment matrixDcontains the Alignment distance(=edit distance) ofaandbinDn, , , Fall 2011 IntroductionAlignment Matrix ExampleExample a=AT,b=AAGT w(x,y) ={0iffx=y1otherwiseA A G TAT0 1 2 3 4120 1 22311 2 Remark: The Alignment matrixDcontains the Alignment distance(=edit distance) ofaandbinDn, , , Fall 2011 IntroductionNeedleman-Wunsch AlgorithmClaimFor (a ,b ) Alignment ofaandbwith lengthr=|a |,w(a ,b ) =w(a 1,b 1) +w(a r,b r).TheoremFor the Alignment matrix D of a and b, holds that D0,0= 0 for all1 i n: Di,0= ik=1w(ak, ) =Di 1,0+w(ai, ) for all1 j m: D0,j= jk=1w( ,bk) =D0,j 1+w( ,bj) Dij= min Di 1,j 1+w(ai,bj)(match)Di 1,j+w(ai, )(deletion)Di,j 1+w( ,bj)(insertion)Remark: The theorem claims that each prefix Alignment distancecan be computed from a constant number of smaller ?}}
7 ?? , , Fall 2011 IntroductionNeedleman-Wunsch AlgorithmClaimFor (a ,b ) Alignment ofaandbwith lengthr=|a |,w(a ,b ) =w(a 1,b 1) +w(a r,b r).TheoremFor the Alignment matrix D of a and b, holds that D0,0= 0 for all1 i n: Di,0= ik=1w(ak, ) =Di 1,0+w(ai, ) for all1 j m: D0,j= jk=1w( ,bk) =D0,j 1+w( ,bj) Dij= min Di 1,j 1+w(ai,bj)(match)Di 1,j+w(ai, )(deletion)Di,j 1+w( ,bj)(insertion)Remark: The theorem claims that each prefix Alignment distancecan be computed from a constant number of smaller : Induction over i+ , , Fall 2011 IntroductionNeedleman-Wunsch Algorithm (Pseudocode)D0,0:= 0fori:= 1 tondoDi,0:=Di 1,0+w(ai, )end forforj:= 1 tomdoD0,j:=D0,j 1+w( ,bj)end forfori:= 1 tondoforj:= 1 tomdoDi,j:= min Di 1,j 1+w(ai,bj)Di 1,j+w(ai, )Di,j 1+w( ,bj)end forend , , Fall 2011 IntroductionBack to ExampleExample a=AT,b=AAGT w(x,y) ={0iffx=y1otherwiseA A G TAT0 1 2 3 41 02 Open: how to find best Alignment ?}
8 , , Fall 2011 IntroductionBack to ExampleExample a=AT,b=AAGT w(x,y) ={0iffx=y1otherwiseA A G TAT0 1 2 3 4120 1 22311 2 Open: how to find best Alignment ? , , Fall 2011 IntroductionTracebackw(x,y) ={0iffx=y1otherwiseA A G TAT0 1 2 3 4120 1 22311 2 Remarks Start in (n,m). For every (i,j) determine optimal case. Not necessarily unique. Sequence oftrace arrowslet s infer best , , Fall 2011 IntroductionTracebackw(x,y) ={0iffx=y1otherwiseA A G TAT0 1 2 3 4120 1 22311 2 Remarks Start in (n,m). For every (i,j) determine optimal case. Not necessarily unique. Sequence oftrace arrowslet s infer best , , Fall 2011 IntroductionComplexity compute one entry: three cases, constant time nmentries fill matrix inO(nm) time traceback:O(n+m) time TOTAL:O(n2) time and space (assumingm n)Remarks assumingm nis since we can exchangeaandb space complexity can be improved toO(n) for computation ofdistance (simple, store only current and last row ) andtraceback (more involved; Hirschberg-algorithm uses Divideand Conquer for computing trace) , , Fall 2011 IntroductionPlan We have seen how to compute the pairwise edit distance andthe corresponding optimal Alignment .}}}
9 Before going multiple, we will look at two further specialtopics for pairwise Alignment : more realistic, non-linear gap cost and similarity scores and local , , Fall 2011 IntroductionAlignment Cost RevisitedMotivation: The alignmentsGA--TGAAGTandG-A-TGAAGT have the same editdistance. The first one is biologically more reasonable: it is more likelythat evolution introduces one large gap than two small ones. This means: gap cost should be non-linear, sub-additive! , , Fall 2011 IntroductionGap PenaltyDefinition (Gap Penalty)Agap penaltyis a functiong:N Rthat is sub-additive, (k+l) g(k) +g(l).Agapin an Alignment stringa is a substring ofa that consists ofonly gap symbols and is maximally extended. a is themulti-set of gaps ina .Thealignment cost with gap penalty g of(a ,b ) iswg(a ,b ) = 1 r |a |,wherea r6= ,b r6= w(a r,b r)(cost of mismatchs)+ x a ] b g(|x|)(cost of gaps)Example:a =ATG---CGAC--GCb =-TGCGGCG-CTTTC a ={---,--}, b ={-,-} , , Fall 2011 IntroductionGeneral sub-additive gap penaltyTheoremLet D be the Alignment matrix of a and b with cost w and gappenalty g , such that Dij=wg( , ).
10 Then: D0,0= 0 for all1 i n: Di,0=g(i) for all1 j m: D0,j=g(j) Dij= min Di 1,j 1+w(ai,bj)(match)min1 k iDi k,j+g(k)(deletion of length k)min1 k jDi,j k+g(k)(insertion of length k)Remarks ComplexityO(n3) time,O(n2) space pseudocode, correctness, traceback left as exercise much more realistic, but significantly more expensive thanNeedleman-Wunsch can we improve it? , , Fall 2011 IntroductionAffine gap costDefinitionA gap penalty is affine, iff there are real constants and , suchthat for allk N:g(k) = + Affine gap penalties almost as good as general ones:Distinguishing gap opening ( ) and gap extension cost ( ) is biologically reasonable . The minimal Alignment cost with affine gap penalty can becomputed inO(n2) time! (Gotoh algorithm) , , Fall 2011 IntroductionGotoh algorithm: sketch onlyIn addition to the Alignment matrixD, define two furthermatrices/states. Ai,j:= cost of best Alignment , ,that ends with deletionai|.