Transcription of Finding Regulatory Motifs in DNA Sequences - UCSD CSE
1 Introduction to Bioinformatics AlgorithmsFinding Regulatory Motifs in DNA Sequences An Introduction to Bioinformatics Implanting Patterns in Random Text Gene Regulation Regulatory Motifs The Gold Bug Problem The Motif Finding Problem Brute Force Motif Finding The Median String Problem Search Trees Branch-and-Bound Motif Search Branch-and-Bound Median String Search Consensus and Pattern Branching: Greedy Motif Search PMS: Exhaustive Motif Search An Introduction to Bioinformatics Sampleatgaccgggatactgataccgtatttggcctagg cgtacacattagataaacgtatgaagtacgttagactcgg cgccgccgacccctattttttgagcagatttagtgacctg gaaaaaaaatttgagtacaaaacttttccgaatactgggc ataaggtacatgagtatccctgggatgacttttgggaaca ctatagtgctctcccgatttttgaatatgtaggatcattc gccagggtccgagctgagaattggatgaccttgtaagtgt tttccacgcaatcgcgaaccaacgcggacccaaaggcaag accgataaaggagatcccttttgcggtaatgtgccgggag gctggttacgtagggaagccctaacggacttaatggccca cttagtccacttataggtcaatcatgttcttgtgaatgga tttttaactgagggcatagaccgcttggcgcacccaaatt cagtgtgggcgagcgcaacggttttggcccttgttagagg cccccgtactgatggaaactttcaattatgagagagctaa tctatcgcgtgcgtgttcataacttgagttggtttcgaaa atgctctggggcacatacaagaggagtcttccttatcagt taatgctgtatgacactatgtattggcccattggctaaaa gcccaacttgacaaatggaagatagaatccttgcatttca acgtatgccgaaccgaaagggaagctggtgagcaacgaca gattcttacgtgcattagctcgcttccggggatctaatag cacgaagcttctgggtactgatagca An Introduction to Bioinformatics the Motif: AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGG ggcgtacacattagataaacgtatgaagtacgttagactc ggcgccgccgacccctattttttgagcagatttagtgacc tggaaaaaaaatttgagtacaaaacttttccgaataAAAA AAAAGGGGGGG atgagtatccctgggatgacttAAAAAAAAGGGGGGG tgctctcccgatttttgaatatgtaggatcattcgccagg gtccgagctgagaattggatgAAAAAAAAGGGGGGG tccacgcaatcgcgaaccaacgcggacccaaaggcaagac cgataaaggagatcccttttgcggtaatgtgccgggaggc tggttacgtagggaagccctaacggacttaatAAAAAAAA GGGGGGG cttataggtcaatcatgttcttgtgaatggatttAAAAAA AAGGGGGGG gaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGG GGGG caattatgagagagctaatctatcgcgtgcgtgttcataa cttgagttAAAAAAAAGGGGGGG ctggggcacatacaagaggagtcttccttatcagttaatg ctgtatgacactatgtattggcccattggctaaaagccca acttgacaaatggaagatagaatccttgcatAAAAAAAAG GGGGGG accgaaagggaagctggtgagcaacgacagattcttacgt gcattagctcgcttccggggatctaatagcacgaagcttA AAAAAAAGGGGGGGa An Introduction to Bioinformatics where is it now?
2 Atgaccgggatactgataaaaaaaagggggggggcgtaca cattagataaacgtatgaagtacgttagactcggcgccgc cgacccctattttttgagcagatttagtgacctggaaaaa aaatttgagtacaaaacttttccgaataaaaaaaaagggg gggatgagtatccctgggatgacttaaaaaaaaggggggg tgctctcccgatttttgaatatgtaggatcattcgccagg gtccgagctgagaattggatgaaaaaaaagggggggtcca cgcaatcgcgaaccaacgcggacccaaaggcaagaccgat aaaggagatcccttttgcggtaatgtgccgggaggctggt tacgtagggaagccctaacggacttaataaaaaaaagggg gggcttataggtcaatcatgttcttgtgaatggatttaaa aaaaaggggggggaccgcttggcgcacccaaattcagtgt gggcgagcgcaacggttttggcccttgttagaggcccccg taaaaaaaagggggggcaattatgagagagctaatctatc gcgtgcgtgttcataacttgagttaaaaaaaagggggggc tggggcacatacaagaggagtcttccttatcagttaatgc tgtatgacactatgtattggcccattggctaaaagcccaa cttgacaaatggaagatagaatccttgcataaaaaaaagg gggggaccgaaagggaagctggtgagcaacgacagattct tacgtgcattagctcgcttccggggatctaatagcacgaa gcttaaaaaaaaggggggga An Introduction to Bioinformatics the Motif with Four MutationsatgaccgggatactgatAgAAgAAAGGttGG Gggcgtacacattagataaacgtatgaagtacgttagact cggcgccgccgacccctattttttgagcagatttagtgac ctggaaaaaaaatttgagtacaaaacttttccgaatacAA tAAAAcGGcGGGatgagtatccctgggatgacttAAAAtA AtGGaGtGGtgctctcccgatttttgaatatgtaggatca ttcgccagggtccgagctgagaattggatgcAAAAAAAGG GattGtccacgcaatcgcgaaccaacgcggacccaaaggc aagaccgataaaggagatcccttttgcggtaatgtgccgg gaggctggttacgtagggaagccctaacggacttaatAtA AtAAAGGaaGGGcttataggtcaatcatgttcttgtgaat ggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaa attcagtgtgggcgagcgcaacggttttggcccttgttag aggcccccgtAtAAAcAAGGaGGGccaattatgagagagc taatctatcgcgtgcgtgttcataacttgagttAAAAAAt AGGGaGccctggggcacatacaagaggagtcttccttatc agttaatgctgtatgacactatgtattggcccattggcta aaagcccaacttgacaaatggaagatagaatccttgcatA ctAAAAAGGaGcGGaccgaaagggaagctggtgagcaacg acagattcttacgtgcattagctcgcttccggggatctaa tagcacgaagcttActAAAAAGGaGcGGa An Introduction to Bioinformatics , geez.
3 Where is it now?!atgaccgggatactgatagaagaaaggttgggggc gtacacattagataaacgtatgaagtacgttagactcggc gccgccgacccctattttttgagcagatttagtgacctgg aaaaaaaatttgagtacaaaacttttccgaatacaataaa acggcgggatgagtatccctgggatgacttaaaataatgg agtggtgctctcccgatttttgaatatgtaggatcattcg ccagggtccgagctgagaattggatgcaaaaaaagggatt gtccacgcaatcgcgaaccaacgcggacccaaaggcaaga ccgataaaggagatcccttttgcggtaatgtgccgggagg ctggttacgtagggaagccctaacggacttaatataataa aggaagggcttataggtcaatcatgttcttgtgaatggat ttaacaataagggctgggaccgcttggcgcacccaaattc agtgtgggcgagcgcaacggttttggcccttgttagaggc ccccgtataaacaaggagggccaattatgagagagctaat ctatcgcgtgcgtgttcataacttgagttaaaaaataggg agccctggggcacatacaagaggagtcttccttatcagtt aatgctgtatgacactatgtattggcccattggctaaaag cccaacttgacaaatggaagatagaatccttgcatactaa aaaggagcggaccgaaagggaagctggtgagcaacgacag attcttacgtgcattagctcgcttccggggatctaatagc acgaagcttactaaaaaggagcgga An Introduction to Bioinformatics is Finding a (15,4) Motif Hard?
4 |..|||.|..||| An Introduction to Bioinformatics Problem Find a motif in a sample of - 20 random Sequences ( 600 nt long) - each sequence containing an implanted pattern of length 15, - each pattern appearing with 4 mismatches as (15,4)-motif. An Introduction to Bioinformatics Gene Regulation A microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed How can one gene have such drastic effects? An Introduction to Bioinformatics Proteins Gene X encodes Regulatory protein, a transcription factor (TF) The 20 unexpressed genes rely on gene X s TF to induce transcription A single TF may regulate multiple genes An Introduction to Bioinformatics Regions Every gene contains a Regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site Located within the RR are the Transcription Factor Binding Sites (TFBS)
5 , also known as Motifs , specific for a given transcription factor TFs influence gene expression by binding to a specific location in the respective gene s Regulatory region - TFBS An Introduction to Bioinformatics Factor Binding Sites A TFBS can be located anywhere within the Regulatory Region. TFBS may vary slightly across different Regulatory regions since non-essential bases could mutate An Introduction to Bioinformatics and Transcriptional Start SitesgeneATCCCG geneTTCCGG geneATCCCG geneATGCCG geneATGCCC An Introduction to Bioinformatics Factors and Motifs An Introduction to Bioinformatics Logo Motifs can mutate on non important bases The five Motifs in five different genes have mutations in position 3 and 5 Representations called motif logos illustrate the conserved and variable regions of a motifTGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA An Introduction to Bioinformatics Logos.
6 An Example( ~ ) An Introduction to Bioinformatics Motifs Genes are turned on or off by Regulatory proteins These proteins bind to upstream Regulatory regions of genes to either attract or block an RNA polymerase Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS) So Finding the same motif in multiple genes Regulatory regions suggests a Regulatory relationship amongst those genes An Introduction to Bioinformatics Motifs : Complications We do not know the motif sequence We do not know where it is located relative to the genes start Motifs can differ slightly from one gene to the next How to discern it from random Motifs ?
7 An Introduction to Bioinformatics Motif Finding Analogy The Motif Finding Problem is similar to the problem posed by Edgar Allan Poe (1809 1849) in his Gold Bug story An Introduction to Bioinformatics Gold Bug Problem Given a secret message:53++!305))6*;4826)4+.)4+);806*;4 8!8`60))85;]8*:+*8!83(88)5*!; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5 *-4)8`8*; 4069285);)6!8)4++;1(+9;48081;8:8+1;48!85 ;4)485!528806*81(+9;48;(88;4(+?34;48)4+; 161;:188;+?; Decipher the message encrypted in the fragment An Introduction to Bioinformatics for The Gold Bug Problem Additional hints: The encrypted message is in English Each symbol correspond to one letter in the English alphabet No punctuation marks are encoded An Introduction to Bioinformatics Gold Bug Problem: Symbol Counts Naive approach to solving the problem.
8 Count the frequency of each symbol in the encrypted message Find the frequency of each letter in the alphabet in the English language Compare the frequencies of the previous steps, try to find a correlation and map the symbols to a letter in the alphabet An Introduction to Bioinformatics Frequencies in the Gold Bug Message Gold Bug Message: English Language:e t a o i n s r h l d c u m f p g w y b v k x j q zMost frequent Least frequentSymbol8;4)+*56(!10293:?`-].Frequ ency34251916151412119876554432111 An Introduction to Bioinformatics Gold Bug Message Decoding: First Attempt By simply mapping the most frequent symbols to the most frequent letters of the alphabet:sfiilfcsoorntaeuroaikoaiotecrnt aeleyrcooestvenpinelefheeosnltarhteenmrn wteonihtaesotsnlupnihtamsrnuhsnbaoeyenta crmuesotorleoaiitdhimtaecedtepeidtaelest aoaeslsueecrnedhimtaetheetahiwfataeoaitd rdtpdeetiwt The result does not make sense An Introduction to Bioinformatics Gold Bug Problem: l-tuple count A better approach: Examine frequencies of l-tuples, combinations of 2 symbols, 3 symbols, etc.
9 The is the most frequent 3-tuple in English and ;48 is the most frequent 3-tuple in the encrypted text Make inferences of unknown symbols by examining other frequent l-tuples An Introduction to Bioinformatics Gold Bug Problem: the ;48 clue Mapping the to ;48 and substituting all occurrences of the symbols:53++!305))6*the26)h+.)h+)te06*th e!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*? te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th06 92e5)t)6!e)h++t1(+9the0e1te:e+1t