Introduction BIOINFORMATICS - Gerstein Lab

1(c) Mark Gerstein , 1999, Yale, Gerstein , Yale (c) Mark Gerstein , 1999, Yale, +3(c) Mark Gerstein , 1999, Yale, isBioinformatics? (Molecular)Bio-informatics One idea for a definition? BIOINFORMATICS is conceptualizingbiology in terms ofmolecules(in the sense of physical-chemistry) andthen applying informatics techniques(derivedfrom disciplines such as applied math, CS, andstatistics) to understand andorganize theinformation associatedwith these molecules,on alarge-scale. BIOINFORMATICS is MIS for Molecular BiologyInformation4(c) Mark Gerstein , 1999, Yale, Biology: an Information Science Central Dogmaof Molecular BiologyDNA-> RNA-> Protein-> Phenotype-> DNA Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation Central Paradigmfor BioinformaticsGenomic Sequence Information-> mRNA (level)-> Protein Sequence-> Protein Structure-> Protein Function-> Phenotype Large Amounts of Information Standardized Statistical(idea from D Brutlag, Stanford, graphics from S Strobel) Genetic material Information transfer (mRNA) Protein synthesis (tRNA/mRNA) Some catalytic activity Most cellular functions are performed orfacilitated by proteins.

Primary biocatalyst Cofactor transport/storage Mechanical motion/support Immune protection Control of growth/differentiation5(c) Mark Gerstein , 1999, Yale, Biology Information - DNA Raw DNA Sequence Coding or Not? Parse into genes? 4bases:AGCT ~1 Kinagene,~2 M in genomeatggcaattaaaattggtatcaatggttttggtc gtatcggccgtatcgtattccgtgcagcacaacaccgtga tgacattgaagttgtaggtattaacgacttaatcgacgtt gaatacatggcttatatgttgaaatatgattcaactcacg gtcgtttcgacggcactgttgaagtgaaagatggtaactt agtggttaatggtaaaactatccgtgtaactgcagaacgt gatccagcaaacttaaactggggtgcaatcggtgttgata tcgctgttgaagcgactggtttattcttaactgatgaaac tgctcgtaaacatatcactgcaggcgcaaaaaaagttgta ttaactggcccatctaaagatgcaacccctatgttcgttc gtggtgtaaacttcaacgcatacgcaggtcaagatatcgt ttctaacgcatcttgtacaacaaactgtttagctccttta gcacgtgttgttcatgaaactttcggtatcaaagatggtt taatgaccactgttcacgcaacgactgcaactcaaaaaac tgtggatggtccatcagctaaagactggcgcggcggccgc ggtgcatcacaaaacatcattccatcttcaacaggtgcag cgaaagcagtaggtaaagtattacctgcattaaacggtaa attaactggtatggctttccgtgttccaacgccaaacgta tctgttgttgatttaacagttaatcttgaaaaaccagctt cttatgatgcaatcaaacaagcaatcaaagatgcagcgga aggtaaaacgttcaatggcgaattaaaaggcgtattaggt tacactgaagatgctgttgtttctactgacttcaacggtt gtgctttaacttctgtatttgatgcagacgctggtatcgc attaactgattctttcgttaaattggtatc.

Caaaaatagggttaatatgaatctcgatctccattttgtt catcgtattcaacaacaagccaaaactcgtacaaatatga ccgcacttcgctataaagaacacggcttgtggcgagatat ctcttggaaaaactttcaagagcaactcaatcaactttct cgagcattgcttgctcacaatattgacgtacaagataaaa tcgccatttttgcccataatatggaacgttgggttgttca tgaaactttcggtatcaaagatggtttaatgaccactgtt cacgcaacgactacaatcgttgacattgcgaccttacaaa ttcgagcaatcacagtgcctatttacgcaaccaatacagc ccagcaagcagaatttatcctaaatcacgccgatgtaaaa attctcttcgtcggcgatcaagagcaatacgatcaaacat tggaaattgctcatcattgtccaaaattacaaaaaattgt agcaatgaaatccaccattcaattacaacaagatcctctt tcttgcacttgg6(c) Mark Gerstein , 1999, Yale, Biology Information:Protein Sequence 20 letter alphabet ACDEFGHIKLMNPQRSTVWYbut notBJOUXZ Strings of ~300 aa in an average protein (in bacteria),~200 aa in a domain ~200 K known protein sequencesd1dhfa_LNCIVAVSQNMGIGKNGDLPWPPL RNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr_ _LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSH VEGKQ-NAVIMGKKTWFSId4dfra_ISLIAALAVDRVIG MENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHT WESId3dfr__TAFLWAQDRDGLIGKDGHLPWH-LPDDLH YFRAQTV--------GKIMVVGRRTYESFd1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSV EGKQ-NLVIMGKKTWFSId8dfr__LNSIVAVCQNMGIGK DGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTW FSId4dfra_ISLIAALAVDRVIGMENAMPW-NLPADLAW FKRNTLD--------KPVIMGRHTWESId3dfr__TAFLW AQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG-------- KIMVVGRRTYESFd1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKL TEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALAL LDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAA CGDVP------EIMVIGGGRVYEQFLPKAd3dfr__

---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAY AKQHLDQ----ELVIAGGAQIFTAFKDDVd1dhfa_-PEK NRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQP ELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALAL LDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAY AKQHLD----QELVIAGGAQIFTAFKDDV7(c) Mark Gerstein , 1999, Yale, Biology Information:Macromolecular Structure DNA/RNA/Protein Almost all protein(RNA Adapted From D Soll Web Page,Right Hand Top Protein from M Levitt web page)8(c) Mark Gerstein , 1999, Yale, Biology Information:Protein Structure Details Statistics on Number of XYZ triplets 200 residues/domain->200 CA atoms, separated by A Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A =>~1500 xyz triplets (=8x200) per protein domain 10 K known domain, ~300 foldsATOM 1 C ACE 0 1 GKY 67 ATOM 2 O ACE 0 1 GKY 68 ATOM 3 CH3 ACE 0 1 GKY 69 ATOM 4 N SER 1 1 GKY 70 ATOM 5 CA SER 1 1 GKY 71 ATOM 6 C SER 1 1 GKY 72 ATOM 7 O SER 1 1 GKY 73 ATOM 8 CB SER 1 1 GKY 74 ATOM 9 OG SER 1 1 GKY 75 ATOM 10 N ARG 2 1 GKY 76 ATOM 11 CA ARG 2 1 GKY 77 ATOM 12 C ARG 2 1 GKY 1444 CB LYS 186 1 GKY1510 ATOM 1445 CG LYS 186 1 GKY1511 ATOM

1446 CD LYS 186 1 GKY1512 ATOM 1447 CE LYS 186 1 GKY1513 ATOM 1448 NZ LYS 186 1 GKY1514 ATOM 1449 OXT LYS 186 1 GKY1515 TER 1450 LYS 186 1 GKY15169(c) Mark Gerstein , 1999, Yale, theWorld ofSequencesBacteria, , ~1600genes[Science269: 496]Eukaryote,13 Mb, ~6 Kgenes[Nature387:1]199519971998 Animal, ~100Mb, ~20 Kgenes[Science282: 1945]Human, ~3Gb, ~100 Kgenes[???]2000?10(c) Mark Gerstein , 1999, Yale, BiologyInformation:Whole Genomes The Revolution Driving EverythingFleischmann, ,Adams, ,White,O.,Clayton, ,Kirkness, ,Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K.,Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L.

I., Glodek, A.,Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D.,Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D.,Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, ,Fraser, ,Smith, , (1995)."Whole-genomerandom sequencing and assembly ofHaemophilusinfluenzae rd."Science269: 496-512.(Picture adapted from TIGR website, ) Integrative Data1995, HI (bacteria): Mb & 1600 genes done1997, yeast: 13 Mb & ~6000 genes for yeast1998, worm: ~100Mb with 19 K genes1999: >30 completed genomes!2003, human: 3 Gb & 100 K sequence nowaccumulate so quickly that,in less than a week, asingle laboratory canproduce more bits of datathan Shakespearemanagedinalifetime,although the latter makebetter G A Pekso,Nature401: 115-116 (1999)11(c) Mark Gerstein , 1999, Yale, ExpressionDatasets: theTranscriptosomeAlso: SAGE;Samson andChurch, Chips;Aebersold,ProteinExpressionYoung/L ander, Chips,Abs.

, array,Rel. Exp. overTimecourseSnyder,Transposons,Protein (c) Mark Gerstein , 1999, Yale, Data(courtesy of J Hager)Yeast Expression Data inAcademia:levels for all 6000 genes!Can only sequence genomeonce but can do an infinitevariety of these arrayexperimentsat 10 time points,6000 x 10 = 60K floatstelling signal frombackground13(c) Mark Gerstein , 1999, Yale, Whole-GenomeExperimentsSystematic KnockoutsWinzeler, E. A., Shoemaker, D. D.,Astromoff, A., Liang, H., Anderson, K.,Andre, B., Bangham, R., Benito, R.,Boeke, J. D., Bussey, H., Chu, A. M.,Connelly, C., Davis, K., Dietrich, F., Dow,S. W., El Bakkoury, M., Foury, F., Friend,S. H., Gentalen, E., Giaever, G.,Hegemann, J. H., Jones, T., Laub, M.,Liao, H., Davis, R. W. & et al. (1999).Functional characterization of the genome by gene deletion andparallel , 901-62 hybrids, linkage mapsHua, S.

B., Luo, Y., Qiu, M., Chan, E., Zhou, H. &Zhu, L. (1998). Construction of a modular yeasttwo-hybrid cDNA library from human EST clones forthe human genome protein linkage ,143-52 For yeast:6000 x 6000 / 2~18 Minteractions14(c) Mark Gerstein , 1999, Yale, Biology Information:Other Integrative Data Information tounderstand genomes Metabolic Pathways(glycolysis), traditionalbiochemistry Regulatory Networks Whole OrganismsPhylogeny, traditionalzoology Environments, Habitats,ecology The Literature(MEDLINE) The (Pathway drawing from P Karp s EcoCyc, Phylogenyfrom S J Gould, Dinosaur in a Haystack)15(c) Mark Gerstein , 1999, Yale, of Data Matchedby Development of ComputerTechnology CPU vs Disk & Net As important as theincrease in computerspeed has been, theability to store largeamounts ofinformation oncomputersisevenmore crucial DrivingForceinBioinformatics(Internet picture adaptedfrom D Brutlag, Stanford)0500100015002000250030003500400 0450019801985199019950204060801001201401 97919811983198519871989199119931995 CPU InstructionTime (ns) (c) Mark Gerstein , 1999, Yale, is born!

(courtesy of Finn Drablos)17(c) Mark Gerstein , 1999, Yale, (c) Mark Gerstein , 1999, Yale, Character ofMolecular BiologyInformation:Redundancy andMultiplicity Different Sequences Have theSame Structure Organism has many similar genes Single Gene May Have MultipleFunctions Genes are grouped into Pathways Genomic Sequence Redundancydue to the Genetic Code Howdowefindthesimilarities?..Integrative Genomics -genes structures functions pathways expression levels regulatory systems ..19(c) Mark Gerstein , 1999, Yale, Paradigm forScientific Computing Because ofincrease in data andimprovement in computers,new calculations becomepossible But BIOINFORMATICS has a newstyle of Two Paradigms Physics Prediction based on physicalprinciples Exact Determination of RocketTrajectory Supercomputer, CPU Biology Classifying information anddiscovering unexpectedrelationships globin ~ colicin~ plastocyanin~repressor networks, federated database20(c) Mark Gerstein , 1999, Yale, Types of Informatics inBioinformatics Databases building ,Querying Object DB Text String Comparison Text Search 1D Alignment Significance Statistics Alta Vista, grep Finding Patterns AI / Machine Learning Clustering Datamining Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching(Visision, recognition)

Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation21(c) Mark Gerstein , 1999, Yale, Topics --Genome Sequence Finding Genes in GenomicDNA introns exons promotors Characterizing Repeats inGenomic DNA Statistics Patterns Duplications in the Genome22(c) Mark Gerstein , 1999, Yale, --Protein Sequence Sequence Alignment non-exact string matching, gaps How to align two strings optimallyvia Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed( blast , FASTA) Amino acid substitution scoringmatrices Multiple Alignment andConsensus Patterns How to align more than onesequence and then fuse theresult in a consensusrepresentation Transitive Comparisons HMMs, Profiles Motifs Scoring schemes andMatching statistics How to tell if a given alignment ormatch is statistically significant A P-value (or an e-value)?

Introduction BIOINFORMATICS - Gerstein Lab

Tags:

Information

Advertisement

Transcription of Introduction BIOINFORMATICS - Gerstein Lab

Related search queries

Introduction BIOINFORMATICS - Gerstein Lab

Tags:

Information

Advertisement

Related documents

Energy Simulation Software for Buildings: Review and ...

Calculation of Blast Loads for Application to Structural ...

A Simplified Guide to Explosives Analysis

Hilti - 2011 Anchor Fastening Technical Guide

Related search queries