Example: biology

Arc Diagrams: Visualizing Structure in Strings

Arc Diagrams: Visualizing Structure in StringsMartin WattenbergIBM ResearchOne Rogers StreetCambridge MA paper introduces a new visualization method, thearc diagram, which is capable of representing complexpatterns of repetition in string data. Arc diagramsimprove over previous methods such as dotplots becausethey scale efficiently for Strings that contain manyinstances of the same subsequence. This paper describesdesign and implementation issues related to arc diagramsand shows how they may be applied to visualize suchdiverse data as music, text, and compiled : string, sequence, visualization, arc diagram,music, text, code1.

Arc Diagrams: Visualizing Structure in Strings Martin Wattenberg IBM Research One Rogers Street Cambridge MA 02142 mwatten@us.ibm.com Abstract This paper introduces a new visualization method, the

Tags:

  Visualizing

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Arc Diagrams: Visualizing Structure in Strings

1 Arc Diagrams: Visualizing Structure in StringsMartin WattenbergIBM ResearchOne Rogers StreetCambridge MA paper introduces a new visualization method, thearc diagram, which is capable of representing complexpatterns of repetition in string data. Arc diagramsimprove over previous methods such as dotplots becausethey scale efficiently for Strings that contain manyinstances of the same subsequence. This paper describesdesign and implementation issues related to arc diagramsand shows how they may be applied to visualize suchdiverse data as music, text, and compiled : string, sequence, visualization, arc diagram,music, text, code1.

2 IntroductionFrom text to DNA to melodies, many data sets come inthe form of a string, or sequence of symbols. Just as withquantitative data, it is often desirable to perform graphicalexploratory analysis on a string to find importantstructural way to reveal a string s Structure is to exploit thefact that sequences often contain significant repeatedsubsequences. Melodies, for instance, are usually basedon combinations of smaller repeated musical passages;text has repeated words and phrases. A natural way tovisualize Structure is to use these repeated units existing methods use repetition to visualizestring Structure , but each has significant drawbacks forcomplex Strings .

3 In this paper we introduce the arcdiagram, a new visualization method for representingsequence Structure by highlighting repeated describe how arc diagrams can find patterns in text,compiled code and, most fruitfully, in Existing methods for string visualizationMany methods have been proposed to display stringstructure visually. The H-Curve and W-Curve [HR83,W93] transform sequences into curves in 3D such curves are capable of showing fine detail,they can be hard to interpret and it can be difficult to spotsmaller repeated substrings.

4 The "Chaos Game"representation of a sequence [J90], in effect a 2 Dhistogram depicting the frequencies of various motifs, isefficient for showing which small substrings arefrequently repeated, but can run into difficultiesdistinguishing long subsequences with similar , chaos game representations remove muchordering information, making them unsuitable fordomains where ordering method, popular in analyzing DNAsequences, is the dotplot [CH92]. A dotplot is a visualauto-correlation matrix; in its simplest form, a string of nsymbols is represented by an n x n image inwhich the pixel at coordinates (i,j) is colored black if ai=ajand white otherwise.

5 This image often provides a goodpicture of the string s Structure , with repeatedsubsequences showing up clearly as diagonal lines. Inmany respects dotplots are an excellent visualizationmethod: They can handle very large data sets, are resistantto noise, and can show both small and large-scalestructures. However, the matrix-style presentation of adotplot means that if a substring is repeated n times, itwill give rise to n2 corresponding visual features. As aresult, dotplots can be confusing when applied to stringswith frequently repeated non-visual method of describing a long string is tosummarize it by describing which subsequences arerepeated.

6 For instance, musicians have long described theglobal Structure of musical compositions by summariessuch as "AABB" (meaning a subsequence, denoted by A,is repeated and then followed by a different subsequence,B, that is also repeated.) This simple symbolic notation iseasy to understand and provides a broad overview of thedata, but obliterates smaller is natural to seek a visual analogue of this theorists, starting with Heinrich Schenker, haveused a system of hand-drawn arcs to indicate structuralunits (see, for example, [S69]). However Schenkeriandiagrams, which are intrinsically subjective and manual,are unsuitable for automation or for showing features onmultiple scales.

7 One commercial software package,TimeSketch [T02], uses half-disks to delineate differentsections of a piece, coloring related passages with thesame color to indicate musical form. The TimeSketchsoftware requires human definition of related passages,and because it uses color for differentiation does not scalewell for sequences that have many different The arc diagramThis paper introduces the arc diagram. An arc diagramgeneralizes the musical AABB notation by using apattern-matching algorithm to find repeated substrings,and then representing them visually as translucent a TimeSketch diagram, an arc diagram can beconstructed automatically and can represent the structureof a sequence with many different repeated subsequencesand multiple scales of repetition.

8 Unlike a dotplot, it canefficiently represent sequences where individualsubsequences are repeated many times. An arc diagram is built around the idea of Visualizing onlya subset of all possible pairs of matching substrings. Bychoosing to highlight just the subsequences essential tounderstanding the string s Structure , the method canconvey all critical Structure while avoiding the quadraticscaling problem of a dotplot. We now define these essential substring pairs for a given string 1. A maximal matching pair is a pair ofsubstrings of S, X and Y, which are:1.

9 Identical. X and Y consist of the same sequenceof Non-overlapping. X and Y do not Consecutive. X occurs before Y, and there is nosubstring Z, identical to X and Y, whosebeginning falls between the beginning of X andthe beginning of Maximal. There do not exist longer identicalnon-overlapping subsequences X and Y with X containing X and Y containing Y . For example, in the sequence 123a123 , the two 123 substrings form a maximal matching pair, but the two 12 substrings do not. It is tempting to base a visualization method on maximalmatching pairs alone, but an awkward situation ariseswhen a pattern is repeated many times in immediatesuccession.

10 For instance in the string 10101010, the onlymaximal matching pair consists of the first and last 1010 substrings, implying that the string has two mainstructural components. In a sense, this division into twolarge substrings is spurious; it would be more accurate todescribe the string as composed of four small repeatedunits. This is the motivation for the following 2. A repetition region R is a substring R of Ssuch that R is made up of a string P repeated two or moretimes in immediate succession. Each repetition of P iscalled a fundamental substring for example, in the string ABC010101, the substring 010101 is a repetition region.


Related search queries