Transcription of A Quick Guide for Developing Effective Bioinformatics ...
1 EducationA Quick Guide for Developing Effective BioinformaticsProgramming SkillsJoel T. Dudley1,2,3*, Atul J. Butte2,31 Program in Biomedical Informatics, Stanford University School of Medicine, Stanford, California, United States of America,2 Department of Pediatrics, Stanford UniversitySchool of Medicine, Stanford, California, United States of America,3 Lucile Packard Children s Hospital, Palo Alto, California, United States of AmericaIntroductionBioinformatics programming skills arebecoming a necessity across many facets ofbiology and medicine, owed in part to thecontinuing explosion of biological dataaggregation and the complexity and scaleof questions now being addressed throughmodern Bioinformatics . Although manyare now receiving formal training inbioinformatics through various universitydegree and certificate programs, thistraining is often focused strongly onbioinformatics methodology, leaving manyimportant and practical aspects of bioin-formatics to self-education and following set of guidelines distillseveral key principals of Effective bioinfor-matics programming , which the authorslearned through insights gained acrossmany years of combined experience de-veloping popular Bioinformatics softwareapplications and database systems in bothacademic and commercial settings [1 6].
2 Successful adoption of these principals willserve both beginner and experiencedbioinformaticians alike in career develop-ment and pursuit of professional andscientific Importance of BuildingYour Technology ToolboxGiven the diversity and complex natureof problems in biology, medicine, andbioinformatics, it is imperative to be ableto approach each problem with a com-prehensive knowledge of available compu-tational tools so that the best tools can beselected for the problem at hand. Themost fundamental and versatile tools inyour technology toolbox are programminglanguages. While most modern program-ming languages are capable of any num-ber of computational feats, some are moreapt for particular tasks than others. Forexample, the R language [7] is almostunparalleled in its statistical computingcapabilities, whereas the Lisp language iswell designed for problems in artificialintelligence, and Erlang [8] excels in fault-tolerant and distributed systems.
3 Given thelearning and practice required to becomean Effective user of a programminglanguage, it is provident to not only gainbasic proficiency in a diversity of languag-es but also to appropriate the time andenergy to gain mastery in at least a singlelanguage. With programming languagemastery comes knowledge and access toadvanced language features and libraries,more efficient programming , and less timespent reading manuals and making there are many languages thatwould be appropriate and Effective inwhich to seek mastery for Bioinformatics ,modern interpreted scripting languages,such as Perl [9], Python [10], and Ruby[11], are among the most preferred andprudent choices [12]. These languagessimplify the programming process byobviating the need to manage many low-level details of program execution ( ,memory management), affording the pro-grammer the ability to focus foremost onapplication logic, and to rapidly prototypeprograms in an interpreted and easilyextensible environment.
4 Any effort tochoose from among these capable lan-guages is ultimately founded in personalpreference. Nonetheless, it should benoted that Perl and Python benefit froma relatively longer established tradition,and subsequently more widespread use inthe field of Bioinformatics . These factsshould not serve to discourage the use ofprogramming languages other than Perl orPython. Java, for example, which ispopular in both academic curriculumand industry, has served as the basis formany successful Bioinformatics , programmers stand to benefitgreatly from the many software tools,libraries, and educational materials avail-able supporting the use of Perl and Pythonfor Bioinformatics [13 17].In many cases, modern scripting languagescan be bridged to other languages suchthat one is able to leverage the advancedfeatures of other languages without abandon-ing the scripting language include the RPy library [18],whichprovidesaninterfacebetweenPyth onand the R language, and JRuby [19], a Java-based Ruby interpreter that enables interac-tion between the Ruby language and if no formal scripting language interfaceis available for a particular software library, itis often possible to generate scripting lan-guage interface using tools such as theSimplified Wrapper and Interface Generator(SWIG)
5 [20] or to simply wrap an existingexecutable using scripting language this paradigm, one becomes capa-ble of envisioning composite solutions thatincorporate the strengths of multiple lan-guage technologies, instead of being limitedby the capabilities of a particular of programming languages thereexists a multitude of software tools, libraries,and applications pertinent to various aspectsof Bioinformatics , and it is worthwhile toinvest time in gaining broad knowledge ofthe most popular of such resources across thebroad spectrum of Bioinformatics . Addition-ally, we encourage proficiency in the use andmaintenance of a Web server system, such asApache [21], as a survey of the bioinfor-matics literature clearly demonstrates anincreasing trend towards the Web-baseddevelopment, delivery, and utilization ofbioinformatics tools and :Dudley JT, Butte AJ (2009) A Quick Guide for Developing Effective Bioinformatics ProgrammingSkills.
6 PLoS Comput Biol 5(12): e1000589. :Fran Lewitter, Whitehead Institute, United States of AmericaPublishedDecember 24, 2009 Copyright: 2009 Dudley, Butte. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are :JTD is supported by an NIH Training Grant T15 LM007033. The authors received no specific fundingfor this Interests:The authors have declared that no competing interests exist.* E-mail: Computational Biology | 2009 | Volume 5 | Issue 12 | e1000589 The Benefits and Opportunitiesof Open Source CommunitiesToo often there is an urge amongprogrammers to reinvent the wheel despitethe availability of existing solutions. Insome cases this can be an innocent anduseful learning exercise, yet in most cases,this is an improvident and wastefulexercise.
7 For many common problems inbioinformatics ( , parsing file formats orworking with nucleotide data), it is oftenthe case that others have previouslyimplemented a solution to the problem,and in many cases these solutions areeasily found implemented in open sourcesoftware in the public domain. Whilegeneral Internet search engines can beuseful in locating existing bioinformaticssource code, there are specialized searchengines, such as Koders [22] and GoogleCode Search [23], that are speciallydesigned to search across the publicdomain source code. These specializedsearch engines offer code-specific searchoptions, such as the ability to constrain thesearch to specific programming languagesor software licensing schemes. It is worth-while to use these tools to search the publicdomain for existing open source code thatmight serve as inspiration for your ownprogram code, or even repurposed as thebasis for your own projects.
8 It should benoted, however, that if the decision torepurpose open source code is made, it isrecommended to fully understand thenature of the license under which opensource code is distributed and to ensurethat the redistribution terms set forth bythe original authors are respected. Fur-thermore, as the modern bioinformaticianwill invariably benefit from the vast bodyof open source code in the public domain,it is good citizenship to contribute yourbioinformatics source code into the publicdomain under an open source licensewhen Bioinformatics , it is fortunate thatsolutions to many common tasks andproblems have been codified into stan-dardized, open source software frame-works [24]. These frameworks are oftencomprehensive, rigorously tested, docu-mented, and engaged by vibrant andhelpful user communities. Language-spe-cific, open source Bioinformatics frame-works are at the forefront of this effort,with BioPERL [25,26], BioPython [27],BioRuby [28,29], BioJava [30], and Bio-Conductor [31] emerging as some of themost mature and widely used of pure Bioinformatics there are anumber of useful open source frameworksworth investigating, such as the SciPy [32]and NumPy [33] for scientific computingin Python and Ruby on Rails [34] forrapid Web application development.
9 Wewould also urge those newer to bioinfor-matics and programming in general toengage these software framework commu-nities as both a user and a the time to understand the sourcecode behind these frameworks and theirsystem design can be highly educational,and members of framework user commu-nities are often more than willing toconstructively critique another s sourcecode and program designs. Furthermore,active participation in an open sourcebioinformatics project can be noted onone s resume or CV as on the job Bioinformatics experience, which can of-ten be hard to gain for fledgling studentsand practitioners of Importance of UNIX SkillsEven if you don t choose to run a UNIX-based Operating System (OS) on yourpersonal workstation, knowledge of UNIXis tremendously useful in the Windows platform is perfectlyadequate for Bioinformatics , the simpletruth is that the majority of bioinformaticscomputation happens on UNIX-basedcomputer systems.
10 A portion of this cir-cumstance may be attributable to a tradi-tion of scientific computing on UNIX andthe availability of many free, open sourceUNIX-based OS, such as Linux. Even so, itcan be argued that a UNIX-based OSoffers several advantages when it comes tofacilitating Bioinformatics . Perhaps one ofthe most compelling reasons to learn UNIXis to avoid programming altogether byleveraging the flexible and extensibleUNIX shell environment. UNIX systemsprovide access to a vast array of specializedutilities that are executed by a commandinterpreter known as the UNIX these commands are often limitedto very specialized functionality ( , the cat command simply concatenates andprints files), the UNIX pipe operator, | ,makes it possible to create ad hoc softwarepipelines by connecting the output of onecommand to the input of another. Thesoftware pipeline paradigm is common inbioinformatics [35], where many biologicalquestions are evaluated by chaining spe-cialized Bioinformatics tools together intoan analysis pipeline ( , BLAST searchRMultiple sequence alignmentRPhyloge-netic analysis) using a scripting language.