Example: biology

Documentation for structure software: Version 2

Documentation forstructuresoftware: Version K. PritchardaXiaoquan WenaDaniel Falushb123aDepartment of Human GeneticsUniversity of ChicagobDepartment of StatisticsUniversity of OxfordSoftware 2, 20101 Our other colleagues in thestructureproject are Peter Donnelly, Matthew Stephens and Melissa first Version of this program was developed while the authors (JP, MS, PD) were in the Departmentof Statistics, University of and questions aboutstructureshould be addressed to the online forum Please check this document and search the previous discus-sion before posting Overview.

Mac, Windows, Linux, Sun). The C executable reads a data file supplied by the user. There is also a Java front end that provides various helpful features for the user including simple processing of the output. You can also invoke structure from the command line instead of …

Tags:

  Linux, Structure, Command

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Documentation for structure software: Version 2

1 Documentation forstructuresoftware: Version K. PritchardaXiaoquan WenaDaniel Falushb123aDepartment of Human GeneticsUniversity of ChicagobDepartment of StatisticsUniversity of OxfordSoftware 2, 20101 Our other colleagues in thestructureproject are Peter Donnelly, Matthew Stephens and Melissa first Version of this program was developed while the authors (JP, MS, PD) were in the Departmentof Statistics, University of and questions aboutstructureshould be addressed to the online forum Please check this document and search the previous discus-sion before posting Overview.

2 What s new in Version .. 32 Format for the data Components of the data file: .. Rows .. Individual/genotype data .. Missing genotype data .. Formatting errors.. 73 Modelling decisions for the Ancestry Models .. Allele frequency models .. How long to run the program .. 134 Missing data, null alleles and dominant Dominant markers, null alleles, and polyploid genotypes .. 145 Estimation ofK(the number of populations) Steps in estimatingK.. Mild departures from the model can lead to overestimatingK.

3 Informal pointers for choosingK; is the structure real? .. Isolation by distance data .. 176 Background LD and other Sequence data, tightly linked SNPs and haplotype data .. Multimodality .. Estimating admixture proportions when most individuals are admixed.. 197 Runningstructurefrom the command Program parameters .. Parameters in filemainparams.. Parameters in fileextraparams.. command -line changes to parameter values .. 258 Front Download and installation.. Overview.

4 Building a project.. Configuring a parameter set.. Running .. Batch runs.. Exporting parameter files from the front end.. Importing results from the command -line program.. Analyzing the results .. 3219 Interpreting the text Output to screen during run .. Printout ofQ.. Printout ofQwhen using prior population information .. Printout of allele-frequency divergence .. Printout of estimated allele frequencies (P) .. Site by site output for linkage model.. 3610 Other resources for use Plottingstructureresults.

5 Importing bacterial MLST data intostructureformat .. 3711 How to cite this program3712 Bibliography3721 IntroductionThe programstructureimplements a model-based clustering method for inferring population struc-ture using genotype data consisting of unlinked markers. The method was introduced in a paperby Pritchard, Stephens and Donnelly (2000a) and extended in sequels by Falush, Stephens andPritchard (2003a, 2007). Applications of our method include demonstrating the presence of popu-lation structure , identifying distinct genetic populations, assigning individuals to populations, andidentifying migrants and admixed , we assume a model in which there areKpopulations (whereKmay be unknown),each of which is characterized by a set of allelefrequencies at each locus.

6 Individuals in thesample are assigned (probabilistically) to populations, or jointly to two or more populations if theirgenotypes indicate that they are admixed. It is assumed that within populations, the loci are atHardy-Weinberg equilibrium, and linkage equilibrium. Loosely speaking, individuals are assignedto populations in such a way as to achieve model does not assume a particular mutation process, and it can be applied to most of thecommonly used genetic markers including microsatellites, SNPs and RFLPs. The model assumesthat markers are not in linkage disequilibrium (LD)withinsubpopulations, so we can t handlemarkers that are extremely close together.

7 Starting with Version , we can now deal with weaklylinked the computational approaches implementedhere are fairly powerful, some care is neededin running the program in order to ensure sensible answers. For example, it is not possible todetermine suitable run-lengths theoretically, andthis requires some experimentation on the part ofthe user. This document describes the use and interpretation of the software and supplements thepublished papers, which provide more formal descriptions and evaluations of the OverviewThe software packagestructureconsists of several parts.

8 The computational part of the programwas written in C. We distribute source code as well as executables for various platforms (currentlyMac, Windows, linux , Sun). The C executable reads a data file supplied by the user. There is alsoa Java front end that provides various helpful features for the user including simple processing ofthe output. You can also invokestructurefrom the command line instead of using the front document includes information about how to format the data file, how to choose appropriatemodels, and how to interpret the results.

9 It also has details on using the two interfaces (commandline and front end) and a summary of the various user-defined What s new in Version release (April 2009) introduces new models for improvingstructureinference for data setswhere (1) the data are not informative enough for the usualstructuremodels to provide accurate in-ference, but (2) the sampling locations are correlated with population membership. In this situation,by making explicit use of sampling location information, we givestructurea boost, often allowingmuch improved performance (Hubisz et al.)

10 , 2009). We hope to release further improvements in thecoming locb locc locd loceGeorge 1-9 145 66092 George 1-9-964094 Paula 1106 142 68192 Paula 1106 148 64094 Matthew 2110 145 -9092 Matthew 2110 148 661-9 Bob2108 142 64194 Bob2-9 142 -9094 Anja 1112 142 -91-9 Anja 1114 142 66194 Peter 1-9 145 660-9 Peter 1110 145 -91-9 Carsten 2108 145 620-9 Carsten 2110 145 64192 Table 1: Sample data file. Here MARKERNAMES=1, LABEL=1, POPDATA=1, NUMINDS=7,NUMLOCI=5, and MISSING=-9.


Related search queries