![]() |
PhyloInformatics Workshop22 October, 07 08:30 AM - 24 October, 07 12:00 PMe-Science Institute, 15 South College Street, EdinburghOrganisers:Prof Roderic Page, , Prof. Vincent Moulton and Prof. Mike Steel |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Any slides or other material generated as a result of this event can be found at: www.nesc.ac.uk/action/esi/contribution.cfm?Title=710 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Programme
ABSTRACTSPhylogenomic supertrees: the end of the road or the light at the end of the tunnel?Author: Olaf R.P. Bininda-Emonds (Friedrich-Schiller-Universität Jena) Supertrees are seen by many as a stopgap measure to produce comprehensive phylogenies until sufficient molecular data (in terms of taxonomic coverage) become available to yield equivalent, "real" ones. In this talk, I examine what role, if any, supertrees might play under the latter scenario of abundant phylogenomic data. Clearly, the traditional application of the supertree framework will still be ideally suited for partitioned analyses of disparate data types (e.g., morphology, sequence data, gap-coded sequence data, or rare genomic changes and other genome-level "metadata"). But, a paradigm shift is likely needed if supertrees are to play any important role in the analysis of taxonomically comprehensive, but "homogenous" data sets (e.g., supermatrices of pure DNA sequence data). Here, instead of being the end product of the phylogenetic analysis, supertrees would become an intermediate result within a divideand- conquer approach, thereby evolving into a computational tool to potentially increase the speed (and accuracy?) of large-scale supermatrix analyses. I conclude by examining this possibility in detail, including its feasibility and what characteristics are needed for it to represent a real improvement over conventional search strategies. TaxMan: a workbench for largescale, multigene phylogenetic analysisAuthors: Martin Jones and Mark Blaxter (Edinburgh University) As sequencing efforts go broader (more taxa) and deeper (more genes) it has become more difficult for researchers to stay abreast of the tide of data. In particular whole genome sequencing and expressed sequence tag projects can deliver large amounts of relevant, but unannotated, data for phylogenetics. We have developed a data management system "TaxMan" for phylogenetic analyses that: (a) assembles sequence datasets from taxa defined by the user (e.g. "all insects"); (b) from these datasets identifies individual sequence entries corresponding to genes of interest (from a list defined by the user); (c) collates sequences by gene and by species, generating consensus sequences where relevant; (d) generates and stores aligned sequence datasets (using a relational database); (e) facilitates selection of data subsets (of taxa and genes) that suit particular phylogenetic questions; and (f) stores phylogenetic trees resulting from analysis of these subsets to ease comparison and synthesis. TaxMan is freely available for download and use. TaxMan was written by Martin Jones. A Search Engine for Phylogenetic Tree DatabasesAuthors: Duhong Chen (Iowa State University), Mukul S. Bansal (Iowa State University), J. Gordon Burleigh (NESCent), David Fernández-Baca (Iowa State University) The rapid growth of phylogenetic information necessitates the development of tools to store and access phylogenetic data. These tools should enable users to search for phylogenetic trees containing specified taxa and to compare a specified phylogenetic hypothesis to existing phylogenetic trees. In this talk, we present PhyloFinder, a search engine that efficiently implements a variety of taxonomic and phylogenetic queries. Its features include the ability to deal with synonymous taxon names (in part by relying on TBMap) and to provide spelling suggestions for taxa. PhyloFinder can also quickly identify database trees that contain the query tree or subtrees similar it. The system provides visualization tools that highlight the query results and provide links to NCBI and TBMap. While PhyloFinder has so far only been tested using trees from TreeBASE, the search engine can, in principle, enhance the utility of any tree database. PhyloFinder is available at http://pilin.cs.iastate.edu/phylofinder/ Related Links
Automated phylogenetic taxonomy in FungiAuthor: David S. Hibbett (Clark University) Taxonomy (perhaps the original "bioinformatics" discipline) has three core goals: to discover and describe the diversity of life; to reconstruct comprehensive phylogenetic trees linking all species; and, to translate trees into phylogeny-based classifications. These challenges are especially difficult in megadiverse, poorly known groups, such as Fungi. In recent years, tremendous progress has been made in reconstructing relationships among the major clades in the fungal branch of the tree of life, and numerous taxonomic and environmental studies have generated abundant data corresponding to the "leaves" on the tree. Fungal systematics is failing, however, in several key ways: the available data are not being integrated into comprehensive phylogenies; the emerging phylogenies are not being efficiently translated into classifications; and, the burgeoning environmental sequences are not being integrated into phylogenies based on rigorously identified materials. These deficiencies limit the information content of current trees and their derivative classifications. We have developed a PERL pipeline called mor that seeks to close the gaps between data generation, phylogenetic reconstruction, and classification. In brief, mor queries GenBank on a weekly basis for ribosomal RNA sequences of mushroom-forming fungi, which it uses to construct ever-growing alignments and phylogenetic trees. The phylogenetic analysis in mor uses a backbone constraint based on recent multi-locus studies. As of this writing the tree in mor includes 3009 terminals, making it by far the largest fungal phylogeny (although not all groups of Fungi are included). Following phylogenetic reconstruction, mor parses the tree using phylogenetic taxon definitions, and it then creates "clade viewer pages" that present the defined sub-tree, a list of included sequences, and an alignment of sequences for that clade. In the future, mor will be enhanced to include sequences of the internal transcribed spacers of ribosomal RNA genes, which are widely used in species-level systematic studies and environmental surveys. Because ITS sequences are too variable to be aligned across distantly related taxa, their incorporation will require automated supertreee analyses. Mor, which remains a prototype application, demonstrates that the venerable old science of taxonomy is amenable to automation. Related Links Challenges and open problems: a wish list for phyloinformaticsAuthor: Roderic Page (University of Glasgow) This talk gives a personal view of phyloinformatics, from the perspective of a biologist wanting a database of evolutionary trees that can be queried. Among the topics covered are supertrees, visualising very large trees, and integrating trees with other data sources and types. Some possible solutions are proposed, but the emphasis is on the problems that remain to be solved. Related Links
Working with Trees in the Phyloinformatic AgeAuthors: William H. Piel (Yale Peabody Museum), Hilmar Lapp (NESCent and Duke University) It is anticipated that the growth of phylogenetic knowledge will soon exceed our ability to assemble, organize, and make sense of it by hand. Even minor branches of the tree of life will expand to enormous size as more and more species are included - this growth is not only in physical size, but is also reflected in the number and diversity of published phylogenetic results that need to be triaged, evaluated, and accounted for. The graphical interfaces that allow us to rearrange clades by dragging a mouse pointer will no longer be effective when trees have hundreds of thousands of nodes. Moreover, there will be too many published trees for us to evaluate these hypotheses individually. We will need computers to take over tasks that we normally perform at a human scale, which in part will require a phylogenetic query language (PQL). I will discuss various approaches and solutions for designing this PQL. Developing taxonomic name services to enhance findabilityAuthor: David Remsen (Global Biodiversity Information Facility) The Global Biodiversity Information Facility has recently launched the implementation version of its data portal that provides access to over 100,000,000 primary occurrence taxon records (specimen and observational data). Numerous challenges arise in providing effective access pathways to these data, particularly those that draw upon the primary biological metadata component of these records: the taxon name itself. In this presentation I will provide a brief overview of the new data portal, and our strategies for addressing these access impediments both internally and through the development of a "global names architecture Data mining GenBank for phylogenetic inference: what can 80,000 phylogenies tell us about the tree of life?Author: Michael J. Sanderson (University of Arizona) Molecular sequence data for 160,000 species, some 10% of all described species on earth, are archived in NCBI's GenBank, providing a rich resource for phylogenetic inference. We have constructed a web-accessible database tailored for phylogenetic inference, based on a taxonomically enriched subset of data for eukaryotes in GenBank. This "Phylota Browser" was assembled by building clusters of locally homologous sequences from all-versus-all BLAST searches for all but the deepest nodes in NCBI’s taxonomy tree. A data availability matrix for each node was constructed, which reports whether a given cluster-by-taxon entry has a sequence in the database. This provides a useful view on what new sequences must be obtained to complete the matrix for subsequent supermatrix or supertree construction. Smaller and more complete subsets of the data can be constructed manually or via formal algorithms. The data can be downloaded for individual clusters or for sets of clusters for subsequent alignment and analysis. To illustrate the utility of the database, we examined 80,000 potentially phylogenetically informative clusters distributed across all eukaryotes, We tallied a measure of support for each taxon in the NCBI tree using these phylogenetic results. This measure calculates the fraction of a clade’s taxa that are "well-supported", meaning their summed measures of support across all the clusters they are found in exceeds a specified value. The distribution of this support was plotted on the NCBI taxonomy tree to reveal areas that have received relatively more or less attention. Not surprisingly, the phylogenetic neighborhood of model organisms has strong support; but this strong support sometimes extends far beyond these taxa in generally well-studied clades such as vertebrates. Related Links
Approximating the Subtree Distance between PhylogeniesAuthor: Charles Semple (University of Canterbury, New Zealand) Phylogenetic (evolutionary) trees are used in evolutionary biology to represent the tree-like evolution of a collection of present-day species. In this context, the graph-theoretic operation of subtree-prune-and-regraft is a basic tool in the study and analysis of phylogenetic trees. This operation is used in a variety of ways---(i) to quantify the dissimilarity between two phylogenetic trees, (ii) to provide a lower bound on the number of reticulation events in the case that evolution happened in a non-tree-like way, and (iii) as a search tool for selecting the best tree in reconstruction algorithms. For (i) and (ii), one is interested in finding the minimum number of subtree-prune-and-regraft operations to transform one phylogenetic tree into another. While computationally hard, there are approximation and fixed-parameter algorithms for computing this number. In this talk, we describe some of these algorithms. This is joint work with Magnus Bordewich (Durham University, UK) and Catherine McCartin (Massey University, New Zealand). Crunching Huge Phylogenies: A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM BlueGeneAuthor: Alexandros Stamatakis (Ecole Polytechnique Federale de Lausanne) Despite the impressive progress that has been achieved with the new generation of Maximum Likelihood (ML) search algorithms for phylogeny reconstruction, the computation of bootstrap (BS) support values and the analysis of large, and thus memory-intensive multi-gene datasets, still represents a major computational challenge for phyloinformatics. Initially, I will present novel BS heuristics that have been implemented in RAxML and are currently available via the RAxML web-server prototype at http://phylobench.vitalit. ch/raxml-bb/. On average the rapid BS method is over 12 times faster than the standard RAxML BS algorithm, e.g., 100 BS replicates for the 500 Zilla dataset take less than 2 hours on a single AMD Opteron. In addition, it is between 18 and 125 times faster than competing programs (PHYML/GARLI) while the speed gain increases with alignment size. Scalability and accuracy have been tested on over 20 diverse (Archaea, Bacteria, Plants, Viruses, Mammals, Fish) DNA and AA real-world datasets comprising 125 up to 7,764 sequences under plain and partitioned models. The Pearson correlation coefficient between rapid and standard BS support values drawn on the best-scoring ML tree averages to 0.97. The weighted topological distance between the majority rule consensus trees of rapid and standard BS is below 0.06 in all cases and averages to 0.04. Coupled with parallelism the rapid BS method opens up new possibilities such as practical assessment of the double Bootstrap procedure or development of experimentally tested bootstopping criteria. In the second part of my talk I will outline how ML-based inference of large multi-gene datasets can efficiently be parallelized on the massively parallel IBM BlueGene supercomputer architecture and on medium-sized Linux clusters with fast interconnects. On the BlueGene we obtained a speedup of 890 on 1,024 processors for the largest, in terms of memory footprint, alignment (270 sequences and 500,000 base pairs) analyzed under ML to date. Related Links
Prospects for enabling phylogenetically-informed comparative biology on the webAuthors: Todd Vision (US National Evolutionary Synthesis Center), Hilmar Lapp (National Evolutionary Synthesis Center) A wide variety of biological research questions, in fields ranging from community ecology to genomics, can take advantage of phylogenetic trees to provide evolutionary historical context. Such investigations as they are carried out today typically use data and software customized by individual investigators. In this talk, we ask: what technologies are needed to enable such investigations on the fly, over the web, even by researchers unfamiliar with phylogenetics, in order to realize the promise of comparative biology outside of the evolutionary biology community? We also present the model being employed at the U.S. National Evolutionary Synthesis Center for the open development of phyloinformatics technologies. Related Links
Robustness of supertree methods for reconciling dense incompatible dataAuthor: Stephen J. Willson (Iowa State University) Given a collection of rooted phylogenetic trees with overlapping sets of leaves, a supertree S is a single tree whose set of leaves is the union of the input sets of leaves and such that S agrees with each input tree when restricted to the leaves of the input tree. Typically with trees from real data, no supertree exists, and various methods may be utilized to reconcile the incompatibilities in the input trees. This talk focuses on a measure of robustness of a supertree method called its "radius" R. For example, if R = 1/10, then whenever T is a candidate binary tree and for all rooted triples ab|c in T we have that {a,b,c} occur together in some input tree and that more than 90% of the input triples involving {a,b,c} are in fact ab|c (a strong assumption), then the method outputs T as the supertree; but this might fail if 90% is replaced by 89%. It is shown that the maximal possible radius for a method is R = 1/2. Many familiar methods, both for supertrees and consensus trees, have R = 0, indicating that they need not output a tree T that would seem to be the natural correct answer. Some methods with the maximal possible R = 1/2 are indicated. Extensions may be presented concerning supertree methods that rely on input distance information as well as the topology of the input trees.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This event is sponsored by eSI in association with the following organisations:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| The e-Science Institute | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||