TraitRateProp Server - Phenotypic trait changes and site-specific sequence evolution

	Detecting trait-dependent evolutionary rate shifts in sequence sites Prof. Itay Mayrose Lab - Plant Evolution, bioinformatics, & comparative genomics

	HOME OVERVIEW GALLERY SOURCE CODE CITING & CREDITS

Abstract
Introduction
Methodology

Phenotypic trait evolution
Sequence evolution
Connecting the trait and sequence evolutionary processes

Input

Phylogenetic tree
Sequence data
Character (trait) data
Additional parameters

Output
References

Abstract

TraitRateProp is a probabilistic method that allows testing whether the rate of sequence evolution of an examined protein or genomic region is associated with a binary phenotypic character trait. The method further allows the detection of specific sequence sites whose evolutionary rate is most noticeably affected following the character transition, suggesting a shift in functional/structural constraints.

Introduction

TraitRateProp detects cases in which some or all sequence positions in a given gene (protein) exhibit evolutionary rate shifts that are associated with the state of a binary phenotypic trait. The trait can be related to a genomic attribute (e.g., the presence/absence of a certain gene family) or to an organismal trait (e.g., an environmental or ecological preference, life history attribute, or morphological feature). Given an input rooted ultrametric species tree, a multiple sequence alignment (MSA), and the characters describing the trait states of the extant species (coded as either '0' or '1'), TraitRateProp allows for: (1) testing whether the evolutionary rate of the input sequence data is associated with the given trait data; (2) In case an association is detected, the method infers the sequence positions whose evolutionary rate is most likely to be associated with the trait data. TraitRateProp is based on the maximum-likelihood paradigm, and provides two important maximum likelihood estimators (MLEs) regarding the co-evolution of sequence and trait data: the relative rate parameter, r, describing the ratio between the sequence evolutionary rates under states '1' and '0', and the parameter, p, which is the proportion of positions in the sequence whose evolutionary rate is associated with the phenotypic state. The full details of the model, the likelihood estimation procedures and the associated statistical tests are detailed in (Levy Karin et al.; Mayrose and Otto).

Methodology

TraitRateProp combines models of sequence evolution and of phenotypic trait evolution in a single likelihood framework by first reconstructing a large number of possible evolutionary histories of the phenotypic trait along the phylogeny. Each such history is inferred using the stochastic mapping approach (Nielsen) and is consistent with the observed phenotypic state values of the extant species. The method is based on comparing a null model, in which a single sequence rate matrix is fit to the data and an alternative model, in which two sequence rate matrices, each corresponding to one of the phenotypic states, are fit to the sequence data.

Phenotypic trait evolution:

A two-state Markov model is used to describe the evolution of the phenotypic trait along the tree.

Sequence evolution:

Connecting the trait and sequence evolutionary processes:

Dependence of the rate of sequence evolution on the phenotypic state is modelled by allowing the sequence evolutionary rate of some positions termed "phenotype-dependent positions" to vary depending on whether the phenotypic trait is in state '0' or '1'. Specifically, a parameter r1 is assumed when the character state is '1', and a parameter r0 when the character state is '0'. Thus, for a phenotype-dependent position, the sequence rate matrix, is multiplied by either r1 or r0, according to the character state. The parameter r denotes the ratio between r1 and r0.

The TraitRateProp joint genotype-phenotype likelihood framework describes an evolutionary process along a phylogeny, and considers two types of data: sequence data (DS) and character states (DC) of the extant species. The likelihood of the model is the joint probability of DS and DC given the model parameters θ. This expression is termed as the probability to observe DC times the probability to observe DS conditioned on having observed DC. Under these settings, likelihood computations based on the sequence data require the knowledge of the character state in each part of T, i.e., the complete reconstructed history of character changes. As this history is unknown the marginal probability of DS given DC and θ is approximated by integrating over many possible character histories h (as obtained using the stochastic mappings approach).

The likelihood based on each sequence position k is computed using a mixture model of the likelihoods over two scenarios: either the position evolved independently of the character state or the position belongs to the phenotype-dependent category. In this mixture model, the parameter p specifies the probability of a position to belong to the phenotype-dependent category. By fixing p to 1, the user can test the hypothesis that the evolutionary rate of all sequence sites is associated with the examined trait.

Input

A rooted ultrametric phylogentic tree with branch lengths (Newick format).
A multiple sequence alignment (MSA) of the sequence data of the extant species (Fasta format).
The character states of the extant species coded as either '0' or '1' (Fasta format).

In addition, the user should indicate the type of sequence input (DNA or protein). The user can also control the search range of the r parameter and whether the p parameter should be optimized or not. Fixing the p parameter to 1 allows the user to run the program in TraitRate mode, assuming that the evolutionary rate of all sequence sites is associated with the examined trait. Finally, in case of protein data, the user can provide a 3D structural model in the form of a PDB file format. In this case, the site-specific predictions of TraitRateProp are projected onto the provided 3D protein structure.

Output

TraitRateProp directs you to a web page called "TraitRateProp Job Status Page". This web page is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity.

A runtime estimation is computed based on the size of the provided input. A basic linear regression model through the origin was pre-computed from simulated datasets that contained 1,000 sequence position (MSA length) with a varying number of species:

As the runtime is expected to increase linearly with the number of species (N) and the number of sequence positions (L), the TraitRateProp web server uses the following formula to estimate the runtime in seconds:

When the calculation finishes, results are printed to this page and provided in several links. For an example output page click here. These results include:

Parameter estimations and model comparison: a result of a chi-squared hypothesis testing to compare the null model (no association between shifts in the phenotypic trait and the rate of sequence evolution) and the alternative model are printed to the result page together with the alternative model's r and p estimations. A full report of the parameter estimations under the null and alternative models and the comparison between them is provided in a downloadable file

A colored version of the input MSA: each column is colored according to its TraitRateProp Bayes-factor score, indicating how likely its evolutionary rate is in association with the trait data. For visualization purposes, the continuous TraitRateProp position-specific scores are partitioned into a discrete scale of 6 bins where dark blue corresponds to maximal association and white to no association. These scores are also provided as a downloadable text file. For example (partial image):

The ultrametric input tree colored by trait state: red indicates '1' and black indicates '0'. For example:

In case the user provided PDB information:

The TraitRateProp per-position scores are projected onto the 3D protein structure. For example:

References

Levy Karin E., Wicke S., Pupko T., and Mayrose I. 2017. An integrated model of phenotypic trait changes and site-specific sequence evolution. In press. J. Sys. Biol.

Mayrose I., Otto SP. 2011. A likelihood method for detecting trait-dependent shifts in the rate of molecular evolution. Mol. Biol. Evol. 28:759–770.

Nielsen, R. 2002. Mapping mutations on phylogenies. Syst Biol 51:729-739.

Uzzell T.,and Corbin K.W. 1971. Fitting discrete probability distributions to evolutionary events. Science 172:1089-1096

Wakeley J. 1993. Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. J. Mol. Evo.l 37:613-623

Jones D.T., Taylor W.R., and Thornton J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Computer Appl. in the Bios. 8:275-282

Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol. Evol. 39:306-314