In our analysis, in case of alternative isoforms for a gene, we only consider the longest peptide isoform, so that we have a 1-to-1 gene/peptide relation. The orthologous genes represent the peptide best reciprocal BLAST hits for the two considered species (Best Reciprocal Hit: BRH) using a discriminative function based on score, e-value, % identity and finally % positivity. We can obtain Unique BRH (UBRH) when the discrimitive is unambigous and returns you only one BRH. But sometimes it can return more than one, then you obtain multiple BRHs (MBRH, which stands for one of (M)any BRHs). For closely releated species, i.e. inside vertebrate phylum, or inside arthropoda phylum, where some gene order conservation is expected (or synteny), we identify additional orthologous pairs obtained by a combination of reciprocal BLAST and location information. These additional orthologue prediction are named Reciprocal Hit based on Synteny (RHS).
Human/Chimpanzee exception: The human/chimpanzee orthlogue prediction were obtained in a completely different manner. The chimpanzee genome is low coverage 4X assembly of poor quality to generate a gene set on the classical Ensembl gene build pipeline. The gene set produced by Ensembl has been generated by "projection" of human genes to the chimpanzee genome through a whole genome BLASTz alignments between human and chimp filtered for orthologue sequence alignments. So we have de facto the human/chimpanzee orthologue genes and are Derived from Whole Genome Alignments (DWGA).
Clicking on the putative orthologue gene will take you to a 'GeneView' display of that gene within the web site for the other species. See also the 'SyntenyView' help page. Where no orthologue predictions are shown, you may wish to explore the Ensembl protein family (see below).
dN (number of non-synonymous substitutions / number of non-synonymous sites) and dS (number of synonymous substitutions / number of synonymous sites) values were generated using the codeml program included in the PAML package (1). With the parameters we have used, codeml performs pair wise Maximum Likelihood calculations of dN and dS for each orthologous pair. We have used the F3x4 codon evolution model (2). This takes into account both the bias deriving from the different probabilities of transition (T<->C and A<->G) versus transversion (T/C<->A/G) mutations, and the bias due to different nucleotide frequencies at the three codon positions (for more details on the paramerters, the codeml parameter file can be found in ensembl-compara/scripts/homology/codeml.ctl).
dN and dS values are only provided for orthologues from some species pairs, i.e. human/mouse, human/rat, and mouse/rat. Orthologues for other species pairs are too divergent for dS to be an accurate measure. Most synonymous sites will have be subjected to more than one mutation and ancestral changes cannot be reliably inferred from extant sequences, (i.e. dS is saturated).
Orthologue predictions for human/mouse, human/rat and mouse/rat may not be perfect. Incorrect assignments will manifest anomalously high dS values. We have, therefore, applied a cut-off of twice the median value of all dS for each species pair as the criterion for displaying the dN/dS ratio. Predicted orthology relationships with dS above this threshold are likely to be errors.
The following dS threshold values have been used:
dS threshold human/mouse 1.25560 human/rat 1.26930 mouse/rat 0.40780
Therefore, dN/dS ratios for human/mouse orthologues are only displayed when dS <= 1.25560, otherwise "--".
1.
Yang, Z. (1997)
"PAML: a program package for phylogenetic analysis by maximum likehood."
Comput. Appl. Biosci. 1997 13: 555-556.
2.
Goldman, N. & Zang, Y. (1994)
"A codon-based model of nucleotide substitution for protein-coding DNA sequences."
Mol. Biol. Evol. 11, 725-736.
3.
Stein, LD et al. (2003)
"The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics."
PLoS Biol. 1, 166-192.
The paralogous genes presently displayed within Ensembl are best defined as 'recently duplicated genes'. For each organism where these recent duplications have been identified, this has be done so using a particular definition of what is recent. For example, recent human duplications are those that appear to have occurred since the divergence of the human and rodent lineages. Currently, these gene duplications are identified for just three species, human, mouse and rat. This will be expanded to the full range of Ensembl species in the very near future.
The recent duplications for each species were detected using the combination of homology searches followed by genetic distance-based filtering. For each organism, potential recent duplicates of each gene were derived using a simple BLAST homology search. Where it made sense to do so, the database used for the homology search included not only all genes from the organism in question, but also genes from species that represent a useful outgroup. Matches with outgroup sequences helped provide a dynamic cut-off during later filtering. In the case of human genes, the closest match to a mouse or rat gene provided a cut-off point defining the time boundary for what constitutes a 'recent' duplication. For genes that did not have a homologous sequence from an outgroup species, an arbritrary cutoff was applied.
Once a set of homologous genes was obtained by BLAST search, each hit sequence was aligned with the query (a codon based alignment was generated). The Ks genetic distance was then calculated for each query-hit gene pair using the Nei and Gojobori method (4) as implemented in the codeml program from the PAML software package (5). Intra-species hit sequences were considered to be recent duplicates if they were more closely related than the nearest outgroup sequence (or, in the case of no related inter-species genes, if they were more closely related distance than an arbritrary genetic distance cut-off).
Species | Outgroup species | Arbritrary Cutoff |
---|---|---|
Homo sapiens | Mus musculus Rattus Norvegicus | 0.6 |
Mus musculus | Homo sapiens | 0.6 |
Rattus norvegicus | Homo sapiens | 0.6 |
4.
Nei, M. & Gojobori, T. (1986)
"Simple methods for estimating the numbers of synonymous and
nonsynonymous nucleotide substitutions."
Mol. Biol. Evol. 3:418-426
5.
Yang, Z. (1997)
"PAML: a program package for phylogenetic analysis by maximum likehood."
Comput. Appl. Biosci. 1997 13: 555-556.