Annotation

The Ensembl mouse gene annotations were vastly improved in the last release (e!61, 1 February 2011) by using updated Ensembl genebuild pipeline code and incorporating new data resources which have become available since the last NCBIM37 genebuild (April 2007). The new resources include an updated mouse-specific repeat library, additional RefSeq and Uniprot protein sequence data for annotating the coding regions of protein-coding genes, as well as new cDNAs and ESTs for annotating untranslated regions (UTRs) of protein-coding genes. Extensive data quality checks have been performed to remove gene/transcript models with erroneous structures (e.g. interlocking transcripts with long introns on the same strand) or supported by dubious evidence (e.g. cDNA fragments with short, wrongly annotated open-reading frames).

The Ensembl annotations were then merged with Vega annotations (mainly generated by the HAVANA team at the Wellcome Trust Sanger Institute) at the transcript level. As in previous releases since October 2007, in release 61 we provided the combined Ensembl-Vega gene set. Transcripts from the two annotation sources were merged if they shared the same internal exon-intron boundaries (i.e. had identical splicing pattern) with slight differences in the terminal exons allowed. Importantly, all Vega source transcripts (regardless of merge status) were included in the final merged gene set.

In the current release, the combined Ensembl-Vega gene set from release 61 was patched to correct gene/transcript models previously truncated due to the presence of selenocysteine residues (encoded by the UGA codon) in their translations. In addition, the gene set was updated to maintain its consistency with the latest CCDS gene set, bringing the number of CCDS models up to 22158 (as of 9 February 2011) from 17637 (as of 22 September 2010).

As a result, the release 62 gene set consists of 36814 genes and 93805 transcripts. Of the 93805 transcripts, 18.96% (17788) were the result of merging Ensembl and Vega annotations, 22.18% (20805) originated from Ensembl, 52.04% (48821) originated from Vega, and a remaining ~6.8% were incorporated from other sources (e.g. immunoglobulin gene segments/transcripts imported from IMGT data). Most of the non-merged Vega transcripts were alternative splice variants and/or non-coding transcripts complementing Ensembl annotations which focus on providing a conservative set of protein-coding genes/transcripts.

Vega logo Additional manual annotation of this genome can be found in Vega

HEROIC

Additional functional genomics data produced by the HEROIC project (High-throughput Epigenetic Regulatory Organisation In Chromatin) is available to download from the Ensembl Projects HEROIC portal.