The ENCODE project aims to discover all functional elements in the human genome. Ensembl is involved in ENCODE in two ways:
Ensembl works in tight coordination with the UCSC group, which is the data collection centre (DCC) for ENCODE.
On June 14th, 2007, the paper detailing the analysis of the ENCODE pilot project was published in Nature along with 36 different companion papers in Genome Research. Dr Birney led the analysis for the main paper. The key results of this paper are:
A number of key intermediate files in the analysis, and resources for ENCODE are available for FTP download.
Building on this initial analysis, Ensembl aims to provide a richer annotation of the human genome. The key concept which we have introducted from Ensembl 45 is the concept of a "Regulatory Build". The regulatory build aims to provide a single "best guess" set of regulatory elements, with growing annotation of those elements from different experiments. This "Regulatory Build" augments the standard "Gene Build" (itself incorporating both protein coding and non protein coding Genes) to form the union of functional elements in the genome.
We also will augment our Gene Build to take account of new functional datasets such as CAGE and PET tags across the genome. These tags are already available in Ensembl for Human and Mouse, using data generated by the RIKEN group in Japan and GIS group in Singapore. We will be using these markers of Transcription start sites and termini to provide more accurate transcript definition.
An initial Regulatory Build has been developed in Ensembl by Paul Flicek and colleagues. It integrates 8 genome-wide datasets, mainly in pre-publication "resource" status, including the DNaseI Hypersensitive site set from Greg Crawford's group at Duke University, a set of 6 histone modifications from Martin Hirst's group at the BGGC at Vancouver, and the CTCF dataset from Bing Ren's group at UCSD.
This set takes DNaseI Hypersensitive Sites, CTCF binding regions and H3K4me3 as three "focus regions" each defining a potential element. The union of these three foci define 110,000 elements across the genome. We then took all the factors (with an additional 5 histone marks) to look for specific patterns diagnostic of certain features. A number of patterns which were high enriched for gene starts, genic regions and distal regions (away from genes) were developed.
The results of this analysis can be seen in the "Regulatory features" track on Ensembl displays in human (on by default) and are available to download from our FTP site.
We are extremely grateful to the Crawford and Hirst laboratories for use of their data in pre-publication status, in line with the open data access of ENCODE and the human genome project, and the CTCF dataset from Bing Ren's group at UCSD (Kim, et al. 2007. Cell 128:1231-45). In the future we hope to integrate more functional datasets and use genetic association studies, such as those published in Stranger et al, to provide the link between elements and genes.
For general questions about the dataset or how to access it, please email the helpdesk as helpdesk@ensembl.org. If you wish to learn more about future plans, please email Steve Searle for questions on transcription information, and Paul Flicek and Ewan Birney on regulatory information. (all of us are on the helpdesk email, which is internally tracked to ensure a response to each question, so we recommend this route. However, we realise that some people have specific strategic questions to pose which they maybe more comfortable sending to us directly).