Version 2.1 of the script requires at least version 63 of the Ensembl Core and Variation APIs and their relevant dependencies to be installed to use the script in normal mode. See instructions for details. To use the cache, the gzip and zcat utilities are required. No explicit installation of the script is necessary.
[Back to top]The Variant Effect Predictor script can be downloaded from Ensembl's FTP site.
It is also included as part of the ensembl-variation module of the Ensembl API - you can find it in the ensembl-variation/scripts/examples/ directory.
[Back to top]The script is run on the command line as follows:
perl variant_effect_predictor.pl [options]
where [options] represent a set of flags and options to the script. These can be listed using the flag --help:
perl variant_effect_predictor.pl --help
By default the script connects to the public Ensembl database server at ensembldb.ensembl.org; other connection options are available.
Most users will need to use only a few of the options described below; for most the following command will be enough to get started with:
perl variant_effect_predictor.pl -i input.txt -o output.txt
where input.txt contains data in one of the compatible input formats, and output.txt is the output file created by the script.
Options can be passed as the full string (e.g. --format), or as the shortest unique string among the options (e.g. --form for --format, since there is another option --force_overwrite).
NOTE Whole-genome mode is now the default run mode for the VEP script. In the rare case that you would prefer to run the script in the old per-variant mode, you can force this with --no_whole_genome
[Back to top]Flag | Alternate | Description |
---|---|---|
--help |
Display help message and quit | |
--verbose |
-v |
Output longer status messages as the script runs. This option can be used to generate the basis of a configuration file - see --config below. Not used by default |
--quiet |
-q |
Suppress status and warning messages. Not used by default |
--no_progress |
Don't show progress bars. Progress bars shown by default | |
--config [filename] |
Load configuration options from a config file. The config file
should consist of whitespace-separated pairs of option names and
settings e.g.:
output_file my_output.txt species mus_musculus format vcf host useastdb.ensembl.orgThis is useful if you find yourself using the same configuration options each time. You can create a quick version file of this by setting the flags as normal and running the script in verbose (-v) mode. This will output lines that can be copied to a config file that can be loaded in on the next run using -config. Note that any options specified in the normal way overwrite those in the config file. Not used by default |
Flag | Alternate | Description |
---|---|---|
--species [species] |
Species for your data. This can be the latin name e.g. "homo_sapiens" or any Ensembl alias e.g. "mouse". Specifying the latin name can speed up initial database connection as the registry does not have to load all available database aliases on the server. Default = "homo_sapiens" | |
--input_file [filename] |
-i |
Input file name. If not specified, the script will attempt to read from STDIN. |
--format [vcf|pileup] |
Input file format. By default, the script auto-detects the input file format. Using this option you can force the script to read the input file as VCF or pileup format. Not used by default | |
--output_file [filename] |
-o |
Output file name. The script can write to STDOUT by specifying STDOUT as the output file name - this will force --quiet mode. Default = "variant_effect_output.txt" |
--force_overwrite |
--force |
By default, the script will fail with an error if the output file already exists. You can force the overwrite of the existing file by using this flag. Not used by default |
Flag | Alternate | Description |
---|---|---|
--host [hostname] |
Manually define the database host to connect to. Users in the US may find connection and transfer speeds quicker using our East coast mirror, useastdb.ensembl.org. Default = "ensembldb.ensembl.org" | |
--user [username] |
-u |
Manually define the database username. Default = "anonymous" |
--password [password] |
--pass |
Manually define the database password. Not used by default |
--port [number] |
Manually define the database port. Default = 5306 | |
--genomes |
Override the default connection settings with those for the Ensembl Genomes public MySQL server. Required when using any of the Ensembl Genomes species. Not used by default | |
--db_version [number] |
--db |
Force the script to connect to a specific version of the Ensembl databases. Not recommended as there will usually be conflicts between software and database versions. Not used by default |
--registry [filename] |
Defining a registry file overwrites other connection settings and uses those found in the specified registry file to connect. Not used by default |
Flag | Alternate | Description |
---|---|---|
--terms [ensembl|so|ncbi] |
-t |
The type of consequence terms to output. The Ensembl terms are described here. The Sequence Ontology is a joint effort by genome annotation centres to standardise descriptions of biological sequences. The NCBI terms are those used by dbSNP, and are the least complete set - where no NCBI term is available, the script will output the Ensembl term. Default = "ensembl" |
--sift [p|s|b] |
Human only SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. The VEP can output the prediction term, score or both. Using SIFT requires a database connection - while it can be used with --cache, the database will still be accessed to retrieve SIFT data. Not used by default | |
--polyphen [p|s|b] |
--poly |
Human only PolyPhen is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. The VEP can output the prediction term, score or both. Using PolyPhen requires a database connection - while it can be used with --cache, the database will still be accessed to retrieve PolyPhen data. Not used by default |
--condel [p|s|b] |
Human only Condel computes a weighed average of the scores (WAS) of several computational tools aimed at classifying missense mutations as likely deleterious or likely neutral. The VEP currently presents a Condel WAS from SIFT and PolyPhen. The VEP can output the prediction term, score or both. Using Condel requires a database connection - while it can be used with --cache, the database will still be accessed to retrieve Condel data. Not used by default | |
--regulatory |
Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site. Output lines have a Feature type of RegulatoryFeature or MotifFeature. Using this option requires a database connection - while it can be used with --cache, the database will still be accessed to retrieve regulatory data. Not used by default | |
--hgvs |
Add HGVS nomenclature based on Ensembl stable identifiers to the output. Both coding and protein sequence names are added where appropriate. Currently it is not possible to generate HGVS identifiers from the cache; a database connection must be made. Not used by default | |
--gene |
Force the gene column to be populated. This is disabled by default unless using --cache. Gene column not populated by default | |
--protein |
Add the Ensembl protein identifier to the output where appropriate. Not used by default | |
--hgnc |
Adds the HGNC gene identifer (where available) to the output. Not used by default | |
--most_severe |
Output only the most severe consequence per variation. Transcript-specific columns will be left blank. Not used by default | |
--summary |
Output only a comma-separated list of all observed consequences per variation. Transcript-specific columns will be left blank. Not used by default |
Flag | Alternate | Description |
---|---|---|
--check_ref |
Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database. Lines that do not match are skipped. Not used by default | |
--coding_only |
Only return consequences that fall in the coding regions of transcripts. Not used by default | |
--check_existing |
Checks for the existence of variants that are co-located with your input. By default the alleles are not compared - to do so, use --check_alleles. Not used by default | |
--check_alleles |
When checking for existing variants, only report a co-located
variant if none of the alleles supplied are novel. For example,
if the user input has alleles A/G, and an existing co-located
variant has alleles A/C, the co-located variant will not be
reported. Strand is also taken into account - in the same example, if the user input has alleles T/G but on the negative strand, then the co-located variant will be reported since its alleles match the reverse complement of user input. Not used by default |
|
--chr [list] |
Select a subset of chromosomes to analyse from your file. Any data
not on this chromosome in the input will be skipped. The list can be
comma separated, with "-" characters representing an interval. For
example, to include chromsomes 1, 2, 3, 10 and X you could use
--chr 1-3,10,XNot used by default |
Flag | Alternate | Description |
---|---|---|
--no_whole_genome |
Force the script to run in non-whole-genome mode. This was the original default mode for the VEP script, but has now been superceded by whole-genome mode, which is the default. In this mode, variants are analysed one at a time, with no caching of transcript data. Not used by default | |
--cache |
Enables use of the cache. By default the VEP will only read from the cache - use --write_cache to enable writing. Not used by default | |
--dir [directory] |
Specify the base cache directory to use. This should be on a filesystem with around 600MB free (for human, other species may vary). Default = "$HOME/.vep/" | |
--buffer [number] |
Sets the internal buffer size, corresponding to the number of variations that are read in to memory simultaneously. Set this lower to use less memory at the expense of longer run time, and higher to use more memory with a faster run time. Default = 5000 | |
--write_cache |
Enable writing to the cache. Not used by default | |
--build [all|list] |
Build a complete cache for the selected species from the database.
Either specify a list of chromosomes (see --chr for how to do this),
or use --build allto build for all top-level chromosomes. WARNING: Do not use this flag when connected to one of the public databases - please instead download a pre-built cache or build against a local database. Not used by default |
|
--compress [command] |
By default the VEP uses the utility zcat to decompress cached files.
On some systems zcat may not be installed or may misbehave; by
specifying one of --compress gzcator --compress "gzip -dc"you may be able to bypass these problems. Not used by default |
|
--skip_db_check |
ADVANCED Force the script to use a cache built from a different host than specified with --host. Only use this if you are sure the two hosts are compatible (e.g. ensembldb.ensembl.org can be considered compatible with useastdb.ensembl.org as the data is mirrored between the two). Not used by default | |
--cache_region_size [size] |
ADVANCED The size in base-pairs of the region covered by one file in the cache. By default this is 1MB, which produces approximately ~500 files maximum per sub-directory in human. Reducing this can reduce the amount of memory and decrease the run-time when you use a cache built this way. Note that you must specify the same --cache_region_size when both building/writing to the cache and reading from it. Not used by default |
perl variant_effect_predictor.pl -o stdout
perl variant_effect_predictor.pl -i variants.txt -regulatory
perl variant_effect_predictor.pl -i variants.vcf.txt -format vcf -hgnc -t so
perl variant_effect_predictor.pl -i variants.txt -o variants_output.txt -force -check_existing -coding_only -hgvs
perl variant_effect_predictor.pl -i variants.txt -r ensembl.registry -sift b -polyphen p -condel s
perl variant_effect_predictor.pl -i variants.txt -genomes -species arabidopsis_thaliana -b 10000
perl variant_effect_predictor.pl -config vep.ini -i variants.txt -q
perl variant_effect_predictor.pl -cache -dir /home/vep/mycache/ -i variants.txt -compress gzcat
The VEP script can use a variety of data sources to retrieve transcript information that is used to predict consequence types. Which one you choose to use should depend on your requirements and available resources.
[Back to top]By default, the script is configured to connect to Ensembl's public MySQL instance at ensembldb.ensembl.org. For users in the US (or for any user geographically closer to the East coast of the USA than to Ensembl's data centre in Cambridge, UK), a mirror server is available at useastdb.ensembl.org. To use the mirror, use the flag --host useastdb.ensembl.org
Users of Ensembl Genomes species (e.g. plants, fungi, microbes) should use their public MySQL instance; the connection parameters for this can be automatically loaded by using the flag --genomes
Users with small data sets (100s of variants) should find using the default connection settings adequate. Those with larger data sets, or those who wish to use the script in a batch manner, should consider one of the alternatives below.
[Back to top]It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, see here. You will need a MySQL server that you can connect to from the machine where you will run the script (this can be the same machine). For most of the functionality of the VEP, you will only need the Core database (e.g. homo_sapiens_core_63_37) installed. In order to find co-located variations or to use SIFT, PolyPhen or Condel, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_63_37).
To connect to your mirror, you can either set the connection parameters using --host, --port, --user and --password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:
use Bio::EnsEMBL::DBSQL::DBAdaptor; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::Registry; Bio::EnsEMBL::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "core", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_core_63_37' ); Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "variation", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_variation_63_37' ); Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");
For more information on the registry and registry files, see here.
[Back to top]From version 2.1 onwards, the VEP is able to use cached data on disk in place of reading from the database. Using the cache is probably the fastest and most efficient way to use the VEP script, as in most cases no network connections are made and most data is read from local disk. The diagrams below illustrate the model of caching that is used.
Normal mode | Cache mode | Build mode |
---|---|---|
![]() |
![]() |
![]() |
It is possible to use any combination of cache and database; when using the cache, the cache will take preference, with the database being used when the relevant data is not found in the cache.
Cache files are compressed using the gzip utility. By default zcat is used to decompress the files, although gzcat or gzip itself can be used to decompress also - you must have one of these utilities installed in your path to use the cache.
[Back to top]The easiest solution is to download a pre-built cache for your species; this eliminates the need to connect to the database while the script is running (except when using certain output options).
Human (Homo sapiens) | Download |
---|---|
Mouse (Mus musculus) | Download |
Rat (Rattus norvegicus) | Download |
Zebrafish (Danio rerio) | Download |
Cow (Bos taurus) | Download |
mv homo_sapiens_vep_62.tar.gz ~/.vep/ cd ~/.vep/ tar xfz homo_sapiens_vep_62.tar.gz
Caches for several species, and indeed different Ensembl releases of the same species, can be stored in the same cache base directory. The files are stored in the following directory hierarchy: $HOME -> .vep -> species -> version -> chromosome
[Back to top]It is possible to build your own cache using the VEP script. You should NOT use this command when connected to the public MySQL instances - the process takes a long time, meaning the connection can break unexpectedly and you will be violating Ensembl's reasonable use policy on the public servers. You should either download one of the pre-built caches, or create a local copy of your database of interest to build the cache from.
You may wish to build a full cache if you have a custom Ensembl database with data not found on the public servers, or you may wish to create a minimal cache covering only a certain set of chromosome regions. Cache files are compressed using the gzip utility; this must be installed in your path to write cache files.
To build a cache "on-the-fly", use the --cache and --write_cache flags when you run the VEP with your input. Only cache files overlapping your input variants will be created; the next time you run the script with this cache, the data will be read from the cache instead of the database. Any data not found in the cache will be read from the database (and then written to the cache if --write_cache is enabled). If your data covers a relatively small proportion of your genome of interest (for example, a few genes of interest), it can be OK to use the public MySQL servers when building a partial cache.
perl variant_effect_predictor.pl -cache -dir /my/cache/dir/ -write_cache -i input.txt
To build a cache from scratch, use the flag
--build allor e.g.
--build 1-5,Xto build just a subset of chromosomes. You do not need to specify any of the usual input options when building a cache:
perl variant_effect_predictor.pl -host dbhost -user username -pass password -port 3306 -build 21 -dir /my/cache/dir/
The cache stores the following information:
It does not store any information pertaining to, and therefore cannot be used for, the following:
Enabling one of these options with --cache will cause the script to warn you in its status output with something like the following:
2011-06-16 16:24:51 - INFO: Database will be accessed for SIFT/PolyPhen/Condel and HGVS[Back to top]
When using the public database servers, the VEP script requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be suitable for those with sensitive or private data. Users should note that only the coordinates are transmitted to the server; no other information is sent.
By using a full downloaded cache or a local database, it is possible to avoid completely any network connections to public servers, thus preserving absolutely the privacy of your data.
[Back to top]