Running SnpEff
We show some basic examples how to use SnpEff.
Basic example: Installing SnpEff
Obviously the first step to use the program is to install it (for details, take a look at the download page. You have to download the core program and then uncompress the ZIP file. In Windows systems, you can just double click and copy the contents of the ZIP file to wherever you want the program installed. If you have a Unix or a Mac system, the command line would be:
# Download using wget
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
# If you prefer to use 'curl' instead of 'wget', you can type:
#     curl -L https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip > snpEff_latest_core.zip
# Install
unzip snpEff_latest_core.zip
Basic example: Annotate using SnpEff
Let's assume you have a VCF file and you want to annotate the variants in that file.
An example file is provided in examples/test.chr22.vcf (this data is from the 1000 Genomes project, so the reference genome is the human genome GRCh37).
You can annotate the file by running the following command (as an input, we use a Variant Call Format (VCF) file available in SnpEff's examples directory).
java -Xmx8g -jar snpEff.jar GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf
# Here is how the output looks like
$ head examples/test.chr22.ann.vcf
##SnpEffVersion="4.1 (build 2015-01-07), by Pablo Cingolani"
##SnpEffCmd="SnpEff  GRCh37.75 examples/test.chr22.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
22  17071756    .   T   C   .   .   ANN=C|3_prime_UTR_variant|MODIFIER|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.*11A>G|||||11|,C|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397A>G|||||4223|
22  17072035    .   C   T   .   .   ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>A|||||3944|
22  17072258    .   C   A   .   .   ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1183G>T|p.Gly395Cys|1443/2034|1183/1674|395/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>T|||||3721|
22  17072674    .   G   A   .   .   ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.767C>T|p.Pro256Leu|1027/2034|767/1674|256/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397C>T|||||3305|
As you can see, SnpEff added functional annotations in the ANN info field (eigth column in the VCF output file).
Details about the 'ANN' field format can be found in the ANN Field section and in VCF annotation about standard 'ANN' field. Note: Older SnpEff version used 'EFF' field (details about the 'EFF' field format can be found in the EFF Field section).
You can also annotate using the "verbose" mode (command line option -v), this makes SnpEff to show a lot of information which can be useful for debugging.
Here output is edited for brevity:
$ java -Xmx8g -jar snpEff.jar -v GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf
00:00:00.000    Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75'
00:00:00.434    done
00:00:00.434    Reading database for genome version 'GRCh37.75' from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/snpEffectPredictor.bin' (this might take a while)
00:00:00.434    Database not installed
    Attempting to download and install database 'GRCh37.75'
00:00:00.435    Reading configuration file 'snpEff.config'. Genome: 'GRCh37.75'
00:00:00.653    done
00:00:00.654    Downloading database for 'GRCh37.75'
00:00:00.655    Connecting to http://downloads.sourceforge.net/project/snpeff/databases/v4_0/snpEff_v4_0_GRCh37.75.zip
00:00:01.721    Local file name: 'snpEff_v4_0_GRCh37.75.zip'
.............................................
00:01:31.595    Download finished. Total 177705174 bytes.
00:01:31.597    Extracting file 'data/GRCh37.75/motif.bin' to '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/motif.bin'
00:01:31.597    Creating local directory: '/home/pcingola/snpEff_v4_0/./data/GRCh37.75'
00:01:31.652    Extracting file 'data/GRCh37.75/nextProt.bin'
00:01:31.707    Extracting file 'data/GRCh37.75/pwms.bin'
00:01:31.707    Extracting file 'data/GRCh37.75/regulation_CD4.bin'
...
00:01:32.038    Extracting file 'data/GRCh37.75/snpEffectPredictor.bin'
00:01:32.881    Unzip: OK
00:01:32.881    Done
00:01:32.881    Database installed.
00:01:58.779    done
00:01:58.813    Reading NextProt database from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/nextProt.bin'
00:02:01.448    NextProt database: 523361 markers loaded.
00:02:01.448    Adding transcript info to NextProt markers.
00:02:02.180    NextProt database: 706289 markers added.
00:02:02.181    Loading Motifs and PWMs
00:02:02.181        Loading PWMs from : /home/pcingola/snpEff_v4_0/./data/GRCh37.75/pwms.bin
00:02:02.203        Loading Motifs from file '/home/pcingola/snpEff_v4_0/./data/GRCh37.75/motif.bin'
00:02:02.973        Motif database: 284122 markers loaded.
00:02:02.973    Building interval forest
00:02:41.857    done.
00:02:41.858    Genome stats :
#-----------------------------------------------
# Genome name                : 'Homo_sapiens'
# Genome version             : 'GRCh37.75'
# Has protein coding info    : true
# Genes                      : 63677
# Protein coding genes       : 23172
#-----------------------------------------------
# Transcripts                : 215170
# Avg. transcripts per gene  : 3.38
#-----------------------------------------------
# Checked transcripts        :
#               AA sequences : 104254 ( 114.79% )
#              DNA sequences : 179360 ( 83.36% )
#-----------------------------------------------
# Protein coding transcripts : 90818
#              Length errors :  14349 ( 15.80% )
#  STOP codons in CDS errors :     39 ( 0.04% )
#         START codon errors :   8721 ( 9.60% )
#        STOP codon warnings :  21788 ( 23.99% )
#              UTR sequences :  87724 ( 40.77% )
#               Total Errors :  21336 ( 23.49% )
#-----------------------------------------------
# Cds                        : 792087
# Exons                      : 1306656
# Exons with sequence        : 1306656
# Exons without sequence     : 0
# Avg. exons per transcript  : 6.07
# WARNING!                   : Mitochondrion chromosome 'MT' does not have a mitochondrion codon table (codon table = 'Standard'). You should update the config file.
#-----------------------------------------------
# Number of chromosomes      : 297
# Chromosomes names [sizes]  :
#       'HG1292_PATCH' [250051446]
#       'HG1287_PATCH' [249964560]
#       'HG1473_PATCH' [249272860]
#       'HG1471_PATCH' [249269426]
#       'HSCHR1_1_CTG31' [249267852]
#       'HSCHR1_2_CTG31' [249266025]
#       'HSCHR1_3_CTG31' [249262108]
#       'HG999_2_PATCH' [249259300]
#       'HG989_PATCH' [249257867]
#       'HG999_1_PATCH' [249257505]
#       'HG1472_PATCH' [249251918]
#       '1' [249250621]
#       'HG1293_PATCH' [249140837]
#       'HG686_PATCH' [243297375]
#       'HSCHR2_1_CTG12' [243216362]
#       'HSCHR2_2_CTG12' [243205453]
#       'HSCHR2_1_CTG1' [243205406]
#       'HG953_PATCH' [243199374]
#       '2' [243199373]
.....
.....
#-----------------------------------------------
00:02:59.416    Predicting variants
WARNINGS: Some warning were detected
Warning type    Number of warnings
WARNING_TRANSCRIPT_INCOMPLETE   8215
WARNING_TRANSCRIPT_NO_START_CODON   3483
00:03:04.327    Creating summary file: snpEff_summary.html
00:03:04.891    Creating genes file: snpEff_genes.txt
00:03:17.334    done.
00:03:17.336    Logging
00:03:18.337    Checking for updates...
Notice how SnpEff automatically downloads and installs the database. Next time SnpEff will use the local version, so the installation step is only done once.
The annotated variants will be in the new file "test.chr22.ann.vcf".
Warning
SnpEff creates a file called "snpEff_summary.html" showing basic statistics about the analyzed variants. Take a quick look at it.
Info
We used the java parameter -Xmx8g to increase the memory available to the Java Virtual Machine to 4G. SnpEff's human genome database is large and it has to be loaded into memory. If your computer doesn't have at least 4G of memory, you probably won't be able to run this example.
Info
If you are running SnpEff from a directory different than the one it was installed, you will have to specify where the config file is. This is done using the '-c' command line option:
java -Xmx8g -jar snpEff.jar -c path/to/snpEff/snpEff.config -v GRCh37.75 test.chr22.vcf > test.chr22.ann.vcf
Detailed examples
Take a look at several detailed examples in our examples page.
Specify a configuration file
Sometimes you need to specify the path to the config file. For instance, when you run SnpEff from a different directory than your install directory, you have to specify where the config file is located using the '-c' command line option.
java -Xmx8g path/to/snpEff/snpEff.jar -c path/to/snpEff/snpEff.config GRCh37.75 path/to/snps.vcf
Info
Since version 4.1B, you can use the -configOption command line option to override any value in the config file
Java memory options
By default the amount of memory set by a java process is set too low. If you don't assign more memory to the process, you will most likely have an "OutOfMemory" error.
You should set the amount of memory in your java virtual machine to, at least, 2 Gb.
This can be easily done using the Java command line option -Xmx.
E.g. In this example I use 4Gb:
# Run using 4 Gb of memory
java -Xmx8g snpEff.jar hg19 path/to/your/files/snps.vcf
Note: There is no space between -Xmx and 4G.
Running SnpEff in the Cloud
You can run SnpEff in a "the Cloud" exactly the same way as running it on your local computer. You should not have any problems at all.
Here is an example of installing it and running it on an Amazon EC2 instance (virtual machine):
$ ssh -i ./aws_amazon/my_secret_key.pem ec2-user@ec2-54-234-14-244.compute-1.amazonaws.com
       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|
[ec2-user@ip-10-2-202-163 ~]$ wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
[ec2-user@ip-10-2-202-163 ~]$ unzip snpEff_latest_core.zip
[ec2-user@ip-10-2-202-163 ~]$ cd snpEff/
[ec2-user@ip-10-2-202-163 snpEff]$ java -jar snpEff.jar download -v hg19
00:00:00.000    Downloading database for 'hg19'
...
00:00:36.340    Done
[ec2-user@ip-10-2-202-163 snpEff]$ java -Xmx8g -jar snpEff.jar dump -v hg19 > /dev/null
00:00:00.000    Reading database for genome 'hg19' (this might take a while)
00:00:20.688    done
00:00:20.688    Building interval forest
00:00:33.110    Done.
Loading the database
One of the first things SnpEff has to do is to load the database. Usually it takes from a few seconds to a couple of minutes, depending on database size. Complex databases, like human, require more time to load. After the database is loaded, SnpEff can analyze thousands of variants per second.
Command line vs Web interface
In order to run SnpEff you need to be comfortable running command from a command line terminal. If you are not, then it is probably a good idea to ask you systems administrator to install a Galaxy server and use the web interface. You can also use the open Galaxy server, but functionality may be limited and SnpEff versions may not be updated frequently.