Variant Calling Workflow
Overview
Teaching: 35 min
Exercises: 25 minQuestions
How do I find sequence variants between my samples and a reference genome?
Objectives
Describe the steps involved in variant calling
Describe the types of data formats encountered during variant calling
Use command line tools to perform variant calling
Alignment to a Reference Genome
We have already trimmed our reads so now the next step is alignment of our quality reads to the reference genome.
We perform read alignment or mapping to determine where in the genome our reads originated from. There are a number of tools to choose from and, while there is no gold standard, there are some tools that are better suited for particular NGS analyses. We will be using the Burrows-Wheeler Aligner (BWA), which is a software package for mapping low-divergent sequences against a large reference genome. The alignment process consists of two steps:
- Indexing the reference genome
- Aligning the reads to the reference genome
Setting up
First we will copy the reference genome data into our data/
directory, as well as a
set of trimmed FASTQ files to work with in this lesson, and a set of untrimmed files
for the next one.
$ cd /work/group/username/dc_workshop
$ cp -r /common/demo/dc/.dc_sampledata_lite/ref_genome data/
$ cp -r /common/demo/dc/.dc_sampledata_lite/trimmed_fastq_small data/
You will also need to create directories for the results that will be generated as
part of this workflow. We can do this in a single line of code because mkdir
can
accept multiple new directory names as input.
$ mkdir results/sai results/sam results/bam results/bcf results/vcf
Loading modules
It’s worth noting here that all of the software we are using for this workshop has been pre-installed on Crane. You’ll need to load the software modules first. Today we are working on the interactive node (that you called using
srun
), but you could also submit jobs usingsbatch
.
Index the reference genome
Our first step is to index the reference genome for use by BWA. This helps speed up our alignment.
To Index or Not To Index
Indexing the reference only has to be run once. The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different tool for alignment.
$ module load bwa
$ bwa index data/ref_genome/ecoli_rel606.fasta
While the index is created, you will see output something like this:
[bwa_index] Pack FASTA... 0.05 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 3.21 seconds elapse.
[bwa_index] Update BWT... 0.05 sec
[bwa_index] Pack forward-only FASTA... 0.03 sec
[bwa_index] Construct SA from BWT and Occ... 1.11 sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 5.031 sec; CPU: 4.464 sec
Note here that the version of bwa
is included in its output: [main] Version: 0.7.12-r1039
Align reads to reference genome
The alignment process consists of choosing an appropriate reference genome to map our reads against and then deciding on an aligner. BWA consists of three algorithms: BWA-backtrack, BWA-SW, and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the other two are for sequences ranging from 70bp to 1Mbp. BWA-SW and BWA-MEM share similar features such as long-read support and split alignment, however BWA-MEM (the latest algorithm) is generally recommended for high-quality queries as it is faster and more accurate.
Since we are working with short reads we will be using BWA-backtrack. The general usage for BWA-backtrack is:
$ bwa aln ref_genome.fasta input_file.fastq > output_file.sai
This will create a .sai
file which is an intermediate file containing the suffix
array indexes.
Have a look at the BWA options page. While we are running BWA with the default parameters here, your use case might require a change of parameters.
Read the Manual
Always read the manual page for any tool before using and make sure the options you use are appropriate for your data.
We’re going to start by aligning the reads from just one of the samples in our dataset
(SRR097977.fastq
). Later, we’ll be iterating this whole process on all of our sample
files.
$ bwa aln \
data/ref_genome/ecoli_rel606.fasta \
data/trimmed_fastq_small/SRR097977.fastq_trim.fastq \
> results/sai/SRR097977.aligned.sai
You will see output like this:
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 8.88 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.45 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 406070 sequences have been processed.
[main] Version: 0.7.12-r1039
[main] CMD: bwa aln data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR097977.fastq_trim.fastq
[main] Real time: 15.120 sec; CPU: 14.873 sec
Alignment cleanup
Post-alignment processing of the alignment file includes:
- Converting output SAI alignment file to a BAM file
- Sorting the BAM file by coordinate
Convert the format of the alignment to SAM/BAM
The SAI file is not a standard alignment output file and will need to be converted into a SAM file before we can do any downstream processing.
SAM/BAM format
The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we do not have time to go in detail of the features of the SAM format, the paper by Heng Li et al. provides a lot more detail on the specification.
The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.
The file begins with an optional header. The header is used to describe source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is displayed below with the different fields highlighted.
Better viewing of tab-delimited files
Tab separated files can sometimes be hard to view in the terminal. You may find that the headers don’t line up with columns and that long lines wrap to the next line. Try using
less -S -x10 filename
to view the file.-S
causes less to not wrap long lines, use the left and right arrow keys to see more of the line.-x10
expands how many spaces each tab displays as to 10 character which will help the columns to line up. You can try larger values for x to make the columns wider.
First we will use the bwa samse
command to convert the .sai file to SAM format.
The usage for bwa samse
is
$ bwa samse ref_genome.fasta input_file.sai input_file.fastq > output_file.sam
The code in our case will look like:
$ bwa samse data/ref_genome/ecoli_rel606.fasta \
results/sai/SRR097977.aligned.sai \
data/trimmed_fastq_small/SRR097977.fastq_trim.fastq \
> results/sam/SRR097977.aligned.sam
Your output will start out something like this:
[bwa_aln_core] convert to sequence coordinate... 1.44 sec
[bwa_aln_core] refine gapped alignments... 0.15 sec
[bwa_aln_core] print alignments... 0.55 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] convert to sequence coordinate... 0.83 sec
[bwa_aln_core] refine gapped alignments... 0.07 sec
[bwa_aln_core] print alignments... 0.30 sec
[bwa_aln_core] 406070 sequences have been processed.
[main] Version: 0.7.12-r1039
[main] CMD: bwa samse data/ref_genome/ecoli_rel606.fasta results/sai/SRR097977.aligned.sai data/trimmed_fastq_small/SRR097977.fastq_trim.fastq
[main] Real time: 4.074 sec; CPU: 3.887 sec
Next we convert the SAM file to BAM format for use by downstream tools. We use the
samtools
program with the view
command and tell this command that the input is in
SAM format (-S
) and to output BAM format (-b
):
$ module load samtools
$ samtools view -S -b results/sam/SRR097977.aligned.sam > results/bam/SRR097977.aligned.bam
Sort BAM file by coordinates
Next we sort the BAM file using the sort
command from samtools
.
$ samtools sort results/bam/SRR097977.aligned.bam > results/bam/SRR097977.aligned.sorted.bam
More Than One Way to . . . sort a SAM/BAM File
SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.
Variant calling
A variant call is a conclusion that there is a nucleotide difference vs. some
reference at a given position in an individual genome or transcriptome, often referred
to as a Single Nucleotide Polymorphism (SNP). The call is usually accompanied by an
estimate of variant frequency and some measure of confidence. Similar to other steps
in this workflow, there are number of tools available for variant calling. In this
workshop we will be using bcftools
, but there are a few things we need to do before
actually calling the variants.
Step 1: Calculate the read coverage of positions in the genome
Do the first pass on variant calling by counting read coverage with samtools mpileup:
$ samtools mpileup -g \
-f data/ref_genome/ecoli_rel606.fasta \
results/bam/SRR097977.aligned.sorted.bam \
> results/bcf/SRR097977_raw.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
We have now generated a file with coverage information for every base. To identify variants, we now will use a different tool from the samtools suite called bcftools.
Step 2: Detect the single nucleotide polymorphisms (SNPs)
Identify SNPs using bcftools:
$ module load bcftools
$ bcftools call -vm \
-O b \
results/bcf/SRR097977_raw.bcf \
> results/bcf/SRR097977_variants.bcf
Step 3: Filter and report the SNP variants in variant calling format (VCF)
Filter the SNPs for the final output in VCF format, using vcfutils.pl
:
$ bcftools view \
results/bcf/SRR097977_variants.bcf \
| vcfutils.pl varFilter - \
> results/vcf/SRR097977_final_variants.vcf
bcftools view
converts the binary format of bcf files into human readable format
(tab-delimited) for vcfutils.pl
to perform the filtering. Note that the output is in
VCF format, which is a text format.
Explore the VCF format
$ less results/vcf/SRR097977_final_variants.vcf
You will see the header (which describes the format), the time and date the file was created, the version of bcftools that was used, the command line parameters used, and some additional information:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.3.1+htslib-1.3.1
##samtoolsCommand=samtools mpileup -g -f data/ref_genome/ecoli_rel606.fasta results/bam/SRR097977.aligned.sorted.bam
##reference=file://data/ref_genome/ecoli_rel606.fasta
##contig=<ID=NC_012967.1,length=4629812>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
.
.
.
.
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and a
lt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.2-4-g1fedb8b-dirty+htslib-1.2.1-7-ga233385
##bcftools_callCommand=call -vm -O b results/bcf/SRR097977_raw.bcf
##bcftools_viewVersion=1.2-4-g1fedb8b-dirty+htslib-1.2.1-7-ga233385
##bcftools_viewCommand=view results/bcf/SRR097977_variants.bcf
Followed by information on each of the variations observed:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT results/bam/SRR097977.aligned.sorted.bam
NC_012967.1 9972 . T G 68 . DP=3;VDB=0.220755;SGB=-0.511536;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,1,2;MQ=37 GT:PL 1/1:96,9,0
NC_012967.1 10563 . G A 26.3447 . DP=2;VDB=0.36;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,0,2;MQ=37 GT:PL 1/1:54,6,0
NC_012967.1 81158 . A C 151 . DP=8;VDB=0.250641;SGB=-0.636426;MQSB=1.01283;MQ0F=0;AC=2;AN=2;DP4=0,0,4,3;MQ=37 GT:PL 1/1:178,21,0
NC_012967.1 139812 . G T 8.07754 . DP=4;VDB=0.56;SGB=-0.453602;RPB=1;MQB=1;MQSB=1;BQB=0.25;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,2,1,1;MQ=37 GT:PL 0/1:40,0,49
NC_012967.1 216480 . C T 65 . DP=4;VDB=0.235765;SGB=-0.511536;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,1,2;MQ=37 GT:PL 1/1:92,9,0
NC_012967.1 247796 . T C 60 . DP=3;VDB=0.102722;SGB=-0.511536;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,1,2;MQ=33 GT:PL 1/1:88,9,0
NC_012967.1 281923 . G T 40.3352 . DP=2;VDB=0.06;SGB=-0.453602;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,1,1;MQ=37 GT:PL 1/1:68,6,0
NC_012967.1 553093 . C T 15.4599 . DP=2;VDB=0.88;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=37 GT:PL 1/1:43,6,0
This is a lot of information, so let’s take some time to make sure we understand our output.
The first few columns represent the information we have about a predicted variation.
Column | Info |
---|---|
CHROM | Contig location where the variation occurs |
POS | Position within the contig where the variation occurs |
ID | A . until we add annotation information |
REF | Reference genotype (forward strand) |
ALT | Sample genotype (forward strand) |
QUAL | Phred-scaled probability that the observed variant exists at this site (higher is better) |
FILTER | A . if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed |
In an ideal world, the information in the QUAL
column would be all we needed to
filter out bad variant calls. However, in reality we need to filter on multiple other
metrics.
The last two columns contain the genotypes and can be tricky to decode.
Column | Info |
---|---|
FORMAT | Lists in order the metrics presented in the final column |
results | Lists the values associated with those metrics in order |
For our file, the metrics presented are DP:VDB:SGB:MQSB:MQOF:AC:AN:DP4:MQ:GT:PL
.
Metric | Definition |
---|---|
GT | Genotype of this sample which for a diploid genome is encoded with a 0 for the REF allele, 1 for the first ALT allele, 2 for the second and so on. So 0/0 means homozygous reference, 0/1 is heterozygous, and 1/1 is homozygous for the alternate allele. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc |
PL | Phred-scaled likelihoods of the given genotypes |
GQ | Genotype quality Phred-scaled confidence for the genotype |
AD | Allele depth, i.e. the unfiltered number of reads that support each of the reported alleles |
DP | (Allele) Depth, i.e. the filtered number of reads that support each of the reported alleles |
The Broad Institute’s VCF guide is an excellent place to learn more about VCF file format.
Exercise
Use the
grep
,cut
, andless
commands you’ve learned to extract thePOS
andQUAL
columns from your output file (without the header lines). What is the position of the first variant to be called with aQUAL
value of less than 4?Solution
$ cut results/vcf/SRR097977_final_variants.vcf -f 6,2 | grep -v "##" | less
POS QUAL 9972 68 10563 26.3447 81158 151 139812 8.07754 216480 65 247796 60 281923 40.3352 . . . 788403 4.97012 911613 3.88886 . . .
Position 911613 has a score of 3.88886.
Assess the alignment (visualization) - optional step
It is often instructive to look at your data in a genome browser. Visualization will allow you to get a “feel” for the data, as well as detecting abnormalities and problems. Also, exploring the data in such a way may give you ideas for further analyses. As such, visualization tools are useful for exploratory analysis. In this lesson we will describe two different tools for visualization; a light-weight command-line based one and the Broad Institute’s Integrative Genomics Viewer (IGV) which requires software installation and transfer of files.
In order for us to visualize the alignment files, we will need to index the BAM file
using samtools
:
$ samtools index results/bam/SRR097977.aligned.sorted.bam
Viewing with tview
Samtools
implements a very simple text alignment viewer based on the GNU ncurses
library,
called tview
. This alignment viewer works with short indels and shows
Maq consensus.
It uses different colors to display mapping quality or base quality, subjected to
users’ choice. Samtools viewer is known to work with an 130 GB alignment swiftly. Due
to its text interface, displaying alignments over network is also very fast.
In order to visualize our mapped reads we use tview
, giving it the sorted bam file
and the reference file:
$ samtools tview \
results/bam/SRR097977.aligned.sorted.bam \
data/ref_genome/ecoli_rel606.fasta
1 11 21 31 41 51 61 71 81 91 101 111 121
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATAC
..................................................................................................................................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ..................N................. ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,........................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ..................N................. ,,,,,,,,,,,,,,,,,,,,,,,,,,,.............................
...................................,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .................................... ................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.................................... .................................... ,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .................................... ,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .......
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ............................. ,,,,,,,,,,,,,,,,,g,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ...........................T....... ,,,,,,,,,,,,,,,,,,,,,,,c, ......
......................... ................................ ,g,,,,,,,,,,,,,,,,,,, ...........................
,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,, ..........................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ................................T.. .............................. ,,,,,,
........................... ,,,,,,g,,,,,,,,,,,,,,,,, .................................... ,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,, .................................... ................................... ....
.................................... ........................ ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ....
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
........................ .................................. ............................. ....
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .................................... ..........................
............................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ....................................
................................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ..................................
.................................... ,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ............................ ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
The first line of output shows the genome coordinates in our reference genome. The
second line shows the reference genome sequence. The third lines shows the consensus
sequence determined from the sequence reads. A .
indicates a match to the reference
sequence, so we can see that the consensus from our sample matches the reference in
most locations. That is good! If that wasn’t the case, we should probably reconsider
our choice of reference.
Below the horizontal line, we can see all of the reads in our sample aligned with the
reference genome. Only positions where the called base differs from the reference are
shown. You can use the arrow keys on your keyboard to scroll or type ?
for a help
menu. Type Ctrl^C
to exit tview
.
Viewing with IGV
IGV is a stand-alone browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like Ensembl or the UCSC browser, are slower, but provide more functionality. They not only allow for more polished and flexible visualization, but also provide easy access to a wealth of annotations and external data sources. This makes it straightforward to relate your data with information about repeat regions, known genes, epigenetic features or areas of cross-species conservation, to name just a few.
In order to use IGV, we will need to transfer some files to our local machine. We know
how to do this with scp
. Open a new tab in your terminal window and create a new
folder. We’ll put this folder on our Desktop for demonstration purposes, but in
general you should avoid proliferating folders and files on your Desktop and instead
organize files within a directory structure like we’ve been using in our
dc_workshop
directory.
$ mkdir ~/Desktop/files_for_igv
$ cd ~/Desktop/files_for_igv
Now we will transfer our files to that new directory. The commands to scp
always go
in the terminal window that is connected to your local computer (not to Crane).
On Linux/Mac systems:
$ scp username@crane.unl.edu:/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam ~/Desktop/files_for_igv
$ scp username@crane.unl.edu:/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam.bai ~/Desktop/files_for_igv
$ scp username@crane.unl.edu:/work/group/username/dc_workshop/data/ref_genome/ecoli_rel606.fasta ~/Desktop/files_for_igv
$ scp username@crane.unl.edu:/work/group/username/dc_workshop/results/vcf/SRR097977_final_variants.vcf ~/Desktop/files_for_igv
On Windows:
$ pscp username@crane.unl.edu:/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam %HOMEPATH%\Desktop\files_for_igv
$ pscp username@crane.unl.edu:/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam.bai %HOMEPATH%\Desktop\files_for_igv
$ pscp username@crane.unl.edu:/work/group/username/dc_workshop/data/ref_genome/ecoli_rel606.fasta %HOMEPATH%\Desktop\files_for_igv
$ pscp username@crane.unl.edu:/work/group/username/dc_workshop/results/vcf/SRR097977_final_variants.vcf %HOMEPATH%\Desktop\files_for_igv
You will need to type your Crane password each time you call scp
.
Alternatively, you can copy everything using one command, and type the password only once. Each path/file is separated by a space and all files to be copied are delimited by a a single quote (‘):
On Linux/Mac systems:
$ scp username@crane.unl.edu:'\
/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam \
/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam.bai \
/work/group/username/dc_workshop/data/ref_genome/ecoli_rel606.fasta \
/work/group/username/dc_workshop/results/vcf/SRR097977_final_variants.vcf' \
~/Desktop/files_for_igv/
On Windows systems:
$ pscp username@crane.unl.edu:'\
/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam \
/work/group/username/dc_workshop/results/bam/SRR097977.aligned.sorted.bam.bai \
/work/group/username/dc_workshop/data/ref_genome/ecoli_rel606.fasta \
/work/group/username/dc_workshop/results/vcf/SRR097977_final_variants.vcf' \
%HOMEPATH%\Desktop\files_for_igv
Next we need to open the IGV software. If you haven’t done so already, you can
download IGV from the
Broad Institute’s page,
double-click the .zip
file to unzip it.
- Open IGV.
- Load our reference genome file (
ecoli_rel606.fasta
) into IGV using Genomes->Load Genomes from File… option. - Load our BAM file (
SRR097977.aligned.sorted.bam
) using the File->Load from File… option. - Load our VCF file (
SRR097977_final_variants.vcf
) using the File->Load from File… option.
Your IGV browser should look like the screenshot below:
There should be two tracks: one corresponding to our BAM file and the other for our VCF file.
In the VCF track, each bar across the top of the plot shows the allele fraction for a single locus. The second bar shows the genotypes for each locus in each sample. We only have one sample called here so we only see a single line. Dark blue = heterozygous, Cyan = homozygous variant, Grey = reference. Filtered entries are transparent.
Zoom in to inspect variants you see in your filtered VCF file to become more familiar with IGV. See how quality information corresponds to alignment information at those loci. Use this website and the links therein to understand how IGV colors the alignments.
Exercise
Try looking at the two closest variants found in SRR097977 at positions 2039200 and 2039206? These two variants are low quality, both with a score of 7.24351.
Solution
Zoom into the proper range by setting the view manually via the search bar found next to the genome and query sequence selection dropdowns.
Now that we’ve run through our workflow for a single sample, we want to repeat this workflow for our other five samples. However, we don’t want to type each of these individual steps again five more times. That would be very time consuming and error-prone, and would become impossible as we gathered more and more samples. Luckily, we already know the tools we need to use to automate this workflow and run it on as many files as we want using a single line of code. Those tools are: wildcards, for loops, and bash scripts. We’ll use all three in the next lesson.
Key Points
Bioinformatics command line tools are collections of commands that can be used to carry out bioinformatics analyses
To use the most powerful bioinformatics tools, you’ll need to use the command line
There are many different file formats for storing genomics data. It’s important to understand these file formats and know how to convert between them