Trimming and Filtering
Overview
Teaching: 30 min
Exercises: 25 minQuestions
How can I get rid of sequence data that doesn’t meet my quality standards?
Objectives
Clean FASTQ reads using Trimmommatic
Select and set multiple options for command line bioinformatics tools
Write
for
loops with two variables
Cleaning Reads
In the previous lesson, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. We know that all of our samples failed at least one of the quality metrics used by FastQC. This doesn’t mean, though, that our samples should be thrown out! It’s very common to have some reads within a sample, or some positions (especially near the beginning or end of reads) across all reads that are low quality and should be discarded. We will use a program called Trimmomatic to filter poor quality reads and trim poor quality bases from our samples.
Trimmomatic Options
To use Trimmomatic on Crane, load the trimmomatic
module with:
$ module load trimmomatic
On Crane, we start Trimmomatic by calling java
along with the path to the program
(which is made active by loading the module). To run Trimmomatic we would use,
java -jar $TM_HOME/trimmomatic.jar
however, to alleviate typing that each time,
let’s create an
alias
for it by running:
alias runtrimmomatic="java -jar $TM_HOME/trimmomatic.jar"
(Note that there should be no spaces around the equal sign.)
You can type alias runtrimmomatic
to see what the alias does and you can type
module help trimmomatic
for more details on the tool itself.
$ runtrimmomatic
Note: runtrimmomatic
is a command you’ll only see if you create the alias
(aliases are convenient, but be sure to add the alias to your project docs).
On other systems, Trimmomatic is called with java -jar $TM_HOME/trimmomatic.jar
.
Trimmomatic is a program written in the Java programming language.
You don’t need to learn Java to use Trimmomatic (FastQC is also
written in Java), but the fact that it’s a Java program helps
explain the syntax that is used to run Trimmomatic. java
tells our
computer that we’re running a Java program. -jar
is an option specifying that we’re going to specify the location of
the Java program we want to run. The Java program itself will have
a .jar
file extension.
That’s just the basic command, however. Trimmomatic has a variety of options and parameters. We will need to specify what options we want to use for our analysis. Here are some of the options:
Option | Meaning |
---|---|
-threads |
The number of processors you want Trimmomatic to use |
SE /PE |
Whether your reads are single or paired-end |
-phred33 /-phred64 |
The encoding system for your quality scores |
In addition to these options, there are various trimming steps available:
Step | Meaning |
---|---|
SLIDINGWINDOW |
Perform sliding window trimming, cutting once the average quality within the window falls below a threshold |
LEADING |
Cut bases off the start of a read, if below a threshold quality |
TRAILING |
Cut bases off the end of a read, if below a threshold quality |
CROP |
Cut the read to a specified length |
HEADCROP |
Cut the specified number of bases from the start of the read |
MINLEN |
Drop an entire read if it is below a specified length |
TOPHRED33 |
Convert quality scores to Phred-33 |
TOPHRED64 |
Convert quality scores to Phred-64 |
We will use only a few of these options and trimming steps in our analysis. It is important to understand the steps you are using to clean your data. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.
Software Versions
One of the biggest causes for irreproducibility in Bioinformatics is forgetting/neglecting to document the version of a program used. In this case, we are using Trimmomatic v0.33. Be sure to note what version of a tool you are using and do not use multiple versions of a tool in one project if possible.
Keen observers will see that the Trimmomatic manual linked above is for Trimmomatic v0.32; this is the latest manual and should apply to all version of Trimmomatic v0.32 and above.
Versions are usually written in some form of
major.minor.development
designation. In this case, we are working with “major version 0, minor version 33, and no development version number”.By convention, major version updates are not compatible, but minor and development versions are safe to update. For example, moving from v0.32 to v0.33 is safe, but if tomorrow Trimmomatic v1.0 was released, moving from v0.33 to v1.0 would likely break our workflows.
We said above that a basic command for Trimmomatic looks like this:
$ runtrimmomatic SE
However, an example complete command for Trimmomatic will look something like this:
$ runtrimmomatic SE \
-threads 4 \
-phred64 \
SRR_1056.fastq \
SRR_1056_trimmed.fastq \
ILLUMINACLIP:SRR_adapters.fa \
SLIDINGWINDOW:4:20
On Crane it’s essential to specify -threads
or Trimmomatic will use more threads on the compute node than you requested.
In the above example, we’ve told Trimmomatic:
Code | Meaning |
---|---|
SE |
Single end file as input |
-threads 4 |
Use four computing threads to run (this will speed up our run) |
-phred64 |
Input uses phred-64 encoding for quality scores |
SRR_1056.fastq |
Input file name |
SRR_1056_trimmed.fastq |
Output file name |
ILLUMINACLIP:SRR_adapters.fa |
Clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa |
SLIDINGWINDOW:4:20 |
Use a sliding window of size 4 that will remove bases if their phred score is below 20 |
Running Trimmomatic
Now we will run Trimmomatic on our data. To begin, navigate to your untrimmed_fastq
data directory:
$ cd /work/group/username/dc_workshop/data/untrimmed_fastq
We are going to run Trimmomatic on one of our single-end samples. We will use a sliding window of size 4 that will remove bases if their phred score is below 20 (like in our example above). We will also discard any reads that do not have at least 20 bases remaining after this trimming step.
$ runtrimmomatic SE \
-threads 4 \
SRR098283.fastq \
SRR098283.fastq_trim.fastq \
SLIDINGWINDOW:4:20 \
MINLEN:20
TrimmomaticSE: Started with arguments: -threads 4 SRR098283.fastq SRR098283.fastq_trim.fastq SLIDINGWINDOW:4:20 MINLEN:20
Quality encoding detected as phred33
Input Reads: 21564058 Surviving: 17030985 (78.98%) Dropped: 4533073 (21.02%)
TrimmomaticSE: Completed successfully
Exercise
Use the output from your Trimmomatic command to answer the following questions.
- What percent of reads did we discard from our sample?
- What percent of reads did we keep?
Solution
- 21.02%
- 78.98%
You may have noticed that Trimmomatic automatically detected the quality encoding of
our sample. It is always a good idea to double-check this or to enter the quality
encoding manually (-phred33
/-phred64
).
We can confirm that we have our output file:
$ ls SRR098283*
SRR098283.fastq SRR098283.fastq_trim.fastq
The output file is also a FASTQ file. It should be smaller than our input file because we’ve removed reads. We can confirm this:
$ ls SRR098283* -l -h
-r-------- 1 username group 3.9G Dec 13 14:30 SRR098283.fastq
-rw-r--r-- 1 username group 3.0G Dec 14 14:37 SRR098283.fastq_trim.fastq
We’ve just successfully run Trimmomatic on one of our FASTQ files!
However, there is some bad news. Trimmomatic can only operate on
one sample at a time and we have more than one sample. The good news
is that we can use a for
loop to iterate through our sample files
quickly!
$ for infile in *.fastq
> do
> outfile="${infile}"_trim.fastq
> runtrimmomatic SE -threads 4 "${infile}" "${outfile}" SLIDINGWINDOW:4:20 MINLEN:20
> done
The new part in our for
loop is the line:
> outfile="${infile}"_trim.fastq
infile
is the first variable in our loop and takes the value of each of the FASTQ
files in our directory. outfile
is the second variable in our loop and is defined by
adding _trim.fastq
to the end of our input file name. Use {}
to wrap the variable
so that _trim.fastq
will not be interpreted as part of the variable name. In
addition, quoting the shell variables is a good practice AND necessary if your
variables have spaces in them. For more, check
Bash Pitfalls.
There are no spaces before or after the =
.
Go ahead and run the for loop. It should take a few minutes for Trimmomatic to run for each of our six input files. Once it’s done running, take a look at your directory contents.
$ ls
SRR097977.fastq SRR098027.fastq_trim.fastq
SRR098283.fastq SRR097977.fastq_trim.fastq
SRR098028.fastq SRR098283.fastq_trim.fastq
SRR098026.fastq SRR098028.fastq_trim.fastq
SRR098283.fastq_trim.fastq_trim.fastq SRR098026.fastq_trim.fastq
SRR098281.fastq SRR098027.fastq
SRR098281.fastq_trim.fastq
If you look very closely, you’ll see that you have three files for the SRR098283
sample. This is because we already had the SRR098283.fastq_trim.fastq
file in our
directory when we started our for
loop (because we had run Trimmomatic on just that
one file already). Our for
loop included this file in our list of .fastq
files and
created a new output file named SRR098283.fastq_trim.fastq_trim.fastq
, which is the
result of running Trimmomatic on our already trimmed file.
SRR098283.fastq_trim.fastq
and SRR098283.fastq_trim.fastq_trim.fastq
should be
identical. If you look at your Trimmomatic output in the terminal window, you will see:
TrimmomaticSE: Started with arguments: -threads 1 SRR098283.fastq_trim.fastq SRR098283.fastq_trim.fastq_trim.fastq SLIDINGWINDOW:4:20 MINLEN:20
Quality encoding detected as phred33
Input Reads: 17030985 Surviving: 17030985 (100.00%) Dropped: 0 (0.00%)
TrimmomaticSE: Completed successfully
This shows that when we re-trimmed our trimmed file, no new reads were dropped. This is a good thing!
Exercise
Earlier we looked at the first read in our
SRR098026.fastq
file and saw that it was very poor quality.$ head -n4 SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35 NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN +SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35 !!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
After filtering out bad reads, what is the first remaining read for this sample? What does its quality look like?
Solution
$ head -n4 SRR098026.fastq_trim.fastq
@SRR098026.342 HWUSI-EAS1599_1:2:1:3:655 length=35 GGATNGGCCTTGTATTTATGATTCTCNGAGTCTGT +SRR098026.342 HWUSI-EAS1599_1:2:1:3:655 length=35 BB@B!B@AACBBABCCCCBBBBBB@@!B?B<ABB@
Comparing this with our quality scale:
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI | | | | | Quality score: 0........10........20........30........40
We can see that the scores are mostly in the 30+ range. This is pretty good.
We’ve now completed the trimming and filtering steps of our quality
control process! Before we move on, let’s move our trimmed FASTQ files
to a new subdirectory within our data/
directory. We can also remove
our extra, double-trimmed file for the SRR098283
sample.
$ cd /work/group/username/dc_workshop/data/untrimmed_fastq
$ mkdir ../trimmed_fastq
$ rm SRR098283.fastq_trim.fastq_trim.fastq
$ mv *_trim* ../trimmed_fastq
$ cd ../trimmed_fastq
$ ls
SRR097977.fastq_trim.fastq SRR098028.fastq_trim.fastq
SRR098026.fastq_trim.fastq SRR098281.fastq_trim.fastq
SRR098027.fastq_trim.fastq SRR098283.fastq_trim.fastq
Bonus Exercise (Advanced)
Now that we’ve quality controlled our samples, they should perform better on the quality tests run by FastQC. Go ahead and re-run FastQC on your trimmed FASTQ files and visualize the HTML files to see whether your per base sequence quality is higher after trimming.
Solution
On Crane window do:
$ fastqc /work/group/username/dc_workshop/data/trimmed_fastq/*.fastq
In a new tab in your terminal do:
$ mkdir ~/Desktop/fastqc_html/trimmed $ scp username@crane.unl.edu:/work/group/username/dc_workshop/data/trimmed_fastq/*.zip ~/Desktop/fastqc_html/trimmed/ $ open ~/Desktop/fastqc_html/trimmed/*/*.html
Before trimming, one of the sequences gave a warning and another failed the per base sequence quality test. After filtering, all sequences pass that test.
Key Points
The options you set for the command line tools you use are important!
Data cleaning is an essential step in a genomics workflow