==============================
Basic (single-genome) assembly
==============================
(You'll need to have screed and khmer installed; if you ran through
:doc:`reads_and_qc` then you already do. Otherwise, follow the
instructions at the top of that tutorial.)
We're going to work through an assembly pipeline that uses a Brown Lab
approach called `digital normalization
`__, which I'll talk about
tomorrow. For now, just view it as a way to get decent assemblies
faster than you might otherwise ;).
As usual, we've both pre-prepared the data for you and given you the
instructions to do it yourself. It should take ~20 minutes to run
through these commands yourself, OR you can just go ahead and
download (section after this one).
.. note::
Many of the commands below take 5-10 minutes to run, or longer. You
may want to look at using a program called 'screen' to run long-running
programs safely -- see :doc:`amazon/using-screen`.
Before we do anything with data, make an assembly directory::
cd /mnt
mkdir assembly
cd assembly
Preparing the data yourself
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Download the quality-filtered data (see :doc:`reads_and_qc` to make them
yourself)::
curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.pe.qc.fq.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.se.qc.fq.gz
.. note::
If you already have downloaded these files during the mapping tutorial,
you can do::
cp /mnt/ecoli/SRR*.qc.fq.gz .
Now, run it through digital normalization::
/usr/local/share/khmer/scripts/normalize-by-median.py -x 1e9 -N 4 -k 20 -C 20 -p SRR390202.pe.qc.fq.gz --savehash normC20k20.kh
/usr/local/share/khmer/scripts/normalize-by-median.py -x 1e9 -N 4 -k 20 -C 20 SRR390202.se.qc.fq.gz --savehash normC20k20.kh --loadhash normC20k20.kh
The above commands produce '.keep' files; the first command normalizes the
paired-end file (-p) and the second does the single-end file.
Now, remove low-abundance k-mers as likely errors::
/usr/local/share/khmer/scripts/filter-abund.py normC20k20.kh *.keep
This will produce a set of files '.abundfilt' that contain the error-trimmed
data. The paired-end abundfilt file will contain newly orphaned reads now,
ones where left or right reads were removed without their pair being removed;
the following command separates orphans into a .se file, while paired reads
are placed in a .pe file::
/usr/local/share/khmer/scripts/extract-paired-reads.py *.pe.qc.fq.gz.keep.abundfilt
After error trimming, we run another round of digital normalization::
/usr/local/share/khmer/scripts/normalize-by-median.py -k 20 -C 5 -x 2e8 -N 4 -p *.abundfilt.pe --savehash normC5k20.kh
/usr/local/share/khmer/scripts/normalize-by-median.py -k 20 -C 5 -x 2e8 -N 4 *.abundfilt.se *.se.qc.fq.gz.keep.abundfilt --loadhash normC5k20.kh --savehash normC5k20.kh
on the PE and SE files variously. We now end with three files: ``SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep``, ``SRR390202.pe.qc.fq.gz.keep.abundfilt.se.keep``, and ``SRR390202.se.qc.fq.gz.keep.abundfilt.keep``. The first file is still paired end/interleaved (try typing ``head SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep``) and the other two contain only orphaned sequences (left OR right, never both).
Downloading pre-prepared data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Alternatively, you can download the already prepared data sets::
curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.pe.qc.fq.gz.keep.abundfilt.se.keep.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.se.qc.fq.gz.keep.abundfilt.keep.gz
gunzip *.keep.gz
Using Velvet to do assembly
===========================
Velvet is a fast and decent assembler. It's no longer considered one of
the best assemblers, but it's robust and easy to use so we like using it
in tutorials.
Install the Velvet software::
cd /root
curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
tar xzf velvet_1.2.10.tgz
cd velvet_1.2.10
make MAXKMERLENGTH=51
cp velvet? /usr/local/bin
Return to the assembly directory::
cd /mnt/assembly
Then run velvet for several different k values::
for k in 21 31 41
do
velveth ecoli.$k $k -shortPaired -fastq SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep -short SRR390202.pe.qc.fq.gz.keep.abundfilt.se.keep SRR390202.se.qc.fq.gz.keep.abundfilt.keep
velvetg ecoli.$k -exp_cov auto
done
This produces a bunch of directories, 'ecoli.21' and 'ecoli.31' and
'ecoli.41', each containing a file 'contigs.fa'. You can look at the
results like so::
head ecoli.21/contigs.fa
To look at some statistics, you'll need a program. We provide one as
part of khmer, called 'assemstats3.py'::
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.??/contigs.fa
If you want to see the N50 for each data set, you can run 'assemstats2.py'::
python /usr/local/share/khmer/sandbox/assemstats2.py 1000 ecoli.21/contigs.fa
Using SPAdes to do assembly
===========================
SPAdes is a nifty assembler that performs very well on single-cell
samples and microbial genomes; see `the SPAdes manual
`__.
To get spades running, first, install CMake::
apt-get -y install cmake
Then, install spades::
wget http://spades.bioinf.spbau.ru/release2.5.1/SPAdes-2.5.1.tar.gz
tar -xzf SPAdes-2.5.1.tar.gz
cd SPAdes-2.5.1
PREFIX=/usr/local ./spades_compile.sh
Go back to the data directory::
cd /mnt/assembly
And, finally, run it. This will take a few hours... ::
cp SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq
spades.py --pe1-12 SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq -o ecoli.spades
This produces a file 'ecoli.spades/scaffolds.fasta' that you can look at::
python /usr/local/share/khmer/sandbox/assemstats2.py 1000 ecoli.spades/scaffolds.fasta
or compare::
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.??/contigs.fa ecoli.spades/scaffolds.fasta
.. Using IDBA to do assembly
.. =========================
.. (Note: this does not currently work for me; IDBA crashes. --titus)
.. You can also try out the `IDBA assembler `__::
.. curl -O http://hku-idba.googlecode.com/files/idba-1.1.1.tar.gz
.. tar xzf idba-1.1.1.tar.gz
.. cd idba-1.1.1/
.. ./configure && make
.. cp bin/idba_ud /usr/local/bin
.. Now, go back to the data directory::
.. cd /mnt/assembly
.. and run it IDBA. Note, here you are making a copy of the .keep file
.. so that it has a '.fq' on the end; this is because IDBA expects the
.. file name to indicate the sequence type::
.. cp SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq
.. idba_ud --pre_correction -l SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq -o idba.dn.d
----
We can also go run a BLAST server now to check out our assembly -- go see
:doc:`blastkit`.
.. BLAST comparison
.. mapping
.. exercise
.. blastkit