============================== Basic (single-genome) assembly ============================== (You'll need to have screed and khmer installed; if you ran through :doc:`reads_and_qc` then you already do. Otherwise, follow the instructions at the top of that tutorial.) We're going to work through an assembly pipeline that uses a Brown Lab approach called `digital normalization `__, which I'll talk about tomorrow. For now, just view it as a way to get decent assemblies faster than you might otherwise ;). As usual, we've both pre-prepared the data for you and given you the instructions to do it yourself. It should take ~20 minutes to run through these commands yourself, OR you can just go ahead and download (section after this one). .. note:: Many of the commands below take 5-10 minutes to run, or longer. You may want to look at using a program called 'screen' to run long-running programs safely -- see :doc:`amazon/using-screen`. Before we do anything with data, make an assembly directory:: cd /mnt mkdir assembly cd assembly Preparing the data yourself ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Download the quality-filtered data (see :doc:`reads_and_qc` to make them yourself):: curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.pe.qc.fq.gz curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.se.qc.fq.gz .. note:: If you already have downloaded these files during the mapping tutorial, you can do:: cp /mnt/ecoli/SRR*.qc.fq.gz . Now, run it through digital normalization:: /usr/local/share/khmer/scripts/normalize-by-median.py -x 1e9 -N 4 -k 20 -C 20 -p SRR390202.pe.qc.fq.gz --savehash normC20k20.kh /usr/local/share/khmer/scripts/normalize-by-median.py -x 1e9 -N 4 -k 20 -C 20 SRR390202.se.qc.fq.gz --savehash normC20k20.kh --loadhash normC20k20.kh The above commands produce '.keep' files; the first command normalizes the paired-end file (-p) and the second does the single-end file. Now, remove low-abundance k-mers as likely errors:: /usr/local/share/khmer/scripts/filter-abund.py normC20k20.kh *.keep This will produce a set of files '.abundfilt' that contain the error-trimmed data. The paired-end abundfilt file will contain newly orphaned reads now, ones where left or right reads were removed without their pair being removed; the following command separates orphans into a .se file, while paired reads are placed in a .pe file:: /usr/local/share/khmer/scripts/extract-paired-reads.py *.pe.qc.fq.gz.keep.abundfilt After error trimming, we run another round of digital normalization:: /usr/local/share/khmer/scripts/normalize-by-median.py -k 20 -C 5 -x 2e8 -N 4 -p *.abundfilt.pe --savehash normC5k20.kh /usr/local/share/khmer/scripts/normalize-by-median.py -k 20 -C 5 -x 2e8 -N 4 *.abundfilt.se *.se.qc.fq.gz.keep.abundfilt --loadhash normC5k20.kh --savehash normC5k20.kh on the PE and SE files variously. We now end with three files: ``SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep``, ``SRR390202.pe.qc.fq.gz.keep.abundfilt.se.keep``, and ``SRR390202.se.qc.fq.gz.keep.abundfilt.keep``. The first file is still paired end/interleaved (try typing ``head SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep``) and the other two contain only orphaned sequences (left OR right, never both). Downloading pre-prepared data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Alternatively, you can download the already prepared data sets:: curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.gz curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.pe.qc.fq.gz.keep.abundfilt.se.keep.gz curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.se.qc.fq.gz.keep.abundfilt.keep.gz gunzip *.keep.gz Using Velvet to do assembly =========================== Velvet is a fast and decent assembler. It's no longer considered one of the best assemblers, but it's robust and easy to use so we like using it in tutorials. Install the Velvet software:: cd /root curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz tar xzf velvet_1.2.10.tgz cd velvet_1.2.10 make MAXKMERLENGTH=51 cp velvet? /usr/local/bin Return to the assembly directory:: cd /mnt/assembly Then run velvet for several different k values:: for k in 21 31 41 do velveth ecoli.$k $k -shortPaired -fastq SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep -short SRR390202.pe.qc.fq.gz.keep.abundfilt.se.keep SRR390202.se.qc.fq.gz.keep.abundfilt.keep velvetg ecoli.$k -exp_cov auto done This produces a bunch of directories, 'ecoli.21' and 'ecoli.31' and 'ecoli.41', each containing a file 'contigs.fa'. You can look at the results like so:: head ecoli.21/contigs.fa To look at some statistics, you'll need a program. We provide one as part of khmer, called 'assemstats3.py':: python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.??/contigs.fa If you want to see the N50 for each data set, you can run 'assemstats2.py':: python /usr/local/share/khmer/sandbox/assemstats2.py 1000 ecoli.21/contigs.fa Using SPAdes to do assembly =========================== SPAdes is a nifty assembler that performs very well on single-cell samples and microbial genomes; see `the SPAdes manual `__. To get spades running, first, install CMake:: apt-get -y install cmake Then, install spades:: wget http://spades.bioinf.spbau.ru/release2.5.1/SPAdes-2.5.1.tar.gz tar -xzf SPAdes-2.5.1.tar.gz cd SPAdes-2.5.1 PREFIX=/usr/local ./spades_compile.sh Go back to the data directory:: cd /mnt/assembly And, finally, run it. This will take a few hours... :: cp SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq spades.py --pe1-12 SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq -o ecoli.spades This produces a file 'ecoli.spades/scaffolds.fasta' that you can look at:: python /usr/local/share/khmer/sandbox/assemstats2.py 1000 ecoli.spades/scaffolds.fasta or compare:: python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.??/contigs.fa ecoli.spades/scaffolds.fasta .. Using IDBA to do assembly .. ========================= .. (Note: this does not currently work for me; IDBA crashes. --titus) .. You can also try out the `IDBA assembler `__:: .. curl -O http://hku-idba.googlecode.com/files/idba-1.1.1.tar.gz .. tar xzf idba-1.1.1.tar.gz .. cd idba-1.1.1/ .. ./configure && make .. cp bin/idba_ud /usr/local/bin .. Now, go back to the data directory:: .. cd /mnt/assembly .. and run it IDBA. Note, here you are making a copy of the .keep file .. so that it has a '.fq' on the end; this is because IDBA expects the .. file name to indicate the sequence type:: .. cp SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq .. idba_ud --pre_correction -l SRR390202.pe.qc.fq.gz.keep.abundfilt.pe.keep.fq -o idba.dn.d ---- We can also go run a BLAST server now to check out our assembly -- go see :doc:`blastkit`. .. BLAST comparison .. mapping .. exercise .. blastkit