Basic (single-genome) assembly

(You’ll need to have screed and khmer installed; if you ran through Understanding Read Formats and Quality Controlling Data then you already do. Otherwise, follow the instructions at the top of that tutorial.)

We’re going to work through an assembly pipeline that uses a Brown Lab approach called digital normalization, which I’ll talk about tomorrow. For now, just view it as a way to get decent assemblies faster than you might otherwise ;).

As usual, we’ve both pre-prepared the data for you and given you the instructions to do it yourself. It should take ~20 minutes to run through these commands yourself, OR you can just go ahead and download (section after this one).


Many of the commands below take 5-10 minutes to run, or longer. You may want to look at using a program called ‘screen’ to run long-running programs safely – see Using ‘screen’.

Before we do anything with data, make an assembly directory:

cd /mnt
mkdir assembly
cd assembly

Preparing the data yourself

Download the quality-filtered data (see Understanding Read Formats and Quality Controlling Data to make them yourself):

curl -O
curl -O


If you already have downloaded these files during the mapping tutorial, you can do:

cp /mnt/ecoli/SRR*.qc.fq.gz .

Now, run it through digital normalization:

/usr/local/share/khmer/scripts/ -x 1e9 -N 4 -k 20 -C 20 -p --savehash

/usr/local/share/khmer/scripts/ -x 1e9 -N 4 -k 20 -C 20 --savehash --loadhash

The above commands produce ‘.keep’ files; the first command normalizes the paired-end file (-p) and the second does the single-end file.

Now, remove low-abundance k-mers as likely errors:

/usr/local/share/khmer/scripts/ *.keep

This will produce a set of files ‘.abundfilt’ that contain the error-trimmed data. The paired-end abundfilt file will contain newly orphaned reads now, ones where left or right reads were removed without their pair being removed; the following command separates orphans into a .se file, while paired reads are placed in a .pe file:

/usr/local/share/khmer/scripts/ *.pe.qc.fq.gz.keep.abundfilt

After error trimming, we run another round of digital normalization:

/usr/local/share/khmer/scripts/ -k 20 -C 5 -x 2e8 -N 4 -p * --savehash

/usr/local/share/khmer/scripts/ -k 20 -C 5 -x 2e8 -N 4 * *.se.qc.fq.gz.keep.abundfilt --loadhash --savehash

on the PE and SE files variously. We now end with three files:,, and The first file is still paired end/interleaved (try typing head and the other two contain only orphaned sequences (left OR right, never both).

Downloading pre-prepared data

Alternatively, you can download the already prepared data sets:

curl -O
curl -O
curl -O
gunzip *.keep.gz

Using Velvet to do assembly

Velvet is a fast and decent assembler. It’s no longer considered one of the best assemblers, but it’s robust and easy to use so we like using it in tutorials.

Install the Velvet software:

cd /root
curl -O
tar xzf velvet_1.2.10.tgz
cd velvet_1.2.10
cp velvet? /usr/local/bin

Return to the assembly directory:

cd /mnt/assembly

Then run velvet for several different k values:

for k in 21 31 41
  velveth ecoli.$k $k -shortPaired -fastq -short
  velvetg ecoli.$k -exp_cov auto

This produces a bunch of directories, ‘ecoli.21’ and ‘ecoli.31’ and ‘ecoli.41’, each containing a file ‘contigs.fa’. You can look at the results like so:

head ecoli.21/contigs.fa

To look at some statistics, you’ll need a program. We provide one as part of khmer, called ‘’:

python /usr/local/share/khmer/sandbox/ 1000 ecoli.??/contigs.fa

If you want to see the N50 for each data set, you can run ‘’:

python /usr/local/share/khmer/sandbox/ 1000 ecoli.21/contigs.fa

Using SPAdes to do assembly

SPAdes is a nifty assembler that performs very well on single-cell samples and microbial genomes; see the SPAdes manual.

To get spades running, first, install CMake:

apt-get -y install cmake

Then, install spades:

tar -xzf SPAdes-2.5.1.tar.gz
cd SPAdes-2.5.1
PREFIX=/usr/local ./

Go back to the data directory:

cd /mnt/assembly

And, finally, run it. This will take a few hours...

cp --pe1-12 -o ecoli.spades

This produces a file ‘ecoli.spades/scaffolds.fasta’ that you can look at:

python /usr/local/share/khmer/sandbox/ 1000 ecoli.spades/scaffolds.fasta

or compare:

python /usr/local/share/khmer/sandbox/ 1000 ecoli.??/contigs.fa ecoli.spades/scaffolds.fasta

We can also go run a BLAST server now to check out our assembly – go see BLASTing your assembled data.

comments powered by Disqus