Basic EC2, command line, and BLAST

Log into your AWS account, spin up a machine, log into the machine with SSH, and install Dropbox (see Getting started with Amazon EC2).

Install BLAST and some other software

You should be starting at the prompt ending in ‘#’, after logging in via SSH.

First, Download and install BLAST:

cd /root

curl -O
tar xzf blast-2.2.24-x64-linux.tar.gz
cp blast-2.2.24/bin/* /usr/local/bin
cp -r blast-2.2.24/data /usr/local/blast-data

Download and install some useful scripts:

git clone /usr/local/share/ngs-scripts

Create a working directory on a large disk, and change to that working directory:

cd /mnt
mkdir blast
cd blast

Download the E. coli MG1655 protein data set:

curl -O

This grabs that URL and saves the contents of ‘NC_000913.faa’ to the local disk.

Grab a Prokka-generated set of proteins (we’ll learn to do this on Thursday):

curl -O

Let’s take a quick look at these files:

head ecoli0104.faa
head NC_000913.faa

Format it for BLAST and run BLAST of the O104 protein set against the MG1655 protein set:

formatdb -i NC_000913.faa -o T -p T
blastall -i ecoli0104.faa -d NC_000913.faa -p blastp -e 1e-12 -o 0104.x.NC

Look at the output file:

head 0104.x.NC

Let’s convert ‘em to a CSV file:

pip install screed
python /usr/local/share/ngs-scripts/blast/ ecoli0104.faa NC_000913.faa 0104.x.NC > 0104.x.NC.csv

This creates a file ‘0104.x.NC.csv’, which you an open in Excel. If you’ve installed Dropbox, you can copy it into your Dropbox folders:

cp 0104.x.NC.csv /root/Dropbox/

Reciprocal BLAST calculation

Do the reciprocal BLAST, too:

formatdb -i ecoli0104.faa -o T -p T
blastall -i NC_000913.faa -d ecoli0104.faa -p blastp -e 1e-12 -o NC.x.0104

Extract reciprocal best hit:

python /usr/local/share/ngs-scripts/blast/ ecoli0104.faa NC_000913.faa 0104.x.NC NC.x.0104 > ortho.csv

This generates a file ‘ortho.csv’, containing the ortholog assignments and their annotations. Now copy that over to Dropbox and open it in Excel:

cp ortho.csv /root/Dropbox/

A few post-tutorial links

Explore the NCBI bacterial genome site here:

  • ‘.faa’ files are protein data sets;
  • ‘.fna’ files are genomic DNA;
  • the rest are annotation files of various kinds.

BONUS: plotting the BLAST e-value distribution from the CSV

IPython Notebook is running on these machines, and we can use that to do data analysis on the files. For example, let’s grab a notebook to plot blast e-value distributions:

cd /usr/local/notebooks
curl -O

Now, go to ‘https://‘ + your machine name, enter password ‘beacon’, and open the ‘plot-blast-evalues’ notebook.

comments powered by Disqus