Basic EC2, command line, and BLAST¶
Log into your AWS account, spin up a machine, log into the machine with SSH, and install Dropbox (see Getting started with Amazon EC2).
Install BLAST and some other software¶
You should be starting at the prompt ending in ‘#’, after logging in via SSH.
First, Download and install BLAST:
cd /root
curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz
tar xzf blast-2.2.24-x64-linux.tar.gz
cp blast-2.2.24/bin/* /usr/local/bin
cp -r blast-2.2.24/data /usr/local/blast-data
Download and install some useful scripts:
git clone https://github.com/ngs-docs/ngs-scripts /usr/local/share/ngs-scripts
Create a working directory on a large disk, and change to that working directory:
cd /mnt
mkdir blast
cd blast
Download the E. coli MG1655 protein data set:
curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa
This grabs that URL and saves the contents of ‘NC_000913.faa’ to the local disk.
Grab a Prokka-generated set of proteins (we’ll learn to do this on Thursday):
curl -O http://athyra.idyll.org/~t/ecoli0104.faa
Let’s take a quick look at these files:
head ecoli0104.faa
head NC_000913.faa
Format it for BLAST and run BLAST of the O104 protein set against the MG1655 protein set:
formatdb -i NC_000913.faa -o T -p T
blastall -i ecoli0104.faa -d NC_000913.faa -p blastp -e 1e-12 -o 0104.x.NC
Look at the output file:
head 0104.x.NC
Let’s convert ‘em to a CSV file:
pip install screed
python /usr/local/share/ngs-scripts/blast/blast-to-csv-with-names.py ecoli0104.faa NC_000913.faa 0104.x.NC > 0104.x.NC.csv
This creates a file ‘0104.x.NC.csv’, which you an open in Excel. If you’ve installed Dropbox, you can copy it into your Dropbox folders:
cp 0104.x.NC.csv /root/Dropbox/
Reciprocal BLAST calculation¶
Do the reciprocal BLAST, too:
formatdb -i ecoli0104.faa -o T -p T
blastall -i NC_000913.faa -d ecoli0104.faa -p blastp -e 1e-12 -o NC.x.0104
Extract reciprocal best hit:
python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py ecoli0104.faa NC_000913.faa 0104.x.NC NC.x.0104 > ortho.csv
This generates a file ‘ortho.csv’, containing the ortholog assignments and their annotations. Now copy that over to Dropbox and open it in Excel:
cp ortho.csv /root/Dropbox/
A few post-tutorial links¶
Explore the NCBI bacterial genome site here: http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria
- ‘.faa’ files are protein data sets;
- ‘.fna’ files are genomic DNA;
- the rest are annotation files of various kinds.
BONUS: plotting the BLAST e-value distribution from the CSV¶
IPython Notebook is running on these machines, and we can use that to do data analysis on the files. For example, let’s grab a notebook to plot blast e-value distributions:
cd /usr/local/notebooks
curl -O http://athyra.idyll.org/~t/plot-blast-evalues.ipynb
Now, go to ‘https://‘ + your machine name, enter password ‘beacon’, and open the ‘plot-blast-evalues’ notebook.