Basic EC2, command line, and BLAST ================================== Log into your AWS account, spin up a machine, log into the machine with SSH, and install Dropbox (see :doc:`amazon/index`). Install BLAST and some other software ------------------------------------- You should be starting at the prompt ending in '#', after logging in via SSH. First, Download and install BLAST:: cd /root curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz tar xzf blast-2.2.24-x64-linux.tar.gz cp blast-2.2.24/bin/* /usr/local/bin cp -r blast-2.2.24/data /usr/local/blast-data .. @CTB navigating NCBI Download and install some useful scripts:: git clone https://github.com/ngs-docs/ngs-scripts /usr/local/share/ngs-scripts Create a working directory on a large disk, and change to that working directory:: cd /mnt mkdir blast cd blast Download the E. coli MG1655 protein data set:: curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa This grabs that URL and saves the contents of 'NC_000913.faa' to the local disk. Grab a Prokka-generated set of proteins (we'll learn to do this on Thursday):: curl -O http://athyra.idyll.org/~t/ecoli0104.faa Let's take a quick look at these files:: head ecoli0104.faa head NC_000913.faa Format it for BLAST and run BLAST of the O104 protein set against the MG1655 protein set:: formatdb -i NC_000913.faa -o T -p T blastall -i ecoli0104.faa -d NC_000913.faa -p blastp -e 1e-12 -o 0104.x.NC Look at the output file:: head 0104.x.NC Let's convert 'em to a CSV file:: pip install screed python /usr/local/share/ngs-scripts/blast/blast-to-csv-with-names.py ecoli0104.faa NC_000913.faa 0104.x.NC > 0104.x.NC.csv This creates a file '0104.x.NC.csv', which you an open in Excel. If you've installed Dropbox, you can copy it into your Dropbox folders:: cp 0104.x.NC.csv /root/Dropbox/ Reciprocal BLAST calculation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Do the reciprocal BLAST, too:: formatdb -i ecoli0104.faa -o T -p T blastall -i NC_000913.faa -d ecoli0104.faa -p blastp -e 1e-12 -o NC.x.0104 Extract reciprocal best hit:: python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py ecoli0104.faa NC_000913.faa 0104.x.NC NC.x.0104 > ortho.csv This generates a file 'ortho.csv', containing the ortholog assignments and their annotations. Now copy *that* over to Dropbox and open it in Excel:: cp ortho.csv /root/Dropbox/ A few post-tutorial links ------------------------- Explore the NCBI bacterial genome site here: http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria - '.faa' files are protein data sets; - '.fna' files are genomic DNA; - the rest are annotation files of various kinds. BONUS: plotting the BLAST e-value distribution from the CSV ----------------------------------------------------------- IPython Notebook is running on these machines, and we can use that to do data analysis on the files. For example, let's grab a notebook to plot blast e-value distributions:: cd /usr/local/notebooks curl -O http://athyra.idyll.org/~t/plot-blast-evalues.ipynb Now, go to 'https://' + your machine name, enter password 'beacon', and open the 'plot-blast-evalues' notebook.