Methods for using RNAmapper on Galaxy.

This tutorial briefly explains how to use RNAmapper in Galaxy. Galaxy is a vast toolbox for data analysis, going far beyond what is presented here. However, you have the option to use RNAmapper in a turnkey way by using the provided workflows as described below. Most of this tutorial is concerned with uploading data. Once your data is uploaded, all you need to do is click on a button and come back 24 – 48 hrs later for the completed analysis. If you wish to look at any of the intermediate data, it will also be available for you to explore or analyze in different ways. 

There is a learning curve to using Galaxy but there are several good webcasts that will help (Galaxy help Galaxy screencasts) and there is specific help for implementing an Amazon version (Galaxy Amazon help).

The Nechiporuk lab at OHSU in Portland has been doing some "beta" testing for us with great success having mapped/IDed several mutants using RNAmapper. Katie Drerup has created a great "how to" document for how she does it using the RNAmapper download. This pipeline is particular useful because she found that the native fastq method was not working so this is how she gets from fastq files all the way to the mutant. Thanks Katie! Katie pipeline.txt


USING YOUR RNAMAPPER GALAXY PROGRAM

Create an Account / Sign in

First open Firefox, this will automatically start the RNAmapper server (see the ReadMe on the Desktop if it does not automatically open to RNAmapper). Before you can use Galaxy, you need to sign in. You can do this in the “User” menu in the top menu bar.

002.png

To start, use the email address galaxy@hms.harvard.edu with the password galaxy. This will give you administrator privileges. You can later register other accounts and elevate them to administrator privilege if you so choose.

0004.png

Congratulations! You are now ready to use the RNAmapper/Galaxy instance!

OPTIONAL. The above will work for all your mapping needs. But, if you really, really want to have your own username and password then first sign in to galaxy as above. You then must change the entry in universe_wsgi.ini, found in the main galaxy folder on the desktop (see image below), and then register the account in galaxy in the normal browser interface.

your very own galaxy

Congratulations! You are again ready to use your RNAmapper/Galaxy instance!



Uploading data through the browswer (<2GB per file)
The simplest way to upload data into galaxy is directly though the browser, using the “Get Data” menu in the toolbar. For technical reasons unrelated to Galaxy, the file size limit is 2 GB per file. A mapped RNAseq dataset (.bam) is usually below this size. However, you can also break up your datasets into several files below 2 GB and upload them in series. If you wish to upload very large files or many files at once, use one of the options described below.

0006.png

To upload a file through the “Get Data” tool, assign the proper file format and select the file’s location. You should also assign the Genome version you wish to use.

Once you have specified this information, click “Execute” and wait for the file to complete uploading. The upload speed will depend on the available bandwidth of your network connection.

0010.png

Once the upload completes, the data will show up in your history. You can preview it by clicking on the “eye” icon, change file formats or metadata with the “pencil” icon or delete it with the “X” button.

0012.png



Uploading data larger than 2GB
Downloaded version.
If your file is bigger than 2GB you can transfer files into the shared folder you setup when creating your Virtual Machine (ideally on your "native" operating system's Desktop, or wherever you put it). You will then make RNAmapper aware of the data (below).

Amazon online version. If you are working on the Amazon cloud and your file is bigger than 2GB, or you want to upload many files at once, you will need an FTP client like Filezilla (http://filezilla-project.org/). Download, install, and start FileZilla now (or use your own client).

0014.png

Under File>Site Manager, create a new site.

0016.png

Configure your site. The “Host Name” is your Instances Public DNS (e.g. ec2-23-20-224-231.compute-1.amazonaws.com), the Port should be 22 (remember the firewall?). Change the protocol to “SFTP”, the “Logon type” to “normal” and the user name to “ubuntu”. I also like to rename my site to something catchy, like “Galaxy” for flavor. Once you are done, click OK.

0018.png

We now need to tell FileZilla to use the keyfile we got from Amazon for verification. Go to “Edit>Setting”, then select “SFTP” and “Add a keyfile” and open the Amazon keyfile from wherever you saved it on your computer. FileZilla may ask to convert the keyfile into a “supported format”- let it. Just name the new key and save it in a convenient location (same folder?). Note that if you share your keyfile with another user on a different computer, e.g. by email, you may need to set the keyfile’s security property to allow read/write permission to only one user. Failure to do so may lead to the Amazon rejecting the key as insufficiently private.

0020.png

Now connect to your Amazon Instance from the Site Manager (make sure the protocol is set to SFTP). Once you are connected, the server directory structure will appear in the middle right window, and replace “Not connected to any server”. If this fails, retrace your steps and make sure all settings are correct and you do not have any typos in any of your fields.

0022.png

Once you are connected, copy your data directly into the “/galaxy_upload/FTP” directory (e.g. “galaxy_upload/FTP/thisiswhereyourfilebelongs.fastq”) or create a new directory in “galaxy upload” and copy your data there (“galaxy_upload/myfolder/this_is_also_fine.fastq”). Galaxy will *not* see any files directly in the “galaxy_upload” directory (galaxy_upload/invisiblefile.fastq) and no files in subdirectories of subdirectories (e.g. “galaxy_upload/FTP/Galaxydoesnotseethis/myfile.txt”).

Once the data is uploaded onto the server, proceed with “Making Galaxy aware of uploaded data”.

Making Galaxy aware of the uploaded data
Before you can analyze your data, you will need to do one last thing: import the data into Galaxy’s database. To do that, log into Galaxy on your instance as an admin (galaxy@hms.harvard.edu). Then, select “Manage data libraries” from the tool menu and “create a new data library.”

0024.png

You can now add datasets to this library.

0026.png

In order to add datasets, choose “Upload a directory of files” as upload option, and choose your data directory. For the Downloaded version of RNAmapper the folder will be either /media/sf_virtualbox_shared or /media/sf__galaxyShare (depending on the instance). For the Online version of RNAmapper the default directory is: /FTP). Since you have already uploaded your files you do not need to copy the files into the Galaxy database and take up extra space- just choose the “Link to files…” option. You can also set the genome flag at this time (as of July 2012: Zv9). Proceed by clicking “Upload to library.”

0028.png

Your data will take a minute or so to register with the Galaxy database. Just wait around on the Data Library page until it is uploaded (job complete). Then import the data into your history, and you are ready to analyze it on your very own Amazon instance! For a tutorial on how to use Galaxy in general, visit http://wiki.g2.bx.psu.edu/Learn/Screencasts .

0030.png

Launching the RNAmapper workflow (pipeline)
Launching the RNAmapper pipeline just takes a couple of mouse clicks. You will need at least five datasets to do so.

1) Your mutant dataset (one .bam or the .fastq read files)

    - you will need to have uploaded these as described above.

2) Your wildtype sibling dataset (one .bam or the .fastq read files)

    - you will need to have uploaded these as described above.

3) A transcript dataset for the current genome annotation

    - found in shared data>data libraries>GTFs>Zv9.65.gtf

4/5) One or two sets of "harmless" wildtype SNPs

    - shared data>data libraries>WT_SNP_data>WT_SNP_set01.vcf

    - shared data>data libraries>WT_SNP_data>WT_SNP_set02.vcf

Once you have all five datasets in your history, use the “Workflows” menu in the “Tools” menu.

0032.png

Choose the flavor of pipeline you wish to run.

1) RNAmapper from reads will use fastQ files directly from the sequencer to perform the mapping. This will take ~48 hrs.

*** An issue with Galaxy prevents the selection of multiple inputs -- likely you have multiple fastQ files for your mutant and your wildtype you are trying to map. You can get around this either by concatenating all mutant fastQs together in one file (likewise all wildtype) and then upload and use each of these single files for mapping. Or follow Katie Drerup's method: Katie's pipeline

2) RNAmapper from bam will start with reads already aligned to the genome (.bam files). This will take ~24 hrs.

0034.png

Now assign the datasets to the proper slots as indicated, and run the workflow.

0036.png
0038.png
0040.png

You should get a confirmation of the workflow launch. Your History will be populated with grey dataset placeholders that will eventually begin processing (yellow) and complete (green). Should an error occur at any step of the pipeline, the dataset will turn red and give you an error message including some indication of what went wrong. Common mistakes are failure to specify file format correctly when uploading the data. If you have questions, check www.rnamapper.org or email RNAmapper@gmail.com for support.



Evaluating RNAmapper outputs
RNAmapper
will compile all critical information about the analyzed dataset in a single report file in html format. This report, generated by the “ReportMakeR” tool, is generated in the last step of the analysis. You can peruse it directly in Galaxy by clicking on the “eye” icon, or you can download it to your computer (disk icon). The report contains the name of your mutant, the genome-wide scan for linkage (image), a table listing the statistics of the scan, and the size and position of the predicted critical interval.

0042.png

It further contains a more detailed scan of the predicted candidate chromosome and critical interval, as well as a prediction of the position of the lesion within it, and a list of predicted SNP effects of those SNPs within the interval that were unique to the mutant dataset by category.

0044.png

Finally, the report contains an analysis of all changes to transcripts (expression level, isoform shift, etc.) within the critical interval, again, separated by category.

0046.png

Given the final report, the user can now attempt to make sense of it biologically by integrating the computational predictions with his or her knowledge of the phenotype of the mutant at hand and pursue promising candidates experimentally.

*** REPORTMAKER sometimes doesn't give candidate results. The next update should have this fixed, however, the data is there you just have to dig a little bit. Below are some notes from Katie Drerup who has been beta testing the software.

Katie says: "If ReportmakerR does not identify potentially causative SNPS, mine your data!
    a) The third box from the top in your history should be titled SnpEff on dataXX. If you click on the title, the box will open and reveal that you have tabular data. There is another similarly titled box above this one but the data is in html format. You want the tabular data!
    b) Click on the disk icon to download and save this file. It should save to your downloads folder in your "home" folder. There is a shortcut to this folder in the dock at the top.
    c) Open this SNPeff file in excel
        - To do this, first open excel and then use the File-open path to open this file. It will not work to simply drag it in!
        - This file has all of the homozygous SNPs found in your mutant on all chromosomes.
    d) Once open, erase the SNPs found on other chromosomes to make the file easier to use.
    e) Now scan through your potential SNPs in order of interest. I chose to sort the excel file by "effect", which is the predicted change caused by the SNP. After sorting, I looked first at the stop_gained, stop_lost, and splice acceptor/donor changes. Non-synonomous coding mutations are a bit trickier and, as such, I left them to last."

IT IS CRUCIAL FOR THE USER TO EVALUATE EACH OF THE CANDIDATES IN FURTHER DETAIL.

The potential problems involved in the bioinformatics pipeline are described on the Command Line 101 page.

Using Individual RNAmapper tools

Although the RNAmapper pipeline is fully automated, it is possible to use any of the tools within the pipeline individually. All tools are available from the tool menu (on the left) in either the “NGS:RNAmapper” section or the "NGS:MEGAmapper" (MEGAmapper is a separately developed set of tools for mapping mutations using whole genome sequencing data).

0048.png

A large number of parameters can be specified within most tools if so desired. This can be useful to accommodate a user’s specific experimental needs. Users can always build their own workflows using the “workflow” menu in the top toolbar (also see link).

0050.png

In addition to a brief description of the tool’s intended purpose, tools do come with their own documentation, which is appended to the tools directly and can be freely perused.

0052.png


Stopping / Deleting your Instance when done
If using the Online version of RNAmapper don’t forget to “STOP” (shut down) or even “Terminate” (delete) your instance once you are done processing your data, you have downloaded it and your project is complete. You will also need to delete your “Snapshots” and “Volumes” once your project is done if you do not want Amazon to charge you for storing your data (~$0.13/ month * GB).

0054.png
 

If you have any questions, comments or suggestions, please contact RNAmapper at gmail dot com

GPL by Nikolaus Obholzer, 2012