Methods for using RNAmapper on Galaxy.
This tutorial briefly explains how to use RNAmapper in Galaxy. Galaxy is a vast toolbox for data analysis, going far beyond what is presented here. However, you have the option to use RNAmapper in a turnkey way by using the provided workflows as described below. Most of this tutorial is concerned with uploading data. Once your data is uploaded, all you need to do is click on a button and come back 24 – 48 hrs later for the completed analysis. If you wish to look at any of the intermediate data, it will also be available for you to explore or analyze in different ways. There is a learning curve to using Galaxy but there are several good webcasts that will help (Galaxy help Galaxy screencasts) and there is specific help for implementing an Amazon version (Galaxy Amazon help). The Nechiporuk lab at OHSU in Portland has been doing some "beta" testing for us with great success having mapped/IDed several mutants using RNAmapper. Katie Drerup has created a great "how to" document for how she does it using the RNAmapper download. This pipeline is particular useful because she found that the native fastq method was not working so this is how she gets from fastq files all the way to the mutant. Thanks Katie! Katie pipeline.txt USING YOUR RNAMAPPER GALAXY PROGRAM Create an Account / Sign in First open Firefox, this will automatically start the RNAmapper server (see the ReadMe on the Desktop if it does not automatically open to RNAmapper). Before you can use Galaxy, you need to sign in. You can do this in the “User” menu in the top menu bar. To start, use the email address galaxy@hms.harvard.edu with the password galaxy. This will give you administrator privileges. You can later register other accounts and elevate them to administrator privilege if you so choose. Congratulations! You are now ready to use the RNAmapper/Galaxy instance! Congratulations! You are again ready to use your RNAmapper/Galaxy instance! Uploading data through the browswer (<2GB per file) The simplest way to upload data into galaxy is directly though the browser, using the “Get Data” menu in the toolbar. For technical reasons unrelated to Galaxy, the file size limit is 2 GB per file. A mapped RNAseq dataset (.bam) is usually below this size. However, you can also break up your datasets into several files below 2 GB and upload them in series. If you wish to upload very large files or many files at once, use one of the options described below. To upload a file through the “Get Data” tool, assign the proper file format and select the file’s location. You should also assign the Genome version you wish to use. Once you have specified this information, click “Execute” and wait for the file to complete uploading. The upload speed will depend on the available bandwidth of your network connection.
Once the upload completes, the data will show up in your history. You can preview it by clicking on the “eye” icon, change file formats or metadata with the “pencil” icon or delete it with the “X” button. Uploading data larger than 2GB Downloaded version. If your file is bigger than 2GB you can transfer files into the shared folder you setup when creating your Virtual Machine (ideally on your "native" operating system's Desktop, or wherever you put it). You will then make RNAmapper aware of the data (below). Amazon online version. If you are working on the Amazon cloud and your file is bigger than 2GB, or you want to upload many files at once, you will need an FTP client like Filezilla (http://filezilla-project.org/). Download, install, and start FileZilla now (or use your own client). Under File>Site Manager, create a new site. Configure your site. The “Host Name” is your Instances Public DNS (e.g. ec2-23-20-224-231.compute-1.amazonaws.com), the Port should be 22 (remember the firewall?). Change the protocol to “SFTP”, the “Logon type” to “normal” and the user name to “ubuntu”. I also like to rename my site to something catchy, like “Galaxy” for flavor. Once you are done, click OK. We now need to tell FileZilla to use the keyfile we got from Amazon for verification. Go to “Edit>Setting”, then select “SFTP” and “Add a keyfile” and open the Amazon keyfile from wherever you saved it on your computer. FileZilla may ask to convert the keyfile into a “supported format”- let it. Just name the new key and save it in a convenient location (same folder?). Note that if you share your keyfile with another user on a different computer, e.g. by email, you may need to set the keyfile’s security property to allow read/write permission to only one user. Failure to do so may lead to the Amazon rejecting the key as insufficiently private. Now connect to your Amazon Instance from the Site Manager (make sure the protocol is set to SFTP). Once you are connected, the server directory structure will appear in the middle right window, and replace “Not connected to any server”. If this fails, retrace your steps and make sure all settings are correct and you do not have any typos in any of your fields. Once you are connected, copy your data directly into the “/galaxy_upload/FTP” directory (e.g. “galaxy_upload/FTP/thisiswhereyourfilebelongs.fastq”) or create a new directory in “galaxy upload” and copy your data there (“galaxy_upload/myfolder/this_is_also_fine.fastq”). Galaxy will *not* see any files directly in the “galaxy_upload” directory (galaxy_upload/invisiblefile.fastq) and no files in subdirectories of subdirectories (e.g. “galaxy_upload/FTP/Galaxydoesnotseethis/myfile.txt”). Once the data is uploaded onto the server, proceed with “Making
Galaxy aware of uploaded data”. Before you can analyze your data, you will need to do one last thing: import the data into Galaxy’s database. To do that, log into Galaxy on your instance as an admin (galaxy@hms.harvard.edu). Then, select “Manage data libraries” from the tool menu and “create a new data library.” You can now add datasets to this library. In order to add datasets, choose “Upload a directory of files” as upload option, and choose your data directory. For the Downloaded version of RNAmapper the folder will be either /media/sf_virtualbox_shared or /media/sf__galaxyShare (depending on the instance). For the Online version of RNAmapper the default directory is: /FTP). Since you have already uploaded your files you do not need to copy the files into the Galaxy database and take up extra space- just choose the “Link to files…” option. You can also set the genome flag at this time (as of July 2012: Zv9). Proceed by clicking “Upload to library.” Your data will take a minute or so to register with the
Galaxy database. Just wait around on the Data Library page until it is uploaded
(job complete). Then import the data into your history, and you are ready to
analyze it on your very own Amazon instance! For a tutorial on how to use
Galaxy in general, visit http://wiki.g2.bx.psu.edu/Learn/Screencasts
. Launching the RNAmapper workflow (pipeline) 1) Your mutant dataset (one .bam or the .fastq read files) - you will need to have uploaded these as described above. 2) Your wildtype
sibling dataset (one .bam or the .fastq read files) - you will need to have uploaded these as described above. 3) A transcript dataset for the current genome annotation - found in shared data>data libraries>GTFs>Zv9.65.gtf 4/5) One or two sets of "harmless" wildtype SNPs - shared data>data libraries>WT_SNP_data>WT_SNP_set01.vcf - shared data>data libraries>WT_SNP_data>WT_SNP_set02.vcf Once you have all five datasets in your history, use the “Workflows” menu in the “Tools” menu. Choose the flavor of pipeline you wish to run. 1) RNAmapper from reads will use fastQ files directly from the sequencer to perform the mapping. This will take ~48 hrs. *** An issue with Galaxy prevents the selection of multiple inputs -- likely you have multiple fastQ files for your mutant and your wildtype you are trying to map. You can get around this either by concatenating all mutant fastQs together in one file (likewise all wildtype) and then upload and use each of these single files for mapping. Or follow Katie Drerup's method: Katie's pipeline 2) RNAmapper from bam will start with reads already aligned to the genome (.bam files). This will take ~24 hrs. Now assign the datasets to the proper slots as indicated, and run the workflow. You should get a confirmation of the workflow launch. Your
History will be populated with grey dataset placeholders that will eventually
begin processing (yellow) and complete (green). Should an error occur at any
step of the pipeline, the dataset will turn red and give you an error message
including some indication of what went wrong. Common
mistakes are failure to specify file format correctly when uploading the data.
If you have questions, check www.rnamapper.org
or email RNAmapper@gmail.com for
support. Evaluating RNAmapper outputs RNAmapper will compile all critical information about the analyzed dataset in a single report file in html format. This report, generated by the “ReportMakeR” tool, is generated in the last step of the analysis. You can peruse it directly in Galaxy by clicking on the “eye” icon, or you can download it to your computer (disk icon). The report contains the name of your mutant, the genome-wide scan for linkage (image), a table listing the statistics of the scan, and the size and position of the predicted critical interval. It further contains a more detailed scan of the predicted candidate chromosome and critical interval, as well as a prediction of the position of the lesion within it, and a list of predicted SNP effects of those SNPs within the interval that were unique to the mutant dataset by category. Finally, the report contains an analysis of all changes to transcripts (expression level, isoform shift, etc.) within the critical interval, again, separated by category. Given the final report, the user can now attempt to make sense of it biologically by integrating the computational predictions with his or her knowledge of the phenotype of the mutant at hand and pursue promising candidates experimentally. *** REPORTMAKER sometimes doesn't give candidate results. The next update should have this fixed, however, the data is there you just have to dig a little bit. Below are some notes from Katie Drerup who has been beta testing the software. Katie says: "If ReportmakerR does not identify potentially causative SNPS, mine your data! IT IS CRUCIAL FOR THE USER TO EVALUATE EACH OF THE CANDIDATES IN FURTHER DETAIL. The potential problems involved in the bioinformatics pipeline are described on the Command Line 101 page. Using Individual RNAmapper tools A large number of parameters can be specified within most tools if so desired. This can be useful to accommodate a user’s specific experimental needs. Users can always build their own workflows using the “workflow” menu in the top toolbar (also see link). In addition to a brief description of the tool’s intended purpose, tools do come with their own documentation, which is appended to the tools directly and can be freely perused. Stopping / Deleting your Instance when done If using the Online version of RNAmapper don’t forget to “STOP” (shut down) or even “Terminate” (delete) your instance once you are done processing your data, you have downloaded it and your project is complete. You will also need to delete your “Snapshots” and “Volumes” once your project is done if you do not want Amazon to charge you for storing your data (~$0.13/ month * GB). If you have any questions, comments or suggestions, please contact RNAmapper at gmail dot com GPL by Nikolaus Obholzer, 2012 |