Methods for performing the RNAmapper pipeline via command line. This methodology requires a working knowledge of command line usage and installation/implementation of programs running within a unix/linux environment. There are many tutorials on how to use such tools -- Google is your friend. Below you will find a brief overview but be sure to read the step-by-step guide as it has more detailed explanations. Step-by-step command line RNAmapper pipeline: pdf Custom R scripts necessary for the pipeline: RNAmappe.R, RNAidentifie.R, RNAeffecto.R, VEPsorte.R, VCFmerge.R R script documentation: pdf txt license Links to other programs used in pipeline: R FastQC Bowtie TopHat Cufflinks SAMtools VEP SIFT Mapping The first steps of the pipeline are concerned with mapping the mutation. The basic idea is to find quality SNPs in the wildtype data to use as mapping markers. Good quality markers will have high coverage and a high percentage of heterozygosity -- both of these measures ensure that the marker will provide useful mapping information. The allele frequency of the markers is then analyzed in the mutant RNA-Seq data. Any position that is unlinked to the mutation will be heterozygous, while linked regions will approach homozygosity. This mutant allele frequency is then plotted against the genome, chromsome by chromsome, with regions of linkage appearing as peaks. Genome Wide Allele Frequency (click to enlarge) The RNAmapper pipeline creates a file called _RNAgenomeFrequency.jpg in which the black marks represent average allele frequencies of marker SNPs at their chromosomal positions. Each chromosome is noted at the bottom. The y-axis is the allele frequency from 0.5 to 1.0 (homozygous). In this example, the mutation maps to chromsome 5. Individual Chromosome Allele Frequency (click to enlarge) The RNAmapper pipeline also makes individual chromosome graphs called mut_chrX.jpg in which the red marks represent the average allele frequencies at chromosomal positions (the same values as the black in the genome wide graph). The black marks represent individual marker allele frequencies. The individual frequencies have much more noise than the "smoothed" average allele frequency. The region of linkage is left up to the user, however the vertical lines note the edges of the region where the average allele frequency is within 2% of homozygosity (i.e. ave. freq. > 0.98). Thus far mutations we have mapped have been found near the middle of the peak of the average of allele frequency but the decision on the region to further evaluate is left to the user. Candidate Identification The RNA-Seq data allows not only for mapping, but also for identifying candidate mutations within the region of linkage. The next steps in the bioinformatic pipeline uses the RNA-Seq data within the region of linkage (as defined by the user) to identify candidate mutations that affect coding sequence (e.g. nonsense and missense muations), expression levels (e.g. expression downregulated in mutant), and splicing (e.g. inclusion of intronic sequence in mutant transcripts). Coding Sequence Candidates are identified by identifying all SNPs within the region of linkage and evaluating whether the changes alter gene sequences with the potential to causes deleterious lesions. The pipeline also compares the linked SNPs against separately identified wildtype SNPs and removes these from further consideration; this assumes that SNPs identified in wildtype populations are not causative for the phenotype of interest. A table of possible changes is output with several columns of interest. Example of SNP candidate table (click to enlarge) The table is organized with information on the allele being evaluated, information from the mutant RNA-Seq data for that position, and information from known wildtype sequencing. Two important points are exemplified in the table above: 1) The nonsense mutation identified is duplicated (highlighted in yellow). This is due to the program evaluating the effect of the SNP allele on each transcript in ENSEMBL's database. In this case, there are two transcripts for the gene, and the SNP causes a STOP in both. 2) THE OUTPUT TABLE IS FAR FROM PERFECT! The second candidate on the above list is one example of a problem with the bioinformatics pipeline that cannot currently be overcome. At this position there were two alternative calls, C and A, with the reference being T. In these rows, the A leads to a misssense mutation whereas the C is a synonomous change. In this example the A call is sequencing error, but that information is not accessible in the format analyzed by the pipeline (there are actually 784 C calls, and only 1 A call, at this location). THE USER HAS TO EVALUATE EACH CANDIDATE TO ENSURE QUALITY. In this case, loading the aligned sequence files into IGV (link) and assessing this position clearly demonstrates the lack of support for the A call and thus can be ruled out as an candidate of interest. See more details in the step-by-step guide (pdf). Expression level candidates are identified by using Cufflinks (link) to compare transcript levels between mutant and wildtype data, and then extracting only those changes within the region of linkage. We have found that mutations that cause decreased transcript levels due to nonsense mediated decay can be identified, but any mutation affecting expression level is, in principle, identifiable. However, if the mutation effects a noncoding element (enhancer, promoter, etc) the RNA-Seq data will not see the change itself, only the effect (altered expression) of the change and further work will be needed to identify the linked mutation. Below is an example table with columns of special interest noted. Example of Expression candidate table (click to enlarge) See the Cufflinks (link) site for details on each column. In the pipeline here, we extract only those genes that show significant differences in expression level between mutant and wildtype data. In this case, the nonsense mutation in egr2b lead to a 25-fold downregulation of the gene. Alternative splicing candidates are identified by using IGV (link) to visually assess the boundaries of each intron/exon boundary. While Cufflinks can evaluate isoform differences between two samples, we have found that it does not reliably identify mutation induced changes to splicing patterns. THE USER HAS TO EVALUATE THE LINKED REGION TO IDENTIFY SPLICE ALTERING MUTATIONS. While this is laborious to do "by hand" it can yield very useful information. When we used a mutation in vangl2 known to include 15bp of intronic sequence in the mutant transcript, we were able to identify the change in the mutant RNA-seq data. Such information is invaluable for the mutation hunter. Below is an IGV view of the vangl2 splice altering mutation. RNA-Seq data from vangl2 mutant and wildtype sibling viewed in IGV (click to enlarge) The first group of transcripts are alignments from vangl2 mutants, the second from their wildtype siblings. The inclusion of intronic sequence in the mutant transcripts is apparent as compared to the normal wildtype transcripts. |