Galaxy provides multiple tools for performing RNA-seq analysis.This exercise try to introduces these Galaxy tools(TopHat–>Cufflinks–>Cuffcompare–>Cuffdiff) and guides use of these tools to implement similar RNASeq Analysis Lab exercise .This tutorial helps to familiar with Galaxy tools,Galaxy workflow and get basic understanding of RNA-seq analysis.
Here are the sample datasets that we are going to use;
RNA samples data from leaves:
RNA samples data from xylem (woody tissue):
We can copy following input file to existing history from shared data libraries(Shared Data>Data Libraries>Ptrichocarpa_129_gene>select Ptrichocarpa_129_gene.gtf>Import to current history>Go).Now we are going to use following Ptrichocarpa_129_gene.gtf dataset for [NGS:RNA Analysis >]cuffcompare function(Use Reference Annotation:Yes;Reference Annotation:select Ptrichocarpa_129_gene.gtf ) to identify predicted transcripts in common between the two samples.
In order to understand the workflow and Galaxy tools we are going to extract only 1% of random replicates(if we use whole datasets without extracting 1% its comparatively large and it will take significant run time to process).We are going to use Random Fastq extract tool to extracts 1% of random entries from a pair of fastq files where each entry contains four lines because each sequence contains four lines as below.
Now we have pair of fastq files for both Xylem and Leaves .Then we are going to convert Fastq files into various FASTQ quality formats.We use FASTQ Groomer tool convert Fastq files into Sanger standard format.Then we use Sanger formated pair of Fastq files with FASTQC tool to understand problems of our data,which we should be aware before doing any further analysis.Also its important to trim our data(For instance 10 to 90) before we start analysis.We can use Trim sequences tool for cut bases from sequences in a Fastq files.
Now we are going to use step 1 data(QC and filtered results) with Tophat.We use Tophat to find splice junctions in RNA-seq data map with Reference genome.Tophat aligns RNA-Seq reads to our reference genome using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
Once we mapped reads(step2 results) with Tophat we need to assemble read into complete transcripts that can be analyzed for differential expressions . Cufflinks tool assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples.Therefore we use Tophat bam files as inputs for Cufflinks to perform de novo transcript assembly.Then We can find transcripts that have multiple exons by looking at assembled transcripts dataset.Cuffcompare helps analyze the transcribed fragments (transfrags) in an assembly by comparing assembled transcripts to a reference annotation and tracking Cufflinks transcripts across multiple experiments (e.g., across a time course). Cuffcompare produces a combined transcripts dataset that is needed in next steps.Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes.
We can send results cuffdiff gene expression data to popgenie.org for in-depth analysis of gene expression.
Now we need to filter(Filter and Sort>Filter>c14="yes") tool in order to significant genes from the above list
We need to use cut tool to get only specific gene column (Text Manipulation>cut>c3):
Now We can send(Send Data>Popgenie GO Tools.) Galaxy out put data to PopGenIE.