Published Pages | chanaka | RNASeq Analysis Lab step by step

Galaxy provides multiple tools for performing RNA-seq analysis.This exercise try to introduces these Galaxy tools(TopHat–>Cufflinks–>Cuffcompare–>Cuffdiff) and guides use of these tools to implement similar RNASeq Analysis Lab exercise .This tutorial helps to familiar with Galaxy tools,Galaxy workflow and get basic understanding of RNA-seq analysis. 

Here are the sample datasets that we are going to use;

RNA samples data from leaves:

Galaxy Dataset | asp5_leaf_read1.P001.fq

Aspen leaves paired-end reads (2 into 50 bp) and the target insert size was 200 bp.

Galaxy Dataset | asp5_leaf_read2.P001.fq

Aspen leaves paired-end reads (2 into 50 bp) and the target insert size was 200 bp.

RNA samples data from xylem (woody tissue): 

Galaxy Dataset | asp5_xylem_read1.P001.fq

Aspen xylem paired-end reads (2 into 50 bp) and the target insert size was 200 bp.

Galaxy Dataset | asp5_xylem_read2.P001.fq

Aspen xylem paired-end reads (2 into 50 bp) and the target insert size was 200 bp.

Additional input file

We can copy following input file to existing history from shared data libraries(Shared Data>Data Libraries>Ptrichocarpa_129_gene>select Ptrichocarpa_129_gene.gtf>Import to current history>Go).Now we are going to use following Ptrichocarpa_129_gene.gtf dataset for [NGS:RNA Analysis >]cuffcompare function(Use Reference Annotation:Yes;Reference Annotation:select Ptrichocarpa_129_gene.gtf ) to identify predicted transcripts in common between the two samples. 

Galaxy Dataset | Ptrichocarpa_129_gene.gtf

Ptrichocarpa_129_gene.gtf

Step 1: Quality Control and Filtering.

In order to understand the workflow and Galaxy tools we are going to extract only 1% of random replicates(if we use whole datasets without extracting 1% its comparatively large and it will take significant run time to process).We are going to use Random Fastq extract tool to  extracts 1% of random entries from a pair of fastq files where each entry contains four lines because each sequence contains four lines as below.

@SOLEXA14:2:1:10:263#0/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SOLEXA14:2:1:10:263#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Now we have pair of fastq files for both Xylem and Leaves .Then we are going to convert Fastq files into various FASTQ quality formats.We use FASTQ Groomer tool convert Fastq files into Sanger standard format.Then we use Sanger formated pair of Fastq files with FASTQC tool  to understand problems of our data,which we should be aware before doing any further analysis.Also its important to trim our data(For instance 10 to 90) before we start analysis.We can use Trim sequences tool for cut bases from sequences in a Fastq files.

QC and Filtering

Workflow for Quality Control and Filtering

Quality Control and Filtering workflow Results

Step 2: Tophat summary

Now we are going to use step 1 data(QC and filtered results) with Tophat.We use Tophat to find splice junctions in RNA-seq data map with Reference genome.Tophat aligns RNA-Seq reads to our reference genome using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

Tophat Summary

Tophat summary workflow

Tophat summary workflow results

Step 3: Cufflinks , Cuffcompare and Cuffdiff 

Once we mapped reads(step2 results)  with Tophat we need to assemble read into complete transcripts that can be analyzed for differential expressions . Cufflinks tool assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples.Therefore we use Tophat bam files as inputs for Cufflinks to perform de novo transcript assembly.Then We can find transcripts that have multiple exons by looking at assembled transcripts dataset.Cuffcompare helps analyze the transcribed fragments (transfrags) in an assembly by comparing assembled transcripts to a reference annotation and tracking Cufflinks transcripts across multiple experiments (e.g., across a time course). Cuffcompare produces a combined transcripts dataset that is needed in next steps.Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes.

Tophat summary workflow

Tophat summary workflow results

Step 4: Export data to PopGenIE

We can send results cuffdiff gene expression data to popgenie.org for in-depth analysis of gene expression.

Now we need to filter(Filter and Sort>Filter>c14="yes") tool in order to significant genes from the above list

We need to use cut tool to get only specific gene column (Text Manipulation>cut>c3):

Now We can send(Send Data>Popgenie GO Tools.) Galaxy out put data to PopGenIE. 

In progress