Projects and Tutorials

Big Data Analysis NGS Data RNA-seq
Grosmannia clavigera
Document Video

RNA-Seq Analysis with Grosmannia clavigera – Part 1, Background

User Ratings :

INTRO

T-Bioinfo is a flexible bioinformatics computational system that provides an intuitive and user friendly interface for building pipelines of bioinformatics algorithms for the analysis of big data.

GROSMANNIA CLAVIGERA: EXAMPLE OF RNA_SEQ DATA ANALYSIS

Identifying Regulation Pathways for Monoterpene Response

Grosmannia clavigera fungus spreads to pine trees via the Mountain pine Beetles. These beetles are able to mine and lay eggs while avoiding the tree’s defenses.

Total cumulative losses from the Mountain Pine Beetle are projected to be 752 million cubic metres (58%) of the merchantable pine volume by 2017

The Blue Stain Fungus spores germinate and produce a thread-like mass that colonizes the phloem and sapwood overcoming the tree’s defenses. In this tutorial, we will investigate how G. clavigera (i) survives under toxic conditions of monoterpenes and (ii) identify the genes involved in removal and consumption of monoterpenes as a carbon source.

  1. clavigera was grown under differing conditions and sequenced in order to identify the enzymatic pathways involved in monoterpene detoxification and utilization as an energy source.

These samples were sequenced using the Illumina machine, which produces reads of gene expression in FastQ format.

Identifying Regulation Pathways for Monoterpene Response

  1. clavigera wild-type and the ABC transporter mutant were grown on two mediums. The first medium was a malt extract agar (MEA), an enriched medium containing the full nutrient sources needed for a successful fungal growth. The second medium was Yeast Nitrogen Base (YNB), a minimal medium lacking both the carbon source and amino acids necessary for growth.  Both of these mediums were used to grow G. clavigera with and without the addition of monoterpenes.

UNDERSTANDING RNA SeQ AND CHOOSING THE RIGHT APPROACH TO DATA ANALYSIS.

RNA_Seq analysis consists of several stages that prepare and analyze data. Several pipelines could be used for our example. Let’s look at each section in detail.

ERROR CORRECTION

Several algorithms are made available on the T-bioinfo platform for error correction. Each one analyzes the data for repeats and unique regions to identify errors in sequencing data.

Mapping on Genome

RNA transcription data shows which genes are active in the organism. The “reads” are aligned with correlating gene sections of the genome. The mapping stage uses various algorithms to detect putative exons and genes by mapping exonic reads on the genome.

Mapping on Junctions

Next we need to locate the exon junctions, the coding regions of the genome. In this section, we find links between sliced reads and pair reads, linking exons. Thus, we arrive at putative isoforms, or slice variants.

DETECTION OF ISOFORMS

The next step is to detect isoforms as paths across links. We have two options – to look for exon junctions or via pair-read links. Once the Isoforms are constructed, we can now proceed to measure expression.

EXPRESSION OF GENES AND ISOFORMS

As the isoforms are constructed, we can now better understand the transcription data as active regions of genes that are being utilized in the cell.

DIFFERENTIAL EXPRESSION

Differential expression compares several conditions, seeking to link gene expression with some acting factor.

PLATFORM OVERVIEW

Now that we took a look at the platform conceptually, we want to see how the data acquired from biological samples gets sequenced and analyzed on the T-Bioinfo platform to understand transcriptomic regulation in G. clavigera response in exposure to toxic monoterpene environments over time.

First, the correct data types are selected, this prepares the platform pipelines that work accordingly. Transcription data and reference genome have to be identified by type.

The second step is building a pipeline suitable to the data and goals of analysis. The platform will suggest possible steps each time a module is selected, preventing the user from building a pipeline unable to handle the data.

Next, the data has to be uploaded. You will see the upload options activate when the pipeline is built. At this stage, your pipeline cannot be modified.

You will now have the option to run the pipeline. By pressing the green button, your pipeline is prepared to run and is placed in the que. You will be able to monitor progress and will get notified of the current status of your pipeline by email.

SCREEN RECORDING

The first thing we will do as we go about the analysis is look at this section on top.

As a default, this section works with an annotated model genome. You will see several options of genomes that are pre-loaded on the platform. In our case, we will be using the RGenome function, meaning that the genome is not annotated. The genome file will have to be uploaded at the end of this process.

The data format we will use is fastQ and we will use Single End data.

In order to upload two datasets, correlating with 2 experiments that we will compare, we will have to use 2 groups and specify each group to contain 1 file.

Now we are ready to select our pipeline. The buttons that are highlighted are the possibilities after a module is selected. We can start by pressing on “data input”.

For now, we can ignore specific parameters for these modules and follow the preset values, click save after seeing this pop-up.

Let’s take a look at the TBioinfo interface again in detail:

Every time a button is pressed, you will see an informational pop-up that also allows you to modify some parameters for the corresponding algorithm.

As you can see, once we completed data input, a number of options will become available. We will now select TopHat, a mapping algorithm.

Once we select TopHat, we will have one primary option available except for visualization. We will select CuffLinks since we are interested in differential expression. This algorithm builds isoforms for our data.

Once the isoforms are built, we need to integrate the data into a GTF file. CuffMerge.

Now, we have 2 options, one related to expression and the other to differential expression. We will select Cuffdiff that will compare expression data.

Once we are finished, we can click on “data output” and proceed to upload data.

To upload NGS data, you can either store the files on your computer and upload through this dialogue window, or copy the download links (such as from the SRA database) into a url.plst file with links. The other file that needs to be uploaded is the Genome file, under “Upload References”. The files can now be seen in the window.

Now the pipeline is ready to be run, so we can select the “Run Pipeline” button. This will be listed under “my pipelines”. Here the pipeline status with a percentage of the job done will be shown.

Obtained Differential Expression data will be in a table result. We have several “data mining” algorithms that could be used to find correlation between changes in transcription across all the genes that show variation and factors – environment changes such as the presence of a carbon source, monoterpenes and time. We performed factor analysis of the data that can be placed in this table to see changes. This statistical analysis needs to be interpreted from a biological standpoint.

Our next step will be trying to compare results from our analysis against the results published by the authors of the research paper. Next, let’s look at the biological side, specifically at the enzymatic pathways utilized by the G. clavigera cell in these different conditions.

When monoterpenes are added to the MEA medium membrane remodeling, stress proteins, and transporters become most active. Monoterpenes can alter the structure of the cell membrane to become “leaky”. To survive the cell must continuously rebuild its membrane.

ABC transporters focus on removing the monoterpenes within the cell, to the outside.

Monoterpenes can alter the native protein structures within the cell. Stress proteins, such as heat shock proteins assist with the proper folding and stability of proteins within the cell.

Lastly, the genome of G. clavigera had an increase of transposable elements, this rearrangement of the genome may suggest the fungus is adapting under stressing conditions.  

In the absence of monoterpenes, little to no differences in expression are seen when comparing the wild type and mutant.

With the addition of monoterpenes the mutant strain, a strain lacking a vital ABC transporter, expression profiles differ.  It’s likely that the monoterpene concentration was much higher within the mutant cells, as it’s less capable of removing the monoterpenes from the cell.

Monoterpenes bound for utilization within the mitochondria. An up-regulation of metabolism associated with the utilization of monoterpenes can be observed at this point.

Overall G. clavigera is capable of catabolism of monoterpenes as a food source, but this pathway is less utilized if other nutrients are available.

Thank you for watching this example of RNA seq pipeline applied to G. clavigera samples to detect genes responsible for detoxification and utilization of monoterpenes.

To run your RNA seq pipeline, go to T-BioInfo.com