Projects and Tutorials

NGS Data RNA-seq
Arabidopsis thaliana
Document Video

RNA-Seq Data Analysis – Arabidopsis thaliana

User Ratings :


Arabidopsis. is a well studied organism that is often viewed as a model organism in plant biology. Some of it’s important advantages include:

-Small size of genome
-Short generation time
-Large number of offspring
-Small plant size

As a result, many research projects use this plant to generate Next Generation Sequencing (NGS) Data to a shared knowledge base, including:

-Raw Genome and Transcriptome reads
-Complete genome sequence
-Genome structure and its annotation
-Gene product (protein) information
-Expression of genes and isoforms
-Genome maps

One of the most active areas in terms of RNA transcription is the specialization of the meristem tissues during flowering. The flowering is initiated in the meristem on the gene pathway level. The classical ABC model postulates a group of genes that encode the transcription factors needed to turn on the genes for Sepal, Petal, Stamen and Carpel development. The master switches fall into three groups. A, B, and C. These switches turn on pathways causing the cell to develop into specific tissue types. The proteins encoded by the ABC genes are transcription factors. These bind to DNA and attract RNA polymerase to the gene expression process. These transcription factors are bound to DNA as timer proteins. ABC genes activate the expression of other genes and cause cells of the meristem to form different parts of the flower.

Further research suggests a more complex regulation of tissue development at the transcriptome and translatome levels. As an example we will be following the project of Professor Meyerowitz that published an article on Arabidopsis flowering called “Cell-type specific analysis of translating RNAs in developing flowers reveals new levels of control”. In this article Professor Meyerowitz and co-authors are comparing expression of the transcriptome and translatome in several tissues of meristem that under development across stages of flowering of Arabidopsis thaliana. The goal was to compare additional levels of transcriptome regulation that are involved in flowering.

The main different between transcription and translation in the cell is that transcription occurs within the nucleus when mRNA is copied from DNA. The transcripts are transported to cytoplasm where translation occurs when the ribosome starts generating proteins form the mRNA. While the gene sequences determine which mRNA is produced. Various other mechanisms influence the proteins that are produced.

To prepare the data, total RNA was extracted by using Poly-A isolation of mRNA which we call the transcriptome. At the same time immunoprecipitation was used to isolate mRNA under translation. We call it the Translatome.

Expression levels were compared in three meristem tissues in order to identify known and new issue specific active transcripts. Transcriptome and Translatome were compared post transcriptional processing such as intron splicing.

All the available data for this project is publically accessible on the SRA database.

In order to analyze data from this project on the T-BioInfo platform we need to work with three data sets. Annotated Genome, Assemble chromosomes and its annotation.gtf file. Transcriptome samples NGS reads.

Comparing datasets:

For our purposes were are going to compare the flowering stage four translatome which is 12 tissue specific samples of the translatome with stages six through seven translatome samples. All pool of transcriptome and translatome from the entire meristem are cross compared across the flowering stages. First, we find the expression of isoforms in transcriptome and translatome samples. Then we will compare transcriptome and translatome for stage four where we find divergent profiles of isoforms of the same gene and compare detected expression profiles with author results. After that we will translatome for different stages. Stage 4 and Stages 6-7 after that we will find specific isoform profiles for different tissues under regulation transcription factors AP1, AP3 and AG.

RNA – SEQ Logical Graph

To analyze the sequence data we will follow this logical graph of RNA-Seq analysis we’re taking into account that various data can be used at the start, such as annotated and unannotated genomes as well as data that requires data correction.

After we have an annotated genome genome with a .gtf file we will proceed to detect exons, map them on junctions and construct isoforms. These isoforms will allow us to see which gene isoforms are expressed and compare stages of flowering as well as tissues.

Data Pools

As a first step we’re going to compare stage four translatome pool on tissue samples with stage 4 transcriptome pool. After that we will compare stage four and stages six and seven transcriptome data.

The T-BoInfo platform combines multiple analysis algorithms into pipelines. These pipelines are built interactively, showing limited selection of options based on user preference, type of data and input/output specification of each algorithm. For this project we will be using two pipelines. Bowtie-i, isolasso, cuffmerge, RSCM and the second pipeline, Tophat, cufflinks, cuffmerge, NRSEM. We will also compare these two methods in our analysis.

After the analysis is complete we want to prepare visual representation of results using principal component analysis.

Principle component analysis is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly and correlated variables called principal components, The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principle component has the largest possible variance. This accounts for as much of the variability of the data possible. Each succeeding variable in turn has the variance possible under the constraint that it is orthogonal to ie uncorrelated with the preceding component. The principle components are orthogonal because they are igenvectors of the covariance matrix which is symmetric.

PCA is sensitive to the relative scaling of the original variables.

The results of the initial statistical analysis of the Mereowitz tissue data which took into consideration the translatome and three tissues at two stages and in two repeats combined with the transcriptome and translatome pools and contrasted with the authors data gave us this picture.

The PC1 and PC2 components which is 55% of the total variance together do not show biological differentiation. They differentiate T-BioInfo analysis from authors analysis and also reflect a divergence between replicants.

The third component, PC3 which is 6.5% of the total variance gave a very clear biological picture. The same pattern of regulation in tissues in stages in three types of analysis, authors cufflinks and isolasso and both replicants. Namely the general pattern of isoform regulation is as follows.

1. AG tissues is equally regulated in two-time stages 4 and 6-7. This is the same regulation up or down with AP3 in stage 4.
2. AG is opposite regulated to AP3 in this stage 6-7.
3. AG is opposite regulated with AP1 in stages 4 and 6-7.
4. AP1 similarly regulated in stages 4 and 6-7.

As we’ve found AG is opposite AP! in both stages. AG is similar in stage 4 is and opposite AP3 in stages 6-7. The expression majority of isoforms in pools looks as follows. Expression in the transriptome differs from the expression translatome but not much. Both average expression levels are coordinated with expression translatome in AP1 both stages and coordinated with AP3 in stages 6-7.

A more comprehensive analysis of isoforms for pools of transcriptome and translatome the pool is all the samples from the meristem, stage 4 and stages 6-7. We performed both methods of analysis. Expression profiles of genes and isoforms were combined and these profiles were analyzed using PCA. The results of PCA analysis shows several groups. These groups can be aligned comparing genes and isorforms, transcriptome and translatome, cufflinks and isolasso pipelines and the two replicants.

Characteristic profiles of marginal distribution of combined gene isoform profiles across transcript and translatome samples on the PCA plain. These marginal patterns demonstrate that variability of expression is combined by two factors. One difference between genes and isoforms and two, difference between transcriptome and translatome.

These graphs show the average profiles for margins on the PCA plain.

Here we see combined profiles of gene and isoforms. The first half of the profile is a gene, and the second half is a profile of its isoform.