Projects and Tutorials

RNA-Seq Educational Project - Whole Transcriptome Profiling of Cancer Tumors in Mouse PDX Models

User Ratings :

Download a copy of this presentation.

Project Title: Difficulties in Tumor-Stroma Total RNA-seq

The “omics” field is expanding rapidly, this is likely driven by the plummeting cost of sequencing and sequencers. However, the processing, analyzing, and interpreting the “omics” data can take months and require large teams. Pine Biotech’s technology, T-Bioinfo platform enables faster and easier analysis, integration, and visualization of Big Data sets, allowing for faster discovery.

Educational Project

In the recent years, understanding the biology of cancer has become increasingly complex. The tumor-stroma interaction contains tens-to hundreds cell types with even more cellular interactions.  This complex system requires use of “omics” data to understand and treat these complex systems. A recent publication “Whole Transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers” investigated the use of RNA-seq in a hypothesis-free tool to generate independent tumor and stromal biomarkers and understand the interactions between tumor and stroma [1].

Pine Biotech Educational Goals

Our goal was to separate this study into educational sections that could:


  1. Demonstrate the power of RNA-seq to allow for simultaneous analysis of human tumors and tumor-stroma interactions
  2. Introduce complex concepts into small understandable projects with the goal of education.
  3. Demonstrate the strength of the T-Bioinfo Approach in RNA-seq Analysis.


From a complex data set, a focused data set was extracted to look at the transcriptional differences between the breast cancer subtypes and stromal activation in four different mouse models.

This was analyzed using unsupervised and supervised analysis methods on the T-Bioinfo platform. The goal was to identify differences in expression profiles as well as select representative tumor genes that could be considered biomarker candidates. Additionally, we identified differences in transcriptional stromal response to tumor type, as the stroma cells play in large role in determining tumor malignancy. All of these variables show the importance of RNA-seq to build a baseline of human and mouse transcriptomes representing numerous cancer types.

This dataset allows for significant insight into the specific expression of each breast cancer subtype. In order to show the importance of this study and the capabilities of the T-BioInfo platform, we followed the steps outlined in (Figure 1).

Figure 1: T-BioInfo Analysis Pipeline  BreastCancer_education_update1

Total RNA-Seq

Whole-Transcriptome analysis with total RNA sequencing detects coding sequences (gene and isoforms) as well as multiple forms of non-coding RNA.

Analysis using the T-Bioinfo Platform:

Pine Biotech-T-BioInfo Platform

Pine Biotech’s technology, T-Bioinfo platform is a flexible and user-friendly bioinformatics distributed computational system. It provides an intuitive and user friendly interface for building pipelines to analyze Big Data.  This platform combines analysis of heterogeneous data types, machine learning approaches as well as integration and modeling.

Converting Complex Gene List into 2D Maps:

Principal Component Analysis (PCA)

Principal Component Analysis is a statistical method that shows multidimensional data in a lower dimensional space. Simply, it is a technique that is used to emphasize variation and bring out strong patterns in a data set. PCA can be seen in (Figure 3), demonstrating before and after Batch Effect Correction.

Batch Effect Correction

Batch Effect Correction is needed when batch effect is evident, this occurs when there are technical sources of variation such as a two technicians run separate subsets of an experiment[2].

Factor Regression Analysis

Factor Regression Analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.

Figure 2: Factor Table


 Figure 3: PCA images before and after Batch Effect Correction



Data Pre-processing included PCA and Batch Correction. Next the data was subjected to Factor Analysis, comparing both the breast cancer subtypes and mouse types. This identified a few hundred genes that significantly differed between the groups. From this, we identified a group of genes specific to SCID Mice as well as Lnc-RNAs specific to subtypes of breast cancer.

In summary, there were a number of computational challenges in this study. This included the need for a concatenated (mouse-human) genome, samples number between cancer subytpes differed greatly, expression of genes showed batch effect. This gives a brief overview of the trends found in the data set and shows a few examples of how the T-Bioinfo platform can be used with complex data sets to find meaningful results. This study highlights the importance of applying machine learning techniques to complex RNA-seq data sets and how the T-Bioinfo Platform can make this a quick and easy process for researchers who are not comfortable the terminal or coding.





[1]       J. R. Bradford, M. Wappett, G. Beran, A. Logie, O. Delpuech, H. Brown, J. Boros, N. J. Camp, R. McEwen, A. M. Mazzola, C. D’Cruz, and S. T. Barry, “Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both tumor and stromal specific biomarkers.,” Oncotarget, vol. 7, no. 15, 2016.

[2]       C. Chen, K. Grennan, J. Badner, D. Zhang, E. Gershon, L. Jin, and C. Liu, “Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods,” PLoS One, vol. 6, no. 2, 2011.

Referenced files:

It’s recommended to use the SVL files due to their small size, as they simply reference data already on the Pine Biotech servers. However, full copies of the educational dataset are also available.