Projects and Tutorials

Big Data Analysis Machine Learning
Document Video

Data Mining / Post Processing

User Ratings :

To use the Data Mining analysis section, the user has to have one or more tables that are in a proper format. It can operate with genomic, transcriptomic, proteomic or metabolomic data that has an ID and sample levels data.

  1. PCA Analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. Therefore, PCA analysis plots gene profiles on a 2D plane, generating a visual graph where it is easy to see groups that are associated. Components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. To represent multi-dimensional data, PCA uses eigenvectors, which are the axes that identify data structure in the best way. Using eigenvectors, one can easily visualize data by separating it into groups.


  1. BiAssociation:

BiAssociation works with a table (could be produced from RNA-seq or another section) – of genomic, transcriptomic, proteomic or metabolomic data. The table would include a gene name in the rows and sample data in columns; under each column you would see different expression rates (number).

gene ID sample 1 sample 2
gene1 12.1 7.05
gene2 1.2 7.05
gene3 5.8 7.05
gene4 4 7.05

The BiAssociation method can take one or two tables like this and create a network of associations. These would be associations between genes or between genes and another type of items, such as proteins. Association is a measurement of a connection and strength of that connection.

The resulting table is going to contain 2 columns of genes (gene-gene, gene-protein, gene-metabolite, etc.) with a score in-between these two.

gene ID score protein ID
gene1 12.1 protein5
gene2 3 protein4
gene3 7.8 protein1
gene4 24.5 protein2

  1. P-clustering

P-clustering creates a graph of nodes by clustering gene profiles (or protein or metabolite) by similarity of responses. Then profiles need to be grouped in cases of high similarity.

gene ID cluster
gene1 cluster1
gene2 cluster1
gene3 cluster2
gene4 cluster2
  1. Factor Analysis

Factor analysis is performed on an experiment whose design consists of two or more factors (such as time or environment), each with discrete possible levels of expression, and whose experimental units take on all possible combinations of these levels across all such factors. Such an experiment allows us to study the effect of each factor on the response variable, as well as the effects of interactions between factors on the response variable.[1]

To compute the main effect of a factor “A”, subtract the average response of all experimental runs for which A was at its low level from the average response of all experimental runs for which A was at its high level.[1]

To understand what factor has an effect on a gene, you have to perform factorial analysis. This analysis calculates if a factor such as presence of a substance or another environmental factor is affecting gene expression levels.

gene ID factor1 factor2 combination
gene1 +  
gene3 +  
gene4 0 0 0

[1] Wikipedia contributors. Factorial experiment [Internet]. Wikipedia, The Free Encyclopedia; 2015 Jan 2, 19:57 UTC [cited 2015 May 22]. Available from: