Principal component analysis
Overview
Teaching: 90 min
Exercises: 30 minQuestions
What is principal component analysis (PCA) and when can it be used?
How can we perform a PCA in R?
How many principal components are needed to explain a significant amount of variation in the data?
How to interpret the output of PCA using loadings and principal components?
Objectives
Identify situations where PCA can be used to answer research questions using high-dimensional data.
Perform a PCA on high-dimensional data.
Select the appropriate number of principal components.
Interpret the output of PCA.
Introduction
Imagine a dataset which contains many variables ($p$), close to the total number of rows in the dataset ($n$). Some of these variables are highly correlated and several form groups which you might expect to represent the same overall effect. Such datasets are challenging to analyse for several reasons, with the main problem being how to reduce dimensionality in the dataset while retaining the important features.
In this episode we will explore principal component analysis (PCA) as a popular method of analysing high-dimensional data. PCA is an unsupervised statistical method which allows large datasets of correlated variables to be summarised into smaller numbers of uncorrelated principal components that explain most of the variability in the original dataset. This is useful, for example, during initial data exploration as it allows correlations among data points to be observed and principal components to be calculated for inclusion in further analysis (e.g. linear regression). An example of PCA might be reducing several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature.
Advantages and disadvantages of PCA
Advantages:
- It is a relatively easy to use and popular method.
- Various software/packages are available to run a PCA.
- The calculations used in a PCA are easy to understand for statisticians and non-statisticians alike.
Disadvantages:
- It assumes that variables in a dataset are correlated.
- It is sensitive to the scale at which input variables are measured. If input variables are measured at different scales, the variables with large variance relative to the scale of measurement will have greater impact on the principal components relative to variables with smaller variance. In many cases, this is not desirable.
- It is not robust against outliers, meaning that very large or small data points can have a large effect on the output of the PCA.
- PCA assumes a linear relationship between variables which is not always a realistic assumption.
- It can be difficult to interpret the meaning of the principal components, especially when including them in further analyses (e.g. inclusion in a linear regression).
Supervised vs unsupervised learning
Most statistical problems fall into one of two categories: supervised or unsupervised learning. Examples of supervised learning problems include linear regression and include analyses in which each observation has both at least one independent variable ($x$) as well as a dependent variable ($y$). In supervised learning problems the aim is to predict the value of the response given future observations or to understand the relationship between the dependent variable and the predictors. In unsupervised learning for each observation there is no dependent variable ($y$), but only a series of independent variables. In this situation there is no need for prediction, as there is no dependent variable to predict (hence the analysis can be thought as being unsupervised by the dependent variable). Instead statistical analysis can be used to understand relationships between the independent variables or between observations themselves. Unsupervised learning problems often occur when analysing high-dimensional datasets in which there is no obvious dependent variable to be predicted, but the analyst would like to understand more about patterns between groups of observations or reduce dimensionality so that a supervised learning process may be used.
Challenge 1
Descriptions of three datasets and research questions are given below. For which of these might PCA be considered a useful tool for analysing data so that the research questions may be addressed?
- An epidemiologist has data collected from different patients admitted to hospital with infectious respiratory disease. They would like to determine whether length of stay in hospital differs in patients with different respiratory diseases.
- An online retailer has collected data on user interactions with its online app and has information on the number of times each user interacted with the app, what products they viewed per interaction, and the type and cost of these products. The retailer would like to use this information to predict whether or not a user will be interested in a new product.
- A scientist has assayed gene expression levels in 1000 cancer patients and has data from probes targeting different genes in tumour samples from patients. She would like to create new variables representing relative abundance of different groups of genes to i) find out if genes form subgroups based on biological function and ii) use these new variables in a linear regression examining how gene expression varies with disease severity.
- All of the above.
Solution
In the first case, a regression model would be more suitable; perhaps a survival model. In the second, again a regression model, likely linear or logistic, would be more suitable. In the third example, PCA can help to identify modules of correlated features that explain a large amount of variation within the data.
Therefore the answer here is 3.
What is a principal component?
The first principal component is the direction of the data along which the observations vary the most. The second principal component is the direction of the data along which the observations show the next highest amount of variation. For example, Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component direction of the data, which is the direction along which there is greatest variability in the data. Projecting points onto this line (i.e. by finding the location on the line closest to the point) would give a vector of points with the greatest possible variance. The next highest amount of variability in the data is represented by the line perpendicular to first regression line which represents the second principal component (green line).
The second principal component is a linear combination of the variables that is uncorrelated with the first principal component. There are as many principal components as there are variables in your dataset, but as we’ll see, some are more useful at explaining your data than others. By definition, the first principal component explains more variation than other principal components.

Cap
The animation below illustrates how principal components are calculated from data. You can imagine that the black line is a rod and each red dashed line is a spring. The energy of each spring is proportional to its squared length. The direction of the first principal component is the one that minimises the total energy of all of the springs. In the animation below, the springs pull the rod, finding the direction of the first principal component when they reach equilibrium. We then use the length of the springs from the rod as the first principal component. This is explained in more detail on this Q&A website.

Cap
The first principal component’s scores ($Z_1$) are calculated using the equation:
\[Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p\]$X_1…X_p$ represents variables in the original dataset and $a_{11}…a_{p1}$ represent principal component loadings, which can be thought of as the degree to which each variable contributes to the calculation of the principal component. We will come back to principal component scores and loadings further below.
How do we perform a PCA?
A prostate cancer dataset
The prostate
dataset represents data from 97
men who have prostate cancer. The data come from a study which examined the
correlation between the level of prostate specific antigen and a number of
clinical measures in men who were about to receive a radical prostatectomy.
The data have 97 rows and 9 columns.
Columns include:
lcavol
(log-transformed cancer volume),lweight
(log-transformed prostate weight),lbph
(log-transformed amount of benign prostate enlargement),svi
(seminal vesicle invasion),lcp
(log-transformed capsular penetration; amount of spread of cancer in outer walls of prostate),gleason
(Gleason score; grade of cancer cells),pgg45
(percentage Gleason scores 4 or 5),lpsa
(log-tranformed prostate specific antigen; level of PSA in blood).age
(patient age in years).
Here we will calculate principal component scores for each of the rows in this dataset, using five principal components (one for each variable included in the PCA). We will include five clinical variables in our PCA, each of the continuous variables in the prostate dataset, so that we can create fewer variables representing clinical markers of cancer progression. Standard PCAs are carried out using continuous variables only.
First, we will examine the prostate
dataset (originally part of the
lasso2
package):
prostate <- readRDS(here("data/prostate.rds"))
head(prostate)
X lcavol lweight age lbph svi lcp gleason pgg45 lpsa
1 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
2 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
3 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
4 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
5 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
6 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
Note that each row of the dataset represents a single patient.
We will create a subset of the data including only the clinical variables we want to use in the PCA.
pros2 <- prostate[, c("lcavol", "lweight", "lbph", "lcp", "lpsa")]
head(pros2)
lcavol lweight lbph lcp lpsa
1 -0.5798185 2.769459 -1.386294 -1.386294 -0.4307829
2 -0.9942523 3.319626 -1.386294 -1.386294 -0.1625189
3 -0.5108256 2.691243 -1.386294 -1.386294 -0.1625189
4 -1.2039728 3.282789 -1.386294 -1.386294 -0.1625189
5 0.7514161 3.432373 -1.386294 -1.386294 0.3715636
6 -1.0498221 3.228826 -1.386294 -1.386294 0.7654678
Do we need to standardise the data?
Now we compare the variances between variables in the dataset.
apply(pros2, 2, var)
lcavol lweight lbph lcp lpsa
1.389157 0.246642 2.104840 1.955102 1.332476
par(mfrow = c(1, 2))
hist(pros2$lweight, breaks = "FD")
hist(pros2$lbph, breaks = "FD")

Alt
Note that variance is greatest for lbph
and lowest for lweight
. It is clear
from this output that we need to scale each of these variables before including
them in a PCA analysis to ensure that differences in variances between variables
do not drive the calculation of principal components. In this example we
standardise all five variables to have a mean of 0 and a standard
deviation of 1.
Challenge 2
Why might it be necessary to standardise variables before performing a PCA?
Can you think of datasets where it might not be necessary to standardise variables? Discuss.
- To make the results of the PCA interesting.
- If you want to ensure that variables with different ranges of values contribute equally to analysis.
- To allow the feature matrix to be calculated faster, especially in cases where there are a lot of input variables.
- To allow both continuous and categorical variables to be included in the PCA.
- All of the above.
Solution
2. Scaling the data isn’t guaranteed to make the results more interesting. It also won’t affect how quickly the output will be calculated, whether continuous and categorical variables are present or not.
It is done to ensure that all features have equal weighting in the resulting PCs.
You may not want to standardise datasets which contain continuous variables all measured on the same scale (e.g. gene expression data or RNA sequencing data). In this case, variables with very little sample-to-sample variability may represent only random noise, and standardising the data would give these extra weight in the PCA.
Next we will carry out a PCA using the prcomp()
function in base R. The input
data (pros2
) is in the form of a matrix. Note that the scale = TRUE
argument
is used to standardise the variables to have a mean 0 and standard deviation of
1.
pca.pros <- prcomp(pros2, scale = TRUE, center = TRUE)
pca.pros
Standard deviations (1, .., p=5):
[1] 1.5648756 1.1684678 0.7452990 0.6362941 0.4748755
Rotation (n x k) = (5 x 5):
PC1 PC2 PC3 PC4 PC5
lcavol 0.5616465 -0.23664270 0.01486043 0.22708502 -0.75945046
lweight 0.2985223 0.60174151 -0.66320198 -0.32126853 -0.07577123
lbph 0.1681278 0.69638466 0.69313753 0.04517286 -0.06558369
lcp 0.4962203 -0.31092357 0.26309227 -0.72394666 0.25253840
lpsa 0.5665123 -0.01680231 -0.10141557 0.56487128 0.59111493
How many principal components do we need?
We have calculated one principal component for each variable in the original dataset. How do we choose how many of these are necessary to represent the true variation in the data, without having extra components that are unnecessary?
Let’s look at the relative importance of each component using summary
.
summary(pca.pros)
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 1.5649 1.1685 0.7453 0.63629 0.4749
Proportion of Variance 0.4898 0.2731 0.1111 0.08097 0.0451
Cumulative Proportion 0.4898 0.7628 0.8739 0.95490 1.0000
This returns the proportion of variance in the data explained by each of the (p = 5) principal components. In this example, PC1 explains approximately 49% of variance in the data, PC2 27% of variance, PC3 a further 11%, PC4 approximately 8% and PC5 around 5%.
Let us visualise this. A plot of the amount of variance accounted for by each PC is also called a scree plot. Note that the amount of variance accounted for by a principal component is also called eigenvalue and thus the y-axis in scree plots if often labelled “eigenvalue”.
Often, scree plots show a characteristic pattern where initially, the variance drops rapidly with each additional principal component. But then there is an “elbow” after which the variance decreases more slowly. The total variance explained up to the elbow point is sometimes interpreted as structural variance that is relevant and should be retained versus noise which may be discarded after the elbow.
# calculate variance explained
varExp <- (pca.pros$sdev^2) / sum(pca.pros$sdev^2) * 100
# calculate percentage variance explained using output from the PCA
varDF <- data.frame(Dimensions = 1:length(varExp), varExp = varExp)
# create new dataframe with five rows, one for each principal component
plot(varDF)

Alt
The screeplot shows that the first principal component explains most of the variance in the data (>50%) and each subsequent principal component explains less and less of the total variance. The first two principal components explain >70% of variance in the data. But what do these two principal components mean?
What are loadings and principal component scores?
Most PCA functions will produce two main output matrices: the principal component scores and the loadings. The matrix of principal component scores has as many rows as there were observations in the input matrix. These scores are what is usually visualised or used for down-stream analyses. The matrix of loadings (also called rotation matrix) has as many rows as there are features in the original data. It contains information about how the (usually centered and scaled) original data relate to the PC scores.
When calling a PCA object generated with prcomp()
, the loadings are printed by default:
pca.pros
Standard deviations (1, .., p=5):
[1] 1.5648756 1.1684678 0.7452990 0.6362941 0.4748755
Rotation (n x k) = (5 x 5):
PC1 PC2 PC3 PC4 PC5
lcavol 0.5616465 -0.23664270 0.01486043 0.22708502 -0.75945046
lweight 0.2985223 0.60174151 -0.66320198 -0.32126853 -0.07577123
lbph 0.1681278 0.69638466 0.69313753 0.04517286 -0.06558369
lcp 0.4962203 -0.31092357 0.26309227 -0.72394666 0.25253840
lpsa 0.5665123 -0.01680231 -0.10141557 0.56487128 0.59111493
The principal component scores are obtained by carrying out matrix multiplication of the (usually centered and scaled) original data times the loadings. The following callout demonstrates this.
Computing a PCA “by hand”
The rotation matrix obtained in a PCA is identical to the eigenvectors of the covariance matrix of the data. Multiplying these with the (centered and scaled) data yields the PC scores:
pros2.scaled <- scale(pros2) # centre and scale the Prostate data pros2.cov <- cov(pros2.scaled) #generate covariance matrix pros2.cov
lcavol lweight lbph lcp lpsa lcavol 1.0000000 0.1941283 0.027349703 0.675310484 0.7344603 lweight 0.1941283 1.0000000 0.434934636 0.100237795 0.3541204 lbph 0.0273497 0.4349346 1.000000000 -0.006999431 0.1798094 lcp 0.6753105 0.1002378 -0.006999431 1.000000000 0.5488132 lpsa 0.7344603 0.3541204 0.179809410 0.548813169 1.0000000
pros2.eigen <- eigen(pros2.cov) # preform eigen decomposition pros2.eigen # The slot $vectors = rotation of the PCA
eigen() decomposition $values [1] 2.4488355 1.3653171 0.5554705 0.4048702 0.2255067 $vectors [,1] [,2] [,3] [,4] [,5] [1,] -0.5616465 0.23664270 0.01486043 0.22708502 0.75945046 [2,] -0.2985223 -0.60174151 -0.66320198 -0.32126853 0.07577123 [3,] -0.1681278 -0.69638466 0.69313753 0.04517286 0.06558369 [4,] -0.4962203 0.31092357 0.26309227 -0.72394666 -0.25253840 [5,] -0.5665123 0.01680231 -0.10141557 0.56487128 -0.59111493
# generate PC scores by by hand, using matrix multiplication my.pros2.pcs <- pros2.scaled %*% pros2.eigen$vectors # compare results par(mfrow=c(1,2)) plot(pca.pros$x[,1:2], main="prcomp()") abline(h=0, v=0, lty=2) plot(my.pros2.pcs[,1:2], main="\"By hand\"", xlab="PC1", ylab="PC2") abline(h=0, v=0, lty=2)
![]()
plot of chunk pca-by-hand
par(mfrow=c(1,1)) # Note that the axis orientations may be swapped but the relative positions of the dots should be the same in both plots.
One way to visualise how principal components relate to the original variables
is by creating a biplot. Biplots usually show two principal components plotted
against each other. Observations are sometimes labelled with numbers. The
contribution of each original variable to the principal components displayed
is then shown by arrows (generated from those two columns of the rotation matrix that
correspond to the principal components shown). NB, there are several biplot
implementations in different R libraries. It is thus a good idea to specify
the desired package when calling biplot()
. A biplot of the first two principal
components can be generated as follows:
stats::biplot(pca.pros, xlim = c(-0.3, 0.3))

Alt
This biplot shows the position of each patient on a 2-dimensional plot where
loadings can be observed via the red arrows associated with each of
the variables. The variables lpsa
, lcavol
and lcp
are associated with
positive values on PC1 while positive values on PC2 are associated with the
variables lbph
and lweight
. The length of the arrows indicates how much
each variable contributes to the calculation of each principal component.
The left and bottom axes show normalised principal component scores. The axes
on the top and right of the plot are used to interpret the loadings, where
loadings are scaled by the standard deviation of the principal components
(pca.pros$sdev
) times the square root the number of observations.
Finally, you need to know that PC scores and rotations may have different slot names, depending on the PCA implementation you use. Here are some examples:
library::command() | PC scores | Loadings |
---|---|---|
stats::prcomp() | $x | $rotation |
stats::princomp() | $scores | $loadings |
PCAtools::pca() | $rotated | $loadings |
Using PCA to analyse gene expression data
In this section you will carry out your own PCA using the Bioconductor package PCAtools
applied to gene expression data to explore the topics covered above.
PCAtools
provides functions that can be used to explore data via PCA and
produce useful figures and analysis tools. The package is made for the somewhat unusual
Bioconductor style of data tables (observations in columns, features in rows). When
using Bioconductor data sets and PCAtools
, it is thus not necessary to transpose the data.
A gene expression dataset of cancer patients
The dataset we will be analysing in this lesson includes two subsets of data:
- a matrix of gene expression data showing microarray results for different probes used to examine gene expression profiles in 91 different breast cancer patient samples.
- metadata associated with the gene expression results detailing information from patients from whom samples were taken.
Let’s load the PCAtools
package and the data.
library("PCAtools")
We will first load the microarray breast cancer gene expression data and associated metadata, downloaded from the Gene Expression Omnibus.
library("SummarizedExperiment")
cancer <- readRDS(here::here("data/cancer_expression.rds"))
mat <- assay(cancer)
metadata <- colData(cancer)
View(mat)
#nrow=22215 probes
#ncol=91 samples
View(metadata)
#nrow=91
#ncol=8
all(colnames(mat) == rownames(metadata))
[1] TRUE
#Check that column names and row names match
#If they do should return TRUE
The ‘mat’ variable contains a matrix of gene expression profiles for each sample. Rows represent gene expression measurements and columns represent samples. The ‘metadata’ variable contains the metadata associated with the gene expression data including the name of the study from which data originate, the age of the patient from which the sample was taken, whether or not an oestrogen receptor was involved in their cancer and the grade and size of the cancer for each sample (represented by rows).
Microarray data are difficult to analyse for several reasons. Firstly, they are typically high-dimensional and therefore are subject to the same difficulties associated with analysing high dimensional data outlined above (i.e. p>n, large numbers of rows, multiple possible response variables, curse of dimensionality). Secondly, formulating a research question using microarray data can be difficult, especially if not much is known a priori about which genes code for particular phenotypes of interest. Finally, exploratory analysis, which can be used to help formulate research questions and display relationships, is difficult using microarray data due to the number of potentially interesting response variables (i.e. expression data from probes targeting different genes).
If researchers hypothesise that groups of genes (e.g. biological pathways) may be associated with different phenotypic characteristics of cancers (e.g. histologic grade, tumour size), using statistical methods that reduce the number of columns in the microarray matrix to a smaller number of dimensions representing groups of genes would help visualise the data and address research questions regarding the effect different groups of genes have on disease progression.
Using the PCAtools
we will apply a PCA to the cancer
gene expression data, plot the amount of variation in the data explained by
each principal component and plot the most important principal components
against each other as well as understanding what each principal component
represents.
Challenge 3
Apply a PCA to the cancer gene expression data using the
pca()
function fromPCAtools
. You can use the help files in PCAtools to find out about thepca()
function (typehelp("pca")
or?pca
in R).Let us assume we only care about the principal components accounting for the top 80% of the variance in the dataset. Use the
removeVar
argument inpca()
to remove the PCs accounting for the bottom 20%.As in the example using prostate data above, examine the first 5 rows and columns of rotated data and loadings from your PCA.
Solution
pc <- pca(mat, metadata = metadata) #Many PCs explain a very small amount of the total variance in the data #Remove the lower 20% of PCs with lower variance pc <- pca(mat, metadata = metadata, removeVar = 0.2) #Explore other arguments provided in pca pc$rotated[1:5, 1:5]
PC1 PC2 PC3 PC4 PC5 GSM65752 -29.79105 43.866788 3.255903 -40.663138 15.3427597 GSM65753 -37.33911 -15.244788 -4.948201 -6.182795 9.4725870 GSM65755 -29.41462 7.846858 -22.880525 -16.149669 22.3821009 GSM65757 -33.35286 1.343573 -22.579568 2.200329 15.0082786 GSM65758 -40.51897 -8.491125 5.288498 14.007364 0.8739772
pc$loadings[1:5, 1:5]
PC1 PC2 PC3 PC4 PC5 206378_at -0.0024680993 -0.053253543 -0.004068209 0.04068635 0.015078376 205916_at -0.0051557973 0.001315022 -0.009836545 0.03992371 0.038552048 206799_at 0.0005684075 -0.050657061 -0.009515725 0.02610233 0.006208078 205242_at 0.0130742288 0.028876408 0.007655420 0.04449641 -0.001061205 206509_at 0.0019031245 -0.054698479 -0.004667356 0.01566468 0.001306807
which.max(pc$loadings[, 1])
[1] 49
pc$loadings[49, ]
PC1 PC2 PC3 PC4 PC5 215281_x_at 0.03752947 -0.007369379 0.006243377 -0.008242589 -0.004783206 PC6 PC7 PC8 PC9 PC10 215281_x_at 0.01194012 -0.002822407 -0.01216792 0.001137451 -0.0009056616 PC11 PC12 PC13 PC14 PC15 215281_x_at -0.00196034 -0.0001676705 0.00699201 -0.002897995 -0.0005044658 PC16 PC17 PC18 PC19 PC20 215281_x_at -0.0004547916 0.002277035 -0.006199078 0.002708574 -0.006217326 PC21 PC22 PC23 PC24 PC25 215281_x_at 0.00516745 0.007625912 0.003434534 0.005460017 0.001477415 PC26 PC27 PC28 PC29 PC30 215281_x_at 0.002350428 0.0007183107 -0.0006195515 0.0006349803 0.00413627 PC31 PC32 PC33 PC34 PC35 215281_x_at 0.0001322301 0.003182956 -0.002123462 -0.001042769 -0.001729869 PC36 PC37 PC38 PC39 PC40 215281_x_at -0.006556369 0.005766949 0.002537993 -0.0002846248 -0.00018195 PC41 PC42 PC43 PC44 PC45 215281_x_at -0.0007970789 0.003888626 -0.008210075 -0.0009570174 0.0007998935 PC46 PC47 PC48 PC49 PC50 215281_x_at -0.0006931441 -0.005717836 0.005189649 0.002591188 0.0007810259 PC51 PC52 PC53 PC54 PC55 215281_x_at 0.006610815 0.005371134 -0.001704796 -0.002286475 0.001365417 PC56 PC57 PC58 PC59 PC60 215281_x_at 0.003529892 0.0003375981 0.009895923 -0.001564423 -0.006989092 PC61 PC62 PC63 PC64 PC65 215281_x_at 0.000971273 0.001345406 -0.003575415 -0.0005588113 0.006516669 PC66 PC67 PC68 PC69 PC70 215281_x_at -0.008770186 0.006699641 0.01284606 -0.005041574 0.007845653 PC71 PC72 PC73 PC74 PC75 215281_x_at 0.003964697 -0.01104367 -0.001506485 -0.001583824 0.003798343 PC76 PC77 PC78 PC79 PC80 215281_x_at 0.004817252 -0.001290033 -0.004402926 -0.003440367 -0.0001646198 PC81 PC82 PC83 PC84 PC85 215281_x_at 0.003923775 0.003179556 -0.0004388192 9.664648e-05 0.003501335 PC86 PC87 PC88 PC89 PC90 215281_x_at -0.00112973 0.006489667 -0.0005039785 -0.004296355 -0.002751513 PC91 215281_x_at -0.001877408
which.max(pc$loadings[, 2])
[1] 27
pc$loadings[27, ]
PC1 PC2 PC3 PC4 PC5 211122_s_at 0.01649085 0.05090275 -0.003378728 0.05178144 -0.003742393 PC6 PC7 PC8 PC9 PC10 211122_s_at -0.00543753 -0.03522848 -0.006333521 0.01575401 0.004732546 PC11 PC12 PC13 PC14 PC15 211122_s_at 0.004687599 -0.01349892 0.005207937 -0.01731898 0.02323893 PC16 PC17 PC18 PC19 PC20 211122_s_at -0.02069509 0.01477432 0.005658529 0.02667751 -0.01333503 PC21 PC22 PC23 PC24 PC25 211122_s_at -0.003254036 0.003572342 0.01416779 -0.005511838 -0.02582847 PC26 PC27 PC28 PC29 PC30 PC31 211122_s_at 0.03405417 -0.01797345 0.01826328 0.005123959 0.01300763 0.0127127 PC32 PC33 PC34 PC35 PC36 211122_s_at 0.002477672 0.01933214 0.03017661 -0.01935071 -0.01960912 PC37 PC38 PC39 PC40 PC41 211122_s_at 0.004411188 -0.01263612 -0.02019279 -0.01441513 -0.0310399 PC42 PC43 PC44 PC45 PC46 211122_s_at -0.02540426 0.0007949801 -0.00200195 -0.01748543 0.006881834 PC47 PC48 PC49 PC50 PC51 211122_s_at 0.006690698 -0.004000732 -0.02747926 -0.006963189 -0.02232332 PC52 PC53 PC54 PC55 PC56 211122_s_at -0.0003089115 -0.01604491 0.005649511 -0.02629501 0.02332997 PC57 PC58 PC59 PC60 PC61 211122_s_at -0.01248022 -0.01563245 0.005369433 0.009445262 -0.005209349 PC62 PC63 PC64 PC65 PC66 211122_s_at 0.01787645 0.01629425 0.02457665 -0.02384242 0.002814479 PC67 PC68 PC69 PC70 PC71 211122_s_at 0.0004584731 0.007939733 -0.009554166 -0.003967123 0.01825668 PC72 PC73 PC74 PC75 PC76 211122_s_at -0.00580374 -0.02236727 0.001295688 -0.02264723 0.006855855 PC77 PC78 PC79 PC80 PC81 211122_s_at 0.004995447 -0.008404118 0.00442875 -0.001027912 0.006104406 PC82 PC83 PC84 PC85 PC86 211122_s_at -0.01988441 0.009667348 -0.008248781 0.01198369 0.01221713 PC87 PC88 PC89 PC90 PC91 211122_s_at -0.003864842 -0.02876816 -0.01771452 -0.02164973 0.003346046
The function
pca()
is used to perform PCA, and uses as inputs a matrix (mat
) containing continuous numerical data in which rows are data variables and columns are samples, andmetadata
associated with the matrix in which rows represent samples and columns represent data variables. It has options to centre or scale the input data before a PCA is performed, although in this case gene expression data do not need to be transformed prior to PCA being carried out as variables are measured on a similar scale (values are comparable between rows). The output of thepca()
function includes a lot of information such as loading values for each variable (loadings
), principal component scores (rotated
) and the amount of variance in the data explained by each principal component.Rotated data shows principal component scores for each sample and each principal component. Loadings the contribution each variable makes to each principal component.
Scaling variables for PCA
When running
pca()
above, we kept the default setting,scale=FALSE
. That means genes with higher variation in their expression levels should have higher loadings, which is what we are interested in. Whether or not to scale variables for PCA will depend on your data and research question.Note that this is different from normalising gene expression data. Gene expression data have to be normalised before donwstream analyses can be carried out. This is to reduce to effect technical and other potentially confounding factors. We assume that the expression data we use had been noralised previously.
Choosing how many components are important to explain the variance in the data
As in the example using the prostate
dataset we can use a screeplot to
compare the proportion of variance in the data explained by each principal
component. This allows us to understand how much information in the microarray
dataset is lost by projecting the observations onto the first few principal
components and whether these principal components represent a reasonable
amount of the variation. The proportion of variance explained should sum to one.
There are no clear guidelines on how many principal components should be included in PCA: your choice depends on the total variability of the data and the size of the dataset. We often look at the ‘elbow’ on the screeplot as an indicator that the addition of principal components does not drastically contribute to explain the remaining variance or choose an arbitory cut off for proportion of variance explained.
Challenge 4
Using the
screeplot()
function inPCAtools
, create a screeplot to show proportion of variance explained by each principal component. Explain the output of the screeplot in terms of proportion of variance in data explained by each principal component.Solution
screeplot(pc, axisLabSize = 5, titleLabSize = 8)
![]()
Alt
Note that first principal component (PC1) explains more variation than other principal components (which is always the case in PCA). The screeplot shows that the first principal component only explains ~33% of the total variation in the micrarray data and many principal components explain very little variation. The red line shows the cumulative percentage of explained variation with increasing principal components. Note that in this case 18 principal components are needed to explain over 75% of variation in the data. This is not an unusual result for complex biological datasets including genetic information as clear relationships between groups are sometimes difficult to observe in the data. The screeplot shows that using a PCA we have reduced 91 predictors to 18 in order to explain a significant amount of variation in the data. See additional arguments in screeplot function for improving the appearance of the plot.
Investigating the principal components
Once the most important principal components have been identified using
screeplot()
, these can be explored in more detail by plotting principal components
against each other and highlighting points based on variables in the metadata.
This will allow any potential clustering of points according to demographic or
phenotypic variables to be seen.
We can use biplots to look for patterns in the output from the PCA. Note that there
are two functions called biplot()
, one in the package PCAtools
and one in
stats
. Both functions produce biplots but their scales are different!
Challenge 5
Create a biplot of the first two principal components from your PCA using
biplot()
function inPCAtools
. Seehelp("PCAtools::biplot")
for arguments and their meaning. For instance,lab
orcolBy
may be useful.Examine whether the data appear to form clusters. Explain your results.
Solution
biplot(pc, lab = NULL, colby = 'Grade', legendPosition = 'top')
![]()
Alt
The biplot shows the position of patient samples relative to PC1 and PC2 in a 2-dimensional plot. Note that two groups are apparent along the PC1 axis according to expressions of different genes while no separation can be seem along the PC2 axis. Labels of patient samples are automatically added in the biplot. Labels for each sample are added by default, but can be removed if there is too much overlap in names. Note that
PCAtools
does not scale biplot in the same way as biplot using the stats package.
Let’s consider this biplot in more detail, and also display the loadings:
Challenge 6
Use
colby
andlab
arguments inbiplot()
to explore whether these two groups may cluster by patient age or by whether or not the sample expresses the oestrogen receptor gene (ER+ or ER-).Note: You may see a warning about
ggrepel
. This happens when there are many labels but little space for plotting. This is not usually a serious problem - not all labels will be shown.Solution
PCAtools::biplot(pc, lab = paste0(pc$metadata$Age,'years'), colby = 'ER', hline = 0, vline = 0, legendPosition = 'right')
Warning: ggrepel: 35 unlabeled data points (too many overlaps). Consider increasing max.overlaps
![]()
Alt
It appears that one cluster has more ER+ samples than the other group.
So far, we have only looked at a biplot of PC1 versus PC2 which only gives part
of the picture. The pairplots()
function in PCAtools
can be used to create
multiple biplots including different principal components.
pairsplot(pc)

Alt
The plots show two apparent clusters involving the first principal component
only. No other clusters are found involving other principal components. Each dot
is coloured differently along a gradient of blues. This can potentially help identifying
the same observation/individual in several panels. Here too, the argument colby
allows
you to set custom colours.
Finally, it can sometimes be of interest to compare how certain variables contribute
to different principal components. This can be visualised with plotloadings()
from
the PCAtools
package. The function checks the range of loadings for each
principal component specified (default: first five PCs). It then selects the features
in the top and bottom 5% of these ranges and displays their loadings. This behaviour
can be adjusted with the rangeRetain
argument, which has 0.1 as the default value (i.e.
5% on each end of the range). NB, if there are too many labels to be plotted, you will see
a warning. This is not a serious problem.
plotloadings(pc, c("PC1"), rangeRetain = 0.1)

plot of chunk loadingsplots
plotloadings(pc, c("PC2"), rangeRetain = 0.1)

plot of chunk loadingsplots
plotloadings(pc, c("PC1", "PC2"), rangeRetain = 0.1)

plot of chunk loadingsplots
You can see how the third code line prooces more dots, some of which do not have extreme loadings. This is because all loadings selected for any PC are shown for all other PCs. For instance, it is plausible that features which have high loadings on PC1 may have lower ones on PC2.
Using PCA output in further analysis
The output of PCA can be used to interpret data or can be used in further analyses. For example, the PCA outputs new variables (principal components) which represent several variables in the original dataset. These new variables are useful for further exploring data, for example, comparing principal component scores between groups or including the new variables in linear regressions. Because the principal components are uncorrelated (and independent) they can be included together in a single linear regression.
Principal component regression
PCA is often used to reduce large numbers of correlated variables into fewer uncorrelated variables that can then be included in linear regression or other models. This technique is called principal component regression (PCR) and it allows researchers to examine the effect of several correlated explanatory variables on a single response variable in cases where a high degree of correlation initially prevents them from being included in the same model. This is called principal componenet regression (PCR) and is just one example of how principal components can be used in further analysis of data. When carrying out PCR, the variable of interest (response/dependent variable) is regressed against the principal components calculated using PCA, rather than against each individual explanatory variable from the original dataset. As there as many principal components created from PCA as there are variables in the dataset, we must select which principal components to include in PCR. This can be done by examining the amount of variation in the data explained by each principal component (see above).
Further reading
- James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013) An Introduction to Statistical Learning with Applications in R. Chapter 6.3 (Dimension Reduction Methods), Chapter 10 (Unsupervised Learning).
- Jolliffe, I.T. & Cadima, J. (2016) Principal component analysis: a review and recent developments. Phil. Trans. R. Soc A 374..
- Johnstone, I.M. & Titterington, D.M. (2009) Statistical challenges of high-dimensional data. Phil. Trans. R. Soc A 367.
- PCA: A Practical Guide to Principal Component Analysis, Analytics Vidhya.
- A One-Stop Shop for Principal Component Analysis, Towards Data Science.
Key Points
A principal component analysis is a statistical approach used to reduce dimensionality in high-dimensional datasets (i.e. where $p$ is equal or greater than $n$).
PCA may be used to create a low-dimensional set of features from a larger set of variables. Examples of when a PCA may be useful include reducing high-dimensional datasets to fewer variables for use in a linear regression and for identifying groups with similar features.
PCA is a useful dimensionality reduction technique used in the analysis of complex biological datasets (e.g. high throughput data or genetics data).
The first principal component represents the dimension along which there is maximum variation in the data. Subsequent principal components represent dimensions with progressively less variation.
Screeplots and biplots may be used to show: 1. how much variation in the data is explained by each principal component and 2. how data points cluster according to principal component scores and which variables are associated with these scores.