Hello Friends,
In the last video on Multivariate Analysis,
we had seen the Introduction of Multivariate
analysis, some of the important concepts used
in it and the introduction of various tools
and techniques as a part of it.
In this video, we are going to learn the 1st
tool in multivariate analysis in Minitab software
with the help of a practical example for easy
understanding and better clarity.
So, let’s begin…
Principal Components Analysis:
The Principal Components Analysis is used
to identify a smaller number of uncorrelated
variables, also called "principal components",
from a large set of data.
With this analysis, you create new variables
(principal components) that are linear combinations
of the observed variables.
The goal of principal components analysis
is to explain the maximum amount of variance
with the fewest number of principal components.
For example, a bank requires eight sections
of information from loan applicants like income,
education level, age, length of time at current
residence, length of time with current employer,
savings, debt, and the number of credit cards.
A bank administrator wants to analyze this
data to determine the best way to group and
report it.
The administrator collects this information
for 30 loan applicants.
Here, the administrator performs a principal
component analysis to reduce the number of
variables to make the data easier to analyze.
The administrator wants enough components
to explain at least 90% of the variation in
the data.
Data considerations for Principal Components
Analysis:
To ensure that your results are valid, consider
the following guidelines when you collect
data, perform the analysis, and interpret
your results.
In the case of Principal Component Analysis,
there is only one requirement of data and
i.e.
You should have at least two variables
And the measurements for each variable should
be recorded in separate numeric columns.
Example of Principal Components Analysis:
Let’s continue with the same example.
A bank requires eight sections of information
from loan applicants like income, education
level, age, length of time at current residence,
length of time with current employer, savings,
debt, and the number of credit cards.
A bank administrator wants to analyze this
data to determine the best way to group and
report it.
The administrator collects this information
for 30 loan applicants.
Here, the administrator performs a principal
component analysis to reduce the number of
variables to make the data easier to analyze.
The administrator wants enough components
to explain at least 90% of the variation in
the data.
Conduct Principal Component Analysis (PCA)
in Minitab:
To conduct a Principal Component Analysis
in Minitab, please follow the steps:
1.
Enter or copy the data to Minitab worksheet
with data for one variable in one column,
as shown in the picture.
2.
Select Stat > Multivariate > Principal Components.
3.
In Variables, enter C1-C8.
4.
In the Number of components to compute, keep
the field blank.
Here, enter the number of principal components
that you want Minitab to calculate.
If you have a large number of variables, you
may want to specify a smaller number of components
to reduce the amount of output.
If you do not know how many components to
enter, you can leave this field blank.
5.
In Type of Matrix, keep the default selection
of Correlation as it is.
Here, please select the correct type of matrix
to use to calculate the principal components.
• Correlation: This is used when your variables
have different scales and you want to weigh
all the variables equally.
Our example falls in this category.
And
• Covariance: This is used when your variables
use the same scale, or when your variables
have different scales, but you want to give
more emphasis to variables with higher variances.
6.
From the Graphs, select the graphs you want
to see for an analysis.
Scree plot: Use a scree plot to identify the
number of components that explain most of
the variation in the data.
Score plot for the first 2 components: Use
the score plot to look for clusters, trends,
and outliers in the first two principal components.
Loading plot for the first 2 components: Use
the loading plot to visually interpret the
first two principal components.
Biplot for the first 2 components: Use the
biplot to look for clusters, trends, and outliers
through the interpretation of the first two
principal components.
The biplot overlays the score plot and the
loading plot on the same graph.
Outlier plot: Use the outlier plot to identify
outliers in the data.
And
7.
Click OK in each dialogue box to get the results.
We will get the results of an analysis in
the Session Window and in Graph Window.
Interpretation of Results:
In these results, use the cumulative proportion
to determine the amount of variance that the
principal components explain.
Retain the principal components that explain
an acceptable level of variance.
The acceptable level depends on your application.
For descriptive purposes, you may only need
80% of the variance explained.
However, if you want to perform other analyses
on the data, you may want to have at least
90% of the variance explained by the principal
components.
This is the case in our example.
The first four principal components explain
90.7% of the variation in the data.
Therefore, the administrator decides to use
these components to analyze loan applicants.
You can also use the size of the eigenvalue
to determine the number of principal components.
Retain the principal components with the largest
eigenvalues i.e. >1.
The scree plot orders the eigenvalues from
largest to smallest.
The ideal pattern is a steep curve, followed
by a bend, and then a straight line.
Use the components in the steep curve before
the first point that starts the line trend.
The loading plot visually shows the results
for the first two components.
Age, Residence, Employ, and Savings have large
positive loadings on component 1, so this
component measures long-term financial stability.
Debt and Credit Cards have large negative
loadings on component 2, so this component
primarily measures an applicant's credit history.
Use the outlier plot to identify outliers.
Any point that is above the reference line
is an outlier.
Outliers can significantly affect the results
of your analysis.
In these results, there are no outliers.
All the points are below the reference line.
The first principal component accounts for
44.3% of the total variance.
The variables that correlate the most with
the first principal component (PC1) are Age
(0.484), Residence (0.466), Employ (0.459),
and Savings (0.404).
The first principal component is positively
correlated with all four of these variables.
Therefore, increasing values of Age, Residence,
Employ, and Savings increase the value of
the first principal component.
