How Principal Component Analysis, PCA Works
Whoever tried to build machine learning models with many features would already know the glims about the concept of principal component analysis. In short PCA.
The inclusion of more features in the implementation of machine learning algorithms models might lead to worsening performance issues. The increase in the number of features will not always improve classification accuracy.
When enough features are not present in the data, the model is likely to underfit, and when data contains too many features, it is expected to overfit or underfit. This phenomenon is known as the curse of dimensionality.
Learn how the popular dimension reduction technique PCA (principal component analysis) works and learn the implementation in python. #pca #datascience #machinelearning #python
Therefore, we apply dimensionality reduction by selecting the optimal set of lower dimensionality features in order to improve classification accuracy.
Following are the techniques to perform the dimensionality reduction:
- Feature Selection
- Feature Extraction
If you are not sure about the PCA (principal component analysis )and the need for dimensionality reduction, don't worry.
You are in the right place. In this article, we are going to cover everything.
Before we dive further, below are the topics you are going to learn in this article. Only if you read the complete article 🙂
Let’s start the discussion with the curse of dimensionality and its impact on building machine learning models.
Curse of Dimensionality
Curse of Dimensionality can be defined as:
The set of problems that arise when we work with high-dimensional data.
The dimension of a dataset is directly related to the number of features that are present in a dataset.
High-dimensional data can be defined as a dataset having a large number of attributes, generally of the order of a hundred or more.
The difficulties that arise with high dimensional data arise during analysis and visualization of the data to identify patterns. Others manifest when we train the machine learning models.
The curse of dimensionality can be defined in other words as:
The rise of difficulties due to the presence of high dimensional data when we train the machine learning models.
The popular aspects of the curse of dimensionality are
- Distance concentration
- Data sparsity
Before we learn about data sparsity and distance concentration, let’s understand the curse of dimensionality with an example.
Understanding the Curse of Dimensionality with regression Example
We know that as the number of features or dimensions grows in a dataset, the available data which we need to generalize grows exponentially and becomes sparse.
So, in high dimensional data The objects appear to be dissimilar and sparse, preventing common data organization strategies from being efficient.
Let’s see how high dimensional data is a curse with the help of the following example.
Consider that we have two points i-e, 0, and 1 in a line, which are a unit distance away from each other.
We introduce another axis again at a unit distance. So, the points are (0,0) and (1,1).
Simulating this code, we get the following output:
In one dimension, we have 1% of the outlier points uniformly distributed from each other. In 50 dimensions, there will be almost 60% of the outlier points.
In the same way or similarly, in 100 dimensions, almost 90% of the points will be outliers.
Data Sparsity
To accurately predict the outcome for a given input data sample, the supervised machine learning models are trained.
When the model is under training. Some part of the data is used for the model training, and the rest is used to evaluate how the model performs on unseen data.
This evaluation step helps us gain an understanding of whether the model is generalized or not.
You can consider any of the below articles for splitting the dataset into train and test.
- Building a decision tree by splitting the data into train and test datasets.
- Building a random forest algorithm in python.
Model generalization can be defined as the ability of the model to predict the outcome for an unseen input data accurately.
It is mandatory that the unseen input data should come from the same distribution as the one used to train the model.
The accuracy of the generalized model’s prediction on the unseen data should be very close to its accuracy on the training data.
The efficient way to build a generalized model is by capturing a variety of possible combinations of the values of predictor variables and their corresponding targets.
For example.
If we have to predict a target that is dependent on two attributes, i-e, age group and gender. Then we have to ideally capture the targets for all possible combinations of values for the two mentioned attributes.
The performance of the model can be generalized if the data used to train the model is able to learn the mapping between the attribute values and the target.
The model would predict the target accurately as long as the future unseen data comes from the same distribution (a combination of values).
Age group levels
- Children (0-14 Years)
- Youth (15-24 Years)
- Adult (25-60 Years)
- Senior (61 and over)
Gender Levels
- Male
- Female
Age Group | Gender | Target |
---|---|---|
Children | Male | T1 |
Youth | Male | T2 |
Adult | Male | T3 |
Senior | Male | T4 |
Children | Female | T5 |
Youth | Female | T6 |
Adult | Female | T7 |
Senior | Female | T8 |
In the above example. We considered the dependence of the target value on gender and age group only if we consider the dependence of the target value on a third attribute.
Let’s say body type, then the number of training samples required to cover all the combinations increases phenomenally.
In the above figure, it is shown that for two variables, we have eight training samples. So, for three variables, we need 24 samples, and so on.
Distance Concentration
Distance concentration can be defined as:
The problem of convergence of all pairwise distances to the same value as the data dimensionality increases.
Some of the machine learning models, such as clustering or nearest neighbors’ methods, make use of distance-based metrics to identify the proximity of the samples.
The concept of similarity or proximity of the samples may not be qualitatively relevant in higher dimensions due to distance concentration.
Implications of the Curse of Dimensionality
The curse of dimensionality has the following implications:
- Due to a large number of features, the optimization problems become infeasible.
- The probability of recognizing a particular point proceeds to fall due to the absolute scale of inherent points in an n-dimensional space.
Mitigating Curse of Dimensionality
To overcome the problems associated with high dimensional data, the techniques termed as ‘Dimensionality reduction techniques’ are applied.
The dimensionality reduction techniques fall into one of the following two categories i-e;
- Feature selection
- Feature extraction
Feature selection Methods
In feature selection techniques, we test the attributes on the basis of their worth, and then they are selected or eliminated.
Following are some of the commonly used Feature selection techniques:
Low Variance filter
The process flow of this technique is as under:
- The variance of all the attributes in a dataset is compared.
- The attributes having sufficiently low variance are discarded.
- The attributes that do not possess much variance assume a constant value, thus having no contribution to the model’s predictability.
High Correlation filter
In this technique, the steps are as under:
- The pairwise correlation between attributes is determined.
- One of the attributes in the pair that has a significantly high correlation is eliminated and the other retained.
- In the eliminated attribute, the variability is captured through the retained attribute.
Multicollinearity
Multicollinearity occurs when a high degree correlation occurs between two or more independent variables in a regression model.
It means that one independent variable can be determined or predicted from another independent variable.
Inflation Factor (VIF) is a well-known technique used to detect multicollinearity. Attributes having high VIF values, usually greater than 10, are discarded.
Feature Ranking
The attributes can be ranked by decision tree models such as CART (Classification and Regression Trees) based on their importance or contribution to the model’s predictability.
The lower-ranked variables in high dimensional data could be eliminated to reduce the dimensions.
Forward selection
When a multi-linear regression model is built with high dimensional data, then only one attribute is selected at the beginning to build the regression model.
Feature Extraction Methods
There are a number of feature extraction techniques in which the combination of high dimensional attributes is done into low dimensional components (PCA or ICA).
There are a number of feature extraction techniques such as:
- Independent Component Analysis
- Principal Component Analysis
- Autoencoder
- Partial Least Squares
We will be discussing the Principal Component Analysis in detail.
Principal Component Analysis (PCA)
Karl Pearson and Harold Hotelling invented Principal Component Analysis in 1901 as an analog to the Principal axis theorem.
Principal Component Analysis or PCA can be defined as:
A dimensionality-reduction technique in which transformation of high dimensional correlated data is performed into a lower-dimensional set of uncorrelated components also referred to as principal components.
The lower-dimensional principal components capture the majority of the information in the high dimensional dataset.
The transformation of an ‘n’ dimensional data is done into ‘n’ principal components. Then the selection of these ‘n’ principal components subset is based on the percentage of variance in the data intended to be captured through the principal components.
We can also define Principal Component Analysis (PCA) as an exploratory approach to reduce the dataset’s dimensionality into 2D or 3D.
Used in exploratory data analysis for making predictive models.
Principal Component Analysis can be declared as a linear transformation of data set that defines a new coordinate rule as under:
- On the first axis, the highest variance by any projection of the data set appears to laze.
- Similarly, the second biggest variance on the second axis, and so on.
Purpose of Principal Component Analysis
Principal component analysis (PCA) is used for the following purposes:
- To visualize the high dimensionality data.
- To introduce improvements in classification.
- To obtain a compact description.
- To capture as much variance in the data as possible.
- To decrease the number of dimensions in the dataset.
- To search for patterns in the dataset of high dimensionality.
- To discard noise
How Principal Component Analysis (PCA) Works
In short, principal component analysis (PCA) can be defined as:
Transforming and reshaping a large number of variables into a smaller number of unrelated variables known as principal components (PCs), developed to capture as much of the variance in the dataset as possible.
Objectives of PCA
The following are the main mathematical objectives of PCA:
- Finding an orthonormal basis for the data
- Sorting the dimensions in the order of importance
- Discarding the low significant dimensions
- Focusing on uncorrelated and Gaussian components
Steps involved in PCA
The following are the main steps involved in Principal Component Analysis.
- Standardization of the PCA.
- Calculation of the covariance matrix.
- Finding the eigenvalues and eigenvectors for the covariance matrix.
- Plotting the vectors on the scaled data.
Problem depicting PCA requirement
Let’s suppose that there are 100 students in a class having “k” different features like
- age,
- height,
- hair color,
- weight,
- grade, and many more.
It is possible that most of the features may not be useful in describing the student. For this reason, it is mandatory to critically find those valuable features that characterize the person.
The analysis based on observing different features of a student:
- Every student has data in the form of a vector that defines the length of k i-e; characteristic features like
- height,
- weight,
- hair_color,
- grade or 181, 68, black, 99.
- Each column represents one student vector. Therefore, n = 100. Here, n represents the number of features of a student.
- It creates a k*n matrix.
- Each student lies in a k-dimensional vector space.
Principal Component Analysis Features
Some of the features of PCA listed below are considered while the rest of them are ignored.
PCA Ignored Features
- Linearly dependent or collinear features. e.g., height and leg size.
- Constant features. e.g., Number of teeth.
- Noisy features which are constant. e.g., hair thickness
PCA Key Features to Keep
- Low covariance or non-collinear features
- Features that are variable and have high variance.
- e.g., grade, age
Math Behind Principal Component Analysis
It is important to understand the mathematical logic involved before kickstarting PCA. Eigenvalues and eigenvectors play essential roles in PCA.
Eigenvectors and eigenvalues
The source of the PCA is described by the eigenvectors and eigenvalues of a covariance matrix (or correlation).
Eigenvectors determine the direction of the new attribute space, and the magnitude is determined by eigenvalues.
Let’s consider a simple example depicting the calculation of eigenvalues and eigenvectors.
Let X represent a square matrix. The function scipy.linalg.eig performs the computation of the eigenvalues and eigenvectors of the square matrix.
The X output looks like the below.
[[1, 0],
[0, -2]]
The function la.eig returns a tuple (eigvals,eigvecs) where eigvals represents a 1D NumPy array of complex numbers giving the eigenvalues of X.
Then eigvecs represents a 2D NumPy array having the corresponding eigenvectors in the columns:
The eigenvalues of the matrix X are as:
[1. + 0.j -2. + 0.j]
The corresponding eigenvectors are as:
[[1. 0.], [0. 1.]]
The main objective of PCA is to reduce the dimensionality of data by projecting it into a smaller subspace, where the axis is formed by the eigenvectors.
All the eigenvectors have a size of 1, but they define only the new axes’ directions. The eigenvectors having the highest values are the ones that include more information about our data distribution.
Covariance Matrix
The classic PCA approach determines the covariance matrix. Where each element depicts the covariance between two attributes.
The covariance relation between two attributes is shown below:
At first, the matrix is created, and then it is converted to the covariance matrix. Eigenvalues and eigenvectors can also be calculated using the correlation matrix.
Applications of PCA
The typical applications of PCA are as under:
Data Visualization: PCA makes data easy to explore by bringing out strong patterns in the relevant dataset.
Data Compression: The amount of the given data can be reduced by decreasing the number of eigenvectors used to reconstruct the original data matrix.
Noise Reduction: PCA can not eliminate noise. It can only reduce the noise. The data noising algorithm of PCA decreases the influence of the noise as much as possible.
Image Compression: Principal component analysis reduces the dimensions of the image and projects those dimensions to reform the image that retains its qualities.
Face Recognition: EigenFaces is an approach generated using PCA, which performs face recognition and reduces statistical complexity in face image recognition.
Principal Component Analysis Implementation in Python
Next we are getting the value of a and b. Now, Let's implementing PCA with the covariance matrix.
Now, standardizing a, we get, PCA with two components. For Checking eigenvectors printing those.
Eigenvectors
Eigenvalues
Sorted component
Plotting PCA with several components;
Conclusion
We know that massive datasets are increasingly widespread in all sorts of disciplines. Therefore, to interpret such datasets, the dimensionality is decreased so that the highly related data can be preserved.
PCA solves the issue of eigenvectors and eigenvalues. We make use of PCA to remove collinearity during the training phase of neural networks and linear regression.
Furthermore, we can use PCA to avoid multicollinearity and to decrease the number of variables.
PCA can be termed as a linear combination of the p features, and taking these linear combinations of the measurements under consideration is mandatory.
So that the number of plots necessary for visual analysis can be reduced while retaining most of the information present in the data. In machine learning, feature reduction is an essential preprocessing step.
Therefore, PCA is an effective step of preprocessing for compression and noise removal in the data. It finds a new set of variables smaller than the original set of variables and thus reduces a dataset’s dimensionality.
Frequently Asked Questions (FAQs) On Principal Component Analysis
1. What is Principal Component Analysis (PCA)?
PCA is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
2. How Does PCA Reduce the Dimensionality of Data?
PCA reduces dimensionality by identifying directions, or 'principal components,' along which the variation in the data is maximal and projecting the data onto these fewer dimensions.
3. What is a Principal Component?
A principal component is a linear combination of the original variables that captures the maximum variance in the data, with each successive component capturing the maximum of the remaining variance subject to being orthogonal to the preceding components.
4. Why is Reducing Dimensions Useful in Machine Learning?
Reducing dimensions can help in alleviating issues due to the curse of dimensionality, reduce overfitting, and decrease computational costs, while retaining most information.
5. How Many Principal Components Should I Choose?
The number of principal components is often chosen by looking at the explained variance plot and selecting enough components to account for a high percentage of the variance (commonly ≥ 95%).
6. Does PCA Always Improve the Performance of Machine Learning Models?
Not necessarily. While PCA can improve model performance by reducing noise and computational burden, it might also discard information that was useful for prediction.
7. Can PCA be Used for Both Regression and Classification?
Yes, PCA can be used for feature reduction in both regression and classification problems.
8. Is PCA Sensitive to Scaling of Data?
Yes, PCA is affected by the scale of the features, so it’s important to standardize data before applying PCA.
9. What's the Difference Between PCA and Factor Analysis?
PCA is a dimensionality reduction technique, while factor analysis is a modeling technique aimed at discovering the latent structure in data. PCA components explain the total variance in the data, while factor analysis explains the covariance among variables.
10. How is PCA Different from SVD (Singular Value Decomposition)?
PCA is often implemented using SVD. While they are related, SVD is a more general matrix decomposition method, and PCA specifically seeks to find the principal components that maximize variance.
11. Can PCA Handle Categorical Data?
PCA is not designed for categorical data directly. Categorical features should be suitively encoded or otherwise transformed into numerical format before PCA is applied.
12. Does PCA Work with Missing Data?
PCA requires a complete dataset. Missing data needs to be imputed before applying PCA, or alternative techniques like Probabilistic PCA which can handle missing values can be used.
13. What Are the Limitations of PCA?
Limitations include sensitivity to the scale of the data, loss of information due to dimensionality reduction, and the assumption that principal components with the largest variance are the most informative.
Recommended Courses
Machine Learning Course
Rating: 4.7/5
Deep Learning Course
Rating: 4.5/5
NLP Course
Rating: 4/5