Datasklr is a blog to provide examples of data science projects to those passionate about learning and having fun with data.

Factor Analysis

Factor Analysis

shutterstock_598057145.jpg
 
“A client is to me a mere unit, a factor in a problem.”
— Arthur Conan Doyle

Factor Analysis:

Factor Analysis (FA) is a method to reveal relationships between assumed latent variables and manifest variables. A latent variable is a concept that cannot be measured directly but it is assumed to have a relationship with several measurable features in data, called manifest variables. Manifest variables are directly measurable.

There are two main forms of FA: Exploratory and confirmatory FA.  Exploratory FA is designed to uncover relationships between manifest variables and factors without any assumption about specific manifest variables being related to specific factors.  Confirmatory FA tests if a specific factor in a model provides an adequate fit for the correlations between specific manifest variables.   

The FA model is very similar to multiple linear regression because the measurable manifest variables are regressed against the latent variables. In other words, an FA model creates an assumption that observed relationships among manifest variables are due to relationships of the observed manifest variables to the latent variables.  

 

Parameters in the FA Model:

There are two possibilities for estimating parameters in an FA model: Principal Factor Analysis and Maximum Likelihood Factor Analysis. During Principal FA, the correlation matrix of the manifest variables is used.  Here, there are two possible ways to take the communality of variables: square the multiple correlation coefficient of manifest variables with other variables OR use the largest of the absolute values of the correlation coefficients of manifest variables to other manifest variables.

Find the Number of Factors:

It is very important to choose the right number of factors because too few factors produce too many high loadings, while too many factors result in a fragmented model. A description of how to choose the right number of factors is further discussed during the programming section of this chapter. 

Rotation:

Rotation is a process that allows analysts to make an FA solution more interpretable.  Any rotation has a goal.  Orthogonal rotations require that factor should not be correlated.  The goal of orthogonal rotation is generalizability and simplicity.  In contrast, oblique rotations allow for correlated factors and their goal is to produce the best fit.

 The rules rotation:

  • Interpretation is easier with strong loadings;

  • Each row in the factor matrix must contain at least one zero;

  • Each column must contain at least k zeros;

  • Every pair of columns of the matrix of the matrix should have variables whose loadings are strong in one but disappear in another column;

  • If the number of factors is above four, every column pair should have several variables with zero loadings in specific column pairs;

  • In every pair of columns few variables should have non-zero loadings.  

Rotation Techniques:

Orthogonal rotation: Major advantage is simplicity - The loadings are correlations between factors and the observed features.

  • Varimax: Few large and lots of close to 0 loadings

  • Quartimax: Forces a given variable to correlate highly with one factor. Makes large loadings extremely large and small loadings extremely small. Maximizes the variance across the rows of the factor matrix by raising the loadings to the fourth power.

  • Oblimax: When the assumption of homogeneously distributed error cannot be applied, but may be replaced by a principle of maximum kurtosis, according to D.R Saunders (The rationale for an “oblimax” method of transformation in factor analysis)

  • Equimax: Attempted improvements on varimax. The rotation adjusts to the number of factors being rotated. resulting in more uniformly distributed set of factors vs. varimax. Creates less generic factors.

Oblique rotation: More complex - Factor pattern coefficients are regression coefficients that can be used to create observed features by multiplying the factors with the factor pattern coefficients. Factor structure coefficients are correlation coefficients between factors and observed features.

  • Promax: Increases the power of loadings in an orthogonal solution. Computationally less expensive and can be used for large data sets.

  • Oblimin: Attempts a simple factor pattern matrix structure by using a parameter that controls the degree of correlation among factors. Finds the best structure while minimizing a.) power loadings and b.) the correlation between factors. Analysts can set the magnitude of the correlation between the factors.

  • Quartimin: Good solution to complex data but produces bias towards highly intercorrelated features when generating factors. (Read Applied Factor Analysis by R.J Rummel)

Factor Analysis with factor_analyzer in Python:

First, we must load the required packages and then ingest the data. Here, we are using the same data (Baseball data) as the set used in Principal Components Analysis.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler

from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
from factor_analyzer.factor_analyzer import calculate_kmo
baseball_df.head()
result = baseball_df.columns
print(result)
Index(['INDEX', 'TARGET_WINS', 'TEAM_BATTING_H', 'TEAM_BATTING_2B',
       'TEAM_BATTING_3B', 'TEAM_BATTING_HR', 'TEAM_BATTING_BB',
       'TEAM_BATTING_SO', 'TEAM_BASERUN_SB', 'TEAM_BASERUN_CS',
       'TEAM_BATTING_HBP', 'TEAM_PITCHING_H', 'TEAM_PITCHING_HR',
       'TEAM_PITCHING_BB', 'TEAM_PITCHING_SO', 'TEAM_FIELDING_E',
       'TEAM_FIELDING_DP'],
      dtype='object')

baseball_df=baseball_df.sort_values('TARGET_WINS')
baseball_df['WINS_GROUP']=pd.qcut(baseball_df['TARGET_WINS'], 3, labels=['1st 33%','2nd 33%','3rd 33%'])

final_df=baseball_df.drop(['TEAM_BATTING_HBP', 'INDEX', 'WINS_GROUP', 'TARGET_WINS'], axis=1)

final_df=final_df.fillna(final_df.mean())
result = final_df.columns

scaler = StandardScaler()

scaled_baseball=final_df.copy()
scaled_baseball=pd.DataFrame(scaler.fit_transform(scaled_baseball), columns=scaled_baseball.columns)
scaled_baseball.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2276 entries, 0 to 2275
Data columns (total 14 columns):
TEAM_BATTING_H      2276 non-null float64
TEAM_BATTING_2B     2276 non-null float64
TEAM_BATTING_3B     2276 non-null float64
TEAM_BATTING_HR     2276 non-null float64
TEAM_BATTING_BB     2276 non-null float64
TEAM_BATTING_SO     2276 non-null float64
TEAM_BASERUN_SB     2276 non-null float64
TEAM_BASERUN_CS     2276 non-null float64
TEAM_PITCHING_H     2276 non-null float64
TEAM_PITCHING_HR    2276 non-null float64
TEAM_PITCHING_BB    2276 non-null float64
TEAM_PITCHING_SO    2276 non-null float64
TEAM_FIELDING_E     2276 non-null float64
TEAM_FIELDING_DP    2276 non-null float64
dtypes: float64(14)
memory usage: 249.1 KB

Adequacy Checks:

Now that the data is ingested, we must check if factor analysis is feasible. Bartlett Sphericity Test is a check of intercorrelation between manifest variables, which means the comparison of the observed correlation matrix and the identity matrix. If factor analysis is an appropriate method to use, the correlation matrix and the identity matrix will not be the same, and the test will be significant. Luckily, the Bartlett Sphericity Test based on our baseball data produced a significant p-value of 0.0.

Next, the KMO test (Kaiser-Meyer-Olkin) should test whether it is appropriate to use the manifest variables for factor analysis. The test involves the computation of the proportion of variance among the manifest variables. The KMO values range between 0-1 and a proportion under 0.6 would suggest that the dataset is inappropriate for factor analysis. Our data is still appropriate with the KMO test at 0.65.

For more on the Bartlett Sphericity Test and on the KMO test, visit the factor_analyzer home page.

#CHECK ADEQUACY
#Bartlett
#p-value should be 0 (statistically sig.)
chi_square_value,p_value=calculate_bartlett_sphericity(scaled_baseball)
print(chi_square_value, p_value)

#KMO
#Value should be 0.6<
kmo_all,kmo_model=calculate_kmo(scaled_baseball)
print(kmo_model)

25927.970341891145 0.0
0.649081704257186

Number of Factors:

First, let us quickly run a preliminary factor analysis without any rotation. This step is to aid the decision about the number of factors used in a solution. In this step, we get the eigenvalues of our initial solution, and plot them on a scree plot. We can find the number of generated factors vs. the eigenvalues. Eigenvalues that are greater than or equal to 1 should be considered when choosing the number of factors. A factor with an eigenvalue of 1 accounts for at least the variance of a single feature. The highly subjective elbow method can also be used. Our scree plot suggests four or five factors.

 
Screen Shot 2020-06-08 at 12.31.05 AM.png
#EXPLORATORY FACTOR ANALYSIS
fa = FactorAnalyzer(10, rotation=None)
fa.fit(scaled_baseball)

FactorAnalyzer(bounds=(0.005, 1), impute='median', is_corr_matrix=False,
               method='minres', n_factors=3, rotation=None, rotation_kwargs={},
               use_smc=True)

#GET EIGENVALUES
fa.get_eigenvalues()

[4.82803775 2.16198932 1.74301502 1.45229663 0.95769957 0.82668734
 0.60604294 0.50273078 0.33357804 0.19899495 0.16015927 0.12622135
 0.087932   0.01461505]
[ 4.56895784e+00  1.96965781e+00  1.46222314e+00  1.18651683e+00
  4.28384036e-01  2.49077679e-01  3.39698376e-02 -3.85219812e-03
 -4.32093191e-02 -5.84983971e-02 -7.14656872e-02 -1.06756688e-01
 -1.84699846e-01 -2.57518572e-01]

# SCREEPLOT (need pyplot)
plt.scatter(range(1,scaled_baseball.shape[1]+1),ev)
plt.plot(range(1,scaled_baseball.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()

Based on the scree plot, let us choose a solution with four factors. This can be specified when using the factor_analyzer package. We can start rotating these factors. Since FA is an iterative method, it is good to stick to a general process:

  1. Start with the Varimax rotation

  2. The method can be set as minres, ml or principal.We can start to minres, while performing Varimax rotation.

  3. Change the method to maximum likelihood but still use Varimax rotation.

  4. Two logical choices are available for whether to use squared multiple correlation as starting guesses for factor analysis. Always start with smc (e.g. squared multiple correlation) and try maximum absolute correlation as second. We can specify this by setting use_smc=True.

  5. Compare the solutions and keep the one that works the best.

  6. Evaluate factor loadings and consider a different factor solution: one higher and one lower than the chosen k (in our current case four).

  7. If we partition the data, we can now try the solution on test data.

# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer(4, rotation="varimax", method='minres', use_smc=True)
fa.fit(scaled_baseball)

FactorAnalyzer(bounds=(0.005, 1), impute='median', is_corr_matrix=False,
               method='minres', n_factors=4, rotation='varimax',
               rotation_kwargs={}, use_smc=True)

fa.loadings_

array([[-0.15918495, -0.2146815 ,  0.96113832, -0.06729736],
       [ 0.3176907 ,  0.11172103,  0.5917639 ,  0.09160824],
       [-0.73733795, -0.22262378,  0.2592684 , -0.06357984],
       [ 0.8616868 ,  0.3663672 ,  0.23162022,  0.11539842],
       [ 0.1462587 ,  0.81209274,  0.18702536,  0.35049663],
       [ 0.67188371,  0.37042894, -0.29515322,  0.15355697],
       [-0.54745691, -0.05588575, -0.02114782,  0.11179349],
       [-0.32065793,  0.04397383, -0.05735513, -0.04428859],
       [-0.05061899, -0.77880608,  0.1760783 ,  0.37193236],
       [ 0.82111309,  0.27670262,  0.29121739,  0.18523814],
       [-0.06212532,  0.11480601,  0.19174278,  0.97948732],
       [ 0.29467493, -0.16989868, -0.23843604,  0.55367096],
       [-0.42013622, -0.76701491,  0.05370052,  0.03993374],
       [ 0.31767905,  0.19478329,  0.26189601,  0.10783185]])

fa.get_communalities()
array([0.99883689, 0.48488363, 0.60211253, 0.91480177, 0.40168745,
       0.7080446 , 0.19872045, 0.08476792, 0.88228986, 0.85554147,
       0.21876294, 0.44758682, 0.74645865, 0.22131811])

Several other parameters are available. For a full list, please click here: factor_analyzer package.

Interpreting Factor Loadings:

Finally, let us attach variable names in the factor loadings matrix. I listed the features that have the strongest loadings on one of our four factors. If this makes sense, we can name our factors. Unfortunately, several factors appears to have only weak and similar loadings on more than one factors. This means that we would have to continue our project and try a different rotation, more than four factors, use a different method, and or try using the alternative to smc (maximum absolute correlation). The point here is that we can explain and describe our 14 features with just a handful (in this case four)!

Factor 1 : TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_PITCHING_H

Factor 2: TEAM_BATTING_BB

Factor 3: TEAM_BATTING_H, TEAM_BATTING_2B

Factor 4 : TEAM_PITCHING_BB, TEAM_PITCHING_BB

loadings = pd.DataFrame(fa.loadings_, columns=['Factor 1', 'Factor 2', 'Factor 3', 'Factor 4'], index=final_df.columns)
print('Factor Loadings \n%s' %loadings)

Factor Loadings 
                  Factor 1  Factor 2  Factor 3  Factor 4
TEAM_BATTING_H   -0.159185 -0.214681  0.961138 -0.067297
TEAM_BATTING_2B   0.317691  0.111721  0.591764  0.091608
TEAM_BATTING_3B  -0.737338 -0.222624  0.259268 -0.063580
TEAM_BATTING_HR   0.861687  0.366367  0.231620  0.115398
TEAM_BATTING_BB   0.146259  0.812093  0.187025  0.350497
TEAM_BATTING_SO   0.671884  0.370429 -0.295153  0.153557
TEAM_BASERUN_SB  -0.547457 -0.055886 -0.021148  0.111793
TEAM_BASERUN_CS  -0.320658  0.043974 -0.057355 -0.044289
TEAM_PITCHING_H  -0.050619 -0.778806  0.176078  0.371932
TEAM_PITCHING_HR  0.821113  0.276703  0.291217  0.185238
TEAM_PITCHING_BB -0.062125  0.114806  0.191743  0.979487
TEAM_PITCHING_SO  0.294675 -0.169899 -0.238436  0.553671
TEAM_FIELDING_E  -0.420136 -0.767015  0.053701  0.039934
TEAM_FIELDING_DP  0.317679  0.194783  0.261896  0.107832
Principal Components Analysis

Principal Components Analysis

0