DataSklr

View Original

Factor Analysis

Factor Analysis:

Factor Analysis (FA) is a method to reveal relationships between assumed latent variables and manifest variables. A latent variable is a concept that cannot be measured directly but it is assumed to have a relationship with several measurable features in data, called manifest variables. Manifest variables are directly measurable.

There are two main forms of FA: Exploratory and confirmatory FA.  Exploratory FA is designed to uncover relationships between manifest variables and factors without any assumption about specific manifest variables being related to specific factors.  Confirmatory FA tests if a specific factor in a model provides an adequate fit for the correlations between specific manifest variables.   

The FA model is very similar to multiple linear regression because the measurable manifest variables are regressed against the latent variables. In other words, an FA model creates an assumption that observed relationships among manifest variables are due to relationships of the observed manifest variables to the latent variables.  

Parameters in the FA Model:

There are two possibilities for estimating parameters in an FA model: Principal Factor Analysis and Maximum Likelihood Factor Analysis. During Principal FA, the correlation matrix of the manifest variables is used.  Here, there are two possible ways to take the communality of variables: square the multiple correlation coefficient of manifest variables with other variables OR use the largest of the absolute values of the correlation coefficients of manifest variables to other manifest variables.

Find the Number of Factors:

It is very important to choose the right number of factors because too few factors produce too many high loadings, while too many factors result in a fragmented model. A description of how to choose the right number of factors is further discussed during the programming section of this chapter. 

Rotation:

Rotation is a process that allows analysts to make an FA solution more interpretable.  Any rotation has a goal.  Orthogonal rotations require that factor should not be correlated.  The goal of orthogonal rotation is generalizability and simplicity.  In contrast, oblique rotations allow for correlated factors and their goal is to produce the best fit.

 The rules rotation:

  • Interpretation is easier with strong loadings;

  • Each row in the factor matrix must contain at least one zero;

  • Each column must contain at least k zeros;

  • Every pair of columns of the matrix of the matrix should have variables whose loadings are strong in one but disappear in another column;

  • If the number of factors is above four, every column pair should have several variables with zero loadings in specific column pairs;

  • In every pair of columns few variables should have non-zero loadings.  

Rotation Techniques:

Orthogonal rotation: Major advantage is simplicity - The loadings are correlations between factors and the observed features.

  • Varimax: Few large and lots of close to 0 loadings

  • Quartimax: Forces a given variable to correlate highly with one factor. Makes large loadings extremely large and small loadings extremely small. Maximizes the variance across the rows of the factor matrix by raising the loadings to the fourth power.

  • Oblimax: When the assumption of homogeneously distributed error cannot be applied, but may be replaced by a principle of maximum kurtosis, according to D.R Saunders (The rationale for an “oblimax” method of transformation in factor analysis)

  • Equimax: Attempted improvements on varimax. The rotation adjusts to the number of factors being rotated. resulting in more uniformly distributed set of factors vs. varimax. Creates less generic factors.

Oblique rotation: More complex - Factor pattern coefficients are regression coefficients that can be used to create observed features by multiplying the factors with the factor pattern coefficients. Factor structure coefficients are correlation coefficients between factors and observed features.

  • Promax: Increases the power of loadings in an orthogonal solution. Computationally less expensive and can be used for large data sets.

  • Oblimin: Attempts a simple factor pattern matrix structure by using a parameter that controls the degree of correlation among factors. Finds the best structure while minimizing a.) power loadings and b.) the correlation between factors. Analysts can set the magnitude of the correlation between the factors.

  • Quartimin: Good solution to complex data but produces bias towards highly intercorrelated features when generating factors. (Read Applied Factor Analysis by R.J Rummel)

Factor Analysis with factor_analyzer in Python:

First, we must load the required packages and then ingest the data. Here, we are using the same data (Baseball data) as the set used in Principal Components Analysis.

See this content in the original post

Adequacy Checks:

Now that the data is ingested, we must check if factor analysis is feasible. Bartlett Sphericity Test is a check of intercorrelation between manifest variables, which means the comparison of the observed correlation matrix and the identity matrix. If factor analysis is an appropriate method to use, the correlation matrix and the identity matrix will not be the same, and the test will be significant. Luckily, the Bartlett Sphericity Test based on our baseball data produced a significant p-value of 0.0.

Next, the KMO test (Kaiser-Meyer-Olkin) should test whether it is appropriate to use the manifest variables for factor analysis. The test involves the computation of the proportion of variance among the manifest variables. The KMO values range between 0-1 and a proportion under 0.6 would suggest that the dataset is inappropriate for factor analysis. Our data is still appropriate with the KMO test at 0.65.

For more on the Bartlett Sphericity Test and on the KMO test, visit the factor_analyzer home page.

See this content in the original post

Number of Factors:

First, let us quickly run a preliminary factor analysis without any rotation. This step is to aid the decision about the number of factors used in a solution. In this step, we get the eigenvalues of our initial solution, and plot them on a scree plot. We can find the number of generated factors vs. the eigenvalues. Eigenvalues that are greater than or equal to 1 should be considered when choosing the number of factors. A factor with an eigenvalue of 1 accounts for at least the variance of a single feature. The highly subjective elbow method can also be used. Our scree plot suggests four or five factors.

See this content in the original post

Based on the scree plot, let us choose a solution with four factors. This can be specified when using the factor_analyzer package. We can start rotating these factors. Since FA is an iterative method, it is good to stick to a general process:

  1. Start with the Varimax rotation

  2. The method can be set as minres, ml or principal.We can start to minres, while performing Varimax rotation.

  3. Change the method to maximum likelihood but still use Varimax rotation.

  4. Two logical choices are available for whether to use squared multiple correlation as starting guesses for factor analysis. Always start with smc (e.g. squared multiple correlation) and try maximum absolute correlation as second. We can specify this by setting use_smc=True.

  5. Compare the solutions and keep the one that works the best.

  6. Evaluate factor loadings and consider a different factor solution: one higher and one lower than the chosen k (in our current case four).

  7. If we partition the data, we can now try the solution on test data.

See this content in the original post

Several other parameters are available. For a full list, please click here: factor_analyzer package.

Interpreting Factor Loadings:

Finally, let us attach variable names in the factor loadings matrix. I listed the features that have the strongest loadings on one of our four factors. If this makes sense, we can name our factors. Unfortunately, several factors appears to have only weak and similar loadings on more than one factors. This means that we would have to continue our project and try a different rotation, more than four factors, use a different method, and or try using the alternative to smc (maximum absolute correlation). The point here is that we can explain and describe our 14 features with just a handful (in this case four)!

Factor 1 : TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_PITCHING_H

Factor 2: TEAM_BATTING_BB

Factor 3: TEAM_BATTING_H, TEAM_BATTING_2B

Factor 4 : TEAM_PITCHING_BB, TEAM_PITCHING_BB

See this content in the original post