The primary objectives of an exploratory factor analysis (EFA) are to determine (1) the number of common factors influencing a set of measures, (2) the strength of the relationship between each factor and each observed measure and (3) the factor scores
Some common uses of EFA are to
To reduce a large number of variables to a smaller number of factors for modeling purposes, where the large number of variables precludes modeling all the measures individually.
To establish that multiple tests measure the same factor, thereby giving justification for administering fewer tests. Factor analysis originated a century ago with Charles Spearman's attempts to show that a wide variety of mental tests could be explained by a single underlying intelligence factor
To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.
To select a subset of variables from a larger set, based on which original variables have the highest correlations with the principal component factors.
To create a set of factors to be treated as uncorrelated variables as one approach to handling multicollinearity in such procedures as multiple regression.
To identify the nature of the constructs underlying responses in a specific content area.
To determine what sets of items “hang together” in a questionnaire.
To demonstrate the dimensionality of a measurement scale. Researchers often wish to develop scales that respond to a single characteristic.
To determine what features are most important when classifying a group of items.
To generate “factor scores” representing values of the underlying constructs for use in other analyses.
An example
We illustrate how to conduct exploratory data analysis using the data from the classic 1939 study by Karl J. Holzinger and Frances Swineford. In the study, twenty-six tests intended to measure a general factor and five specific factors were administered to seventh and eighth grade students in two schools, the Grant-White School (\(n = 145\)) and Pasteur School (\(n = 156\)). Data used in this example include nineteen tests intended to measure four domains: spatial ability, verbal ability, speed, and memory. In addition, only data from the 145 students in the Grant-White School are used.
The data are saved in the file GrantWhite.csv. The 26 tests are described below with the 19 used in the example are highlighted.
scores on straight and curved capitals test, test 13
wordr
scores on word recognition test, test 14
numberr
scores on number recognition test, test 15
figurer
scores on figure recognition test, test 16
object
scores on object-number test, test 17
numberf
scores on number-figure test, test 18
figurew
scores on figure-word test, test 19
deduct
scores on deduction test, test 20
numeric
scores on numerical puzzles test, test 21
problemr
scores on problem reasoning test, test 22
series
scores on series completion test, test 23
arithmet
scores on Woody-McCall mixed fundamentals, form I test, test 24
paperrev
scores on additional paper form board test, test 25
flagssub
scores on flags test, test 26
Exploratory factor analysis
The usual exploratory factor analysis involves (1) Preparing data, (2) Determining the number of factors, (3) Estimation of the model, (4) Factor rotation, (5) Factor score estimation and (6) Interpretation of the analysis.
Preparing data
In EFA, a correlation matrix is analyzed. The following R code calculates the correlation matrix. In order to simplify the other steps, we save the correlation matrix in the data file GWcorr.csv and will be used later.
> usedata('GrantWhite')
> fa.var<-c('visual', 'cubes', 'paper', 'lozenge',
+ 'general', 'paragrap', 'sentence', 'wordc',
+ 'wordm', 'add', 'code', 'counting', 'straight',
+ 'wordr', 'numberr', 'figurer', 'object', 'numberf',
+ 'figurew')
>
> fadata<-GrantWhite[,fa.var]
> ## correlation matrix
> fa.cor<-cor(fadata)
> ## part of the correlation matrix
> round(fa.cor[1:5,1:5],3)
visual cubes paper lozenge general
visual 1.000 0.326 0.372 0.449 0.328
cubes 0.326 1.000 0.190 0.417 0.275
paper 0.372 0.190 1.000 0.366 0.309
lozenge 0.449 0.417 0.366 1.000 0.381
general 0.328 0.275 0.309 0.381 1.000
>
> ## save the correlation matrix
> write.csv(fa.cor, '../data/GWcorr.csv')
>
Determining the number of factors
With the correlation matrix, we first decide the number of factors. There are several ways to do it. But all the methods are based on the eigenvalues of the correlation matrix. From R, we have the eigenvalues below. First, note the number of eigenvalues is the same as the number of variables. Second, the sum of all the eigenvalues is equal to the number of variables.
The basic idea can be related to the variance explained as in regression analysis. With the correlation matrix, we can take the variance of each variable as 1. For a total of $p$ variables, the total variance is therefore $p$. For factor analysis, we try to find a small number of factors that can explain a large portion of the total variance. The eigenvalues correspond to the variance of each factor. If the eigenvalue corresponding to a factor is large, that means the variance explained by the factor is large. Therefore, the eigenvalues can be used to select the number of factors.
Rule 1
The first rule to decide the number of factors is to use the number of eigenvalues larger than 1. In this example, we have four eigenvalues larger than 1. Therefore, we can have 4 factors.
Rule 2
Another way is to select the number of factors with the cumulative eigenvalues accounting for 80% of the total variance. This is to say if we add the eigenvalues of the selected number of factor, the total values should be larger than 80% of the sum of all eigenvalues.
Cattell's Scree plot
The Cattell's Scree plot is a plot of eigenvalues on the Y axis along with the number of factors on the X axis. The plot looks like the side of a mountain, and "scree" refers to the debris fallen from a mountain and lying at its base. As one moves to the right, toward later components/factors, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components/factors after the one starting the elbow. For this example, we can identify 4 factors based on the scree plot below.
Once the number of factors is decided, we can conduct exploratory factor analysis using the R function factanal(). The R input and output for this example is given below.
In EFA, each observed data consists of two part, the common factor part and the uniqueness part. The common factor part is based on the four factors, which are also called the common factors. The uniqueness part is also called uniqueness factor, which is specific to each observed variable.
Note the factor loadings are from the Loadings section of the output. The loadings are the regression coefficients of the latent factors on the manifest indicators or observed variables. The variance of the uniqueness is in the Uniquenesses section. For \(u_{visual}\), the variance is 0.465. For the other variables, it's the same.
The other section is related to the variance explained by the factors. SS loadings is the sum squared loadings related to each factor. It is the overall variance explained in all the 19 variables by each factor. Therefore, the first factor explains the total of 5.722 variance, that's about 30.1%=5.722/19. Proportion Var is the variances in the observed variables/indicators explained by each factor. Cumulative Var is the cumulative proportion of variance explained by all factors.
A test is conducted to test whether the factor model is sufficient to explain the observed data. The null hypothesis that a 4-factor model is sufficient. For this model, the chi-square statistic is 102.06 with degrees of freedom 101. The p-value for the chi-square test is 0.452 which is larger than .05. Therefore,we fail to reject the null hypothesis that the factor model have a good fit to the data.
Factor rotation
Although we have identified 4 factors and found the 4-factor model is a good model. We cannot find a clear pattern in the factor loadings to have a deep understanding of the factors. Through factor rotation, we can make the output more understandable and is usually necessary to facilitate the interpretation of factors. The aim is to find a simple solution that each factor has a small number of large loadings and a large number of zero (or small) loadings. There are many different rotation methods such as the varimax rotation, quadtimax rotation, equimax rotation, oblique rotation, etc. The PROMAX rotation is one kind of oblique rotation and is widely used. After PROMAX rotation, the factor will be correlated.
The output of PROMAX rotation is shown below. In the output, we use print(fa.res, cut=0.2) to show factor loadings that are greater than 0.2. Note that after rotation, many loading are actually smaller than 0.2. The pattern of the factor loadings are much clear now. For example, the variable visual has a large loading 0.747 on Factor 2 but small than 0.2 loadings on all the other three factors. In this case, we might say that the variable visual is mainly influenced by Factor 2.
Different from the variable visual, the variable straight has large loadings on both Factor 2 and Factor 4. Alternatively, straight measures both factors than just a single factor.
We can also see that the primary indicators for Factor 1 are general, paragrap, sentence, wordc, and wordm. And for Factor 4, the indictors include add, code, counting, and straight.
The correlation among the factors are given in the section of Factor Correlation. For example, the correlation between Factor 1 and Factor 2 is 0.368. Note that after rotation, the test of the model is the same as without rotation.
Based on the rotated factor loadings, we can name the factors in the model. This can be done by identifying significant loadings. For example, the Factor 1 is indicated by general, paragrap, sentence, wordc, and wordm, all of which are related to verbal perspective of cognitive ability. One way to name the factor is to call it a verbal factor. Similarly, the second is called the spatial factor, the third can be called the memory factor, and the last one can be called the speed factor.
Factor scores
Sometimes, the purpose of factor analysis is to estimate the score of each latent construct/factor for each participant. Factor scores can be used in further data analysis. In general, there are two methods for estimating factor scores: the regression method and the Bartlett method. The second method generally works better. For example, the following code obtains the Bartlett factor scores. As an example, the linear regression is also fitted.
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2. To take the full advantage of the book such as running analysis within your web browser, please subscribe.