Exploratory Factor Analysis
The primary objectives of an exploratory factor analysis (EFA) are to determine (1) the number of common factors influencing a set of measures, (2) the strength of the relationship between each factor and each observed measure and (3) the factor scores
Some common uses of EFA are to
- To reduce a large number of variables to a smaller number of factors for modeling purposes, where the large number of variables precludes modeling all the measures individually.
- To establish that multiple tests measure the same factor, thereby giving justification for administering fewer tests. Factor analysis originated a century ago with Charles Spearman's attempts to show that a wide variety of mental tests could be explained by a single underlying intelligence factor
- To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.
- To select a subset of variables from a larger set, based on which original variables have the highest correlations with the principal component factors.
- To create a set of factors to be treated as uncorrelated variables as one approach to handling multicollinearity in such procedures as multiple regression.
- To identify the nature of the constructs underlying responses in a specific content area.
- To determine what sets of items “hang together” in a questionnaire.
- To demonstrate the dimensionality of a measurement scale. Researchers often wish to develop scales that respond to a single characteristic.
- To determine what features are most important when classifying a group of items.
- To generate “factor scores” representing values of the underlying constructs for use in other analyses.
An example
We illustrate how to conduct exploratory data analysis using the data from the classic 1939 study by Karl J. Holzinger and Frances Swineford. In the study, twenty-six tests intended to measure a general factor and five specific factors were administered to seventh and eighth grade students in two schools, the Grant-White School (\(n = 145\)) and Pasteur School (\(n = 156\)). Data used in this example include nineteen tests intended to measure four domains: spatial ability, verbal ability, speed, and memory. In addition, only data from the 145 students in the Grant-White School are used.
The data are saved in the file GrantWhite.csv
. The 26 tests are described below with the 19 used in the example are highlighted.
> usedata('GrantWhite') > head(GrantWhite) X...id female grade agey agem school visual cubes paper lozenge general 1 201 0 7 13 0 0 23 19 13 4 46 2 202 1 7 11 10 0 33 22 12 17 43 3 203 0 7 12 6 0 34 24 14 22 36 4 204 0 7 11 11 0 29 23 12 9 38 5 205 0 7 12 5 0 16 25 11 10 51 6 206 1 7 12 6 0 30 25 12 20 42 paragrap sentence wordc wordm add code counting straight wordr numberr 1 10 17 22 10 69 65 82 156 173 91 2 8 17 30 10 65 60 98 195 174 81 3 11 19 27 19 50 49 86 228 168 84 4 9 19 25 11 114 59 103 144 130 84 5 8 25 28 24 112 54 122 160 184 98 6 10 23 28 18 94 84 113 201 188 86 figurer object numberf figurew deduct numeric problemr series arithmet 1 96 8 2 10 21 12 17 11 17 2 106 9 15 17 33 12 22 31 32 3 101 1 7 16 45 10 43 21 18 4 101 10 15 14 25 21 26 19 28 5 99 9 9 15 28 16 35 21 25 6 116 10 10 16 36 14 27 18 30 paperrev flagssub 1 13 25 2 20 37 3 19 40 4 11 44 5 10 28 6 16 42 >
visual | scores on visual perception test, test 1 |
cubes | scores on cubes test, test 2 |
paper | scores on paper form board test, test 3 |
lozenge | scores on lozenges test, test 4 |
general | scores on general information test, test 5 |
paragrap | scores on paragraph comprehension test, test 6 |
sentence | scores on sentence completion test, test 7 |
wordc | scores on word classification test, test 8 |
wordm | scores on word meaning test, test 9 |
add | scores on add test, test 10 |
code | scores on code test, test 11 |
counting | scores on counting groups of dots test, test 12 |
straight | scores on straight and curved capitals test, test 13 |
wordr | scores on word recognition test, test 14 |
numberr | scores on number recognition test, test 15 |
figurer | scores on figure recognition test, test 16 |
object | scores on object-number test, test 17 |
numberf | scores on number-figure test, test 18 |
figurew | scores on figure-word test, test 19 |
deduct | scores on deduction test, test 20 |
numeric | scores on numerical puzzles test, test 21 |
problemr | scores on problem reasoning test, test 22 |
series | scores on series completion test, test 23 |
arithmet | scores on Woody-McCall mixed fundamentals, form I test, test 24 |
paperrev | scores on additional paper form board test, test 25 |
flagssub | scores on flags test, test 26 |
Exploratory factor analysis
The usual exploratory factor analysis involves (1) Preparing data, (2) Determining the number of factors, (3) Estimation of the model, (4) Factor rotation, (5) Factor score estimation and (6) Interpretation of the analysis.
Preparing data
In EFA, a correlation matrix is analyzed. The following R code calculates the correlation matrix. In order to simplify the other steps, we save the correlation matrix in the data file GWcorr.csv
and will be used later.
> usedata('GrantWhite') > fa.var<-c('visual', 'cubes', 'paper', 'lozenge', + 'general', 'paragrap', 'sentence', 'wordc', + 'wordm', 'add', 'code', 'counting', 'straight', + 'wordr', 'numberr', 'figurer', 'object', 'numberf', + 'figurew') > > fadata<-GrantWhite[,fa.var] > ## correlation matrix > fa.cor<-cor(fadata) > ## part of the correlation matrix > round(fa.cor[1:5,1:5],3) visual cubes paper lozenge general visual 1.000 0.326 0.372 0.449 0.328 cubes 0.326 1.000 0.190 0.417 0.275 paper 0.372 0.190 1.000 0.366 0.309 lozenge 0.449 0.417 0.366 1.000 0.381 general 0.328 0.275 0.309 0.381 1.000 > > ## save the correlation matrix > write.csv(fa.cor, '../data/GWcorr.csv') >
Determining the number of factors
With the correlation matrix, we first decide the number of factors. There are several ways to do it. But all the methods are based on the eigenvalues of the correlation matrix. From R, we have the eigenvalues below. First, note the number of eigenvalues is the same as the number of variables. Second, the sum of all the eigenvalues is equal to the number of variables.
> usedata('GWcorr', row.names=1) > > fa.eigen <- eigen(GWcorr) > fa.eigen$values [1] 6.3041871 1.9473919 1.5265417 1.4877579 0.9398040 0.8747401 0.7639373 [8] 0.6559871 0.6508332 0.5719815 0.5481270 0.4640505 0.4371337 0.4070784 [15] 0.3655828 0.3201049 0.3064701 0.2312029 0.1970878 > > sum(fa.eigen$values) [1] 19 > cumsum(fa.eigen$values) [1] 6.304187 8.251579 9.778121 11.265879 12.205683 13.080423 13.844360 [8] 14.500347 15.151180 15.723162 16.271289 16.735339 17.172473 17.579552 [15] 17.945134 18.265239 18.571709 18.802912 19.000000 > cumsum(fa.eigen$values)/19 [1] 0.3317993 0.4342936 0.5146379 0.5929410 0.6424043 0.6884433 0.7286505 [8] 0.7631762 0.7974305 0.8275348 0.8563836 0.8808073 0.9038144 0.9252396 [15] 0.9444808 0.9613284 0.9774584 0.9896270 1.0000000 >
The basic idea can be related to the variance explained as in regression analysis. With the correlation matrix, we can take the variance of each variable as 1. For a total of $p$ variables, the total variance is therefore $p$. For factor analysis, we try to find a small number of factors that can explain a large portion of the total variance. The eigenvalues correspond to the variance of each factor. If the eigenvalue corresponding to a factor is large, that means the variance explained by the factor is large. Therefore, the eigenvalues can be used to select the number of factors.
Rule 1
The first rule to decide the number of factors is to use the number of eigenvalues larger than 1. In this example, we have four eigenvalues larger than 1. Therefore, we can have 4 factors.
Rule 2
Another way is to select the number of factors with the cumulative eigenvalues accounting for 80% of the total variance. This is to say if we add the eigenvalues of the selected number of factor, the total values should be larger than 80% of the sum of all eigenvalues.
Cattell's Scree plot
The Cattell's Scree plot is a plot of eigenvalues on the Y axis along with the number of factors on the X axis. The plot looks like the side of a mountain, and "scree" refers to the debris fallen from a mountain and lying at its base. As one moves to the right, toward later components/factors, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components/factors after the one starting the elbow. For this example, we can identify 4 factors based on the scree plot below.
> usedata('GWcorr', row.names=1) > > fa.eigen <- eigen(GWcorr) > > plot(fa.eigen$values, type='b', ylab='Eigenvalues', xlab='Factor') >
Estimation of model / Factor analysis
Once the number of factors is decided, we can conduct exploratory factor analysis using the R function factanal()
. The R input and output for this example is given below.
> usedata('GrantWhite') > fa.var<-c('visual', 'cubes', 'paper', 'lozenge', + 'general', 'paragrap', 'sentence', 'wordc', + 'wordm', 'add', 'code', 'counting', 'straight', + 'wordr', 'numberr', 'figurer', 'object', 'numberf', + 'figurew') > > fadata<-GrantWhite[,fa.var] > > fa.res<-factanal(x=fadata, factors=4, rotation='none') > fa.res Call: factanal(x = fadata, factors = 4, rotation = "none") Uniquenesses: visual cubes paper lozenge general paragrap sentence wordc 0.465 0.742 0.712 0.549 0.344 0.306 0.287 0.493 wordm add code counting straight wordr numberr figurer 0.270 0.360 0.553 0.377 0.411 0.684 0.710 0.560 object numberf figurew 0.470 0.573 0.777 Loadings: Factor1 Factor2 Factor3 Factor4 visual 0.536 0.176 0.392 -0.249 cubes 0.330 0.302 -0.228 paper 0.440 0.110 0.247 -0.147 lozenge 0.505 0.358 -0.253 general 0.762 -0.238 -0.113 paragrap 0.759 -0.338 sentence 0.762 -0.322 -0.166 wordc 0.701 wordm 0.762 -0.381 add 0.455 0.475 -0.451 code 0.545 0.367 0.103 counting 0.434 0.593 -0.238 -0.162 straight 0.592 0.393 -0.289 wordr 0.394 0.149 0.362 numberr 0.352 0.139 0.219 0.315 figurer 0.435 0.183 0.425 0.192 object 0.445 0.241 0.522 numberf 0.454 0.383 0.221 0.157 figurew 0.389 0.115 0.133 0.202 Factor1 Factor2 Factor3 Factor4 SS loadings 5.722 1.625 1.065 0.945 Proportion Var 0.301 0.086 0.056 0.050 Cumulative Var 0.301 0.387 0.443 0.492 Test of the hypothesis that 4 factors are sufficient. The chi square statistic is 102.06 on 101 degrees of freedom. The p-value is 0.452 >
In EFA, each observed data consists of two part, the common factor part and the uniqueness part. The common factor part is based on the four factors, which are also called the common factors. The uniqueness part is also called uniqueness factor, which is specific to each observed variable.
Using the variable visual
as an example, we have
\[ visual = 0.536\times Factor1 + 0.176\times Factor2 + 0.392\times Factor3 - 0.249\times Factor4 + u_{visual} \]
Note the factor loadings are from the Loadings
section of the output. The loadings are the regression coefficients of the latent factors on the manifest indicators or observed variables. The variance of the uniqueness is in the Uniquenesses
section. For \(u_{visual}\), the variance is 0.465. For the other variables, it's the same.
The other section is related to the variance explained by the factors. SS loadings
is the sum squared loadings related to each factor. It is the overall variance explained in all the 19 variables by each factor. Therefore, the first factor explains the total of 5.722 variance, that's about 30.1%=5.722/19
. Proportion Var
is the variances in the observed variables/indicators explained by each factor. Cumulative Var
is the cumulative proportion of variance explained by all factors.
A test is conducted to test whether the factor model is sufficient to explain the observed data. The null hypothesis that a 4-factor model is sufficient. For this model, the chi-square statistic is 102.06 with
degrees of freedom 101. The p-value for the chi-square test is 0.452 which is larger than .05. Therefore,we fail to reject the null hypothesis that the factor model have a good fit to the data.
Factor rotation
Although we have identified 4 factors and found the 4-factor model is a good model. We cannot find a clear pattern in the factor loadings to have a deep understanding of the factors. Through factor rotation, we can make the output more understandable and is usually necessary to facilitate the interpretation of factors. The aim is to find a simple solution that each factor has a small number of large loadings and a large number of zero (or small) loadings. There are many different rotation methods such as the varimax rotation, quadtimax rotation, equimax rotation, oblique rotation, etc. The PROMAX rotation is one kind of oblique rotation and is widely used. After PROMAX rotation, the factor will be correlated.
The output of PROMAX rotation is shown below. In the output, we use print(fa.res, cut=0.2)
to show factor loadings that are greater than 0.2. Note that after rotation, many loading are actually smaller than 0.2. The pattern of the factor loadings are much clear now. For example, the variable visual
has a large loading 0.747 on Factor 2
but small than 0.2 loadings on all the other three factors. In this case, we might say that the variable visual
is mainly influenced by Factor 2
.
Different from the variable visual
, the variable straight
has large loadings on both Factor 2
and Factor 4
. Alternatively, straight measures both factors than just a single factor.
We can also see that the primary indicators for Factor 1
are general
, paragrap
, sentence
, wordc
, and wordm
. And for Factor 4, the indictors include add
, code
, counting
, and straight
.
The correlation among the factors are given in the section of Factor Correlation
. For example, the correlation between Factor 1
and Factor 2
is 0.368. Note that after rotation, the test of the model is the same as without rotation.
> usedata('GrantWhite') > fa.var<-c('visual', 'cubes', 'paper', 'lozenge', + 'general', 'paragrap', 'sentence', 'wordc', + 'wordm', 'add', 'code', 'counting', 'straight', + 'wordr', 'numberr', 'figurer', 'object', 'numberf', + 'figurew') > > fadata<-GrantWhite[,fa.var] > > fa.res<-factanal(x=fadata, factors=4, rotation='promax') > print(fa.res, cut=0.2) Call: factanal(x = fadata, factors = 4, rotation = "promax") Uniquenesses: visual cubes paper lozenge general paragrap sentence wordc 0.465 0.742 0.712 0.549 0.344 0.306 0.287 0.493 wordm add code counting straight wordr numberr figurer 0.270 0.360 0.553 0.377 0.411 0.684 0.710 0.560 object numberf figurew 0.470 0.573 0.777 Loadings: Factor1 Factor2 Factor3 Factor4 visual 0.747 cubes 0.571 paper 0.485 lozenge 0.683 general 0.760 paragrap 0.806 sentence 0.862 wordc 0.555 wordm 0.856 add -0.245 0.806 code 0.290 0.420 counting 0.773 straight 0.489 0.484 wordr 0.567 numberr 0.544 figurer 0.376 0.501 object -0.244 0.766 numberf 0.271 0.446 figurew 0.381 Factor1 Factor2 Factor3 Factor4 SS loadings 3.109 2.231 1.947 1.801 Proportion Var 0.164 0.117 0.102 0.095 Cumulative Var 0.164 0.281 0.384 0.478 Factor Correlations: Factor1 Factor2 Factor3 Factor4 Factor1 1.000 0.368 0.517 0.457 Factor2 0.368 1.000 0.435 0.432 Factor3 0.517 0.435 1.000 0.545 Factor4 0.457 0.432 0.545 1.000 Test of the hypothesis that 4 factors are sufficient. The chi square statistic is 102.06 on 101 degrees of freedom. The p-value is 0.452 >
Interpret the results from EFA
Based on the rotated factor loadings, we can name the factors in the model. This can be done by identifying significant loadings. For example, the Factor 1
is indicated by general
, paragrap
, sentence
, wordc
, and wordm
, all of which are related to verbal perspective of cognitive ability. One way to name the factor is to call it a verbal factor. Similarly, the second is called the spatial factor, the third can be called the memory factor, and the last one can be called the speed factor.
Factor scores
Sometimes, the purpose of factor analysis is to estimate the score of each latent construct/factor for each participant. Factor scores can be used in further data analysis. In general, there are two methods for estimating factor scores: the regression method and the Bartlett method. The second method generally works better. For example, the following code obtains the Bartlett factor scores. As an example, the linear regression is also fitted.
> usedata('GrantWhite') > fa.var<-c('visual', 'cubes', 'paper', 'lozenge', + 'general', 'paragrap', 'sentence', 'wordc', + 'wordm', 'add', 'code', 'counting', 'straight', + 'wordr', 'numberr', 'figurer', 'object', 'numberf', + 'figurew') > > fadata<-GrantWhite[,fa.var] > > fa.res<-factanal(x=fadata, factors=4, rotation='promax', scores='Bartlett') > head(fa.res$scores) Factor1 Factor2 Factor3 Factor4 [1,] -0.43901139 -1.7896772 -0.7416310 -1.1652786 [2,] -0.64032039 0.3301758 0.1765352 -0.6575473 [3,] -0.05713828 0.8185462 -1.3589951 -1.4069599 [4,] -0.55427892 -1.0738916 -0.7036627 0.2676821 [5,] 0.68178108 -1.7044904 0.1877187 0.5317372 [6,] 0.21943745 0.3296831 1.0497671 0.2068044 > > summary(lm(Factor2 ~ Factor1, data=as.data.frame(fa.res$scores))) Call: lm(formula = Factor2 ~ Factor1, data = as.data.frame(fa.res$scores)) Residuals: Min 1Q Median 3Q Max -2.6901 -0.5732 0.0375 0.6267 2.8976 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.759e-16 8.340e-02 0.000 1 Factor1 4.631e-01 7.975e-02 5.808 3.94e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.004 on 143 degrees of freedom Multiple R-squared: 0.1908, Adjusted R-squared: 0.1852 F-statistic: 33.73 on 1 and 143 DF, p-value: 3.941e-08 >
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.