Exploratory Factor Analysis

The primary objectives of an exploratory factor analysis (EFA) are to determine (1) the number of common factors influencing a set of measures, (2) the strength of the relationship between each factor and each observed measure and (3) the factor scores

Some common uses of EFA are to

To reduce a large number of variables to a smaller number of factors for modeling purposes, where the large number of variables precludes modeling all the measures individually.
To establish that multiple tests measure the same factor, thereby giving justification for administering fewer tests. Factor analysis originated a century ago with Charles Spearman's attempts to show that a wide variety of mental tests could be explained by a single underlying intelligence factor
To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.
To select a subset of variables from a larger set, based on which original variables have the highest correlations with the principal component factors.
To create a set of factors to be treated as uncorrelated variables as one approach to handling multicollinearity in such procedures as multiple regression.
To identify the nature of the constructs underlying responses in a specific content area.
To determine what sets of items “hang together” in a questionnaire.
To demonstrate the dimensionality of a measurement scale. Researchers often wish to develop scales that respond to a single characteristic.
To determine what features are most important when classifying a group of items.
To generate “factor scores” representing values of the underlying constructs for use in other analyses.

An example

We illustrate how to conduct exploratory data analysis using the data from the classic 1939 study by Karl J. Holzinger and Frances Swineford. In the study, twenty-six tests intended to measure a general factor and five specific factors were administered to seventh and eighth grade students in two schools, the Grant-White School ($n = 145$) and Pasteur School ($n = 156$). Data used in this example include nineteen tests intended to measure four domains: spatial ability, verbal ability, speed, and memory. In addition, only data from the 145 students in the Grant-White School are used.

The data are saved in the file GrantWhite.csv. The 26 tests are described below with the 19 used in the example are highlighted.

> usedata('GrantWhite')
> head(GrantWhite)
  X...id female grade agey agem school visual cubes paper lozenge general
1    201      0     7   13    0      0     23    19    13       4      46
2    202      1     7   11   10      0     33    22    12      17      43
3    203      0     7   12    6      0     34    24    14      22      36
4    204      0     7   11   11      0     29    23    12       9      38
5    205      0     7   12    5      0     16    25    11      10      51
6    206      1     7   12    6      0     30    25    12      20      42
  paragrap sentence wordc wordm add code counting straight wordr numberr
1       10       17    22    10  69   65       82      156   173      91
2        8       17    30    10  65   60       98      195   174      81
3       11       19    27    19  50   49       86      228   168      84
4        9       19    25    11 114   59      103      144   130      84
5        8       25    28    24 112   54      122      160   184      98
6       10       23    28    18  94   84      113      201   188      86
  figurer object numberf figurew deduct numeric problemr series arithmet
1      96      8       2      10     21      12       17     11       17
2     106      9      15      17     33      12       22     31       32
3     101      1       7      16     45      10       43     21       18
4     101     10      15      14     25      21       26     19       28
5      99      9       9      15     28      16       35     21       25
6     116     10      10      16     36      14       27     18       30
  paperrev flagssub
1       13       25
2       20       37
3       19       40
4       11       44
5       10       28
6       16       42
>

visual	scores on visual perception test, test 1
cubes	scores on cubes test, test 2
paper	scores on paper form board test, test 3
lozenge	scores on lozenges test, test 4
general	scores on general information test, test 5
paragrap	scores on paragraph comprehension test, test 6
sentence	scores on sentence completion test, test 7
wordc	scores on word classification test, test 8
wordm	scores on word meaning test, test 9
add	scores on add test, test 10
code	scores on code test, test 11
counting	scores on counting groups of dots test, test 12
straight	scores on straight and curved capitals test, test 13
wordr	scores on word recognition test, test 14
numberr	scores on number recognition test, test 15
figurer	scores on figure recognition test, test 16
object	scores on object-number test, test 17
numberf	scores on number-figure test, test 18
figurew	scores on figure-word test, test 19
deduct	scores on deduction test, test 20
numeric	scores on numerical puzzles test, test 21
problemr	scores on problem reasoning test, test 22
series	scores on series completion test, test 23
arithmet	scores on Woody-McCall mixed fundamentals, form I test, test 24
paperrev	scores on additional paper form board test, test 25
flagssub	scores on flags test, test 26

Exploratory factor analysis

The usual exploratory factor analysis involves (1) Preparing data, (2) Determining the number of factors, (3) Estimation of the model, (4) Factor rotation, (5) Factor score estimation and (6) Interpretation of the analysis.

Preparing data

In EFA, a correlation matrix is analyzed. The following R code calculates the correlation matrix. In order to simplify the other steps, we save the correlation matrix in the data file GWcorr.csv and will be used later.

> usedata('GrantWhite')
> fa.var<-c('visual', 'cubes', 'paper', 'lozenge', 
+ 'general', 'paragrap', 'sentence', 'wordc', 
+ 'wordm', 'add', 'code', 'counting', 'straight', 
+ 'wordr', 'numberr', 'figurer', 'object', 'numberf', 
+ 'figurew')
> 
> fadata<-GrantWhite[,fa.var]
> ## correlation matrix
> fa.cor<-cor(fadata)
> ## part of the correlation matrix
> round(fa.cor[1:5,1:5],3)
        visual cubes paper lozenge general
visual   1.000 0.326 0.372   0.449   0.328
cubes    0.326 1.000 0.190   0.417   0.275
paper    0.372 0.190 1.000   0.366   0.309
lozenge  0.449 0.417 0.366   1.000   0.381
general  0.328 0.275 0.309   0.381   1.000
> 
> ## save the correlation matrix
> write.csv(fa.cor, '../data/GWcorr.csv')
>

Determining the number of factors

With the correlation matrix, we first decide the number of factors. There are several ways to do it. But all the methods are based on the eigenvalues of the correlation matrix. From R, we have the eigenvalues below. First, note the number of eigenvalues is the same as the number of variables. Second, the sum of all the eigenvalues is equal to the number of variables.

> usedata('GWcorr', row.names=1)
> 
> fa.eigen <- eigen(GWcorr)
> fa.eigen$values
 [1] 6.3041871 1.9473919 1.5265417 1.4877579 0.9398040 0.8747401 0.7639373
 [8] 0.6559871 0.6508332 0.5719815 0.5481270 0.4640505 0.4371337 0.4070784
[15] 0.3655828 0.3201049 0.3064701 0.2312029 0.1970878
> 
> sum(fa.eigen$values)
[1] 19
> cumsum(fa.eigen$values)
 [1]  6.304187  8.251579  9.778121 11.265879 12.205683 13.080423 13.844360
 [8] 14.500347 15.151180 15.723162 16.271289 16.735339 17.172473 17.579552
[15] 17.945134 18.265239 18.571709 18.802912 19.000000
> cumsum(fa.eigen$values)/19
 [1] 0.3317993 0.4342936 0.5146379 0.5929410 0.6424043 0.6884433 0.7286505
 [8] 0.7631762 0.7974305 0.8275348 0.8563836 0.8808073 0.9038144 0.9252396
[15] 0.9444808 0.9613284 0.9774584 0.9896270 1.0000000
>

The basic idea can be related to the variance explained as in regression analysis. With the correlation matrix, we can take the variance of each variable as 1. For a total of $p$ variables, the total variance is therefore $p$. For factor analysis, we try to find a small number of factors that can explain a large portion of the total variance. The eigenvalues correspond to the variance of each factor. If the eigenvalue corresponding to a factor is large, that means the variance explained by the factor is large. Therefore, the eigenvalues can be used to select the number of factors.

Rule 1

The first rule to decide the number of factors is to use the number of eigenvalues larger than 1. In this example, we have four eigenvalues larger than 1. Therefore, we can have 4 factors.

Rule 2

Another way is to select the number of factors with the cumulative eigenvalues accounting for 80% of the total variance. This is to say if we add the eigenvalues of the selected number of factor, the total values should be larger than 80% of the sum of all eigenvalues.

Cattell's Scree plot

The Cattell's Scree plot is a plot of eigenvalues on the Y axis along with the number of factors on the X axis. The plot looks like the side of a mountain, and "scree" refers to the debris fallen from a mountain and lying at its base. As one moves to the right, toward later components/factors, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components/factors after the one starting the elbow. For this example, we can identify 4 factors based on the scree plot below.

> usedata('GWcorr', row.names=1)
> 
> fa.eigen <- eigen(GWcorr)
> 
> plot(fa.eigen$values, type='b', ylab='Eigenvalues', xlab='Factor')
>

Estimation of model / Factor analysis

Once the number of factors is decided, we can conduct exploratory factor analysis using the R function factanal(). The R input and output for this example is given below.

> usedata('GrantWhite')
> fa.var<-c('visual', 'cubes', 'paper', 'lozenge', 
+ 'general', 'paragrap', 'sentence', 'wordc', 
+ 'wordm', 'add', 'code', 'counting', 'straight', 
+ 'wordr', 'numberr', 'figurer', 'object', 'numberf', 
+ 'figurew')
> 
> fadata<-GrantWhite[,fa.var]
> 
> fa.res<-factanal(x=fadata, factors=4, rotation='none')
> fa.res

Call:
factanal(x = fadata, factors = 4, rotation = "none")

Uniquenesses:
  visual    cubes    paper  lozenge  general paragrap sentence    wordc 
   0.465    0.742    0.712    0.549    0.344    0.306    0.287    0.493 
   wordm      add     code counting straight    wordr  numberr  figurer 
   0.270    0.360    0.553    0.377    0.411    0.684    0.710    0.560 
  object  numberf  figurew 
   0.470    0.573    0.777 

Loadings:
         Factor1 Factor2 Factor3 Factor4
visual    0.536   0.176   0.392  -0.249 
cubes     0.330           0.302  -0.228 
paper     0.440   0.110   0.247  -0.147 
lozenge   0.505           0.358  -0.253 
general   0.762  -0.238  -0.113         
paragrap  0.759  -0.338                 
sentence  0.762  -0.322  -0.166         
wordc     0.701                         
wordm     0.762  -0.381                 
add       0.455   0.475  -0.451         
code      0.545   0.367           0.103 
counting  0.434   0.593  -0.238  -0.162 
straight  0.592   0.393          -0.289 
wordr     0.394           0.149   0.362 
numberr   0.352   0.139   0.219   0.315 
figurer   0.435   0.183   0.425   0.192 
object    0.445   0.241           0.522 
numberf   0.454   0.383   0.221   0.157 
figurew   0.389   0.115   0.133   0.202 

               Factor1 Factor2 Factor3 Factor4
SS loadings      5.722   1.625   1.065   0.945
Proportion Var   0.301   0.086   0.056   0.050
Cumulative Var   0.301   0.387   0.443   0.492

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 102.06 on 101 degrees of freedom.
The p-value is 0.452 
>

In EFA, each observed data consists of two part, the common factor part and the uniqueness part. The common factor part is based on the four factors, which are also called the common factors. The uniqueness part is also called uniqueness factor, which is specific to each observed variable.

Using the variable visual as an example, we have

\[ visual = 0.536\times Factor1 + 0.176\times Factor2 + 0.392\times Factor3 - 0.249\times Factor4 + u_{visual} \]

Note the factor loadings are from the Loadings section of the output. The loadings are the regression coefficients of the latent factors on the manifest indicators or observed variables. The variance of the uniqueness is in the Uniquenesses section. For $u_{visual}$, the variance is 0.465. For the other variables, it's the same.

The other section is related to the variance explained by the factors. SS loadings is the sum squared loadings related to each factor. It is the overall variance explained in all the 19 variables by each factor. Therefore, the first factor explains the total of 5.722 variance, that's about 30.1%=5.722/19. Proportion Var is the variances in the observed variables/indicators explained by each factor. Cumulative Var is the cumulative proportion of variance explained by all factors.

A test is conducted to test whether the factor model is sufficient to explain the observed data. The null hypothesis that a 4-factor model is sufficient. For this model, the chi-square statistic is 102.06 with
degrees of freedom 101. The p-value for the chi-square test is 0.452 which is larger than .05. Therefore,we fail to reject the null hypothesis that the factor model have a good fit to the data.

Factor rotation

Although we have identified 4 factors and found the 4-factor model is a good model. We cannot find a clear pattern in the factor loadings to have a deep understanding of the factors. Through factor rotation, we can make the output more understandable and is usually necessary to facilitate the interpretation of factors. The aim is to find a simple solution that each factor has a small number of large loadings and a large number of zero (or small) loadings. There are many different rotation methods such as the varimax rotation, quadtimax rotation, equimax rotation, oblique rotation, etc. The PROMAX rotation is one kind of oblique rotation and is widely used. After PROMAX rotation, the factor will be correlated.

The output of PROMAX rotation is shown below. In the output, we use print(fa.res, cut=0.2) to show factor loadings that are greater than 0.2. Note that after rotation, many loading are actually smaller than 0.2. The pattern of the factor loadings are much clear now. For example, the variable visual has a large loading 0.747 on Factor 2 but small than 0.2 loadings on all the other three factors. In this case, we might say that the variable visual is mainly influenced by Factor 2.

Different from the variable visual, the variable straight has large loadings on both Factor 2 and Factor 4. Alternatively, straight measures both factors than just a single factor.

We can also see that the primary indicators for Factor 1 are general, paragrap, sentence, wordc, and wordm. And for Factor 4, the indictors include add, code, counting, and straight.

The correlation among the factors are given in the section of Factor Correlation. For example, the correlation between Factor 1 and Factor 2 is 0.368. Note that after rotation, the test of the model is the same as without rotation.

> usedata('GrantWhite')
> fa.var<-c('visual', 'cubes', 'paper', 'lozenge', 
+ 'general', 'paragrap', 'sentence', 'wordc', 
+ 'wordm', 'add', 'code', 'counting', 'straight', 
+ 'wordr', 'numberr', 'figurer', 'object', 'numberf', 
+ 'figurew')
> 
> fadata<-GrantWhite[,fa.var]
> 
> fa.res<-factanal(x=fadata, factors=4, rotation='promax')
> print(fa.res, cut=0.2)

Call:
factanal(x = fadata, factors = 4, rotation = "promax")

Uniquenesses:
  visual    cubes    paper  lozenge  general paragrap sentence    wordc 
   0.465    0.742    0.712    0.549    0.344    0.306    0.287    0.493 
   wordm      add     code counting straight    wordr  numberr  figurer 
   0.270    0.360    0.553    0.377    0.411    0.684    0.710    0.560 
  object  numberf  figurew 
   0.470    0.573    0.777 

Loadings:
         Factor1 Factor2 Factor3 Factor4
visual            0.747                 
cubes             0.571                 
paper             0.485                 
lozenge           0.683                 
general   0.760                         
paragrap  0.806                         
sentence  0.862                         
wordc     0.555                         
wordm     0.856                         
add              -0.245           0.806 
code                      0.290   0.420 
counting                          0.773 
straight          0.489           0.484 
wordr                     0.567         
numberr                   0.544         
figurer           0.376   0.501         
object           -0.244   0.766         
numberf           0.271   0.446         
figurew                   0.381         

               Factor1 Factor2 Factor3 Factor4
SS loadings      3.109   2.231   1.947   1.801
Proportion Var   0.164   0.117   0.102   0.095
Cumulative Var   0.164   0.281   0.384   0.478

Factor Correlations:
        Factor1 Factor2 Factor3 Factor4
Factor1   1.000   0.368   0.517   0.457
Factor2   0.368   1.000   0.435   0.432
Factor3   0.517   0.435   1.000   0.545
Factor4   0.457   0.432   0.545   1.000

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 102.06 on 101 degrees of freedom.
The p-value is 0.452 
>

Interpret the results from EFA

Based on the rotated factor loadings, we can name the factors in the model. This can be done by identifying significant loadings. For example, the Factor 1 is indicated by general, paragrap, sentence, wordc, and wordm, all of which are related to verbal perspective of cognitive ability. One way to name the factor is to call it a verbal factor. Similarly, the second is called the spatial factor, the third can be called the memory factor, and the last one can be called the speed factor.

Factor scores

Sometimes, the purpose of factor analysis is to estimate the score of each latent construct/factor for each participant. Factor scores can be used in further data analysis. In general, there are two methods for estimating factor scores: the regression method and the Bartlett method. The second method generally works better. For example, the following code obtains the Bartlett factor scores. As an example, the linear regression is also fitted.

> usedata('GrantWhite')
> fa.var<-c('visual', 'cubes', 'paper', 'lozenge', 
+ 'general', 'paragrap', 'sentence', 'wordc', 
+ 'wordm', 'add', 'code', 'counting', 'straight', 
+ 'wordr', 'numberr', 'figurer', 'object', 'numberf', 
+ 'figurew')
> 
> fadata<-GrantWhite[,fa.var]
> 
> fa.res<-factanal(x=fadata, factors=4, rotation='promax', scores='Bartlett')
> head(fa.res$scores)
         Factor1    Factor2    Factor3    Factor4
[1,] -0.43901139 -1.7896772 -0.7416310 -1.1652786
[2,] -0.64032039  0.3301758  0.1765352 -0.6575473
[3,] -0.05713828  0.8185462 -1.3589951 -1.4069599
[4,] -0.55427892 -1.0738916 -0.7036627  0.2676821
[5,]  0.68178108 -1.7044904  0.1877187  0.5317372
[6,]  0.21943745  0.3296831  1.0497671  0.2068044
> 
> summary(lm(Factor2 ~ Factor1, data=as.data.frame(fa.res$scores)))

Call:
lm(formula = Factor2 ~ Factor1, data = as.data.frame(fa.res$scores))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6901 -0.5732  0.0375  0.6267  2.8976 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.759e-16  8.340e-02   0.000        1    
Factor1      4.631e-01  7.975e-02   5.808 3.94e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.004 on 143 degrees of freedom
Multiple R-squared:  0.1908,	Adjusted R-squared:  0.1852 
F-statistic: 33.73 on 1 and 143 DF,  p-value: 3.941e-08

>

To cite the book, use: Zhang, Z. & Wang, L. (2017-2025). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.