Computing factor scores and confidence intervals for CFA models with missing data

Hugo_Storm · March 21, 2025, 1:39am

Hey everyone, I’m trying to figure out how to deal with missing data in my CFA model. I want to calculate factor scores and their confidence intervals, but I’m not sure how to handle the NAs in my dataset.

I’ve got a sample dataset with 500 observations and 5 items. Some of the items have missing values. Here’s a quick look at my data structure:

# Create sample data
n <- 500

df <- data.frame(
  ID = 1:n,
  state = sample(c('CA', 'NY', 'TX', 'FL', 'IL'), n, replace = TRUE),
  school = paste('School', 1:n),
  gender = sample(c('M', 'F'), n, replace = TRUE),
  q1 = runif(n),
  q2 = runif(n),
  q3 = runif(n),
  q4 = runif(n),
  q5 = runif(n)
)

# Add some NAs
df[sample(1:n, 50), 'q1'] <- NA
df[sample(1:n, 75), 'q3'] <- NA
df[sample(1:n, 60), 'q5'] <- NA

# CFA model
model <- 'factor =~ q1 + q2 + q3 + q4 + q5'

# Fit model
library(lavaan)
fit <- cfa(model, data = df)

I’m using lavaan for the CFA, but I’m stuck on how to get the factor scores and their confidence intervals with the missing data. Any tips or code examples would be super helpful! Thanks!

Ethan85 · March 26, 2025, 8:35pm

I’ve encountered similar issues with missing data in CFA models. One approach that’s worked well for me is using the ‘sem’ package in R, which can handle missing data through full information maximum likelihood (FIML) estimation.

Here’s a basic example of how you might proceed:

library(sem)
model <- specifyModel()
factor -> q1, lambda1
factor -> q2, lambda2
factor -> q3, lambda3
factor -> q4, lambda4
factor -> q5, lambda5
q1 <-> q1, theta1
q2 <-> q2, theta2
q3 <-> q3, theta3
q4 <-> q4, theta4
q5 <-> q5, theta5
factor <-> factor, NA, 1

fit <- sem(model, data = df, missing = 'fiml')
scores <- fscores(fit)

This should give you factor scores even with missing data. For confidence intervals, you might consider using bootstrapping techniques. The ‘boot’ package in R can be helpful for this purpose.

Remember that the appropriateness of FIML depends on your missing data mechanism. If you’re dealing with MCAR or MAR data, FIML should provide unbiased estimates.

Luke87 · March 26, 2025, 10:21am

hey hugo, i’ve dealt with this before. you can use the lavPredict() function to get factor scores even with missing data. it’ll use full information maximum likelihood (FIML) to handle NAs. for confidence intervals, try bootstrapping with lavaan::bootstrapLavaan(). might take a while to run tho. hope this helps!

Owen_Galaxy · March 23, 2025, 7:35pm

Hey there Hugo_Storm!

Dealing with missing data in CFA can be tricky, but it’s totally doable. Have you considered using multiple imputation? It’s a neat technique that could work well for your situation.

Here’s a thought - what if you tried the mice package? It’s pretty robust for handling missing data. You could impute multiple datasets, run your CFA on each, and then pool the results. Might give you more stable estimates.

Just curious, have you explored why you have missing data? Sometimes the pattern of missingness can be informative. Maybe there’s an interesting story hiding in those NAs!

What do you think about trying something like this:

library(mice)
library(lavaan)

imputed_data <- mice(df, m = 5)
results <- with(imputed_data, cfa(model))
pooled_results <- pool(results)

This could give you a good starting point. What do you reckon? Have you tried anything like this before?

I’d love to hear more about your project. What’s the factor you’re trying to measure? Sounds intriguing!