Dealing with missing data in CFA: How to estimate factor scores and confidence intervals?

I’m working on a Confirmatory Factor Analysis (CFA) project and I’ve run into a snag. My dataset has some missing values (NAs) and I’m not sure how to handle this when calculating factor scores and their confidence intervals. Here’s what I’ve done so far:

I created a sample dataset with 500 observations and 5 items. Each item has some NAs randomly inserted. Then I set up a simple CFA model for a ‘Satisfaction’ factor using all 5 items.

# Create sample data
n = 500
df = data.frame(
  ID = 1:n,
  item1 = sample(c(NA, runif(n, 0, 1)), n, replace = TRUE),
  item2 = sample(c(NA, runif(n, 0, 1)), n, replace = TRUE),
  item3 = sample(c(NA, runif(n, 0, 1)), n, replace = TRUE),
  item4 = sample(c(NA, runif(n, 0, 1)), n, replace = TRUE),
  item5 = sample(c(NA, runif(n, 0, 1)), n, replace = TRUE)
)

# CFA model
model = 'Satisfaction =~ item1 + item2 + item3 + item4 + item5'

# Fit model
library(lavaan)
fit = cfa(model, data = df, estimator = 'MLR')

Now, how do I calculate factor scores and their confidence intervals with this NA-containing data? Any help would be greatly appreciated!

Hey Owen_Galaxy! Interesting question you’ve got there about handling missing data in CFA. I’ve been working with similar issues lately, and it can be a real head-scratcher, right?

Have you considered using Full Information Maximum Likelihood (FIML) estimation? It’s pretty nifty for dealing with missing data in structural equation models. You could modify your code like this:

fit = cfa(model, data = df, estimator = 'MLR', missing = 'fiml')

This way, lavaan will use all available data to estimate the model parameters without imputing missing values.

For factor scores, you might want to look into the lavPredict() function. It can handle missing data and give you factor score estimates. Something like:

scores = lavPredict(fit)

As for confidence intervals, that’s a bit trickier with missing data. Have you thought about using bootstrapping? It could give you a distribution of factor scores to estimate confidence intervals.

What do you think about these approaches? Have you tried any other methods for dealing with the missing data? I’m really curious to hear more about your project and what you’ve discovered so far!

Dealing with missing data in CFA can be challenging. I’ve found that the multiple imputation approach works well in practice. You might want to consider using the ‘mice’ package in R for this. Here’s a basic workflow:

  1. Impute missing values multiple times
  2. Run your CFA model on each imputed dataset
  3. Pool the results to get final estimates and confidence intervals

This method tends to provide more robust results compared to single imputation or listwise deletion. For factor scores, you can calculate them for each imputed dataset and then average across imputations.

Remember to assess the pattern of missingness in your data. If it’s Missing Completely at Random (MCAR) or Missing at Random (MAR), multiple imputation should work well. For Missing Not at Random (MNAR), you might need more sophisticated techniques.

Have you explored the missingness patterns in your dataset? That could provide valuable insights for choosing the most appropriate method.

yo owen, missing data’s a pain! FIML’s good but have u tried multiple imputation? it’s pretty sweet for this stuff. u can use mice package in R to impute missing vals, then run ur CFA on each imputed dataset. combine results for robust estimates & CIs. just a thought, lemme know if u need more deets!