Hey everyone, I’m trying to figure out how to deal with missing data in my CFA model. I want to calculate factor scores and their confidence intervals, but I’m not sure how to handle the NAs in my dataset.
I’ve got a sample dataset with 500 observations and 5 items. Some of the items have missing values. Here’s a quick look at my data structure:
# Create sample data
n <- 500
df <- data.frame(
ID = 1:n,
state = sample(c('CA', 'NY', 'TX', 'FL', 'IL'), n, replace = TRUE),
school = paste('School', 1:n),
gender = sample(c('M', 'F'), n, replace = TRUE),
q1 = runif(n),
q2 = runif(n),
q3 = runif(n),
q4 = runif(n),
q5 = runif(n)
)
# Add some NAs
df[sample(1:n, 50), 'q1'] <- NA
df[sample(1:n, 75), 'q3'] <- NA
df[sample(1:n, 60), 'q5'] <- NA
# CFA model
model <- 'factor =~ q1 + q2 + q3 + q4 + q5'
# Fit model
library(lavaan)
fit <- cfa(model, data = df)
I’m using lavaan for the CFA, but I’m stuck on how to get the factor scores and their confidence intervals with the missing data. Any tips or code examples would be super helpful! Thanks!
I’ve encountered similar issues with missing data in CFA models. One approach that’s worked well for me is using the ‘sem’ package in R, which can handle missing data through full information maximum likelihood (FIML) estimation.
This should give you factor scores even with missing data. For confidence intervals, you might consider using bootstrapping techniques. The ‘boot’ package in R can be helpful for this purpose.
Remember that the appropriateness of FIML depends on your missing data mechanism. If you’re dealing with MCAR or MAR data, FIML should provide unbiased estimates.
hey hugo, i’ve dealt with this before. you can use the lavPredict() function to get factor scores even with missing data. it’ll use full information maximum likelihood (FIML) to handle NAs. for confidence intervals, try bootstrapping with lavaan::bootstrapLavaan(). might take a while to run tho. hope this helps!
Dealing with missing data in CFA can be tricky, but it’s totally doable. Have you considered using multiple imputation? It’s a neat technique that could work well for your situation.
Here’s a thought - what if you tried the mice package? It’s pretty robust for handling missing data. You could impute multiple datasets, run your CFA on each, and then pool the results. Might give you more stable estimates.
Just curious, have you explored why you have missing data? Sometimes the pattern of missingness can be informative. Maybe there’s an interesting story hiding in those NAs!
What do you think about trying something like this: