Missing values
In SPSS, there are several methods to handle missing data, each suitable for different analytical
goals. Here's a concise explanation of the key methods: Listwise Deletion, Pairwise Deletion,
EM (Expectation-Maximization), and Regression Imputation.
🔹 1. Listwise Deletion (Complete Case Analysis)
Description: SPSS excludes any case (row) that has a missing value in any variable used in the
analysis.
Use in SPSS:
Default in many procedures (e.g., Analyze > Correlate > Bivariate)
Option: “Exclude cases listwise”
Pros:
Simple and easy to implement
Keeps sample structure intact
Cons:
Can lead to substantial data loss
Introduces bias if data is not MCAR (Missing Completely At Random)
🔹 2. Pairwise Deletion
Description: Uses all available data for each pair of variables. If a value is missing for one pair,
it is excluded only from that pairwise calculation.
Use in SPSS:
In correlation or covariance analysis
Option: “Exclude cases pairwise”
Pros:
Uses more data than listwise
Better for exploratory analysis
Cons:
Results in inconsistent sample sizes across comparisons
May produce non-positive definite correlation matrices
🔹 3. EM (Expectation-Maximization) Algorithm
Description: An iterative method that estimates means, covariances, and regression parameters
assuming a multivariate normal distribution.
Use in SPSS:
Analyze > Missing Value Analysis > EM algorithm
Pros:
Produces unbiased estimates under MAR (Missing At Random)
Preserves correlations among variables
Good for descriptive statistics and imputation before modeling
Cons:
Does not generate multiple datasets, so uncertainty from imputation is not fully
captured
Not suitable alone for inferential analysis
🔹 4. Regression Imputation
Description: Predicts missing values using regression equations built from other variables in the
dataset.
Use in SPSS:
Transform > Replace Missing Values > Choose “Linear regression”
Pros:
Easy to apply
Maintains relationships between variables
Cons:
Can underestimate variance (imputed values lie on regression line)
Can overfit if predictors are highly collinear
Not recommended for final statistical inference
🧠 Summary Table
Handles Variance Suitable for
Method Notes
Missingness Preserved Inference
❌ Biased if not
Listwise No ✅ Yes Data loss risk
MCAR
Good for
Pairwise Partial ⚠️Sometimes ❌ Inconsistent N
correlations
⚠️No (not Good for summary
EM Algorithm Yes (MAR) ✅ Yes
multiple) stats
Regression
Yes (MAR) ❌ Underestimated ❌ No Risk of overfitting
Impute
MICE (not in Best for robust
Yes (MAR) ✅ Yes ✅ Yes
SPSS) analysis
Imputation is the process of filling in missing data. The choice of imputation method depends
on:
the type of variable (numeric, categorical),
the mechanism of missingness (MCAR, MAR, MNAR),
and the analytical goals (e.g., preserving variance, predictive modeling, causal
inference).
✅ Single vs. Multiple Imputation
Feature Single Imputation Multiple Imputation
Create multiple datasets with different
Definition Fill in missing values once
imputations
Mean/Median Imputation, kNN,
Examples MICE, Bayesian Imputation
Regression
Captures Uncertainty ❌ No ✅ Yes
Bias Risk ✅ High risk of bias 🔻 Lower risk with proper modeling
Variance
✅ Yes ❌ No – preserves natural variability
Underestimated
🔺 Requires pooling of results across
Analysis Complexity ✅ Simple
datasets
Use in Inferential ✅ Recommended for inferential
❌ Often discouraged
Models statistics
📘 Best Practices for Your Graphology Dataset
You have:
Quantitative features derived from handwriting (e.g., spacing, inclination).
Some missingness likely due to measurement failure or partial administration.
Inclination variables with only negative values, needing distribution-sensitive
imputation.
Aim to compare groups (e.g., epilepsy vs. control), meaning you must preserve variance
and uncertainty.
🧠 Recommended Approach: Multiple Imputation with Iterative Method (MICE)
Why: MICE (Multiple Imputation by Chained Equations) handles complex multivariate
missingness and accounts for the uncertainty of imputation, which is essential for valid
group comparisons and preserving statistical inference.
Estimator: Bayesian Ridge or Random Forest are good defaults.
Iterations: Usually 10–20 are sufficient; you can increase if convergence is not reached.