Missing Data: How to address it?
Posted on Tue 29 April 2025 in articles
In the previous article on missing data, we explored common missing data patterns, their definitions, and the drawbacks of Complete-Case (CC) analysis. This time, we'll focus on handling different types of missing data using Multiple Imputation (MI). By the end of this article, you’ll not only understand the theory behind missing data but also learn how to build your own models to address it.
Multiple Imputation Purpose
MI arises from the idea that missing values don't have a single correct answer but rather a range of plausible values. Instead of filling in a single estimate, MI acknowledges this uncertainty by drawing multiple possible values based on the data's distribution.
Consider our previous example of student grades. If we impute a missing grade as 80%, why stop there? A student with similar performance might have received a 77%, 81%, or even 84%. These alternatives are not just arbitrary; they reflect the natural variability in that student's performance. The key question is: how likely are these other values compared to the one we chose? MI addresses this by generating multiple (\(D\)) datasets with different imputed values, each sampled from a realistic distribution. By analyzing all of them together, we get a robust and unbiased estimate, reducing the risk of underestimating uncertainty in our final conclusions.
Multiple Imputation Definition
Let's continue with the student example from the previous article. To estimate the average student grade, denoted as \(\theta\), we must account for both observed (\(Y_{\textrm{obs}}\)) and missing data (\(Y_{\textrm{miss}}\)). The full posterior distribution of \(\theta\), given both components is expressed as:
Since we don’t actually observe \(Y_{\textrm{miss}}\), this posterior is hypothetical—it represents the distribution we would have obtained if the missing data were known. Using Bayes’ theorem, we express it in proportional form:
where:
- \(L(\theta \mid Y_{\textrm{obs}}, \; Y_{\textrm{miss}})\), is the likelihood function with observed and missing components
- \(p(\theta)\), os our prior belief about \(\theta\)
In practice, since \(Y_{\textrm{miss}}\) is unknown, we need to marginalize over its distribution, integrating out the uncertainty:
The problem is how to integrate it out in a reasonable manner. One way in which this can be done is by assuming that we can use the observed data (\(Y_{\textrm{obs}}\)) to reasonably impute the missing data units (\(Y_{\textrm{miss}}\)). Thus, modifying our previous integral into the following form:
What we will create is a model that fullfills \(\; \boldsymbol{p(Y_{\textrm{miss}} \mid Y_{\textrm{obs}})} \;\) in this integral. Allowing us to draw multiple plausible values from this conditional distribution. Hence, MI approximates this integral by generating multiple datasets (\(D\)) with different imputed values sampled from our yet to be created model and then combining the average (\(\bar{\theta}\)) of each imputed \(d\) dataset to produce an unbiased estimate of the average grade of our students. Like so:
The variance terms are the imputation variance of each \(d\) dataset and the "between-imputation" variance, which refers to the variability between the \(D\) datasets.
Modeling MAR
Given that under MAR both the outcome (\(y\)) and the missingness mechanism (\(m\)) depend only on the observed features (\(x\)), our imputation model will be structured accordingly. Specifically, the imputation of \(y\) will be modeled conditional on \(x\), reflecting the relationship \(P(y \mid x)\). To implement this approach, we will use Stan to build a Bayesian MI model, defined as follows:
data {
int<lower=0> N_obs; // Number of observed Y values
int<lower=0> N_mis; // Number of missing Y values
int<lower=1> N_total; // Total number of data points
int<lower=1> K; // Number of Predictors
matrix[N_total, K] X; // Design Matrix
vector[N_obs] Y_obs; // Observed values of Y
array[N_obs] int<lower=1, upper=N_total> obs_idx; // Indices of observed Y
array[N_mis] int<lower=1, upper=N_total> mis_idx; // Indices of missing Y
}
parameters {
vector[2] beta; // β Coefficients
real<lower=1e-3> phi; // Dispersion for Beta Distribution
}
transformed parameters {
vector<lower=0, upper=1>[N_obs] mu_obs;
mu_obs = inv_logit(X[obs_idx] * beta);
}
model {
beta ~ normal(0, 2);
phi ~ gamma(2, 0.1);
for (i in 1:N_obs)
Y_obs[i] ~ beta(mu_obs[i] * phi, (1 - mu_obs[i]) * phi);
}
generated quantities {
vector<lower=0, upper=1>[N_mis] Y_mis;
vector<lower=0, upper=1>[N_mis] mu_mis;
mu_mis = inv_logit(X[mis_idx] * beta);
for (i in 1:N_mis)
Y_mis[i] = beta_rng(mu_mis[i] * phi, (1 - mu_mis[i]) * phi);
}
To illustrate our point, we simulate a missing data rate of 48% across 800 students. Despite the loss, we still retain 410 students, which is generally sufficient to estimate the mean. The real challenge, however, lies in the MAR mechanism, which systematically shifts the mean.
After fitting the model and generating one hundred imputed datasets (\(D = 100\)), we estimate the average grade and its 95% confidence interval (CI). The table below compares the estimates from MI and CC analysis under the simulated MAR mechanism, along with the true mean from the original (complete) dataset.
Method | Average | Lower CI | Higher CI |
---|---|---|---|
Original Data | 66.6% | 65.7% | 67.4% |
MI | 66.1% | 65.1% | 67.1% |
MAR CC | 64.8% | 63.6% | 65.9% |
Compared to the original dataset, the mean estimated via MI is notably closer to the true mean than the mean obtained using CC analysis. Notice that the CI of the mean from the CC analysis does not include the true mean of the original dataset.
To formally assess statistical significance of these differences, we perform hypothesis tests to compare the estimated means against the true mean. Thus, our hypotheses are:
- \(H_0\): The estimated mean is equal to the true mean
- \(H_1\): The estimated mean is different from the true mean
The CC analysis yields a \(p\)-value of 0.003, which tells us that there is significant statistical difference from the original mean. In contrast, the MI estimate does not significantly differ from the true mean, with a \(p\)-value of 0.295.
The figure below displays the t-distribution corresponding to both tests, along with the locations of their t-statistics, and the significance threshold. This comparison highlights the importance of addressing missing data and how it provides a reliable recovery of the original mean.
Modeling MNAR
Given that under MNAR the missingness mechanism (\(m\)) depends directly on the unobserved outcome (\(y\)), we must jointly model both processes. Specifically, we model \(P(y \mid x)\) for the outcome and \(P(m \mid y)\) for the missingness mechanism. The modeling of the missingness mechanism is achieved via a Bernoulli component that predicts the probability that each value is missing. This Bernoulli component ties the two models together by using the imputed values of \(y\) as inputs when modeling the probability of missingness, thereby enforcing the MNAR dependency structure during estimation.
data {
int<lower=0> N_obs; // Number of observed Y values
int<lower=0> N_mis; // Number of missing Y values
int<lower=1> N_total; // Total number of data points
int<lower=1> K; // Number of predictors
matrix[N_total, K] X; // Design matrix
vector[N_obs] Y_obs; // Observed Y values
array[N_total] int<lower=0, upper=1> M; // Missingness indicator
}
parameters {
vector<lower=0, upper=1>[N_mis] Y_mis; // Imputed Y values
vector[K] beta; // β Coefficients
real<lower=1e-3> phi; // Dispersion for Beta Distribution
real alpha0; // Intercept for Missingness Model
real alpha1; // Y-dependent Missingness
}
model {
beta ~ normal(0, 2);
phi ~ gamma(2, 0.1);
alpha0 ~ normal(0, 2);
alpha1 ~ normal(0, 2);
vector[N_total] Y;
vector[N_total] mu;
// Collect observed and imputed Y values
int pos_obs = 1;
int pos_mis = 1;
for (i in 1:N_total) {
if (M[i] == 0) {
Y[i] = Y_obs[pos_obs];
pos_obs += 1;
} else {
Y[i] = Y_mis[pos_mis];
pos_mis += 1;
}
}
mu = inv_logit(X * beta);
// Imputation Model, which depends on X
for (i in 1:N_total) {
Y[i] ~ beta(mu[i] * phi, (1 - mu[i]) * phi);
}
// Missingness Model, which depends on Y.
// It connects the imputated values to probability of missingness.
for (i in 1:N_total) {
M[i] ~ bernoulli_logit(alpha0 + alpha1 * Y[i]);
}
}
generated quantities {
vector[N_mis] Y_mis_draw;
Y_mis_draw = Y_mis;
}
Just as for MAR the table below compares the estimates using MI, CC analysis, and the original data.
Method | Average | Lower CI | Higher CI |
---|---|---|---|
Original Data | 66.6% | 65.7% | 67.4% |
MI | 66.0% | 65.0% | 67.0% |
MNAR CC | 63.4% | 62.3% | 64.5% |
Compared to the original dataset, the mean estimated via MI is notably closer to the true mean than the mean obtained using CC analysis under the MNAR mechanism. Notice that the CI of the mean from the CC analysis does not include the true mean of the original dataset. In this case, the evidence against the CC estimate is even stronger — the hypothesis test yields a \(p\)-value of 0.000 for CC analysis, indicating a statistically significant difference, while the MI estimate shows no significant difference with a \(p\)-value of 0.275.
Conclusion
In this article, we moved beyond theory and showed how to build practical models to address missing data under both MAR and MNAR mechanisms. By leveraging MI, we demonstrated that properly modeling missingness — rather than ignoring it — leads to estimates that remain faithful to the true underlying values.
Through examples and complete model code, you have now seen how to build MI models and understand the risks of relying solely on CC analysis. Handling missing data is not just about filling gaps; it’s about safeguarding the integrity and credibility of your insights.
Mastering these techniques opens the door to more robust analyses, more confident decisions, and ultimately, more impactful work.
As always, you may refer to all the code used to build this article on my Github.