Missing Data: Why does it matter?
Posted on Mon 24 February 2025 in articles
Missing data is a challenge every DS and AI/ML team encounters. It’s an inevitable part of working with real-world data—yet it often feels like a barrier to getting started on the 'real work'.
Under pressure from tight deadlines or stakeholder expectations, teams may resort to quick fixes, such as dropping missing values. But these shortcuts almost always backfire. Poorly handled missing data can distort predictions and lead to flawed inferences, yet its impact is often hidden—making it difficult to see just how much harm these rushed decisions cause.
When dealing with missing data there are two approaches that you can take:
- Complete-Case (CC) Analysis: This method involves removing instances with missing values
- Imputation: This method fills in the missing values
This article will demystify missing data theory, helping you understand its effects and the risks of CC analysis. By the end, you’ll be better equipped to make informed choices that preserve the integrity of your analyses.
Why is the Data Missing?
The first step in tackling missing data is understanding why it’s missing. Without this knowledge, any method applied to address it is more guesswork than science. Your missing data can fall into one of three mechanisms:
- Missing Completely at Random (MCAR)
- Missing at Random (MAR)
- Missing Not at Random (MNAR)
To explain these mechanisms, consider a scenario involving student exam grades (\(y\)) and class attendance (\(x\)). Additionally, we will have a binary variable (\(m\)) that indicates missingness, where \(m=1\) means the grade is present, and \(m=0\) means it is missing. The image below shows the simulated dataset that was created for this article.
MCAR
MCAR occurs when data is missing randomly, with no relationship to observed (\(x\)) or unobserved (\(y\)) variables. The probability of missingness is constant accross all variables: \(\textrm{P}(m|y,x) = \textrm{P}(m)\). For example, a student missing an exam due to an injury is an instance of MCAR—it happens independently of their academic performance or other characteristics.
To illustrate how MCAR affects data, I simulated a scenario where each student's grade has a 30% probability of being missing (i.e., \(P(m)=0.30\)). Below, you can see a comparison between the original grade distribution and the distribution after applying the MCAR mechanism and conducting a CC analysis:
The grade distribution remains centered around the same region but has wider tails, indicating increased uncertainty. This uncertainty arises because fewer data points are available.
For example, suppose the original dataset contains 400 observations, but after applying CC analysis, only 200 remain. The variance is given by: \(\textrm{Var}(y) = \frac{\sigma^2}{n-1}\) where \(\sigma^2\) is the sum of squares and \(n\) is the sample size. If \(\sigma^2\) remains constant while \(n\) decreases the variance increases.
Thus, when using CC analysis under an MCAR mechanism, you can expect more uncertain estimates due to the reduced sample size. However, they should remain unbiased.
MAR
MAR occurs when missingness depends on observed variables (\(x\)) but not directly on the unobserved values (\(y\)). For example, students with low attendance (\(x\)) may be more likely to skip exams. This relationship is expressed as: \(\textrm{P}(m | y,x) = \textrm{P}(m | x)\)
When data is MAR, CC analysis is unsuitable. Since missingness (\(m\)) depends on an observed variable (\(x\)), excluding missing cases can introduce bias, leading to invalid conclusions.
To illustrate this, I simulated an MAR mechanism where students with lower attendance had a higher probability of missing grades. The figure below compares the original grade distribution with the distribution after applying the MAR mechanism and performing CC analysis:
The resulting distribution shifts significantly, becoming more left-skewed and moving further away from the original data. This demonstrates how CC analysis under MAR can distort results, leading to unreliable conclusions.
Since missingness is related to observed values (\(x\)), imputation methods are required to properly handle MAR data. These methods leverage the relationship between \(x\) and \(y\) to estimate missing values. A detailed discussion on imputation techniques will be covered in a future post.
MNAR
MNAR is the most challenging type of missingness, occurring when missingness depends directly on the missing values themselves (\(y\)). For example, students who anticipate poor grades because they didn't take time to prepare for the exam may choose to skip it, regardless of their attendance record. This relationship is expressed as: \(\textrm{P}(m | y,x) = \textrm{P}(m | y)\)
Unlike MCAR and MAR, handling MNAR data is particularly difficult because the missing values themselves influence the probability of missingness. This means standard imputation methods may not work effectively without making strong assumptions about the missing data mechanism.
While imputation is possible, it requires additional assumptions about the relationship between \(x\) and \(y\) and careful modeling of the missingness pattern.
Determining the Missingness Mechanisms
Now that we've explored the different types of missing data mechanisms, lets focus on how to determine which one might be affecting your dataset. While no method is foolproof, the following four approaches can be used:
- Visualization: Plot the probability of missingness against a variable's range, such as attendance. The plot below illustrates simulated data where each value of \(y\) has a 25% chance of being missing, with Attendance (\(x\)) on the x-axis. Since the probability of missingness remains flat across different attendance levels, this suggests MCAR. Minor fluctuations at extreme values (e.g., 30% attendance) occur due to the rarity of these observations in the dataset. In contrast, if the probability of missingness increases or decreases systematically with \(x\) or \(y\), the data is likely MAR or MNAR. Keep in mind you need a very large amount of data points to build reliable plots or create buckets for ranges of \(x\).
- Statistical Tests: To complement visualizations, you can perform statistical tests like Little’s MCAR test, which uses a \(\chi^2\) distribution to assess whether mechanism is MCAR. Additionally, the Wald-Wolfowitz runs test can evaluate whether elements of the sequence are mutually independent, helping to detect patterns (e.g., MAR) in the missingness mechanism.
- Modeling: Fit a predictive model, such as Logistic Regression, to estimate the probability of missingness (\(m=1\) or \(m=0\)). If no covariate other than the intercept (\(\beta_0\)) has a statistically significant effect, the missingness pattern is likely MCAR. If the intercept is not statistically significant but another coefficient is—such as \(\beta_1\) or \(\beta_2\) —the mechanism is MAR (for \(\beta_1\)) or MNAR (for \(\beta_2\), which requires \(y\) to be imputed first). Moreover, if both \(\beta_1\) and \(\beta_2\) are statistically significant then it's possible you have more than a single missing data mechanism in your data. This approach assumes a well-fitted model that satisfies the required assumptions.
- Subject Matter Expertise: Involving individuals with in-depth knowledge of the data, processes, or context is invaluable when investigating the reasons behind missing data. Subject matter experts can offer insights or direct you toward likely causes of missingness. For instance, in the simulated student example, consulting teachers or the grade assignment committee—who may have encountered similar cases over the years—can help clarify why certain grades are missing.
Conclusion
Understanding and properly handling missing data is crucial for producing reliable analyses and predictions. Dropping rows through CC analysis can lead to increased uncertainty and biased results if the missingness mechanism is not MCAR.
In a future article we will explore multiple imputation—a personal favourite of mine—to effectively handle complex cases like MAR and MNAR.
All code used in this article is available on my Github. Rubin and Little’s work on missing data theory, particularly their book, has been instrumental in shaping this field and is highly recommended. The book I would suggest reading can be found here.