Most practical methods of dealing with missing data are based on the theoretical work of Rubin [28] and Little and Rubin [29] who formulated conditions under which the damage of missingness would be minimized. However, the theoretical guarantees provided by this theory are rather weak, and the taxonomy of missing data problems rather coarse.

Specifically, Rubin’s theory divides problems into three categories: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Performance guarantees and some testability results are available for MCAR and MAR, while the vast space of MNAR problems has remained relatively unexplored.

Viewing missingness from a causal perspective evokes the following questions:

Q1.

What must the world be like for a given statistical procedure to produce satisfactory results?

Q2.

Can we tell from the postulated world whether any method exists that produces consistent estimates of the parameters of interest?

Q3.

Can we tell from data whether the postulated world should be rejected?

To answer these questions the user must articulate features of the problem in some formal language and capture both the inter-relationships among the variables of interest and the missingness process. In particular, the model should identify those variables that are responsible for values missing in another.

The graph in Figure 4(a) depicts a typical missingness process, where missingness in *Z* is explained by *X* and *Y*, which are fully observed. Taking such a graph, *G*, as a representation of reality, we define two properties relative to a partially observed dataset *D*.

**Definition 4** *(Recoverability)*

*A probabilistic relationship Q is said to be recoverable in G if there exists a consistent estimate* $\stackrel{\u02c6}{Q}$ *of Q for any dataset D generated by G. In other words, in the limit of large samples, the estimator should produce an estimate of Q as if no data were missing*.

**Definition 5** *(Testability)*

*A missingness model G is said to be* testable *if any of its implications is refutable by data with the same sets of fully and partially observed variables*.

Figure 4 (a) Graph describing a MAR missingness process. *X* and *Y* are fully observed variables, *Z* is partially observed and ${Z}^{\ast}$ is a proxy for *Z*. ${R}_{z}$ is a binary variable that acts as a switch: ${Z}^{\ast}=Z$ when ${R}_{z}=0$ and ${Z}^{\ast}=m$ when ${R}_{z}=1$. (b) Graph representing a MNAR process. (The proxies ${Z}^{\ast},{X}^{\ast}$, and ${Y}^{\ast}$ are not shown.)

Figure 5 Recoverability of the joint distribution in MCAR, MAR, and MNAR. Joint distributions are recoverable in areas marked (*S*) and (*M*) and proven to be non-recoverable in area (*N*).

While some recoverability and testability results are known for MCAR and MAR [30, 31], the theory of structural models permits us to extend these results to the entire class of MNAR problems, namely, the class of problems in which at least one missingness mechanism (${R}_{z}$) is triggered by variables that are themselves victims of missingness (e.g. *X* and *Y* in Figure 4(b)). The results of this analysis are summarized in Figure 5 which partitions the class of MNAR problems into three major regions with respect to recoverability of the joint distribution.

1.

*M* (Markovian${}^{+}$) – Graphs with no latent variables, no variable that is a parent of its missingness mechanism and no missingness mechanism that is an ancestor of another missingness mechanism.

2.

*S* (Sequential-MAR) – Graphs for which there exists an ordering ${X}_{1},{X}_{2},\dots ,{X}_{n}$ of the variables such that for every *i* we have: ${X}_{i}\phantom{\rule{1pt}{0ex}}\u2568\phantom{\rule{1pt}{0ex}}({R}_{{X}_{i}},{R}_{{Y}_{i}})|{Y}_{i}$ where ${Y}_{i}\subseteq \{{X}_{i+1},\dots ,{X}_{n}\}$. Such sequences yield the estimand: $P(X)={\prod}_{i}P({X}_{i}|{Y}_{i},{R}_{{x}_{i}}=0,{R}_{{y}_{i}}=0)$, in which every term in this product is estimable from the data.

3.

*N* (Proven to be Non-recoverable) – Graphs in which there exists a pair $(X,{R}_{x})$ such that *X* and ${R}_{x}$ are connected by an edge or by a path on which every intermediate node is a collider.

The area labeled “

*O*” consists of all

*other* problem structures, and we conjecture this class to be empty. All problems in areas

$(M)$ and

$(S)$ are recoverable.

To illustrate, Figure 4(a) is MAR, because *Z* is *d*-separated from ${R}_{z}$ by *X* and *Y* which are fully observed. Consequently, $P(X,Y,Z)$ can be written
$P(X,Y,Z)=P(Z|Y,X)P(X,Y)=P(Z|Y,X,{R}_{x}=0)P(X,Y)$and the r.h.s. is estimable. Figure 4(b) however is not MAR because all variables that *d*-separate *Z* from ${R}_{z}$ are themselves partially observed. It nevertheless allows for the recovery of $P(X,Y,Z)$ because it complies with the conditions of the Markovian${}^{+}$ class, though not with the Sequential-MAR class, since no admissible ordering exists. However, if *X* were fully observed, the following decomposition of $P(X,Y,Z)$ would yield an admissible ordering:
$\begin{array}{rl}P(X,Y,Z)& =P(Z|X,Y)P(Y|X)P(X)\\ & =P(Z|X,Y,{R}_{z}=0,{R}_{y}=0)P(Y|X,{R}_{y}=0)P(X)\end{array}$(18)in which every term is estimable from the data. The licenses to insert the *R* terms into the expression are provided by the corresponding *d*-separation conditions in the graph. The same licenses would prevail had ${R}_{z}$ and ${R}_{y}$ been connected by a common latent parent, which would have disqualified the model from being Markovian^{+}, but retain its membership in the Sequential-MAR category.

Note that the order of estimation is critical in the two MNAR examples considered and depends on the structure of the graph; no model-blind estimator exists for the MNAR class [32, 33].

Note that the partition of the MNAR territory into recoverable vs non-recoverable models is query-dependent. For example, some problems permit unbiased estimation of queries such as $P(Y|X)$ and $P(Y)$ but not of $P(X,Y)$. Note further that MCAR and MAR are nested subsets of the “Sequential-MAR” class, all three permit the recoverability of the joint distribution. A version of Sequential-MAR is discussed in Gill and Robins [34] and Zhou et al. [35] but finding a recovering sequence in any given model is a task that requires graphical tools.

Graphical models also permit the partitioning of the MNAR territory into testable vs nontestable models [36]. The former consists of at least one conditional independence claim that can be tested under missingness. Here we note that some testable implications of fully recoverable distributions are not testable under missingness. For example, $P(X,Y,Z,{R}_{z})$ is recoverable in Figure 4(a) since the graph is in $(M)$ (it is also in MAR) and this distribution advertises the conditional independence $Z\phantom{\rule{1pt}{0ex}}\u2568\phantom{\rule{1pt}{0ex}}{R}_{z}|XY$. Yet, $Z\phantom{\rule{1pt}{0ex}}\u2568\phantom{\rule{1pt}{0ex}}{R}_{z}|XY$ is not testable by any data in which the probability of observing *Z* is non-zero (for all $x,y$) [33, 37]. Any such data can be construed as if generated by the model in Figure 4(a), where the independence holds. In Figure 4(b) on the other hand, the independence $(Z\phantom{\rule{1pt}{0ex}}\u2568\phantom{\rule{1pt}{0ex}}{R}_{x}|Y,{R}_{y},{R}_{z})$ is testable, and so is $({R}_{z}\phantom{\rule{1pt}{0ex}}\u2568\phantom{\rule{1pt}{0ex}}{R}_{y}|X,{R}_{x})$.

**Summary Result 4** *(Recoverability from missing data) [38]*

–*The feasibility of recovering relations from missing data can be determined in polynomial time, provided the missingness process is encoded in a causal diagram that falls in areas* $M,S$*, or N of Figure 5*.

Thus far we dealt with the recoverability of joint and conditional probabilities. Extensions to causal relationships are discussed in Mohan and Pearl [37].

## Comments (0)