## 1 Introduction

When the outcome of interest occurs infrequently, effect estimation can be particularly challenging. For example, a recent study sought to examine the impact of planned place of delivery (obstetric unit or not) on perinatal mortality and neonatal morbidities, occurring in 250 of 63,827 births (0.39 %) (Birth-place in England Collaborative Group 2011). Due to the paucity of individual birth events, however, the researchers estimated the effect on a composite outcome measure. Likewise, tuberculosis is a main cause of mortality among HIV+ people (World Health Organization 2013). Evaluating strategies to reduce its transmission are essential, but difficult due to the disease’s relatively low incidence. Along the same lines, international consortiums have been established to investigate the burden and treatment for uncommon cancers (e. g. RARECARENet 2014). While these outcomes are rare, a better understanding of their occurrence is likely to have important policy and health implications.

For binary outcomes or proportions, parametric logistic regression is often used to estimate the conditional odds ratio, given the exposure and measured covariates. Several researchers have investigated the performance of this approach when the outcome is extremely rare (e. g. Concato et al. 1993; Harrell et al. 1996; Peduzzi et al. 1996; King and Zeng 2001; Harrell 2001; Braitman and Rosenbaum 2002; Cepeda et al. 2003; Vittinghoff and McCulloch 2007). For example, simulations by Peduzzi et al. (1996) illustrated that estimates could be biased and inference unreliable if the number of outcomes per independent variable in the regression model was less than 10. The authors also found problems with estimator convergence, statistical power and the validity of significance tests (i. e. type I error rates and confidence interval coverage). Harrell et al. (1996) cautioned against over-fitting and encouraged the use of cross-validation or bootstrapping for model validation. Moreover, King and Zeng (2001) found that standard logistic models could substantially under-estimate the probability of the outcome and offered a bias correction with accompanying software.

When dealing with rare events, several researchers have recommended estimators based on the propensity score, which is the conditional probability of being exposed, given the covariates (Rosenbaum and Rubin 1983). These methods avoid estimation of the conditional mean outcome and thereby are expected to perform well when there are very few outcome events (e. g. Joffe and Rosenbaum 1999; Braitman and Rosenbaum 2002; Patorno et al. 2014). Simulations by Cepeda et al. (2003) suggested that propensity score methods were less biased and more efficient than logistic regression for the mean outcome, when the number of events per independent variable in the regression model was less than 8. The authors also cautioned that the performance of propensity score methods depended on the strength of the relationship between the covariates and the exposure.

Targeted minimum loss-based estimation (TMLE) is a general methodology for the construction of semiparametric, efficient substitution estimators (van der Laan and Rubin 2006; van der Laan and Rose 2011). A TMLE for a single time point exposure can be implemented as follows. First, the conditional expectation of the outcome, given the exposure and covariates, is estimated with parametric regression or with a more flexible approach, such as SuperLearner (van der Laan et al. 2007). Second, information on the exposure-covariate relation (i. e. the propensity score) is incorporated to improve this initial estimator. The propensity score can also be estimated parametrically or with a more flexible approach. Informally, this “targeting” step helps remove some of the residual bias due to incomplete adjustment for confounding. More formally, this targeting step serves to solve the efficient score equation. Finally, the targeted predictions of the outcome under the exposure and under no exposure are averaged over the sample and contrasted on the relevant scale.

Thereby, TMLE requires estimation of both the conditional mean outcome as well as the propensity score, and achieves a number of desirable asymptotic properties (van der Laan and Rose; 2011). The standardized estimator is asymptotically normal with mean 0 and variance given by the variance of its influence curve. The TMLE is also double robust: if either the conditional mean outcome or the propensity score is consistently estimated, we will have a consistent estimate of the parameter of interest. If both functions are consistently estimated (at a fast enough rate), the TMLE will be efficient and achieve the lowest possible asymptotic variance among a large class of estimators. Finally, TMLE is a substitution estimator, providing robustness in the face of positivity violations (when there is no or little variability in the exposure within certain covariate strata) and rare outcomes (e. g. Stitelman and van der Laan 2010; Gruber and van der Laan 2010; Sekhon et al. 2011; Petersen et al. 2012; Gruber and van der Laan 2013; Lendle et al. 2013). Building on the work of Gruber and van der Laan (2010), this paper proposes a new TMLE for the semiparametric statistical model

## 2 The estimation problem

We are interested in estimating the impact of a binary exposure

To translate our scientific question into a causal parameter, let us define

In words, we only observe the counterfactual outcome corresponding to the observed treatment

This causal parameter is the difference in expected counterfactual outcomes if everyone in the target population were exposed and if everyone in that target population were unexposed. For a binary outcome,

To express

where the summation generalizes to the integral for continuous covariates and

The challenge, addressed in this paper, is estimation of

where

The information for learning the target parameter is captured by the sample size

(van de Vaart 1998). Low information can occur when the propensity score

We consider the semiparametric statistical model

for some bounds

The variance of the efficient influence curve (Eq. [1]) and the information for estimating the target parameter (Eq. [2]) depend on whether the true distribution

In the following section, we propose a new TMLE, which guarantees that the initial and targeted estimator of

Before introducing the new TMLE, we first caution that the statistical model

## 3 Estimating effects with rare outcomes

Suppose we have

In some cases, we can estimate

where

Data-adaptive or machine learning algorithms can incorporate our limited model knowledge, while smoothing over areas of data with weak support. SuperLearner, for example, allows a set of pre-specified algorithms to compete and selects the best algorithm using cross-validation (i. e. data-splitting) (van der Laan et al. 2007). The selected algorithm can be used to predict the outcomes for all units given the exposure and their measured covariates *a priori*-specified can result in misleading inference.) Furthermore, estimating the conditional expectation

TMLE provides a solution to several of these challenges (van der Laan and Rose 2011). TMLE incorporates data-adaptive estimation with an additional targeting step to reduce bias for

and the following logistic regression model to update the initial estimator:

Here,

The standard TMLE algorithm for a binary or bounded continuous outcome is not optimal for our semiparametric statistical model

### 3.1 The rare outcomes TMLE

To incorporate model knowledge, we propose a linear transformation of the rare outcome

The analogous transformation was proposed by Gruber and van der Laan (2010) for the TMLE of a bounded continuous outcome

Therefore, the conditional mean of the transformed outcome

As a result, the negative quasi-log-likelihood (Wedderburn 1974; McCullagh 1983) is a valid loss function for initial estimation and targeting of the transformed mean

The proof is analogous to Lemma 1 in Gruber and van der Laan (2010) and thus omitted here.

To update an initial estimator of

Through this transformation, the initial and targeted estimates are guaranteed to satisfy the model constraints. This will provide robustness. The targeted estimates can then be rescaled and substituted in the parameter mapping. The proposed loss function and parametric submodel define a new TMLE of the target parameter

### 3.2 Step-by-step implementation

We present the rare outcomes TMLE for the semiparametric model

*Step 1: Transform the outcome*. We first transform the outcome

*Step 2: Estimate the transformed mean*. An initial estimate of the conditional mean of the transformed outcome

*Step 3: Estimate the propensity score*. An initial estimate of the conditional probability of being exposed given the covariates

*Step 4: Target the initial estimator*. Run logistic regression of the transformed outcome

*Step 5: Transform and plug-in the targeted estimates*. The targeted estimates of the conditional mean of the transformed outcome, denoted

We obtain a point estimate by substituting in the targeted estimates of the conditional mean under the exposure

*Step 6: Obtain inference*. Under regularity conditions,

Alternative approaches for variance estimation include the non-parametric bootstrap or a substitution estimator for the variance. The former might be problematic with rare binary outcomes as some bootstrapped samples may not have any events. The latter would guarantee the bounds on the variance are respected and is an area of future work.

By construction, the rare outcomes TMLE solves the efficient score equation. Specifically, the empirical mean of the efficient influence curve at the targeted estimator

### 3.3 Selecting the upper bound $u$ with cross-validation

Thus far, we have assumed that the upper bound *a priori*. Specifically, the upper bound can be selected by minimizing the cross-validated risk of candidate estimators

## 4 Simulations & data application

The following simulation studies compare the finite sample performance of the standard TMLE with the proposed rare outcomes TMLE (rTMLE). For comparison, we also include a propensity score matching (PSM) estimator (Rosen-baum and Rubin 1983), inverse probability of treatment weighted (IPTW) estimator (Hernán et al. 2000) and augmented inverse probability of treatment weighted (AIPTW) estimator (Robins 2000; van der Laan and Robins 2003). The first two methods rely solely on estimation of the propensity score and may have superior performance with very rare outcomes. AIPTW requires estimation of both the conditional mean outcome and the propensity score, but is double robust and asymptotically efficient under consistent estimation of both functions. AIPTW is an estimating equation (i. e. not a substitution estimator) and thereby can result in impossible parameter estimates (e. g. probabilities less than 0 or greater than 1) (Lendle et al. 2013).

For the PSM estimator, we used the Matching package (Sekhon 2011) for 1:1 matching based on the estimated propensity score and calculated the point estimate as

with

A point estimate from AIPTW is attained by directly solving the efficient score equation:

where

### 4.1 Simulation 1: individual-level data

The finite sample performance of the estimators was assessed by drawing 2,000 samples of sizes

The exposure

With this exposure mechanism, there were no positivity violations; the propensity score was bounded between 20 % and 92 %. Finally, the binary outcome was drawn from a Bernoulli distribution with probability

The resulting marginal probability of the outcome was

The results for this simulation are presented in Figures 1 and 2. As expected, the PSM estimator and IPTW exhibited low bias when the regression model for the propensity score

In theory, the other estimators solve the efficient score equation directly (AIPTW) or during the targeting step (TMLE, rTMLE). As a result, these estimators are double robust. For these simulations, however, the standard TMLE exhibited substantial bias when estimating

The performance of AIPTW and rTMLE did not suffer when fitting

Under the null, the PSM estimator and IPTW had good type I error control when the propensity was correctly specified (Results not shown). Under misspecification, their type I error rates exceeded 20 %. The standard TMLE also suffered from high type I error rates, when both regression models were correctly specified. AIPTW and rTMLE maintained nominal type I error rates for both sample sizes and all regression specifications.

### 4.2 Simulation 2: group-level data

For this set of simulations, we focused on clustered data, where the covariates, exposure and outcome were measured or aggregated to the cluster-level. For 2,000 simulations of

As suggested by the previous simulations, a possible approach for estimation with rare outcomes is using a small adjustment set for

The simulation results are given in Table 1. Since the estimated propensity score was based on the correct regression model, all algorithms exhibited low bias. When using an unadjusted (initial) estimator

Estimator performance over 2,000 simulations of

Treatment effect: | Null: | |||||||
---|---|---|---|---|---|---|---|---|

MSE | Cov. | Power | ^{a} | MSE | Cov. | |||

PSM | 0.31 | 1.9E-6 | 0.99 | 0.40 | 0.01 | 2.3E-6 | 0.98 | 0.02 |

IPTW | 0.31 | 1.3E-6 | 1.00 | 0.00 | 0.00 | 1.4E-6 | 1.00 | 0.00 |

AIPTW | 0.30 | 1.0E-6 | 1.00 | 0.34 | 0.00 | 1.1E-6 | 1.00 | 0.00 |

TMLE | 0.30 | 1.0E-6 | 1.00 | 0.40 | 0.00 | 1.1E-6 | 1.00 | 0.00 |

rTMLE | 0.30 | 1.0E-6 | 1.00 | 0.40 | 0.00 | 1.1E-6 | 1.00 | 0.00 |

rTMLE | 0.30 | 7.4E-7 | 0.94 | 0.94 | 0.00 | 7.8E-7 | 0.94 | 0.06 |

Notes:

### 4.3 Application

For a practical demonstration, we used data from the New York Social Environment Study to examine the association between permissive neighborhood drunkenness norms and the prevalence of alcohol use disorder at the neighborhood level

For all estimators, SuperLearner was used for (initial) estimation of the conditional mean

The results are presented in Table 2. The point estimates from the PSM estimator (0.66 %) and IPTW (0.82 %) were positive, but their confidence intervals were wide and included the null. The point estimates from AIPTW (0.61 %), TMLE (0.88 %) and rTMLE (0.85 %) were also all positive, and the confidence intervals for the TMLEs did not include the null. While the double robust estimators are expected to perform similarly asymptotically, their finite sample performance is expected to differ. In particular, both TMLEs benefit from being substitution estimators (Stitelman and van der Laan 2010; Gruber and van der Laan; 2010; Sekhon et al. 2011; Petersen et al. 2012; Gruber and van der Laan 2013; Lendle et al. 2013). rTMLE further benefits by using model knowledge on the bounds of the mean of the rare outcome. Overall, the results suggest that there is an increased risk of alcohol use disorder among neighborhoods with more permissive drunkenness norms. This finding is in line with previous work in the population, which suggested a significant association between neighborhood norms about drinking and binge drinking (Ahern et al. 2008).

Point estimates, variance estimates and confidence intervals in the applied data example.

95 % CI | |||||
---|---|---|---|---|---|

PSM | 0.66 | 0.0038 | (–0.54, 1.87) | ||

IPTW | 2.06 | 1.24 | 0.82 | 0.0040 | (–0.42, 2.07) |

AIPTW | 2.13 | 1.52 | 0.61 | 0.0015 | (–0.14, 1.36) |

TMLE | 2.18 | 1.29 | 0.88 | 0.0015 | (0.12, 1.64) |

rTMLE | 2.18 | 1.33 | 0.85 | 0.0015 | (0.10, 1.60) |

## 5 Discussion

In this paper, we proposed a new TMLE for evaluating causal effects and estimating associations with very rare outcomes and high dimensional data. The rare outcomes TMLE (rTMLE) is based on harnessing knowledge in the semiparametric model

In simulations, the proposed rare outcomes TMLE performed as well or outperformed the alternative estimators. The PSM estimator and IPTW were biased under misspecification of the propensity score. The standard TMLE algorithm suffered from bias and poor confidence interval coverage when the adjustment set for the conditional mean outcome was large. Both AIPTW and rTMLE were robust to model misspecification and yielded consistent estimates if either the conditional mean outcome or the propensity score were consistently estimated. AIPTW, however, is not a substitution estimator and yielded negative (impossible) risk estimates. In contrast, the proposed TMLE respected the global knowledge in the statistical model. Our simulations further highlighted the potential for data-adaptive estimation to avoid parametric assumptions and to increase power.

We focused on situations where the conditional mean of a rare outcome was bounded from below by

The proposed TMLE is easily generalizable for estimation of other parameters, including the risk ratios, odds ratios and the impacts of longitudinal exposures. The TMLE is also applicable to other sampling designs. Specifically, case-control studies are commonly employed to increase robustness and efficiency of the analysis of rare events. There are several well established methods that correct for selection on the outcome (Anderson 1972; Prentice and Breslow 1978; King and Zeng 2001; Robins 1999; Mansson et al. 2007). For example, van der Laan (2008) presented a general mapping of loss functions, substitution estimators and estimating equations developed for prospective sampling (i. e. cohort sampling) into loss functions, substitution estimators and estimating equations for biased sampling (e. g. case-control sampling). The estimator’s properties, such as double robustness and asymptotic efficiency, are preserved under the mapping. As noted by van der Laan (2008), however, the sample size needed to detect effects on the order of the outcome prevalence will be very large, unless the conditional probability of the outcome is bounded from above by some small constant

For Simulation 1, the number of negative estimates for the marginal risk under no exposure from AIPTW in 2000 simulated data sets.

Both Correct | 120 | 5 |

260 | 11 | |

130 | 6 |

Suppose the outcome

Now consider the outcome

The variance of the efficient influence curve at the true distribution

More generally, we can consider the outcome

Since the outcome

Suppose we have

Under regularity conditions, the asymptotic variance of the first estimator

where

The second TMLE bounds the estimated mean

Since

Full R code is available on the author’s website: http://works.bepress.com/laura_balzer/25.

## References

Abadie, A., and Imbens, G. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74:235–267.

Abadie, A., and Imbens, G. (2015). Matching on the estimated propensity score. Technical report, NBER Technical Working Paper. http://www.nber.org/papers/w15301.

Ahern, J., Galea, S., Hubbard, A., Midanik, L., and Syme, S. L. (2008). “Culture of drinking” and individual problems with alcohol use. American Journal of Epidemiology, 167:1041–1049.

Anderson, J. (1972). Separate sample logistic discrimination. Biometrika, 59:19–35.

Beck, N., King, G., and Zeng, L. (2000). Improving quantitative studies of international conflict: A conjecture. American Political Science Review, 94:21–25.

Bickel, P., Klaassen, C., Ritov, Y., and Wellner, J. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press.

Birthplace in England Collaborative Group. (2011). Perinatal and maternal outcomes by planned place of birth for healthy women with low risk pregnancies: The birthplace in england national prospective cohort study. British Medical Journal 343:d7400.

Braitman, L., and Rosenbaum, P. (2002). Rare outcomes, common treatments: Analytic strategies using propensity scores. Annals of internal medicine, 137:693–696.

Cepeda, M., Boston, R., Farrar, J., and Strom, B. (2003). Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. American Journal of Epidemiology, 158:280–287.

Concato, J., Feinstein, A., and Holford, T. (1993). The risk of determining risk with multivariable models. Annals of internal medicine, 118:201–210.

Gruber, S., and van der Laan, M. (2010). A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics, 6:Article 26.

Gruber, S., and van der Laan, M. (2013). An application of targeted maximum likelihood estimation to the meta-analysis of safety data. Biometrics, 69:254–262.

Harrell, F. Jr. (2001). Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. Berlin, Heidelberg, New York: Springer.

Harrell, F. Jr, Lee, K., and Mark, D. (1996). Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15:361–387.

Hasin, D., Stinson, F., Ogburn, E., and Grant, B. (2007). Prevalence, correlates, disability, and comorbidity of DSM-IV alcohol abuse and dependence in the United States. Archives of general psychiatry, 64:830–842.

Hernán, M., Brumback, B., and Robins, J. (2000). Marginal structural models to estimate the causal effect of Zidovudine on the survival of HIV-positive men. Epidemiology, 11:561–570.

Joffe, M., and Rosenbaum, P. (1999). Invited commentary: Propensity scores. American Journal of Epidemiology, 150:327–333.

King, G., and Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9:137–163.

Lendle, S., Fireman, B., and van der Laan, M. (2013). Targeted maximum likelihood estimation in safety analysis. Journal of Clinical Epidemiology, 66:S91–S98.

Mansson, R., Joffe, M., Sun, W., and Hennessy, S. (2007). On the estimation and use of propensity scores in case-control and case-cohort studies. American Journal of Epidemiology, 166:332–339.

McCullagh, P. (1983). Quasi-likelihood functions. Annals of Applied Statistics, 11:59–67.

National Institute on Alcohol Abuse and Alcoholism. (2013). Alcohol use disorder: a comparison between DSM-IV and DSM-5, NIH Publication No. 13–7999. http://pubs.niaaa.nih.gov/publications/dsmfactsheet/dsmfact.pdf.

Neyman, J. (1923). Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes (in polish). English translation by D.M. Dabrowska and T.P. Speed (1990). Statistical Science, 5:465–480.

Patorno, E., Glynn, R., Hernández-Díaz, S., Liu, J., and Schneeweiss, S. (2014). Studies with many covariates and few outcomes: Selecting covariates and implementing propensity-score-based confounding adjustments. Epidemiology, 26:268–278.

Pearl, J. (2000). Causality: Models, Reasoning and Inference, 2nd ed. New York: Cambridge University Press.

Peduzzi, P., Concato, J., Kemper, E., Holford, T., and Feinstein, A. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49:1373–1379.

Petersen, M., Porter, K., Gruber, S., Wang, Y., and van der Laan, M. (2012). Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21:31–54.

Polley, E., and van der Laan, M. (2013). SuperLearner: Super Learner Prediction. http://CRAN.R-project.org/package=SuperLearner, rpackage version 2.0-10.

Prentice, R., and Breslow, N. (1978). Retrospective studies and failure time models. Biometrika, 65:153–158.

R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org.

RARECARENet. (2014). “Information network on rare cancers,” URL http: //www.rarecarenet.eu/rarecarenet/.

Robins, J. (1986). A new approach to causal inference in mortality studies with sustained exposure periods–application to control of the healthy worker survivor effect. Mathematical Modelling, 7:1393–1512.

Robins, J. (1999). [Choice as an alternative to control in observational studies]: Comment. Statistical Science, 14:281–293.

Robins, J. (2000). Robust estimation in sequentially ignorable missing data and causal inference models. In: 1999 Proceedings of the American Statistical Association, Alexandria, VA: American Statistical Association, 6–10.

Rosenbaum, P., and Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55.

Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66:688–701.

Rubin, D. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75:591–593.

Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, 6:34–58.

Sekhon, J. (2011). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Software, 42:1–52.

Sekhon, J., Gruber, S., Porter, K., and van der Laan, M. (2011). Propensity-score-based estimatiors and c-TMLE. In: Targeted Learning: Causal Inference for Observational and Experimental Data, M. van der Laan and S. Rose (Eds.), 343–364. New York, Dordrecht, Heidelberg, London: Springer.

Stitelman, O., and van der Laan, M. (2010). Collaborative targeted maximum likelihood for time-to-event data. The International Journal of Biostatistics, 6:Article 21.

van der Laan, M. (2008). Estimation based on case-control designs with known prevalence probability. The International Journal of Biostatistics, 4:Article 17.

van der Laan, M., Polley, E., and Hubbard, A. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6:Article 25.

van der Laan, M., and Robins, J. (2003). Unified Methods for Censored Longitudinal Data and Causality. New York, Berlin, Heidelberg: Springer-Verlag.

van der Laan, M., and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. New York, Dordrecht, Heidelberg, London: Springer.

van der Laan, M., and Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2:Article 11.

van der Vaart, A. (1998). Asymptotic Statistics. New York: Cambridge University Press.

Vittinghoff, E., and McCulloch, C. (2007). Relaxing the rule of ten events per variable in logistic and cox regression. American Journal of Epidemiology, 165:710–718.

Wedderburn, R. (1974). Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biometrika, 61:439–447.

World Health Organization. (2013). Global tuberculosis report 2013. Geneva.