In many psychological, biological, and medical trials, more than two treatment groups are involved. In these situations, one is interested in detecting any significant difference among the treatment means , i.e. to test the global null hypothesis , and, particularly, in the detection of specific significant differences, i.e. in performing multiple comparisons according to the computation of simultaneous confidence intervals (SCI). In randomized clinical trials, the computation of SCI is consequently required by regulatory authorities: “Estimates of treatment effects should be accompanied by confidence intervals, whenever possible…” (ICH E9 Guideline 1998, chap. 5.5, p. 25 ). Hereby, the family-wise error rate should be strongly controlled.
In statistical practice, however, the usual way to detect specific significant differences among the effects of interest, and to compute SCI, consists of three steps: (1) the global null hypothesis is tested by an appropriate procedure, e.g. analysis of variance (ANOVA), (2) if the global null hypothesis is rejected, multiple comparisons are usually carried out to test individual hypotheses, e.g. the lth partial null hypothesis , and (3) in the final step, SCI for the treatment effects of interest are computed. Although stepwise procedures using different approaches on the same data are pretty common in practice, they may have the undesirable property that the global null hypothesis may be rejected, but none of the individual hypotheses and vice versa. This means, the global test procedure and the multiple testing procedure may be non-consonant to each other Gabriel 1969  and Hsu . Further the confidence intervals may include the null, i.e. the value of no treatment effect, even if the corresponding individual null hypotheses have been rejected. This means, the individual test decisions and the corresponding confidence intervals may be incompatible . It is well known that the classical Bonferroni adjustment can be used to perform multiple comparisons as well as for the computation of compatible SCI. This approach, however, has a low power, particularly when the test statistics are not independent.
In recent years, multiple contrast test procedures (MCTPs) with accompanying compatible SCI for linear contrasts were derived by Mukerjee et al.  and Bretz et al. . The procedures are based on the exact multivariate distribution of a vector of t-test statistics, where each test statistic corresponds to an individual null hypothesis, e.g. . It will be rejected, if the corresponding test statistic exceeds a critical value being obtained from the distribution of the vector of t-test statistics. The global null hypothesis will be rejected, if any individual hypothesis is rejected. Therefore, the individual and global test decisions are consonant and coherent. These MCTPs take the correlation between the test statistics into account and can be used for testing arbitrary contrasts, e.g. many-to-one, all-pairs, or even average comparisons . Thus, MCTPs provide an extensive tool for powerful multiple comparisons, for the computation of compatible SCI, and for testing the global null hypothesis. The results by Bretz et al.  were extended to general linear models by Hothorn et al. , to heteroscedastic models by Hasler and Hothorn  and Herberich et al. , and for ranking procedures by Konietschke and Hothorn , Konietschke et al. , and Konietschke et al. . For a comprehensive overview of existing methods, we refer to Bretz et al. .
Comparing MCTP and the global testing procedure ANOVA, one notices that both procedures can be used to test the global null hypothesis . From a practical point of view, MCTPs demonstrate their superiority to the ANOVA in terms of providing the information which levels cause the statistical overall significance as well as by offering SCI. In quite restricted homoscedastic normal models, both procedures are exact level tests. Arias-Castro et al.  studied global and multiple testing procedures under sparse alternatives and emphasize “Because ANOVA is such a well established method, it might surprise the reader – but not the specialist – to learn that there are situations where the Max test, though apparently naive, outperforms ANOVA by a wide margin” [9, p. 2534]. The evidence of a loss in power of the MCTP to detect global alternatives, if so, has not been investigated yet . Thus, exact power comparisons remain.
It is the aim of this article to investigate the exact power of MCTP and of the ANOVA to detect global alternatives. To give a fair comparison, we restrict our analysis to those linear contrasts which are embedded in the ANOVA, i.e. contrasts which compare each mean to the overall mean . In particular, we compute the least favorable configuration (LFC) of the alternative, i.e. the alternative which is detected with a minimal power of both the ANOVA and the MCTP. The results indicate that the LFCs of both procedures are identical. Exact power calculations show that their powers to detect the LFCs are equal.
2 Statistical model and test statistics
We consider a completely randomized one-way layout
where the index i denotes the level of the treatment group, and j denotes the jth unit within the ith group. Throughout this article, let denote the total sample size, the vector of expectations, its scaled version, and let denote the diagonal matrix of the sample sizes. Furthermore, let denote the vector of means, let denote the overall mean, and let denote the pooled sample variance.
Our aim is to test the null hypothesis versus the alternative for at least one , where is the mean of expectations. The global null hypothesis can be equivalently written as
The contrast matrix is also known as the centering matrix , where denotes the unit matrix, and denotes the -matrix of 1’s. Throughout this article, will be called Grand-mean-type contrast matrix . Each row vector of is one contrast and will be used later for testing individual hypotheses , i.e. for . The ANOVA-F-test
is the commonly used statistic for testing . As usually known, is exactly -distributed, where denotes the non-centrality parameter. Clearly, under , is equal to zero. It follows from the definition of in eq.  that this global testing procedure is the scaled sum of the squared contrasts in means. Therefore, it cannot provide any information about the means which differ significantly from the overall mean . The MCTP by using the contrasts on the contrary consists of the vector of t-test type statistics
is the modified t-test statistic for testing . Thus, consists of the scaled single contrasts . We note that the MCTP is not restricted to comparisons to the overall mean. For example, Dunnett-type many-to-one  comparisons can be performed by using the contrast matrix in
Tukey-type  all-pairs comparisons can be conducted using
and by replacing the contrasts in eq.  by the row vectors of the chosen contrast matrix. For a detailed overview of different kinds of contrasts, we refer to Bretz et al. . The ANOVA-F-test, however, is restricted to the comparisons to the overall mean as described in eq. . Therefore, we will only compare the ANOVA with the MCTP as given in eq. . As further results, we will also investigate the powers of the MCTP by using the Dunnett-type or Tukey-type contrast matrix as given in eq.  or , respectively. For convenience, we will write the different contrasts in a unified way by a non-specified contrast matrix throughout this article.
Bretz et al.  have shown that follows a multivariate distribution with degrees of freedom, correlation matrix and non-centrality parameter vector
where . Under the global null hypothesis , the non-centrality parameter vector is equal to . The correlation matrix is known and only depends on the sample sizes in model . It can be easily computed by standardizing the covariance matrix of . For a detailed explanation, we refer to Bretz et al. . The individual null hypothesis will be rejected at multiple level of significance, if , where denotes the -equicoordinate quantile from the multivariate -distribution, that is
In particular, compatible -SCI for the treatment effects are given by
The global null hypothesis will be rejected, if
Apparently, both test statistics in eq.  and in eq.  consist of the same contrasts and the same error estimate . The difference between the procedures is that the ANOVA-F-test uses the scaled sum of the squares of the contrasts and the MCTP uses the maximum of the scaled single contrasts. The impact of these two different principles on the powers of the tests will be investigated in the next section.
3 Power comparisons of the ANOVA and MCTP
It is obvious that the ANOVA-F-test is a squared test statistic, while , or better the single contrasts embedded in , are linear statistics. Roughly speaking, both methods are not comparable analytically. We, therefore, consider the power of the MCTP to detect the global alternative for at least one , . Due to the abundance of possible alternatives, we will compute the LFC of both ANOVA and the MCTP, i.e. the alternatives which are detected with a minimal power. Next, the powers to detect their LFC can be fairly compared. As pointed out in Section 2, the vector of t-test statistics as defined in eq.  follows a multivariate distribution with degrees of freedom, correlation matrix , and non-centrality parameter vector . Thus, the power of to detect at significance level can be defined by
Note that , hence, the correlation matrix is singular and the distribution of cannot have a density with respect to Lebesgue measure. The exact power of the MCTP as defined in eq. , however, can be computed by using the (a – 1)-variate regular multivariate t-distribution function of the -statistics being computed with the linear independent contrasts respectively, and an appropriate transformation of the integration region, i.e. the probability in eq. , can be computed by
Now, it is our purpose to consider the two conditions
and to establish the configuration of the for which the power function is minimized, i.e. we compute the LFC of such that
Note that in unbalanced designs, the power of the LFC cannot be invariant under any permutation of the coordinates of , which follows from the definition of the multivariate t-distribution. To get a useful result, we, therefore, restrict the computation to balanced designs. The LFCs of for Grand-mean and Tukey-type MCTPs are given in Theorem 1.
, so that . Then, if
Let , so that . Then, if
It follows from Theorem 1 that, under the restrictions or , the LFCs or , respectively, will be detected with minimal power. In particular, Hayter and Liu [15, 16] compute the LFCs of the ANOVA-F-test under both restrictions and , respectively. It turns out that both the ANOVA and the MCTP have the same LFCs. The comparisons of the powers to detect their LFCs will be investigated in Section 3.1.
3.1 Numerical comparisons
The computations of the exact powers of both procedures to detect their LFCs under the restrictions and , respectively, are of particular interest. In Figure 1, the exact power curves (type-I error level ) of both procedures for levels with sample sizes are displayed (restriction upper row; restriction lower row).
It can be readily seen from Figure 1 that the powers of the ANOVA and the MCTP to detect their LFCs appear to be equal. Under the restriction , the MCTP has a slightly higher power than the ANOVA. Hence, by offering more informations in terms of local test decisions and SCI, MCTPs are preferably applied for statistical inferences.
Next, we compute the minimal required sample size to detect the LFCs for a given difference , different power levels , and different type-I error levels  and . The results under the restriction for the ANOVA, Grand-mean-type, and Tukey-type MCTP, respectively, are given in Table 1.
Table 1 shows that slightly smaller sample sizes are required to detect the LFC using the Grand-mean-type MCTP than with the ANOVA, particularly for increasing numbers of factor levels and decreasing under the restriction . For the Tukey-type MCTP, no homogeneous behavior can be detected. In Table 2, the minimal required sample sizes for the LFC detection under the restriction are displayed. The minimal sample size to detect the LFC using the ANOVA is slightly smaller than using the Grand-mean-type MCTP. The smallest sample size is revealed with the Tukey-type MCTP.
3.2 Power investigations for selected alternatives
The LFCs provide only two possible candidates among an infinite number of alternatives. In this section, we investigate the powers of the two procedures to detect different kinds of alternatives, namely
with varying sample sizes , numbers of factor levels , and varying values of . The results are displayed in Figure 2. It can be readily seen from Figure 2 that the powers of both procedure particularly depends on the chosen kind of alternative. The ANOVA seems to be more powerful in terms of trend patterns (alternative 1 and alternative 2), while being slightly less powerful for umbrella alternatives (alternative 3). Finally, we investigate the powers of the procedures to reject a point alternative of the form
with varying numbers of groups and sample size . The results are displayed in Figure 3. It follows from Figure 3 that the powers of the ANOVA to reject the two chosen alternatives are monotonically decreasing in a, while the powers of the MCTP are nearly constant in a.
ANOVA procedures are commonly applied in statistical practice, when more than two samples are compared. They can only be used, however, to test the global null hypothesis, which is not often the main question of the practitioners. Specific informations for the local group levels in terms of multiple contrasts, adjusted p-values, and SCI are of particular practical importance. Bretz et al.  proposed exact MCTP and SCI which allow for arbitrary user-defined contrasts, e.g. Tukey-type , Dunnett-type , or even changepoint comparisons. Adjusted p-values and SCI for pre-defined or user-defined contrasts can be easily estimated using the R package multcomp  and mvtnorm . These procedures provide local informations as well as SCI as required by international regulatory authorities. Thus, from a practical point of view, they are preferably applied for making statistical inferences. Since also both the MCTPs and the ANOVA-type procedures can be used to test the same overall null hypothesis, the remaining question is “How much is the price in terms of a loss in power” which needs to be paid for the additional informations offered by the MCTP. For the set of all possible kinds of alternatives, the ANOVA is a uniformly most powerful unbiased and invariant test procedure. In this article, we compared the exact power of both the MCTP and the ANOVA and we computed their LFCs to reject the global null hypothesis under two different restrictions. It turned out that both kinds of procedures have the same LFCs under the restrictions and , respectively. Exact power calculations additionally showed that the power curves of both tests are equal. This gives a reason to claim that MCTPs are not inferior to the ANOVA. Obviously, as the LFCs are a small subset of two alternative configurations among an infinite number of possible candidates, the question “Are MCTPs superior to the ANOVA?” cannot be answered. The ANOVA is sensitive to many shapes – even for convex and concave mean profiles – whereas the MCTPs are mostly sensitive to the pre-specified kind of alternative. The ANOVA, however, cannot provide the information which factor levels cause the statistical difference. Moreover, MCTPs also provide directional decisions, whereas the quadratic form of the F-test provides only two-sided decisions.
We restricted our analysis to one-way normal designs with homoscedastic variances. The investigation of higher-way layouts, e.g. two-way ANOVA models, analysis of covariance models, etc., will be part of future research.
Proof of Theorem 1
The proof follows the same ideas as the proof of Theorems 1 and 2 in Hayter and Liu . By conditioning on the value of the random variable , it is apparent that for any ,
where the function for and is defined as
Here, denote independent normal random variables with variances and means , respectively. Note that is the multivariate distribution function, which can be computed by using the corresponding -variate regular multivariate normal distribution. Now, for any , we have the following four properties for the function
, where the operator permutes coordinates.
is log-concave , i.e. for , and for all ,
The log-concavity of implies by induction that for any
, where , and
Properties 1 and 5 imply that for all .
Proof of Theorem 1.1
Suppose that . Let denote the vectors obtained by permuting and leaving in place. Let and note that . Now, by properties 1–6, it follows that for any ,
Proof of Theorem 1.2
Suppose that . Let denote the vectors obtained by permuting and leaving and in place. Let . Then, by properties 1–6, it follows that for any ,
The proof for Tukey-type comparisons is very similar and is, therefore, omitted, see Hayter and Liu .
The authors are grateful to an Associate Editor and two anonymous referees for helpful comments which considerably improved the article. This work was supported by the German Research Foundation projects DFG-Br 655/16–1 and Ho 1687/9–1.
Bretz F, Genz A, Hothorn LA. On the numerical availability of multiple comparison procedures. Biom J 2001;43:645–56.
Mukerjee H, Robertson T, Wright FT. Comparison of several treatments with a control using multiple contrasts. J Am Stat Association 1987;82:902–10. [Crossref]
Hothorn T, Bretz F, Westfall P. Simultaneous inference in general parametric models. Biom J 2008;50:346–63. [PubMed]
Hasler M, Hothorn LA. Multiple contrast tests in the presence of heteroscedasticity. Biom J 2008;50:793–800. [PubMed]
Herberich E, Sikorski J, Hothorn T. A robust procedure for comparing multiple means under heteroscedasticity in unbalanced designs. PLoS One 2010. DOI:10.1371/journal.pone.0009788. [Crossref] [PubMed]
Konietschke F, Hothorn LA. Evaluation of toxicological studies using a nonparametric Shirley-type trend test for comparing several dose levels with a control group. Stat Biopharm Res 2012;4:14–27. [Web of Science] [Crossref]
Konietschke F, Libiger O, Hothorn LA. Nonparametric evaluation of quantitative traits in population-based association studies when the genetic model is unknown. PLoS One 2012;7:e31242. DOI:10.1371/journal.pone.0031242. [Crossref]
Djira GD, Hothorn LA. Detecting relative changes in multiple comparisons with an overall mean. J Qual Control 2009;41:60–5.
Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Association 1955;50:1096–121. [Crossref]
Tukey JW. The problem of multiple comparisons. Dittoed manuscript, Department of Statistics, Princeton University, Princeton, NJ, 1953.
Bretz F. Powerful modifications of Williams’ test on trend. Ph.D. thesis, University of Hannover, 1999.
Genz A, Kwong KS. Numerical evaluation of singular multivariate normal distributions. J Stat Comput Simulation 2000;68:1–21. [Crossref]
Hayter AJ, Liu W. The power function of the studentised range test. Ann Stat 1990;18:465–8. [Crossref]
Hayter AJ, Liu W. A method of power assessment for tests comparing several treatments with a control. Commun Stat-Theory Meth 1992;21:1871–89. [Crossref]
Hothorn T, Bretz F, Westfall P. multcomp: simultaneous inference in general parametric models. R package version 0.8–15, 2012. Available at: http://CRAN.R-project.org/
Genz A, Bretz F, Tetsuhisa M, Mi X, Leisch F, Scheipl F, et al. mvtnorm: multivariate normal and t distributions. R package version 0.9–9994, 2012. Available at: http://CRAN.R-project.org/
Prekopa A. On logarithmic concave measures and functions. Acta Sci Mathematicarum 1973;34:335–43.
Horn M, Vollandt R. Sample sizes for comparisons of k treatments with a control based on different definitions of power. Biom J 1998;40:589–612.
Hsu JC. Multiple comparisons – theory and methods. London: Chapman and Hall, 1996.
Liu W. On sample size determination of Dunnett’s procedure for comparing several treatments with a control. J Stat Plann Inference 1997;62:255–61. [Crossref]
ICH. Statistical principles for clinical trials. Guideline, international conference on harmonization, 1998. Available at: http://private.ich.org
Bretz F, Genz A. Numerical computation of multivariate t–probabilities with application to power calculation of multiple contrasts. J Stat Comput Simulation 1999; 63:361–78.
Hayter AJ, Hurn M. Power comparisons between the F-test, the studentised range test, and an optimal test of the equality of several normal means. J Stat Comput Simulation 1992;42:173–85. [Crossref]
Gabriel, KR. (1969). Simultaneous test procedures – some theory of multiple comparisons. Annals of Mathematical Statistics 40:224–250. [Crossref]
Bretz, F., Hothorn, T., Westfall, P. (2010). Multiple Comparisons Using R, CRC Press, Chapman & Hall/CRC Press, Boca Raton, Florida, USA,