The number of response categories in ordered response models.

The choice of the number m of response categories is a crucial issue in categorization of a continuous response. The paper exploits the Proportional Odds Models' property which allows to generate ordinal responses with a different number of categories from the same underlying variable. It investigates the asymptotic efficiency of the estimators of the regression coefficients and the accuracy of the derived inferential procedures when m varies. The analysis is based on models with closed-form information matrices so that the asymptotic efficiency can be analytically evaluated without need of simulations. The paper proves that a finer categorization augments the information content of the data and consequently shows that the asymptotic efficiency and the power of the tests on the regression coefficients increase with m. The impact of the loss of information produced by merging categories on the efficiency of the estimators is also considered, highlighting its risks especially when performed in its extreme form of dichotomization. Furthermore, the appropriate value of m for various sample sizes is explored, pointing out that a large number of categories can offset the limited amount of information of a small sample by a better quality of the data. Finally, two case studies on the quality of life of chemotherapy patients and on the perception of pain, based on discretized continuous scales, illustrate the main findings of the paper.


Introduction
A critical point in surveys with rating questions is the choice of the number m of response categories to use in the discretization of a measurement obtained on a continuous scale (in which the only marks are those related to the minimum and the maximum level). Although categorization of continuous measurements implies a loss of information, it is a widespread practice in various fields, such as medicine and epidemiology for instance, where researchers often split a continuous scale into ordered categories to make interpretation of the results easy. Examples are in [1][2][3][4][5], among others, where categorization is performed without prearranged meaningful categories. Furthermore [6] underline the benefits of transforming continuous responses in ordinal categories when the measurement variable is skewed because ordinal response models handle floor and ceiling effects better than linear models.
Within the statistical literature the choice of m is discussed by [7] who studies the beneficial impact of an increasing number of categories on standard errors. [8] instead points out that a large m allows a more powerful detection of associations between variables; a result confirmed by [9] with reference to tests on differential item functioning. Furthermore [10] show that a larger value of m reduces the impact of response errors on the local robustness properties of the estimators in the modeling framework for ordinal data denoted as CUB models [11].
The current paper investigates the impact that the choice of m has on the efficiency of the estimators in case of discretization of a continuous response variable in data analysis (in Section 4) performed through a proportional odds model (POM) [12,13]. The latter naturally arises when the rating is supposed to be driven by an underlying continuous variable. Each rating corresponds to an interval on the support of this variable. Choosing m is equivalent to deciding in how many classes the support is to be partitioned. With respect to alternative modeling frameworks, the POM is extremely parsimonious and is, by far, the most widely applied model in the biomedical context (see [14][15][16], among others).
Closely related to the choice of the number of categories, combining values/scores by collapsing adjacent categories of an ordinal response represents another relevant issue and also a widespread practice which arises in processing sample information after data collection. It may be pursued for overcoming sparseness problems, for simplifying interpretation or dealing with extreme response styles [17]. For a given dataset, changing the number of categories can affect the inferential results obtained from the data [18,19]. Other studies focus on merging categories to reduce the size of contingency tables by using the homogeneity of the corresponding rows (or columns) or the structure criterion (see [20] and reference therein). Section 6 of the current paper shows that collapsing categories -in case of a univariate ordinal response -reduces the information content of the sample generating a loss of efficiency which becomes extremely high in case of dichotomization.
Another critical point in data analysis concerns the appropriate number of categories with respect to a given sample size n [21]. Section 7 shows that increasing the number of categories enhances the efficiency of the estimators even if n is small. The relationship between m and n represents a crucial point discussed also in close field related to the association among categorical variables (see [20,22,23], among others).
In summary the paper handles three topics regarding data analysis of discretized continuous variables and ordinal variables: the choice of the number of response categories, the consequences of collapsing categories, and the relationship between m and the sample size n and it is organized as follows. The next Section provides a brief overview of the POM, whereas Section 3 describes the models used for the analysis. The information matrices of these models are analytically derivable so that the evaluation of the asymptotic efficiency of the estimators of the regression coefficients can be carried out without need of simulations. The impact of the choice of the number of response categories on efficiency and on hypotheses testing is investigated in Sections 4 and 5. The effect of various forms of merging categories is examined in Section 6, and the relationship between the number of categories and the sample size is analyzed in Section 7. In Section 8 two case studies related to the medical context illustrate the main findings of the paper. The first is about the perceived health-related quality of life of chemotherapy patients, whereas the second one deals with perceived pain by women during labor. Final remarks end the paper.

Theoretical background
In the POM framework, the ordinal response Y depends on an underlying continuous variable Y * through the relationship where −∞ = 0 < 1 < · · · < m = +∞ are the thresholds of the support of Y * . The choice of m determines in how many intervals the support of Y * is divided.
The variable Y * , in turn, depends on p ≥ 1 covariates, so that for the ith statistical unit we have the regression model ; p. 255) -of the POM is exploited to analyze how the choice of m affects the efficiency of the estimators and the accuracy of the derived inferential procedures. Let = ( ′ , ′ ) ′ be the parameter vector, where = ( 1 , … , m−1 ) ′ is the vector of the thresholds. Given an observed random sample (y i , x i ), for i = 1, 2, . . . , n, the log-likelihood function is where I[ ] is an indicator function which takes value 1 when holds and 0 otherwise, and [24] for the analytic expression of S( , y i , x i )). The maximum likelihood estimator (MLE) is the solution̂= (̂,̂) of S n (̂) = 0.
The generic term of the information matrix ( , X) for a single statistical unit, conditionally on X = x, is given by where S r ( ; y, x) is the element of the score function related to the r-th element r of . The elements of the unconditional information matrix ( ) are given by The asymptotic variance-covariance matrix of the MLEs is ( . In particular for the estimator̂k of the single where ( ) k k is the element on the diagonal of ( ) −1 corresponding to k .

The models
To investigate the asymptotic efficiency of the estimators of the regression coefficients when m varies, we focus on models whose (unconditional) information matrix can be analytically derived through (2.3). In particular we consider the following three underlying regression models.
-Model 1 (with a continuous covariate). The variable Y * depends on a continuous covariate Y * = X + , where X ∼ N(0, 1) and = 1.5. The information matrix is given by where (⋅) is the standard normal density function.
-Model 2 (with dichotomous covariates). The variable Y * depends on two dichotomous covariates Y * = X 1 1 + X 2 2 + , where X 1 ∼ Ber(0.5), X 2 ∼ Ber(0.25) and X 1 and X 2 are mutually independent. The regression coefficients are 1 = 1.5 and 2 = 0.7. Denote the conditional information matrix, given X 1 = x 1 and X 2 = x 2 , by ( ; x 1 , x 2 ); the information matrix is given by -Model 3 (with mixed covariates). The variable Y * depends on a continuous covariate, a dichotomous one and their interaction. The regression model is Y * = X 1 1 + X 2 2 + X 1 X 2 3 + , where X 1 ∼ N(0, 1), X 2 ∼ Ber(0.5) and X 1 and X 2 are mutually independent. The regression coefficients are 1 = 2.7, 2 = 1.5 and 3 = 0.7. In obvious notation, the information matrix is given by Notice that Model 3 is generally used in the analysis of differential item functioning in grading scales (see [9] and references therein).
We consider the probit, the logit and the complementary log-log link function which assume the Gaussian, the logistic and the extreme value distribution for i in (2.2), respectively.
Furthermore, in the analysis of the asymptotic efficiency, four different sets of thresholds are considered. By following the literature [7,25,26], a first set is given by equidistant thresholds ED j , which satisfy the constraint ED j − ED j−1 = h. Alternative thresholds ID j are obtained by considering smaller classes around the median of Y * , and by progressively increasing the length of the classes by a factor of (1 + 1∕m) when moving towards the tails. A further set of thresholds EP j splits the support of Y * in classes of equal probability, so that P(Y = j) = 1∕m for j = 1, … , m. A final set of thresholds DP j corresponds to classes with larger probability at the center and decreasing probability, by a factor of (1 − 1∕m), when moving from the center towards the extremes. Details on the construction of the four sets of thresholds are given in the supplementary material. Each set of thresholds generates its specific distribution of the ordinal response from the same underlying variable. The distributions produced by the four sets of thresholds, in Model 1 with the logit link and m = 7, are displayed in Figure 1 and appear markedly different.
Since the analytical expression of the (unconditional) information matrix is available for the three models, most of the analyses carried out in the following sections, and in particular the evaluation of the asymptotic efficiency of the estimators, the approximation of the power of the test and the assessment of the loss of efficiency produced by collapsing categories and dichotomization, do not require simulations. Numerical experiments are carried out only for Figure 3 and more extensively in Section 7 where the appropriate choice of m for a given sample size n is investigated. The purpose is to take into account numerical issues which may arise for specific combinations of m and n especially when n is small. Estimation is implemented in the R package MASS [27] and the simulation is always performed on 10 000 samples. Both analytical results and simulation experiments consider a number of parameters increasing with m, since estimation involves m − 1 thresholds in addition to the p regression coefficients.

The asymptotic efficiency of the estimators
When m increases, each category of Y corresponds to a smaller class on the support of Y * , and the resulting finer categorization yields more information on the underlying variable. In this context it is useful to remind that, given an event , its information content is given by − log{P (}: the smaller P() the larger the information ( [28]; p. 4), since occurrence of rare events is more informative than occurrence of likely ones. Now suppose we have a finer discretization of the support of Y * in m classes C 1 , … , C m , and a coarser categorization of Y * in m ′ classesC 1 , … ,C m ′ , with m ′ < m. Let Y andỸ be the responses obtained from the discretization in m and m ′ classes. When Y = j,Ỹ takes a value k such that C j ∩C k ≠ ∅. SinceỸ is based on a coarser categorization, it is reasonable to assume that P(Y = j) = P(Y * ∈ C j ) ≤ P(Ỹ = k) = P(Y * ∈C k ). Under this condition, the following result holds.

Proposition 1. Let Y * ∼ F. Given two alternative discretizations of its support in the classes
(4.1) Equation (4.1) implies that the information content of Y is larger than that ofỸ. Hence a finer categorization provides more information content to the data, which -in turn -yields more efficient estimators.
Tables 1-4 illustrate the asymptotic efficiency of the estimators of the regression coefficients in Models 1, 2 and 3 for the four sets of thresholds. In Model 1 there is a single regression coefficient, so that the efficiency is measured by its asymptotic variance ( ) . In case of Models 2 and 3, where there is a vector of regression coefficients, the efficiency is measured by the trace of the portion of the asymptotic variance-covariance matrix related to the regression coefficients, i.e. by the sum of the asymptotic variances of the elements of̂, The outcomes clearly point out that, for all the sets of thresholds, the efficiency of̂increases with m accordingly with the larger amount of information associated with a finer categorization. The decrease of the asymptotic variances is especially marked for low values of m while it tapers off when m increases, and this behavior is shared by the three models and by the three links. Increasing m up to 7 usually yields considerable gains in efficiency, while marginal benefits are obtained by increasing m beyond 10.
As concerns the impact of the thresholds on the asymptotic efficiency, Tables 1-4 show that no set of thresholds outperforms the others, and the optimal set depends on the model as well as on the value of m and on the link (see also the Figures S.1, S.2, and S.3 in the supplementary material). In Model 1, with the probit and the logit link, the DP j thresholds produce more efficient estimators for small m, while when m ≥ 6 the ED j and ID j thresholds are to be preferred. In the same model, the best results for the complementary log-log link are generally obtained with the DP j thresholds. In Model 2, with the probit and the complementary log-log    link, larger efficiency is usually achieved by using the DP j thresholds for small m and the ID j thresholds for larger m, while with the logit link the best thresholds are the EP j . In Model 3 with the probit link the DP j thresholds appear preferable for small m and the ID j and the ED j ones for m > 7, while the best thresholds for the logit link and the complementary log-log link are the ID j and the DP . Although no set of thresholds dominates the others, appreciable differences in efficiency are observed only for small m, while the efficiency of the estimators obtained with the four sets of thresholds gets progressively closer when m increases. Hence a sufficiently large m can reduce the loss of efficiency produced by an inappropriate choice of the thresholds. In the following sections of the paper, the ED j thresholds will be considered. This choice is motivated by their simplicity and frequency of use in applications (whenever no other indications arise from the concrete problem at hand), without implying any preference in this direction. Analogous analyses for the other sets of thresholds are reported in the supplementary material (Section S.1).
The efficiency of̂affects also the efficiency of the derived measures used to investigate the impact of the covariates and, in particular, the efficiency of the estimators of the odds-ratio (OR). As an example consider

Hypothesis testing
The choice of m affects also the power of the test. Consider the hypotheses on a single regression coefficient is the standard normal distribution function. Hence the power of the test can be approximated by To investigate the impact of the choice of m on ( ) we consider the null hypothesis H 0 : 3 = 0 in Model 3. It implies that the interaction between X 1 and X 2 is omitted from the latent model, which becomes Y * = X 1 1 + X 2 2 + . Table 5 shows the power of the test, computed analytically through (5.1), at the 5% significance level, for the sample sizes n = 250, 500 and the three links (see also the analogous Tables S.1 Table 5 are based on the asymptotic efficiency analytically evaluated. To take into account also the numerical issues which may arise when the estimation is actually implemented, Figure 3 shows the power of the test assessed through a simulation when the logit link is adopted, for sample sizes between 100 and 500. The magnitude of ( ) is mainly determined by the sample size. Nevertheless it can be appreciated the gain in power which can be achieved by increasing the number of categories especially when the initial m is small, say m ≤ 6, though the marginal benefits are decreasing with m.

Merging categories
A widespread practice in data analysis is collapsing adjacent categories into one larger category (see [8,18,19], among others). This is typically done with extreme categories when there is a concern about their frequencies being very low (for instance in case of extreme response styles which cause the observations to be concentrated only on one side of the scale). An alternative reason for merging arises when a limited sample size yields unobserved categories or categories with very low frequencies. Finally the reduction of the number of categories may be finalized to simplify the interpretation, and in the extreme case it reaches its limit when the response is dichotomized.
In the main literature on categorical data, merging categories is a common practice for reducing the dimension of contingency tables (see [29], among others), and avoiding sparseness or small cell entries especially at the edges of the classification scale. However, an easier interpretation of the model is also a recurrent motivation (see [20], among others), and guidelines criteria for merging are homogeneity and structure (see [30][31][32], for further details).
In this paper merging categories is considered for a single variable, i.e. the ordinal response Y. Because of the collapsibility property of the POM, the regression parameters remain unchanged when the categories are merged. Nevertheless, it is important to point out that collapsing categories reduces the information content of the sample outcome, as shown by the following proposition.

See the Appendix for the proof.
Clearly the loss of information produced by collapsing is likely to turn into a loss of efficiency. Here we investigate the impact of various forms of merging categories. For the case extreme categories are involved, we consider merging performed symmetrically on both sides of the scale, and merging implemented only on one side. A selection of cases is illustrated in the current section, while a more extensive investigation is carried out in the supplementary material (Section S.2). Furthermore the impact of halving the number of categories is analyzed, and finally the effect of dichotomizing the response is examined (see also Section S.1.4 of the supplementary material for similar analyses with the ID , EP and DP thresholds).
For the first form of merging Model 1 with the logit link is considered. In this model the distribution of the underlying variable is symmetric, so that (to avoid low extreme frequencies) it is reasonable to join both the first two categories and the last two. Consequently the first and the last thresholds 1 and m−1 are neglected, and the categories are based on the remaining thresholds 2 , … , m−2 . This procedure reduces the number of categories from m to m − 2. Table 6 shows the asymptotic efficiency ratio between the "before merging" estimator̂m and the "after merging" estimator̂m −2 . The efficiency loss is considerable when m is small (m = 5, 6, 7). When the number of categories is reduced from 5 to 3 the loss of efficiency can be as high as almost 22%. The loss of efficiency is restrained -does not exceed 5% -when the number of categories is m > 7 and the probability of the extreme categories (which disappear) is below 0.025. Similar results for the value of m (which should be fairly large, say m ≥ 10) and the probability of the vanishing categories (which should be sufficiently low, say around 0.025) are observed also for the other models and for the probit link (see Tables S.2

.1-S.2.5 of the supplementary material).
To investigate the consequences of merging when it occurs only on one side, reference is still made to Model 1 but the link is now the complementary log-log one, so that the underlying variable has a skewed distribution with low probability on the first categories. When the first two categories are merged the number of scale points is reduced from m to m − 1, since the threshold 1 is neglected. Consequently the "before merging" estimator iŝm and the "after merging" estimator iŝm −1 . Table 7 shows the efficiency ratio between m and̂m −1 . When merging occurs only on one side the loss of efficiency is less dramatic than when both tails are involved. The loss of efficiency is below 5% when m ≥ 6 and the probability of the first category  , where Var(̂m ,k ) is the asymptotic variance of the k-th element of the estimator̂m obtained from m categories. Halving the number of scale points can have a remarkably high price in terms of efficiency especially when the number of covariates increases or the link is the probit or the complementary log-log one (the neglected information turns out to be especially valuable in these cases). Furthermore, consistently with previous results, the negative effect of merging is larger when the initial value of m is small, since collapsing produces a much coarser categorization. Finally a common practice in applications is to reduce an ordinal response into a dichotomous one to easy interpretation (see [33,34], among others). Table 9 shows the cost in terms of efficiency to be paid for dichotomization, which is measured by . The loss of efficiency due to dichotomization can be extremely severe (see also [35] and [36] for similar results). If a response with 4 categories is dichotomized, the efficiency ratio varies between roughly 1.2 and more than 5 (see Model 3). The loss of efficiency constantly increases with m. In the worst case, Model 3 with the probit or the complementary log-log link, if a 10-point response is dichotomized the efficiency ratio largely exceeds 10, and it gets even worse for larger m. Table 9 shows also a different pattern for the three link functions: although for all of them the dichotomization has a considerable impact, the estimators obtained with the logit link appear to exploit better the reduced amount of information limiting the loss of efficiency.  These outcomes call for a recommendation against the use of dichotomization. Similar suggestions can be also found in [8,12,37,38]. In particular [36] define dicothomization an arbitrary researchers' choice and show that the loss of efficiency can be exacerbated by the selection of an inappropriate cut-point.
Overall, the above results point out that a reduction of the number of categories, in any of the forms considered here, by decreasing the amount of information, can produce a remarkable loss of efficiency especially when the merging involves the central categories (with higher frequencies) or reaches the limiting case of dichotomization.

Choice of m for given n
A question which frequently arises, when setting the number of response options, concerns the appropriate number of categories for a given sample size n. On one side the positive relationship between efficiency and m would suggest a large number of categories. On the other side, if the sample size n is small, when m increases one or more categories may not produce observations.
In this regard, it is to be pointed out that, although in different statistical contexts categories with null frequencies give rise to the well known problems of sparse data, in the POM context a missing category in the sample produces no computational problems. Indeed, when one or more categories are unobserved, the estimation of the model can be still carried out by considering only the sampled categories.
The relationship between m and n is investigated through a simulation (with the details given in Section 3) to take into account the numerical issues which may arise for specific combinations of n and m, especially when the sample size is small. The analysis is carried out in the context of Model 3 with the logit link, though similar results are obtained for the other models and the other links as reported in Section S.3 of the supplementary material (see also Section S.1.5 for thresholds different from the ED ).
Let m obs be the observed number of categories, Table 10 shows the percentage of samples such that m obs < m. This percentage is extremely large for n = 50 or n = 100, though it rapidly reduces when n increases: it is below 1.5% for n = 300 and it becomes negligible for n = 400. These results indicate that, in order to avoid unobserved categories, the samples size should be n ≥ 300 (see also  Tables S.3.1-S.3.8 of the supplementary material for analogous results on the other models and the other links). Table 11 shows the sum of the mean square errors of̂computed on the same samples of Table 10   It is to be pointed out that, regardless whether the observed number of categories corresponds to m or not, the efficiency of̂increases with m for any sample size in agreement with the results of Section 4. Although [39] notice that when n is small and m is large maximum likelihood can yield biased estimators of the regression coefficients, the larger amount of information provided by a finer categorization produces a reduction in variance sufficiently large to offset the bias, yielding decreasing mean square errors. Hence the circumstance that some categories may be missing in the sample, does not alter the positive relationship between efficiency of̂and m, which holds for any sample size.  These results indicate that a small n requires a larger number of categories to compensate the limited availability of data with more information on the underlying variable, i.e. with a better quality of the data. The choice of m becomes less crucial when n increases, because the waste of information produced by a coarser categorization is balanced by a larger amount of data. Hence, m needs to be large especially if n is small.

Case studies
Two different case studies concerning the Linear Analogue Self-Assessment (LASA) scale and the visual analog pain scale (VAPS) are considered. Aim of the analysis is to show the impact of the choice of m in the discretization of scales which are endpoint-anchored lines. Researchers are interested in investigating the self-assessment of the quality of life (in the first example) and the perceived pain (in the second example) originally measured on an interval scale. In both examples an increasing number of categories allows to reduce the standard errors of the estimates and improve their significance. The second case study is also related to a small sample size to illustrate the opportunity of a relatively large m when the number of interviewees is limited.

Quality of life measured on linear analogue self-assessment scale
Data stem from the ANZ0001 trial conducted by the ANZ Breast Cancer Trials Group with the aim of assessing health-related quality of life of patients with advanced breast cancer [40]. Our analysis focuses on the overall quality of life, recorded on an LASA scale, normalized to (0, 100) where 0 represents 'as bad as it can be' and 100 'as good as it can be'. The treatments intermittent capecitabine (IC) and continuous capecitabine (CC) are compared with the standard combination treatment (CMF), each with its own protocol.
The chemotherapy cycle number (cycle num.) and the body surface area (m 2 ) (body surface) are recorded for each assessment of the quality of life, in addition to the treatment (Treatment). The dataset, which contains 2473 observations, is available in the R package ordinalCont [41], see also [42]. The regression model corresponding to (2.2) is where Z IC i and Z CC i are dichotomous variables which identify the modalities IC and CC of the nominal variable Treatment, while CMF is the reference category.
The LASA scale has been discretized into equal-length intervals with m varying between 3 and 15. The fitted models with the logit link are shown in Table 12.
Consistently with the analytical results of the previous sections, the standard errors of the estimators decrease with m. Consequently the estimated coefficient of the variable Z IC , which is not significant for m = 3 and m = 5, became significant for m ≥ 7, pointing out that this type of Treatment can negatively affect the patients' quality of life. Different effects of the two treatments IC and CC can be tested by considering the null hypothesis 4 − 3 = 0. The t-statistic, for varying m, is reported in Table 13. As m increases, it becomes evident that the CC treatment leads to a better quality of life with respect to the IC treatment. There is instead no significant difference between CC and CMF. These outcomes show that an increasing number of categories may enhance model specification.

Pain measured on visual analog pain scale
Data are about a small sample of 56 women aged between 23 and 44 years interviewed on their perceived pain during labor until childbirth. These women delivered at the Città di Roma Hospital (Rome) or at the Saint  Raffaele Hospital (Milan), and most of them attended hospitals' childbirth preparation classes there. Details on these data are in [43]. The perceived pain has been collected by means of a VAPS. It consists in a slide rule with the patient's side unmarked and the observer's side marked from 0 to 100 mm, where 0 represents 'no pain' and 100 represents 'worst pain ever'. We consider the discretization of the VAPS with a number of intervals from m = 3 to the maximum rating considered for the analysis of pain m = 11 (for comparison with the Numerical Rating Pain Scale, see [44]. The position of the unborn (head), the participation to the pre-birth course (course) and the occurrence of previous events in which women perceived pain (previous pain) are the three covariates. The regression model corresponding to (2.2) is The fitted models with the complementary log-log link are in Table 14.  In accordance with previous results the standard errors decrease with m. The estimated coefficients of Head and Course, which are not significant for m = 3, become significant for m ≥ 5. The hypothesis 1 = 2 = 0, which implies that Head and Course do not affect the perceived pain, can be tested trough the likelihood ratio (LR) test. The corresponding statistic and the related p-value are reported in Table 15. A large m is required to detect the relevance of these two covariates as explanatory factors. Despite the small sample size, a large number of categories is necessary to convey power to the tests.

Final remarks
The paper exploits the collapsibility property of the cumulative models with proportional odds assumption, which allows to generate ordinal responses with a different number of categories from the same underlying variable, and investigates the impact of the choice of m on the reliability of inferential analyses. It proves that increasing m augments the information content of the data, yielding more efficient estimators and more powerful tests. However the benefits of increasing m are considerable when the initial number of categories is small, and become progressively smaller when m increases. The analyses carried out in the paper suggest values of m between 7 and 10. This range of values for m limits also the impact of inappropriate thresholds used in the categorization of continuous measurements. Since the variance of the estimators decreases with m, the opportunity of merging categories should be carefully evaluated. Combining extreme categories should be applied only when m ≥ 10 and the probability of the vanishing category is sufficiently small (say around 0.025). Halving the number of categories appears an inconvenient procedure in terms of efficiency. The dichotomization is the most critical practice because it produces an extremely severe loss of information and, consequently, of efficiency, which can be only partially restrained by choosing the logit link instead of alternative link functions.
Finally numerical simulations show that increasing m enhances the efficiency even if the sample size is small. A high number of scale points is recommended to gather all the information contained in the sample especially if it is of limited size. These experiments indicate also that a sample size n ≥ 300 allows to avoid unobserved categories to a large extent and produces sufficiently efficient estimators.
These findings are illustrated through two case studies based on discretization of continuous scales. In both cases an increasing number of categories allows to reveal the relevance of the explanatory variables, which may remain undetected if the categorization is too coarse. Hence a large m enhances model specification.

Proof of Proposition 2.
Let Y = (Y 1 , … , Y m ) be a random vector with a multinomial distribution with parameter = ( 1 , 2 , … , m ), where j ≥ 0 for j = 1, … , m, and ∑ m j=1 j = 1. Its probability mass function is Suppose, without loss of generality, that the first two categories are merged. Since Y 1 = 1 and Y 2 = 1 are incompatible events, the event (Y 1 = 1) ∪ (Y 2 = 1) is observed with probability 1 + 2 . The probability of the merged variable Y M = is Inequality (A.1) shows that there is more information in the original distribution, with a larger number of categories, than in the distribution derived from merging, i.e. collapsing categories reduces the amount of sample information. □