Nonparametric methods of statistical inference for double-censored data with applications

: This article introduces new nonparametric statistical methods for prediction in case of data containing right-censored observations and left-censored observations simultaneously. The methods can be considered as new versions of Hill ’ s A n ( ) assumption for double-censored data. Two bounds are derived to predict the survival function for one future observation + X n 1 based on each version, and these bounds are compared through two examples. Two interesting features are provided based on the proposed methods. The ﬁ rst one is the detailed graphical presentation of the e ﬀ ects of right and left censoring. The second feature is that the lower and upper survival functions can be derived.


Introduction
This article introduces new nonparametric statistical methods for prediction using the past data that contain right-censored observations and left-censored observations simultaneously, where these kind of data are called double-censored data in the literature.The methods are proposed to learn about one future observation based on the original sample with few mathematical assumptions.For real-valued data, Hill [1,2] presented the A n ( ) assumption for prediction when there is few knowledge about the underlying distribution, and this assumption provides certain probabilities for one future observation based on the past observations.For right-censored data, Berliner and Hill [3,4] generalized the A n ( ) assumption, and the generalizations are considered for survival analysis.In this article, regarding methods are developed by using the information of censoring observations, and they are referred to by generalization type A and B of A n ( ) assumption.This article is organized as follows.In Section 2, a brief overview of the A n ( ) assumption is given for realvalued data.Section 3 presents the generalizations of A n ( ) assumption for right-censored data.Section 4 presents the generalizations of A n ( ) assumption for double-censored data along with their corresponding justifications.In Section 5, we present bounds for probabilities and survival functions based on the generalizations of A n ( ) assumption proposed for double-censored data.In Section 6, the new proposed generalizations are used with two applications.In this article, we assume that ties can occur with probability zero to make notation simple, but Section 7 briefly discusses the way that ties can be dealt with.The final section provides some concluding remarks and discusses some future-related research.

( ( ) )
A n assumption for real-valued univariate data In 1968, Hill [1] introduced the A n ( ) assumption for prediction when there is no prior knowledge about the underlying distribution.The data support is partitioned into + n 1 intervals using the observed data points and be continuous and exchangeable realvalued random quantities, and let x x x , ,…, n 1 2 be the corresponding observed data points.Furthermore, let .For one future observation + X n 1 , the A n ( ) assumption is , where It should be noted that the A n ( ) assumption is suitable to provide bounds for probabilities, known as imprecise probabilities, by using the theorem of probability proposed by De Finetti [5], but it is not good to derive precise probabilities for many functions of interest.The bounds created for probabilities can lead to valuable information in the case of uncertainty of event or in the case of indeterminacy caused by restricted information [6,7].Augustin and Coolen [8] derived strong consistency properties for nonparametric predictive inference in interval probability theory based on the A n ( ) assumption, and multiple examples have been presented by Coolen [9], Coolen and Coolen-Schrijner [10], Coolen et al. [11], Coolen and Van der Laan [12].
Based on data including n event observations, the A n ( ) assumption provides a partially specified predictive probability distribution for one future observation + X n 1 via the probabilities assigned to the intervals . These probabilities can lead to derive lower and upper probabilities for any event of interest in terms of + X n 1 [8].If we are interested in the event ∈ , with A a set of the nonnegative real values, then the lower probability for this event, referred to by ) , is derived by summing only the probabilities for + X n 1 on intervals , referred to by ), is derived by summing all the probabilities for + X n 1 on intervals which have a nonempty intersection with the set A. The A n ( ) assumption has been discussed for multiple statistical inferences in the literature, see, e.g., Hill [2] and De Finetti [5] for more detailed presentation, and the assumption has been contributed with multiple important related works, see, e.g., Dempster [13] and Lane and Sudderth [14] for detailed information.For right-censored data, Mark Berliner and Hill [3] and Coolen and Yan [4] proposed two versions of A n ( ) , and the proposed assumptions will be presented in the next section.

Generalizations of ( ( ) )
A n assumption for right-censored univariate data The A n ( ) assumption for real-valued data is introduced as certain predictive probabilities are assigned to open intervals created by the observed data points with no further assumptions or restrictions on the spread of probabilities within the intervals.In 1988, the A n ( ) assumption is generalized for data containing right-censored observations by Mark Berliner and Hill [3].They use the same technique and provide a partial probability distribution via certain values for next future observation + X n 1 , and the probability masses are assigned to open intervals with no more constraints on the spread of the probability mass within each interval.Let X X X , ,…, n 1 2 be exchangeable positive random quantities, and there are u event observations, and v right- censored observations.The event and right-censored observations are ordered as < , where = − v n u, respectively.They assign a certain probability for one future observation + X n 1 to be fallen in any two ordered event observations ) by the following formula: where This proposed assumption was the first attempt to generalize A n ( ) assumption for right-censored data, and it was a good start to deal with the case of data including right-censored observations.Coolen and Yan [4] provided an alternative approach, named by the right-censoring A n ( ) assumption, which is nicer because it does not neglect the censored observations as Berliner and Hill does to create the intervals partitioned the sample space.
The right-censoring A n ( ) assumption,rc A n ( ) , provides a partial probability distribution for one future observation, and it is specified via M-function values [4].The random quantities X X X , ,…, n 1 2 are assumed to be exchangeable, nonnegative and real-valued, and there are u event observations, and v right-censoring obser- vations.The event and right-censoring observations are ordered as , they ordered the right-censoring times in each interval i , where l i is the number of censors in I i .Finally, they formed the open intervals as Definition. (M-function).The M-function value is a probability partially specified to each interval , so that the next future observation + X n 1 falls in an interval with a probability M-function value [4].
. Therc A n ( ) assumption is that the probability distribution for one positive random quantity + X n 1 based on data including u event observations, and v censored observations is partially assigned through M-function values as follows [4]: where n ˜c r ( ) is the number of observations not experiencing the event of study just before time c r ( ) plus one. .
, respectively, with a probability M-function value.
From studying therc A n ( ) assumption, two interesting notes have been experienced, and it is good to list them to have a wider picture for the assumption.First, Coolen and Yan [4] presented a new version of the A n ( ) assumption for data containing right-censored observations by using the same technique of A n ( ) assumption, but the intervals are overlapped due to the censored observations.They provided a partial probability distribution via the M-function values for a random quantity + X n 1 , and the probability masses are assigned to open intervals with no more constraints on the spread of the probability mass within each interval.The probability mass specified to such interval a b , ( ) is referred to by ) and interpreted by M-function value for + X n 1 on a b , ( ).Second, the M-function values are limited between 0 and 1, and the values have to sum up to one over all intervals created [4].

Generalizations of ( ( ) )
A n assumption for double-censored univariate data This section provides two generalizations of A n ( ) assumption along with their justifications.For the justifications, we will include a detailed information of the exact nature of the noninformative censoring assumption implicit in the generalizations, and the justifications will be provided in two stages.First, the influence of double-censored observations on the A n ( ) assumption will be presented, with no further assumptions added.In the second stage, further postdata and predata assumptions, which are strongly related to A n ( ) , are considered to derive partially specified predictive probability distributions for the random quantities that were double censored.The further postdata assumption will be for right-censored data, and the further predata assumption will be for left-censored data.
Based on the A n ( ) assumption, a partially specified probability distribution for + X n 1 is provided by using the observed data points.As first stage in our justifications of the generalizations of A n ( ) , the influence of double-censored data on the partially specified probability distribution for + X n 1 is considered without further constraints, and this leads to generalize A n ( ) assumption for double-censored data.This generalization will be referred to by A ˜n ( ) , and the definition is justified as follows: Definition. (A ˜n ( ) ).The A ˜n ( ) assumption is that the probability distribution for one positive random quantity + X n 1 based on data including u event observations, ) , is partially specified through certain probabilities as follows: where The justification of A ˜n ( ) assumption, in relation to the A n ( ) assumption, is as follows.The intervals in the form of are each assigned a minimum probability mass of + n 1 1 , by the A n ( ) assumption.If we consider one such an interval, then the total mass in it could be more than + n 1 1 due to the presence of double-censored observations.Any additional probability mass due to such double-censored observations does not have to be restricted to lie within this interval, without any assumptions, leading us to the probabilities on intervals from a right-censored observation to infinity and from 0 to a left-censored observation.For the case of having a right-censored observation in ( ) ( ) .However, due to the lack of information about the true times of double-censored observations, we cannot assign a probability + n 1 1 to the subintervals.Therefore, the probability for + X n 1 to be fallen in The only information known about a right-censored observation rc j ( ) is that the corresponding event would occur after time rc j ( ) .This implies that if this time were observed, then one of the intervals in the form of ) would be split into two intervals, and each interval has a probability assumption.However, because it is only known that the corresponding event would exceed rc j ( ) , the only statement about this probability mass + n 1 1 for + X n 1 that can be justified, with no more constraints, is that it will fall in ∞ rc , , and hence, . For a left-censored time lc w ( ) , we know that the corresponding event occurred before time lc w ( ) .From this information, the event is occurred in one of the intervals and this leads to split the interval two intervals, and each one has a probability + n 1 1 based on the A n ( ) assumption.But, because it is only known that the corresponding event occurred before lc w ( ) , the only statement about this probability mass + n 1 1 for + X n 1 that can be justified, with no more constraints, is that it will fall in lc 0, w ( ) ( ) , and hence, ∈ = .

Generalization type A
To link the first stage to the second stage of the justification of the generalization type A, each probability assigned to interval ∞ rc , will be uniformly distributed to the event intervals and the interval For the left-censored observations, each probability assigned to interval lc 0, w ( ) ( ) will be uniformly distributed to the event intervals occurred before lc w ( ) and the interval ) including the left-censored observation lc w ( ) .These assumptions lead to have the following function: where . ⋅ I( ) is the indicator function.

Generalization type B
To link the first stage to the second stage of the justification of the generalization type B, each probability assigned to interval ∞ rc , j ( ) will be uniformly distributed to the event intervals and the interval ( ) rc t , j rc j ( ) ( ) , where t rc j ( ) is the first event time greater than rc j ( ) .For the left-censored observations, each probability assigned to interval lc 0, w ( ) ( ) will be uniformly distributed to the event intervals ( ) occurred before lc w ( ) and the interval ( ) , where t lc w ( ) is the first event time less than lc w ( ) .These assumptions lead to have the following functions: where x o is the time and d o is the censored indicator.Each generalization of A n ( ) assumption leads to a partially specified probability distribution, via the probabilities calculated by equations ( 6) and ( 7), for survival time of a future observation + X n 1 , which we refer to by a random quantity X 8 in the simulated example.The probabilities related to the intervals , and ( ) are presented in Tables 2 and 3, and these assignment probabilities can be used to derive bounds for the survival function of X presentations, as they indeed give a wider picture of the double-censored data, and from a predictive perspective, they can easily be interpreted.For the case of large data and many observations are event, not censored, these proposed methods become nearly identical, and this is obvious in the following example.

AIDS example
The data set is from a cohort of drug users recruited in a detoxification program in Badalona (Spain) [20].On the basis of the data, we may predict the lower and upper survival functions for the elapsed time from starting IV-drugs to AIDS diagnosis for the next observation.The data of 232 patients infected with HIV are presented as follows; 136 left AIDS-free (right censored), 14 died with AIDS without prior diagnoses (left censored) and 82 had AIDS while in the program (noncensored). Figure 2 presents the lower and upper survival functions based on the generalizations of A n ( ) assumption.It is rational that they are nearly identical and the differences between the lower and upper survival functions become small due to the large sample size with many not censored observations.

Treating such cases of ties
For simplicity, we have assumed that there are no ties in the data sets through this article, but in real studies, ties could occur in different scenarios.There are seven kinds of ties that could occur: tied event observations, tied right-censoring observations, tied left-censoring observations, ties among event and right-censoring observations, ties among event and left-censoring observations, ties among left-censoring and right-censoring observations, and ties among event and left-censoring observations with right-censoring observations.With the first three situations, we break the tied observations by adding a small value to those ties.With the fourth situation, we assume that the right-censoring times occur after the event observations, where this assumption has been widely used in the literature [3,21].With the fifth situation, we assume that the left-censoring times occur before the event observations.With the sixth situation, we assume that the right-censoring times occur ) be the ordered observations and let = which are completely within the set A. The upper probability for the event ∈ +

Figure 1 :
Figure 1: The lower and upper survival functions for X 8 based on the generalizations of A n ( ) assumption.
then the unobserved data point t rc corresponding to that right- censored observation could occur in , then the unobserved data point t lc corresponding to that left-censored observation could occur in