In the current status model, a positive variable of interest X with distribution function F0 is not observed directly. A censoring variable T ∼ G is observed instead together with the indicator Δ = (X ≤ T). curstatCI provides functions to estimate the distribution function F0 and to construct pointswise confidence intervals around F0(t) based on an observed sample (T1, Δ1), …, (Tn, Δn) of size n from the observable random vector (T, Δ). The methods used in this package are described in Groeneboom and Hendrickx (2017). More details on the current status model can be found in Groeneboom and Jongbloed (2014).
To illustrate the usage of the functions provided in the curstatCI package, we consider a sample (X1, …, Xn) of size n = 1000 from the truncated exponential distribution on [0, 2] with distribution function F0(t) = (1 − exp (−t))/(1 − exp (−2) if t ∈ [0, 2]. The observation points (T1, …, Tn) are sampled from a uniform distribution on [0, 2]. This data setting is also considered in the simulation section of Groeneboom and Hendrickx (2017). The R-code to generate the observable random vector (t1, δ1), …, (tn, δn) is given below.
set.seed(100)
n<-1000
t<-rep(NA, n)
delta<-rep(NA, n)
for(i in (1:n) ){
x<-runif(1)
y<--log(1-(1-exp(-2))*x)
t[i]<-2*runif(1);
if(y<=t[i]){ delta[i]<-1}
else{delta[i]<-0}}
The MLE Fn of F0 is defined as the maximizer of the log likelihood given by (up to a constant not depending on F) $$ \sum_{i=1}^n \Delta_i\log F(T_i) + (1-\Delta_i)\log(1-F(T_i)),$$ over all possible distribution functions F. As can be seen from its structure, the log likelihood only depends on the value that F takes at the observed time points Ti. The values in between are irrelevant as long as F is increasing. The MLE Fn is a step function which can be characterized by the left derivative of the greatest convex minorant of the cumulative sumdiagran consisting of the points P0 = (0, 0) and $$P_i=\left(\sum_{j=1}^i w_j,\sum_{j=1}^i f_{1j}\right),\,i=1,\dots,m,$$
where the wj are weights, given by the number of observations at point T(j), assuming that T(1) < … < T(m) (m being the number of different observations in the sample) are the order statistics of the sample (T1, Δ1), …, (Tn, Δn) and where f1j is the number of Δk equal to one at the jth order statistic of the sample. When no ties are present in the data, wj = 1, m = n and f1j = Δ(j), where Δ(j) corresponds to T(j).
The function ComputeMLE in the package curstatCI computes the values of the MLE at the distinct jump points of the stepwise MLE. The current status data needs to be formatted as follows. The first column contains the observations t(1) < … < t(m) in ascending order. The second and third columns contain the variables f1j and wj corresponding to t(j).
## [,1] [,2] [,3]
## [1,] 0.0007901406 0 1
## [2,] 0.0064286250 0 1
## [3,] 0.0075551583 0 1
## [4,] 0.0121238651 0 1
## [5,] 0.0150197768 0 1
## [6,] 0.0179643347 0 1
The function ComputeMLE returns the jump points and corresponding values of the MLE.
## x mle
## 1 0.00000000 0.00000000
## 2 0.04461921 0.05405405
## 3 0.12463964 0.15000000
## 4 0.18832966 0.18181818
## 5 0.21623244 0.21428571
## 6 0.24208302 0.25000000
## 7 0.25054101 0.30555556
## 8 0.32536475 0.38461538
## 9 0.34771374 0.38888889
## 10 0.39735814 0.45454545
## 11 0.55754520 0.50000000
## 12 0.63327266 0.51515152
## 13 0.70567068 0.54838710
## 14 0.74357757 0.61363636
## 15 0.82979603 0.66666667
## 16 0.83427289 0.69230769
## 17 0.90577147 0.69902913
## 18 1.13136189 0.75000000
## 19 1.14422337 0.81967213
## 20 1.38870142 0.85714286
## 21 1.40659256 0.90000000
## 22 1.48428796 0.92307692
## 23 1.55291452 0.97297297
## 24 1.77735579 0.98666667
## 25 1.91620524 1.00000000
The number of jump points in the simulated data example equals 24. The MLE is zero at all points smaller than the first jump point 0.04461921 and one at all points larger than or equal to the last jump point 1.91620524. A picture of the step function Fn is given below.
Starting from the nonparametric MLE Fn, a smooth estimator F̃nh of the distribution function F0 can be obtained by smoothing the MLE Fn using a kernel function K and bandwidth h > 0. We use the triweight kernel defined by $$ K(t)=\frac{35}{32}\left(1-t^2\right)^31_{[-1,1]}(t).$$ The SMLE is next defined by $$ \tilde F_{nh}(t)=\int\mathbb K\left(\frac{t-x}{h}\right)\,d F_n(x), $$ where 𝕂(t) = ∫−∞tK(x) dx. The function ComputeSMLE computes the values of the SMLE in the points x based on a pre-specified bandwidth choice h. A user-specified bandwidth vector bw of size length(x) is used for each point in the vetor x. A data-driven bandwidth choice for each point in x is returned by the function ComputeBW which is described later.
For our simulated data set, a picture of the SMLE F̃nh(t) using a bandwidth h = 2n−1/5, evaluated in the points t = 0.02, 0.04, …, 1.98 together with the true distribution function F0 is obtained as follows:
grid<-seq(0.02,1.98, 0.02)
bw<-rep(2*n^-0.2,length(grid))
smle<-ComputeSMLE(data=A, x=grid, bw=bw)
plot(grid, smle,type='l', ylim=c(0,1), main="",ylab="",xlab="",las=1)
lines(grid, (1-exp(-grid))/(1-exp(-2.0)), col=2, lty=2)
The nonparametric bootstrap procedure, which consists of resampling (with replacement) the (Ti, Δi) from the original sample is consistent for generating the limiting distribution of the SMLE F̃nh(t) under current status data (see Groeneboom and Hendrickx (2017)). As a consequence, valid pointwise confidence intervals for the distribution function F0(t) can be constructed using a bootstrap sample (T1*, Δ1*), …(Tn*, Δn*). The 1 − α bootstrap confidence intervals generated by the function ComputeConfIntervals are given by $$ \left[\tilde F_{nh}(t)-Q_{1-\alpha/2}^*(t)\sqrt{S_{nh}(t)}, \tilde F_{nh}(t)-Q_{\alpha/2}^*(t)\sqrt{S_{nh}(t)}\right],$$ where Qα*(t) is the αth quantile of B values of Wnh*(t) defined by, $$W_{nh}^*(t)=\left\{\tilde F_{nh}^*(t)-\tilde F_{nh}(t)\right\}/\sqrt{S_{nh}^*(t)},$$ where F̃nh*(t) is the SMLE in the bootstrap sample and Snh(t) is given by: $$S_{nh}(t)=\frac1{(nh)^{2}}\sum_{i=1}^n K\left(\frac{t-T_i}h\right)^2\left(\Delta_i-F_n(T_i)\right)^2.$$ Snh*(t) is the bootstrap analogue of Snh(t) obtained by replacing (Ti, Δi) in the expression above by (Ti*, Δi*). The number of bootstrap samples B used by the function ComputeSMLE equals B = 1000.
The bandwidth for estimating the SMLE F̃nh at point t, obtained by minimizing the pointwise Mean Squared Error (MSE) using the subsampling principle in combination with undersmoothing is given by the function ComputeBW. To obtain an approximation of the optimal bandwidth minimizing the pointwise MSE, 1000 bootstrap subsamples of size m = o(n) are generated from the original sample using the subsampling principle and ct, opt is selected as the minimizer of $$ \sum_{b=1}^{1000}\left\{\tilde F_{m,cm^{-1/5}}^b(t) - \tilde F_{n,c_0n^{-1/5}}(t) \right\}^2,$$ where F̃n, c0n−1/5 is the SMLE in the original sample of size n using an initial bandwidth c0n−1/5. The bandwidth used for estimating the SMLE is next given by h = ct, optn−1/4.
When the bandwidth h is small, it can happen that no time points are observed within the interval [t − h, t + h]. As a consequence the estimate of the variance Snh(t) is zero and the Studentized Confidence intervals are unsatisfactory. If this happens, the function ComputeConfIntervals returns the classical 1 − α confidence intervals given by: [F̃nh(t) − Z1 − α/2*(t), F̃nh(t) − Zα/2*(t)], where Zα*(t) is the αth quantile of B values of Vnh*(t) defined by, Vnh*(t) = F̃nh*(t) − F̃nh(t).
Besides the upper and lower bounds of the 1 − α bootstrap confidence intervals, the function ComputeConfIntervals also returns the output of the functions ComputeMLE and ComputeSMLE. The theory for the construction of pointwise confidence intervals is limited to points within the observation interval. It is not recomended to construct intervals in points smaller resp. larger than the smallst resp. largest observed time point.
The general approach for constructing the confidence intervals is the following: First decide upon the points where the confidence intervals need to be computed. If no particular interest in certain points is requested, it is useful to consider a grid of points within the interval starting from the smallest observed time point ti until the largest observed time point tj. The function ComputeConfIntervals can also deal with positive values outside this interval but some numerical instability is to be expected and the results are no longer trustworthy.
## [1] 0.0007901406 1.9997072918
Next, select the bandwidth vector for estimating the SMLE F̃nh(t) for each point in the grid. If no pre-specified bandwidth value is preferred, a data-driven bandwidth vector can be obtained by the function ComputeBW.
## The computations took 8.668544 seconds
The bandwidth vector obatined by the function ComputeBW is used as input bandwidth for the function ComputeConfIntervals.
## The program produces the Studentized nonparametric bootstrap confidence intervals for the cdf, using the SMLE.
##
## Number of unique observations: 1000
## Sample size n = 1000
## Number of Studentized Intervals = 199
## Number of Non-Studentized Intervals = 0
## The computations took 13.68592 seconds
The function ComputeConfIntervals informs about the number of times the Studentized resp. classical bootstrap confidence intervals are calculated. The default method is the Studentized bootstrap confidence interval. In this simulated data example, the variance estimate for the Studentized confidence intervals is available for each point in the grid and out$Studentized equals grid.
## $names
## [1] "MLE" "SMLE" "CI" "Studentized"
## [5] "NonStudentized"
## numeric(0)
A picture of the the SMLE together with the pointwise confidence intervals in the gridpoints and the true distribution function F0 is given below:
left<-out$CI[,1]
right<-out$CI[,2]
plot(grid, out$SMLE,type='l', ylim=c(0,1), main="",ylab="",xlab="",las=1)
lines(grid, left, col=4)
lines(grid, right, col=4)
segments(grid,left, grid, right)
lines(grid, (1-exp(-grid))/(1-exp(-2.0)), col=2)
Niels Keiding (1991) considered a cross-sectional study on the Hepatitis A virus from Bulgaria. In 1964 samples were collected from school children and blood donors on the presence or absence of Hepatitis A immunity. In total n = 850 individuals ranging from 1 to 86 years old were tested for immunization. It is assumed that, once infected with Hepatitis A, lifelong immunity is achieved. To estimate the sero-prevalence for Hepatitis A in Bulgaria, 95% confidence intervals around the distribution function for the time to infection are computed using the ComputeConfIntervals function in the package curstatCI. Since only 22 out of the 850 individuals were older than 75 years, who, moreover, all had antibodies for Hepatitis A, it seems sensible to restrict the range to [1,75]. The resulting confidence intervals are obtained as follows:
## t freq1 freq2
## 1 1 3 16
## 2 2 3 15
## 3 3 3 16
## 4 4 4 13
## 5 5 7 12
## 6 6 4 15
grid<-1:75
bw<-ComputeBW(data=hepatitisA,x=grid)
out<-ComputeConfIntervals(data=hepatitisA,x=grid,alpha=0.05, bw=bw)
The estimated prevalence of Hepatitis A at the age of 18 is 0.51, about half of the infections in Bulgaria happen during childhood.
## [1] 0.5109369
left<-out$CI[,1]
right<-out$CI[,2]
plot(grid, out$SMLE,type='l', ylim=c(0,1), main="",ylab="",xlab="",las=1)
lines(grid, left, col=4)
lines(grid, right, col=4)
segments(grid,left, grid, right)
The confidence interval around F̃nh(1)
is computed using the classical confidence interval instead of the
Studentized confidence interval.
## [1] 1
N. Keiding et al. (1996) considered a current status data set on the prevalence of rubella in 230 Austrian males with ages ranging from three months up to 80 years. Rubella is a highly contagious childhood disease spread by airborne and droplet transmission. The symptoms (such as rash, sore throat, mild fever and swollen glands) are less severe in children than in adults. Since the Austrian vaccination policy against rubella only vaccinated girls, the male individuals included in the data set represent an unvaccinated population and (lifelong) immunity could only be acquired if the individual got the disease. Pointwise confidence intervals are useful to investigate the time to immunization (i.e. the time to infection) against rubella.
## t freq1 freq2
## 1 0.2740 0 1
## 2 0.3781 0 1
## 3 0.5288 0 1
## 4 0.5342 0 1
## 5 0.9452 1 1
## 6 0.9479 0 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.274 8.868 25.595 28.970 44.888 80.118
grid<-1:80
bw<-ComputeBW(data=rubella,x=grid)
out<-ComputeConfIntervals(data=rubella,x=grid,alpha=0.05, bw=bw)
The SMLE increases steeply in the ages before adulthood which is in line with the fact that rubella is considered as a childhood disease.
left<-out$CI[,1]
right<-out$CI[,2]
plot(grid, out$SMLE,type='l', ylim=c(0,1), main="",ylab="",xlab="",las=1)
lines(grid, left, col=4)
lines(grid, right, col=4)
segments(grid,left, grid, right)