1 Introduction

Survival analysis builds a relationship between time to occurrence of an event to the covariates that influence it. This finds a lot of applications in medicine, economics, finance, and engineering. However, the likelihood of observing more than one event during the study period is very common regardless of the field of application and they are called recurrent or multiple events. The challenge underlying this problem is that the probability of one event is likely to get influenced by the earlier event even if they are of a different type.

Cox proportional hazards model (Cox 1972), a well-known approach for the analysis of survival data, has been extended to multiple events as well as events with non-proportional hazards. But the problem in extending the proportional hazards model to multiple events is the intra-subject correlation (Therneau and Grambsch 2013; Hougaard 2012). Some extensions of the Cox proportional model have been proposed for analysis, which belongs to the class of intensity processes such as Andersen and Gill model (AG) (Andersen and Gill 1982), Prentice, Williams, and Peterson models (PWP) (Prentice et al. 1981) and Wei, Lin and Weissfeld model (WLW) (Wei et al. 1989) to address the intrasubject correlation. Different types of frailty models (Vaupel et al. 1979), as well as multi-state models (MSM) (Hougaard 1999), are also popular in use for multiple event problems. The choice among the different models depends on the research question under investigation. These models are illustrated using data collected to study the effect of class size on defect proneness in the Mozilla product (Gunes et al. 2007, 2008).

Gunes et al. found out that the quality assurance activities in biomedical open-source software (OSS) were not sufficient to be used for mission-critical purposes, thus, justifying the significance of studying the characteristics of defect prone open-source software modules to take preventive actions (Gunes et al. 2007). The event of interest in this problem is the defect fixes made to C++ classes in the open-source Mozilla project. Since there was more than one defect per class, the data falls in recurring event setup. The purpose of the analysis is to obtain the instantaneous hazard rate of fixing a defect at a time t for a C++ class. Such data have been dealt with using various methodologies in epidemiology and clinical trials, some of which are introduced using the defect proneness data. The data layout for the application of each of the methods and comparison of the models have been discussed.

2 Data

Defect prediction modeling is an integral area in software maintenance. Most often, the size of the software is one of the reasons of defect proneness (El-Emam et al. 2000). Gunes et al. collected data from a Mozilla project to study this association (Gunes et al. 2007). The data consists of the complete history of all the C++ classes introduced to the Mozilla project from May 29, 2002, to February 22, 2006, which was available publically in the PROMISE repository (Sayyad Shirabad and Menzies 2005). The complete data was accessed from Open Machine Learning (OpenML) (Vanschoren et al. 2013). The event of interest of the study was defect fixes made to C++ classes added to the public code repository of the Mozilla product by developers. The revision history of the followed-up classes was extracted from the Concurrent Version System (CVS) repository. From the list of revisions, defect fixes were identified by the keywords “bug”, “fix”, and “defect” in the commit log message.

Time taken for repeated defect fixes of each C++ class was recorded in minutes. Each class introduced in the system was either followed up until the end of the observation period or until the class was deleted. If the class is removed to fix a defect, it was accounted as an event, otherwise censored. Even though classes were introduced at different calendar times to the system, starting time is defined as zero. The only covariate observed in the study (Gunes et al. 2007) was the size of the class as the study was conducted to establish the impact of module size on defect proneness. The size of the class was measured in terms of the number of lines of code without considering comment lines and blank lines. Class size is a time-varying covariate since the length of the code changes with time. A total of 8828 defects were fixed during the study period which belonged to 4089 classes. The observations for each revision are recorded with the fields:

  • ID: A unique identifier for each class.

  • Start: Start = 0, when a class is introduced. Otherwise, the time when the revised class is reintroduced to the system.

  • End: Next revision time of the class or the deletion time, or the time at which the study period ended, whichever occurs first.

  • Event: An indicator that tells whether the revision was for defect fixing or not.

  • Size: Number of lines of the code at the time of revision.

The median(Interquartile range) class size was 126 (36–418) lines. The data were trimmed for at most 4 defect fixes due to the anticipated instability of the estimates. Log transformed size was incorporated as a covariate in the models as done in previous studies (Gunes et al. 2007, 2008).

Table 1 Data layout for different models

The analysis of repeated events can be approached in two different ways: (i) by considering the total time (in minutes) or (ii) by considering the gap time (in minutes) between the events. The data should be arranged appropriately to fit different models successfully. The formats for AG, PWP, WLW, MSM, and frailty models are described in Table 1 using the entries of first, second, and third classes.

3 Models

This section introduces some of the models using the data described in Sect. 2 with the demonstration of the relationship between the defect proneness and size of an OSS.

3.1 Andersen and Gill (AG) model

The intensity of recurring defect fixes is modeled using the realization of the counting process(counting observed defect fixes up to a particular time t) (Andersen and Gill 1982). Assuming the defect fixes to be independent and identical, defects were treated as the same irrespective of their order. AG model can give the best fit if the risk of further defects remains the same regardless of the previous defects (Hosmer Jr et al. 2008). The likelihood of a defect fix at time point t is unaffected by the previously observed defects and hence the baseline hazard function of all the defects is common (Amorim and Cai 2015).

The advantages of the model are its compatibility with complicated censoring schemes as well as time-dependent covariates, which can be used to bring in the dependency between the defects of the same class (Ullah et al. 2014). Moreover, the AG model is well incorporated in most of the statistical software.

3.2 Prentice, Williams and Peterson (PWP) model

This model is commonly used to analyze survival data with ordered recurrent events (Prentice et al. 1981). Modeling of hazard function is performed by stratification of the classes such that only those classes with a defect in the previous stratum are considered to be at risk in successive strata while all other classes are included in the first stratum. The risk set for the kth defect fix at time point t includes only those classes which had \((k-1)\)th defect in advance and not experienced kth defect at time t (Amorim and Cai 2015; Castañeda and Gerritse 2010). A separate column is included in data to take care of this stratification. The baseline hazard that varies across defects is a function of time since the study entry. The model is capable of incorporating both overall and defect-specific effects of the covariate and it is best suited for a small number of defects observed on a large number of classes. The hazard function for the kth defect of the ith class is given in Table 2.

Table 2 Model specifications for recurrent event survival models

3.3 Prentice, Williams and Peterson gap-time (PWP-GT) model

The PWP-GT model is a modification of the PWP model wherein time to a defect is observed since the immediately preceding defect fix (Prentice et al. 1981). The very next row of every defect fix will have a starting time zero in the data layout. Unlike the PWP model, the baseline hazard is a function of time since the previous defect fix. This model is highly preferred since it enables incorporating the effect of varying class sizes between any two successive defect fixes.

PWP models are conditional models since a class is at risk for a new defect only if it has sustained a previous defect and hence the hazard of a class at any time is conditional on the previous defect fixes made to the same class. More specifically, PWP as well as the PWP-GT model can be seen as a stratified version of the Cox model (Cox 1972). The dependency between the defects is incorporated through the strata formed. Hazard function for the kth defect of the ith class can be found in Table 2.

3.4 Wei, Lin, and Weissfeld (WLW) model

Unlike the other models, the mean number of defect fixes is modeled in the WLW model (Wei et al. 1989). The model can be used if the investigator is interested in the rate or the expected number of defect fixes, conditional on the covariate. All recurring defects of a class are treated as a single counting process. Moreover, if no time-varying(aka time-dependent) covariate is considered in the AG model to take care of the event history, the point estimates are the same as that of the WLW model (Amorim and Cai 2015). In such cases, the WLW model is preferred over the AG model due to its comparative flexibility. The model avoids specifying the dependence structure among different defect fixes received by the same class (Wei et al. 1989; Metcalfe and Thompson 2007).

One of the drawbacks of the model is an overestimation of the effect of size as all the classes are at risk for the maximum number of defects reported, even though they have developed a very low number of defects. In the data layout, every class has four rows since the maximum number of defects considered was four. The advantage of the model is the compatibility with unordered events.

3.5 Frailty model

The frailty model takes care of the unobserved heterogeneity of different classes in the hazard function (Vaupel et al. 1979). The model is useful when the classes are of different risks. A random covariate is introduced in the AG model to account for the excess risk associated with different classes. Conditional frailty models can be fitted with the same methodology extended to the PWP models.

Just like the concept of mixed models (McLean et al. 1991) in the analysis of longitudinal data, class-specific interpretation of the estimates is introduced in this model. If the random effects are small, a larger number of defects fixes are required. Otherwise, a smaller number of defect fixes would be adequate to proceed with the frailty model.

3.6 Multi-state model (MSM)

The instantaneous hazard of progression to one particular state conditional on occupying the other state is modeled in MSM. The model provides a covariate effect on each transition. A five-state model is introduced for the defect fix data. Defect-free classes belong to state 1, classes with one, two, and three defects belong to state 2, state 3, and state 4 respectively, and state 5 indicates more vulnerable classes with more than three defects.

Fig. 1
figure 1

Graphical representation of the 5-state MSM fitted to the data

States are called absorbing if there is no path out from it and transient otherwise. In the data, states 1, 2, 3, and 4 have further transitions and hence they are transient. State 5 is absorbing since we defined it as the state where the more vulnerable classes belong to. Since a defective class has no going back to a defect-free state, the transitions are irreversible. The possible transitions are from state 1 to state 2, state 2 to state 3, state 3 to state 4, and state 4 to state 5. Figure 1 represents the 5-state model fitted to the data. A column is incorporated to indicate the state in which a class belongs to. The advantage of MSM is the estimates obtained for each transition in the data, which gives the influence of the covariates in each transition(Table 4).

Based on the different assumptions regarding the transition rates, different multi-state models such as Markov, semi-Markov, or non-Markov, etc. can be fit for multiple event survival data (Meira-Machado et al. 2009).

Brief description including the pieces of information like the risk set considered, baseline hazard function, dependence structure, etc. of the different models discussed from Sect. 3.13.6 can be found in Table 2. In the functional form of the hazard function, \(\lambda _{0}\) denotes the baseline hazard, i.e., the underlying hazard without any covariate effect. X and \(\beta \) stand for covariate and unknown regression coefficients respectively.

4 Results

A total of 8828 defects were fixed during the observation period for 4089 classes. Above-mentioned models were fit using survival package (Therneau 2015) in RStudio (RStudio Team 2016). Parameter estimates obtained from the models are given in Tables 3 and 4 . The hazard ratio is the exponential of the parameter estimate. A considerable difference exists in the estimates from the different models due to their assumptions and properties.

Table 3 Parameter estimates obtained from different models
Table 4 Parameter estimates obtained from multi-state model

From the AG model, the hazard ratio observed is 1.60 with a 95% confidence interval (1.58, 1.62). i.e., one unit increase in the ln(size) of the class is associated with 60% more risk of recurrence.

The hazard ratio from the PWP model is 1.32 (1.28, 1.36). The risk set for the PWP model was formed using the classes having the same number of previous defects and hence for a larger number of defects, the strata become too small. The number of defects was truncated to 4 to overcome this limitation.

According to the frailty model, the instantaneous risk of recurrence is increased by 16% or one unit increase in ln(size).

According to the WLW model, the hazard for a defect fix increases by 15% with a unit increase in ln(size).

The multi-state model gives the hazard ratio associated with each transition for a unit increase in the size of the class. The results can be found in Table 4.

The goodness of fit was assessed using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and log-likelihood. PWP models had the best performance in terms of AIC, BIC, and log-likelihood values. From the properties of the PWP model, it is clear that the model is most suited for the ordered events. Hence the result is obvious. AG model treats the defects as the same irrespective of the order. From Fig. 1, it is clear that the survival probability decreases as the number of prior defects increases. AG model gives the best fit if the risk of further defects remains the same regardless of the previous defects, which is not so in this data.

Fig. 2
figure 2

Kaplan Meier estimates of the survival function stratified by number of previous defects

Figure 2 represents the Kaplan Meier estimates of the survival function, stratified by the number of previous defects. As the number of previous defects increases, survival considerably decreases. More specifically, classes with one previous defect are 2.3 times more susceptible to further defects. Classes with two, three, and four prior defects are 3.9, 4.3, and 9.6 times more prone to further defects according to the stratum-wise PWP model.

5 Conclusions

Characteristics of the different recurrent event models were introduced with the help of the defect fix data. As expected, each defect was making the class more susceptible to further defects. Parameter estimates obtained from various models were considerably different due to the disparity in their properties and assumptions. From the AIC, BIC, and log-likelihood values, we can infer that the PWP model gives the best fit for the defect fix data. Still, the selection of the appropriate model again depends on different factors such as research questions, type of defects, etc.

Most of the time, the above-mentioned models were introduced in the context of epidemiology and clinical trials. The same scenario occurs very often in science and technology also. In software quality assessment, recurrent event models can give more meaningful insights and the paper gives illustrations about the different approaches to recurrent event data which can be further applied in quality modeling.