So finally my work in defect prediction was published. Here is a presentation.
September 14, 2008
March 5, 2008
Intuitively linking CVS and Bugzilla
The main problem with using software repositories in defect prediction is the lack of integration of the CVS history files and defect tracking systems. You can link the problem reports(PRs) and modification reports(MRs) using the PR identification number available both in the MRs in CVS and in the PRs in Bugzilla. However, the real challenge is to associate the bug reports in Bugzilla with the specific Firefox releases. The data collection process took place at moment t and the goal was to collect bugs that were in the source code the moment of the release, t1 (Figure below). This is not trivial as the following example illustrates. Suppose at the time of the release t1 a defect was in the source code. If the defect was solved after the release, say at t2 or t’, the bug at t3, when we collected the data, is labeled as being resolved.
An approximate of the defects that were in the source code at the time of a Firefox release t1, to be all the bugs with creation timestamp before t1 AND:
1.with the status CLOSED, RESOLVED or VERIFIED after t1 OR
2.with the commited to CVS timestamp after t1 OR
3.with the status NEW, ASSIGNED or REOPENED at t3 .
It may happen that a bug was solved, the commit message exists in the CVS history file at t2, but the bug status was not modified in Bugzilla environment. It may also be the case that the commit message in CVS is not reflecting the change performed, it does not have a PR identification number associated with it, even if the change resolves a problem and it is reported in Bugzilla at t’. We also selected only the PRs with the severity marked as blocker, critical, major, normal, minor, and where applicable with the resolution set to FIXED. The problem with this approach is that there may be defects in the code undiscovered at moment t1, and that will be reported after the release. Because there is no way to tell to which release the bugs belong we simply did not consider them.
Mining for defects – Mozilla Firefox
The main problem with using software repositories in defect prediction is the lack of integration of the CVS history files and defect tracking systems. You can link the PRs with MRs using the PR identification number available both in the MRs in CVS and in the PRs in Bugzilla.
A real challenge is to associate the bug reports in Bugzilla with the specific Firefox releases. The data collection process takes place at moment t3 and the goal is to collect bugs that are in the source code the moment of the release, t1. This is not trivial as the following example illustrates. Suppose at the time of the release t1 a defect was in the source code. If the defect was solved after the release, say at t2 or t’, the bug at t3, when data is collected, is labeled as being resolved.
But we are dealing with an open source environment. It may happen that a bug was solved, the commit message exists in the CVS history file at t2, but the bug status was not modified in Bugzilla environment. It may also be the case that the commit message in CVS is not reflecting the change performed, it does not have a PR identification number associated with it, even if the change resolves a problem and it is reported in Bugzilla at t’.
There is a lot of debate with respect to whether size and complexity can predict defects. We argue that there is value in size and complexity metrics with respect to defect prediction and that research should rather focus on to what extent can size and complexity predict defects or in what particular cases we can predict defects based on size and complexity metrics. In this context, we present our interpretation of the results.
Data Collection – Mining for Defects
As a non critical software system, it is widely recognized that Firefox contains post release defects. OSS facilitates the collection of data to be used in defect prediction models. An important requirement for OSS code is that it should be rigorously modular, self-contained and self explanatory, to allow development at remote sites. Therefore, the data that can be used for prediction models in OSS could be retrieved from the source code version repositories (CVS) and bug tracking systems (Bugzilla). On the other hand, OSS development is characterized by lack of a formal process, poor design and architecture, and development tools that are not comparable to those used in commercial development. Few of the defect prediction approaches in commercial software can be directly applied to OSS development, however results obtained from OSS prediction models can be used in an industry environment.
1. Versions: Firefox is based on independent Mozilla Core components layered together. Due to this architecture some of Mozilla’s applications share many components, but they are fundamentally different in functionality.
The Mozilla source code is organized in several branches. The trunk is the main branch, the central source code that is used for continuous and ongoing development. Trunk builds contain the very latest changes and updates. However, the trunk can also be very unstable at times. When development is started for a specific Mozilla version a new branch is created. At conception, a derived branch contains everything that the principal branch contains. Firefox 1.0 branch was derived from Mozilla Branch 1.7 while Firefox 1.5 from Mozilla Branch 1.8. Firefox branches that are forked from the existing Mozilla branch will be used for all future releases of Firefox. The term release is used in OSS development to refer to different types of releases: major and minor, alpha and beta.
Firefox Branch 1.5.0.3 resynchronized the code base with the trunk which contained additional features not available in Firefox 1.0. On the other hand, in release 1.5.0.3 the focus was not on adding features but on improving security related aspects, which were bypassed in version 1.5.0. This peculiarity of the three selected releases allowed us to test if the performance of a defect prediction models increases when trained on data collected from major releases instead of minor ones.
2. Module Selection: The reason behind branching is that components that need to be prepared for a future release are at the same time continuously developed on the trunk. A distinction needs to be made between Firefox-specific source code, i.e. code that does not support any other Mozilla application, and the Mozilla components that support Firefox.
3. Metrics: To derive the product metrics for each source file Understand C++ can be used. The tool computes the source code metrics for C and C++ programs and generates metrics reports. The reports contain three categories of metrics: project level, file level, and function level. It also contains object oriented metrics for the .cpp files.
The reason behind branching is that components that need to be prepared for a future release are at the same time continuously developed on the trunk. A distinction needs to be made between Firefox-specific source code, i.e. code that does not support any other Mozilla application, and the Mozilla components that support Firefox.
March 4, 2008
Firefox Development Process
Firefox is based on independent Mozilla Core components layered together. Due to this architecture some of Mozilla’s applications share many components, but they are fundamentally different in functionality.
The Mozilla source code is organized in several branches. The trunk is the main branch, the central source code that is used for continuous and ongoing development. Trunk builds contain the very latest changes and updates. However, the trunk can also be very unstable at times. When development is started for a specific Mozilla version a new branch is created. At conception, a derived branch contains everything that the principal branch contains (Figure 1). Firefox 1.0 branch was derived from Mozilla Branch 1.7 while Firefox 1.5 from Mozilla Branch 1.8. Firefox branches that are forked from the existing branch will be used for all future releases of Firefox. The term release is used in OSS development to refer to different types of releases: major and minor, alpha and beta. Due to data availability constraints, we have only considered two major releases, 1.0 and 1.5, and a minor release, 1.5.0.3, in our work presented here.
Firefox Branch 1.5.0.3 resynchronized the code base with the trunk which contained additional features not available in Firefox 1.0. On the other hand, in release 1.5.0.3 the focus was not on adding features but on improving security related aspects, which were bypassed in version 1.5.0. This characteristic of the three selected releases allowed us to test if the performance of a defect prediction models increases when trained on data collected from major releases instead of minor ones.
The reason behind branching is that components that need to be prepared for a future release are at the same time continuously developed on the trunk. A distinction needs to be made between Firefox-specific source code, i.e. code that does not support any other Mozilla application, and the Mozilla components that support Firefox.