The main problem with using software repositories in defect prediction is the lack of integration of the CVS history files and defect tracking systems. You can link the problem reports(PRs) and modification reports(MRs) using the PR identification number available both in the MRs in CVS and in the PRs in Bugzilla. However, the real challenge is to associate the bug reports in Bugzilla with the specific Firefox releases. The data collection process took place at moment t and the goal was to collect bugs that were in the source code the moment of the release, t1 (Figure below). This is not trivial as the following example illustrates. Suppose at the time of the release t1 a defect was in the source code. If the defect was solved after the release, say at t2 or t’, the bug at t3, when we collected the data, is labeled as being resolved.
An approximate of the defects that were in the source code at the time of a Firefox release t1, to be all the bugs with creation timestamp before t1 AND:
1.with the status CLOSED, RESOLVED or VERIFIED after t1 OR
2.with the commited to CVS timestamp after t1 OR
3.with the status NEW, ASSIGNED or REOPENED at t3 .
It may happen that a bug was solved, the commit message exists in the CVS history file at t2, but the bug status was not modified in Bugzilla environment. It may also be the case that the commit message in CVS is not reflecting the change performed, it does not have a PR identification number associated with it, even if the change resolves a problem and it is reported in Bugzilla at t’. We also selected only the PRs with the severity marked as blocker, critical, major, normal, minor, and where applicable with the resolution set to FIXED. The problem with this approach is that there may be defects in the code undiscovered at moment t1, and that will be reported after the release. Because there is no way to tell to which release the bugs belong we simply did not consider them.