The main problem with using software repositories in defect prediction is the lack of integration of the CVS history files and defect tracking systems. You can link the PRs with MRs using the PR identification number available both in the MRs in CVS and in the PRs in Bugzilla.
A real challenge is to associate the bug reports in Bugzilla with the specific Firefox releases. The data collection process takes place at moment t3 and the goal is to collect bugs that are in the source code the moment of the release, t1. This is not trivial as the following example illustrates. Suppose at the time of the release t1 a defect was in the source code. If the defect was solved after the release, say at t2 or t’, the bug at t3, when data is collected, is labeled as being resolved.
But we are dealing with an open source environment. It may happen that a bug was solved, the commit message exists in the CVS history file at t2, but the bug status was not modified in Bugzilla environment. It may also be the case that the commit message in CVS is not reflecting the change performed, it does not have a PR identification number associated with it, even if the change resolves a problem and it is reported in Bugzilla at t’.
There is a lot of debate with respect to whether size and complexity can predict defects. We argue that there is value in size and complexity metrics with respect to defect prediction and that research should rather focus on to what extent can size and complexity predict defects or in what particular cases we can predict defects based on size and complexity metrics. In this context, we present our interpretation of the results.