Dots of Tech Perception

October 28, 2008

On Role Mining

The role mining problem is a wrong way to go when trying to solve the role definition problem. There is a need to distinguish between actual and potential/candidate roles.

  • The actual roles are a complete and definite set that have semantics in the considered organizational context.
  • The role mining problem will define a set of potential roles. We can only come with an incomplete solution to the role mining problem. By means of approximation and heuristic, we could find an almost optimal solution or one that works reasonably well, so a subset of the all the potential roles.
  • Whereas all the actual roles set are assumed to exist and be useful in an enterprise context from the set of candidate roles a overwhelming majority will be eliminated.

So instead of trying to find all roles, why not just redefine the problem and take into consideration the correct premises to derive the set of correct roles. It all comes in the end to the old saying, why buy the cow (i.e. define the set of ALL candidate roles) when you can have the milk for free (i.e. the actual roles)?

October 10, 2008

Role Definition: Role Mining vs Role Engineering

Filed under: Role Mining,role definition — Tags: , — GG @ 4:17 am

For the purpose of role definition, two strategies can be used: top-down and bottom-up. The top-down strategy for role definition means essentially breaking down an organizational system to gain insight into its compositional business units. Each department is then refined in detail, sometimes in more than two additional levels (i.e. groups, teams), until the entire specification is reduced to business roles. A bottom-up approach for role definition is piecing together entitlements objects in order to define technical roles. In a bottom-up approach the individual entitlements are first specified in great detail. The entitlements are then linked together to form technical roles, which in turn are linked at a number of levels depending on the desired granularity, until a complete top-level system is formed.

In conclusion, bottom up role definition focuses on analyzing the entitlements in target systems and is typically performed by a technical team, while the top-down strategy is centered on user attributes and is performed by a business team. Regardless of the approach, the enterprise should first capture the scope for of role definition, and then decide on the strategy to use. Two main objects are prevalent when defining roles: defining roles for administrative purposes and defining roles for security and compliance purposes. First, when the goal of role definition effort is to validate and verify the technical environment and to correct inaccurate functional and security controls, then a bottom-up strategy is recommended. This strategy starts with collecting and analyzing entitlements. The roles resulting when implementing a bottom-up strategy can be both technical roles and business roles. In order for the resulted roles to become usable they need perspective from business users and owners. Second, when the main objective of role definition is to provide validate and verify the administrative environment and to correct inaccurate structural controls, then a top-down approach is recommended. This strategy focuses on creating business roles based on analyzing job responsibilities as well as the relationship among responsibilities. However in order for these roles to become usable they need perspective and input from resource owners.

This differentiation allows the two strategies to be implemented independently during the role definition process. Up to a certain point, the two strategies can be applied in parallel. They should meet in the middle to create a mapping structure between the business roles and technical roles. Technical teams can make considerable progress toward creating a role structure for an enterprise, but the roles must ultimately be approved by business owners. Similarly, business representatives are able to create high-level role definitions but will have some difficulty in mapping these roles to the entitlement infrastructure.

After the strategy (bottom up, top down or a combination of both) for role definition is established based on the project scope, the process scope must be established. The role definition process depends on what needs to be done in order to define roles and to some extent how it should be done (e.g., establishing some role guidelines, a quality model for role definition). Both top-down and bottom-up strategies can be tactically implemented using processes like role engineering and role mining. If the role definition process scope is to enumerate the set of roles without cardinality constraints, the process tries to answer the question “how many roles?” and not “why so many roles?” In this case, the outcome of the definition effort will be represented as actions to be taken on the entitlements and users for which data is being or has been collected. This process applies best to organizations with a small number of employees, or with highly specialized business units with a small number of users. However, if the process of role definition aims to correct or improve user attributes and entitlement data and/or the workflow of the administrative tasks in the future, then cardinality constraints must be taken into consideration and the question that is important here is “Why so many roles?” In this case, the outcome of the definition effort will be actions to be performed on the process that produced the data. Therefore, the focus of the process will be on data cleaning, data consolidation and data synchronization.

The role definition process can be viewed by extrapolation as requirements analysis. Coyne was the first one to use the term role engineering (Coyne, 1996). Requirements analysis methods can be regarded as process-oriented, data-oriented and control-oriented (Thayer & Dorfman, 1997). From the point of view of role definition, a process-oriented method would take into consideration the way the systems transform user and entitlement data into roles, with less emphasis on user and entitlement data itself, and also with less emphasis on control aspects. Data-oriented methods for role definition would emphasize the state of the provisioning system as a user and entitlement data structure. Control-oriented methods would emphasize synchronization, exclusion, concurrence, and role activation and deactivation. In conclusion, all three methods mentioned in this paragraph can be used in the activities performed at different stages in a role definition effort. Careful consideration should be given to each phase scope, and an appropriate engineering method should be selected.
Requirements engineering activities associated with role definition may be classified in terms of:

  • role elicitation: discovering, reviewing, and understanding roles and constraints
  • role analysis: defining role structure and constraints
  • role specification: refining and documenting roles structure and constraints, clearly and precisely
  • role verification: ensuring that the roles are complete, correct, consistent, and clean

Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data items. Data mining often involves the analysis of data
stored in a data warehouse. So, by extrapolation, role mining can be defined as the use of automated data analysis techniques to uncover previously undetected relationships among users and entitlements. Role mining involves the analysis of user information (attributes and entitlements) stored in various systems. Data mining techniques can be used to perform entitlement-to-technical-role associations. In the same way, user attributes to business role can be discovered using pattern recognition techniques. But the final entitlement-to-technical-role and business-role-to-user attributes definition decisions can only be done in a business centric process. Data mining activities associated with role definition may be structured in terms of:

  • role data collection
  • role data preparation
  • role modeling
  • role evaluation

Data mining for role definition is nothing more than a tool that can assist the role definition process.  But this approach can reduce the problem space to either solve the problem or reduce the problem enough so that one can find the optimum solution with a (worst-case) exponential method.  However, it is very difficult to use data mining approaches to mine for roles in an industrial environment. The main reason is that they fail to discover roles with semantic meanings. Existing role mining problem definitions only use user entitlement information. Since entitlements are symbols without meanings, this limits the ability to identify usable roles. Mining on both user attributes and entitlements might be a future research direction to consider.

Provisioning – A Definition

Filed under: provisioning,role definition — GG @ 3:25 am

In an enterprise environment, provisioning mechanisms are used to ensure that users have access only to the entitlements that they need in order to perform the responsibilities assigned to them throughout their full life-cycle (i.e., employment to separation). Provisioning mechanisms attempt to automate the previously manual responsibilities of the human resources and information technology departments. More formally, provisioning can be viewed as all the life-cycle steps required to setup, maintain and terminate user access to a directory and a data target systems.

The life-cycles to be defined in order to assign users to the required level of access depend on the chosen provisioning model (e.g., rule-based provisioning, role-based provisioning). Role-based provisioning – provisioning based on roles that users can play in an enterprise environment (e.g., team leader role, division manager role). Role-based provisioning can be viewed in terms of three life cycles: user life-cycle, role life-cycle and entitlement life-cycle.

The entitlement life-cycle uses entitlement objects as abstractions of privileges that are currently held by users. An entitlement object instance can be approved or pending approval, fulfilled /active or pending fulfillment/activation, removed or pending removal. Pending entitlement objects may also represent privileges to be added to a user, privileges to be removed or changes in the configuration of a current privilege.  The role life-cycle uses abstract role objects. Roles dictate what electronic access and physical assets are to be provided to a user, either automatically or manually, and how each entitlement is to be encapsulated in a role. Roles can be in the same states as the entitlements.  The user life-cycle defines user objects as abstractions of users that are hired, transferred, promoted, leaving on vacation and/or separated (i.e., leave the organization). User object entities are the current users with their organizational responsibilities and duties, function, rank, location and other attributes (e.g., visa status, location). User objects may represent employees, contractors, vendors, partners, customers or other types of employment in the considered organization. An implementation of a user object can be represented on target systems by authentication credentials or user profile attributes.

The consolidated user attributes in a provisioning system contains logically integrated user, role and entitlement object instances. Entitlement and role objects cannot exist in a provisioning system without a user object but user objects can exist without role and entitlement objects. Any of the above life-cycles are triggered when user attributes are changed due to transfers or separations or when data for a new user is detected. A change in a user object instance will trigger a role or entitlement life-cycle task.

October 7, 2008

Roles for Provisioning

Roles can help simplify entitlement and user provisioning inside an organization. By using roles, the organization’s management can gain visibility into the entitlements allocated to users. Roles are also desirable to organizations wishing to deploy provisioning systems or to measure access compliance, because of their potential for improving security and control. Even if the advantages of using roles in an enterprise environment are acknowledged, no agreement exists on the theoretical definition of a role. From the perspective of role theory, roles are the culturally defined norms—rights, duties, expectations, and standards for behavior—associated with a given social position.

By extending this definition to an enterprise environment, roles can be defined as company specific norms (duties and responsibilities) associated with a given functional or organizational position. Hence, roles are defined by operational needs and functional constraints. Roles limit access to data and enable administration of user and entitlement objects. Roles can be used to describe a class of entitlements from a technical point of view. They are also used by business units to represent aspects of the organizational structure. Organizations often use the term role to cover both meanings, and then it becomes immediately very difficult to define roles. That is why, in the context of role definition for provisioning, we will use two types of roles: business roles and technical roles.

A business role (organizational or structural role) is an articulation of a business responsibility required to perform a job function. A business role is a named collection of business functions (duties or responsibilities) that can be performed by users in an organization. Business roles are derived from operational procedures and policies. They are created and managed by business users, such as managers and team leaders. A business role can be either static or dynamic. Dynamic business roles depend on rules to determine role-to-user assignment. Allocation rules define who must be assigned to a particular role and under what circumstances. For example, a business unit manager, John, can create a rule to grant the business role, Statistician, to all active users in the Statistics department who have the job description Statistician. All users in the selected department who meet the condition described by this rule are automatically added to the list of users of the Statistician role. Automatically entitling users to business roles is based on data available from human resources target systems. When attribute values that control automated assigned roles change, user assignments to roles will change. Static business roles assign roles through manual-requests. Unlike dynamic business roles, that rely on rules, static business roles must be requested manually for each individual user that is entitled to the specific role. For example, consider the static business role “mission travel”. John, a manager, decides to assign this role to his employee Jane who is traveling on a mission to Europe. John must manually request this role for Jane. Ideally, when Jane comes back from the mission, John should not have to manually revoke the role from Jane. The role definition should consider environmental attributes (e.g., role duration), and the role assignment should automatically be revoked when a given condition (e.g., role duration is less than 3 months) is satisfied.

A technical role (IT or functional role) is a named collection of entitlements and can be mapped to business roles in order to assign users to a set of entitlements. Suppose that Jane is a new employee in the Statistics Department of Mother Organization and that among her main responsibilities is data dissemination to partner organizations. Then John, Jane’s
manager, should give her access to the Data Dissemination business role. The data dissemination role is further divided into the technical roles Dissemination Access to Partner
Organization 2 and Dissemination Access to Partner Organization 1, depending on the information system used for exchange between Mother Organization and the Partner
Organizations. In other words, the technical role Data Dissemination to Partner Organization 1 is a collection of entitlements: Create New Data Set, Publish Data Set and See Reports of
Published Data on a specific target system used for data exchange between Partner Organization 1 and Mother Organization. Technical roles can be refined based on target systems (e.g., various exchange systems between organizations), based on various levels of security within the same target system, or based on user attributes.

The industry is split into two camps when it comes to role definition: the pragmatic one thinking that there is no need for multiple categories of roles, (Courion, 2007) and the more academic one believing that there is a need for distinction (Eurekify Ltd, 2008). I am in the latter camp and think that business and technical roles have practical applications.
There is an immediate and urgent need for role-based provisioning, which may not involve role types. However, using the business and technical role types generates options for additional uses of each type within and outside the span of provisioning. More this differentiation induces the necessary level of abstraction between technical and business organizational levels. The main argument in favor of using role types, is that they make role-based provisioning easier to understand over time. Hence, the distinction between business and technical helps role maintenance. One additional argument in favor of the separation between business and technical roles is that it allows implementation of management and security controls, such as:

  • all business roles must be approved by the personnel administrative department
  • all business roles must be approved and activated by business units managers
  • only business roles can be directly assigned to users upon manual-request
  • business roles can only contain other roles, and not direct entitlements
  • technical roles can only be approved by resource owners

September 14, 2008

Paper Published

Filed under: bugzilla,defect prediction,mozilla,mozilla firefox — GG @ 6:19 pm

So finally my work in defect prediction was published. Here is a presentation.

March 5, 2008

Intuitively linking CVS and Bugzilla

The main problem with using software repositories in defect prediction is the lack of integration of the CVS history files and defect tracking systems. You can link the problem reports(PRs) and modification reports(MRs) using the PR identification number available both in the MRs in CVS and in the PRs in Bugzilla. However, the real challenge is to associate the bug reports in Bugzilla with the specific Firefox releases. The data collection process took place at moment t and the goal was to collect bugs that were in the source code the moment of the release, t1 (Figure below). This is not trivial as the following example illustrates. Suppose at the time of the release t1 a defect was in the source code. If the defect was solved after the release, say at t2 or t’, the bug at t3, when we collected the data, is labeled as being resolved.

An approximate of the defects that were in the source code at the time of a Firefox release t1, to be all the bugs with creation timestamp before t1 AND:

1.with the status CLOSED, RESOLVED or VERIFIED after t1 OR
2.with the commited to CVS timestamp after t1 OR
3.with the status NEW, ASSIGNED or REOPENED at t3 .

It may happen that a bug was solved, the commit message exists in the CVS history file at t2, but the bug status was not modified in Bugzilla environment. It may also be the case that the commit message in CVS is not reflecting the change performed, it does not have a PR identification number associated with it, even if the change resolves a problem and it is reported in Bugzilla at t’. We also selected only the PRs with the severity marked as blocker, critical, major, normal, minor, and where applicable with the resolution set to FIXED. The problem with this approach is that there may be defects in the code undiscovered at moment t1, and that will be reported after the release. Because there is no way to tell to which release the bugs belong we simply did not consider them.

Mining for defects – Mozilla Firefox

Filed under: bugzilla,defect prediction,mozilla,mozilla firefox — GG @ 1:58 am

The main problem with using software repositories in defect prediction is the lack of integration of the CVS history files and defect tracking systems. You can link the PRs with MRs using the PR identification number available both in the MRs in CVS and in the PRs in Bugzilla.

A real challenge is to associate the bug reports in Bugzilla with the specific Firefox releases. The data collection process takes place at moment t3 and the goal is to collect bugs that are in the source code the moment of the release, t1. This is not trivial as the following example illustrates. Suppose at the time of the release t1 a defect was in the source code. If the defect was solved after the release, say at t2 or t’, the bug at t3, when data is collected, is labeled as being resolved.

But we are dealing with an open source environment. It may happen that a bug was solved, the commit message exists in the CVS history file at t2, but the bug status was not modified in Bugzilla environment. It may also be the case that the commit message in CVS is not reflecting the change performed, it does not have a PR identification number associated with it, even if the change resolves a problem and it is reported in Bugzilla at t’.

There is a lot of debate with respect to whether size and complexity can predict defects. We argue that there is value in size and complexity metrics with respect to defect prediction and that research should rather focus on to what extent can size and complexity predict defects or in what particular cases we can predict defects based on size and complexity metrics. In this context, we present our interpretation of the results.

Data Collection – Mining for Defects

Filed under: bugzilla,defect prediction,mozilla,mozilla firefox — GG @ 1:50 am

As a non critical software system, it is widely recognized that Firefox contains post release defects. OSS facilitates the collection of data to be used in defect prediction models. An important requirement for OSS code is that it should be rigorously modular, self-contained and self explanatory, to allow development at remote sites. Therefore, the data that can be used for prediction models in OSS could be retrieved from the source code version repositories (CVS) and bug tracking systems (Bugzilla). On the other hand, OSS development is characterized by lack of a formal process, poor design and architecture, and development tools that are not comparable to those used in commercial development. Few of the defect prediction approaches in commercial software can be directly applied to OSS development, however results obtained from OSS prediction models can be used in an industry environment.

1. Versions: Firefox is based on independent Mozilla Core components layered together. Due to this architecture some of Mozilla’s applications share many components, but they are fundamentally different in functionality.

The Mozilla source code is organized in several branches. The trunk is the main branch, the central source code that is used for continuous and ongoing development. Trunk builds contain the very latest changes and updates. However, the trunk can also be very unstable at times. When development is started for a specific Mozilla version a new branch is created. At conception, a derived branch contains everything that the principal branch contains. Firefox 1.0 branch was derived from Mozilla Branch 1.7 while Firefox 1.5 from Mozilla Branch 1.8. Firefox branches that are forked from the existing Mozilla branch will be used for all future releases of Firefox. The term release is used in OSS development to refer to different types of releases: major and minor, alpha and beta.

Firefox Branch 1.5.0.3 resynchronized the code base with the trunk which contained additional features not available in Firefox 1.0. On the other hand, in release 1.5.0.3 the focus was not on adding features but on improving security related aspects, which were bypassed in version 1.5.0. This peculiarity of the three selected releases allowed us to test if the performance of a defect prediction models increases when trained on data collected from major releases instead of minor ones.

2. Module Selection: The reason behind branching is that components that need to be prepared for a future release are at the same time continuously developed on the trunk. A distinction needs to be made between Firefox-specific source code, i.e. code that does not support any other Mozilla application, and the Mozilla components that support Firefox.

3. Metrics: To derive the product metrics for each source file Understand C++ can be used. The tool computes the source code metrics for C and C++ programs and generates metrics reports. The reports contain three categories of metrics: project level, file level, and function level. It also contains object oriented metrics for the .cpp files.

The reason behind branching is that components that need to be prepared for a future release are at the same time continuously developed on the trunk. A distinction needs to be made between Firefox-specific source code, i.e. code that does not support any other Mozilla application, and the Mozilla components that support Firefox.

March 4, 2008

Mozilla Bugzilla Reporting Process – aka a bug’s lifecycle

Filed under: bugzilla,defect prediction,mozilla firefox — GG @ 6:51 am

The Mozilla project relies on Bugzilla, a defect tracking system, to monitor problem reports (PR), i.e. bugs. A PR in Bugzilla has several pre-defined attributes. Some fields, such as the PR identification number and creation timestamp, are created when the report is first filed. Other fields, such as the product, component, and severity, are selected by the testers when the report is filed and may be changed over the lifetime of the report. Other fields routinely change over time, such as the current status of the report, and if resolved, its resolution state.

Studying the lifecycle of a bug facilitates linking the Bugzilla PRs and CVS Modification Reports (MRs). The status and resolution fields define bugs as evolving entities that change over time. When a tester enters a new bug in Bugzilla the status of the bug is set to UNCONFIRMED. The Mozilla quality assurance team will look at it and confirm the bug exists and changing its status to NEW. After a developer looks at the bug and either accepts it or assigns it to someone else, the bug’s status becomes ASSIGNED. Once the bug is fixed, its status changes to RESOLVED. Finally, the quality assurance team verifies that the bug was indeed fixed and the status is set to VERIFIED and then CLOSED. If the quality assurance team is not satisfied with the solution, than the bug is REOPENED and the process starts again. A report can be RESOLVED in various ways. Bugzilla PRs indicate this in the resolution field. If the bug was solved and this resulted in a change to the code base, the bug is resolved as FIXED. When a developer determines that the bug is a duplicate of an existing report then it is marked as DUPLICATE. If the developer is unable to reproduce the defect, then the resolution is set to WORKSFORME. If the report describes a problem that will not be fixed, i.e. it is not an actual bug, the report is marked as WONTFIX or INVALID.

In Bugzilla terminology, a bug can be anything that needs to be tracked. Some entries are not real bugs, i.e. defects, but rather enhancements. When analyzing a report in Bugzilla, the quality assurance team rates severity of the bug using one of the following labels: blocker, critical, major, normal, minor, trivial, or enhancement.

While Bugzilla contains information about defects, it does not contain information about the location of the defects in the source code. Instead, this information is captured in the CVS log files. CVS Modification Reports (MRs) keep the complete history of any file in the project, including when and what was modified. Bonsai, Mozilla’s web interface to its CVS repository, can be used to retrieve MRs related to source files, comments associated with the files, and the timestamp of the commit message. Each comment acknowledges the people who submitted the change and contains relevant PR identifications numbers (if any). Every number that appeared in a MR’s comment field was a potential link to a bug, indicating that that commit message solved a PR. We selected the number as a candidate for a bug id if the following two conditions were met: the number had the length less than 6 digits and the comment message contained the keywords bug, bug id, id or # before the number.

Firefox Development Process

Filed under: mozilla,mozilla firefox — GG @ 6:47 am

Firefox is based on independent Mozilla Core components layered together. Due to this architecture some of Mozilla’s applications share many components, but they are fundamentally different in functionality.

The Mozilla source code is organized in several branches. The trunk is the main branch, the central source code that is used for continuous and ongoing development. Trunk builds contain the very latest changes and updates. However, the trunk can also be very unstable at times. When development is started for a specific Mozilla version a new branch is created. At conception, a derived branch contains everything that the principal branch contains (Figure 1). Firefox 1.0 branch was derived from Mozilla Branch 1.7 while Firefox 1.5 from Mozilla Branch 1.8. Firefox branches that are forked from the existing branch will be used for all future releases of Firefox. The term release is used in OSS development to refer to different types of releases: major and minor, alpha and beta. Due to data availability constraints, we have only considered two major releases, 1.0 and 1.5, and a minor release, 1.5.0.3, in our work presented here.

Firefox Branch 1.5.0.3 resynchronized the code base with the trunk which contained additional features not available in Firefox 1.0. On the other hand, in release 1.5.0.3 the focus was not on adding features but on improving security related aspects, which were bypassed in version 1.5.0. This characteristic of the three selected releases allowed us to test if the performance of a defect prediction models increases when trained on data collected from major releases instead of minor ones.

The reason behind branching is that components that need to be prepared for a future release are at the same time continuously developed on the trunk. A distinction needs to be made between Firefox-specific source code, i.e. code that does not support any other Mozilla application, and the Mozilla components that support Firefox.

Blog at WordPress.com.