Rutgers Lib 29880 - PDF 1
Rutgers Lib 29880 - PDF 1
Rutgers Lib 29880 - PDF 1
MINING PROBLEM
by
Qi Guo
Role-based access control (RBAC) has become the norm for enforcing
systems. Roles, which are nothing but sets of permissions when seman-
tics are unavailable, represent organizational agents that perform certain job
a set of roles and associate permissions to them, is essential before all the
There are two basic approaches towards role engineering: top-down and
bottom-up. The key problem with the top-down approach is that it is likely
which makes role engineering tedious, time consuming and very difficult to
neering process especially when business semantics are not available. Also, it
ii
starts from the existing permissions and aggregates them into roles. There-
mining.
A number of approaches exist for role mining and majority of them em-
candidate roles. While offering justifications for the identified roles, there is
no integrative view of the entire set of roles. For insightful bottom-up analy-
sis, we need to define interestingness metrics for roles. The objective of this
find the solutions to solve them. To this end, we have made the following
include the Basic-RMP, the Edge-RMP and the Minimal Perturbation Role
Mining Problem (i.e., the MinPert-RMP) and the Role Hiararchy Mining
Problem (the RHMP). For each major category, we have also considered
its different variations which we feel have strong pragmatic significance. For
example, we have explored the δ-approx Basic-RMP and the MinNoise Basic-
RMP (simply called as the δ-approx RMP and the MinNoise RMP respec-
tively throughout the dissertation) which are variants of the Basic-RMP, the
iii
tive which is both good/meaningful and in the view of the entire collection
each role mining is important since without it and consequently a clear end
goal, role mining would be aimless and mining approaches would become
incomparable. Also, meaningful and diverse objectives for different role min-
needs in the process of role generation. It is true that there exist many so-
lutions for role mining so far, unfortunately, none of previous work formally
introduces objective functions (even from the perspective of one single role)
and formally prove that the decision version of all RMPs defined are NP-
In this dissertation, we have formally defined our model of noise, the degree
of noise and the approach to check the accuracy of results, since each of
these factors has a significant impact on how we measure the accuracy of our
proposed algorithms.
the RMP variants. We have introduced the concept of metrics, the ways of
iv
role mining problems are. (i.e., how close the proposed approaches are to the
optimality).
v
ACKNOWLEDGEMENTS
Obliviously without her help, I wouldn’t have achieved what I have today
my work with patience throughout the time. I also wish to thank Dr. Jaideep
My gratitude goes also to Dr. Adam, head of CIMIC, the center that sup-
ported my work.
all the people who stood by my side. Thank you guys for helping me in big
program.
ABSTRACT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER 1. INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Role Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Role Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Problem Statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER 3. PRELIMINARIES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Role-Based Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 L1 norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Boolean Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Jaccard Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
ix
LIST OF TABLES
x
LIST OF FIGURES
xii
1
CHAPTER 1
INTRODUCTION
Role-based access control (RBAC) [10, 76, 27, 55, 59, 30, 93, 57, 62, 3, 28,
56, 79, 29, 84, 74, 5] has long been recognized as a normative access control
convenient but reduces the complication of access control since the number of
and/or duties within the organization [49]. As a result, RBAC has been
that RBAC will maintain its increasing prevalence since the growing demand
where roles represent organizational agents that perform certain job functions
within the organization, users are human beings and permissions are a set of
RBAC reference model [26], roles describe the relationship between users and
ing [21, 2, 89] requires defining roles and assigning permissions to them.
Role engineering is essential before all the benefits of RBAC can be real-
this, organizations are often reluctant to move to RBAC. Therefore, the in-
creasing popularity of RBAC calls for efficient solutions for role engineering
There are two basic approaches towards role engineering: top-down and
bottom-up. Under the top-down approach, roles are defined by carefully ana-
independent manner. These functional units are then associated with per-
defining a particular job function and then creating a role for this job function
accurately, the top-down approach has two major weaknesses: 1) Often, the
another and then incorporate them in the form of roles. This is tedious and
time consuming. It is also a difficult task since, often, there are dozens of
them. Therefore, those arguments add up to say that relying solely on a top-
down approach in most cases is not viable, although some case studies [86]
a high cost).
The bottom-up approach starts from the existing permissions before RBAC
is implemented and aggregates them into roles. A bottom-up model may not
approach excels in the fact that much of the role engineering process can be
referred to as role mining which we will discuss in detail in the next section.
It is also worth to notice that one may use a mixture of these two ap-
candidate roles which can then be examined to determine if they are appro-
Many works [87, 54, 35, 38, 72, 12] employ clustering techniques or their
variants to discover roles. However, they all suffer from a serious weakness
in that clusters do not allow overlapping among each other. Therefore, each
major issue since permissions are seldom used by one role only and may
of its major limitations is that the approach lacks a formal definition of the
essential since it defines an agreed role mining goal to achieve, and provides
role mining would be aimless and make no sense anymore. Another limita-
In this dissertation, we have made two main contributions to the role mining
realm.
5
entire collection of roles in contrast to one single role. We are the first
mining paradigm even from the perspective of one single role. On the
other hand, the extensive applications of RBAC urgently call for the
The role mining problems under investigation include but not limited to
the Basic-RMP, the Edge-RMP and the MinPert-RMP. For each prob-
lem, we also considered its different variations which we feel have strong
2. We formulated that the role mining problem is nothing but the binary
matrix decomposition problem [63, 91, 51, 78, 88, 58, 53]. Therefore,
tions of same binary matrix with each of them being associated with a
6
We believe we are the first to formulate the role mining problem this
tion.
• A user may play several different roles, and the user may have
This is essential for role mining since ignoring any of them would make
the job easier, but may result in more inaccurate set of roles. This
For example, by missing the first observation, role mining will be triv-
paradigm.
which we feel typical and significant doesn’t necessarily mean that the num-
The dissertation has explored key issues related to each role mining prob-
tion has explored the following issues with regard to each role mining problem
for they are given the opportunities to pick a specific problem (and
8
better.
lem. This is critical since we need to use them as a guide to more effi-
ciently search for a solution, i.e., we don’t want to fall into the swamp
ing problems defined. For some of role mining problems, we did this
from scratch without any direct help from the prior works in the lit-
erature with the belief that we cannot find equivalent solutions from
the same level of efficiency as the existent, we reused what has already
have investigated the relation of certain role mining problems with the
problems already identified, studied and solved in data mining and data
given approach is towards the optimality. The less distant the value
all possible values for a metric could be infinitely close to 1 but will
view of the fact that data in reality won’t be accurate as always, multi-
ple reasons could result data deviating from being correct. Therefore,
of result accuracy.
10
limitations of the work are that they lack integrative view of the entire set
of roles when justifying for the roles identified, therefore, no clear and good
objectives are available by which those work can be evaluated and compared
among each other. In another words, there is no formal metric defined before
against which we can evaluate the result of a given role mining algorithm or
compare the results of two algorithms. This dissertation has formally pro-
generated roles from a different angel. In another word, this dissertation has
filled the gap by defining a suite of objectives to be associated with the role
mining problem, and formulating the role mining problem as nothing but
tions to them, and also used metrics, the quantitative measures, to evaluate
nitions on which the overall research is based. Chapter 4 formally present the
role mining problem (the Basic-RMP) and its variants, the δ-approx RMP
and the MinNoise RMP. It discussed the definition, complexity analysis and
tively. Chapter 7 presents the Role Hierarchy Mining Problem (the RHMP).
to noise, it discusses noise model, degree of noise, and how the accuracy of
our algorithms is checked. Chapter 9 presents the tool called Role Mining
ter 10 summaries the research work shown in the dissertation and describes
CHAPTER 2
RELATED WORK
and correct set of roles. This process, known as role engineering [18], has
plish the task of role engineering, which can be categorized into three ap-
Coyne [18] is the first to describe the role engineering problem, and to
determine the needed permissions of roles using use cases has been proposed
the roles. Shin et al. [90] present a system-centric approach that examines
proach, which derives permissions from objects and their methods and then
derive roles derived from these permissions. Neumann and Strembeck [75]
are then aggregated into roles. Epstein and Sandhu [20] propose to use
UML for facilitating role engineering, where roles can be defined in either
the role life-cycle including role analysis, role design, role management, and
role maintenance.
role mining approach called ORCA, which discovers roles by merging per-
are merged determines the outcome of roles. Moreover, it does not allow
overlapping roles (i.e., a user cannot play multiple roles), which is a sig-
limitations.
they first consider role minimization, the process of finding a smallest collec-
relation. They introduce fast graph reductions that allow recovery of the
14
above present heuristic ways to find a set of candidate roles. While offering
justifications for the identified roles, there is no integrative view of the entire
ness metrics for roles. [100] takes a first step towards this by ordering candi-
date roles on the basis of their support (i.e., roles that are more prevalent are
ordered higher). However, this metric still is quite ad-hoc and preliminary.
Also, while one may come up with interestingness metrics for a role by itself,
this does not directly lead to the notion of a good collection of roles. Indeed,
this is critical for the security administrator to gain confidence and be able
to fully utilize the output of any role mining algorithm beyond a piece-meal
fashion.
treat each permission evenly and there is no notion of weight associated with
each permission. [105] introduced the notion about the weight of permission
Repetitious computation over the NSM can improve the quality and utility
of the similarity. These reinforced similarities are more practical and useful
So far all role mining approaches assume clean input data which, as we
15
all know is too ideal to be true in reality. The noisy input data is pervasive.
[42] suggested an cleaning approach by dividing the role mining problem into
two steps: noise removal and candidate role generation. It used non-binary
systems which adopts RBAC for role management. Accordingly the associ-
with the update task. By using this approach, it is possible to check whether
checking techniques.
All proposed work so far are based on either top-down or bottom-up ap-
proach. As we know that neither pure top-down approach nor pure bottom-
and Role Mining elements into a comprehensive framework for role creation.
HyDRo considers existing user information and access right structures with-
by the same research group ensures the hybrid integration, cleansing, and se-
for hybrid role mining that incorporates top-down business information into a
Colantonio et al. [12] proposes another hybrid approach that extends the
the same business process or organizational units within the same branch.
Molloy et al [72] also proposed a hybrid role mining algorithm based on for-
While all the above proposals attempt to discover a set of roles with a
certain optimality notion, they do not discover hierarchies. Molloy et al. [71]
have proposed an approach using formal concept analysis for role hierarchy
abstract cost function. The authors believe that their work is safer then
other proposals [31, 60] since they adopt under-assign permissions than over-
synthetic as well as real datasets [94], they compared nine role mining al-
gorithms. Takati et al. [97, 96] have defined the problem of mining role
hierarchy with minimal perturbation. They have also defined two measures:
a measure for goodness of an RBAC state and another measure for minimal
called StateMiner to find an RBAC state with the smallest structural com-
plexity and as similar as possible to both the existing state and the optimal
state and as close as possible to both deployed RBAC state and the optimal
state.
real world, proposed the enterprise RBAC model, which uses a two-level lay-
ered role hierarchy. In such a role hierarchy, there are two types of roles:
functional roles and business roles. Permissions are only assigned to func-
tional roles. Business roles are connected to functional roles and inherit all
permissions from the connected functional roles. Finally, users are only as-
signed business roles and inherit all permissions from the assigned business
roles.
mal binary matrix decomposition and its variants using binary integer pro-
18
gramming [24, 92, 104, 37, 52, 8, 4, 39, 13, 6, 65, 14] They present the binary
discover an optimal and correct set of roles from existing permissions, referred
to as the role mining problem. Their modeling allows them to directly adopt
the huge body of heuristic solutions and tools developed for binary integer
programming into Role Mining Problem. One step further, Lu et al. ex-
tended role mining problem with both positive and negative role assignment
user negatively, that user cannot have any permission contained by that role,
this extended binary matrix decomposition (EBMD) has broad potential ap-
plications since there are many types of real datasets, which could be better
described with both the set difference operation and the set union operation,
formal definition of the role mining problem. Apparently the lack of this
dilutes the research effort and have a negative effect on how results are applied
and compared. Frank et al. [64] proposed a novel definition of the role mining
have. In detail, they first identify the most important requirements for RBAC
that any real-world RBAC configuration should satisfy. Then they analyze
their weaknesses. Finally, they proposed a new definition of the role mining
19
CHAPTER 3
PRELIMINARIES
We adopt the NIST standard of the Role Based Access Control (RBAC)
model [26]. For the sake of simplicity, we do not consider sessions, role
Definition 1 (RBAC)
• U, ROLES, OP S, and OBJ are the set of users, roles, operations, and
objects.
relation.
V
• PRMS (the set of permissions) ⊆ {(op, obj)|op ∈ OP S obj ∈ OBJ}
21
assignments.
3.2 L1 norm
The L1 -metric allows us to count the difference between two matrices – i.e.,
to figure out how good an approximation one is of the other. When the
L1 -metric is 0, the two matrices are identical. Other metrics (and distances)
can also be used – [67] discusses some alternatives and their implications.
1
Note that in the original NIST standard [26], P A was defined as P A ⊆ P RM S ×
ROLES, a many-to-many mapping of permission-to-role assignments.
22
X
d
k v − w k1 = |vi − wi |.
i=1
ural way, i.e., if A and B are matrices in X n×m , for some set X, then
Xn X
n X m
k A − B k1 = k ai − bi k1 = |aij − bij |.
i=1 i=1 j=1
discovered that role mining problems can be also viewed as the boolean
The Jaccard coefficient is a well known statistic used for comparing the sim-
|A∩B|
Given two sets A and B the Jaccard Coefficient JCAB = |A∪B|
.
While the above Jaccard coefficient computes the similarity between two
sets, the dissimilarity between two sets, called the Jaccard distance JDAB =
4
1 − JCAB. In the above example, JDAB = 7
= 0.571.
24
CHAPTER 4
THE BASIC ROLE MINING PROBLEMS
In this chapter, we formally present the role mining problem (the Basic-RMP)
and its variants, the δ-approx RMP and the MinNoise RMP. Given m users,
1 in cell {ij} indicates the assignment of role j to user i. Similarly, the role-
which is critical to the notion of accuracy of the roles. The L1 norm defined
where M(UA), M(P A), and M(UP A) denote the matrix representation of
should be within δ of UP A.
In this section, we present the Basic-RMP and two of its variants, the δ-
Given the user-permission matrix, the basic Role Mining problem asks us
of roles. Put another way, it asks us what is the minimum number of roles
necessary to fully describe the given data (and what are those roles, and the
have 1000 users and 100 permissions. The size of UP A is 5000 (i.e., 5000
user-permission assignments are allowed out of the possible 100, 000). Now,
suppose 100 roles are required to exactly match the given user-permission
sume that the minimum number of roles required is only 60. As long as we
do not add any spurious permissions (i.e., no extra 1s are added), the second
case is clearly better than the first, since we significantly reduce the number
current state of the organizations. Permissions and (to a lesser extent, Roles)
are dynamic. Thus while exact match may be the best descriptor in the static
case, it is probably not good for the dynamic case. Approximate match might
number of roles, k.
27
It should be clear that the basic Role Mining Problem defined earlier is
simply a special case of the δ-approx RMP (with δ set to 0). Instead of
the approximation. We call this the Minimal Noise Role Mining Problem
(the MinNoise RMP). Thus, we fix the number of roles that we would like
to find, but now we want to find those roles that incur minimal difference
with respect to the original user-permission matrix (UP A). The security
administrator might want to do this when he is looking for the top-k roles
that describe the problem space well enough, and are still (in some sense)
robust to noise.
where M(UA), M(P A), and M(UP A) denote the matrix representation of
4.1 shows a sample user-privilege assignment (UP A), for 4 users and 5 priv-
ileges. Tables 4.1(a) and 4.1(b) depict a user-role assignment (UA) and role-
p1 p2 p3 p4 p5
u1 0 1 0 0 1
u2 1 1 1 0 1
u3 1 1 0 1 1
u4 1 1 1 0 0
and ROLES are optimal. It is not possible to completely describe the given
UP A with less than 3 roles. Tables 4.2(a) and 4.2(b) depict the optimal
depict one possible optimal minimal noise UA and P A. Tables 4.3(a) and
4.3(b) depict another optimal UA and P A for the MinNoise RMP. Both rep-
resent correct solutions to the MinNoise RMP, though the second one does
the complexity of these problems. The Role Mining Problem, the δ-approx
RMP, and the MinNoise RMP Problem are all optimization problems. The
to consider the complexity of the problems, we now frame the decision version
of these problems.
that |ROLES| ≤ k?
that
where M(UA), M(P A), and M(UP A) denote the matrix representation of
We can now prove that the decision RMP, the decision δ-approx RMP,
and the decision MinNoise RMP are all NP-complete (Indeed, some of these
results have already been obtained in the literature[67, 22]). Proving that a
The problem π ′ used to reduce from is the “set basis problem” defined
below:
• The decision Role Mining Problem is in NP. The set of roles ROLES,
p1 p2 p3 p4 p5 p6 p7 p1 p2 p3 p4 p5 p6 p7
u1 1 1 0 0 1 1 1 u1 1 1 0 0 1 1 1
u2 0 0 0 1 1 1 1 u2 0 0 1 1 1 1
u3 1 1 0 1 1 0 0 u3 1 1 0 1 1 0 0
u4 1 1 0 0 0 0 0 u4 1 1 0 0 0 0 0
(a) A 4 × 7 user-to-permission assignment (UPA) (b) Shaded areas indicate tiles, the 3 identified roles
R1 R2 R3
p1 p2 p3 p4 p5 p6 p7
u1 1 0 1
R1 1 1 0 0 0 0 0
u2 0 1 1
R2 0 0 0 1 1 0 0
u3 1 1 0
R3 0 0 0 0 1 1 1
u4 1 0 0
UP A. Thus, every set c ∈ C stands for one user u. Now, the answer
to the decision role mining problem directly provides the answer to the
mapping).
• The decision δ-approx RMP is in NP. The set of roles ROLES, the
Thus, every set c ∈ C stands for one user u. δ is set to 0. Now, the
• The decision MinNoise RMP is in NP. The set of roles ROLES, the
the answer to the decision MinNoise RMP directly provides the answer
In the following sections, we show that the Basic-RMP along with several
variants can be mapped to other problems already studied in the data mining
and data analysis literature. We discuss the complexity for each variant along
lem with the Tiling Databases problem. This mapping allows us to directly
original Database Tiling paper by Geerts et al. [22] looked at a set of five
problems, one of which exactly matches the role mining problem. We now
describe the relevant problems studied and then discuss their implications.
35
Tiling Databases
viewed as the number of objects and the number of columns, n, can be viewed
attribute j (i.e., some relationship exists between object i and attribute j).
well as all the rows that have 1s in all the columns in I. The area of a tile is
defined as the cardinality of the tile (i.e., the number of 1s in the tile).
below.
of the matrix with area equal to the total number of 1s in the matrix and
To see that the Minimum Tiling problem corresponds exactly to the Basic-
RMP, one must first see how a tile corresponds to a role. As defined above, a
tile is just a block of 1s – i.e., a collection of rows and columns that all have
missions. Thus, inherently, in any tile, the collection of the columns provides
36
the role-to-permission assignment (P A) for that role. At the same time, the
collection of rows denotes those users/entities that have that role – thus the
role. As such, any tiling corresponds to a set of roles and their role/permission
and user/role assignments. If the tiling completely covers the entire matrix –
then all 1s have been covered, meaning that all user/permission assignments
have been covered. Since each tile corresponds to a role, if the tiling is min-
imal and covers the entire matrix, this means that we have found a set of
minimal roles such that they completely describe the given user-permission
assignment.
the transactions and columns denote the items. We may order transactions
shows a tiling of the matrix consisting of 3 tiles. The shaded region represents
a tile. Thus, Tile 1={(1,1), (1,2), (3,1), (3,2), (4,1), (4,2)}, Tile 2={(2,4),
(2,5), (3,4), (3,5)} and Tile 3={(1,5), (1,6), (1,7), (2,5), (2,6), (2,7)}. As
one can see, Tiles 2 and 3 overlap on cell (2, 5). Figure 4.1(b) also gives the
minimum tiling of the matrix. It is not possible to find a tiling the covers
the entire matrix with less than 3 tiles. We can view the same problem from
a role. Figure 4.1(c) and 4.1(d) show an optimal UA and P A, such that
37
sense that it is impossible to find only two roles such that UA and P A will
be 0-consistent with UP A.
as follows.
Theorem 4.3.1 The Minimum Tiling problem is identical to the basic Role
Mining Problem.
To show that the two problems are identical we show that their inputs and
outputs exactly match. Thus, for every input instance, the output of both
• For any problem instance, the Minimum Tiling problem returns a set
of tiles that completely cover the input while minimizing the number
the set of columns C, in the tile. For each column c ∈ C, add the
a tiling of the matrix. This tiling covers the entire matrix since UA′
the same as |ROLES ′| which is less than |ROLES|. But that means
RMP.
Since the Minimum Tiling problem is equivalent to the Basic-RMP, the al-
gorithms developed for Minimum Tiling now directly apply. [22] proposes
area over a given threshold. A depth first search strategy is used to find all
large tiles. [22] prove that the Minimum Tiling problem can be approximated
39
p1 p2 p3 R1 R2 p1 p2 p3
u1 1 1 1 u1 0 1
R1 1 0 0
u2 1 0 0 u2 1 0
u3 1 1 1 u3 0 1 R2 0 1 1
u4 1 1 1 u4 1 1
(a) A 4 × 3 user-to-permission
(b) Decomposition of UPA into UA
assignment (UPA).
and PA with k=2.
R1 R2
u1 0 1 p1 p2 p3
u2 1 0 R1 1 0 0
u3 1 1
R2 1 1 1
u4 1 1
within the factor O(log mn), given an oracle that finds for any database D
and tiling T , the tile t such that the area(T ∪ t) is the maximum (i.e., the
oracle returns the tile which covers as much of the remaining uncovered part
by adapting the maximal tile algorithm. [22] provides more detail on this.
In the first phase, we find a minimum tiling for the given UP A. In the second
phase, we convert the tiling into ROLES, UA, and P A. As described earlier,
phase 1 uses a simple greedy strategy of adding the largest uncovered tile to
the current tiling, until UP A is completely covered (i.e., the largest uncovered
tile remaining is empty). Algorithm 2 describes the procedure for finding the
Algorithm 1 RMP(UP A)
1: {Find the minimum tiling for UP A}
2: T ← {}
3: while (T ′ ← LUTM(UP A,T )) 6= {} do
4: T ← T ∪ T′
5: end while
6: {Convert the minimum tiling into UA and P A}
7: ROLES ← {}, UA ← {}, P A ← {}
8: for each tile t ∈ T do
9: Create a new role R and add it to ROLES
10: Extract the set of permissions P in the tile
11: For each permission p ∈ P , add the assignment {p, R} to P A
12: Extract the set of users U in the tile
13: For each user u ∈ U, add the assignment {u, R} to UA
14: end for
simply assume some canonical order over the permissions. The key idea be-
hind the algorithm is that all large tiles containing a permission i ∈ P RMS,
but not containing any permission lower than i (according to the canoni-
cal order) can be found in the so-called i-conditional database [40]. In our
assignments that contain P , but from which all permissions before the last
permission in P and that last permission itself would have been removed.
Now, any large tile that is found in this conditional database, at once im-
simply need to add |P | to the width of the area (|P ′ |) and multiply this with
|U(P ′ )| [22]. We modify the original LTM algorithm [22] to return the largest
uncovered tile. For this, we keep track of the current largest uncovered tile,
41
LT, and its uncovered area, LTarea. The main steps of the algorithm are as
follows:
Step 1: Originally, LT and LTarea are initialized to the empty set and 0, re-
Step 2: Line 3 starts the main loop of the algorithm, and iterates over each
tile being considered is larger than the current known best, the best
uncovered tile seen so far. Over here, we need to clarify what we mean
by uncovered area. For any tile, the uncovered area is the number of 1s
that the tile covers that are not already covered in the existing tiling –
i.e., the uncovered area refers to that part of the tile that is new and
Referring back to Figure 4.1(b), assume that the current tiling consists
of Tile 1 and Tile 2. Now, the covered area is simply the distinct
number of 1s included in the Tiling. In our case, since the tiles do not
overlap, the overall covered area is equal to 10 (6 for Tile 1 and 4 for
Tile 2).
5 (since the total number of 1s in Tile 3 is 6, and one out of those 1s, at
Step 4: Finally, line 13 invokes the algorithm recursively to calculate the largest
finish after all the permissions have been considered. The algorithm
found in [22].
Algorithm 2 LUTM(UP A, T )
1: P ← {}
2: LT ← {}, AreaLT ← 0
3: for ∀p ∈ P RMS do
4: if uncovered area of t(P ∪ {p})> AreaLT then
5: LT ← t(P ∪ {p})
6: Update AreaLT to have uncovered area of t(P ∪ {p})
7: end if
8: {Create the conditional database for recursion}
9: UP A(P ∪{p}) ← {}
10: for (∀q|(q ∈ P RMS) ∩ (q>p)) do
11: Add (q, U({p}) ∩ U({q})) to UP A(P ∪{p})
12: end for
13: Compute T ((P ∪ {p}), UP A(P ∪{p}) ) recursively
14: end for
Essentially, the solutions for Database Tiling Problem ask for pruning of an
case.) Thus, if the pruning strategies do not work, or even in cases where
stead, we have come up with a solution based on the FastMiner [100] algo-
rithm that can significantly cut down on the cost while still retaining very
good accuracy. Even better, this solution can also be modified to work for
the δ-approx RMP. Essentially, a single algorithm can serve both the RMP
that one can implement one single algorithm and can tune it to obtain the
results of different RMP variants. This lends itself for security administrators
to pick and choose the RMP variant that is applicable to the organization at
phases. In the first phase, we generate a set of candidate roles from the
ing all unique user pairs. In general, any technique can be used to generate
the candidate roles. In the second phase, we select the final roles from among
database tiling. Essentially, the best candidate role is selected from the
compute the uncovered area of that role. The uncovered area of a role can
already covered by any of the roles in ROLES (the current minimum tiling).
This can easily be used for the δ-approx RMP with a minor modifica-
within δ. Since the Basic-RMP is only a special case of the δ-approx RMP
(with δ = 0), we formally present only the algorithm for the δ-approx RMP.
is sorted in descending order by the area of the roles, the length of each
total area of that role is less than the currently seen maximum uncovered
area, we know that it is impossible for that role to be the best. Indeed, since
the roles are sorted, we know that none of the roles following this can be
the best role either. Therefore, we immediately stop the iteration and use
the best role found so far. This can significantly help in reducing the overall
and 4 permissions depicted in [100]. Figure 4.4(a) shows the sample database
different roles is 24 = 16 (i.e., the size of the powerset). Figure 4.3 depicts 15
of those roles (all excluding the empty set - i.e., the role with no permissions).
Now, in the first phase of our algorithm, the FastMiner algorithm [100] is
45
used to select the set of candidate roles. FastMiner proceeds as follows: First,
the set InitRoles gets initialized to {{p1 , p2 , p4 }, {p2 , p3 , p4 }, {p2 , p3 }, {p4}, {}}.
These roles along with their corresponding counts are rectangled in Figure
4.3. The empty set, along with its count of 2 is not shown in the figure since
it does not add to the computation. Now, all pairs of unique users are inter-
sected to find the remaining candidate roles. These result in two additional
roles, {{p2 , p4 }, {p2 }}, which are ovaled in the figure. Since {p2 , p3 }, {p4} and
{} are also the result of some intersections, at the end of phase 2, GenRoles
gets set to the roles {{p2 , p3 }, {p2 , p4 }, {p2}, {p4 }, {}}. Finally, the generated
roles are matched to the corresponding counts which are 6,8,11,10, and 2.
Together, the initial roles and the generated roles form the set of candidate
4.4(b)-4.4(e) show the 4 iterations. Each figure shows the database with the
covered part shaded, as well as the uncovered area of each role considered
in the iteration. In the first iteration, the role {p2 , p4 } is picked since its
uncovered area (16) is the largest. Note that since the uncovered area 16 is
greater than the area of the next largest role ({p1 , p2 , p4 }, 15), the iteration
correctly breaks right there without even looking at the remaining candidate
roles. In the second iteration, the role {p2 , p3 } is picked since its current
uncovered area (9) is the largest. Again, the iteration does not check the
last role ({p2 , p3 , p4 }) since its total area is not larger than the maximum
uncovered area seen so far in that iteration. In the third iteration, since the
47
role {p1 , p2 , p4 } has the largest uncovered area (5), it is picked. Starting from
this iteration, all of the roles are considered since the maximum uncovered
area is smaller than the area of all of the roles. In the fourth iteration, the
role {p4 } with the maximum uncovered area (2) is picked. This covers the
entire UP A and terminates the algorithm. Figure 4.4(f) shows the complete
based on sorting only helps us for two iterations in this example, in general
it helps significantly. This has been borne out by the experimental results.
Also note that the algorithm will terminate faster for non-zero values of δ.
For example, for δ = 2, the algorithm will terminate at the end of iteration
three since UP A is covered within δ, and there will only be 3 returned roles.
48
p1 p2 p3 p4
u1 1 1 0 1
u2 0 1 1 0
u3 1 1 0 1 Candidate Associated Orig Gen Total
Area uc
Roles Users Count Count Count
u4 1 1 0 1
u5 0 1 1 1 {p2,p4} {u1,u3,u4,u5,u6,u11,u12,u13} 0 8 8 16
u6 0 1 1 1 {p1,p2,p4} {u1,u3,u4,u11,u12} 5 0 5 15
u7 0 1 1 0 {p2,p3} {u2,u5,u6,u7,u8,u13} 3 3 6 12
u8 0 1 1 0 {u1,u2,u3,u4,u5,u6,u7,
{p2} 0 11 11 11
u8,u11,u12,u13}
u9 0 0 0 1 {u1,u3,u4,u5,u6,u9,u10,u11,
{p4} 2 8 10 10
u10 0 0 0 1 u12,u13}
{p2,p3,p4} {u5,u6,u13} 3 0 3 9
u11 1 1 0 1
u12 1 1 0 1
U13 0 1 1 1
p1 p2 p3 p4
u1 1 1 0 1
u2 0 1 1 0
u3 1 1 0 1
Candidate Associated
u4 1 1 0 1 Area uc
Roles Users
u5 0 1 1 1
{p2,p4} {u1,u3,u4,u5,u6,u11,u12,u13} 16 16
u6 0 1 1 1
{p1,p2,p4} {u1,u3,u4,u11,u12} 15 X
u7 0 1 1 0
{p2,p3} {u2,u5,u6,u7,u8,u13} 12 X
u8 0 1 1 0
{u1,u2,u3,u4,u5,u6,u7,
{p2} 11 X
u9 0 0 0 1 u8,u11,u12,u13}
{u1,u3,u4,u5,u6,u9,u10,u11,
u10 0 0 0 1 {p4} 10 X
u12,u13}
u11 1 1 0 1 {p2,p3,p4} {u5,u6,u13} 9 X
u12 1 1 0 1
u13 0 1 1 1
(b) Iteration 1
49
p1 p2 p3 p4
u1 1 1 0 1
u2 0 1 1 0
u3 1 1 0 1 Candidate Associated
Area uc
u4 1 1 0 1 Roles Users
u5 0 1 1 1 {p2,p4} {u1,u3,u4,u5,u6,u11,u12,u13} 16 0
u6 0 1 1 1 {p1,p2,p4} {u1,u3,u4,u11,u12} 15 5
u7 0 1 1 0 {p2,p3} {u2,u5,u6,u7,u8,u13} 12 9
u8 0 1 1 0 {u1,u2,u3,u4,u5,u6,u7,
{p2} 11 3
u8,u11,u12,u13}
u9 0 0 0 1
{u1,u3,u4,u5,u6,u9,u10,u11,
{p4} 10 2
u10 0 0 0 1 u12,u13}
u11 1 1 0 1 {p2,p3,p4} {u5,u6,u13} 9 X
u12 1 1 0 1
u13 0 1 1 1
(c) Iteration 2
p1 p2 p3 p4
u1 1 1 0 1
u2 0 1 1 0
u3 1 1 0 1 Candidate Associated
Area uc
u4 1 1 0 1 Roles Users
u5 0 1 1 1 {p2,p4} {u1,u3,u4,u5,u6,u11,u12,u13} 16 0
u6 0 1 1 1 {p1,p2,p4} {u1,u3,u4,u11,u12} 15 5
u7 0 1 1 0 {p2,p3} {u2,u5,u6,u7,u8,u13} 12 0
u8 0 1 1 0 {u1,u2,u3,u4,u5,u6,u7,
{p2} 11 0
u8,u11,u12,u13}
u9 0 0 0 1
{u1,u3,u4,u5,u6,u9,u10,u11,
{p4} 10 2
u10 0 0 0 1 u12,u13}
u11 1 1 0 1 {p2,p3,p4} {u5,u6,u13} 9 0
u12 1 1 0 1
u13 0 1 1 1
(d) Iteration 3
50
p1 p2 p3 p4
u1 1 1 0 1
u2 0 1 1 0
u3 1 1 0 1 Candidate Associated
Area uc
Roles Users
u4 1 1 0 1
u5 0 1 1 1 {p2,p4} {u1,u3,u4,u5,u6,u11,u12,u13} 16 0
u6 0 1 1 1 {p1,p2,p4} {u1,u3,u4,u11,u12} 15 0
u7 0 1 1 0 {p2,p3} {u2,u5,u6,u7,u8,u13} 12 0
u8 0 1 1 0 {u1,u2,u3,u4,u5,u6,u7,
{p2} 11 0
u8,u11,u12,u13}
u9 0 0 0 1 {u1,u3,u4,u5,u6,u9,u10,u11,
{p4} 10 2
u12,u13}
u10 0 0 0 1
{p2,p3,p4} {u5,u6,u13} 9 0
u11 1 1 0 1
u12 1 1 0 1
u13 0 1 1 1
(e) Iteration 4
p1 p2 p3 p4
u1 1 1 0 1
u2 0 1 1 0
u3 1 1 0 1
u4 1 1 0 1
u5 0 1 1 1
u6 0 1 1 1
u7 0 1 1 0
u8 0 1 1 0
u9 0 0 0 1
u10 0 0 0 1
u11 1 1 0 1
u12 1 1 0 1
u13 0 1 1 1
(f) Iteration 5
Computational Complexity
complexity of the candidate generation phase and the complexity of the can-
didate selection phase. Since the FastMiner algorithm uses pairwise intersec-
tion of unique users to generate candidate roles, it requires O(n2 ) time, where
n is the number of users. Since at most n roles are necessary to describe the
candidate selection. Thus in the absolute worst case, the overall cost is O(n3 )
which is still significantly better than the exponential worst case of tiling.
However, in practice, due to the sorting and quick termination strategy, the
While our solution works extremely well for non-zero values of δ, it still
takes a long time for solving the Basic-RMP problem (i.e., when δ = 0).
The reason for this is quite clear – essentially at every iteration a smaller
our pruning strategy and at every iteration the complete list of candidate
roles has to be scanned. With up to n2 candidates, this list gets very big
overall time, though at the cost of giving a slightly non-optimal final set of
roles. The key is to observe that all of the candidate roles are not relevant at
every iteration. Though, at the start, any of the candidates could be chosen
as the best role, in later iterations, roles which do not comprise of any of
52
at any iteration, we could simply remove all of the users who have already
been covered and only generate candidates from the remaining users. As
more and more users are covered, this strategy keeps significantly decreasing
the number of candidate roles. Indeed the strategy can be applied as many
times as desired through the run of the algorithm. The reason why this may
generate a set of roles that is not optimal is due to the way the candidate
never generate a candidate role that could actually be the best role. If instead
CompleteMiner was used, this problem would be avoided (though the number
for most situations using this optimization gives close to the optimal set of
roles, while significantly reducing the overall time required. Algorithm 4 gives
using the FastMiner algorithm. Note that this can be done at the start of
to the Discrete Basis problem. This mapping again allows us to directly bor-
set of three related problems and shows that these are NP-complete. We now
describe the relevant problems studied and then discuss their implications.
The Discrete Basis problem [68] studies the problem of finding a basis
from given data. Similar to Principal Component Analysis (PCA), the dis-
and/or compression. Unlike PCA, the discrete basis problem only considers
We have already introduced some of the notation used for defining the
discrete basis problem from [68]. Formally, the discrete basis problem is
defined as follows:
and a positive integer k ≤ min{n, d}, find a matrix B ∈ {0, 1}k×d minimizing
The Discrete Basis Problem only asks for a discrete basis. A related
k C − S ⊗ B k1 (4.7)
Together, the Discrete Basis Problem and the Basis Usage Problem cor-
equivalence.
In the context of the discrete basis problem, the input is a boolean matrix,
where the rows and columns might stand for anything – users and permis-
sions, or documents and words. For now, we assume that these show the
<min{m, n}), Figure 4.2(b) shows one possible decomposition into a usage
Indeed this is the best (optimal) decomposition possible for the given input
matrix. Note that the discrete basis problem only asks for the optimal basis
1
We keep the notations of matrix product and L1 norm as what they originally are in
DBP paper [68], even if they are slightly different with those used in the RMP.
55
asks for the optimal usage matrix S (i.e., user-role assignment UA). In our
case, the MinNoise RMP asks for both P A and UA together. The difference
However, splitting the problem into two parts (i.e., finding optimal P A,
and then finding optimal UA given P A) does help in the case of the Basic-
RMP. For the Basic-RMP, we wish to exactly match the given UP A. In this
case, while the discrete basis problem (finding optimal P A) remains NP-
A simple algorithm for the basis usage problem in this case is as follows: For
each user and for each role, if the set of permissions of the role is a subset
of the permissions of the user, then assign that role to that user. Since we
only assign a role to a user as long as all of its permissions are owned by
the user, there are no mistakes (and we have an exact match). Obviously,
this assumes that the provided basis is complete (i.e., that each user can be
exactly described using some subset of the roles), and thus all of the required
roles are assigned to the user. Thus, after going through the entire set of
users and permissions, we automatically come up with the optimal UA. The
running time of this algorithm is clearly polynomial in the size of the input
[67].
Miettinen [67] also shows that the discrete basis problem cannot be ap-
This essentially shuts the door on any attempt to find an approximation al-
rule mining are proposed and seem to give fairly good results on simulated
data. Again, [67] provides further details on this. Other heuristics can also
the best candidates to describe the dataset. As part of future work, we in-
We now present a heuristic solution for the MinNoise RMP problem based
while still being fairly efficient in most cases, and better in certain cases.
Note that, the same solution can also be modified to solve several variants
of this problem including the Database Tiling problem [22] and the δ-approx
roles. The algorithm proceeds in three independent phases. In the first phase,
al.[100]. The basic idea is to intersect the permissions of every pair of users to
generate candidate roles. Along with this, for the sake of completeness, can-
didate roles are created consisting of each user’s permissions, and separately
consisting of each permission. Thus the total candidate roles set consists of
candidate roles. After removing duplicates, for each role, we associate with it
the users which fully enclose the set of permissions of the candidate. In other
the third phrase shortly. We sort the roles according to the area (i.e. the
users).
In the second phase, we select the final set of roles from among these
tiling. Essentially, the best candidate role is selected from the remaining
candidate roles until the original user set can be completely reconstituted or
until we have the required number of roles. In detail, in each iteration, out
of all remaining candidate roles, we select the one with the largest uncovered
area. The uncovered area of the role can be easily computed by finding the
number of 1s in the user set that are not already covered by any of the roles
already selected.
In the third phase, we associate additional users with each of selected roles
generated from the previous phase. Instead of associating only those users
whose permission set totally encloses a role, we slightly relax the restriction
to include those users that will benefit from such an association. Specifically,
permissions shared between it and the role is greater than half the size of the
permission set (i.e. more than half of the number of permissions the role has).
58
In other words, a user is only associated with a role if it has a net gain due
to the association. Clearly, this will introduce inaccuracies, since there may
be permissions in the role which are not owned by the user – however we are
guaranteed that the number of such permissions is less than half of the user’s
MinNoise with Errors (since errors can be introduced in the third phase. If
we stop after the second phase, there are no extraneous errors introduced
though the algorithm may not do as well – this is called MinNoise without
evaluation.
One problem with the above algorithm is that the candidate selection
process can potentially take quite a lot of time (since there may n2 candidates
area of the roles, the length of each iteration can be significantly reduced.
uncovered area seen so far. While scanning through the list of sorted roles,
we can stop as soon as we come to a role whose area is less than the maximum
uncovered area seen so far. Since the actual area of that role is less than the
this role to be the best. Indeed, since the roles are sorted, we know that none
of the roles following this can be the best role either. Therefore, the scan for
the iteration can be terminated and the best role found so far used. This can
significantly help in reducing the overall time. Algorithm 5 gives the details.
59
the algorithm. Figure 4.5(a) shows the example user permission assignment
each column a permission. The value of 1 in cell {i, j} indicates that user i
The algorithm mainly consists of four parts. Lines 1-5 generate the can-
didate roles. Lines 6-15 associate the users with each of the candidate
roles and then sort the roles by their areas in descending order. Lines 16-28
select the roles from the candidate set in a greedy manner. Lines 29-37 as-
sociate the users with the selected roles. In the following, we explain each
step in detail. The candidate roles are generated from 3 sources. First, we
uniquely incorporate the permission set associated with each user. They are
get from the previous step, this generates {P1 , P2 }, {P2 }, {P1 }, {P2 , P3 },
{P0 }, {P1 }, {P2 }, {P3 }, {P4 } into the candidate set. The second part of the
algorithm associates users with each of the candidate roles as long as the
permission set of a user fully encloses that of the candidate role. Then the
candidate roles are sorted in descending order by the area (i.e., the number of
In the third part, we keep selecting the best role from the candidate set
61
until the expected number of roles has been reached or the original candidate
set has been completely reconstituted. By the best role, we mean the role
which has the maximum uncovered area. In this example, we have set the
4.5(b) shows the first iteration in which role with the largest area {P0 , P2 }
apparently the largest so far. The iteration stops right here, since the next
role {P0 , P1 } has a total area of only 7, which is smaller than the uncovered
greater. This is also true for all following roles since all roles are sorted in
descending order of their areas. Therefore, we move the best role from the
candidate set to the selected role set. Figure 4.5(c) show the second iteration.
its total area of 6 minus the area (i.e., both cell{U1 , P0 } and {U6 , P0 }) already
covered by {P0 , P2 } selected from the first iteration. The iteration continues
calculating the uncovered area of each role down the candidate list and stops
after the role {P1 } since the next role {P1 , P2 } cannot have an uncovered area
greater than the current largest one. Similarly, the third iteration shown in
The last part of the algorithm associates the users with the selected roles.
The exact association has been conducted in part 2 of the algorithm. The
may not fully enclose the role. Figure 4.5(e) shows the selected roles by our
algorithm and how they cover the original user permission assignment, we
62
Dataset Parameters
NRoles NUsers NPermissions MRolesUsr MPermissionsRole
data1 100 2000 100 3 10
data2 100 2000 500 3 50
data3 100 2000 1000 3 100
data4 100 2000 2000 3 200
can see that there are five 1s not covered at cells {U0 , P2 }, {U2 , P2 },{U2 , P3 },
{U4 , P3 },{U5 , P0 }. The right table in Figure 4.5(e) also shows the coverage
can see that it also left the same five 1s uncovered, on top of that, it covers
P0 P1 P2 P3 P4 1 {0,2} {1,4,6,7} 8
2 {0,1} {1,5,6} 6
U0 0 1 1 0 0
3 {0,1,2} {1,6} 6
U1 1 1 1 1 1
4 {0,2,3} {1,4} 6
U2 0 0 1 1 0 5 {1} {0,1,3,5,6,9} 6
U3 0 1 0 1 1 6 {1,2} {0,1,6} 6
7 {1,3,4} {1,3} 6
U4 1 0 1 1 0
8 {2} {0,1,2,4,6,7} 6
U5 1 1 0 0 0 9 {2,3} {1,2,4} 6
U6 1 1 1 0 0 10 {3,4} {1,3,8} 6
11 {0} {1,4,5,6,7} 5
U7 1 0 1 0 0
12 {0,1,2,3,4} {1} 5
U8 0 0 0 1 1
13 {3} {1,2,3,4,8} 5
U9 0 1 0 0 0 14 {4} {1,3,8} 3
UPA Candidate r o l e s
(a) Iteration 1
U3 0 1 0 1 1 6 {1,2} {0,1,6} 6 X
7 {1,3,4} {1,3} 6 X
U4 1 0 1 1 0
8 {2} {0,1,2,4,6,7} 6 X
U5 1 1 0 0 0
9 {2,3} {1,2,4} 6 X
U6 1 1 1 0 0 10 {3,4} {1,3,8} 6 X
U7 1 0 1 0 0 11 {0} {1,4,5,6,7} 5 X
12 {0,1,2,3,4} {1} 5 X
U8 0 0 0 1 1
13 {3} {1,2,3,4,8} 5 X
U9 0 1 0 0 0 14 {4} {1,3,8} 3 X
UPA Candidate r o l e s
(b) Iteration 2
64
U7 1 0 1 0 0 11 {0} {1,4,5,6,7} 5 X
12 {0,1,2,3,4} {1} 5 X
U8 0 0 0 1 1
13 {3} {1,2,3,4,8} 5 X
U9 0 1 0 0 0
14 {4} {1,3,8} 3 X
UPA Candidate r o l e s
(c) Iteration 3
U6 1 1 1 0 0 10 {3,4} {1,3,8} 6 6
11 {0} {1,4,5,6,7} 5 X
U7 1 0 1 0 0
12 {0,1,2,3,4} {1} 5 X
U8 0 0 0 1 1
13 {3} {1,2,3,4,8} 5 X
U9 0 1 0 0 0 14 {4} {1,3,8} 3 X
UPA Candidate r o l e s
(d) Iteration 4
65
Selected r ol es Selected r ol es
by our algorithm by DBP algorithm
P0 P1 P2 P3 P4 P0 P1 P2 P3 P4
U0 0 1 1 0 0 U0 0 1 1 0 0
U1 1 1 1 1 1 U1 1 1 1 1 1
U2 0 0 1 1 0 U2 0 0 1 1 0
U3 0 1 0 1 1 U3 0 1 0 1 1
U4 1 0 1 1 0 U4 1 0 1 1 0
U5 1 1 0 0 0 U5 1 1 0 0 0
U6 1 1 1 0 0 U6 1 1 1 0 0
U7 1 0 1 0 0 U7 1 0 1 0 0
U8 0 0 0 1 1 U8 0 0 0 1 1
U9 0 1 0 0 0 U9 0 1 0 0 0
(e) Iteration 5
Figure 4.5. An example of heuristic solution for the MinNoise RMP problem
66
from small to large organizations, with a few to a large number of roles, and
want to see the effect of δ in terms of reduction in time and number of roles.
We would also like to see the effect of our optimization for the Basic-RMP.
Ideally, this should be done on real data sets. However, it is very difficult
to find different real data sets with the variety of parameters that we would
like to evaluate. Therefore, we used the same synthetic test data generator
used in [100] to allow us to set the parameters of our choice that enable us
First we briefly describe how the test data generator performs: First a set
a certain maximum are chosen to form the role. The maximum number of
rithm. Next, the users are created. For each user, a random number of roles
are chosen. Again, the maximum number of concurrent roles a user can have
is set as a parameter of the algorithm. Finally, the user permissions are set
according to the roles to which the user has been assigned. Algorithm 6
gives the details. It is obvious that this generator uses a simple randomiza-
tion strategy in order to generate the test data. This allows us to test in
67
mantics are used to create specific associations or weights. Thus, one could
actually enhance the test data generator itself to include more semantics
and create more realistic data. However, this would require expert domain
and permission hierarchies into the test data generation. A survey of orga-
such semantics to create very realistic datasets. While this is a very worthy
goal, due to the high effort required, we leave it to future work. Our current
for now.
role mining algorithm was run on each of the created data sets. All results
reported for a specific parameter set are averaged over the 5 runs.
For each set of experiments, we report the speed as well as the number
of roles reported by the algorithm. Since we are guaranteed that the roles
the original roles (any roles that suitably cover UP A are fine). Therefore we
Thus, in the first set of experiments, we kept the number of users and roles
the number of permissions per role). Table 4.5 describes the test parameters.
Figure 4.6(b) shows the time taken by the algorithm, while Figure 4.6(a)
shows the number of roles found by the algorithm. There are four lines for
each of the four datasets. For each line, there are five plot points signifying
values for δ of 0, 1, 5, 10, and 20. It can be clearly seen close to the optimal
roles are found for the Basic-RMP (i.e., with δ = 0). What is even more
example, for δ = 5%, the number of roles is close to 80, on average. This
means that we can cover 95% of the original UP A without any errors with
only 80% of the original roles. Clearly, these 80 roles could be viewed as more
well for non-zero values of δ, though it is quite slow for δ = 0. However, one
must note that the Basic-RMP optimization was not used for this experiment.
However, it was used in the second set of experiments to judge the effect.
In the second set of experiments, we kept the number of roles and per-
missions constant while varying the number of users. Table 4.6 describes the
test parameters. Figure 4.7(b) shows the time taken by the algorithm, while
69
Dataset Parameters
NRoles NUsers NPermissions MRolesUsr MPermissionsRole
data1 100 100 1000 3 100
data2 100 500 1000 3 100
data3 100 1000 1000 3 100
data4 100 2000 1000 3 100
Figure 4.7(a) shows the accuracy of the algorithm. The accuracy results are
quite similar to the first set of experiments. Indeed, our algorithm always
performed extremely well for a wide variety of parameters beyond those re-
ported here. The interesting result lies in the speed of the algorithm. Now,
the speed is drastically better, and the algorithm time does not significantly
In this section, we present the results of our experiments that we have con-
ducted to validate our algorithm and to compare our results with those gen-
erated by the DBP algorithm [77]. In order to validate it, we have used the
same test data generator used for FastMiner. The test data generator first
creates a set of roles such that, for each role, a random number of permissions
algorithm. Next, the users are created such that, for each user, a random
number of roles are chosen. Again, the maximum number of concurrent roles
a user associates with is set as a parameter. Finally, the permissions are set
with errors, and the other one called MinNoise without errors. The algo-
rithm without errors does not have the last user association phase of the
them with the DBP algorithm, we ran two types of experiments. In the first
experimental test, we generate 3 test data groups by running the test data
generator 3 times. Therefore, their structure is the same even though the
actual data varies. More specifically, each group consists of 4 different data
sets of different sizes. All data sets in a group have 200 permissions, but the
number of users of each data set varies from 100, 200, 500 to 1000. All data
Figure 4.8 shows the test results averaged over the 3 different runs when
20 roles are desired. The value on y-axis in each figure represents the differ-
ence between the original user permission assignment and the user permission
the number of users of a different data set in that group. As we can see
that with the increase of sizes of the data sets, our algorithm (MN with er-
shows that our algorithm creates roles with higher reconstruction accuracy
sistent even when different number of roles are required. To test this, we
71
ran 2 tests. Each test ran on only one data set. But each test was run 5
times with different values of k from 5, 10, 15, 20 to 25. Figure 4.9.(a) and
(b) show the results on the 2 tests where Figure 4.9.(a) uses data set of 200
users and 200 permissions, and Figure 4.9.(b) has the data set with 500 users
and 200 permissions. Again, one can see, that regardless of the value of k,
that does not make any errors cannot outperform the DBP, and in fact fares
the worst among the three. This shows the utility of the final user associ-
ation phase, where users are associated with the candidate vectors even at
the cost of introducing inaccuracy. One other interesting point is that the
with a role if at least half of the role permissions are also present in the user.
However, this may not gain us anything, as those permissions may already
be covered due to different roles. One can see this effect in the experiments,
when the errors sometimes increases with increase in the number of roles.
This can be easily fixed by only associating a user with a role in the final
phase when the number of uncovered permissions common to the role is more
than half of the number of permissions in the role. We plan to do this in the
120
100
80
d1
d2
60
d3
d4
40
20
0
0 1 5 10 20
(a) Accuracy
24:00:00
21:36:00
19:12:00
16:48:00
14:24:00
d1
d2
12:00:00
d3
9:36:00 d4
7:12:00
4:48:00
2:24:00
0:00:00
0 1 5 10 20
(b) Speed
120
100
80
d1
d2
60
d3
d4
40
20
0
0 1 5 10 20
(a) Accuracy
0:50:24
0:43:12
0:36:00
0:28:48 d1
d2
d3
0:21:36 d4
0:14:24
0:07:12
0:00:00
0 1 5 10 20
(b) Speed
80000
70000 DBP
MN with errors
60000 MN without errors
50000
Difference
40000
30000
20000
10000
0
100 200 500 1000
Number of Users
25000
20000
Difference
15000 DBP
MN w ith errors
10000 MN w ithout errors
5000
0
5 10 15 20 25
Values of k
60000
50000
40000
Difference
DBP
30000 MN w ith errors
MN w ithout errors
20000
10000
0
5 10 15 20 25
Values of k
CHAPTER 5
THE EDGE-RMP
describe the given data (and what are those roles, and the corresponding
user assignments)?
ment P A 0-consistent with UP A and minimizing the sum of the sizes of the
|UA| + |P A|).
Although, based on the above two definitions, one may at a first glance
think that the Basic-RMP and the Edge-RMP are related, they are in reality,
unrelated. In other words, if one can derive a solution for the Basic-RMP,
R1 R2
p1 p2 p3
u1 1 0
R1 1 1 0
u2 0 1
R2 0 1 1
p1 p2 p3 u3 1 1
u1 1 1 0 u4 1 1 (c)
u2 0 1 1 u5 1 1
u3 1 1 1 u6 1 1
Basic-RMP
u4 1 1 1 (b)
u5 1 1 1
u6 1 1 1 R1 R2 R3 p1 p2 p3
u1 1 0 0 R1 1 1 0
(a)
u2 0 1 0 R2 0 1 1
u3 0 0 1 R3 1 1 1
u4 0 0 1
(e)
u5 0 0 1
u6 0 0 1
Edge-RMP
(d)
Figure 5.1. An example of showing that the Basic-RMP and the Edge-RMP
are different
78
are true.
signment UP A. Figures 5.1(b) and (c), represent the solution to the Basic-
identified roles, |R|=2. But, the number of edges, i.e., |UA| + |P A|=14.
Figures 5.1(d) and (e) represent the solution to the Edge-RMP problem
the minimum number of edges after the decomposition is, |UA| + |P A|=13.
However, the number of roles, |r|=3, which is not minimal. This example
shows that the optimal solution for the Basic-RMP is not necessarily optimal
Intuitively, looking at the given example, one may think that this com-
plexity is solely due to the repetition of users – i.e., that if no two users have
the exact same set of rights, the Edge-RMP and the Basic-RMP might be
the same. However, this is not the case either. Irrespective of repetition of
examples having different solutions for both. This conclusively shows that
the two are completely different problems. The Edge-RMP can be consid-
ered as more complex than the Basic-RMP. This is due to the repetition of
79
If a set of roles can describe a particular user, it can describe any number
of identical users. While this is true with the Edge-RMP as well, the overall
cost (|UA| + |P A|) does change depending on the number of users. Thus, it
might make sense to create a special role for a 1000 users who perform three
related job functions. Here the Edge-RMP may find more roles such that
they reduce the overall work. Since it has to take the number of identical
order to consider the complexity of the problems, we now frame the decision
estingly, to the best of our knowledge, most (if not all) of the prior NP-
[36]:
1. showing that π is in NP
The problem π ′ used to reduce from is the “Vertex Cover problem” defined
below:
• The decision edge Role Mining Problem is in NP. The set of roles
|P A| ≤ k
we trivially assume that all of the vertices in V have at least one edge
simply removing all isolated vertices since these can never be part of
for i ∈ [1, . . . , n], and {ai,j }, {bi,j }, {ci,j }, {di,j } for each each edge in E
(between vertices vi and vj ). Since there are n total vertices and m total
set {xi , yi}. For each edge e between vertices vi and vj in E (assume
i < j), we define 5 users u1i,j , . . . , u5i,j with the following permissions:
to show that G has a vertex cover of size at most k if and only if the
The intuitive idea behind the proof follows the proof idea of Jiang and
set of roles, ROLES must have either the role {xi , yi} or the two roles
second case, |UA|+|P A| = 2+2 = 4. These are the only two meaningful
clearly would not be required). Let V1 = {vi | both {xi } and {yi } are in
ROLES}. We can show that for a fixed < vi , vj >∈ E at least 4 roles
and 18 links are necessary to cover the five users (in addition to the sets
UA as follows: For ever vi in the cover we include both {xi } and {yi }
4 roles {ai,j , bi,j }, {ci,j , di,j }, {yj , bi,j , ci,j }, {xj , di,j , ai,j }. Now, all of the
as follows:
S
1. u1i,j = {xi } {ai,j , bi,j },
slightly different: {bi,j , ci,j }, {di,j , ai,j }, {xi , ai,j , bi,j }, {yi , ci,j , di,j }. The
role assignment is also quite similar with u1i,j and u3i,j being present in
the roles created and the rest being composed out of two roles. Finally,
is in V1 . In all three cases, only 4 sets are created and the resulting
vertices and m edges in the graph, the total number of sets created is
the fact that ROLES must have the set {xi , yi } to represent the user
ROLES are required (in addition to the sets ci , cj {xi }, and {xj }) to
cover the users u1i,j , . . . , u5i,j . Further, at least 5 sets are required to
= 3n + k ′ + 20m − 2m′
= 3n + k + 18m + (k ′ − k + 2m − 2m′ )
links are sufficient if e ∈ E ′ . In this case, we can simply use the construction
noted above of using 4 roles to represent the 5 users, and the number of
construction that requires 20 extra links and to show that it cannot be done
roles {ai,j , bi,j }, {ci,j , di,j }, {xi }, {yi }, {yj , bi,j , ci,j }, and {xj , di,j , ai,j }. Now,
Lemma 5.1.2 The permissions for any role have to be a subset of the per-
permission set is not a subset of the permissions for any user. There, Ri has
at least two permissions p1 and p2 which are not owned together by any user.
In this case, the role Ri cannot be assigned to any of the users since this
Lemma 5.1.3 If the users can be decomposed into disjoint sets based on
their permissions, then the global solution is the sum of the solutions for
By Lemma 5.1.2, any role created can only be a subset of the permissions
for some user. Therefore, if there are two users whose permission sets are
completely disjoint, no role assigned to the first user can ever be assigned to
the second user and vice versa. (To see this, note that any role assigned to
that user has to have at least one permission assigned to that user. Since the
permission sets are disjoint, this role cannot then be assigned to the other.
The above lemmas can show that there is no solution which can com-
pletely reduce |UA| + |P A| below 20. The second lemma shows that we can
consider edges in isolation, while the first lemma allows us to enumerate all
possible solutions for this fixed set. A brute force search can convince you of
the fact that you cannot find any decomposition that requires less than 20
extra links.
identify the minimal set of roles. For the sake of exposition, we will assume
that there are n users, m permissions and q candidate roles identified. Our
goal is to select a set of roles from the candidate roles to minimize |UA|+|P A|,
where |UA| gives the size of the user-role assignment (i.e., |UA| is the sum
87
of the number of roles assigned to each user) and |P A| gives the size of the
phases. The first phase is the same as the heuristic solution for the Basic-
RMP. That is, we generate a set of candidate roles from the UP A using the
we select the final roles from among these candidates. For this selection, we
adopt a greedy strategy different from database tiling. Essentially, the best
candidate role is selected from the remaining candidate roles until the origi-
remaining candidate role, we compute the number of new 1s added into both
|UA| and |P A|. In addition, we compute the uncovered area of that role that
not already covered by any of the roles in ROLES (the current minimum
tiling).
The greedy strategy is, in each iteration, we pick the role which introduce
the minimal number of new 1s into both |UA| and |P A|. However, this
1. If two roles introduce the same amount of new 1s, the one with the
2. The role picked should have uncovered area greater than or equal to 1.
88
more 1s than role a, but the number of 1s it introduces is less than the
We define the first constraint since it will carry the greedy further in
a sense that in each step, if all other condition are the same, then picking
the one with largest uncovered area will generate least number of roles as
well. We expect this to cut the number of 1s further. The second constraint
guarantees that each picked role will contribute to the termination of the
role selection process. The last constraint serves two intention, first we try
to avoid the trivial case of always picking the role with the least number of
new 1s but only contain a single permission. Second, we strike the balance
between two types of greediness. The greedy that the one with minimum
in view of the whole role selection process. The other greedy (shown in the
first constraint) that picked the largest uncovered area can terminate the role
in both |UA| and |P A|. The last constraint is more to resolve the conflict
between two contributing factors when we can’t satisfy both. Note that this
In the following we use the same example as above to show the difference
For the Basic-RMP, in the first iteration, we picked the role {p2 , p4 }
since it has the largest uncovered area. in Fig , we show the first iteration of
selecting a candidate roles for the Edge-RMP. All candidate roles are initially
sorted by their covered area. Then we compute their uncovered area and the
new 1s introduced. Role {p2 , p4 },{p1 , p2 , p4 } {p2 }, {p4 } is not select since the
new 1s introduced is not minimal. Role {p2 , p3 } has new 1s less than role
{p2 , p4 }, but {p1 , p2 , p4 } have the larger uncovered area than it even though
they have the same number of new 1s into |UA| and |P A|. Finally, role
{p2 , p3 , p4 } is selected.
90
CHAPTER 6
THE MINPERT-RMP
Minimality is a good notion, since it allows one to formally define the prob-
as a best approximation for realizing good descriptive roles. [101] shows that
NP-complete the set basis problem to this. An optimal set of roles is desir-
roles.
However, adopting this minimal set of roles suffers from the following
limitations. First, since this process does not consider job functions or any
semantics, the discovered set of roles may not accurately represent the or-
role mining process completely ignores the existing set of roles. This probably
roles from scratch is not permissible for organizations that have an RBAC in
place. This is because, if the organization has spent considerable effort in role
engineering (perhaps using the top-down approach), these efforts are simply
91
a waste. Moreover, changes to the role set may result in drastic changes
to the way in which organizations conduct their businesses. This may cause
tion of duty constraints that are defined on roles. Furthermore, there may
Therefore, although one would like to use an optimal set of roles, migrat-
ing to this new set of roles from existing set of roles (called deployed roles
has proposed an approach that identifies a set of roles (ROLES) that are
as close as possible to both DROLES and the optimal set of roles. We de-
note this problem as the the Minimal Perturbation RMP (also known as the
time, it gives a formal way to measure the goodness of the current RBAC
assignments.
ing 8 roles, i.e., the optimal set of roles are: {{1, 3, 9}, {11, 3, 8}, {3, 8, 9},
{4, 5, 8}, {10, 5, 7}, {1, 10}, {2}, {3, 4}}. Assume that the deployed roles
92
consist of each permission in its own role; i.e., there are 12 deployed roles.
Hence DROLES = {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10},
{11}, {12}}. The MinPert-RMP also has a weight parameter that allows
us to select how close the generated roles should be to the deployed set of
roles. With the weight set to 0, the following 9 roles are generated by our
algorithm: {{1, 3, 8, 11}, {1, 3, 9}, {1, 10}, {2, 3}, {3, 4}, {3, 5, 7, 8, 10, 11},
{3, 8, 9}, {4, 5, 8}, {5, 7, 10}}. With the weight set to 0.2, the following 10
roles are generated: {{1}, {1, 3, 8, 11}, {2, 3}, {3, 4}, {3, 8, 9}, {3, 8, 9, 11},
{3, 9}, {4, 5, 8}, {5, 7, 10}, {10}}. Finally, with the weight set to 1, the fol-
lowing 11 roles are generated: {{1}, {1, 3, 8, 11}, {2, 3}, {3}, {3, 8, 9, 11},
{3, 9}, {4}, {5}, {5, 7, 10}, {8}, {10}}. We can see that though the number
of roles is increasing, the roles generated are getting closer to the set of de-
algorithm allows one to come up with a suitable set of roles in between the
6.1 Definition
the similarity / distance between pairs of roles and pairs of sets of roles. Since
a role is nothing but a set of permissions, we can use the Jaccard coefficient
simulate the real time situations where different permissions and roles carry
Permissions Weighting
based on the scope of their effects upon it. global weight(G) and local
spect to other permissions among all the deployed roles whereas local weight
there is only one global weight per each permission in contrast to multiple lo-
cals weights a permission can have, in fact, it can have as many local weights
notion.
There are numerous ways to define how global weight can be calculated. A
legitimate strategy should support two assertions, first, the numerical values
it should carry at least some weight. On the other hand, no permission should
claim all weight unless in the extreme case where it is the sole permission
94
essentially the notion of weight doesn’t show its influence. However, this is
from all others based on its popularity in users, i.e., the more users who are
be proportional to its popularity in user. They may vary inversely (i.e., they
the perspective of the Chief Executive Officer role sitting at the top of the
role hierarchy, those permissions, which are barely authorized to users other
may be of great importance to itself since they are exclusive to only few
are not that critical to the safety of the whole role-based secure systems,
naturally they concerns the high ranked roles less. From this point onward,
95
this definition to cater to the cases when both show the relation of inverse
proportionality.
can be calculated in the similar fashion. The two assertions still hold, i.e., the
exclusive on 0 in a sense that each permission will still carry some level of
significance at a given role even though the fact that this permission may
not even be included in that specific role. Therefore, a given permission may
only be included in certain roles but will have weights on each of all deployed
roles. One implication of this is that in any deployed role, the sum of weights
fact that a permission’s weight may vary across different roles. A native
but legitimate way of local weighting is to assign the same weight to both
constituent permissions and permissions excluded from the role. In this case,
the global weight all local weights will end up being the same among each
calculate its local weight in a role the same way we calculate its global weight,
evenly distribute the remaining weights over the permissions excluded from
the role. This is better not because we ensure more weights for constituent
permissions, (as the matter of fact, the opposite might be the truth), but
because we treats the included permissions specially in a sese that, the weight
For example, consider a UP A with three users {u1 , u2, u3 } who, respec-
are two users for each of p1 , p2 , p4 , and three users for p3 , we have the sum-
that if needed, we may add the second subscript specifying the global envi-
ronment such as the name of the specific UP A or of the role set for clarity).
in R1 . While L{p1 ,R2 } , L{p2 ,R2 } equally share the remaining weight which is 49 ,
ability of the user input, here the user could be the role system administrator
97
or just a normal user who can access and manipulate the organizational role
mining systems. Typically, there are three types of privileges a user is autho-
rized to. First, a user can specify the priority order between local weighting
and global weighting. By default, if both are available, local weighting will
be adopted as opposed to global weighting. But the user can always over-
write this rule. Second, a user can arbitrarily assign weights to permission
locally or globally or both. That is, the user can assign weights either to
each of all permissions or only to some permissions, in the latter case, the
weights are not specified. In local weighting, if the user specifies weights for
be from the included permissions of the role. Then the constituent permis-
sions which is not specified weights and the excluded permissions will share
evenly the remaining weights. For instance, in the above example, if the
user specifies that that global weight and local weight of p3 is 0.4 and 0.3
respectively. Then the global and local weights for other permissions are as
follows: Gp1 = 0.2, L{p1 ,R1 } = 0.35, L{p1 ,R2 } = 0, Gp2 = 0.2, L{p2 ,R1 } = 0.35,
L{p2 ,R2 } = 0, Gp4 = 0.2, L{p4 ,R1 } = 0, L{p1 ,R2 } = 0.7. Third, the user can
in the above example, the user can specify that, globally, p1 is 20 percent
more important than any other permissions. We already see from the above
percent more important that others. The calculation will be similar, except
that the specified 30 percent weight will be even distributed between p2 and
p3 .
Roles Weighting
local weights(L) for each deployed role, while global weight measures the sig-
nificance of a role among all the deployed roles in UA whereas local weight
ministic once a set of roles have been generated and will remain unchanged
as long as the deployed role set stay the same. Therefore, each role has only
one global weight for a given weighting strategy whereas it can have multi-
ple local weights each of which associates with a specific user, factually, the
number of local weights for a role can be as many as the number of users
given that the local weights are different among each other.
There are various strategies to calculate the global weight of a role given
a UA and a set of deployed roles and all of them shall support the same two
local weighting of a role shares the same assertions with the local weighting
of a permission.
99
to define the same importance among deployed roles. So each role will share
the situation where weighting is not used since assigning the same weight to
each one defeats the purpose of introducing the notion of weight, which is to
authorized users divided by the sum of the number of users authorized to each
of all deployed roles, that is, the number of 1s in UA. The local weighting of
one role w.r.t. one user can be similarly defined: for the assigned roles, one
can count the number of authorized users divided by the sum of the number
of users authorized to each of all deployed roles, for all roles not included in
the locality of the user, We just distribute even the remaining weights among
them. This strategy calculates the weight of each assigned role based on the
relative popularity of the role among users with respect to all other roles,
and treat all other roles not assigned to the user in undifferentiated manner.
For example,consider a UA with three users {u1 , u2 , u3} who are autho-
rized to role sets {R1 , R2 },{R2 },{R1 , R3 } respectively. We use the aforemen-
tioned popularity based strategy to generate the local and global weights of
each role. Since we have five 1s in UA, and R1 and R2 each has two autho-
the local weights for R1 ,R2 and R3 as follows: L{R1 ,u1 } = 24 , L{R1 ,u2 } = 0,
L{R1 ,u3 } = 32 , L{R2 ,u1 } = 24 , L{R2 ,u2 } = 1, L{R2 ,u3 } = 0, L{R3 ,u1 } = L{R3 ,u2 } = 0,
L{R3 ,u3 } = 13 .
100
The weights of roles, in real time, are generated based on the availabil-
ity of the user input pretty much the same way as permission weighting.
In summary, a user can define the priority order between local and global
assign weights to either all of the deployed roles and only a subset of it, in
this case, the remaining weight will be even distributing among unspecified
roles. This is applicable to both local and global weighting. In addition, the
user can specify the weight either by numerical real numbers or by the rela-
tive importance of one or more among all roles. For instance, in the above
example, if the user specifies that that global weight and local weight of R1 is
0.6 and 0.4 respectively. Then the global and local weights for other roles are
as follows: GR2 = 0.2, L{R2 ,u1 } = 0.6, L{R2 ,u2 } = 1, L{R2 ,u3 } = 0, GR3 = 0.2,
R2 as follows, note that in the definition, we use local weights for permissions
by default, in real time, the user can overwrite them with the corresponding
101
global weights. This is also the truth for the roles-roles similarity defined
shortly.
P
pi ×L{pi ,R2 }
sim(R1 , R2 ) = P pi ∈ni , d(R1 , R2 ) = 1 − sim(R1 , R2 ).
pj ∈nu pj ×L{pj ,R2 }
Note that each permission,in the union of R1 and R2 , can have two local
resolve the conflict, that is, we can either use local weights from the first
In this definition, we use the local weight from right parameter, but this can
Since there is only one permission common to both R1 and R2 (i.e., p2 ) and
The above definition has several favorable properties. When two roles are
(and distance) is a value between 0 and 1. We can also easily extend the
definition to measure distance between two sets of roles. First, we define the
set of roles SR1 , we define the similarity between R1 and SR1 as follows:
sim(R1 , SR1 ) = maxR2 ∈SR1 sim(R1 , R2 ) and the distance d(R1 , SR1 ) = 1 −
sim(R1 , SR1 ).
102
between the role and any role in the other set. We choose this as an effective
measure since if an identical role is found in the other set, the similarity
{p1 , p2 , p3 } and the set of roles SR1 = {{p4 , p5 }, {p3 , p6 }, {p1 , p2 , p4 , p7 }}. In
{p1 ,p2 ,p4 ,p7 }) = 2/5 = 0.4. Since the max similarity is 0.4, sim(R1 , SR1 )
maximum, etc., if the situation warrants it. For now, we think that this is a
only one other role or to a set of roles. Similarly, it is not clear if a role can
be involved in more than one matching (i.e., once a role is picked as part of
by extending the earlier Role − Role metric. Thus, we could simply define
the similarity as the number of identical roles divided by the number of total
SR1 and SR2 , we define the similarity between SR1 and SR2 as follows: Let
ni = S where (S = {(u, v)}, such that u ∈ SR1 , v ∈ SR2 and sim(u, v) = 1).
T S
Thus ni = SR1 SR2 . Also, let nu = SR1 SR2 . We define
Ri ×G{Ri ,SR2 }
P
sim(SR1 , SR2 ) = P Ri ∈ni , d(SR1 , SR2 ) = 1−sim(SR1 , SR2 ).
Rj ∈nu Rj ×G{Rj ,SR2 }
For example, consider two sets of roles SR1 = {{p1 , p2 }, {p2 , p4 }, {p3 , p4 ,
p5 }}, and SR2 = {{p2 , p4 }, {p3 , p4 , p6 }}. In this case, only one role in SR1
is identical to a role in SR2 (the role {p2 , p4 }). Therefore, according to the
between roles. For example, the roles {p3 , p4 , p5 } and {p3 , p4 , p6 } differ only
general, this could be worse, especially with larger roles which may not be
identical but are very similar. Instead, we would like to have a definition of
counting identical roles, we take the maximum similarity each role has with
the other set of roles, and average across all of the roles. If there is an identical
role in the earlier set, the earlier defined similarity metric will also give 1 as
desired. The advantage is that this also works for non-identical roles. To
two sets of roles SR1 and SR2 , we define the similarity between SR1 and
SR2 as follows: if the sizes of the two sets are not equal, without loss of
generality, assume that SR1 is the smaller set. Then, sim(SR1 , SR2 ) =
For example, consider two sets of roles SR1 = {{p1 , p2 }, {p3 , p4 }} and
SR2 = {{p1 , p2 }, {p3 , p5 }}. The similarity between the four possible pairs
more sophisticated fashion. For example, we may want to count the average
similarity, while eliminating the outliers (or while tolerating a certain number
of low similarity roles, etc.). For now we do not bother about this. Also note
that both of the above measures still assume that a role can only be mapped
to one other role. In general this is not true. For example, let one set has the
role {p1 , p2 , p3 } and the other set has the roles {{p1 }, {p2 , p3 }}. While our
similarity measures will give a score of 0.33 and 0.66 respectively for the two
roles, it should be clear that taken together, the sets of roles are quite similar.
Now that we know how to measure similarity, we still need to find a way to
incorporate it into the role selection. Thus, given two different objectives, we
single global function. For the Basic-RMP, we minimize the number of roles.
roles and deployed roles. Many ways of combining the two objectives are
number between 0 and 1 while the number of roles is between 0 and min(m, n)
way to resolve this is to use a linear combination of the two. We would also
like to weigh the relative contribution of the two factors. Therefore we define
into a comparable range. With all of the prior definitions in place, we can
ing the combination function of the number of roles and the distance between
LES)).
initions, we assume that all permissions are given equal weight. However,
in real situations this may not be the case. However, our definitions can be
Role definition to include it. Permission weights could be set by the user or
RMP, the same idea applies to all of the variants – the δ-approx RMP and
the MinNoise RMP – proposed in [101]. All these problems can be extended
6.2 Complexity
In the first phase, we generate a set of candidate roles. This is currently done
general, any technique can be used to generate the candidate roles. In the
second phase, we select the final roles from among these candidates. For this
is selected from the remaining candidate roles until the original UP A can
candidate role we compute the uncovered area of that role as well as the
similarity of that role to the deployed roles. The uncovered area of a role
not already covered by any of the roles in ROLES. The similarity of the
calculated by taking the area, similarity, and weight into consideration. One
in descending order by the area of the roles, the length of each iteration can
that role is less than the currently seen maximum score, we know that it
is impossible for that role to be the best (since the similarity, and weight
are bounded between 0 and 1, the total area gives the upper bound on the
108
maximum score from that role). Indeed, since the roles are sorted, we know
that none of the roles following this can be the best role either. Therefore, we
immediately stop the iteration and use the best role found so far. This can
significantly help in reducing the overall time. Algorithm 7 gives the details.
strate the working of the algorithm. We use the same hypothetical organiza-
109
tion described in the introduction in Figure 6.1(a), along with set of deployed
roles shown in Figure 6.1(b). We go through the run of the algorithm when
run with a weight for similarity of 0.2. In this case, 10 roles are found with
9). Since it would be quite tedious to show all of the 10 iterations (one role
is picked in each iteration), we instead just show a few of the iterations. Fig-
ure 6.2(a) shows the very first iteration where the role {p5 , p7 , p10 } is chosen
as the role with the best score (maximum uncovered area and similarity).
Figure 6.2(b) shows the fifth iteration when the role {p1 } is picked. Finally,
Figure 6.2(c) shows the final iteration when the role {p11 } is picked and
the remaining uncovered area at that point drops to 0 which terminates the
algorithm.
complexity of the candidate generation phase and the complexity of the can-
110
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
u0 0 1 1 1 0 0 0 0 0 1 0 0
u1 0 0 1 1 1 1 0 1 0 0 1 0
u2 0 1 0 0 1 1 0 0 1 0 1 0
u3 0 0 0 1 0 1 0 1 1 1 1 0
u4 0 1 0 0 0 1 0 1 0 0 1 0
u5 0 1 0 1 0 0 0 0 1 1 0 1
u6 0 0 0 1 0 1 0 1 1 1 1 1
u7 0 0 0 1 1 1 0 0 1 0 0 0
u8 0 0 0 1 0 0 0 0 1 1 0 0
u9 0 1 0 0 0 0 0 0 0 0 1 0
u10 0 0 0 0 1 1 0 0 1 0 0 0
u11 0 0 0 1 1 0 0 0 0 0 0 0
u12 0 1 0 0 1 1 0 1 1 0 1 0
u13 0 1 0 1 0 1 0 1 1 0 1 1
u14 0 0 1 1 1 0 0 0 1 1 0 0
u15 0 0 0 1 1 0 0 0 0 0 0 0
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
r1 1 0 0 0 0 0 0 0 0 0 0 0
r2 0 1 0 0 0 0 0 0 0 0 0 0
r3 0 0 1 0 0 0 0 0 0 0 0 0
r4 0 0 0 1 0 0 0 0 0 0 0 0
r5 0 0 0 0 1 0 0 0 0 0 0 0
r6 0 0 0 0 0 1 0 0 0 0 0 0
r7 0 0 0 0 0 0 1 0 0 0 0 0
r8 0 0 0 0 0 0 0 1 0 0 0 0
r9 0 0 0 0 0 0 0 0 1 0 0 0
r10 0 0 0 0 0 0 0 0 0 1 0 0
r11 0 0 0 0 0 0 0 0 0 0 1 0
r12 0 0 0 0 0 0 0 0 0 0 0 1
(b) Deployed Roles
111
didate selection phase. Since the FastMiner algorithm uses pairwise intersec-
tion of unique users to generate candidate roles, it requires O(n2 ) time, where
n is the number of users. Since at most n roles are necessary to describe the
candidate selection. Thus in the absolute worst case, the overall cost is O(n3 )
which is still significantly better than the exponential worst case of tiling.
However, in practice, due to the sorting and quick termination strategy, the
To check the effect of weight on the results, we ran some experiments. Figure
6.3 shows the number of roles as a function of the weight. Figure 6.4 shows
the similarity of the roles generated to the deployed roles as a function of the
weight. The number of users was set to 200, number of permissions set to
400.
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
r1 0 1 0 0 0 0 0 0 0 0 0 0
r2 0 1 0 1 0 0 0 0 1 0 0 1
r3 0 0 1 1 0 0 0 0 0 0 0 0
r4 0 0 0 1 1 0 0 0 0 0 0 0
r5 0 0 0 1 0 0 0 0 1 1 0 0
r6 0 0 0 1 0 0 0 0 1 1 0 1
r7 0 0 0 1 0 0 0 0 0 1 0 0
r8 0 0 0 0 1 1 0 0 1 0 0 0
r9 0 0 0 0 0 1 0 1 0 0 1 0
r10 0 0 0 0 0 0 0 0 0 0 1 0
with conflicting roles. There are two types of SOD, Static Separation of
Duty (SSOD) and Dynamic Separation of Duty (DSOD). SSOD refers to the
constraint that, at any time, a user can’t be assigned to two roles which have
conflicting interest. Let’s say R1 and R2 are in conflicts with each other, the
automatically disqualified for R2 not only at the same time, but also even
all the time. SSOD is implemented mainly in order to avoid the fraudulent
conduct. An typical example will be that a user can’t perform duties of both
associated with it. More specifically, if we identify that two roles R1 and R2
114
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
u0 0 1 1 1 0 0 0 0 0 1 0 0
u1 0 0 1 1 1 1 0 1 0 0 1 0
u2 0 1 0 0 1 1 0 0 1 0 1 0
u3 0 0 0 1 0 1 0 1 1 1 1 0
u4 0 1 0 0 0 1 0 1 0 0 1 0
u5 0 1 0 1 0 0 0 0 1 1 0 1
u6 0 0 0 1 0 1 0 1 1 1 1 1
u7 0 0 0 1 1 1 0 0 1 0 0 0
u8 0 0 0 1 0 0 0 0 1 1 0 0
u9 0 1 0 0 0 0 0 0 0 0 1 0
u10 0 0 0 0 1 1 0 0 1 0 0 0
u11 0 0 0 1 1 0 0 0 0 0 0 0
u12 0 1 0 0 1 1 0 1 1 0 1 0
u13 0 1 0 1 0 1 0 1 1 0 1 1
u14 0 0 1 1 1 0 0 0 1 1 0 0
u15 0 0 0 1 1 0 0 0 0 0 0 0
Uncovered Area = 53
(a) Iteration 1
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
u0 0 1 1 1 0 0 0 0 0 1 0 0
u1 0 0 1 1 1 1 0 1 0 0 1 0
u2 0 1 0 0 1 1 0 0 1 0 1 0
u3 0 0 0 1 0 1 0 1 1 1 1 0
u4 0 1 0 0 0 1 0 1 0 0 1 0
u5 0 1 0 1 0 0 0 0 1 1 0 1
u6 0 0 0 1 0 1 0 1 1 1 1 1
u7 0 0 0 1 1 1 0 0 1 0 0 0
u8 0 0 0 1 0 0 0 0 1 1 0 0
u9 0 1 0 0 0 0 0 0 0 0 1 0
u10 0 0 0 0 1 1 0 0 1 0 0 0
u11 0 0 0 1 1 0 0 0 0 0 0 0
u12 0 1 0 0 1 1 0 1 1 0 1 0
u13 0 1 0 1 0 1 0 1 1 0 1 1
u14 0 0 1 1 1 0 0 0 1 1 0 0
u15 0 0 0 1 1 0 0 0 0 0 0 0
Uncovered Area = 12
(b) Iteration 5
115
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
u0 0 1 1 1 0 0 0 0 0 1 0 0
u1 0 0 1 1 1 1 0 1 0 0 1 0
u2 0 1 0 0 1 1 0 0 1 0 1 0
u3 0 0 0 1 0 1 0 1 1 1 1 0
u4 0 1 0 0 0 1 0 1 0 0 1 0
u5 0 1 0 1 0 0 0 0 1 1 0 1
u6 0 0 0 1 0 1 0 1 1 1 1 1
u7 0 0 0 1 1 1 0 0 1 0 0 0
u8 0 0 0 1 0 0 0 0 1 1 0 0
u9 0 1 0 0 0 0 0 0 0 0 1 0
u10 0 0 0 0 1 1 0 0 1 0 0 0
u11 0 0 0 1 1 0 0 0 0 0 0 0
u12 0 1 0 0 1 1 0 1 1 0 1 0
u13 0 1 0 1 0 1 0 1 1 0 1 1
u14 0 0 1 1 1 0 0 0 1 1 0 0
u15 0 0 0 1 1 0 0 0 0 0 0 0
Uncovered Area = 0
(c) Iteration 10
140
120
100
NumberofRoles
80
60
40
20
0
0 0.2 0.4 0.6 0.8 1
SimilarityWeight
0.8
0.7
0.6
0.5
Similarity
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Similarity Weight
are under the dynamic separation of duty constraint a user could get assigned
to these two roles statically at the time of user-role assignment, but at run
time, he can not get assigned both at the same time. Notice that even though
DSOD can always be enforced via SSOD, its useage is justified by the fact
that it, in compared with SSOD, is less restrictive and provide a higher level
of flexibility.
SSOD enforced by the original deployed roles under the assumption that
both discovered roles by the MinPert-RMP and original deployed roles are
R1 and R2 are two deployed roles which bear conflicting interests. Then
there exists at least one permission in each of R1 and R2 which are at odds
the fact that deployed roles are accurate, p1 and p2 will not be authorized
to any single user together in the original UP A. Therefore, the pair will not
discovered roles will be different from original UP A, i.e., the errors will be
argument assumes that two roles are conflicting with each other, it is still
However, the MinPert-RMP may violate DSOD given the same assump-
tion upon the accuracy of both discovered and deployed roles. For example,
{p1 , p2 , p3 },{p1 },{p2 , p3 } respectively. Assume that the deployed role set con-
sists of DR1 = {p1 },DR2 = {p2 },DR3 = {p3 }. If we define DSOD between
DR2 and DR3 , we know that p2 and p3 are in conflict at run time such that
no user can have both permissions at the same time. However, the MinPert-
RMP will generate the two roles R1 = {p1 },R2 = {p2 , p3 }. The solution is
still accurate but violates the DSOD which states that p2 and p3 can not be
We first define, in the context of DSOD, a few terms for convenience of ex-
of permissions which, if assigned to the same role, will cause the violation
conflicting role. Note the notions of conflicting roles and conflicting permis-
sions can be used in SSOD, a subtle difference over the notion of conflicting
of roles among which conflict exists. We don’t use the singular form (i.e., a
conflicting role) since the conflict occurs in the context of two or more than
119
Our approach to eliminate DSOD conflicts is based on the fact that per-
missions responsible for the conflict are known. We call it preprocessing since
the potential conflicting permissions are removed before the evaluation and
selection of discovered roles from the candidate list. Clearly, this is a safe ap-
proach, since the DSOD will not be violated. The preprocessing for conflict
elimination can be incorporated into Algorithm 7. It will take place right af-
ter a candidate set of roles ROLES are created. Then we identify, according
example shown in Figure 6.5 in which p1 and p3 are assumed to the conflict-
UP A. Figure 6.5.(b) shows the candidate roles generated as the first step
r5 . Therefore, they are removed from the candidate list since those two roles
violate the DSOD. Figure 6.5.(c) shows the candidate roles after removing
those which violate DSOD. The shaded rows in Figure 6.5.(c) represent the
roles discovered by Algorithm 7. Figure 6.5.(d) and Figure 6.5.(e) show the
UA and P A.
strategy which procrastinates the elimination of conflicting roles right till af-
ter the role discovery. In details, once the discovered roles are generated. We
identify, among them, the conflicting roles and subsequently, all conflicting
r1 r4 r10 r11 p1 p2 p3 p4 p5
u1 1 0 0 0 r1 0 1 0 0 1
u2 1 1 1 0 r4 1 1 0 0 0
u3 1 1 0 1 r10 0 0 1 0 0
u4 0 1 1 0 r11 0 0 0 1 0
(d) UA
(e) PA
users from UA. Then we form a small-scale UP A which consists of all per-
missions, but only conflicting users identified in the previous step. For the
second time, we discover the roles from the candidate list from which, the
conflicting roles are removed this time. Then the newly generated UA and
P A are consolidated with their counterpart derived from the first run.
therefore, the discovering effort proves to be futile and the time has been
factually wasted. (2) In the second run of discovering roles out of the smaller
the conflicting roles before role discovery, but this effort is considered to be
it in postprocessing.
Conservative Strategy
then this permission set probably does not form a role. The reason for using
the word probably is that it is possible that such a role exists but no user
has been assigned to that role. However, in such a case, role mining is not
conflicting roles. This means that there exists at least one permission in
each of DRi and DRj which are in conflict with each other. Without loss of
generality, let’s assume that they are pi ∈ DRi and pj ∈ DRj . Therefore, we
know that there doesn’t exist a user u who is authorized to both DRi and
that the values of both cells {u, pi} and {u, pj } are 1, otherwise, UP A and
the deployed role set are not consistent and deployed role set has errors.
exists a role Ri in the discovered role set by the MinPert-RMP, such that
123
pi andpj ∈ Ri . However, we know from above that none of the users are
inate Ri from the discovered set as is contradict with the fact that Ri is in
identify conflicting roles. Given a set of deployed roles DROLES and a set
of discovered roles ROLES, there might have a few possible relations between
may become multiple ones in ROLES. For example, let’s say DRi and DRj
are two conflicting deployed roles where DR1 = {p1 , p2 , p3 }, and DR2 =
original one SSOD in DROLES turns into two SSODs, one is between R1 and
R3 , the other R2 and R4 . Second, to the opposite of the previous case, there
might exist multiple SSODs in DROLES, and they all get merged into one
in ROLES, we will see this if we twist the above example a bit by assuming
that R1 , R2 , R3 , R4 are from DROLES and DR1 and DR2 are in ROLES,
can be easily seen if the roles involved in SSOD in DROLES remain the
same in ROLES.
124
deployed roles at all. For each of the permission sets involved in a conflict,
we just search ROLES to find all roles each of which contains all those
permissions. We can form a set S1 containing all these roles. Then we do the
same to each of the involving permission set to get S2 , S3 and so on. Then
each combination consisting of a role from each of such sets will form a new
{p1 , p2 },{p5 }, {p7 , p8 , p9 }, this means that any given user can’t be assigned
R4 = {p7 , p8 , p9 , p1 0}. Now, we first form the role set S1 while each role in S1
one role from each set. So we get two SSODs, one involves R1 , R2 , R4 , the
other involves R1 , R3 , R4 .
of two steps, we will first form a Candidate Role Set CRS which is composed
of all such discovered roles that each of them contains at least one permission
125
from each of the conflicting roles in this SSOD. Then, we run through Candi-
identify all such combinations that in each of them, the union of permissions
from all its constituent roles are superset of the union of permissions from
DROLES. Given each observed SSOD, Line 2-3 constructs CRS and one
step further CRC. In CRS, we incorporate all candidate roles each of which
contains at least one permission from each of the conflicting roles involved
candidate roles, but this is the best we can do in the context of information
insufficiency. Line 4-8 is a inner for loop, in it, each set C in CRC is checked
We use an example to explain the strategy in detail for the case of in-
{p1 , p2 , p3 }, and DR2 = {p4 , p5 , p6 } (note that in the example, we limit the
conflict occurring between two roles for simplicity, the strategy the example
tries to bring home to is actually applicable to cases in which more than two
roles get involved.) We know no more than the fact that DR1 conflicts with
conflict in permission level. Therefore, any one or more than one of the 49
conflict. (The 49 is derived since each of DR1 and DR2 has 7 non-empty
if the conflicting relation between DR1 and DR2 is migrated over to R1 and
that the SSOD has been migrated onto R1 and R2 from DR1 and DR2 when
all possible permission combinations in DR1 and DR2 will be included in all
subset combinations formed between R1 and R2 , that is, when the union of
there exists a SSOD between R1 and R2 since the union of both would be
This strategy is conservative in a sense that we might not mark all possible
called liberal strategy, in this strategy, the goal of this strategy is to mark
all SSODs in ROLES without outlooking any single one of them. The price
we pay for this is that we bring in the noise SSODs, i.e., we generate more
SSODs than we should and some of them are not the correct ones. The
major difference between those two strategies is, after building CRS and
that in each of them, the union of permissions from all its constituent roles
covers at least one permission from each of the conflicting roles in original
SSOD. (rather than being the superset of the union of permissions from all
roles in original SSOD). Therefore, we might notice that, given the same
SSOD migration problem, the result set of liberal strategy is the superset of
permissions from DR1 , say p1 , and one from DR2 , say p4 . Therefore, we claim
that there is a SSOD among them. However, this might not be correct, if p2
Optimization
We can improve the SSOD migration strategy proposed in the previous sec-
For the liberal strategy, we know that we include noisy SSOD. We can
5 identifies the sets in CRC that for each set, the union of permissions from
all roles in that set covers at least one permission from each of the conflicting
129
roles in original SSOD. Second, We add Line 10-17 for the optimization
purpose. For the resultant SSODs, the optimization intends to trim off those
which are noises. For each user, the process checks the union of permissions
in all the roles assigned to this user against the union of permissions in each
generated SSOD, if the former contains the latter, the SSOD will be removed.
(Note that we assume that deployed roles are all correct in a sense that the
has been assigned two roles DR1 = {p1 , p2 , p3 } and DR1 = {p4 , p6 }. Since
Pragmatic Strategy
all possible new SSODs in ROLES will be identified even though we pay
a price by possibly identify more than we should, i.e., irrelevant role group
are also marked as conflicting roles. However, they might be not. The
section, we come up with a pragmatic strategy which, for each original SSOD
Another disadvantage comes from the fact that the identified one even might
130
be the real one. However, we pick it over other candidates since it has higher
each of those roles, we first identify the closest role in ROLES by calculating
the the similarity of it with all roles in ROLES. The similarity function has
CHAPTER 7
THE ROLE HIERARCHY MINING PROBLEM
Even though the basic concept of role hierarchy is quite standard [46, 47, 69,
83, 70, 15, 1, 16, 17, 45], the structural specifics of building a hierarchy are not
ments needed by our approach. Note that we use the symbols ≻, ≻≻ and
and each edge e ∈ E represents a direct relation between the two incident
2. ∀ r ∈ ROLES, permissions(r) 6= ∅
explicitly assigned to it and those that are inherited from its descendants,
132
and users(r) comprises of the assigned users(r) as well as all the users
that are eligible to play a descendent role (as a result of inheriting the per-
missions). Besides the role ID r, each vertex v also contains users(r) and
permissions(r). For the sake of simplicity, from this point on, we use the
role hierarchy. Each edge e indicates the direct relation between two incident
creation of dummy roles. A dummy role without any permissions can have no
other role in the graph. Isolated roles cannot exist in the hierarchy. This
requirement allows the existence of one and only one super role which has
all of the permissions. This role prevents any role from being disconnected
from the hierarchy. It might create some dummy links between itself and
others which would be isolated otherwise without the existence of the super
role. We restrict the number of such roles to one to keep the maintenance
overhead to a minimum.
133
Here, we also define another flavor of role hierarchy which we call complete
role hierarchy.
{r1 ,r2 , . . . rx } =
6 ∅, { (ri ,r1 ),(r1 ,r2 ),. . ., (rx−1 ,rx ),(rx ,rj ) } ⊆ E } or
between any pair of roles need to be captured in the role hierarchy, either
where V is the set of vertices and E is the set of edges. The transitive closure
In this section, we formally define the basic Role Hierarchy Mining Prob-
lem(the RHMP) and its two variants, the Role Hierarchy Building Problem
134
(the RHBP) and Minimal Perturbation Role Hierarchy Mining Problem (the
MinPert-RHMP).
The RHBP asks us to build a hierarchy out of the existing role set, such
number of complete role hierarchies built from a given set of roles could be
substantial. These hierarchies could all be correct since they are complete,
and therefore, have the same transitive closure. Yet, some are superior to
others, from the perspective of role administrators due to the fact that they
smaller number of edges implies less maintenance workload for the adminis-
roles r1 = {p1, p2, p3, p4}, r2 = {p1}, r3 = {p1, p2}, r4 = {p1, p3}, r5 =
{p1, p3, p4}, there could exist multiple role hierarchies which are complete.
Figure 7.1.(a) and (b) are two of the complete hierarchies built from this role
set. They have the same transitive closure, but Figure 7.1.(a) is superior to
Figure 7.1.(b) since it has less number of edges. Actually Figure 7.1.(a) is
r1 r1
1,2,3,4 1,2,3,4
r5 r5
1,3,4 1,3,4
r3 r3
1,2 r4 1,2 r4
1,3 1,3
r2 r2
1 1
(a) (b)
Figure 7.1. An example of a set of complete role hierarchies from a given
role set
problem. However, before doing so, since this problem involves creating a
new set of roles, it must be ensured that the set of roles created correctly
The basic idea is to count the difference between the number of original
long as the difference is within δ, the discovered roles, user-role and role-
UP A. However, this does not take role hierarchies into consideration at all.
k M(UP A) − M(UP Ad ) k1 ≤ δ
RH (i.e., including both direct and indirect permission mappings due to the
3. { ∃ R ⊆ V | R = {r1 ,r2 , . . . rx } =
6 ∅, { (ra ,r1 ),(r1 ,r2 ),. . ., (rx−1 ,rx ),(rx ,rb )
} ⊆ E, i ∈ users(ra ), j ∈ permissions(rb ) }
This concept of mapping assures that, for any cell {i, j} with value of 1
any cell {i, j} with value of 0 in UP Ad , there exists no role in the RH, such
and the discovered roles and role hierarchy. We can now define the minimal
would like to achieve optimality so that total sum of the number of roles
prefer to strike a balance between getting closer towards optimality (in terms
of the roles and role hierarchy) and towards causing as little disruption to
existing deployed roles to a degree that the quantified disruptions and the
the optimal roles need to be discovered that cause as little disruption to the
When there are no deployed roles, the hierarchy mining must include role
mining as well. Essentially, the optimal set of roles and role hierarchy must
138
|ROLES| + |E|.
assignment P A, and a complete role hierarchy RH such that both the pair
The above definition is suitable for cases where no existing roles are de-
ployed yet or role mining is still at its initial stage therefore, abandoning the
7.2.(b) and (d) together shows an optimal solution for the RHMP. It generates
3 roles, r1 = {p1, p2, p3, p4, p5, p6}, r2 = {p1, p2, p5, p6}, r3 = {p5, p6}, and
2 edges in the hierarchy. For this UP A, there are no role sets which could
have the sum of the number of roles and the number of edges in the hierarchy
139
p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6 p1 p2 p3 P4 p5 p6
u1 1 1 0 0 1 1 r1 1 1 1 1 1 1 r1 1 1 1 1 1 1
u2 1 1 1 1 1 1 r2 1 1 0 0 1 1 r2 1 1 0 0 0 0
u3 0 0 0 0 1 1 r3 0 0 0 0 1 1 r3 0 0 0 0 1 1
r1 r1
1,2,3,4,5,6 1,2,3,4,5,6
r2 r2 r3
1,2,5,6 1,2 5,6
r3
5,6
(d) (e)
less than 5. However, there may exist multiple optimal solutions for a specific
Given a deployed role set r1 = {p1, p2, p3, p4}, r2 = {p1}, r3 = {p1, p2},
r4 = {p1, p3}, r5 = {p1, p3, p4}, there could exist multiple role hierarchies
which are complete. Figure 7.1.(a) and (b) are two of the hierarchies built
from this role set. They have the same transitive closure, but Figure 7.1.(a) is
superior to Figure 7.1.(b) since it has less number of edges. Actually Figure
Section 7.2.1 describes the RH-Builder which aims to address the Role Hier-
from any given set of roles. The completeness property guarantees that
two roles are connected in the hierarchy if and only if they have an inheri-
{r1 ,r2 , . . . rx } =
6 ∅, { (ri ,r1 ),(r1 ,r2 ),. . ., (rx−1 ,rx ),(rx ,rj ) } ⊆ E } or
(ri ,rj ) ∈ E.
This observation has two implications. First it states that any two roles with
it implies that redundancy occurs whenever indirect path and direct link
between the same pair of roles coexists. Since the relation represented by the
direct link is already implicit in the indirect path, for the sake of workload
reduction for role management, the direct link can be removed. This is
preferable to removing any link from the indirect path, since doing so could
roles in DROLES into the hierarchy which can be fully represented by its
the deployed role set. The basic idea is to check the inheritance relation
of the role r with each direct descendant ri of the super role sr. Based on
/ neither), different operations are performed. After the for loop, r will be
placed at the best position in the current hierarchy. Currently best has two
are inserted, all the inheritance relations associated with r will be represented
the created hierarchy is optimal. Second, when other roles are inserted after
the details:
Line 1 adds the edge (sr, r) into the graph. Note that a super role sr
incorporates all permissions in UP A. This role can ensure that all roles are
connected to the graph by at least linking with the super role if it has no
containment relation with any other roles. Now the containment relation of
142
r is checked with every direct descendent ri of the super role sr. Lines 3-4
linked with any role in that subtree). Lines 5-8 handle the case where r fully
contains ri . This relation means that all permissions in ri are also associated
with r. If this is the case, the RH-builder removes the direct link (sr, ri ) and
replaces it with the indirect path ((sr, r), (r, ri)). Then the subtree rooted at
ri will be flagged to ensure that the roles in it won’t be checked since r can
not establish a link with any of those roles because r can reach any of them
through a path via ri , therefore, any links between r and them are redundant.
The RH-Builder also marks ri as the descendant in case that (r, ri ) can be
143
removed if later r is linked with ancestors of ri . Next, if, due to the previous
iteration of comparison, there exists a direct link (r, rj ) between r and one
the edge (r, rj ) should also be replaced by a indirect path via ri . Lines 9-11
ri , since the relation between r and ri is exactly the same as the relation
between r and sr. Therefore, the RH-Builder just assigns ri , the descendant
this case, the RH-Builder do Breadth First Search for any descendant rj of ri
such that r fully contains rj . If it is found, link (r, rj ) will be added into E,
accordingly, all edges (r, rk ) between r and any direct or indirect descendant
As we can see that the RH-Builder is greedy in the sense that after
each comparison between r and any role in the hierarchy, adjustment will
take place to remove as many edges as possible. Those edges are redundant
paths. Therefore, the RH-Builder avoids the co-existence of direct link and
indirect path between the same pair of roles. However, under the assumption
that non-redundancy constraint is not violated for each role, there could
roles. I.e. two roles could be linked via different paths. By allowing this, we
archy via the RH-Builder from the deployed role set {{p1}, {p1,p2}, {p1,p3},
{p1,p3,p4}, {p1,p2,p3}}. We will insert the roles in the order shown above.
as the super role. Figure 7.3.(a) show the insertion of r2={p1}. In figure
7.3.(b), since r3 ⊇ r2, the edge r1, r2 has been replaced by the new edges
(r1, r3), (r3, r2) denoted by directed dotted lines. In Figure 7.3.(c), r4 is in-
serted, since r4 overlaps r3, therefore, it performs the breadth first search for
has been added into E. Figure 7.3.(d) and (e) shows how r5={p1,p3,p4} is
inserted. In Figure 7.3.(d), r5 is first compared with r3, since they overlap,
r5 search down the subtree rooted at r3 for containment relation and conse-
and creates edge (r5, r4) since r5 ⊇ r4, meanwhile, (r1, r4) is removed. Then
Next we prove that the role hierarchy built from the RH-Builder is opti-
mal.
r1
r1 r1 r1
1,2,3,4
1,2,3,4 1,2,3,4 1,2,3,4
r3 r3 r4 r5 r3 r4
r2
1,2 1,2 1,3 1,3,4 1,2 1,3
1
1 r2
1
1
(a) (b) (c) (d)
r1 r1
1,2,3,4 1,2,3,4
r5 r6 r5
1,3,4 1,2,3 1,3,4
r3
1,2 r4 r3 r4
1,3 1,2 1,3
r2 r2
1 1
(e) (f)
and G′ (V, E ′ ) to denote any arbitrarily picked graph which has the same
and G′ have the same transitive closure, there must exist at least one path
(ri ,r1 ),(r1 ,r2 ),. . ., (rx−1 ,rx ),(rx ,rj ) } ⊆ E . Apparently, not all of those edges
are in G, otherwise, the coexistence of edge (ri , rj ) and the indirect path
loss of generality, let’s assume that e′ =(rm , rn ) is the only edge which is in G′
146
but not in G. (the argument will be the same in cases where multiple edges
in G′ are missing in G). Since G and G′ have the same transitive closure,
there must exists at least one path in G which links rm with rn . Among all
and an indirect path linking ri and rj in G, which is { (ri ,r1 ),(r1 ,r2 ),. . .,
If G′ is exactly the same as G, the theorem holds since there only exists
one complete role hierarchy. Otherwise, we assume that there exists at least
one edge e′ which is in G′ but not in G. since all edges in G are also in G′ ,
we know that G′ has at least one more edge e′ than G. This proves that G is
the transitive reduction of any arbitrary graph which has the same transitive
closure with G.
The theorem that the RH-Builder is optimal can also be shown by the
fact that removal of any edge out of the hierarchy built by the RH-Builder
any given edge in G forms the only path between the two incident vertices.
derived subsequently. In the following we prove that there is only one optimal
Theorem 7.2.2 The optimal role hierarchy and the set of deployed roles are
in a one-to-one correspondence.
Let’s assume that G1 and G2 are two different optimal hierarchies built
from a set of deployed roles DROLES. Then there must exist at least one
edge e incident on ri and rj such that e is in one hierarchy but not in the
tween ri and rj . since (ri , rj ) is not an edge in G2, but all optimal hier-
exists R ⊆ ROLES, R = {r1 ,r2 , . . . rx } 6= ∅ such that { (ri ,r1 ),(r1 ,r2 ),. . .,
(rx−1 ,rx ),(rx ,rj ) } ⊆ E. Of course, not all these edges will be in G1 at the
least one edge in the above list is not in G1. Without loss of generality, let’s
not an edge in G1, but all optimal hierarchy need to meet the requirement
direct path between ri and rj since { (ri ,r1 ),(r1 ,r2 ),. . ., (rm ,r1′ ),(r1′ ,r2′ ),. . .,
148
′
(rx−1 ,rx′ ),(rx′ ,rn ) . . ., (rx−1 ,rx ),(rx ,rj ) } ⊆ E. Meanwhile, we know that
implies that G1 is not optimal which is in conflict with the assumption made
Based on the two theorems above, we know that the RH-Builder will
build one and only one optimal hierarchy with the minimal number of edges.
all of the edges in the current graph at most once. Since the number of edges
each node), the overall complexity of the RH-Builder is O(|V |3 ) (the worst
In this section, we propose another algorithm called the RH-Miner, for the
case where we have no initial roles. Essentially, the RH-Miner breaks down
the problem into two steps. First, a minimal set of roles is generated using one
a given UP A has been formally defined by Vaidya et al. [101] as the Role
Mining Problem (the RMP) (which is also mapped to the Tiling Databases
Once we generate the minimal set of roles, we can apply the RH-Builder
149
to it to come up with the hierarchy with the minimal number of edges. Our
One weakness of our algorithm is that it not only serializes the generation
of role set and hierarchy, but also specifies the order that roles to be gen-
the discovery of roles. Thus the solution reached by our algorithm may not
really be optimal.
Example 7 Figure 7.4 provides two solutions to the RHMP given a set of
7.4.(b),(d) and 7.4.(c),(e) shows two solutions. Both of them adopt the idea
of generating minimal set of roles first followed by the build of hierarchy with
the minimal number of edges. Each solution generates a set of roles, the two
role sets are different, but they have the same cardinality of 3 and therefore
both are optimal. The hierarchies based on the previously generated roles
are also optimal. In a sense that given the same set of roles, no better
solutions available which can achieve less number of edges. However, we may
1
see that solutions represented by Figure 7.4.(b),(d) is superior to the other
1
We count the edges involved with dummy super roles. However, those edges will not
affect the conclusion, since even we don’t count them, one solutions is still superior to the
other
150
p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6
u1 1 1 0 0 1 1 r1 1 1 1 1 1 1 r1 1 1 0 0 1 1
u2 1 1 1 1 1 1 r2 1 1 0 0 1 1 r2 0 0 1 1 0 0
u3 0 0 0 0 1 1 r3 0 0 0 0 1 1 r3 0 0 0 0 1 1
r1 sr
1,2,3,4,5,6 1,2,3,4,5,6
r2 r2 r1
1,2,5,6 3,4 1,2,5,6
r3 r3
1 5,6
(d) (e)
Figure 7.4. An example showing the difference between the optimal hierarchy
and the hierarchy generated by the RH-Miner
each subprograms might not achieve the optimality of the RHMP. Part of
our future work is to look for solutions better than the RH-Miner to address
the RHMP.
151
CHAPTER 8
ROBUSTNESS TO NOISE
The experimental results of the prior chapters show that our algorithm per-
the above experiments was that the access control data is error free. But,
this assumption is often unsupported in real life access control data sets.
Indeed, in reality, there are often many errors, and no data is quite clean.
Permissions are accidentally assigned to those who do not need them, or are
users may not have all the permissions that others performing similar job
functions have. In such cases, the old adage frequently applies – “garbage in,
garbage out”. We need to ensure that such is not the case for our algorithm
several issues need to be clearly defined. First, what is our model of noise?
Second, what is the degree of noise? Third, how is the accuracy of results
First, let use consider how to meaningfully model noise. The presence of
noise essentially means that errors have occurred in the data – i.e., the actual
data does not match the real data. In our case, the data consists of access
representation, one can view the data in the form of a boolean matrix –
denote this as disallowed permission). In this case, error means that the
actual boolean matrix is different from the desired boolean matrix. Three
a 0, but not vice-versa. This could happen when a user is only given a
of permissions for some initial assignments but not the full set he will
It is clear that general noise actually includes both additive noise and
subtractive noise. Thus, the presence of general noise implies the presence
not equal. In actuality, the degree of additive and subtractive noise depends
on the number of 0s and 1s. All else being equal, any general noise will
noise proportional to the number of 1s. Typically, the access control matrix
will be sparse with fewer 1s and many more 0s. Therefore, additive noise will
be more likely to occur. This corresponds to real situations where users are
more likely to have more permissions than they need (additive noise) than
less permissions than they need (subtractive noise) because otherwise they
could not perform their job functions. For now, we explore the robustness
ordinary situations. Even though we feel this is likely, this may not be
the case in certain situations. To test for this, in the future, we plan to
Along with the model of noise, we also need to consider the degree of noise.
data. However, how does this truly apply? Take the example of a dataset
with 2000 users and 500 permissions. The total size of the dataset is 1,000,000
bits. 1% of this equates to 10000 bits. Does this mean that 10000 bits should
be flipped? This might seem reasonable except for the fact that the access
this case, flipping 10000 bits would completely obviate the true data. One
must realize that unlike digital communications, in access control data, the
signal is characterized only by the 1s, and not by the 0s. This implies that
hands out a total of 4000 permissions, is it likely that he would make 10000
Finally, we need to consider how to add noise to the data. Again, assume
2000 users, 500 permissions and 4000 allowed subject-object pairs (4000 1s).
If we wanted to introduce 10% general noise into this data set, how should
we proceed? A simple way to add noise is to pick 400 bits at random from
155
the 1,000,000 bits and flip them. But is this correct – should exactly 400
bits be flipped? The key issue is whether by 10% noise we mean that the
noise is exactly 10% of the data or whether we mean that the probability
of error in the data is 10%. The first case corresponds to flipping 400 bits.
However, in the second case, we should go through all 1,000,000 bits and flip
each bit with a probability of 10%. Though either way is fine, we argue that
the second way of introducing noise more closely approximates real life.
In our experiments we considered 1%, 5%, 10%, and 20% noise and in-
• if x ∈ UP A, x ∈ UP A′ with probability 1 − p
/ UP A, x ∈ UP A′ with probability p
• if x ∈
The final issue to consider is the way of checking the accuracy of the algo-
rithm. An easy way to do this is simply to report the same accuracy figures
compute how many of the original roles are found in the results. This is fine,
156
except for one caveat. When we match roles, the matches are exact. While
this is fine when there is no noise, in the presence of noise there is a good
possibility that we may find approximate roles rather than the real roles. In
for now, we restrict ourselves to exact matches and report the results ob-
better results.
We now discuss the actual experimental results concerning noise. For all of
the data generated earlier in Section 4.4.1, we generated noisy data and reran
the experiments to compute the accuracy. Since the noise addition process
was random, this was run three times on each data set. Thus for each of
the five variants of the four datasets for the four experiments, three noisy
for each noise level. Accuracy results were computed and averaged over the
with noise addition of 1%, 5%, 10%, and 20%, thus leading to a total of
240∗4 = 960 experiments. The results from these experiments are reproduced
in the form of eight charts. Essentially, for each of the four experiments
carried out in Section 4.4.1 we depict two charts showing the results for
accuracy in the top 1x results and the accuracy in the top 2x results. Each
chart contains the results at all 4 noise levels as well as the original no-noise
157
reference level (from the original experiments). This helps to visually see the
Figures 8.1(a) and 8.1(b) display the effect of noise when the number
of permissions and roles is kept constant, and only the number of users is
varied. It can be seen that 20% noise significantly degrades the accuracy of
the algorithm, while noise up to 10% is still tolerable. The overall results
for 2x accuracy are still quite good. Figures 8.2(a) and 8.2(b) display the
noise results when only the number of permissions are kept constant while
both the number of users and roles is varied. Again, the overall results are
encouraging. Accuracy significantly drops off only for 20% noise with the
highest number of permissions. The reason for this is the fact that as the
data set is sparse, even minor noise can make a big difference when finding
For the third set of experiments, Figures 8.3(a) and 8.3(b) show the effect
of noise when the number of users is increased. The results here follow the
results observed in the absence of noise. Overall, the results are quite good
For the fourth set of experiments, Figures 8.4(a) and 8.4(b) display the
This is partly due to the fact that as the density of the dataset increases,
and the number of intersections increase, even minor amounts of noise can
make a huge difference when finding exact matches. Again, we expect that
158
this case.
Overall, the noise results are encouraging and give us more confidence on
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
150 (15) 30 (3) 15 (1.5) 3 (0.3)
Permissions/Roles (Permissions/User) Ratio
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
150 (15) 30 (3) 15 (1.5) 3 (0.3)
Permissions/Roles (Permissions/User) Ratio
Figure 8.1. Effect of Increasing Number of Users and Roles (in Constant
Proportion) with Constant Permissions
160
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
1 (0.05) 5 (0.25) 10 (0.5) 20 (1)
Permissions/Roles (Permissions/User) Ratio
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
10? (0.05) 5 (0.25) 10 (0.5) 20 (1)
Permissions/Roles (Permissions/User) Ratio
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
2.5 5 15 25
User/Role Ratio
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
2.5 5 15 25
User/Role Ratio
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
2 3 5 10
Maximum Number of Roles Per User
100.0
80.0
Accuracy (Percent Roles Found)
60.0 0% Noise
1%Noise
5%Noise
10%Noise
40.0 20%Noise
20.0
0.0
2 3 5 10
Maximum Number of Roles Per User
CHAPTER 9
ROLE MINING SYSTEMS
All the RMPs and their variants have significance in solving the role engineer-
ing problem and help the security administrators to pick and choose the one
the tool called Role Mining Systems to facilitate the security administrators
The role mining system (RMS) consists of four functional modules: RMP
and Algorithm Selector, Test Dataset Generator, Role Mining Processor and
The whole role mining request starts from RMP Selector component where
a user select a specific role mining problem and an algorithm. The suite is
actually open in a sense that it could be extended to enclose more role mining
problems. The user needs to choose a role mining problem and optionally
164
RMPSelector
AlgorithmSelector Testing
DatasetDB
Web
bͲbasedGU
TestDataset
Generator
UI
RoleMining
Processor Testing
ResultDB
AlgorithmAnalyzer
its variations which are the δ-approx variant and the MinNoise variant. The
that the user can select more than one algorithm at one time. We have
ORCA [87] and all other significant role mining approaches, we also plan to
In this module, the user has the option of either choosing the existing data
sets provided by various sources, for example, the data set by Ulrike Steffens
[87], or randomly generating data set for testing purpose. For the latter,
which are necessary for the data set to be generated appropriately. On top of
that, the module simulates the reality by enabling the user to introduce noise
into the data set to be generated. That is, the user has options to choose the
Also, the user can indicate the degree of noise in percentage. The user can
also choose the number of times to run for the set of parameters chosen in the
previous module. Since the test data generator algorithm is randomized, the
incorporate reasonable semantics into the generated data to make them more
This module is to carry out the role mining process based on the select role
mining problem, algorithm and test dataset. Note that when the test dataset
is randomly generated in the module above, the number of times that each of
the chosen algorithms will be run on the dataset is based on the parameter
that the dataset is run when being generated in Test Dataset Generator
for data generation 10 times, each of the chosen algorithms for role mining
will be run on each of the created data sets. Then all results reported for
a specific parameter set are averaged over the 10 runs. The results after
the processing will be both directly returned to the user and saved in the
testing result database for future references. Once receiving the result the
Algorithm Analyzer
The Algorithm Analyzer module aims to analyze the effect of different fac-
the effect of one parameter at a time by keeping others fixed. Since we have
dependent manner. Therefore, for each of the three features, the user could
rameter(s) the user shows interested in. In this module, the user can analyze
any combination of the three features. Note that the noise robustness factor
is automatically set ON for each of the chosen algorithms if the test data
is randomly generated with noises, or the user selects the existing dataset
enable the user to compare the algorithms in different aspects if more than
system, which includes front-end user interface, web server in the middle
and database server at the back end. The Web-based user interface is im-
plemented in JSP, HTML. The middle tier interacting with users runs on
IIS Web server. JDBC is used to connect to the database. We used Oracle
10G DBMS to store the test datasets collected from various sources, and the
leave the remaining modules as the future work. Note that we will keep
adding new features and algorithms into this systems with the further explo-
ration of role mining paradigm. Figure 9.2(a) shows the login page, the user
should sign up first in Register page shown in Figure 9.2(b) before logging in,
the system prompts the user for signing up since the it can provide the user
to customize the views. In addition, the user account could enable the user to
save the running results onto the server for future reference. In the following,
168
Figure 9.2(c) shows the page for basic-RMP. Here the user has two ways
to provide the data in order to run basic-RMP: either the real data can be
uploaded, or the system can generate a random dataset with the provided
parameters for user count and privilege count. The results can be shown
we plan to give the user more options on choosing how the result could be
shown. The user may opt for having only the final role sets generated by the
of the algorithms. Alternatively, the user could also have only the first n (say
10) iterations shown in details for further understanding how the algorithm
functions.
shown in Figure 9.2(d) generates the roles without introducing errors. This
means that discovered roles can exactly represent the UPA, the tab MinNoi-
seV7 shown in Figure 9.2(e) introduces the inexactness in a sense that the
pletely clean, therefore mistakes can always be made, and asking for roles
some sense, it may be possible to find more fundamental roles by only trying
have 1000 users and 100 permissions. The size of UP A is 5000 (i.e., 5000
user-permission assignments are allowed out of the possible 100, 000). Now,
suppose 100 roles are required to exactly match the given user-permission
sume that the minimum number of roles required is only 60. As long as we
do not add any spurious permissions (i.e., no extra 1s are added), the second
case is clearly better than the first, since we significantly reduce the number
current state of the organizations. Permissions and (to a lesser extent, Roles)
are dynamic. Thus while exact match may be the best descriptor in the static
case, it is probably not good for the dynamic case. Approximate match might
several pruning techniques more details can be found in [22]. Essentially, the
exponential in the worst case.) Thus, if the pruning strategies do not work,
170
algorithm and the Database Tiling Problem is that our solution is based on
the FastMiner algorithm that can significantly cut down on the cost while
(a) index
(b) register
172
(c) rmp
(d) MinNoise V6
173
(e) MinNoise7
CHAPTER 10
SUMMARY AND FUTURE RESEARCH
Role engineering is considered essential before all the benefits of RBAC can be
realized. On the other hand, it is one of the major challenges and the costliest
and the extensive applications of RBAC calls for efficient solutions to role
10.1 Contributions
tions of these works is that they lacks integrative view of the entire set of
roles when justifying for the roles identified. Therefore, no criterions can be
used to evaluate those works. This dissertation pioneered this area, it filled
aspects:
has an different objective which is both meaningful and in the view of the
entire collection of roles in contrast to one single role. We are the first to
175
introduce objectives into the role mining problems since, to the best of our
does not exist in previous research works in role mining paradigm (even from
the perspective of one single role) even though the extensive applications of
RBAC urgently call for the association of meaningful and diverse objectives
with the role mining problem so that in order to meet the specific orga-
problem which has a suitable objective. The role mining problems under
investigation include but not limited to the Basic-RMP, the Edge-RMP, the
lem, we also considered its different variations which we feel have strong
the MinNoise Basic-RMP which are variants of the Basic-RMP, the δ-approx
Therefore, it is almost impractical to enumerate all. The fact that in the dis-
typical and significant doesn’t necessarily mean that the number of role min-
role mining is rather an extensible and open-ended research area in that new
Second, we formulate the role mining problem as nothing but the binary
criterion. We believe we are the first to formulate the role mining problem
drawback that a permission can only associate with one role at most. To put
• A user may play several different roles, and the user may have a certain
permission due to more than one of those roles (since multiple roles may
This is essential for role mining since ignoring any of them would make
the job easier, but may result in more inaccurate set of roles. This also
differentiates our solutions from the traditional clustering ones. For example,
lem. This is critical since we need to use them as a guide to more efficiently
search for a solution, i.e., we don’t want to fall into the swamp of finding a
Fourth, for some role mining problems, instead of ”reinventing the wheels”
and constructing a new solution which, however, is at the same level of effi-
ciency as the existent, we reused what has already been proved to be func-
tional to address our problems. That is, we investigate the relation of certain
role mining problems with the problems already identified, studied and solved
in data mining and data analysis literature. These role mining problems have
the existing solutions of the solved problems have been directly applied to
ours. This serves as one of the novel aspects of the dissertation. This will
Basis problem and we also demonstrate the equivalence of the Role Mining
meration based role mining process, which has three major advantages as
compared to earlier role mining proposals. (1) The first step of mining roles
exponential in the worst case.) Thus, if the pruning strategies do not work,
intersecting all unique user pairs. The results show that FastMiner can sig-
nificantly cut down on the cost while still retaining very good accuracy. (2)
We select the final roles from among these candidates. For this selection,
candidate role is selected from the remaining candidate roles until the origi-
remaining candidate role, we compute the uncovered area of that role. The
in M(UP A) that are not already covered by any of the roles in ROLES (the
work for the δ-Approx RMP. Instead of terminating when the UP A is com-
pletely reconstituted, the algorithm for δ-approx RMP stops as soon as the
of the δ-approx RMP (with δ = 0). Essentially, a single algorithm can serve
both the RMP variants by simply setting a parameter. We believe that this
is significant in that one can implement one single alogrithm and can tune it
to obtain the results of different RMP variants. This lends itself for security
administrators to pick and choose the RMP variant that is applicable to the
179
Finally, we propose our noise model that helps in evaluating the algo-
assigned to those who do not need them, or are not revoked once the per-
mission is no longer needed. In addition, some users may not have all the
permissions that others performing similar job functions have. Since noise
RMP in discovering roles could use this noise to its benefit and may have
also propose the concept of noise degree, and define the noise robustness
area of data mining and data analysis – the database tiling and the discrete
for these problem and directly apply them to solve the basic RMP and Min-
schein rank of a matrix is exactly the same as the basic RMP. Other proper-
ties of the boolean rank have also been studied [9]. It would be interesting to
investigate what other results are directly applicable to our problem and see
180
if they offer new insight into our domain. Bipartite graphs and bicliques are
another way of defining the RMP and its variants. Several papers have looked
at different variants of this (e.g., [41] and [80]) – though most concentrate on
our problem. We also need to see which solutions among this work can be
utilized for our problem. Moreover, most of the role mining approaches em-
on user counts) to decide which candidate roles are most useful is quite
rudimentary. This does not take into consideration any semantics or other
Other work includes using the semantics associated with permissions. The
only data set available to us had no semantics and so our process found roles
used to further refine the roles. Permissions that are semantically related
ally a role or a blend of two or more roles. Semantics would also be helpful
mining problem (also known as the discrete basis problem), which is the
problem of finding the roles that can be used to best approximate the orig-
inal user permission assignment. Finding such roles / basis vectors can be
ers. Our experimental results demonstrate that our proposed solution, which
prior solution for the DBP. In the future, we plan to improve our algorithm
in two ways. First, the final association between users and permissions can
be done in a better fashion to ensure that more roles never lead to worse
actually associate users with candidate roles right at the start, and then cor-
respondingly choose the final roles. This is likely to give much better results
Recently Lu et al. [60] present a unified framework for modeling the optimal
binary matrix decomposition and its variants using binary integer program-
182
ming. Such modeling allows them to directly adopt the huge body of heuristic
solutions and tools developed for binary integer programming. Our future
research effort will be to compare the efficiency of our algorithm with theirs
be borrowed and directly apply them to solve the basic- RMP and the Edge-
RMP. In the future, we will explore other domains which could be similarly
bility of utilizing their solutions for our problem. In addition, there are more
algorithm for the minimal perturbation RMP. Even the experimental results
indicate that the minimal perturbation RMP is a good way of balancing the
deployed roles versus the optimal roles. It is important to note that, in many
cases, while migrating to a new set of roles, one should be able to specify
the desired set of unchanged parameters. These could include certain user-
perturbation RMP only finds the new set of roles, but does not provide a
mapping between exiting set of roles to the new set. This would involve,
towards constructing role hierarchy is that once we generate the minimal set
the minimal number of edges. However, this algorithm essentially breaks one
only serializes the generation of role set and hierarchy, but also specifies the
creates extra constraints and overlooks the possibility that hierarchy could be
created simultaneously with the discovery of roles. Thus the solution reached
the future will be to find better solution which will remove the constraints
to set up the standards against which a role hierarchy can be evaluated and
therefore, two role hierarchies built out of the same UP A can be compared
semantics of the objects and their attributes. This will help to discover more
in this dissertation using both simulated and real data sets. We will also
develop solutions for the MinPert RHMP, which is likely to be of the most
use to organizations.
REFERENCES
2006.
16:726–750, 2006.
for boolean matrices and their applications. pages 123 – 133. Springer-
Verlag.
[10] Fred H. Cate. Privacy interests: Security, privacy, and the role of law.
[14] Grard Cornujols. Valid inequalities for mixed integer linear programs.
cess control models and technologies, pages 145–154, New York, NY,
and communications security, pages 85–92, New York, NY, USA, 2003.
ACM.
control models and technologies, pages 228–236, New York, NY, USA,
2006. ACM.
[19] Alina Ene, William Horne, Nikola Milosavljevic, Prasad Rao, Robert
2001.
[21] Pete Epstein and Ravi Sandhu. Towards a uml based approach to role
Role-based access control, pages 135–143, New York, NY, USA, 1999.
ACM.
Springer-Verlag.
642, 2003.
125, 1997.
[27] David Ferraiolo, Rick Kuhn, and Ravi Sandhu. RBAC standard ratio-
Access control models and technologies, pages 142–143, New York, NY,
1996. ACM.
[32] Mario Frank, Andreas P. Streich, David Basin, and Joachim M. Buh-
cations security, pages 101–111, New York, NY, USA, 2009. ACM.
discovery and data mining, pages 73–83, New York, NY, USA, 1999.
ACM Press.
1979.
edge management, pages 748–757, New York, NY, USA, 2006. ACM.
191
[38] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Rock: A robust
25(5):345–366, 2000.
[39] Tarik Hadzic and J. N. Hooker. Postoptimality analysis for integer pro-
[40] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns with-
[42] Jorge Lobo Yuan (Alan) Qi Ian Molloy, Ninghui Li and Luke Dickens.
Mining roles with noisy data. In SACMAT ’10: Proceedings of the 15th
[43] Tao Jiang and Bala Ravikumar. Minimal nfa problems are hard. In
1991. Springer-Verlag.
[44] Ruixuan Li Jinwei Hu, Yan Zhang and Zhengding Lu. Role updat-
[46] James B D Joshi, Elisa Bertino, and Arif Ghafoor. Temporal hier-
[47] James B. D. Joshi, Elisa Bertino, Arif Ghafoor, and Yue Zhang. Formal
2002.
[49] Axel Kern, Martin Kuhlmann, Andreas Schaad, and Jonathan Moffett.
els and technologies, pages 3–11, New York, NY, USA, 2003. ACM.
[52] Matthias Koppe and Robert Weismantel. An algorithm for mixed in-
69, 2006.
[54] Martin Kuhlmann, Dalia Shohat, and Gerhard Schimpf. Role mining
[55] Ninghui Li, Ji-Won Byun, and Elisa Bertino. A critique of the ANSI
[58] Tao Li. A general model for clustering binary data. In KDD ’05:
Yongxi Cheng, Yingchao Zhao, and Jing Zhang. Core role-based ac-
[61] Haibing Lu, Jaideep Vaidya, Vijayalakshmi Atluri, and Yuan Hong.
[62] Emil Lupu and Morris Sloman. Reconciling role based management
and role based access control. In RBAC ’97: Proceedings of the second
[64] Joachim M. Buhmann Mario Frank and David Basin. On the defini-
[67] Pauli Miettinen. The discrete basis problem, master’s thesis. Master’s
[70] Jonathan D. Moffett and Emil C. Lupu. The uses of role hierarchies in
on Role-based access control, pages 153–160, New York, NY, USA, 1999.
ACM.
[71] Ian Molloy, Hong Chen, Tiancheng Li, Qihua Wang, Ninghui Li, Elisa
Bertino, Seraphin Calo, and Jorge Lobo. Mining roles with semantic
on Access control models and technologies, pages 21–30, New York, NY,
[72] Ian Molloy, Ninghui Li, Tiancheng Li, Ziqing Mao, Qihua Wang, and
on Role-based access control, pages 39–44, New York, NY, USA, 2000.
ACM.
[76] Qun Ni, Elisa Bertino, Jorge Lobo, and Seraphin B. Calo. Access
43, 2009.
[78] Feng Pan, Xiang Zhang, and Wei Wang. Crd: fast co-clustering on
2008. ACM.
[79] Joon S. Park, Ravi Sandhu, and Gail-Joon Ahn. Role-based access
control on the web. ACM Trans. Inf. Syst. Secur., 4(1):37–71, 2001.
2000.
[84] Ravi Sandhu, David Ferraiolo, and Richard Kuhn. The nist model for
[85] Ravi S. Sandhu et al. Role-based Access Control Models. IEEE Com-
[87] Jürgen Schlegelmilch and Ulrike Steffens. Role mining with orca. In
[88] Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. Mining discrete patterns
data mining, pages 757–766, New York, NY, USA, 2009. ACM.
[89] Dongwan Shin, Gail-Joon Ahn, Sangrae Cho, and Seunghun Jin. On
els and technologies, pages 169–178, New York, NY, USA, 2003. ACM.
[90] Dongwan Shin, Gail-Joon Ahn, Sangrae Cho, and Seunghun Jin. On
[91] Ajit P. Singh and Geoffrey J. Gordon. Relational learning via collec-
[92] Srinath Sridhar, Fumei Lam, Guy E. Blelloch, R. Ravi, and Russell
man. Efficient policy analysis for administrative role based access con-
[95] Andreas P. Streich, Mario Frank, David Basin, and Joachim M. Buh-
[101] Jaideep Vaidya, Vijay Atluri, and Qi Guo. The role mining problem:
[102] Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. The role mining
on Computer Science, page 364, New York, NY, USA, 1987. ACM.
on Access control models and technologies, New York, NY, USA, 2010.
ACM.
203
VITA
Qi Guo