A Logical Model for Multiversion Data
Warehouses
Waqas Ahmed1,⋆ , Esteban Zimányi1 , and Robert Wrembel2,⋆⋆
1
Dept. of Computer & Decision Engineering (CoDE),
Université Libre de Bruxelles, Belgium
{waqas.ahmed,ezimanyi}@ulb.ac.be
2
Institute of Computing Science,
Poznań University of Technology, Poland
[email protected]
Abstract. Data warehouse systems integrate data from heterogeneous
sources. These sources are autonomous in nature and change independently of a data warehouse. Owing to changes in data sources, the content
and the schema of a data warehouse may need to be changed for accurate
decision making. Slowly changing dimensions and temporal data warehouses are the available solutions to manage changes in the content of
the data warehouse. Multiversion data warehouses are capable of managing changes in the content and the structure simultaneously however,
they are relatively complex and not easy to implement. In this paper, we
present a logical model of a multiversion data warehouse which is capable of handling schema changes independently of changes in the content.
We also introduce a new hybrid table version approach to implement the
multiversion data warehouse.
1
Introduction
A data warehouse (DW) is a repository of historical, subject-oriented, and heterogeneous data that is integrated from external data sources (EDSs). An inherent feature of EDSs is that they are not static and their schema may change
as a result of adaptation of new technologies, changes in the modeled reality, or
changes in the business requirements. As a result of schema changes in EDSs,
DWs may become obsolete and thus need to be redesigned. Often, after applying
schema changes, users demand to preserve the existing content and the schema
in a DW. The Multiversion data warehouse (MVDW) is an available solution
to manage the schema changes in a DW. This solution is based on the schema
versioning approach where every change produces a new DW version and the
old content and the schema are also kept available.
The content of a DW changes as a result of periodic loading of new data into
it. One particular scenario that requires consideration is changes in the states of
⋆
⋆⋆
This research is funded by the Erasmus Mundus Joint Doctorate IT4BI-DC.
The Polish National Science Center (NCN), grant No. 2011/01/B/ST6/05169.
L. Bellatreche and M.K. Mohania (Eds.): DaWaK 2014, LNCS 8646, pp. 23–34, 2014.
c Springer International Publishing Switzerland 2014
24
W. Ahmed, E. Zimányi, and R. Wrembel
existing dimension members. For analysis purposes, it may be important for the
user to keep the history of changes in the states of members. Slowly changing
dimensions [8] and temporal data warehouses [7] are the solutions to manage
changes in the content of a DW. These solutions maintain the content history
by associating timestamps with the values. Though these solutions partially
solve the problem of managing DW content changes, they are unable to separate
different DW states that describe different real world scenarios.
Most of the available MVDW proposals try to solve the issues of content
and schema changes simultaneously which makes it complicated to understand
and implement these proposals. Some solutions for MVDW present metamodels
to maintain metadata supporting the life cycle of a DW. These metamodels
store DW versions in separate physical structures. Creation and maintenance of
these separate data structures makes a MVDW complex and may also negatively
impact the query performance. If the user only needs to manage the schema
changes in a DW then a simplified model can serve this purpose.
We envision a MVDW in which content and structure change functionality can
be implemented as independent plug-ins. In this paper we propose a model of the
MVDW that supports structural changes in dimensions and facts. The proposed
model is relatively simple and helps to identify the desired features of schema
versioning. We also introduce a new hybrid table version (HTV) approach to
implement the MVDW. This paper is organized as follows: Section 2 discusses
the approaches related to handling the evolution of DWs. Section 3 presents a
running example and defines the requirements for the MVDW. Section 4 presents
a model of the MVDW while Sect. 5 proposes an approach to implement it.
Finally, Sect. 6 concludes the paper by providing a summary and considerations
for future research.
2
Related Work
In [7], the proposals for managing DW evolutions are classified into three broad
categories: schema modification, schema evolution, and schema versioning. The
systems that allow schema modification support changes in a data source schema
but as a result of these changes, the existing data may become unavailable.
Schema evolution is supported when a system has capability of accommodating
schema changes while preserving existing data. Schema versioning is a mechanism through which systems store data using multiple schema versions.
Most of the research related to schema versioning deals with the issues of
managing schema and content changes at the same time and therefore, presents
solutions to manage schema versioning for temporal databases. Brahmia et al. [3]
presented one such schema version management mechanism for multi-temporal
databases. The proposed approach however, does not provide a generic model
to manage schema versioning and is only specific to relational databases. The
model presented in [5] also supports changes in both the content and structure
of data. This model supports structural changes to the dimension members only
and does not explain how to map the members from one version to another.
A Logical Model for Multiversion Data Warehouses
25
In [1], the authors presented a formal model of a MVDW. This model supports the real schema versions, which represent the changes in the real world,
and also provides the capability of deriving alternative schema versions. The alternative schema versions can be used for what-if analysis . In [12], the authors
presented a logical model for the implementation of a MVDW and discussed
various constraints to maintain data integrity across DW versions. They used
the work presented in [13] to query data from multiple versions of a DW. [10] is
a prototype implementation which manages schema versions by creating graph
based metamodels. An augmented schema [6] is also created to perform data
transformations among different schema versions.
The aforementioned approaches to MVDWs track the history of content and
schema changes and can query multiple DW versions but they maintain DW
versions in separate structures. The creation and maintenance of these structures is relatively complex. Further, the approaches to handling DW evolution
either manage changes in the content only [9], changes in the schema only [2],
or changes in the content and schema simultaneously [12]. Data warehouse versioning approaches support both changes in the content and the schema at the
same time but none of the existing versioning approaches deals with issues of
schema and content evolution independently of each other.
Three approaches may be used to convert a database structure after applying
schema changes: single table version (STV), multiple table version (MTV), and
partial multiple table version (PMTV) [11]. A disadvantage of the STV approach
is the null space overhead: null values are introduced as a result of both the addition and the deletion of attributes. There is no null space overhead when using
the MTV approach as the deleted attributes are dropped from the new version.
However, multiple table versions increase the maintenance overhead. Also, running a query that spans multiple table versions requires data adaptations and
table joins which negatively affect the query performance. The PMTV approach
also requires joins to construct the complete schema of the table. Joins are costly
operations and may have a negative impact on the system performance.
3
Multiversion Data Warehouses
A multiversion data warehouse is a collection of its versions. Each DW version
is composed of a schema version and an instance version. A schema version
describes the structure of the data within a time period. The instance version
represents the data that is stored using a particular schema version. We assume
that at a given time instant, only one DW version is current and used to store
data. A new version can be derived by applying changes to the current version
only. To represent the period for which a version was used to store data, each
DW version is assigned a closed-open time interval [9], represented by begin application time (BAT) and end application time (EAT). The EAT for the current
version is set to UC (until-changed). It is possible to create alternative schema
versions [12] using the model presented in Sect. 4 but for simplicity’s sake, we
do not consider the branching versioning model. Figure 1 shows an example of
multiple DW versions.
26
W. Ahmed, E. Zimányi, and R. Wrembel
V0
V1
t0
T0
V2
t1
T1
Time
t2
T2
UC
Fig. 1. Multiple versions of a data warehouse
Suppose that the initial version V0 was created at time T0 . The BAT and
EAT for V0 are set to T0 and U C, respectively. Figure 2a shows the DW schema
in version V0 which contains three dimensions: Time, Product, and Geography.
The Geography dimension has a hierarchy, which consists of three levels: Store,
City, and Region. Fact Sales relates these dimensions and contains measures
Quantity and Amount. The granularity of dimension Geography is at level Store.
Consider that the following schema changes were applied to the DW. (1) At
time T1 , attribute Manager was added to and attribute Area was deleted from
level Store. As a result, version V1 [T1 , U C) was created and the EAT of V0 was
changed from U C to T1 . Figure 2b shows the schema of level Store in V1 . (2) At
time T2 , level Store was deleted and new level State was added between levels
City and Region. This schema modification changed the granularity of dimension
Geography and thenceforth all fact members were assigned to a city. Moreover,
level City rolled-up to level State. Since V1 was the current version at that time,
version V2 [T2 , U C) was derived from V1 and the EAT of V1 was changed from
U C to T 1. Figure 2c shows the schema of V2 .
Changes in dimensions and/or facts create a new DW version. The possible
changes to a dimension include (1) adding a new attribute to a level, (2) deleting
an attribute from a level, (3) changing the domain of an attribute, (4) adding a
level to a hierarchy, and (5) deleting a level from a hierarchy. The schema of a
fact changes when (1) a new dimension is added, (2) a dimension is deleted, (3)
a measure is added, (4) a measure is deleted, and (5) the domain of a measure
is changed. In certain cases, a schema change in the dimension also requires
changes in the fact. For example, adding a new level into a hierarchy at the
lowest granularity will also require a new attribute to be added to the fact which
will link the new level to the fact. Similarly, in case of removing the lowest level
from a hierarchy, an attribute is removed from the fact to unlink the deleted
level and a new attribute is added into it to link the next level of the hierarchy.
Thus, effects of adding and deleting the lowest level of a hierarchy on the DW
schema are similar to adding and deleting a new dimension to the DW.
It is worth mentioning that since our versioning model deals with the structural changes and not with the temporal evolution of facts and dimension members, at any instant t, all facts and dimension members in the MVDW are valid.
To summarize the requirements of a multiversion data warehouse, we can say
that (1) in a MVDW, new data is loaded using current version only; (2) a MVDW
must retain all the data loaded into it throughout its lifespan; and (3) all the data
stored in the MVDW must be viewable using any MVDW schema version.
A Logical Model for Multiversion Data Warehouses
Time
TimeKey
Day
Week
Month
Year
Store
Sales
Product
StoreKey
ProductKey
TimeKey
Quantity
Amount
ProductKey
Name
Category
Packaging
Supplier
City
StoreKey
Name
Address
Area
CityKey
Store
RegionKey
Name
Representative
(a) DW schema in version V0
Sales
CityKey
ProductKey
TimeKey
Quantity
Amount
StoreKey
Name
Address
Manager
CityKey
Region
CityKey
Name
Representative
RegionKey
City
CityKey
Name
Representative
StateKey
27
(b) Schema of level
Store in version V1
State
StateKey
Name
Population
RegionKey
Region
RegionKey
Name
Representative
(c) DW schema in version V2
Fig. 2. Multiple schema versions of the running example
The first requirement is straightforward. We elaborate the second and third
requirements with the help of examples. The initial state of level Store in version
V0 at an instant t0 in [T0 , T1 ) is shown in Fig. 3a. As a result of the first schema
change, version V1 is derived from V0 . Figure 3b shows the schema of level Store in
V1 . The shaded column represents the deleted attribute. The second requirement
is that when deriving version V1 from V0 , for the existing store members, the
DW must retain the values of the deleted attribute Area. Suppose that the user
decides to add a new store member s4 at an instant t1 in [T1 ,T2 ). The third
requirement states that this newly added member must also be available in
version V0 . So, when accessing the store members using version V0 , the user
should be able to access all the members stored in V0 and V1 . As attribute
Manager does not exist in V0 , it is not available in V0 at t1 . This situation is
shown in Fig. 3c where shaded and crossed out cells represent the unavailable
attribute. Similarly, attribute Area does not exist in V1 therefore, the value of
attribute Area for s4 is not available in V0 .
As a result of the second schema change, version V2 is derived from V1 . In V2
both the dimensions and the fact are affected. Figure 3d shows the state of fact
Sales in V0 . After second schema change, the fact members are assigned to level
City and that is why attribute StoreKey is deleted and CityKey is added to the
new version of fact Sales. Since in previous versions level Store rolled-up to level
28
W. Ahmed, E. Zimányi, and R. Wrembel
Store
City
Key Name Address Area Key
s1
s2
s3
Store1
Store2
Store3
ABC
DEF
HIJ
20
30
50
c1
c1
c2
(a) Store at t0 in version V0
Store
City
Key Name Address Area Manager Key
s1
s2
s3
s4
Store1
Store2
Store3
Store4
ABC
DEF
HIJ
KLM
20
30
50
Store
City
Key Name Address Area Manager Key
c1
c1
c2
c2
John
s1
s2
s3
s4
(b) Store at t1 in version V1
Store1
Store2
Store3
Store4
ABC
DEF
HIJ
KLM
20
30
50
John
c1
c1
c2
c2
(c) Store at t1 in version V0
Store Product Time
Key
Key
Key Quantity Amount
s1
s2
s3
s4
p1
p1
p2
p2
t1
t1
t2
t3
5
3
2
3
20
15
18
9
(d) Sales at t0 in version V0
Store City Product Time
Key
Key Quantity Amount
Key Key
s1
s2
s3
s4
c1
c2
c2
c3
p1
p1
p1
p2
p2
p1
t1
t1
t1
t2
t3
t4
5
3
8
2
3
2
Store City Product Time
Key Key
Key
Key Quantity Amount
20
15
35
18
9
25
s1
s2
s3
s4
(e) Sales at t2 in version V2
c1
c2
c2
c3
p1
p1
p1
p2
p2
p1
t1
t1
t1
t2
t3
t4
5
3
8
2
3
2
20
15
35
18
9
25
(f) Sales at t2 in version V0
City
Key Name Representative RegionKey
c1 Brussels
John
r1
c2
Liège
Doe
r2
c3 Charleroi
Ahmed
r2
City
State Region
Key Name Representative Key
Key
c1 Brussels
John
e1
r1
c2
Liège
Doe
e2
r2
c3 Charleroi
Ahmed
e3
r2
(g) City at t0 in version V0
(h) City at t2 in version V2
City
Key
Name
c1
c2
c3
Brussels
Liège
Charleroi
State Region
Representative Key
Key
John
Doe
Ahmed
e1
e2
e3
r1
r2
r2
(i) City at t2 in version V0
Fig. 3. Contents of the DW in multiple versions of the running example
City, it is possible to obtain the values of CityKey for the existing fact members.
While doing so, the fact members belonging to the stores that are located in
the same city should be combined provided that the other key attributes are the
same. One such example is shown in Fig. 3e where the first two fact members
belong to stores s1 and s2. These stores are located in the same city c1 and
the values of the other dimensions are the same, that is why the fact members
A Logical Model for Multiversion Data Warehouses
29
associated to these stores are combined and represented as a single member in
the new version of fact Sales. If the user tries to access fact Sales in V0 at an
instant t2 in [T1 , T2 ), the value of StoreKey for the last fact member will not be
available as this attribute is not present in the new version of fact Sales. This
situation is depicted in Fig. 3f.
In V2 , a new level State is also added between levels City and Region. Figure 3g
shows level City in V0 . Figure 3h shows that the effect of adding a level into the
DW schema is similar to adding attribute StateKey to level City and deleting
attribute RegionKey from it. If the members of level City are accessed in V0 at
time t2 then the information about the regions will not be available. Figure 3i
depicts this scenario.
Suppose at some point the user decides to add a new dimension Supplier to
the DW. This addition will require a new version of fact Sales because the new
facts must be linked to Supplier as well. It is worth mentioning that the supplier
information will not be available for the existing facts and they will roll-up to
unknown supplier.
4
A Multiversion Data Warehouse Model
In this section, we first introduce the formal definition of a data warehouse model
and then we extend the definition for a multiversion data warehouse.
Definition 1 (Multidimensional schema). Multidimensional schema S has
a name and is composed of (1) the set of dimensions D = {D1 , . . . , Dn } and (2)
the set of facts F = {F1 , . . . , Fn }.
Dimension Di ∈ D, i = 1, . . . , n, has a name and is composed of (1) the
set of levels L = {L1 , . . . , Ln , All}, (2) aggregation relation R, and (3) the set
of hierarchies H = {H1 , . . . , Hn }. Level L1 ∈ L is called the base level of the
dimension and every dimension has a unique level All. The dimension names are
unique in D.
Level Lj ∈ L, j = 1, . . . , n, is defined by its schema Lj (A1 : T1 , . . . , An : Tn ),
where Lj is the level name and it is unique in L. Each attribute Ak , k = 1, . . . , n,
is defined over domain Tk and attribute name Ak is unique in Lj . Level All ∈ Di
does not have any attribute.
Aggregation relation R is a partial order binary relation on L ∈ Di , i =
1, . . . , n, and contains ordered pairs of form Lp , Lq , where Lp and Lq are levels
belonging to L. R∗ denotes the transitive closure of R such that if L1 , L2 and
L2 , L3 also belong to R, then L1 , L3 belongs to R∗ . Any level Lj ∈ L is,
directly or transitively, reachable in R∗ from the base level and any level Lj ∈ L
reaches in R∗ , directly or transitively, top level All.
Hierarchy Hm ∈ H ⊆ R∗ , m = 1, . . . , n, and has a unique name. Each hierarchy Hm ∈ H begins from the base level and has top level All.
Fact Fi ∈ F , i = 1, . . . , n, is defined by its schema Fi (R1 : L1 , . . . , Rm : Lm ,
M1 : T1 , . . . , Mn : Tn ), where Rs , s = 1, . . . , m, is a role name, Ls is a base level of
a dimension Di ∈ D and each measure Mt , t = 1, . . . , n, is defined over domain
Tt . Role name Rs and measure name Mt are unique in Fi .
⊔
⊓
30
W. Ahmed, E. Zimányi, and R. Wrembel
Definition 2 (Multidimensional instance). Multidimensional instance I is
composed of dimension instance and fact instance.
An instance of dimension Di ∈ D is as follows: For each level Lj ∈ Di , the
set of members MLj = {m1 , . . . , mn } where member mk ∈ MLj , k = 1, . . . , n, is
uniquely identifiable. Level All has a special member all. For each Lj ∈ Di with
schema Lj (A1 : T1 , . . . , An : Tn ), a subset of MLj × T1 × . . . × Tn . For each pair of
L
level names Lp , Lq in R ∈ Di , a partial function Roll upLqp from MLp to MLq .
An instance of fact Fi , which is defined by its schema Fi (R1 : L1 , . . . , Rm :
⊔
⊓
Lm , M1 : T1 , . . . , Mn : Tn ), is a subset of ML1 × . . . × MLm × T1 × . . . × Tn .
Definition 3 (Multiversion multidimensional schema). M̊ultiversion multidimensional schema Smv has a name and is composed of (1) the set of multiversion dimensions Dv = {D1v , . . . , Dnv }, (2) the set of multiversion facts F v =
{F1v , . . . , Fnv }, and (3) the set of schema versions S v = {S1 , . . . , Sn }.
Multiversion dimension Div ∈ Dv , i = 0, . . . , n, defines the set {Div1 , . . . , Divn }
v
of versions of dimension Di . Dimension version Di j ∈ Div , j = 0, . . . , n, is a
v
dimension as defined in Def. 1. The dimension names for all Di j ∈ Div are the
same.
vn
v
v1
Multiversion fact Fm
} of
, . . . , Fm
∈ F v , m = 1, . . . , n, defines the set {Fm
vj
versions of fact Fm . Fact version Fm , j = 0, . . . , n, is a fact as defined in Def. 1.
v
v
are the same.
The fact names for all Fmj ∈ Fm
v
Schema version Sl ∈ S , l = 0, . . . , n, has an associated time interval Tl =
[Bl , El ) and is a multidimensional schema as defined in Def. 1, that is, it is
composed of a set of dimensions D = {D1 , . . . , Dn } and a set of facts F =
v
{F1 , . . . , Fn }, where each Di ∈ D is a dimension version Di j ∈ Div and each
vj
v
Fm ∈ F is a fact version Fm ∈ Fm . Only one version of dimension Div and fact
v
can exist in D and in F , respectively. The time intervals associated to all
Fm
schema versions in S v are disjoint, contiguous, and their union cover the time
interval since the creation of the first version until now.
⊔
⊓
Example 1. The initial schema of the multiversion data warehouse in Fig. 2a
can be represented as follows:
v
, DTv , DPv }, F v = {FSv }, S v = {S0 }, where G, T , P , and S denote
Dv = {DG
Geography, Time,
andSales,
v0 Product,
v0
v0
respectively.
v
v
v0
DG
=
D
,
D
=
D
, DPv =
DP , FSv =
FS , and
G
T
T
v0 v0 v0 v0
DG , DT , DP , FS
. For brevity, we omit the schema definitions
S0 =
of the levels and the fact.
As a result of the first schema change, a new version of the Geography dimension is derived from the previous version and the other dimensions and
the fact
Thus, the
MVDW schemais modified
remain
unchanged.
as follows:
v
v0
v1
v1
= DG
, DG
, DTv0 , DPv0 , FSv0 .
, S v = {S0 , S1 }, S1 = DG
DG
Finally, new versions of the dimension Geography and fact Sales are derived
as a result of the second schema
The resulting
the MVDW is
v0 change.
v0 schema
of
v
v1
v2
v
v1
v
=
D
,
D
,
D
,
F
=
F
,
F
,
S
=
{S0 , S1 , S2 },
modified
as
follows:
D
G
G
G
G
S
S
S
v2 v0 v0 v1
, DT , DP , FS
S2 = DG
.
⊔
⊓
A Logical Model for Multiversion Data Warehouses
31
Definition 4 (Multiversion multidimensional global schema). The multiversion multidimensional global schema is a multidimensional schema as defined in Def. 1 and it is constructed as follows:
In the global schema of dimension Di ∈ Ds, L ∈ Di is the union of all the
levels existing in all versions of Di ; the schema of level Li ∈ L is the union of all
the attributes of Li from all versions of Di in which Li is present; aggregation
relation R is the union of the aggregation relations from all version of Di , and
H ∈ Di is empty. Since the global schema is for system use only, there is no need
to maintain hierarchies in it.
The global schema of fact Fi ∈ F consists of the set of base levels B and the
set of measures C, where B is the union of all the base levels in all versions of Fi
and C is the union of all the measures in all versions of Fi .
⊔
⊓
The global schema, defined in Def. 4, is a traditional multidimensional schema
and its instance is obtained by using Def. 2. A multiversion multidimensional
instance is actually an instance of the global schema. Figure 3b, including the
shaded column Area, shows the global instance of level Store. The MVDW also
contains a transformation function T (Si ) which transforms the global instance
into the instance of schema Si ∈ S v , as defined in Def. 3. This transformation
function can be implemented as view definitions. Figure 6b shows how the global
instances of Store can be transformed into the instance of Store in version V1 .
5
Implementation of the Multiversion DW
We discuss next how the single table version (STV) and the multiple table version
(MTV) approaches can be used to implement the MVDW. Then, we present a
new hybrid table versioning approach to implement the MVDW.
In the STV approach, the newly added attributes are appended to the existing
ones and the deleted attributes are not dropped from the table. A default or null
value is stored for the deleted or the unavailable attributes. Figure 4 shows the
effect of the schema changes on the relational implementation of a DW that uses
the STV approach. Figure 4a shows the state of the DW after the first schema
change where attribute Manager is added and Area is deleted from table Store.
Records s1, s2, and s3 have null values for attribute Manager because its value is
unknown for these records. As attribute Area has been deleted, all newly added
records such as s4, will have null values for it. These null values may incur a space
overhead in case of huge amount of data. Some DBMSs partially resolve the
issue of null space overhead by offering specific features but the implementation
of these features has its own limitations1 .
In the MTV approach, each change in the schema of a table produces a new
version of the table. Figure 5a shows the new version of table Store which is created as a result of the first schema change. This version includes the newly added
attribute Manager and excludes the deleted attribute Area. The new version of
table Store results in a new version of table Sales because Sales uses attribute
1
http://technet.microsoft.com/en-us/library/cc280604.aspx
32
W. Ahmed, E. Zimányi, and R. Wrembel
Store
Key Name Address Area City Manager
s1
s2
s3
s4
Store1
Store2
Store3
Store4
ABC
DEF
HIJ
KLM
20
30
50
null
c1
c1
c2
c2
null
null
null
John
Store City Product Time
Key Key
Key
Key Quantity Amount
s1
s2
s3
s4
null
c1
c2
c2
c2
c3
p1
p1
p2
p2
p1
(a) Store in V2
t1
t1
t2
t3
t4
5
3
2
3
2
20
15
18
9
25
(b) Sales in V2
City
State Region
Key Name Representative Key
Key
c1 Brussels
John
s1
r1
c2
Liège
Doe
s2
r2
c3 Charleroi
Ahmed
s3
r2
(c) City in V2
Fig. 4. State of the data warehouse using the STV approach
Store
Key Name Address City Manager
s4 Store4 KLM
c2
John
City Product Time
Key
Key
Key Quantity Amount
c3
p1
t4
2
25
(a) Store in V2
(b) Sales in V2
City
Key
Name
c1
c2
c3
Brussels
Liège
Charleroi
State
Representative Key
John
Doe
Ahmed
s1
s2
s3
(c) City in V2
Fig. 5. State of the data warehouse using the MTV approach
StoreKey of Store as a foreign key. A new foreign key constraint is required to
associate the new version of Store with fact table Sales. This is possible by creating a new version of table Sales and using StoreKey attribute of the new version
of Store in it. As a result of the second schema change, level Store is deleted from
dimension Geography and a version of table Sales is created because henceforth,
the facts are assigned to level City. Figures 5b and 5c show tables Sales and City
created as a result of the second schema change.
An advantage of the MTV approach over the STV one is that it does not
require the null values to be stored for the dropped columns thus it prevents the
storage space overhead. The disadvantages of this approach are that the data
belonging to a table can be accessed either by creating materialized views or
performing joins. The materialized views introduce the problem of view maintenance whereas, depending upon the data size in the DW and the number of
existing versions, the join operations may become a performance overhead.
We propose a new Hybrid Table Version (HTV) approach for implementing
the MVDW. Usually, the dimension tables in a DW have fewer records as compared to the number of records in the fact tables. We propose for changes in
dimension schema, a single table version for each dimension level throughout
A Logical Model for Multiversion Data Warehouses
SELECT StoreKey, Name,
Address, Area, City
FROM Store
(a) Store using V0
SELECT
FROM
WHERE
GROUP BY
UNION
SELECT
FROM
SELECT StoreKey, Name, Address,
Manager, City
FROM Store
(b) Store using V1
33
SELECT StoreKey, TimeKey,
ProductKey, Quantity,
Amount
FROM Sales
(c) Sales using V0
CityKey, TimeKey, ProductKey, SUM(Quantity) AS Quantity, SUM(Amount) AS Amount
Sales V0 S, Store T
Sales.StoreKey = Store.StoreKey
City, TimeKey, ProductKey
CityKey, TimeKey, ProductKey, Quantity, Amount
Sales V1
(d) Sales using V2
Fig. 6. Derivation of version instances from the global instance using views
the lifespan of the DW. This table version is defined as the union of all the
attributes that have ever been defined for the dimension level. If the attributes
are added or deleted from the level, they are treated in the same way as they
are treated in the STV approach. Since, the data is loaded more frequently into
fact tables and they contain more records than dimension tables, it is more advantageous to create a new table versions for every change in the schema of the
facts. We are aware that the creation of a new structure for every fact version
may negatively impact the query performance but indexing techniques [4] can
be used to address the issue of efficiency. The HTV approach avoids the null
space overhead in case of fact tables and limits the number of joins by managing
dimension versions in a single table. In this way, the HTV approach combines
the advantages of both the STV and MTV approaches. For brevity, we do not
show the the state of the DW after schema changes using HTV approach but
the state of levels Store and City can be envisioned as shown in Figs. 4a and 4c,
respectively and the fact Sales is represented by Figs. 3d and 5b.
The data from the multiple versions of the MVDW can be accessed by defining a set of views. Whenever a new version of a dimension or fact is created, a
view definition is also created to access the existing members using this newly
created version. To access the new members using the existing versions, existing
view definitions need to be modified. For example, the members of level Store
can be accessed in versions V0 and V1 using the views defined in Figs. 6a and 6b,
respectively. Similarly, the views defined in Figs. 6c and 6d return the fact members in versions V0 and V2 , respectively. As a result of second schema change, the
granularity of dimension Geography was changed. The view in Fig. 6d aggregates
the existing sales facts to display them at the granularity of level City.
6
Conclusions
In this paper, we presented (1) a logical model of a multiversion data warehouse
(MVDW), and (2) the hybrid table version (HTV) approach to implement
34
W. Ahmed, E. Zimányi, and R. Wrembel
the MVDW. This approach combines the benefits of both the single table version
(STV) and the multiple table version (MTV) approaches and creates new table
versions only for the fact tables. As future work, we plan to combine our approach
with temporal data warehouses so that the history of both the changes in the structure and content of the DW can be maintained. We plan to extend the presented
model in such a way that the functionality of schema and content changes can be
implemented as independent plug-ins. We are also working on the experimental
evaluation of the HTV approach and its impact on query performance. Further,
we have plans to develop a query language and data structures for MVDWs.
References
1. Bebel, B., Eder, J., Koncilia, C., Morzy, T., Wrembel, R.: Creation and management of versions in multiversion data warehouse. In: Proc. of ACM SAC, pp.
717–723. ACM (2004)
2. Blaschka, M., Sapia, C., Höfling, G.: On schema evolution in multidimensional
databases. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp.
153–164. Springer, Heidelberg (1999)
3. Brahmia, Z., Mkaouar, M., Chakhar, S., Bouaziz, R.: Efficient management of
schema versioning in multi-temporal databases. International Arab Journal of Information Technology 9(6), 544–552 (2012)
4. Chmiel, J.: Indexing multiversion data warehouse: From ROWID-Based multiversion join index to bitmap-based multiversion join index. In: Grundspenkis, J.,
Kirikova, M., Manolopoulos, Y., Novickis, L. (eds.) ADBIS 2009. LNCS, vol. 5968,
pp. 71–78. Springer, Heidelberg (2010)
5. Eder, J., Koncilia, C., Morzy, T.: The COMET metamodel for temporal data warehouses. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE
2002. LNCS, vol. 2348, pp. 83–99. Springer, Heidelberg (2002)
6. Golfarelli, M., Lechtenbörger, J., Rizzi, S., Vossen, G.: Schema versioning in data
warehouses: Enabling cross-version querying via schema augmentation. Data &
Knowledge Engineering 59(2), 435–459 (2006)
7. Golfarelli, M., Rizzi, S.: A survey on temporal data warehousing. International
Journal of Data Warehousing and Mining 5(1), 1–17 (2009)
8. Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to
Dimensional Modeling, 3rd edn. John Wiley & Sons (2013)
9. Malinowski, E., Zimányi, E.: A conceptual model for temporal data warehouses and
its transformation to the ER and the object-relational models. Data & Knowledge
Engineering 64(1), 101–133 (2008)
10. Rizzi, S., Golfarelli, M.: X-time: Schema versioning and cross-version querying in
data warehouses. In: Proc. of ICDE, pp. 1471–1472. IEEE (2007)
11. Wei, H.-C., Elmasri, R.: Schema versioning and database conversion techniques for
bi-temporal databases. Annals of Mathematics and Artificial Intelligence 30(1-4),
23–52 (2000)
B.: Metadata management in a multiversion data warehouse.
12. Wrembel, R., Bebel,
֒
In: Spaccapietra, S., et al. (eds.) Journal on Data Semantics VIII. LNCS, vol. 4380,
pp. 118–157. Springer, Heidelberg (2007)
13. Wrembel, R., Morzy, T.: Managing and querying versions of multiversion data
warehouse. In: Ioannidis, Y., et al. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 1121–
1124. Springer, Heidelberg (2006)