Aggregate Data Models

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 55

Aggregate Data Models

Data Model
• A data model is a representation that we use
to perceive and manipulate our data.
• It allows us to:
– Represent the data elements under analysis, and
– How these are related to each others
• This representation depends on our
perception.
Data Model: Database View
• In the database field, it describes how we
interact with the data in the database.
• This is distinct from the storage model:
– It describes how the database stores and
manipulate the data internally.
• In an ideal worlds:
– We should be ignorant of the storage model, but
– In practice we need at least some insight to
achieve a decent performance
Data Models: Example
• A Data model is the model of the specific data
in an application
• A developer might point to an entity-
relationship diagram and refer it as the data
model containing
– customers,
– orders and
– products
Data Model: Definition
• In this course we will refer “data
model” as the model by which the
database organize data.
• It can be more formally defined as
meta-model
Last Decades Data Model
• The dominant data model of the last decades
what the relational data model.
1. It can be represented as a set of tables.
2. Each table has rows, with each row
representing some entity of interest.
3. We describe entities through columns
4. A column may refer to another row in the
same or different table (relationship).
NoSQL Data Model
• It moves away from the relational data model
• Each NoSQL database has a different model
– Key-value,
– Document,
– Column-family,
– Graph, and
– Sparse (Index based)
• Of these, the first three share a common
characteristic (Aggregate Orientation).
Relational Model
vs
Aggregate Model
Relational Model
• The relational model takes the information that
we want to store and divides it into tuples (rows).
• However, a tuple is a limited data structure.
• It captures a set of values.
• So, we can’t nest one tuple within another to get
nested records.
• Nor we can put a list of values or tuple within
another.
Relational Model
• This simplicity characterize the relational
model
• It allows us to think on data manipulation as
operation that have:
– As input tuples, and
– Return tuples
• Aggregate orientation takes a different
approach.
Aggregate Model
• It recognizes that, you want to operate on data unit
having a more complex structure than a set of
tuples.
• We can think on term of complex record that allows:
– List,
– Map,
– And other data structures to be nested inside it
• Key-Value, document, and column-family databases
uses this complex structure.
Aggregate Model
• Aggregate is a term coming from Domain-
Driven Design [Evans03]
– An aggregate is a collection of related objects that
we wish to treat as a unit. It is a unit for data
manipulation and management for consistency.
• We like to update aggregates with atomic
operation
• We like to communicate with our data storage
in terms of aggregates
Aggregate Models
• This definition matches really with how key-value,
document, and column-family databases works.
• With aggregates it is easier to work on a cluster,
since they are unit for replication and sharding.
• Aggregates are also easier for application
programmer to work since it solve the impedance
mismatch problem of relational databases.
Example of Relational Model
• Assume we are
building an e-
commerce website;
• We have to store
information about:
users, products,
orders, shipping
addresses, billing
addresses, and
payment data.
Example of Relational Model
• As we are good
relational soldier:
– Everything is
normalized
– No data is
repeated in
multiple tables.
– We have referential
integrity
Example of Relational Model
Example of Aggregate Model
• We have two aggregates: Customers and Orders
• We use the black diamond composition to show
how data fits into the aggregate structure

A possible aggregation
Example of Aggregate Model
• The customer contains a list of billing addresses;
• The order contains a list of: order items, a shipping address, and
payments
• The payment itself contains a billing address for that payment
Example of Aggregate Model
• A single address appears 3 times, but instead of using an id it is copied each time
• This fits a domain where we don’t want shipping, payment and billing address to
change
• What is the difference w.r.t a relational representation?
Example of Aggregate Model
• The link between customer and the order is a
relationship between aggregates
Example of Aggregate Model
• Link from an order item would cross into a separate
aggregate structure for product (not considered
here)
• This is kind of denormalization – similar to tradeoff
with relational database, but is more common with
aggregate because we want to minimize the
number of aggregates we access.
Example of Aggregate Model
• We aggregate to minimize the number of
aggregates we access during data interaction
• •The important think to notice is that,
– We have to think about accessing that data
– We make this part of our thinking when developing the
application data model
• We could draw our aggregate differently, but it
really depends on the “data accessing models”.
• No universal answer for how to draw aggregate boundaries
• It depends entirely on how you tend to manipulate data!
– Accesses on a single order at a time: first solution
– Accesses on customers with all orders: second solution
• Context-specific
– some applications will prefer one or the other
– even within a single system
• Focus on the unit of interaction with the data storage
• Pros:
– it helps greatly with running on a cluster: data will be manipulated
together, and thus should live on the same node!
• Cons:
– an aggregate structure may help with some data interactions but be
an obstacle for others.
Consider a Student information system consisting of 3 entities namely,
Student_info, Course_info, and Marksheet.
Following are the frequent queries in the workload:
1. List the details of students admitted to ‘F.Y.B.Sc’ course.
2. List the details of students staying in ‘Kothrud’ area and studying in
‘T.Y.B.Sc’
3. Find the maximum score value for ‘Databases’ subject
4. List the number of students failing in the subject ‘Computer networks’
(marks < 40)

Given the above workload, derive an aggregate boundary, for aggregating the
three entities. Justify your answer.
Consequences of Aggregate Models
No Distributable Storage
• Relational mapping can captures data elements
and their relationship well.
• It does not need any notion of aggregate entity,
because it uses foreign key relationship.
• But we cannot distinguish for a relationship that
represent aggregations from those that don’t.
• As result we cannot take advantage of that
knowledge to store and distribute our data.
Marking Aggregate Tools
• Many data modeling techniques provides way to
mark aggregate structures in relational models
• However, they do not provide semantic that
helps in distinguish relationships
• When working with aggregate-oriented
databases, we have a clear view of the semantic
of the data.
• We can focus on the unit of interaction with the
data storage.
Aggregate Ignorant
• Relational database are aggregate-ignorant,
since they don’t have concept of aggregate
• Also graph database are aggregate-ignorant.
• This is not always bad.
• In domains where it is difficult to draw
aggregate boundaries aggregate-ignorant
databases are useful.
Aggregate and Operations
• An order is a good aggregate when:
– A customer is making and reviewing an order, and
– When the retailer is processing orders
• However, when the retailer want to analyze its
product sales over the last months, then
aggregate are trouble.
• We need to analyze each aggregate to extract
sales history.
Aggregate and Operations
• Aggregate may help in some operation and not in
• others.
• In cases where there is not a clear view aggregate-
ignorant database are the best option.
• But, remember the point that drove us to
aggregate models (cluster distribution).
• Running databases on a cluster is need when
dealing with huge quantities of data.
Running on a Cluster
• It gives several advantages on computation
power and data distribution
• However, it requires to minimize the number of
nodes to query when gathering data
• By explicitly including aggregates, we give the
database an important view of which
information should be stored together
• But, still we have the problem on querying
historical data
Aggregates and Transactions
ACID transactions
• Relational database allow us to manipulate any
combination of rows from any table in a single
transaction.
• ACID transactions:
– Atomic,
– Consistent,
– Isolated, and
– Durable
have the main point in Atomicity.
Atomicity & RDBMS
• Many rows spanning many tables are updated
into an Atomic operation
• It may succeeded or failed entirely
• Concurrently operations are isolated and we
cannot see partial updates
• However relational database still fail.
Atomicity & NoSQL
• NoSQL don’t support Atomicity that spans
multiple aggregates.
• This means that if we need to update multiple
aggregates we have to manage that in the
application code.
• Thus the Atomicity is one of the consideration
for deciding how to divide up our data into
aggregates
Aggregates Models on NoSQL
Key-Value and Document
• Key-value and Document databases are strongly
aggregate-oriented.
• Both of these types of databases consists of lot of
aggregates with a key used to get the data.
• The two type of databases differ in that:
– In a key-value stores the aggregate is opaque (Blob)
– In a document database we can see a structure in the
aggregate.
Key-Value and Document
• The advantage of opacity is that we can store
whatever we like in the aggregate.
• The database may impose some size limit, but
we have freedom
• A document store imposes limits on what we
can place in it, defining a structure on the
data.
Key-Value and Document
• With a key-value we can only access by its key
• With document:
– We can submit queries based on fields,
– We can retrieve part of the aggregate, and
– The database can create index based on the fields
of the aggregate.
• But in practice they are used differently
Key-Value and Document
• In practice, the line between key-value and
document gets a bit blurry.
• An ID field is put in a document database to do a
key-value style lookup
• With key-value databases we expect aggregates
using a key
• With document databases, we mostly expect to
submit some form of query on the internal
structure of the documents.
Column-Family Stores
• One of the most influential NoSQL databases
was Google’s BigTable [Chang et al.]
• Its name derives from its structure composed
by sparse columns and no schema.
• We don’t have to think of this structure as a
table, but to a two-level map.
Column-Family Stores
• These BigTable-style data model are referred
to as column stores.
• Pre-NoSQL column stores like C-Store used
SQL and the relational model.
• What make NoSQL columns store different is
how physically they store data.
• Most databases has rows as unit of storage,
which helps in writing performances
Column-Family Stores
• However, there are many scenarios where:
– Write are rares, but
– You need to read a few columns of many rows at
once
• In this situations, it’s better to store groups of
columns for all rows as the basic storage unit.
• These kind of databases are called column
stores or column-family databases
Column-Family Stores
• Column-family databases have a two-level aggregate
structure.
• Similarly to key-value the first key is the row
identifier.
• The difference is that retrieving a key return a Map
of more detailed values.
• These second-level values are defined to as columns.
• Fixing a row we can access to all the column-families
or to a particular element.
Example of Column Model
Column-Family Stores
• They organize their columns into families.
• Each column is a part of a family, and column
family acts as unit of access.
• Then the data for a particular column family
are accessed together.
Column-Family Stores:
How to structure data
• In row-oriented:
– each row is an aggregate (For example the customer
with id 456),
– with column families representing useful chunks of
data (profile, order history) within that aggregate
• In column-oriented:
– each column family defines a record type (e.g.
customer profiles) with rows for each of the records.
– You can think of a row as the join of records in all
columnfamilies
Key Points
• An aggregate is a collection of data that we interact with as
a unit.
• Aggregates form the boundaries for ACID operations with
the database
• Key-value, document, and column-family databases can all
be seen as forms of aggregate-oriented database
• Aggregates make it easier for the database to manage data
storage over clusters
• Aggregate-oriented databases work best when most data
interaction is done with the same aggregate
• Aggregate-ignorant databases are better when interactions
use data organized in many different formations

You might also like