Aggregate Data Models
Data Model
• A data model is a representation that we use
to perceive and manipulate our data.
• It allows us to:
– Represent the data elements under analysis, and
– How these are related to each others
• This representation depends on our
perception.
Data Model: Database View
• In the database field, it describes how we
interact with the data in the database.
• This is distinct from the storage model:
– It describes how the database stores and
manipulate the data internally.
• In an ideal worlds:
– We should be ignorant of the storage model, but
– In practice we need at least some insight to
achieve a decent performance
Data Models: Example
• A Data model is the model of the specific data
in an application
• A developer might point to an entity-
relationship diagram and refer it as the data
model containing
– customers,
– orders and
– products
Data Model: Definition
• In this course we will refer “data
model” as the model by which the
database organize data.
• It can be more formally defined as
meta-model
Last Decades Data Model
• The dominant data model of the last decades
what the relational data model.
1. It can be represented as a set of tables.
2. Each table has rows, with each row
representing some entity of interest.
3. We describe entities through columns
4. A column may refer to another row in the
same or different table (relationship).
NoSQL Data Model
• It moves away from the relational data model
• Each NoSQL database has a different model
– Key-value,
– Document,
– Column-family,
– Graph, and
– Sparse (Index based)
• Of these, the first three share a common
characteristic (Aggregate Orientation).
Relational Model
vs
Aggregate Model
Relational Model
• The relational model takes the information that
we want to store and divides it into tuples (rows).
• However, a tuple is a limited data structure.
• It captures a set of values.
• So, we can’t nest one tuple within another to get
nested records.
• Nor we can put a list of values or tuple within
another.
Relational Model
• This simplicity characterize the relational
model
• It allows us to think on data manipulation as
operation that have:
– As input tuples, and
– Return tuples
• Aggregate orientation takes a different
approach.
Aggregate Model
• It recognizes that, you want to operate on data unit
having a more complex structure than a set of
tuples.
• We can think on term of complex record that allows:
– List,
– Map,
– And other data structures to be nested inside it
• Key-Value, document, and column-family databases
uses this complex structure.
Aggregate Model
• Aggregate is a term coming from Domain-
Driven Design [Evans03]
– An aggregate is a collection of related objects that
we wish to treat as a unit. It is a unit for data
manipulation and management for consistency.
• We like to update aggregates with atomic
operation
• We like to communicate with our data storage
in terms of aggregates
Aggregate Models
• This definition matches really with how key-value,
document, and column-family databases works.
• With aggregates it is easier to work on a cluster,
since they are unit for replication and sharding.
• Aggregates are also easier for application
programmer to work since it solve the impedance
mismatch problem of relational databases.
Example of Relational Model
• Assume we are
building an e-
commerce website;
• We have to store
information about:
users, products,
orders, shipping
addresses, billing
addresses, and
payment data.
Example of Relational Model
• As we are good
relational soldier:
– Everything is
normalized
– No data is
repeated in
multiple tables.
– We have referential
integrity
Example of Relational Model
Example of Aggregate Model
• We have two aggregates: Customers and Orders
• We use the black diamond composition to show
how data fits into the aggregate structure
A possible aggregation
Example of Aggregate Model
• The customer contains a list of billing addresses;
• The order contains a list of: order items, a shipping address, and
payments
• The payment itself contains a billing address for that payment
Example of Aggregate Model
• A single address appears 3 times, but instead of using an id it is copied each time
• This fits a domain where we don’t want shipping, payment and billing address to
change
• What is the difference w.r.t a relational representation?
Example of Aggregate Model
• The link between customer and the order is a
relationship between aggregates
Example of Aggregate Model
• Link from an order item would cross into a separate
aggregate structure for product (not considered
here)
• This is kind of denormalization – similar to tradeoff
with relational database, but is more common with
aggregate because we want to minimize the
number of aggregates we access.
Example of Aggregate Model
• We aggregate to minimize the number of
aggregates we access during data interaction
• •The important think to notice is that,
– We have to think about accessing that data
– We make this part of our thinking when developing the
application data model
• We could draw our aggregate differently, but it
really depends on the “data accessing models”.
• No universal answer for how to draw aggregate boundaries
• It depends entirely on how you tend to manipulate data!
– Accesses on a single order at a time: first solution
– Accesses on customers with all orders: second solution
• Context-specific
– some applications will prefer one or the other
– even within a single system
• Focus on the unit of interaction with the data storage
• Pros:
– it helps greatly with running on a cluster: data will be manipulated
together, and thus should live on the same node!
• Cons:
– an aggregate structure may help with some data interactions but be
an obstacle for others.
Consider a Student information system consisting of 3 entities namely,
Student_info, Course_info, and Marksheet.
Following are the frequent queries in the workload:
1. List the details of students admitted to ‘[Link]’ course.
2. List the details of students staying in ‘Kothrud’ area and studying in
‘[Link]’
3. Find the maximum score value for ‘Databases’ subject
4. List the number of students failing in the subject ‘Computer networks’
(marks < 40)
Given the above workload, derive an aggregate boundary, for aggregating the
three entities. Justify your answer.
Consequences of Aggregate Models
No Distributable Storage
• Relational mapping can captures data elements
and their relationship well.
• It does not need any notion of aggregate entity,
because it uses foreign key relationship.
• But we cannot distinguish for a relationship that
represent aggregations from those that don’t.
• As result we cannot take advantage of that
knowledge to store and distribute our data.
Marking Aggregate Tools
• Many data modeling techniques provides way to
mark aggregate structures in relational models
• However, they do not provide semantic that
helps in distinguish relationships
• When working with aggregate-oriented
databases, we have a clear view of the semantic
of the data.
• We can focus on the unit of interaction with the
data storage.
Aggregate Ignorant
• Relational database are aggregate-ignorant,
since they don’t have concept of aggregate
• Also graph database are aggregate-ignorant.
• This is not always bad.
• In domains where it is difficult to draw
aggregate boundaries aggregate-ignorant
databases are useful.
Aggregate and Operations
• An order is a good aggregate when:
– A customer is making and reviewing an order, and
– When the retailer is processing orders
• However, when the retailer want to analyze its
product sales over the last months, then
aggregate are trouble.
• We need to analyze each aggregate to extract
sales history.
Aggregate and Operations
• Aggregate may help in some operation and not in
• others.
• In cases where there is not a clear view aggregate-
ignorant database are the best option.
• But, remember the point that drove us to
aggregate models (cluster distribution).
• Running databases on a cluster is need when
dealing with huge quantities of data.
Running on a Cluster
• It gives several advantages on computation
power and data distribution
• However, it requires to minimize the number of
nodes to query when gathering data
• By explicitly including aggregates, we give the
database an important view of which
information should be stored together
• But, still we have the problem on querying
historical data
Aggregates and Transactions
ACID transactions
• Relational database allow us to manipulate any
combination of rows from any table in a single
transaction.
• ACID transactions:
– Atomic,
– Consistent,
– Isolated, and
– Durable
have the main point in Atomicity.
Atomicity & RDBMS
• Many rows spanning many tables are updated
into an Atomic operation
• It may succeeded or failed entirely
• Concurrently operations are isolated and we
cannot see partial updates
• However relational database still fail.
Atomicity & NoSQL
• NoSQL don’t support Atomicity that spans
multiple aggregates.
• This means that if we need to update multiple
aggregates we have to manage that in the
application code.
• Thus the Atomicity is one of the consideration
for deciding how to divide up our data into
aggregates
Aggregates Models on NoSQL
Key-Value and Document
• Key-value and Document databases are strongly
aggregate-oriented.
• Both of these types of databases consists of lot of
aggregates with a key used to get the data.
• The two type of databases differ in that:
– In a key-value stores the aggregate is opaque (Blob)
– In a document database we can see a structure in the
aggregate.
Key-Value and Document
• The advantage of opacity is that we can store
whatever we like in the aggregate.
• The database may impose some size limit, but
we have freedom
• A document store imposes limits on what we
can place in it, defining a structure on the
data.
Key-Value and Document
• With a key-value we can only access by its key
• With document:
– We can submit queries based on fields,
– We can retrieve part of the aggregate, and
– The database can create index based on the fields
of the aggregate.
• But in practice they are used differently
Key-Value and Document
• In practice, the line between key-value and
document gets a bit blurry.
• An ID field is put in a document database to do a
key-value style lookup
• With key-value databases we expect aggregates
using a key
• With document databases, we mostly expect to
submit some form of query on the internal
structure of the documents.
Column-Family Stores
• One of the most influential NoSQL databases
was Google’s BigTable [Chang et al.]
• Its name derives from its structure composed
by sparse columns and no schema.
• We don’t have to think of this structure as a
table, but to a two-level map.
Column-Family Stores
• These BigTable-style data model are referred
to as column stores.
• Pre-NoSQL column stores like C-Store used
SQL and the relational model.
• What make NoSQL columns store different is
how physically they store data.
• Most databases has rows as unit of storage,
which helps in writing performances
Column-Family Stores
• However, there are many scenarios where:
– Write are rares, but
– You need to read a few columns of many rows at
once
• In this situations, it’s better to store groups of
columns for all rows as the basic storage unit.
• These kind of databases are called column
stores or column-family databases
Column-Family Stores
• Column-family databases have a two-level aggregate
structure.
• Similarly to key-value the first key is the row
identifier.
• The difference is that retrieving a key return a Map
of more detailed values.
• These second-level values are defined to as columns.
• Fixing a row we can access to all the column-families
or to a particular element.
Example of Column Model
Column-Family Stores
• They organize their columns into families.
• Each column is a part of a family, and column
family acts as unit of access.
• Then the data for a particular column family
are accessed together.
Column-Family Stores:
How to structure data
• In row-oriented:
– each row is an aggregate (For example the customer
with id 456),
– with column families representing useful chunks of
data (profile, order history) within that aggregate
• In column-oriented:
– each column family defines a record type (e.g.
customer profiles) with rows for each of the records.
– You can think of a row as the join of records in all
columnfamilies
Key Points
• An aggregate is a collection of data that we interact with as
a unit.
• Aggregates form the boundaries for ACID operations with
the database
• Key-value, document, and column-family databases can all
be seen as forms of aggregate-oriented database
• Aggregates make it easier for the database to manage data
storage over clusters
• Aggregate-oriented databases work best when most data
interaction is done with the same aggregate
• Aggregate-ignorant databases are better when interactions
use data organized in many different formations