BGD Mod 2 QB Solns
BGD Mod 2 QB Solns
BGD Mod 2 QB Solns
Ans: A NoSQL (originally referring to "non-SQL" or "non-relational") database provides a mechanism for
storage and retrieval of data that is modeled in means other than the tabular relations used in relational
databases. NoSQL databases are increasingly used in big data and real-time web applications.
• Concurrency
Enterprise applications tend to have many people looking at the same body of data at once, possibly
modifying that data. Most of the time they are working on different areas of that data, but occasionally
they operate on the same bit of data. As a result, we have to worry about coordinating these interactions
to avoid such things as double booking of hotel rooms.
Concurrency is notoriously difficult to get right, with all sorts of errors that can trap even the most
careful programmers. Since enterprise applications can have lots of users and other systems all working
concurrently, there’s a lot of room for bad things to happen. Relational databases help handle this by
controlling all access to their data through transactions. While this isn’t a cure-all (you still have to
handle a transactional error when you try to book a room that’s just gone), the transactional mechanism
has worked well to contain the complexity of concurrency.
Transactions also play a role in error handling. With transactions, you can make a change, and if an error
occurs during the processing of the change, you can roll back the transaction to clean things up.
• Integration
Enterprise applications live in a rich ecosystem that requires multiple applications, written by different
teams, to collaborate in order to get things done. This kind of inter-application collaboration is awkward
because it means pushing the human organizational boundaries. Applications often need to use the same
data and updates made through one application have to be visible to others.
A common way to do this is shared database integration [Hohpe and Woolf] where multiple applications
store their data in a single database. Using a single database allows all the applications to use each
other’s data easily, while the database’s concurrency control handles multiple applications in the same
way as it handles multiple users in a single application.
Let’s assume we have to build an e-commerce website; we are going to be selling items directly to customers
over the web, and we will have to store information about users, our product catalog, orders, shipping
addresses, billing addresses, and payment data.
In this model, we have two main aggregates: customer and order. We’ve used the black-diamond
composition marker in UML to show how data fits into the aggregation structure. The customer contains a
list of billing addresses; the order contains a list of order items, a shipping address, and payments. The
payment itself contains a billing address for that payment.
A single logical address record appears three times in the example data, but instead of using IDs it’s treated
as a value and copied each time. This fits the domain where we would not want the shipping address, nor
the payments’ billing address, to change. In a relational database, we would ensure that the address rows
aren’t updated for this case, making a new row instead. With aggregates, we can copy the whole address
structure into the aggregate as we need to.
The link between the customer and the order isn’t within either aggregate—it’s a relationship between
aggregates. Similarly, the link from an order item would cross into a separate aggregate structure for
products, which we haven’t gone into. We’ve shown the product name as part of the order item here—this
kind of denormalization is similar to the trade-offs with relational databases, but is more common with
aggregates because we want to minimize the number of aggregates we access during a data interaction. The
important thing to notice here isn’t the particular way we’ve drawn the aggregate boundary so much as the
fact that you have to think about accessing that data—and make that part of your thinking when developing
the application data model. Indeed we could draw our aggregate boundaries differently, putting all the orders
for a customer into the customer aggregate
3.Write note on:
Graph Database:- Most NoSQL databases were inspired by the need to run on clusters, which led to
aggregate-oriented data models
of large records with simple connections.
• Graph databases are motivated by a different frustration with relational databases and thus have an
opposite model—small records with complex interconnections
• The fundamental data model of a graph database is very simple: nodes connected by edges.
• Once you have built up a graph of nodes and edges, a graph database allows you to query that
network with query operations designed with this kind of graph in mind.
• This is where the important differences between graph and relational databases come in. Although
relational databases can implement relationships using foreign keys, the joins required to navigate
around can get quite expensive—which means performance is often poor for highly connected data
models.
• The emphasis on relationships makes graph databases very different from aggregate-oriented
databases.
• ACID transactions need to cover multiple nodes and edges to maintain consistency.
• The only thing they have in common with aggregate-oriented databases is their rejection of the
relational
model and an upsurge in attention they received around the same time as the rest of the NoSQL field.
Schemeless Database:-A common theme across all the forms of NoSQL databases is that they are
schema less.
• In a relational database, you first have to define a schema—a defined structure for the database
which says
what tables exist, which columns exist, and what data types each column can hold.
• With NoSQL databases, storing data is much more casual.
Key-value store allows you to store any data you like under a key.
Document database effectively does the same thing, since it makes no restrictions on
the structure of the documents you store.
Column-family databases allow you to store any data under any column you like.
Graph databases allow you to freely add new edges and freely add properties to nodes
and edge as you wish.
• Schema lessness is appealing, and it certainly avoids many problems that exist with fixed-schema
databases,
but it brings some problems of its own.
• Having the implicit schema in the application code results in some problems.
• Schemaless database shifts the schema into the application code that accesses it. This becomes
problematic if
multiple applications, developed by different people, access the same database.
• Schemalessness does have a big impact on changes of a database’s structure over time, particularly
for more
uniform data. Although it’s not practiced as widely as it ought to be, changing a relational database’s
schema
can be done in a controlled way.
Materialized Views:- Aggregate models have their own advantages, as well few disadvantages.
• Relational databases have an advantage here because their lack of aggregate structure allows them
to support accessing data in different ways.
• Views provide a mechanism to hide from the client whether data is derived data or base data.
• Materialized views are views that are computed in advance and cached on disk.
• There are two rough strategies to building a materialized view. The first is the eager approach
where you update the
materialized view at the same time you update the base data for it.
• Second way of implementing Lazy approach: runs batch jobs to update the materialized views at
regular intervals.
• Materialized views can be used within the same aggregate.
• Different column families for materialized views is a common feature of column-family databases.
An advantage of
doing this is that it allows you to update the materialized view within the same atomic operation
4,5What is distribution Model? Explain different Models with neat architectural diagram.
Distribution Model:- The primary driver of interest in NoSQL has been its ability to run databases on a
large cluster. As data volumes
increase, it becomes more difficult and expensive to scale up—buy a bigger server to run the
database on. A more
appealing option is to scale out—run the database on a cluster of servers.
• Depending on your distribution model, you can get a data store that will give you the ability to
handle larger
quantities of data, the ability to process a greater read or write traffic.
• There are two paths to data distribution: replication and sharding.
• Replication takes the same data and copies it over multiple nodes.
• Sharding puts different data on different nodes.
• Replication and sharding are orthogonal techniques, we can use either or both of them.
• Replication comes into two forms: master-slave and peer-to-peer.
• Different techniques of data distribution models are:
Ø Single Server
Ø Sharding
Ø Master-Slave Replication
Ø Peer-to-Peer Replication
Single Server
• The first and the simplest distribution option is the one we would most often recommend—no
distribution at
all. Run the database on a single machine that handles all the reads and writes to the data store.
• Although a lot of NoSQL databases are designed around the idea of running on a cluster, it can make
sense to
use NoSQL with a single-server distribution model if the data model of the NoSQL store is more suited
to the
application.
• Graph databases are the obvious category here—these work best in a single-server configuration.
• If we can get away without distributing our data, we will always choose a single-server approach.
Sharding
• A busy data store is busy because different people are accessing different parts of the dataset.
• In these circumstances we can support horizontal scalability by putting different parts of the data
onto different servers –a technique that’s called sharding.
• In the ideal case, we have different users all talking to different server nodes. Each user only has to
talk to one server,so gets rapid responses from that server. The load is balanced out nicely between
servers—for example, if we have ten
servers, each one only has to handle 10% of the load.
• Several factors that can help improve performance—
• physical location
• keep the load even
• Many NoSQL databases offer auto-sharding, where the database takes on the responsibility of
allocating data to shards
and ensuring that data access goes to the right shard. This can make it much easier to use sharding in
an application.
• Sharding is particularly valuable for performance because it can improve both read and write
performance. Sharding
provides a way to horizontally scale writes.
• Sharding does little to improve resilience when used alone. Although the data is on different nodes,
a node failure
makes that shard’s data unavailable just as surely as it does for a single-server solution. The resilience
benefit it does
provide is that only the users of the data on that shard will suffer; however, it’s not good to have a
database with part of
its data missing.
• Some databases are intended from the beginning to use sharding, in which it’s wise to run them on
a cluster from the
very beginning of development, and certainly in production.
Master-Slave Replication
• With master-slave distribution, you replicate data across multiple nodes. One node is designated as
the master, or primary.
This master is the authoritative source for the data and is usually responsible for processing any
updates to that data.
• The other nodes are slaves, or secondaries. A replication process synchronizes the slaves with the
master.
• Master-slave replication is most helpful for scaling when you have a read-intensive dataset.
• We can scale horizontally to handle more read requests by adding more slave nodes and ensuring
that all read requests
are routed to the slaves.
• A second advantage of master-slave replication is read resilience: If the master fail, the slaves can
still handle read
requests? Again, this is useful if most of your data access is reads.
• The failure of the master does eliminate the ability to handle writes until either the master is
restored or a new master is
appointed. However, having slaves as replicates of the master does speed up recovery after a failure
of the master since
a slave can be appointed a new master very quickly.
• Masters can be appointed manually or automatically. Manual appointing typically means that when
you configure your
cluster, you configure one node as the master. With automatic appointment, you create a cluster of
nodes and they elect
one of themselves to be the master. Apart from simpler configuration, automatic appointment means
that the cluster can
automatically appoint a new master when a master fails, reducing downtime.
• The most difficult way of handling Master-Slave is that different clients, reading different slaves, will
see different
values because the changes haven’t all propagated to the slaves.
• In the worst case, that can mean that a client cannot read a write it just made. Even if you use
master-slave replication
just for hot backup this can be a concern, because if the master fails, any updates not passed on to
the backup are lost.
Peer-to-Peer Replication
• Master-slave replication helps with read scalability but doesn’t help with scalability of writes.
• It provides resilience against failure of a slave, but not of a master.
• The master is still a bottleneck and a single point of failure, but Peer-to-peer replication attacks
these problems by not having
a master.
• All the replicas have equal weight, they can all accept writes, and the loss of any of them doesn’t
prevent access to the data
store.
Ans:
• The first stage in a map-reduce job is the map. A map is a function whose input is a single
aggregate and whose output is a bunch of key value pairs.
• Map operation operates only on a single record; the reduce function takes multiple map outputs
with the same key and combines their values.
• Each application of the map function is independent of all the others. This allows them to be safely
parallelizable, so that a map-reduce framework can create efficient map tasks on each node and
freely allocate each order to a map task. This yields a great deal of parallelism and locality of data
access.
• A map operation only operates on a single record; the reduce function takes multiple map outputs
with the same key and combines their values.
• Reduce function would reduce down to one, with the totals for the quantity and revenue. While
the map function is limited to working only on data from a single aggregate, the reduce function can
use all values emitted for a single key.
• The map-reduce framework arranges for map tasks to be run on the correct nodes to process all
the documents and for data to be moved to the reduce function.
7. What are the phases of Map- Reduce program? Explain in detail.
Ans.
1. InputFiles
The data that is to be processed by the MapReduce task is stored in input files. These
input files are stored in the Hadoop Distributed File System. The file format is
arbitrary, while the line-based log files and the binary format can also be used.
2. InputFormat
It specifies the input-specification for the job. InputFormat validates the MapReduce
job input-specification and splits-up the input files into logical InputSplit instances.
Each InputSplit is then assigned to the individual Mapper. TextInputFormat is the
default InputFormat.
3. InputSplit
It represents the data for processing by the individual Mapper. InputSplit typically
presents the byte-oriented view of the input. It is the RecordReader responsibility to
process and present the record-oriented view. The default InputSplit is the FileSplit.
4. RecordReader
RecordReader reads the <key, value> pairs from the InputSplit. It converts a byte-
oriented view of the input and presents a record-oriented view to the Mapper
implementations for processing.
It is responsible for processing record boundaries and presenting the Map tasks with
keys and values. The record reader breaks the data into the <key, value> pairs for
input to the Mapper.
5. Mapper
Mapper maps the input <key, value> pairs to a set of intermediate <key, value> pairs.
It processes the input records from the RecordReader and generates the new <key,
value> pairs. The <key, value> pairs generated by Mapper are different from the input
<key, value> pairs.
The generated <key, value> pairs is the output of Mapper known as intermediate
output. These intermediate outputs of the Mappers are written to the local disk.
The Mappers output is not stored on the Hadoop Distributed File System because this
is the temporary data, and writing this data on HDFS will create unnecessary copies.
The output of the Mappers is then passed to the Combiner for further processing.
6. Combiner
It is also known as the ‘Mini-reducer’. Combiner performs local aggregation on the
output of the Mappers. This helps in minimizing data transfer between the Mapper
and the Reducer.
After the execution of the Combiner function, the output is passed to the Partitioner
for further processing.
7. Partitioner
When we are working on the MapReduce program with more than one Reducer then
only the Partitioner comes into the picture. For only one reducer, we do not use
Partitioner.
9. Reducer
Reducer then reduces the set of intermediate values who shares a key to the smaller
set of values. The output of reducer is the final output. This output is stored in the
Hadoop Distributed File System.
10. RecordWriter
RecordWriter writes the output (key, value pairs) of Reducer to an output file. It
writes the MapReduce job outputs to the FileSystem.
11. OutputFormat
The OutputFormat specifies the way in which these output key-value pairs are written
to the output files. It validates the output specification for a MapReduce job.
OutputFormat basically provides the RecordWriter implementation used for writing
the output files of the MapReduce job. The output files are stored in a FileSystem.
Hence, in this manner, MapReduce works over the Hadoop cluster in different phases.