Module 5_NoSQL databases
Module 5_NoSQL databases
Module 5_NoSQL databases
What is NoSQL?
• NoSQL Database is a non-relational Data Management System, that does not require
a fixed schema.
• It avoids joins, and is easy to scale.
• The major purpose of using a NoSQL database is for distributed data stores with
humongous data storage needs.
• NoSQL is used for Big data and real-time web apps. For example, companies like
Twitter, Facebook and Google collect terabytes of user data every single day.
• NoSQL database stands for “Not Only SQL” or “Not SQL.”
• Carl Strozz introduced the NoSQL concept in 1998.
• Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
• Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, unstructured and
polymorphic data.
Types of Databases
Why NoSQL?
• The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data.
• The system response time becomes slow when you use RDBMS for massive volumes
of data.
• To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.
• The alternative for this issue is to distribute database load on multiple hosts
whenever the load increases. This method is known as “scaling out.”
Brief History of NoSQL Databases
•1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
•2000- Graph database Neo4j is launched
•2004- Google BigTable is launched
•2005- CouchDB is launched
•2007- The research paper on Amazon Dynamo is released
•2008- Facebooks open sources the Cassandra project
•2009- The term NoSQL was reintroduced
Features of NoSQL
• Non-relational
• Schema-free
• Simple API
• Distributed
Non-relational
ACID
Schema-free
• Offers easy to use interfaces for storage and querying data provided
• Key-value pair storage databases store data as a hash table where each key is unique,
and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
• For example, a key-value pair may contain a key like “Website” associated with a
value like “Karunya”.
Key- Value Pair
• It is one of the most basic NoSQL database example. This kind of NoSQL database is
used as a collection, dictionaries, associative arrays, etc. Key value stores help the
developer to store schema-less data. They work best for shopping cart contents.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They
are all based on Amazon’s Dynamo paper.
Column-based
• Column-oriented databases work on columns and are based on BigTable paper by
Google. Every column is treated separately. Values of single column databases are
stored contiguously.
Column-based
• They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
• HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based
database.
Document-Oriented
• Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the
value part is stored as a document. The document is stored in JSON or XML formats.
The value is understood by the DB and can be queried.
Document-Oriented
• In this diagram on your left you can see we have rows and columns, and in the right,
we have a document database which has a similar structure to JSON. Now for the
relational database, you have to know what columns you have and so on. However,
for a document database, you have data store like JSON object. You do not require to
define which make it flexible.
• The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications. It should not use for complex transactions
which require multiple operations or queries against varying aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Graph-Based
• A graph type database stores entities as well the relations amongst those entities.
The entity is stored as a node with the relationship as edges. An edge gives a
relationship between nodes. Every node and edge has a unique identifier.
What is NoSQL?
• Compared to a relational database where tables are loosely connected, a Graph
database is a multi-relational in nature. Traversing relationship is fast as they are
already captured into the DB, and there is no need to calculate them.
• Graph base database mostly used for social networks, logistics, spatial data.
• Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Query Mechanism tools for NoSQL
• The most common data retrieval mechanism is the REST-based retrieval of a value
based on its key/ID with GET resource
• Document store Database offers more difficult queries as they understand the value
in a key-value pair. For example, CouchDB allows defining views with MapReduce
What is NoSQL?
• Before 2012, users could write MapReduce programs using scripting languages such
as Java, Python, and Ruby.
• They could also use Pig, a language used to transform data. No matter what language
was used, its implementation depended on the MapReduce processing model.
• In May 2012, during the release of Hadoop version 2.0, YARN was introduced.
• It is no longer limited to working with the MapReduce framework anymore as YARN
supports multiple processing models in addition to MapReduce, such as Spark.
• Other features of YARN include significant performance improvement and a flexible
execution engine.
What is the CAP Theorem?
• CAP theorem is also called brewer’s theorem. It states that is impossible for a
distributed data store to offer more than two out of three guarantees
• Consistency
• Availability
• Partition Tolerance
What is the CAP Theorem?
• Consistency:
• The data should remain consistent even after the execution of an operation. This
means once data is written, any future read request should contain that data. For
example, after updating the order status, all the clients should be able to see the
same data.
• Availability:
• The database should always be available and responsive. It should not have any
downtime.
• Partition Tolerance:
• Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be
partitioned into multiple groups which may not communicate with each other. Here,
if part of the database is unavailable, other parts are always unaffected.
Eventual Consistency
• Eventual Consistency
• The term “eventual consistency” means to have copies of data on multiple machines
to get high availability and scalability. Thus, changes made to any data item on one
machine has to be propagated to other replicas.
• Data replication may not be instantaneous as some copies will be updated
immediately while others in due course of time. These copies may be mutually, but
in due course of time, they become consistent. Hence, the name eventual
consistency.
15