Cassandra Unit 4
Cassandra Unit 4
Cassandra Unit 4
Components of Cassandra
The key components of Cassandra are as follows −
Node − It is the place where data is stored.
Data center − It is a collection of related nodes.
Cluster − A cluster is a component that contains one or more data centers.
Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every
write operation is written to the commit log.
Mem-table − A mem-table is a memory-resident data structure. After commit log,
the data will be written to the mem-table. Sometimes, for a single-column family,
there will be multiple mem-tables.
SSTable − It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
7.2 Features of Cassandra
1. Peer-to-peer network
No master-slave architecture, means that no question of single point of failure. All
the nodes in the network are identical. In case a node fails or offline , affects the
throughput. However, it will not crash the entire system because of single node
failure. Work can be carried out as usual. It is a peer-to-peer distributed system
across homogenous nodes. It ensures that data is distributed across all nodes in the
cluster. Each node exchanges information across the cluster every second.
Let us see how Cassandra node writes. Each write is written to commit log
sequentially. A write is successful only if it is written to commit log. Data is then
indexed and pushed to an in-memory structure called Memtable. When the in-
memory data structure the Memtable is full the contents are flushed to ‘SSTable
(Sorted String) data file on the disk. The SSTable is immutable and is append –only.
It is stored on disk sequentially and is maintained for each Cassandra table. The
partitioning and replication of all writes re performed automatically across the cluster.
3. Partitioner
A partitioner determines as how to distribute data on the various nodes on a cluster.
It also determines the node on which to place the very first copy of data. partitioner
is a hash function to compute the token of the partition key. The partition key helps
to identify row uniquely.
4. Replication factor:It is the number of machines in the cluster that will receive copies
of the same data. Replication strategyis used to place replicas in the ring. We have
strategies such as simple strategy (rack-aware strategyand network topology
strategy (datacenter-shared strategy).network topology strategy is preferred
since it is simple, easy to expand to multiple data centers..
5. Anti-entropy and read repair: It is fault tolerant, data is replicated on one or many
nodes. A client can connect to any node to read data. If the client read is not
consistent , the read operation blocks. Few of the nodes may respond with an out-of-
date value. In such a case, Cassandra will initiate a read repair operation to bring
replicas with old values up-to date.
Anti entropy implies comparing all the replicas of each piece of data and updating
each replica with the newest version.
6. Writes in Cassandra :
When write request comes to the node, first of all, it logs in the commit log.Then
Cassandra writes the data in the mem-table. Data written in the mem-table on each
write request also writes in commit log separately. Mem-table is a temporarily stored
data in the memory while Commit log logs the transaction records for back up
purposes.When mem-table is full, data is flushed to the SSTable data file. SSTAB is
immutable and append only.
7. Hinted handoffs
Hinting is a data repair technique applied during write operations. When replica
nodes are unavailable to accept a mutation, either due to failure or more commonly
routine maintenance, coordinators attempting to write to those replicas store
temporary hints on their local filesystem for later application to the unavailable
replica. Hints are an important way to help reduce the duration of data
inconsistency. Hinted handoff is the process by which Cassandra applies hints to
unavailable nodes.
Whenever a node becomes unresponsive or down for a certain period of time, all the
write requests to that node will fail. Cassandra keeps a copy of the partitions that
were supposed to be written to the unresponsive node within a local hints table. The
hints are stored for a certain configurable period of time. When the downed node
comes up and starts gossiping with the rest of the nodes in the cluster, the hints are
replayed to the downed node, and data is back in sync between all the replica
nodes.
Assume there is a cluster of 3nodes,Node A,B, C and replication factor 2 .node C is
down for some reason. A client makes a write request to node A. node A is the
coordinator and serves as a proxy between the client and the nodes on which the
replica is placed. The client write Row X to node A. Node A writes Row X to node B
and stores hint for Node C. the hint has following information.
1. Location of the node on which the replica is to be placed.
2. Version metadata
3. The actual data.
When Node C recovers and is back to function, Node A reacts to the hint by
forwarding the data to Node C.
8. Tunable consistency
Strong consistency and eventual consistency.
In Strong consistency each update propagates to all locations where that piece of
data resides. It ensures that all the servers that should have a copy of the data,will
have it, before the client is acknowledged with a success.
Eventual consistency.The client is acknowledged with a success as soon as a part
of the cluster acknowledges the write.
Read consistency: means how many replicas must respond before sending
the result to the client application.
Write consistency: means how many replicas write must proceed before
sending out the acknowledgement to the client application.
Cassandra Query Language
Users can access Cassandra through its nodes using Cassandra Query Language (CQL).
CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a
prompt to work with CQL or separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node (coordinator)
plays a proxy between the client and the nodes holding the data.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later
the data will be captured and stored in the mem-table. Whenever the mem-table is full,
data will be written into the SStable data file. All writes are automatically partitioned and
replicated throughout the cluster. Cassandra periodically consolidates the SSTables,
discarding unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom
filter to find the appropriate SSTable that holds the required data.
Cqlsh>
7.5 Keyspaces
A keyspace is an object that holds the column families, user defined types.
In Cassandra, Keyspace is similar to RDBMS Database. Keyspace holds
column families, indexes, user defined types, data center awareness,
strategy used in keyspace, replication factor, etc.
1. Simple Strategy: Simple strategy is used when you have just one data
center. In this strategy, the first replica is placed on the node selected
DESCRIBE TABLES;
INSERT DATA:
INSERT INTO STUDENT_INFO(ROLL, STDNAME,DATEOFJOIN,LASTEXAMPERCENT)
VALUES(1,’SURAJ’,’2018-03-22’,77.6);
Queries in CQL;
To delete row :
Delete from student_info where roll =2;
Objective: to use “ allow filtering” with the select statement , when
searching a range of rows. To execute query despite unpredictable
performance , use ‘allow filtering’
Select * from project where pname = ‘ dbms’ allow filtering Order by desc;
7.7 Collections
Cassandra collections are a good way for handling tasks. Multiple elements can
be stored in collections. There are limitations in Cassandra collections.
Cassandra Set
A Set stores group of elements that returns sorted elements when querying.
Syntax
Here is the syntax of the Set collection that store multiple email addresses for the
teacher.
Cassandra List
When the order of elements matters, the list is used.
Cassandra Map
The map is a collection type that is used to store key value pairs. As its name
implies that it maps one thing to another. It is used to store timestamp related
informarion. Each element of the map is stored as a Cassandra columns. Each
element can be individually queries,modified, and deleted.
For example, if you want to save course name with its prerequisite course name,
map collection can be used.
To load data into a counter column, or to increase or decrease the value of the counter, use the
UPDATE command. Cassandra rejects USING TIMESTAMP or USING TTL in the command to
update a counter column.
Procedure
Create a table for the counter column.
cqlsh>USE cycling;
CREATETABLEpopular_count (idUUIDPRIMARYKEY, popularity counter
);
Loading data into a counter column is different than other tables. The data is
updated rather than inserted.
UPDATEcycling.popular_count
SET popularity = popularity + 1
WHEREid = 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47;
Additional increments or decrements will change the value of the counter column.
UPDATEcycling.popular_count
SET popularity = popularity + 1
WHEREid = 12345;
USING TTL keywords can be used to insert data into a table for a specific duration of time
in seconds. To determine the current time-to-live for a record, use the TTL function.
INSERT INTO userlogin (userid, password) values ( 1, ‘87^4’) USING TTL 30;
Ex: Change data type of the column sampleid to int from text.
EXPORT TO CSV