Cassandra Unit 4

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Unit 4

7.1 Introduction to Cassandra


1. It is a NoSQl database.Cassandra is a high performance distributed database from
Apache that is highly scalable and designed to manage very large amounts of
structured data. It provides high availability with no single point of failure. There is no
master-slave architecture.
2. It distributes and manages gigantic amount of data across commodity servers.
3. Column-oriented database designed to support peer-to-peer symmetric nodes
instead of master-slave architecture.
4. Adherence to A and P properties of CAP theorem. It takes care of consistency using
BASE approach.
Companies like Twitter, Netflix,Cisco, Adobe, eBay, Rackspace have successfully
deployed Cassandra.

Cassandra has a distributed architecture which is capable to handle a huge amount of


data. Data is placed on different machines with more than one replication factor to attain a
high availability without a single point of failure.

Cassandra is open source, distributed,decentralized, no single point, column oriented,


peer to peer, elastic scalability.

Components of Cassandra
The key components of Cassandra are as follows −
 Node − It is the place where data is stored.
 Data center − It is a collection of related nodes.
 Cluster − A cluster is a component that contains one or more data centers.
 Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every
write operation is written to the commit log.
 Mem-table − A mem-table is a memory-resident data structure. After commit log,
the data will be written to the mem-table. Sometimes, for a single-column family,
there will be multiple mem-tables.
 SSTable − It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
7.2 Features of Cassandra
1. Peer-to-peer network
No master-slave architecture, means that no question of single point of failure. All
the nodes in the network are identical. In case a node fails or offline , affects the
throughput. However, it will not crash the entire system because of single node
failure. Work can be carried out as usual. It is a peer-to-peer distributed system
across homogenous nodes. It ensures that data is distributed across all nodes in the
cluster. Each node exchanges information across the cluster every second.
Let us see how Cassandra node writes. Each write is written to commit log
sequentially. A write is successful only if it is written to commit log. Data is then
indexed and pushed to an in-memory structure called Memtable. When the in-
memory data structure the Memtable is full the contents are flushed to ‘SSTable
(Sorted String) data file on the disk. The SSTable is immutable and is append –only.
It is stored on disk sequentially and is maintained for each Cassandra table. The
partitioning and replication of all writes re performed automatically across the cluster.

2. Gossip and failure detection


Gossip protocol is used for intra-ring communication. It is a peer-to-peer
communication which eases the discovery and sharing of location and state
information with other nodes in the cluster. At its core it is simple and robust system.
A node has to send out the communication to a subset of other nodes. For repairing
unread nodes Cassandra uses anti-entropy version of gossip protocol.
 Gossip is a peer-to-peer communication protocol in which nodes periodically exchange
state information about themselves and about other nodes they know about.The Gossip
protocol runs every second and exchange state messages with up to three other nodes in
the cluster.
In Cassandra, Gossip protocol is very useful because nodes exchange information about
themselves and about the other nodes that they have gossiped about, so all nodes quickly
learn about other nodes in the cluster.

3. Partitioner
A partitioner determines as how to distribute data on the various nodes on a cluster.
It also determines the node on which to place the very first copy of data. partitioner
is a hash function to compute the token of the partition key. The partition key helps
to identify row uniquely.
4. Replication factor:It is the number of machines in the cluster that will receive copies
of the same data. Replication strategyis used to place replicas in the ring. We have
strategies such as simple strategy (rack-aware strategyand network topology
strategy (datacenter-shared strategy).network topology strategy  is preferred
since it is simple, easy to expand to multiple data centers..
5. Anti-entropy and read repair: It is fault tolerant, data is replicated on one or many
nodes. A client can connect to any node to read data. If the client read is not
consistent , the read operation blocks. Few of the nodes may respond with an out-of-
date value. In such a case, Cassandra will initiate a read repair operation to bring
replicas with old values up-to date.
Anti entropy implies comparing all the replicas of each piece of data and updating
each replica with the newest version.
6. Writes in Cassandra :
When write request comes to the node, first of all, it logs in the commit log.Then
Cassandra writes the data in the mem-table. Data written in the mem-table on each
write request also writes in commit log separately. Mem-table is a temporarily stored
data in the memory while Commit log logs the transaction records for back up
purposes.When mem-table is full, data is flushed to the SSTable data file. SSTAB is
immutable and append only.
7. Hinted handoffs
Hinting is a data repair technique applied during write operations. When replica
nodes are unavailable to accept a mutation, either due to failure or more commonly
routine maintenance, coordinators attempting to write to those replicas store
temporary hints on their local filesystem for later application to the unavailable
replica. Hints are an important way to help reduce the duration of data
inconsistency. Hinted handoff is the process by which Cassandra applies hints to
unavailable nodes.
Whenever a node becomes unresponsive or down for a certain period of time, all the
write requests to that node will fail. Cassandra keeps a copy of the partitions that
were supposed to be written to the unresponsive node within a local hints table. The
hints are stored for a certain configurable period of time. When the downed node
comes up and starts gossiping with the rest of the nodes in the cluster, the hints are
replayed to the downed node, and data is back in sync between all the replica
nodes.
Assume there is a cluster of 3nodes,Node A,B, C and replication factor 2 .node C is
down for some reason. A client makes a write request to node A. node A is the
coordinator and serves as a proxy between the client and the nodes on which the
replica is placed. The client write Row X to node A. Node A writes Row X to node B
and stores hint for Node C. the hint has following information.
1. Location of the node on which the replica is to be placed.
2. Version metadata
3. The actual data.
When Node C recovers and is back to function, Node A reacts to the hint by
forwarding the data to Node C.
8. Tunable consistency
Strong consistency and eventual consistency.
In Strong consistency each update propagates to all locations where that piece of
data resides. It ensures that all the servers that should have a copy of the data,will
have it, before the client is acknowledged with a success.
Eventual consistency.The client is acknowledged with a success as soon as a part
of the cluster acknowledges the write.
Read consistency: means how many replicas must respond before sending
the result to the client application.
Write consistency: means how many replicas write must proceed before
sending out the acknowledgement to the client application.
Cassandra Query Language
Users can access Cassandra through its nodes using Cassandra Query Language (CQL).
CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a
prompt to work with CQL or separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node (coordinator)
plays a proxy between the client and the nodes holding the data.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later
the data will be captured and stored in the mem-table. Whenever the mem-table is full,
data will be written into the SStable data file. All writes are automatically partitioned and
replicated throughout the cluster. Cassandra periodically consolidates the SSTables,
discarding unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom
filter to find the appropriate SSTable that holds the required data.

7.3 CQL data types

7.4 CQLSH :it is command prompt in Cassandra, to work out


commands.

Cqlsh>

7.5 Keyspaces

A keyspace is an object that holds the column families, user defined types.
In Cassandra, Keyspace is similar to RDBMS Database. Keyspace holds
column families, indexes, user defined types, data center awareness,
strategy used in keyspace, replication factor, etc.

Create keyspaceKeyspaceName with replication={'class':strategy name,


'replication_factor': No of replications on different nodes};
Various Components of Cassandra Keyspace

 Strategy: While declaring strategy name in Cassandra. There are two


kinds of strategies declared in Cassandra Syntax.

1. Simple Strategy: Simple strategy is used when you have just one data
center. In this strategy, the first replica is placed on the node selected

by the partitioner. Remaining nodes are placed in the clockwise direction in


the ring without considering rack or node location.

2. Network Topology Strategy: Network topology strategy is used when you


have more than one data centers. In this strategy, you have to provide
replication factor for each data center separately. Network topology
strategy places replicas in nodes in the clockwise direction in the same
data center. This strategy attempts to place replicas in different racks.

 Replication Factor: Replication factor is the number of replicas of data


placed on different nodes. For no failure, 3 is good replication factor. More
than two replication factor ensures no single point of failure. Sometimes,
the server can be down, or network problem can occur, then other replicas
provide service with no failure.
 Example: Here is the snapshot of the executed command "Create
Keyspace" that will create keyspace in Cassandra.

Create keyspace University with


replication={'class':SimpleStrategy,'replication_factor': 3};

After successful execution of command "Create Keyspace", Keyspace University


will be created in Cassandra with strategy "SimpleStrategy" and replication factor
3.

Example 2: CREATE KEYSPACE Students WITH REPLICATION = { class : ‘


SimpleStrategy,'replication_factor': 1};

Objective: to describe all the existing keyspaces


DESCRIBE KEYSPACES;

OBJ: To get details on the existing keyspaces

SELECT * FROM system.schema_keyspaces;

Objective: to use keyspace “students” .


Syntax: USE keyspace_name
USE students;

Objective: to create table ‘student _info’


CREATE TABLE Student_info( rollint PRIMARY KEY, stdname text,
dateofjointimestamo, lastexampercent double);

Objective: to display all tables in the current keyspaces

DESCRIBE TABLES;

Objective:TO describe table student_info

DESCRIBE TABLE student_info;

7.6 CRUD ( CREATE, READ, UPDATE, DELETE OPERATIONS)

INSERT DATA:
INSERT INTO STUDENT_INFO(ROLL, STDNAME,DATEOFJOIN,LASTEXAMPERCENT)
VALUES(1,’SURAJ’,’2018-03-22’,77.6);
Queries in CQL;

SELECT *FROM STUDENT_INFO;


SELECT * FROM STUDENT_INFO WHERE ROLL IN(1,2,3);

NOTE:The attribute in the WHERE clause should be primary key or an index.

TO create index on student name: stdname


CREATE INDEX ON student_info(stdname)

SELECT * FROM STUDENT_INFO WHERE stdname= ‘ sonu’;

Objective: to specify number of rows in the output( display 2 rows)

Select * from student_infoLIMIT 2;


Objective: to create alias
Select roll as ROLLNUM FROM STUDENT_INFO;

Objective: TO UPDATE TABLE


UPDATE student_info SET STDNAME = ‘ BHUVI’ WHERE ROLL = 3;

NOTE: UPDATE on primary key value is not allowed.

To delete one or more columns from a table


Delete lastexampercent from student_info where roll=2;

To delete row :
Delete from student_info where roll =2;
Objective: to use “ allow filtering” with the select statement , when
searching a range of rows. To execute query despite unpredictable
performance , use ‘allow filtering’

Select * from project where pname = ‘ dbms’ allow filtering Order by desc;

7.7 Collections

Cassandra collections are a good way for handling tasks. Multiple elements can
be stored in collections. There are limitations in Cassandra collections.

 Cassandra collection cannot store data more than 64KB.


 Keep a collection small to prevent the overhead of querying collection
because entire collection needs to be traversed.
 If you store more than 64 KB data in the collection, only 64 KB will be able
to query, it will result in loss of data.

There are three types of collections that Cassandra supports.

Cassandra Set
A Set stores group of elements that returns sorted elements when querying.

Syntax

Here is the syntax of the Set collection that store multiple email addresses for the
teacher.

Create table University.Teacher


(
idint,
Name text,
Email set<text>,
Primary key(id)
);
Example :Here is the snapshot where table "Teacher" is created with "Email" column as a
collection.
Here is the snapshot where data is being inserted in the collection.

insert into University.Teacher(id,Name,Email) values(l,'Guru99',{'[email protected]


','[email protected]'});

Cassandra List
When the order of elements matters, the list is used.

Cassandra Map
The map is a collection type that is used to store key value pairs. As its name
implies that it maps one thing to another. It is used to store timestamp related
informarion. Each element of the map is stored as a Cassandra columns. Each
element can be individually queries,modified, and deleted.

For example, if you want to save course name with its prerequisite course name,
map collection can be used.

7.8 Using a counter

Counter is a special column that is changed in increments. For example, we


may need a counter column to count the number of times a particular book is
issued from the library to a student.

To load data into a counter column, or to increase or decrease the value of the counter, use the
UPDATE command. Cassandra rejects USING TIMESTAMP or USING TTL in the command to
update a counter column.

Procedure
 Create a table for the counter column.
 cqlsh>USE cycling;
 CREATETABLEpopular_count (idUUIDPRIMARYKEY, popularity counter
);

 Loading data into a counter column is different than other tables. The data is
updated rather than inserted.
 UPDATEcycling.popular_count
SET popularity = popularity + 1
WHEREid = 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47;

 Take a look at the counter value and note that popularity has a value of 1.


SELECT * FROMcycling.popular_count;

 Additional increments or decrements will change the value of the counter column.
 UPDATEcycling.popular_count
SET popularity = popularity + 1
WHEREid = 12345;

 Take a look at the counter value and note that popularity has a value of 2.


SELECT * FROMcycling.popular_count;

7.9 Determining time-to-live (TTL) for a column


Data in a column,other than a counter column, can have an optional expiration period
called TTL..the client request may have a TTL value for the data. The TTL is specified in
seconds.The TTL function may be used to retrieve the TTL information.

CREATE TABLE userlogin(useridint primary key, password text) ;

USING TTL keywords can be used to insert data into a table for a specific duration of time
in seconds. To determine the current time-to-live for a record, use the TTL function.

INSERT INTO userlogin (userid, password) values ( 1, ‘87^4’) USING TTL 30;

Select TTL(password) from userlogin where userid =1;

7.10 ALTER COMMANDS

To change the structure of the table/column .

Ex: Change data type of the column sampleid to int from text.

ALTER TABLE sample ALTER sampleid TYPE INT;

ALTER Table to delete a column


ALTER TABLE sample DROP sampleid;

Drop a table/ column family : DROP columnfamily sample;

DROP A Database: DROP keyspace “students”

7.11 IMPORT AND EXPORT

EXPORT TO CSV

You might also like