Basics of Big Data
Basics of Big Data
Basics of Big Data
3.5 Datafication
Summary
Keywords
Self-Assessment Questions
Suggested Reading
2019
All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by
any means, electronic or mechanical, including photocopying, recording or by any information storage
or retrieval system without written permission from the publisher.
Acknowledgement
Every attempt has been made to trace the copyright holders of materials reproduced in this unit.
Should any infringement have occurred, SCDL apologies for the same and will be pleased to make
necessary corrections in future editions of this unit.
Objectives
While we keep on ordering a cab for us to travel, or order a parcel from Amazon we never think how
much easy it is for us to be able to sit in our room and do these things. Although a couple of years ago,
this was not the case. Cabs had to be called by hand signals on the road and we ourselves did not trust
ordering an item from online stores because of some reason or other. But as we have evolved, we
have had these comforts because storing and analysing data has gotten cheaper and faster. But along
with this, what has also changed is how much amount of data we are generating. Everybody these
days has social media accounts, 2 tablets and a phone, a laptop etc. and we are continuously
generating data. As all of these data points are getting generated, they are getting stored parallel as
well. Have you ever wondered how Amazon might be managing its data from all users and storing?
How much big data space will they be requiring? And even if they manage to store say such huge
amounts of data, won’t it take forever to do analysis on it? These are some of the questions that we
might ask ourselves and a decade back all of these questions answers would have been “No, it is not
possible”. But we have come a long way from it.
The answer to all the above questions is yes and that is only possible because of Big Data applications
and its technological advancements. Then this raises the question what is this Big Data and why is it
getting such importance?
“Big Data is a phrase used to mean a massive volume of both structured and
unstructured data that is so large it is difficult to process using traditional
database and software techniques. In most enterprise scenarios the volume
of data is too big or it moves too fast or it exceeds current processing
capacity.” – Webopedia
Thus in short, Big Data is nothing but huge amounts of data produced which our traditional SQL
databases cannot handle or store. Plus it is not compulsory that all of this data getting generated is
structured on unstructured. A Big Data system is capable of handling both plain/random text
(Unstructured Data) and a tabular format data (Structured). As a result of this, this can help companies
and conglomerates to take data driven intelligent decisions.
The below figure is one of the most common figures you will see when you Google about Big Data.
a) Volume
b) Variety
c) Value
d) Velocity
e) Veracity
Let us talk about each one of them in brief so as to understand what each means and how is it
important.
a) Volume
Volume is nothing the amount of data that needs to be stored. Let us say you are building a
product which will capture data from all users using a particular browser once you have asked
for their permission. Supposedly there are only 5 users at start and they slowly start growing.
The database which you had used at the start like Excel, SQL will not be enough as the users
multiply the volume of data which will be coming in will be humongous and it will not be
possible to store it in the same old/traditional database. This is where volume plays a pivotal
role in deciding the technology that needs to be used for the product back end.
b) Variety
The data that we see in Excel or any other database for that matter is mostly structured. In
the world of Big Data, it can be anything. You can store text, images and voice notes into a
database. This creates variety amongst the data types you are going to store. This variety is a
driving force is selecting the Big Data technology.
c) Value
Not all the data which will be stored will be valuable to us. But does that mean we should not
store invaluable data? Sometimes, a particular type/column in dataset might not be useful
but that does not mean it won’t be used in the future. That is reason why the database should
be built in such a way that it is easy to segregate valuable & invaluable data from the source
as well as it should be easy to combine when required.
d) Velocity
In today’s world, speed is the name of the game. If analytics can be done real time it changes
a lot of things. Thus it is imperative to know beforehand the speed at which you need to collect
and store data, the timeframe of data capturing and the way in which it should be processed.
This is the most important part of Big Data.
e) Veracity
Most real world datasets are not perfect. There are mostly a lot of inconsistencies in the data
when it is captured. It is important to replace those inconsistencies with suitable data values.
You will often find missing values, wrong data types in wrong columns etc. in the real world
data that will be captured. You should be able to transform and replace these missing values
in the dataset with imputed ones easily.
3.3 BIG DATA & DATA SCIENCE
Data science is an interdisciplinary field that combines statistics, mathematics, data acquisition, data
cleansing, mining and programming to extract insights and information. When data sets get so big that
they cannot be analysed by traditional data processing application tools, it becomes ‘Big Data’. That
massive amount of data is useless if it is not analysed and processed.
Both Data Science and Big Data are related to Data Driven Decision but are significantly different.
Data Driven Decision (with the expectations of better decision and increase value) is process and
involves different stages like
1) Capturing of Data
2) Processing & Storing Data
3) Analysing and Generating Insights
4) Decision & Actions
Big Data is typically involved in processing and storing the data (Step 2) and that too in all the
scenarios. Big Data & Technology helps in reducing cost in processing volume of data and also making
it feasible to do a few typically analyses.
Data Science is involved in analysing and generating insights (Step 3). It involves in using Statistical,
Mathematical and Machine Learning algorithms to use data and generate insights. Whether a data is
"Big data" or not, we can use Data Science to support Data Driven Decisions and take better decisions.
For e.g.,
If you want to mail your friend a 100 Mb file, the mail system will not allow it. So for the mail system,
this file will be “Big Data”. But if you consider the same file to be uploaded to any cloud drive, you
would be able to do that easily. Hence, the definition of Big Data changes from system to system.
Some of the technologies which work with Big Data are Hadoop, Apache Spark etc.
Nearly every industry has begun investing in big data analytics, but some are investing more heavily
than others. According to IDC, banking, discrete manufacturing, process manufacturing,
federal/central government, and professional services are among the biggest spenders. Together
those industries will likely spend $72.4 billion on big data and business analytics in 2017, climbing to
$101.5 billion by 2020.
The fastest growth in spending on big data technologies is occurring within banking, healthcare,
insurance, securities and investment services, and telecommunications.
It’s noteworthy that three of those industries lie within the financial sector, which has many
particularly strong use cases for big data analytics, such as fraud detection, risk management and
customer service optimisation.
The list of technology vendors offering big data solutions is seemingly infinite.
Many of the big data solutions that are particularly popular right now fit into one of the following 5
categories:
1) Hadoop Ecosystem
2) Apache Spark
3) Data Lakes
4) NoSQL Databases
5) In-Memory Databases
1) Hadoop Ecosystem
Let’s first learn Hadoop and its ecosystem, then automatically you will get the idea that what is
Hadoop and its Ecosystems.
Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It efficiently
processes large volumes of data on a cluster of commodity hardware.
Hadoop is not only a storage system but is a platform for large data storage as well as processing.
Hadoop is an open-source tool from the ASF – Apache Software Foundation. Open source project
means it is freely available and we can even change its source code as per the requirements. If
certain functionality does not fulfil your need then you can change it according to your need. Most
of Hadoop code is written by Yahoo, IBM, Facebook and Cloudera.
Cluster means a group of systems connected via LAN. Apache Hadoop provides parallel processing
of data as it works on multiple machines simultaneously.
1) HDFS
HDFS is a distributed file system which is provided in Hadoop as a primary storage service.
It is used to store large data sets on multiple nodes. HDFS is deployed on low cost
commodity hardware.
So, if you have ten computers where each of the computer (node) has a hard drive of 1
TB and you install Hadoop on top of these ten machines, you get a storage capacity of 10
TB in total. So, it means that you can store single file of 10 TB in HDFS which will be stored
in a distributed fashion on these ten machines.
There are many features of HDFS which makes it suitable for storing large data like
scalability, data locality, fault tolerance etc.
2) Map Reduce
Map Reduce is the processing layer of Hadoop. Map Reduce programming model is
designed for processing large volumes of data in parallel by dividing the work into a set
of independent tasks.
You need to put business logic in the way Map Reduce works and rest things will be taken
care by the framework. Work (complete job) which is submitted by the user to master is
divided into small works (tasks) and assigned to slaves.
In Map Reduce, we get inputs from a list and it converts it into output which is again a
list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to Map
Reduce as here parallel processing is done.
3) YARN
Apache Yarn – “Yet another Resource Negotiator” is the resource management layer of
Hadoop. The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System).
Apart from resource management, Yarn is also used for job Scheduling. Yarn extends the
power of Hadoop to other evolving technologies, so they can take the advantages of HDFS
(most reliable and popular storage system on the planet) and economic cluster.
Apache yarn is also considered as the data operating system for Hadoop 2.x. The yarn
based architecture of Hadoop 2.x provides a general purpose data processing platform
which is not just limited to the Map Reduce.
It enables Hadoop to process other purpose-built data processing system other than Map
Reduce. It allows running several different frameworks on the same hardware where
Hadoop is deployed.
Now that we have understood what is Hadoop let’s try and understand what is the Hadoop
Ecosystem
The Hadoop ecosystem refers to the various components of the Apache Hadoop software library,
as well as to the accessories and tools provided by the Apache Software Foundation.
Figure Source: https://www.oreilly.com/library/view/apache-hiveessentials/9781788995092/e846ea02-6894-45c9-
983a-03875076bb5b.xhtml
The above figure shows the various components of Hadoop ecosystem. Some of the components
are explained as follows:
a) Hive
Apache Hive is an open source data warehouse system for querying and analysing large
datasets stored in Hadoop files. Hive do three main functions:
Data summarisation
Query Processing
Analysis
HiveQL automatically translates SQL-like queries into Map Reduce jobs which will execute on
Hadoop.
b) Pig
Apache Pig is a high-level language platform for analysing and querying huge dataset that are
stored in HDFS.
Pig uses Pig Latin language. It is very similar to SQL. It loads the data, applies the required
filters and dumps the data in the required format.
c) HBase
Apache HBase is distributed database that was designed to store structured data in tables that
could have billions of row and millions of columns.
HBase is scalable, distributed, and NoSQL database that is built on top of HDFS. HBase provides
real time access to read or write data in HDFS.
d) HCatalog
HCatalog supports different components available in Hadoop like Map Reduce, Hive, and Pig
to easily read and write data from the cluster. HCatalog is a key component of Hive that
enables the user to store their data in any format and structure.
e) Avro
Avro is an open source project that provides data serialization and data exchange services for
Hadoop. These services can be used together or independently.
Big data can exchange programs written in different languages using Avro.
Above mentioned services are the ones which are generally present in Hadoop Ecosystem. It is
not compulsory that each of these technologies will be required always.
Thus Hadoop is a very important part of Big Data and most of it being open source, can be modified
as per requirement.
2) Apache Spark
Apache Spark is a general-purpose & lightning fast cluster computing system. It provides high-
level API. For example, Java, Scala, Python and R.
Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Big Data
Hadoop and 10 times faster than accessing data from disk.
Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it becomes AMP Lab. It
was open sourced in 2010 under BSD license.
In 2013, spark was donated to Apache Software Foundation where it became top-level Apache
project in 2014. It was built on top of Hadoop Map Reduce and it extends the Map Reduce model
to efficiently use more types of Computations.
Spark can be used along with Map Reduce in the Same Hadoop cluster or can be used alone as a
processing framework. Also Spark application can run on YARN.
Apache Spark framework can be implemented in Java, R, Python and Scala. However, Scala
Programming is the most favourable one because-
Apache Pyspark
Pyspark is one of the supported language for Spark. Spark is a big data processing platform,
provides capability to process petabyte scale data.
Using Pyspark you can write spark application to process data and run it on Spark platform. AWS
provides managed EMR, spark platform.
Using Pyspark you can read data from various file format like csv, parquet, Json or from databases
and do analysis on top of it.
It is because of such features why Spark is widely preferred in industry these days. Whether it is
start-ups or Fortune 500s, all are adopting Apache Spark to build, scale and innovate their
applications.
Spark has left no area of Industry untouched whether it is finance or entertainment, it is being
widely used everywhere.
3) Data Lakes
A data lake is a reservoir which can store vast amounts of raw data in its native format. This data
can be –
4) Unstructured data (emails, documents, PDFs) and Binary data (images, audio, video).
The purpose of a data lake, a capacious and agile platform is to hold all the data of an enterprise
at a central platform.
By this, we can do comprehensive reporting, visualisation, analytics and eventually glean deep
business insights.
But keep in mind that Data Lakes and Data Warehouse are different things.
Contrary to a data warehouse, where data is processed and stored in files and folder, a data lake
has a flat architecture, meaning that a data lake stores all the data without any prior processing
done, reducing the time required for compilation. The data in a data lake is retained in its original
format, until it is needed.
Data lakes provides agility and flexibility, making it easier to make changes. Though the reason to
store data in a data lake is not predefined, the main objective of building a data lake is to offer an
unrefined view of data to data scientists, whenever needed.
Data Lake also allows Ingestion i.e. connectors to get data from different data sources to be loaded
into the Data Lake. Data lake storage is more scalable and cost efficient and allows fast data
exploration.
If not designed correctly, Data Lake can soon become toxic. Some of the guiding principles for
designing Data Lake are:
Data within the data lake is stored in the same format as that of the source. The idea is to store
data quickly with minimal processing to make the process fast and cost efficient.
Data within the data lake is reconciled with the source every time a new data set is loaded, to
ensure that it is a mirror copy of data inside the source.
Data within the data lake is well documented to ensure correct interpretation of data. Data
catalogue and definitions are made available to all authorised users through a convenient channel.
Data within the data lake can be traced back to its source to ensure integrity of data.
Data within the data lake is secured through a controlled access mechanism. It is generally made
available to data analysts and data scientists to explore further.
Data within the data lake is generally large in volume. The idea is to store as much data as possible,
without worrying about which data elements are going to be useful and which are not. This
enables an exploratory environment, where users can keep looking at more data and build reports
or analytical models in an incremental fashion.
Data within the data lake is stored in the form of daily copies of data so that previous versions of
data can be easily accessed for exploration. Accumulation of historic data overtime enables
companies to do trend analysis as well as build intelligent machine learning models that can learn
from previous data to predict outcomes.
Data within the data lake is generally stored in open source big data platforms like Hadoop to
ensure minimum storage costs. This also enables very efficient querying and processing of large
volumes of data during iterative data exploration and analysis.
Data within the data lake is stored in the format that it is received from the source, and is not
necessarily structured. The idea is to put minimum efforts while storing data into the data lake.
All efforts to organize and decipher data happens post loading.
Thus Data Lakes are now a major part of every enterprise architecture building process.
When a business question arises, the data lake can be queried for relevant data, and that smaller
set of data can then be analysed to help answer the question.
4) In Memory Databases
An in-memory database is a data store that primarily uses the main memory of a computer. Since
this main memory has the fastest access time, data stored in main memory affords the most speed
for database applications.
Main stream databases, mostly store data in a permanent store (such as a hard disk or network
storage), which increases its access time and are thus not as fast, when compared to in-memory
databases.
Mission critical applications, which need very fast response times, such as medical and telecom
applications always relied on in-memory databases. However, recent development of memory
devices that can fit large amounts of data for a very low price, have made in-memory databases
very attractive to commercial applications as well.
In-memory databases generally store data in proprietary forms. There are several open-source in-
memory databases that store data in a ‘key-value’ format. So, in that sense, these databases are
not similar to traditional relational databases that use SQL.
All properly constructed DBMS’s are actually in-memory databases for query purposes at some
level because they really only query data that is in memory, i.e. in their buffer caches. The
difference is that a database that claims to be in-memory will always have the entire database
resident in memory from start-up while more traditional databases use a demand loading scheme
only copying data from permanent storage to memory when it is called for.
So, even if our Oracle, Informix, DB2, PostGreSQL, MySQL, or MS SQL Server instance has sufficient
memory allocated to it to keep your entire database in memory, the first number of queries will
run slower than later queries until all of the data has been called for directly by queries or pulled
into memory by read ahead algorithm activity.
A true in-memory database system will have a period at start-up when it will either refuse to
respond to queries or will suspend them until the entire database can be loaded in from storage
after which all queries will be served as quickly as possible.
5) NOSQL Databases
NoSQL refers to a general class of storage engines that store data in a non-relational format. This
is in contrast to traditional RDBMS in which data is stored in tables that have data that relate to
each other. NoSQL stands for "Not Only SQL" and isn't meant as a rejection of traditional
databases.
There's different kinds of NoSQL databases for different jobs. They can be categorised broadly into
four different buckets:
a) Key-Value Stores: Are very simple in that you simply define a key for a binary object. It’s very
common for programmers to simple store large serialised objects in these kinds of DBs.
Examples are Cassandra (database), and Oracle NoSQL.
b) Document Store: Stores "documents" also based on a key-value system although more
structured. The most common implementation is based on the JSON (JavaScript Object
Notation) standard, which I tend to think of as a similar structure to XML. Examples are
MongoDB and Couch DB.
c) Graph DB: Stores data as "graphs" which allow you to define complex relationships between
objects. Very common for something like storing relationships of people in a social network.
Examples are Neo4j.
d) Column Oriented: Data is stored in columns rather than rows (this is a tricky concept to get
at first). Allows for great compression and for building tables that are very large (hundreds of
thousands of columns, billions/trillions of rows). Examples are HBase.
In general, NoSQL databases excel when you need something that can both read and write large
amounts of data quickly. And since they scale horizontally, just adding more servers tends to improve
performance with little effort. Facebook uses it for you Inbox.
Other examples might be a user's game online profile or storing large amounts of legal documents.
An RDBMS is still the best option for handling large numbers of atomic level transactions (IE, we likely
won't see things like banking systems or supply chain management systems run on a NoSQL database).
This is also because they are not ACID compliant (basically two people looking at the same key might
see different values, depending on when the data was accessed).
NoSQL databases are getting used in nearly 90% of our daily applications. They are very important
component contributing to the overall Big Data Structure.
3.5 DATAFICATION
Datafication is a new concept which refers “how we render into data many aspects of the
uncontrollable and qualitative factors into a quantified form”. In other words, this new term
represents our ability to collect data for aspects of our lives that have never been quantified before
and turning them into value, e.g. valuable knowledge.
Every time we try to go to a big store for buying any product, the store guys ask us to fill a form and
get a card of that store. In earlier days, this practice was not present. So what changed?
Earlier when we used to buy certain products, there was no traceability associated with the products
with respect to which product was bought by which person. But now as we buy anything by swiping a
card associated with that store, they associate the product with us. This helps them in sending us
offers on our mobile and email.
This is DATAFICATION. Earlier where this data was qualitative in nature was not getting captured. But
now with introduction of this card system, we are able to know this for nearly 40% customers.
Datafication is not only about the data, but refers to the process of collecting data, as well as the tools
and the technologies that support data collection. In the business context, an organization uses data
to monitor processes, support decision-making and plan short- and long-term strategies. Many start-
up companies have been established on the hype of big data by extracting value from them. In a few
years, no business will be able to operate without exploiting the data available, while whole industries
may face complete re-engineering.
But keep in mind that Datafication is not Digitalization. The later term describes the process of using
digital technologies to restructure our society, businesses and personal lives. It began with the rise of
computers and their introduction in organizations. In the following years, new technologies such
Internet of Things, have been gradually integrated in our lives and revolutionized them. However,
Datafication represents the next phase of evolution, when data production and proper collection is
already a present and the society tends to establish processes for the extraction of valuable
knowledge.
1. Give one example in your day to day life where you see DATAFICATION happening.
Summary
Keywords
Big Data: Big data is data that exceeds the processing capacity of conventional database
systems.
Datafication: Datafication refers to the collective tools, technologies and processes used to
transform an organisation to a data-driven enterprise.
Self-Assessment Questions
1. How is Data Science and Big Data connected? Explain with an example.
2. Where do you think Big Data can be most useful in today’s world?
Suggested Reading
1. Big Data: A Revolution That Will Transform How We Live, Work, and Think. - Book by
Kenneth Cukier and Viktor Mayer-Schönberger.
2. Big Data For Dummies - Book by Alan Nugent, Fern Halper, Judith Hurwitz, and Marcia
Kaufman.
3. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities - Book by Thomas H.
Davenport.