20ai402 Data Analytics Unit-1

Please read this disclaimer before
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
20AI402
DATA ANALYTICS
Department: CSE
Batch/Year: 2020-2024 /IV YEAR
Created by:
Ms.Sajithra S / Asst.Professor
Ms. Gayathri.S / Asst.Professor
Date: 03-08-2023
1.Table of Contents
S.NO Topic Page No.
1. Contents 5
2. Course Objectives 6
3. Pre-Requisites 7
4. Syllabus 8
5. Course outcomes 9
6. CO- PO/PSO Mapping 11
7. Lecture Plan 13
8. Activity based learning 14
9. Lecture notes 16
10. Assignments 33
11. Part A Q & A 35
12. Part B Qs 42
13. Supportive online Certification courses 43
14. Real time Applications in day to day life 44

and to Industry
15 Contents beyond the Syllabus 46
16 Assessment Schedule 49
17 Prescribed Text Books and Reference Books 50
18 Mini Project Suggestions 51

2.Course Objectives
➢ To explain the fundamentals of big data and data analytics
➢ To discuss the Hadoop framework
➢ To explain about exploratory data analysis and data manipulation

tools
➢ To analyse and interpret streaming data
➢ To discuss various applications of data analytics

3.Pre-Requisites
Semester-V1
Data Science
Fundamentals
Sem ester-II
Python Programming
Se mester-I
C Programming
4.SYLLABUS
20AI402 DATA ANALYTICS LTPC
3003
UNIT I INTRODUCTION 9
Evolution of Big Data-Definition of Big Data-Challenges with Big Data- Traditional

Business Intelligence (BI) versus Big Data-Introduction to big data analytics-
Classification of Analytics-Analytics Tools- Importance of big data analytics.
UNIT II HADOOP FRAMEWORK 9

Introducing Hadoop- RDBMS versus Hadoop- Hadoop Overview-HDFS (Hadoop
Distributed File System)- Processing Data with Hadoop- Managing Resources and
Applications with Hadoop YARN - Interacting with Hadoop Ecosystem.
UNIT III EXPLORATORY DATA ANALYSIS 9
EDA fundamentals – Understanding data science – Significance of EDA – Making sense

of data – Comparing EDA with classical and Bayesian analysis – Software tools for EDA
–Data transformation techniques - Introduction to NoSQL – MongoDB: RDBMS Vs
MongoDB – Data Types – Query Language – Hive – Hive Architecture – Data Types –
File Formats - Hive Query Language (HQL) – RC File Implementation – User Defined
Functions.
UNIT IV MINING DATA STREAMS 9
The data stream model – stream queries-sampling data in a stream-general streaming

problem- filtering streams-analysis of filtering- dealing with infinite streams- Counting
Distance Elements in a Stream – Estimating Moments – Counting Ones in Window –
Decaying Windows.
UNIT V APPLICATIONS 9
Application: Sales and Marketing – Industry Specific Data Mining – microRNA Data
Analysis Case Study – Credit Scoring Case Study – Data Mining Nontabular Data.
TOTAL: 45 PERIODS
5.COURSE OUTCOMES
At the end of this course, the students will be able to:
COURSE OUTCOMES HKL
CO1 Explain the fundamentals of big data and data analytics K2
CO2 Discuss the Hadoop framework K6
CO3 Explain about exploratory data analysis and data K2

manipulation tools
CO4 Analyse and interpret streaming data K4
CO5 Illustrate various applications of data analytics K2

CO – PO/PSO Mapping
6.CO – PO /PSO Mapping Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 2 2 1 1 1 1 1 2 3 3
2 3 3 3 3 1 1 1 1 2 3 3
3 3 3 3 3 3 3 3 3 2 3 3
4 3 3 3 3 3 3 3 3 2 3 3
5 3 3 3 3 3 3 3 3 2 3 3
Lecture Plan
Unit - I
LECTURE PLAN – Unit 1- INTRODUCTION
Sl. Topic Numbe Proposed Actual CO Tax Mode of

No r of Date Lecture ono Delivery
. Periods Date my
Leve
l
07/08/2023
Evolution of PPT/Chalk &
1 Big Data 1 1 K1
Talk
Definition of Big
10/08/2023 PPT/Chalk &
2 1 1 K1 Talk
Data
Challenges
12/08/2023
PPT/Chalk &
3 with Big Data 1 1 K1
Talk
Traditional 16/08/2023
Business
Intelligence(BI) PPT/Chalk &
4 1 1 K1
Talk
versus Big Data
Introduction to big
18/08/2023
PPT/Chalk &
5 data analytics 1 1 K1
Talk
Classification of
19/08/2023
PPT/Chalk &
6 1 1 K1
Analytics Talk
22/08/2023
PPT/Chalk &
7 Analytics Tools 1 1 K1
Talk
23/08/2023
Importance of big
PPT/Chalk &
8 data analytics 1 1 K1
Talk
23/08/2023 PPT/Chalk &

9 Tutorial 1 1 K1 Talk
8. ACTIVITY BASED LEARNING
Activity name:
Compare and contrast traditional business intelligence (BI) and Big
Data based on the following parameters
1. Purpose
2. Components
3. Tools
4. Characteristics
5. Applied Fields
Students will have better understanding about the traditional business intelligence
and big data, its applications and uses in day to day life.
Guidelines to do an activity :
1) Students can form group. ( 3 students / team)
2) Choose any application or scenario
3)Conduct Peer review. ( each team will be reviewed by all other teams and
mentors )
Useful link:
https://www.educba.com/business-intelligence-vs-big-data/
UNIT-I
INTRODUCTION
9.LECTURE NOTES
1. Evolution of Big Data
1970s and before was the era of mainframes. The data was essentially primitive and
structured. Relational databases evolved in 1980s and 1990s. The era was of data
intensive applications. The World Wide Web (WWW) and the Internet of Things
(IOT) have led to an onslaught of structured, unstructured, and multimedia data.
The evolution of big data
➢ Mainframes are high-performance computers with large amounts of memory

and processors that process billions of simple calculations and transactions in
real time.
Some examples of mainframe computer include:
• zSeries mainframe computer of Internation Business Machines series (IBM)

• Universal Automatic Computer from UNIVAC series.
• A system named NonStop by the manufacturers of Hewlett Packard.
➢ A relational database is a type of database that stores and provides
access to data points that are related to one another.
• Examples of standard relational databases include Microsoft SQL Server,
Oracle Database, MySQL and IBM DB2.
➢ Data-intensive is used to describe applications that are I/O bound or with a

need to process large volumes of data. Such applications devote most of their
processing time to I/O and movement and manipulation of data.
➢ Structured data is highly organized and easily understood by machine

language. Examples of structured data include names, dates, addresses,
credit card numbers, stock information, geolocation, and more.

➢ Unstructured data is data stored in its native format and not processed
until it is used. Examples of unstructured data includes things like video, audio
or image files, as well as log files, sensor or social media posts.
➢ The multimedia data include one or more primary media data types such as
text, images, graphic objects (including drawings, sketches and illustrations)
animation sequences, audio and video.
2. Definition of Big Data
The following are the few responses for the question “ Define big data” that we have
heard over time.
1. Anything beyond the human and technical infrastructure needed to support
storage, processing and analysis.
2. Today’s BIG may be tomorrow’s NORMAL.

3. Terabytes or petabytes or zettabytes of data
All of these responses are correct. But it is not just one of these; in fact, big
data is all of the above and more.
Definition of Big Data
❖ Big data is high-volume, high-velocity and high-variety information assets that

demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
- Gartner IT Glossary
For the sake of easy comprehension, we will look at the definition in three parts.
Definition of big data - Gartner
➢ Part I of the definition:
"Big data is high-volume, high-velocity, and high-variety information assets"

talks about voluminous data (humongous data) that may have great variety
(a good mix of structured, semi-structured. and unstructured data) and will
require a good speed/pace for storage, preparation, processing and analysis.
➢ Part II of the definition:
"cost effective, innovative forms of information processing" talks about

embracing new techniques and technologies to capture (ingest), store,
process, persist, integrate and visualize the high-volume, high-velocity, and
high-variety data.
➢ Part III of the definition:
"enhanced insight and decision making" talks about deriving deeper, richer
and meaningful insights and then using these insights to make faster and
better decisions to gain business value and thus a competitive edge.
• Data -> Information -> Actionable intelligence ->Better decisions ->Enhanced

business value
3. Challenges with big data
Following are a few challenges with big data:
1.Data today is growing at an exponential rate. Most of the data that we have today
has been generated in the last 2-3 years. This high tide of data will continue to rise
incessantly. The key questions here are: “Will all this data be useful for analysis?”,
“Do we work with all this data or a subset of it?”, “How will we separate the
knowledge from the noise?”, etc.
2.Cloud computing and virtualization are here to stay. Cloud computing is the answer
to managing infrastructure for big data as far as cost-efficiency, elasticity and easy
upgrading/downgrading is concerned. This further complicates the decision to host
big data solutions outside the enterprise.
3.The other challenge is to decide on the period of retention of big data. Just how
long should one retain this data? A tricky question indeed as some data is useful for
making long-term decisions, whereas in few cases, the data may quickly become
irrelevant and obsolete just a few hours after having being generated.
4.There is a lack of skilled professionals who possess a high level of proficiency in

data sciences that is vital in implementing big data solutions.
5.There are other challenges with respect to capture, storage, preparation, search,
analysis, transfer, security and visualization of big data. Big data refers to datasets
whose size is typically beyond the storage capacity of traditional database software
tools. There is no explicit definition of how big the dataset should be for it to be
considered “big data.” Here we are to deal with data that is just too big, moves way
to fast and does not fit the structures of typical database systems. The data changes
are highly dynamic and therefore there is a need to ingest this as quickly as possible.
6. Data visualization is becoming popular as a separate discipline. We are short by

quite a number, as far as business visualization experts are concerned.
4. Traditional Business Intelligence (Bi) Versus Big Data
Following are the differences that one encounters dealing with traditional Business
intelligence and big data.
1. In traditional BI environment, all the enterprise's data is housed in a central
server whereas in a big data environment data resides in a distributed file
system. The distributed file system scales by scaling in or out horizontally as
compared to typical database server that scales vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas in big
data, it is analyzed in both real time as well as in offline mode.
3. Traditional Bl is about structured data and it is here that data is taken to
processing functions (move data to code) whereas big data is about variety:
Structured, semi-structured and unstructured data and here the processing
functions are taken to the data (move code to data).
5. Introduction to Big Data Analytics

Big Data Analytics is...
2. Technology-enabled analytics: Quite a few data analytics and visualization
tools are available in the market today from leading vendors such as IBM,
Tableau, SAS, R Analytics, Statistica, World Programming Systems (WPS), etc.
to help process and analyze the big data.
3. About gaining a meaningful, deeper, and richer insight into the business to
steer it in the right direction, understanding the customer's demographics to
cross-sell and up- sell to them, better leveraging the services of the vendors
and suppliers, etc.
4. About a competitive edge over the competitors by enabling us with findings
that allow quicker and better decision-making.
5. A tight handshake between three communities: IT, business users, and data
scientists. Refer Figure 3.3.
5.Working with datasets whose volume and variety exceed the current storage
and processing capabilities and infrastructure of the enterprise.
6.About moving code to data. This makes perfect sense as the program for
distributed processing is tiny (just a few KBs) compared to the data
(Terabytes or Petabytes today and likely to be Exabytes or Zettabytes in the
near future).
6. Classification of Analytics
There are basically two schools of thought:
1. Those that classify analytics into basic, operationalized, advanced and

Monetized.
2. Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
What big data entails?
1. First School of Thought
It includes Basic analytics, Operationalized analytics, Advanced analytics and

Monetized analytics.
❖ Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic
visualization, etc.
❖ Operationalized analytics: It is operationalized analytics if it gets woven into
the enterprise’s business processes.
❖ Advanced analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modelling.
❖ Monetized analytics: This is analytics in use to derive direct business revenue.

6.2 Second School of Thought
Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer Table
6.1.
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: mid 1950s to 2005 to 2012 2012 to present

2009 Descriptive + predictive
Descriptive statistics Descriptive statistics + + predictive statistics
(report on events, predictive statistics (use data from the past
occurrences, etc. of (use data from the past to to make prophecies for
the past) make predictions for the the future and at the
future) same time make
recommendations to
leverage the situation to
one's advantage)
key questions asked: key questions asked: Key questions asked:
What happened? What will happen? What will happen?

Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the
action taken to take
advantage of what will
happen?
Data from legacy Big data A blend of big data and

systems, ERP, CRM, data from legacy
and 3rd party systems, ERP, CRM, and
applications. 3rd party applications.
Small and structured Big data is being taken up A blend of big data and
data sources. Data seriously. Data is mainly traditional analytics to
stored in enterprise unstructured, arriving at a yield insights and
data warehouses or much higher pace. This fast offerings with speed and
data marts. flow of data entailed that the impact.
influx of big volume data had
to be stored and processed
rapidly, often on massive
parallel servers running
Hadoop.
Data was internally Data was often Data is both being internally
sourced. externally sourced. and externally sourced.
Relational databases Database appliances, In memory analytics, in

Hadoop clusters, SQL database processing, agile
to Hadoop analytical methods, machine
environments, etc. learning techniques, etc.
Table 6.1 Analytics 1.0, 2.0 and 3.0
Figure 6.1 Analytics 1.0, 2.0 and 3.0
Figure 6.1 shows the subtle growth of analytics from Descriptive → Diagnostic →
Predictive → Perspective analytics.
7. Analytics Tools
1. MS Excel
Templates do most of the set-up and design work for us, so we can focus on
our data. In Excel 2013, there are templates for budgets, calendars, forms,
and reports, and more.
➢ Instant Data Analysis
The new Quick Analysis tool lets us to convert our data into a chart or table
in two steps or less. We can preview our data with conditional formatting,
sparklines, or charts, and make our choice stick in just one click.
➢ Powerful data analysis
● Create a PivotTable that suits our data

● Use multiple tables in our data analysis
● Power Query
● Power Map
● Connect to new data sources
● Use OLAP calculated members and measures
2. SAS
SAS is a tool for analyzing statistical data. SAS is an acronym for

statistical analytics software. The main purpose of SAS is to retrieve, report
and analyze statistical data.
3. IBM SPSS Modeler
IBM SPSS Modeler is a data mining and text analytics software application from IBM.
It is used to build predictive models and conduct other analytic tasks. It has
a visual interface which allows users to leverage statistical and data mining
algorithms without programming
4. Statistica
• It's a worldwide proven software suite that can be used to analyze and
visualize data,to create insights and predictions, and deliver tailored solutions
for the users.

• More than 1 million users worldwide from a wide variety of industries rely on
Statistica.
• The software stands out for its user-friendliness including the complete
visualization of analytical processes in the project interface.
• Statistica offers the capability to integrate with many other tools and platform.
By integrating TIBCO Spotfire, we can take advantage of comprehensive
dashboard functionality.
• By integrating R and/or Python we can make use of their extensive
capabilities.
5. Salford systems ( World Programming Systems )
• The Salford Predictive Modeler (SPM) software suite is a highly accurate and
ultra-fast platform for developing predictive, descriptive, and analytical
models.
• The SPM software suite’s data mining technologies span classification,
regression, survival analysis, missing value analysis, data binning and
clustering/segmentation.
• SPM algorithms are considered to be essential in sophisticated data science
circles.
6. WPS
• World Programming produces and sells WPS Analytics, an advanced analytics

software platform for data engineers and data scientists.
• The platform offers interactive visual and programming tools for working with
data, supporting languages of SAS, SQL, Python and R, and runs on
workstation, server, cloud, and mainframe systems.
• Leading organisations around the globe in every industry depend on WPS

Analytics software.
7.1 Open source Analytics Tools
1. R
Analytics
● R analytics is data analytics using R programming language, an open-source
language used for statistical computing or graphics.
● This programming language is often used in statistical analysis and data
mining. It can be used for analytics to identify patterns and build practical
models.
● R not only can help analyze organization’s data, but also be used to help in
the creation and development of software applications that perform statistical
analysis.
2. Weka
Weka – Waikato Environment for Knowledge Analysis (Weka)
It contains a Collection of visualization tools and algorithms for data analysis and
predictive modeling coupled with graphical user interface.
8. Importance of Big Data Analytics
The various approaches to analysis of data and what it leads to are:
1. Reactive-Business Intelligence:
What does Business Intelligence (BI) help us with? It allows the businesses to
make faster and better decisions by providing the right information to the
right person at the right time in the right format. It is about analysis of the
past or historical data and then displaying the findings of the analysis or
reports in the form of enterprise dashboards, alerts, notifications, etc. It has
support for both pre-specified reports as well as ad hoc querying.
2. Reactive - Big Data Analytics:
Here the analysis is done on huge datasets but the approach is still reactive
as it is still based on static data.
3. Proactive -
Analytics:
This is to support futuristic decision making by use of data mining predictive
modelling, text mining, and statistical analysis. This analysis is not on big data
as it still uses the traditional database management practices on big data and
therefore has severe limitations on the storage capacity and the processing
capability.
4. Proactive - Big Data Analytics:
This is filtering through terabytes, petabytes, exabytes of information to filter

out the relevant data to analyze. This also includes high performance analytics
to gain rapid insights from big data and the ability to solve complex problems
using more data.
VIDEO LINKS
Unit – I
VIDEO LINKS
Sl. Topic Video Link
No.
1 Evolution of Big Data https://www.youtube.com/watch?v=MTpByOSs5KI
2 Challenges with Big Data https://www.youtube.com/watch?v=aPAmVReeUaE
3 Traditional Business https://www.youtube.com/watch?v=DinhsGbrydY

Intelligence (BI) versus
Big Data
4 Introduction to big data https://www.youtube.com/watch?v=bY6ZzQmtOzk
analytics
5 Classification of Analytics https://www.youtube.com/watch?v=-HVbMti4g8c
6 Analytics Tools https://www.youtube.com/watch?v=P-bKqfKhqR8
7 Importance of big data https://www.youtube.com/watch?v=wMzVIN13wnQ

analytics
10. ASSIGNMENT : UNIT – I
1. You have just got a book issued from the library. What are the details
about the book that can be placed in a RDBMS table?
2. Share your experience as a customer on an e-commerce site. Comment
on the big data that gets created on a typical e-commerce site.
3. How does Big Data Analytics influence the business decision-making
and business priorities as well?
4. How the concept of Big Data would be useful in future possibilities for
an organization?
5. What are key skill sets and behavioral characteristics of a data scientist?
Scenario based question

You are at the university library. You see a few students browsing through the
library catalog on a kiosk. You observe the librarians busy at work issuing and
returning books. You see a few students fill up the feedback form on the
services offered by the library. Quite a few students are learning using the
e-learning content.
Think for a while on the different types of data that are being generated in
this scenario. Support your answer with logic.
PART-A Q&A UNIT-I
11. PART A : Q & A : UNIT – I
1. Define big data ( CO1 , K1 )
Big data is high-volume, high-velocity and high-variety information assets that

demand cost-effective, innovative forms of information processing for enhanced
insight and decision making.
Definition of big data - Gartner
2.How does traditional BI environment differ from big data environment? ( CO1 , K1 )
● In traditional BI environment, all the enterprise's data is housed in a central

server whereas in a big data environment data resides in a distributed file
system. The distributed file system scales by scaling in or out horizontally as
compared to typical database server that scales vertically.
● In traditional BI, data is generally analyzed in an offline mode whereas in big

data, it is analyzed in both real time as well as in offline mode.
● Traditional Bl is about structured data and it is here that data is taken to
processing functions (move data to code) whereas big data is about variety:
Structured, semi-structured and unstructured data and here the processing
functions are taken to the data (move code to data)
3. What are the challenges with big data? ( CO1 , K1 )
● Data today is growing at an exponential rate
● To decide on the period of retention of big data
● There is a lack of skilled professionals who possess a high level of proficiency in

data sciences.
● Challenges with respect to capture, storage, preparation, search, analysis,

transfer, security and visualization of big data.
4. What is big data analytics? ( CO1 , K1 )

● Big Data Analytics is technology-enabled analytics.
● It is about gaining a meaningful, deeper, and richer insight into the business to
steer it in the right direction, understanding the customer's demographics to
cross-sell and up-sell to them, better leveraging the services of the vendors and
suppliers.
● Working with datasets whose volume and variety exceed the current storage
and processing capabilities and infrastructure of the enterprise.
5. What are the characteristics of big data? ( CO1 , K1 )

Big data is a collection of data from many different sources and is often described by
five characteristics:
● volume
● value
● variety
● velocity
● veracity
6. What do you mean by mainframes? ( CO1 , K1 )

Mainframes are high-performance computers with large amounts of memory and
processors that process billions of simple calculations and transactions in real time.
Some examples of mainframe computer include:
• zSeries mainframe computer of Internation Business Machines series (IBM)

• Universal Automatic Computer from UNIVAC series.
• A system named NonStop by the manufacturers of Hewlett Packard.
7. What is relational database? ( CO1 , K1 )

A relational database is a type of database that stores and provides access to
data points that are related to one another.
• Examples of standard relational databases include Microsoft SQL Server, Oracle

Database, MySQL and IBM DB2.
8.How do you classify data analytics? ( CO1 , K1 )
There are basically two schools of thought:
➢ Those that classify analytics into basic, operationalized, advanced and

Monetized.
➢ Those that classify analytics into analytics 1.0, analytics 2.0, and analytics
3.0.
9. What is the difference between basic and advanced analytics? ( CO1 , K1 )
❖ Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic
visualization, etc.
❖ Advanced analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modelling.
10. List few data analytics tools ( CO1 , K1 )
● MS Excel
● SAS
● IBM SPSS Modeler
● Statistica
● Salford systems ( World Programming Systems)
● WPS
11. What are the two open source analytics tools ( CO1 , K1 )
❖ R Analytics
● R analytics is data analytics using R programming language, an open-source
language used for statistical computing or graphics.
● This programming language is often used in statistical analysis and data

mining. It can be used for analytics to identify patterns and build practical
models.
❖ Weka
● Weka – Waikato Environment for Knowledge Analysis (Weka)

● It contains a Collection of visualization tools and algorithms for data analysis
and predictive modeling coupled with graphical user interface.
12. What are the various approaches to analysis of data ? ( CO1 , K1 )
➢ Reactive - Business Intelligence

➢ Reactive - Big Data Analytics
➢ Proactive - Analytics
➢ Proactive - Big Data Analytics
13. What is SAS ? ( CO1 , K1 )
● SAS is a tool for analyzing statistical data.

● SAS is an acronym for statistical analytics software.
● The main purpose of SAS is to retrieve, report and analyze statistical
data.
14. What do you mean by Statistica? ( CO1 , K1 )

• It's a worldwide proven software suite that can be used to analyze and
visualize data,to create insights and predictions, and deliver tailored solutions
for the users.
• More than 1 million users worldwide from a wide variety of industries rely on
Statistica.
• The software stands out for its user-friendliness including the complete
visualization of analytical processes in the project interface.
15. What is WPS ( CO1 , K1 )
• World Programming produces and sells WPS Analytics, an advanced analytics

software platform for data engineers and data scientists.
• The platform offers interactive visual and programming tools for working with
data, supporting languages of SAS, SQL, Python and R, and runs on
workstation, server, cloud, and mainframe systems.
• Leading organisations around the globe in every industry depend on WPS
Analytics software.
16. What do you understand by Reactive-Business Intelligence? ( CO1 , K1 )
● It allows the businesses to make faster and better decisions by providing the
right information to the right person at the right time in the right format.
● It is about analysis of the past or historical data and then displaying the
findings of the analysis or reports in the form of enterprise dashboards,
alerts, notifications, etc.
● It has support for both pre-specified reports as well as ad hoc querying.
17. What is Proactive Analytics? ( CO1 , K1 )
● This is to support futuristic decision making by use of data mining predictive

modelling, text mining, and statistical analysis on.
● This analysis is not on big data as it still the traditional database
management practices on big data and therefore has severe limitations on
the storage capacity and the processing capability.
18. What is IBM SPSS Modeler? ( CO1 , K1 )
● IBM SPSS Modeler is a data mining and text analytics software application
from IBM.
● It is used to build predictive models and conduct other analytic
tasks.
● It has a visual interface which allows users to leverage statistical and data
mining algorithms without programming.
19. What is the purpose of Salford systems ( World Programming Systems ) ? ( CO1
, K1 )
• The Salford Predictive Modeler (SPM) software suite is a highly accurate and
ultra-fast platform for developing predictive, descriptive, and analytical
models.
• The SPM software suite’s data mining technologies span classification,

regression, survival analysis, missing value analysis, data binning and
clustering/segmentation.
20. What are the features of Powerful data analysis? ( CO1 , K1 )

● Create a PivotTable that suits our data
● Use multiple tables in our data analysis
● Power Query
● Power Map
● Connect to new data sources
● Use OLAP calculated members and measures
12. PART B QUESTIONS : UNIT – I
1. Write short notes on Evolution of Big Data. ( CO1,K1)

2. Explain the Challenges with Big Data. ( CO1,K1)
3. Explain the fundamentals of data analytics. ( CO1,K1)
4. Explain the Classification of Analytics ( CO1,K1)
5. Write short notes on Analytics Tools ( CO1,K1)
6. Discuss the Importance of big data analytics ( CO1,K1)
13. SUPPORTIVE ONLINE CERTIFICATION COURSES
NPTEL : https://onlinecourses.nptel.ac.in/noc21_cs45/preview
coursera : https://www.coursera.org/learn/python-data-analysis
Udemy : https://www.udemy.com/topic/data-science/
Mooc : https://mooc.es/course/introduction-to-data-science-in-python/
edx : https://learning.edx.org/course/course-v1:Microsoft+DAT208x+2T2016/home
14. REAL TIME APPLICATIONS
1. Transportation
Data analytics can be applied to help in improving Transportation Systems

and intelligence around them. The predictive method of the analysis helps find
transport problems like Traffic or network congestions. It helps synchronize the vast
amount of data and uses them to build and design plans and strategies to plan
alternative routes, reduce congestions and traffics, which in turn reduces the
number of accidents and mishappenings. Data Analytics can also help to optimize
the buyer’s experience in the travels through recording the information from social
media. It also helps the travel companies fixing their packages and boost the
personalized travel experience as per the data collected.
For Example During the Wedding season or the Holiday season, the transport
facilities are prepared to accommodate the heavy number of passengers traveling
from one place to another using prediction tools and techniques.
2. Logistics and Delivery
There are different logistic companies like DHL, FedEx, etc that uses data analytics
to manage their overall operations. Using the applications of data analytics, they can
figure out the best shipping routes, approximate delivery times, and also can track
the real-time status of goods that are dispatched using GPS trackers. Data Analytics
has made online shopping easier and more demandable.

Example of Use of data analytics in Logistics and Delivery:
When a shipment is dispatched from its origin, till it reaches its buyers, every
position is tracked which leads to the minimizing of the loss of the goods.
3. Web Search or Internet Web Results

The web search engines like Yahoo, Bing, Duckduckgo, Google uses a set of data to
give you when you search a data. Whenever you hit on the search button, the search
engines use algorithms of data analytics to deliver the best-searched results within a
limited time frame. The set of data that appears whenever we search for any
information is obtained through data analytics.
The searched data is considered as a keyword and all the related pieces of
information are presented in a sorted manner that one can easily understand. For
example, when you search for a product on amazon it keeps showing on your social
media profiles or to provide you with the details of the product to convince you by
that product.
4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall working
through certain tools like prediction analysis, regression analysis, budgeting, etc. The
unit can figure out the number of products needed to be manufactured according to
the data collected and analyzed from the demand samples and likewise in many
other operations increasing the operating capacity as well as the profitability.

15. CONTENTS BEYOND SYLLABUS : UNIT – I
Introduction to Apache Pig
Pig is a high-level platform or tool which is used to process the large datasets.
It provides a high-level of abstraction for processing over the MapReduce. It provides
a high-level scripting language, known as Pig Latin which is used to develop the data
analysis codes.
First, to process the data which is stored in the HDFS, the programmers will
write the scripts using the Pig Latin Language. Internally Pig Engine (a component of
Apache Pig) converted all these scripts into a specific map and reduce task. But
these are not visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig
tool. The result of Pig always stored in the HDFS.
Need of Pig: One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the code, submitting the
job and retrieving the output is a time-consuming task. Apache Pig reduces the time
of development using the multi-query approach. Also, Pig is beneficial for the
programmers who are not from Java background. 200 lines of Java code can be
written in only 10 lines using the Pig Latin language. Programmers who have SQL
knowledge needed less effort to learn Pig Latin.
Features of Apache Pig:

● For performing several operations Apache Pig provides rich sets of
operators like the filters, join, sort, etc.
● Easy to learn, read and write. Especially for SQL-programmer, Apache Pig
is a boon.
● Apache Pig is extensible so that you can make your own user-defined
functions and process.
● Join operation is easy in Apache Pig.
● Fewer lines of code.
● Apache Pig allows splits in the pipeline.
● The data structure is multivalued, nested, and richer.
● Pig can handle the analysis of both structured and unstructured data.
Applications of Apache Pig:
● For exploring large datasets Pig Scripting is used.

● Provides the supports across large data-sets for Ad-hoc queries.
● In the prototyping of large data-sets processing algorithms.
● Required to process the time sensitive data loads.
● For collecting large amounts of datasets in form of search logs and web
crawls.
● Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models
as follows:
● Atom: It is a atomic data value which is used to store as a string. The
main use of this model is that it can be used as a number and as well as
a string.
● Tuple: It is an ordered set of the fields.

● Bag: It is a collection of the tuples.
● Map: It is a set of key/value pairs.
Assessment
Schedule (Proposed
Date & Actual Date)
16.Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Proposed Actual
No. Date Date
1 FIRST INTERNAL ASSESSMENT 09/09/2023 to
15/09/2023
2 SECOND INTERNAL ASSESSMENT 26/10/2023 to
01//11/2023
3 MODEL EXAMINATION 15/11/2023 to
25/11/2023
4 END SEMESTER EXAMINATION 05/12/2023
17. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS
TEXT BOOKS:
1.Subhashini Chellappan, Seema Acharya, “Big Data and Analytics”, 2nd edition,
Wiley Publications, 2019.
2.Suresh Kumar Mukhiya and Usman Ahmed, “Hands-on Exploratory Data Analysis
with Python”, Packt publishing, March 2020.
3.Jure Leskovek, Anand Rajaraman and Jefrey Ullman,” Mining of Massive Datasets.
v2.1”, Cambridge University Press,2019.
4.Glenn J. Myatt, Wayne P. Johnson, Making Sense of Data II : A Practical Guide To
Data Visualization, Advanced Data Mining Methods, and Applications, Wiley 2009.
REFERENCES:
1.Nelli, F., Python Data Analytics: with Pandas, NumPy and Matplotlib, Apress,
2018.
2.Bart Baesens," Analytics in a Big Data World: The Essential Guide to Data Science
and its Applications", John Wiley & Sons, 2014
3.Min Chen, Shiwen Mao, Yin Zhang, Victor CM Leung, Big Data: Related
Technologies, Challenges and Future Prospects, Springer, 2014.
4.Michael Minelli, Michele Chambers, Ambiga Dhiraj, “Big Data, Big Analytics:
Emerging Business Intelligence and Analytic Trends”, John Wiley & Sons, 2013.
5.Marcello Trovati, Richard Hill, Ashiq Anjum, Shao Ying Zhu, “Big Data Analytics
and cloud computing – Theory, Algorithms and Applications”, Springer
International Publishing, 2016
18. MINI PROJECT SUGGESTION
Mini Project:
The project should contain the following components
• Realtime dataset
• Data preparation & Transformation
• Handling missing Data
• Data Storage
• Algorithm for data analytics
• Data visualization: Charts, Heatmap, Crosstab, Treemap
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the respective
group / learning community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail if you have received
this document by mistake and delete this document from your system. If you are not the intended
recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the
contents of this information is strictly prohibited.

20ai402 Data Analytics Unit-1

Uploaded by

Copyright:

Available Formats

20ai402 Data Analytics Unit-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20ai402 Data Analytics Unit-1

Uploaded by

Copyright:

Available Formats

Please read this disclaimer before

S.NO Topic Page No.

6. CO- PO/PSO Mapping 11

8. Activity based learning 14

11. Part A Q & A 35

13. Supportive online Certification courses 43

14. Real time Applications in day to day life 44

15 Contents beyond the Syllabus 46

17 Prescribed Text Books and Reference Books 50

18 Mini Project Suggestions 51

➢ To explain the fundamentals of big data and data analytics

➢ To discuss the Hadoop framework

➢ To explain about exploratory data analysis and data manipulation

➢ To analyse and interpret streaming data

➢ To discuss various applications of data analytics

Evolution of Big Data-Definition of Big Data-Challenges with Big Data- Traditional

UNIT II HADOOP FRAMEWORK 9

UNIT III EXPLORATORY DATA ANALYSIS 9

EDA fundamentals – Understanding data science – Significance of EDA – Making sense

UNIT IV MINING DATA STREAMS 9

The data stream model – stream queries-sampling data in a stream-general streaming

At the end of this course, the students will be able to:

COURSE OUTCOMES HKL

CO1 Explain the fundamentals of big data and data analytics K2

CO2 Discuss the Hadoop framework K6

CO3 Explain about exploratory data analysis and data K2

CO4 Analyse and interpret streaming data K4

CO5 Illustrate various applications of data analytics K2

Sl. Topic Numbe Proposed Actual CO Tax Mode of

23/08/2023 PPT/Chalk &

Data based on the following parameters

1) Students can form group. ( 3 students / team)

2) Choose any application or scenario

The evolution of big data

➢ Mainframes are high-performance computers with large amounts of memory

Some examples of mainframe computer include:

• zSeries mainframe computer of Internation Business Machines series (IBM)

➢ Data-intensive is used to describe applications that are I/O bound or with a

➢ Structured data is highly organized and easily understood by machine

credit card numbers, stock information, geolocation, and more.

2. Definition of Big Data

2. Today’s BIG may be tomorrow’s NORMAL.

Definition of Big Data

❖ Big data is high-volume, high-velocity and high-variety information assets that

➢ Part I of the definition:

"Big data is high-volume, high-velocity, and high-variety information assets"

➢ Part II of the definition:

"cost effective, innovative forms of information processing" talks about

• Data -> Information -> Actionable intelligence ->Better decisions ->Enhanced

3. Challenges with big data

Following are a few challenges with big data:

4.There is a lack of skilled professionals who possess a high level of proficiency in

6. Data visualization is becoming popular as a separate discipline. We are short by

5. Introduction to Big Data Analytics

There are basically two schools of thought:

1. Those that classify analytics into basic, operationalized, advanced and

What big data entails?

1. First School of Thought

It includes Basic analytics, Operationalized analytics, Advanced analytics and