20ai402 Data Analytics Unit-1
20ai402 Data Analytics Unit-1
20ai402 Data Analytics Unit-1
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
20AI402
DATA ANALYTICS
Department: CSE
Batch/Year: 2020-2024 /IV YEAR
Created by:
Ms.Sajithra S / Asst.Professor
Ms. Gayathri.S / Asst.Professor
Date: 03-08-2023
1.Table of Contents
1. Contents 5
2. Course Objectives 6
3. Pre-Requisites 7
4. Syllabus 8
5. Course outcomes 9
7. Lecture Plan 13
9. Lecture notes 16
10. Assignments 33
12. Part B Qs 42
16 Assessment Schedule 49
Semester-V1
Data Science
Fundamentals
Sem ester-II
Python Programming
Se mester-I
C Programming
4.SYLLABUS
20AI402 DATA ANALYTICS LTPC
3003
UNIT I INTRODUCTION 9
UNIT V APPLICATIONS 9
Application: Sales and Marketing – Industry Specific Data Mining – microRNA Data
Analysis Case Study – Credit Scoring Case Study – Data Mining Nontabular Data.
TOTAL: 45 PERIODS
5.COURSE OUTCOMES
2 3 3 3 3 1 1 1 1 2 3 3
3 3 3 3 3 3 3 3 3 2 3 3
4 3 3 3 3 3 3 3 3 2 3 3
5 3 3 3 3 3 3 3 3 2 3 3
Lecture Plan
Unit - I
LECTURE PLAN – Unit 1- INTRODUCTION
Definition of Big
10/08/2023 PPT/Chalk &
2 1 1 K1 Talk
Data
Challenges
12/08/2023
PPT/Chalk &
3 with Big Data 1 1 K1
Talk
Traditional 16/08/2023
Business
Intelligence(BI) PPT/Chalk &
4 1 1 K1
Talk
versus Big Data
Introduction to big
18/08/2023
PPT/Chalk &
5 data analytics 1 1 K1
Talk
Classification of
19/08/2023
PPT/Chalk &
6 1 1 K1
Analytics Talk
22/08/2023
PPT/Chalk &
7 Analytics Tools 1 1 K1
Talk
23/08/2023
Importance of big
PPT/Chalk &
8 data analytics 1 1 K1
Talk
Activity name:
Compare and contrast traditional business intelligence (BI) and Big
1. Purpose
2. Components
3. Tools
4. Characteristics
5. Applied Fields
Students will have better understanding about the traditional business intelligence
and big data, its applications and uses in day to day life.
Guidelines to do an activity :
3)Conduct Peer review. ( each team will be reviewed by all other teams and
mentors )
Useful link:
https://www.educba.com/business-intelligence-vs-big-data/
UNIT-I
INTRODUCTION
9.LECTURE NOTES
1. Evolution of Big Data
1970s and before was the era of mainframes. The data was essentially primitive and
structured. Relational databases evolved in 1980s and 1990s. The era was of data
intensive applications. The World Wide Web (WWW) and the Internet of Things
(IOT) have led to an onslaught of structured, unstructured, and multimedia data.
The following are the few responses for the question “ Define big data” that we have
heard over time.
1. Anything beyond the human and technical infrastructure needed to support
storage, processing and analysis.
- Gartner IT Glossary
For the sake of easy comprehension, we will look at the definition in three parts.
Definition of big data - Gartner
"enhanced insight and decision making" talks about deriving deeper, richer
and meaningful insights and then using these insights to make faster and
better decisions to gain business value and thus a competitive edge.
1.Data today is growing at an exponential rate. Most of the data that we have today
has been generated in the last 2-3 years. This high tide of data will continue to rise
incessantly. The key questions here are: “Will all this data be useful for analysis?”,
“Do we work with all this data or a subset of it?”, “How will we separate the
knowledge from the noise?”, etc.
2.Cloud computing and virtualization are here to stay. Cloud computing is the answer
to managing infrastructure for big data as far as cost-efficiency, elasticity and easy
upgrading/downgrading is concerned. This further complicates the decision to host
big data solutions outside the enterprise.
3.The other challenge is to decide on the period of retention of big data. Just how
long should one retain this data? A tricky question indeed as some data is useful for
making long-term decisions, whereas in few cases, the data may quickly become
irrelevant and obsolete just a few hours after having being generated.
5.There are other challenges with respect to capture, storage, preparation, search,
analysis, transfer, security and visualization of big data. Big data refers to datasets
whose size is typically beyond the storage capacity of traditional database software
tools. There is no explicit definition of how big the dataset should be for it to be
considered “big data.” Here we are to deal with data that is just too big, moves way
to fast and does not fit the structures of typical database systems. The data changes
are highly dynamic and therefore there is a need to ingest this as quickly as possible.
2. Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
❖ Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic
visualization, etc.
❖ Operationalized analytics: It is operationalized analytics if it gets woven into
the enterprise’s business processes.
❖ Advanced analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modelling.
Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer Table
6.1.
Small and structured Big data is being taken up A blend of big data and
data sources. Data seriously. Data is mainly traditional analytics to
stored in enterprise unstructured, arriving at a yield insights and
data warehouses or much higher pace. This fast offerings with speed and
data marts. flow of data entailed that the impact.
influx of big volume data had
to be stored and processed
rapidly, often on massive
parallel servers running
Hadoop.
Data was internally Data was often Data is both being internally
sourced. externally sourced. and externally sourced.
Figure 6.1 shows the subtle growth of analytics from Descriptive → Diagnostic →
Predictive → Perspective analytics.
7. Analytics Tools
1. MS Excel
Templates do most of the set-up and design work for us, so we can focus on
our data. In Excel 2013, there are templates for budgets, calendars, forms,
and reports, and more.
The new Quick Analysis tool lets us to convert our data into a chart or table
in two steps or less. We can preview our data with conditional formatting,
sparklines, or charts, and make our choice stick in just one click.
IBM SPSS Modeler is a data mining and text analytics software application from IBM.
It is used to build predictive models and conduct other analytic tasks. It has
a visual interface which allows users to leverage statistical and data mining
algorithms without programming
4. Statistica
• It's a worldwide proven software suite that can be used to analyze and
visualize data,to create insights and predictions, and deliver tailored solutions
• The Salford Predictive Modeler (SPM) software suite is a highly accurate and
ultra-fast platform for developing predictive, descriptive, and analytical
models.
• The SPM software suite’s data mining technologies span classification,
regression, survival analysis, missing value analysis, data binning and
clustering/segmentation.
• SPM algorithms are considered to be essential in sophisticated data science
circles.
6. WPS
1. R
Analytics
● R analytics is data analytics using R programming language, an open-source
language used for statistical computing or graphics.
● This programming language is often used in statistical analysis and data
mining. It can be used for analytics to identify patterns and build practical
models.
● R not only can help analyze organization’s data, but also be used to help in
the creation and development of software applications that perform statistical
analysis.
2. Weka
It contains a Collection of visualization tools and algorithms for data analysis and
predictive modeling coupled with graphical user interface.
8. Importance of Big Data Analytics
1. Reactive-Business Intelligence:
What does Business Intelligence (BI) help us with? It allows the businesses to
make faster and better decisions by providing the right information to the
right person at the right time in the right format. It is about analysis of the
past or historical data and then displaying the findings of the analysis or
reports in the form of enterprise dashboards, alerts, notifications, etc. It has
support for both pre-specified reports as well as ad hoc querying.
Here the analysis is done on huge datasets but the approach is still reactive
as it is still based on static data.
3. Proactive -
Analytics:
This is to support futuristic decision making by use of data mining predictive
modelling, text mining, and statistical analysis. This analysis is not on big data
as it still uses the traditional database management practices on big data and
therefore has severe limitations on the storage capacity and the processing
capability.
1. You have just got a book issued from the library. What are the details
about the book that can be placed in a RDBMS table?
2. Share your experience as a customer on an e-commerce site. Comment
on the big data that gets created on a typical e-commerce site.
3. How does Big Data Analytics influence the business decision-making
and business priorities as well?
4. How the concept of Big Data would be useful in future possibilities for
an organization?
5. What are key skill sets and behavioral characteristics of a data scientist?
2.How does traditional BI environment differ from big data environment? ( CO1 , K1 )
● volume
● value
● variety
● velocity
● veracity
❖ Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic
visualization, etc.
❖ Advanced analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modelling.
● MS Excel
● SAS
● IBM SPSS Modeler
● Statistica
● Salford systems ( World Programming Systems)
● WPS
11. What are the two open source analytics tools ( CO1 , K1 )
❖ R Analytics
● R analytics is data analytics using R programming language, an open-source
language used for statistical computing or graphics.
11. PART A : Q & A : UNIT – I
❖ Weka
• The software stands out for its user-friendliness including the complete
visualization of analytical processes in the project interface.
● It allows the businesses to make faster and better decisions by providing the
right information to the right person at the right time in the right format.
● It is about analysis of the past or historical data and then displaying the
findings of the analysis or reports in the form of enterprise dashboards,
alerts, notifications, etc.
● It has support for both pre-specified reports as well as ad hoc querying.
● IBM SPSS Modeler is a data mining and text analytics software application
from IBM.
● It is used to build predictive models and conduct other analytic
tasks.
● It has a visual interface which allows users to leverage statistical and data
mining algorithms without programming.
19. What is the purpose of Salford systems ( World Programming Systems ) ? ( CO1
, K1 )
• The Salford Predictive Modeler (SPM) software suite is a highly accurate and
ultra-fast platform for developing predictive, descriptive, and analytical
models.
NPTEL : https://onlinecourses.nptel.ac.in/noc21_cs45/preview
coursera : https://www.coursera.org/learn/python-data-analysis
Udemy : https://www.udemy.com/topic/data-science/
Mooc : https://mooc.es/course/introduction-to-data-science-in-python/
edx : https://learning.edx.org/course/course-v1:Microsoft+DAT208x+2T2016/home
14. REAL TIME APPLICATIONS
1. Transportation
For Example During the Wedding season or the Holiday season, the transport
facilities are prepared to accommodate the heavy number of passengers traveling
from one place to another using prediction tools and techniques.
There are different logistic companies like DHL, FedEx, etc that uses data analytics
to manage their overall operations. Using the applications of data analytics, they can
figure out the best shipping routes, approximate delivery times, and also can track
the real-time status of goods that are dispatched using GPS trackers. Data Analytics
When a shipment is dispatched from its origin, till it reaches its buyers, every
position is tracked which leads to the minimizing of the loss of the goods.
give you when you search a data. Whenever you hit on the search button, the search
engines use algorithms of data analytics to deliver the best-searched results within a
limited time frame. The set of data that appears whenever we search for any
The searched data is considered as a keyword and all the related pieces of
information are presented in a sorted manner that one can easily understand. For
example, when you search for a product on amazon it keeps showing on your social
media profiles or to provide you with the details of the product to convince you by
that product.
4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall working
through certain tools like prediction analysis, regression analysis, budgeting, etc. The
unit can figure out the number of products needed to be manufactured according to
the data collected and analyzed from the demand samples and likewise in many
Pig is a high-level platform or tool which is used to process the large datasets.
It provides a high-level of abstraction for processing over the MapReduce. It provides
a high-level scripting language, known as Pig Latin which is used to develop the data
analysis codes.
First, to process the data which is stored in the HDFS, the programmers will
write the scripts using the Pig Latin Language. Internally Pig Engine (a component of
Apache Pig) converted all these scripts into a specific map and reduce task. But
these are not visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig
tool. The result of Pig always stored in the HDFS.
Need of Pig: One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the code, submitting the
job and retrieving the output is a time-consuming task. Apache Pig reduces the time
of development using the multi-query approach. Also, Pig is beneficial for the
programmers who are not from Java background. 200 lines of Java code can be
written in only 10 lines using the Pig Latin language. Programmers who have SQL
knowledge needed less effort to learn Pig Latin.
Types of Data Models in Apache Pig: It consist of the 4 types of data models
as follows:
● Atom: It is a atomic data value which is used to store as a string. The
main use of this model is that it can be used as a number and as well as
a string.
TEXT BOOKS:
1.Subhashini Chellappan, Seema Acharya, “Big Data and Analytics”, 2nd edition,
Wiley Publications, 2019.
2.Suresh Kumar Mukhiya and Usman Ahmed, “Hands-on Exploratory Data Analysis
with Python”, Packt publishing, March 2020.
3.Jure Leskovek, Anand Rajaraman and Jefrey Ullman,” Mining of Massive Datasets.
v2.1”, Cambridge University Press,2019.
4.Glenn J. Myatt, Wayne P. Johnson, Making Sense of Data II : A Practical Guide To
Data Visualization, Advanced Data Mining Methods, and Applications, Wiley 2009.
REFERENCES:
1.Nelli, F., Python Data Analytics: with Pandas, NumPy and Matplotlib, Apress,
2018.
2.Bart Baesens," Analytics in a Big Data World: The Essential Guide to Data Science
and its Applications", John Wiley & Sons, 2014
3.Min Chen, Shiwen Mao, Yin Zhang, Victor CM Leung, Big Data: Related
Technologies, Challenges and Future Prospects, Springer, 2014.
4.Michael Minelli, Michele Chambers, Ambiga Dhiraj, “Big Data, Big Analytics:
Emerging Business Intelligence and Analytic Trends”, John Wiley & Sons, 2013.
5.Marcello Trovati, Richard Hill, Ashiq Anjum, Shao Ying Zhu, “Big Data Analytics
and cloud computing – Theory, Algorithms and Applications”, Springer
International Publishing, 2016
18. MINI PROJECT SUGGESTION
Mini Project:
The project should contain the following components
• Realtime dataset
• Data preparation & Transformation
• Handling missing Data
• Data Storage
• Algorithm for data analytics
• Data visualization: Charts, Heatmap, Crosstab, Treemap
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the respective
group / learning community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail if you have received
this document by mistake and delete this document from your system. If you are not the intended
recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the
contents of this information is strictly prohibited.