21BCAD5C01 IDA Module 1 Notes

Introduction to Data Analytics
Module -1
Syllabus of the Module
Introductory Concepts (12 Hours)
Overview of Data Science and Data Analytics, Types of Analytics: Descriptive, Diagnostics, Predictive
and Prescriptive; Data Ubiquity, Nature of Data: Structured, Unstructured, Big Data; Advantages of
Data-Driven Decisions, Data Science Process, Applications of Data Science in various fields, Data
Science Roles, Data Security, Privacy, and Ethical Issues.
4th Semester, Dept. of BCA 1

What is Data Science?

 Data Science is multidisciplinary field that uses methods to analyze data and extract useful
knowledge from it.
 It is a combination of multiple disciplines – Mathematics, Statistics, Optimization, Computer
Science, Information Science, Data Mining, Data Visualization, Machine Learning, and
Artificial Intelligence as well as domain expertise.
 It operates on massive amount of structured, and unstructured data.
 It uses a fusion of analytical methods, domain expertise, and a variety of tools and
technologies.
 Data science includes descriptive, diagnostic, predictive, and prescriptive capabilities. This
means that with data science, organizations can use data to figure out what happened, why it
happened, what will happen, and what they should do about the anticipated result.
Figure-1: Modified Data Science Venn Diagram originally suggested by Drew Conway
(2013)
Why Data Science?

 Traditionally, the data that we had was mostly structured and small in size, which could be
analyzed by using simple BI tools.
 Unlike data in the traditional systems which was mostly structured, today most of the data is
unstructured or semi-structured.
 Let’s have a look at the data trends in the image given below (Figure-2) which shows that by
2025, more than 80 % of the data will be unstructured.

 This data is generated from different sources like financial logs, text files, multimedia forms,
sensors, machines and instruments.
 Simple BI tools are not capable of processing this huge volume and variety of data.
 This is why we need more complex and advanced analytical tools and algorithms for
processing, analyzing and extracting meaningful insights out of it.
Today, the whole world contributes to massive data growth in colossal volumes. The World
Economic Forum states that by the end of 2025, the daily global data generation will reach 463
exabytes (260 bytes) of data! (Figure-2)
Figure-2: Massive Growth of Data
Evolution of Data Science

 The term ‘Data Science’ has been evolved over several decades. Although Computer Science
and Information Technology has been there for storing, and manipulating data, reporting and
disseminating information, the term ‘Data Science’ has never been in the limelight.
 However, the term became buzzword only after Harvard Business Review published an
article “Data Scientist: The Sexiest Job of the 21st Century” in October 2012. The article
by Thomas H. Davenport and D.J. Patil described a data scientist as a high-ranking
professional with the training and curiosity who can make discoveries and find treasure out of
messy, unstructured data.
Age of Number-Crunching: 1960s-1970s
 At that time, it was called data processing. It involved getting data into a database and
reporting them, but not analyzing them further. Although computer programming was used in

managing data but to statisticians it was not a data science because it didn’t involve much
analysis of data.
 Statistics had been used mostly in military applications and more mundane logistics and
demographic reporting. Then the dominance of deterministic engineering applications,
industrial optimization grew and drew most of the public’s attention.
 In the late 1960s statistical-software-packages, most notably BMDP (Bio-Medical Data
Package) and later SPSS (Statistical Package for the Social Sciences) and SAS (Statistical
Analysis System), were developed and applied statisticians became very important people in
1970s.
Age of Data-Wrangling: 1980s-2000s
 Statistical analysis changed a lot after the 1970s. PC sales had reached almost a million per
year by 1980. Now companies had IT Departments. From the early 1990s, sales of PCs had
been fueled by Pentium-speed, GUIs, the Internet, and affordable, user-friendly software,
including spreadsheets with statistical functions.
 Statistics was going through a phase of explosive evolution. By the mid-1980s, statistical
analysis was no longer considered the exclusive domain of professionals. With PCs and
statistical software proliferating and universities providing a statistics course for a wide
variety of degrees, it became common for non-professionals to conduct their own analyses.
 Another major event in the 1983 was the introduction of Lotus 1–2–3. The spreadsheet
software provided users with the ability to manage their data, perform calculations, and create
charts. In 1985, Microsoft Excel was introduced and became the prominent spreadsheet
software within a decade surpassing Lotus 1-2-3. The popularity of Microsoft Excel fueled
data analysis tasks in management information reporting systems. BI (Business Intelligence)
emerged in 1989, mainly in major corporations.
 Against that backdrop of applied statistics came the explosion of data wrangling capabilities
(i.e., transforming or mapping data from one form to another with an intent of making it
suitable for analysis). Relational databases and Sequel (SQL) data retrieval became the trend.
Technology also exerted its influence. Not only were PCs becoming faster but, perhaps more
importantly, hard disk drives were getting bigger and less expensive. This led to data
warehousing, and eventually, the emergence of Big Data. Big data brought Data Mining and
black-box modeling / machine learning models. R programming impacted profoundly the
statistical analysis and data mining task.
 In the late 90s, instant messaging, blogging, and social media evolved and became very
popular in a short span of time. The amount of data generated by and available from the
Internet skyrocketed. Big Data became big and inflicted challenges for organizations.
Technologies such as Hadoop, Spark, NoSQL databases like MongoDB, Cassandra evolved
to manage it. Big Data required special software, like Hadoop, not just because of its volume
but also because much of it was unstructured.
 In 2001, William S. Cleveland coined the term “Data Science”. Shortly thereafter, in April
2002, the publications of the “CODATA Data Science Journal” by the International Council
for Science: Committee on Data for Science and Technology and in January 2003, the

“Journal of Data Science” by Columbia University, respectively kickstarted the journey of

Data Science.
 Government organizations and corporates started funding for activities related to data science
and big data.
Age of Data Science: 2010s-Present
 The major technological advances in NoSQL databases, artificial intelligence and machine
learning, the surge in social media computing, cloud computing, Internet of things, led to a
revolution.
 In 2012, Harvard Business Review (HBR) published an article that declared data scientist to
be the sexiest job of the 21stcentury.
 Organizations started pouring money into data science programs in anticipation of the money
that would be generated from it.
 Six years later in 2018, KDnuggets (a leading site on AI, Analytics, Big Data, Data Mining,
Data Science, and Machine Learning) described Data Science as an interdisciplinary field at
the intersection of Statistics, Computer Science, Machine Learning, and Business Expertise,
quite a bit more specific than the HBR article. A few others also joined to describe what data
science actually was. Everybody wanted to be on the bandwagon that was trendy, prestigious,
and lucrative.
Data Science vs Data Analytics

Data Science is an umbrella field that encompasses Data Analytics. Data Analytics is a branch of Data
Science that focuses on more specific answers to the questions that Data Science brings forth.
While Data Science deals with extracting hidden patterns and insights from large complex datasets,
Data Analytics is designed to uncover the specifics of extracted insights.
For example, if data science creates customer segmentations from the past transactional data based on
the hidden patterns by applying clustering algorithms, then data analytics interpret the clusters in the
business context to answer specific questions, and enable business decisions.
Data science is centred around building, cleaning, and organizing datasets from raw data by using
algorithms, building statistical & machine learning models to lay the groundwork for all of the analyses
that need to be carried out for organizational decision-making.
Applying data analytics tools and methodologies in a business setting is typically referred to
as business analytics. The main goal of business analytics is to extract meaningful insights from data
that an organization can use to inform its strategy and, ultimately, reach its objectives.
Data Analytics in business organizations let to
• Improved efficiency and productivity
• Better financial performance
• Identification and creation of new product and service revenue
• Improved customer acquisition and retention
• Improved customer experiences

• Competitive advantage
Types of Data Analytics

Data analytics refers to the process and practice of analyzing data to answer questions, extract insights,
and identify trends. This is done using an array of tools, techniques, and frameworks that vary
depending on the type of analysis being conducted.
The four major types of analytics include:
• Descriptive analytics, which looks at data to examine, understand, and describe something
that’s already happened.
• Diagnostic analytics, which goes deeper than descriptive analytics by seeking to understand
the why behind what happened.
• Predictive analytics, which relies on historical data, past trends, and assumptions to answer
questions about what will happen in the future.
• Prescriptive analytics, which aims to identify specific actions that an individual or
organization should take to reach future targets or goals.
Each type of data analysis can help you reach specific goals and be used in tandem to create a full
picture of data that informs your organization’s strategy formulation and decision-making.
1. Descriptive Data Analytics

Descriptive analytics is the process of using current and historical data to identify trends and
relationships.
Descriptive data analytics is the simplest type of analytics and the foundation (or backbone) the other
types are built on. It allows you to pull trends and relationships from raw data and concisely describe
what happened or is currently happening but doesn’t dig deeper.
Descriptive analytics answers the question, “What happened?”. It also answers question of when
happened, where happened, how many happened…
For example, imagine you’re analyzing your company’s data and find there’s a seasonal surge in sales
for one of your products: a video game console. Here, descriptive analytics can tell you, “This video
game console experiences an increase in sales in October, November, and December each year.”
Another example, you may be responsible for reporting on which media channels drive the most traffic
to the product page of your company’s website. Using descriptive analytics, you can analyze the page’s
traffic data to determine the number of users from each source. You may decide to take it one step
further and compare traffic source data to historical data from the same sources. This can enable you
to update your team on movement; for instance, highlighting that traffic from paid advertisements
increased 20 percent year over year.
Another example of descriptive analytics that may be familiar to you is financial statement
analysis. Financial statements are periodic reports that detail financial information about a business
and, together, give a holistic view of a company’s financial health.

Data visualization is a natural fit for communicating descriptive analysis because charts, graphs, and
maps can show trends in data—as well as dips and spikes—in a clear, easily understandable way.
Basic statistical software, such as Microsoft Excel or data visualization tools, such as Google Charts
and Tableau, can help parse data, identify trends and relationships between variables, and visually
display information.
2. Diagnostic Data Analytics

Diagnostic data analytics is the process of using data to determine the causes of trends and correlations
between variables. It can be viewed as a logical next step after using descriptive analytics to identify
trends.
Diagnostic analytics addresses the next logical question, “Why did this happen?”
Taking the analysis a step further, this type includes comparing coexisting trends or movement,
uncovering correlations between variables, and determining causal relationships where possible.
Continuing the aforementioned example, you may dig into video game console users’ demographic
data and find that they’re between the ages of eight and 18. The customers, however, tend to be between
the ages of 35 and 55. Analysis of customer survey data reveals that one primary motivator for
customers to purchase the video game console is to gift it to their children. The spike in sales in the
October, November and December months may be due to the festivals and holidays that include gift-
giving.
Diagnostic data analytics is useful for getting at the root of an organizational issue.
Diagnostic analysis can be done manually, using an algorithm, or with statistical software (such as
SPSS, Microsoft Excel).
There several concepts you need to understand before diving into diagnostic analytics: hypothesis
testing, correlation and causation, and diagnostic regression analysis.
3. Predictive Data Analytics

Predictive analytics is the use of data to predict future trends and events. It uses historical data to
forecast potential scenarios that can help drive strategic decisions.
It answers the question, “What might happen in the future?”
The predictions could be for the near future—for next day, next month or the more distant future - for
the upcoming year.

By analyzing historical data in tandem with industry trends, you can make informed predictions about
what the future could hold for your company.
For instance, knowing that video game console sales have spiked in October, November, and December
every year for the past decade provides you with ample data to predict that the same trend will occur
next year. Backed by upward trends in the video game industry as a whole, this is a reasonable
prediction to make.
Making predictions for the future can help your organization formulate strategies based on likely
scenarios.
Predictive analysis can be conducted manually or using machine-learning algorithms. Either way,
historical data is used to make assumptions about the future.
4. Prescriptive Data Analytics

Prescriptive analytics is the use of data that provides information on not just what will happen in your
company, but how it could happen better if you did x, y, or z. Beyond providing information,
prescriptive analytics goes even one step further to recommend actions you should take to optimize a
process, campaign, or service to the highest degree.
Prescriptive analytics answers the question, “What should we do next?”
Prescriptive analytics is “the future of data analytics”. This type of analysis goes beyond explanations
and predictions to recommend the best course of action moving forward. Prescriptive analytics takes
into account all possible factors in a scenario and suggests actionable takeaways. This type of analytics
can be especially useful when making data-driven decisions.
Rounding out the video game example: What should your team decide to do given the predicted trend
in seasonality due to winter gift-giving? Perhaps you decide to run an A/B test (a method of two-sample
hypothesis testing for comparing the outcomes of two different choices, A and B) with two ads: one
that caters to product end-users (children) and one targeted to customers (their parents). The data from
that test can inform how to capitalize on the seasonal spike and its supposed cause even further. Or,
maybe you decide to increase marketing efforts in September with festival / holiday-themed messaging
to try to extend the spike into next months.
While manual prescriptive analysis is doable and accessible, machine-learning algorithms are often
employed to help parse through large volumes of data to recommend the optimal next step. Algorithms
use “if” and “else” statements, which work as rules for parsing data. If a specific combination of
requirements is met, an algorithm recommends a specific course of action.
DIKW Pyramid
The DIKW (Data, Information, Knowledge, Wisdom) Pyramid represents the relationships among
data, information, knowledge and wisdom. The hierarchical representation moves from the lower level
to a higher level - first comes data, then is information, next is knowledge and finally comes wisdom.

Each step answers different questions about the initial data and adds value to it. The more we enrich
our data with meaning and context, the more knowledge and insights we get out of it so we can take
better, informed and data-driven decisions. At the top of the pyramid, we have turned the knowledge
and insights into a learning experience that guides our actions.
Figure-3: DIKW Pyramid

Data: Data is a collection of facts, figures in a raw or unorganized form doesn’t provide any meaning;
doesn’t answer any questions nor draw any conclusion. Raw data is not useful nor valuable without a
context. Example: Rs. 5.32 crores.
Information: provides meaning to data in a context, describes what is the data in a context. Information
is a contextual data that has a meaning. This is data that has been “cleaned” of errors and further
processed in a way that makes it easier to measure, visualize and analyze for a specific purpose.
Depending on this purpose, data processing can involve different operations such as combining
different sets of data (aggregation), ensuring that the collected data is relevant and accurate (validation),
etc. By asking relevant questions about ‘who’, ‘what’, ‘when’, ‘where’, etc., we can derive valuable
information from the data and make it more useful for us. Example: Rs. 5.32 crores is the total sales of
last month of ABC Enterprises. Sales figures of different months.
Knowledge: “How” is the information derived from the collected data, relevant to our goals? “How”
are the pieces of this information connected to other pieces to add more meaning and value? And,
maybe most importantly, “how” can we apply the information to achieve our goal? We turn it into
knowledge. We discover of hidden patterns in data, relationships among variables in data. Example:
When you compare the monthly sales, there is a decreasing trend of sales over last couple of months.
Wisdom: Wisdom is the top of the DIKW hierarchy and to get there, we must answer questions such
as ‘why do something’ and ‘what is best’. In other words, wisdom is knowledge applied in action.

We can also say that, if data and information are like a look back to the past, knowledge and wisdom
are associated with what we do now and what we want to achieve in the future.
It provides understanding, explanation, why it is happening. Example: why is the decrease in sales?
It helps in doing the right thing, tells what is the best thing to do, wisdom to enable right decisions.
Example: what is the right action to nullify the factors that cause decrease in sales and increase the
sales.
The four types of data analytics add values to the raw data and contribute to the DIKW pyramid. You
process the data and get information; when you derive hidden patterns from the data, you get
knowledge; you understand it and apply the knowledge to take right decision- you become intelligent
and wise.
Data Ubiquity
According to experts "What we're seeing right now is the end of the era of software. Hardware had its
20- to 30-year run. Software had its 20- to 30-year run".
So, what's next? An era that experts call the "age of data ubiquity," one in which a new generation of
data-centric apps exploit massive data sets generated by both individuals and enterprises.
Data will dominate your life. Whether your walking, talking, driving or shopping, you are generating
data valuable for public and private sector. Organizations and their customers generate huge amount
of data on daily basis.
In the era of social media and Internet of Things, data is everywhere. Your old refrigerator was a dumb
refrigerator. Your new refrigerator is a smart refrigerator. It records time, temperatures, vibrations. It
records the electricity it's consuming. It's connected to the Internet for better monitoring and control.
The plant in your house is going to be instrumented, and it's going to tell you when it's dry and needs
to be watered, and maybe it will be able to turn on the sprinkler when it knows it's thirsty and needs
water.
When you walk into a big retail store, you are recognised as a shopper, your movement, shopping
behaviour is analysed and custom advertisements get displayed in digital signage to attract you and
provide you better shopping experience.
In addition to the connected devices, we use today smartphones, tablets, PCs and smartwatches -
sensors attached to or embedded in other physical objects generate data as well. The IDC estimates that
by 2025 there will be over 40 billion IoT devices on earth, generating almost half the world’s total
digital data.
The bulk of data generated comes from three primary sources: social data, machine data and
transactional data.
 Transactional data is generated from all the daily transactions that take place both online and
offline. Invoices, payment orders, storage records, delivery receipts – all are characterized as
transactional data. However, data alone is almost meaningless, and most organizations struggle
to make sense of the data that they are generating and how it can be put to good use.
 Machine data is defined as information which is generated by industrial equipment, sensors
that are installed in machinery, and even web logs which track user behavior. This type of data
is expected to grow exponentially as the internet of things grows, become more pervasive and
expands around the world. Sensors such as medical devices, smart meters, road CCTV cameras,

satellites, and drones will deliver high velocity, value, volume and variety of data in the very
near future.
 Social data comes from the Posts, Likes, Tweets & Retweets, Comments, Video Uploads, and
general media (e.g., image, text docs) that are uploaded and shared via the world’s favorite
social media platforms. This kind of data provides invaluable insights into consumer behavior
and sentiment and can be enormously influential in marketing analytics. The public web is
another good source of social data.
Nature of Data
Data generated at different sources are of many different types, such as audio, video, text, image,
streaming and each of them tends to require different tools and techniques to process. Data can be
distinguished along many dimensions. The most important one is the degree of organization. When
differentiating the degree of organization between data, we classify it into three different types:
structured, semi structured and unstructured data. The distinction is not always clear cut, and should
be understood from processing point of view. The type of data has implications on how the data should
be stored, how machine readable it is and therefore how easy it is to analyze.
Structured Data
Structured data has a high degree of organization. The data that has been formatted and transformed
into a well-defined data model / schema, consisting of rows and columns so that it can be machine
readable and extracted through algorithms easily. Examples: Relational Data, ExcelSheet, CSV etc.
Structured data is generated by both humans and machines.
Unstructured Data
Unstructured data has an internal structure but is not structured via predefined data models or schema.
It may be textual or non-textual, and human- or machine-generated. This data is difficult to process
due to its complex arrangement and no specific formatting. Examples: Audio, Video, Image, Text
files, PDF files etc.
Typical machine-generated unstructured data includes:
• Satellite imagery: Weather data, landforms, military movements.
• Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric
data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.
Typical human-generated unstructured data includes:

• Text files: Word processing, spreadsheets, presentations, emails, logs.
• Email: Email message field is unstructured although email has some internal structure.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo sharing sites.

• Mobile data: Text messages, locations.

• Communications: Chat, Instant Messaging, phone recordings, collaboration software.
• Media: MP3, digital photos, audio and video files.
Semi-Structured Data
Semi-structured data has some degree of organization in it. It is not as rigorously formatted as
structured data, but also not as messy as unstructured data. This degree of organization is typically
achieved with some sort of tags and markings that identify separate data elements, which enables data
analysts to determine information grouping and hierarchies. These can be comma or colons or anything
else for that matter. Examples: Markup languages, HTML, XML; Open standard JSON
(JavaScript Object Notation); Data stored in NoSQL databases.
Big Data
Big data refers to data that is so large, fast or complex that it’s very difficult or impossible to process
using traditional methods.
Big data analytics is the use of advanced processing and analysis techniques against very large, diverse
data sets that include structured, semi-structured and unstructured data, from different sources, and in
different sizes from terabytes (1012 bytes) to zettabytes (1021 bytes).

The act of accessing and storing large amounts of information for analytics has been around for a long
time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug
Laney articulated the now-mainstream definition of big data as the three V’s: volume, velocity and
variety.
Big Data Characteristics

Today, Big Data is described using several Vs.
Volume: The quantity of generated and stored data. The size of the data determines the value and
potential insight, and whether it can be considered big data or not. The size of big data is usually larger
than terabytes and petabytes.
Examples: Date such as Twitter data feeds, clickstreams on a web page or a mobile app, or sensor-
enabled equipment.
Velocity: The speed at which the data is generated and (perhaps) processed. Big data is often available
in real-time or near real-time and requires real-time evaluation and action. Two kinds of velocity related
to big data are the frequency of generation and the frequency of handling, recording, and publishing.
Examples: RFID tags, sensors and smart meters are driving the need to deal with these torrents of data
in near-real time.
Variety: The type and nature of the data. Data comes in all types of formats – from structured, numeric
data in traditional databases to unstructured and semi-structured data types, such as text documents,
emails, videos, audios, stock ticker data and financial transactions that require additional pre-
processing to make it suitable for storage and analysis.
Some more Vs have emerged over the past few years: veracity, value and variability. Data has
intrinsic value. But it’s of no use until that value is discovered. Equally important: How truthful is your
data—and how much can you rely on it?
Veracity: The truthfulness or reliability of the data, which refers to the quality of data. Because data
comes from so many different sources, the data quality of captured data can vary greatly; it’s difficult
to link, match, cleanse and transform data across systems; it can be erroneous affecting an accurate
analysis.
Value: The worth in information that can be achieved by the processing and analysis of large datasets.
Value also can be measured by an assessment of the usability of information that is retrieved from the
analysis of big data.
Variability: In addition to the increasing velocities and varieties of data, data flows are unpredictable
– changing often and varying greatly. It’s challenging, but businesses need to know when something
is trending in social media, and how to manage daily, seasonal and event-triggered peak data loads.
Advantages of Data-Driven Decisions

What is Data-Driven Decision?
Data-driven decision-making (sometimes abbreviated as DDDM) is the process of using data to inform
your decision-making process and validate a course of action before committing to it.

In business, this is seen in many forms. For example, a company might:

• Collect survey responses to identify products, services, and features their customers would
like
• Conduct user testing to observe how customers are inclined to use their product or services
and to identify potential issues that should be resolved prior to a full release
• Launch a new product or service in a test market in order to test the waters and understand
how a product might perform in the market
• Analyze shifts in demographic data to determine business opportunities or threats
How exactly data can be incorporated into the decision-making process will depend on a number of
factors, such as your business goals and the types and quality of data you have access to. Though data-
driven decision-making has existed in business in one form or another for centuries, it’s a truly modern
phenomenon.
Benefits
1. You’ll Make More Confident Decisions
Once you begin collecting and analyzing data, you’re likely to find that it’s easier to reach a confident
decision about virtually any business challenge, whether you’re deciding to launch or discontinue a
product, adjust your marketing message, branch into a new market, or something else entirely.
2. You’ll Become More Proactive
When you first implement a data-driven decision-making process, it’s likely to be reactionary in nature.
The data tells a story, which you and your organization must then react to.
While this is valuable in its own right, it’s not the only role that data and analysis can play within your
business. Given enough practice and the right types and quantities of data, it’s possible to leverage it
in a more proactive way—for example, by identifying business opportunities before your competition
does, or by detecting threats before they grow too serious.
3. You Can Realize Cost Savings and profitability
When your decision-making become more data-driven you result
• Improved efficiency and productivity in organizational processes
• Better financial performance
• Identification and creation of new product and service revenue
• Improved customer acquisition and retention
• Improved customer experiences
• Competitive advantage
How to become more Data-Driven

1. Look for Patterns Everywhere
The first step in becoming more data-driven is making a conscious decision to be more analytical. Data
analysis is, at its heart, an attempt to find a pattern within, or correlation between, different data points.
It’s from these patterns and correlations that insights and conclusions can be drawn.

2. Tie Every Decision Back to the Data

Whenever you’re presented with a decision, do your best to avoid relying on gut instinct or past
behavior when determining a course of action. Instead, make a conscious effort to apply an analytical
mindset.
Identify what data you have available that can be used to inform your decision. If no data exists,
consider ways in which you could collect it on your own. Once you have the data, analyze it, and use
any insights to help you make your decision.
3. Visualize the Meaning Behind the Data
Data visualization is a huge part of the data analysis process. It’s nearly impossible to derive meaning
from a table of numbers. By creating engaging visuals in the form of charts and graphs, you’ll be able
to quickly identify trends and make conclusions about the data.
Familiarize yourself with popular data visualization techniques and tools, and practice creating
visualizations with any form of data you have readily available.
Data Science Process

A data science process consists of several activities in multiple stages that are used to find a solution
for a business problem at hand.
1. Frame the Business Problem

Before solving a problem, the pragmatic thing to do is to know what exactly the problem is. You begin
the process of data science by asking the right questions to find what the problem is. Let’s take a very
common problem – The sales problem of a company that deals with a product.
For analysis of the problem, you need to start by asking a lot of questions:
• Who are the customers? How to identify them?

• What products they are interested in?
• Why are they interested in your products?
• What are the markets? How do you approach the target market?
• What information do you have about the target market?
• What is the sale process right now?
After a discussion with the marketing team, you decide to focus on the problem: “How can we identify
potential customers who are more likely to buy our product?”
The next step for you is to figure out what all data you have available with you to answer the above
questions.
2. Data Acquisition
After defining the problem, you will need to collect the required data to derive insights and turn the
business problem into a probable solution.
The required data may be available in the organization, if not it can be outsourced / purchased from
external sources.
If you think the data available is not sufficient, then you must make arrangements to collect new data.
Many companies store the sales data they have in customer relationship management (CRM) systems.
3. Data preparation

The data you have collected may contain errors like invalid entries, inconsistence values, missing
values / null values, duplicate values, and many more. First, you need to make sure the data is clean
and free from all possible errors.
Since data collection is an error-prone process, in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing removes
false values from a data source and inconsistencies across data sources, data integration enriches data
sources by combining information from multiple data sources, and data transformation ensures that
the data is in a suitable format for use in your models.
4. Data exploration
This is one of the most crucial steps in a data science process. This step often popularly known as EDA
(Exploratory Data Analysis). Data exploration is concerned with building a deeper understanding of
your data. You try to understand how variables interact with each other, the distribution of the
data, and whether there are outliers. Identify correlation and trends between the dependent and
independent variable.
To achieve this, you mainly use descriptive statistics, and data visualization techniques.
Descriptive statistics describe the data based on its properties such as

• Measures of Frequency
• Measures of Central Tendency
• Measures of Dispersion or Variation
• Measures of Position
Data visualisation makes the patterns and trends identification much easier rather than just looking at
thousands of rows on a dataset and just using statistics.
5. Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data and
to transform data into a form where a model can understand better. Feature engineering also deal with
data and pattern anomalies:
• Data Transformation techniques: Log transform, Square-root transformation, Categorical
encoding, Power functions and Scaling.
• Dealing with: Skewed data, Bias mitigation, Binning, Outlier detection the latter identifies data
points or observations that deviate from a dataset’s normal behaviour. Outliers data can indicate
critical incidents, such as a technical glitch, or potential hazards therefore this need to be treated
accordingly otherwise it could mislead our final model(s).
After transforming the data to the right format and dealing with potential data hazard, most of the time
especially with a high dimensional dataset, we end up with many features. We cannot feed all the
features to the machine learning model, that is not how it works, that would overfit the model hugely.
Instead, we have to choose the right number of features. This is called Feature Selection.
6. Model Building
Once the data with reduced features are ready to be modelled, the data is divided into training and test
sets. Model building (see Figure-4) involves model training and evaluation in iteration. We create

a Baseline model and keep increasing or decreasing the complexity of the model by Hyperparameter
Tuning to get desired accuracy.
Figure-4: Model Building Process
Model is trained using training set. We need to evaluate the performance of the Trained model, on test
dataset.
7. Presentation and Model deployment

Finally, you present the results to your organization. These results can take many forms, ranging from
presentations to research reports. Sometimes you’ll need to automate the execution of the process and
use the outcome from your model. You deploy the model into production that may be on real-time
mode or batch mode.
Once the model is deployed, it is monitored for its performance. If required you may go back to one or
more previous stages to recalibrate the model to improve its performance.
Due to the iterative nature of the process, it is sometimes called Data Science process life cycle.
Applications of Data Science
Data science has been the most popular field with applications in a wide range of domains. It has been
revolutionizing the way we perceive data and data-driven decisions. Below given are some of the
popular domains, and important applications although the list is not exhaustive.
 Banking, Finance and Insurance
 Sales and Marketing
 Healthcare and Pharma
 Internet and Social media
 Travel and Transport

Banking, Finance & Insurance

Data science has enabled financial institutions to be more secure and manage their resources efficiently.
It also enables them to make smarter and more strategic decisions and be saved from fraud. It also helps
manage customer data, risk analysis and modeling, predictive analysis, and much more.
 Personalized Services (personalized services offerings and interactions based on customer
profiling)
 Risk Analysis and Management (Loan approvals, Credit card approval, Customer credit
scoring )
 Fraud Detection (Credit Card Fraud Detection, Money laundering, Enhancing auditing by
finding irregularities, Suspicious financial transactions)
 Predicting stock performance and pricing
 Economic and financial forecasting of organizations
 Bankruptcy prediction
 Mortgage underwriting
 Foreign exchange rate forecasting
 Insurance Claim Prediction
Marketing and Sales
 Sales forecast : predicting future sales
 Customer Sentiment Analysis: analysing the feedbacks and reviews to understand what
customers desire and why
 Customer Churn prevention: identify trends and features in the behavior, communication,
and ordering of customers who have ceased shopping through customer relationship
management information.
 Inventory Management: identify buying patterns, optimize inventory management and timely
delivery
 Cross-sell recommendations: cross-selling invites customers to buy related or complementary
items. Cross-selling requires that a consumer who has previously purchased or intends to buy
the extra product being offered.
 Predicting customer lifetime value (CLV): to know how valuable a customer is during his
association; Several metrics, such as the buying pattern, gross value, frequency of purchase,
mean order value, etc. are used to measure customer lifetime value to the company and loyalty
programs are recommended.
 Customer Segmentation and Target Marketing: personalised product and service offerings
Healthcare and Pharma
 Disease Diagnosis: identifying diseases accurately
 Medical Prognosis: treatment sensitivity, life expectancy, survivability, and disease
progression
 Medical Image Analysis
 With the help of medical image analysis, a machine predicts diseases such as cancer,
tumor, organ delineation, and many others.

 Genomics and Proteomics: identification, quantification, profiling new genes, proteins

 Drug Development: target molecule identification
 Clinical Trails: conducted to identify the dose-toxicity, side effects, and drug effectiveness
 Epidemic Outbreak and Control: cluster identification, contact tracing, peak prediction
Internet and Social Media
 Internet Search
 Search engines make use of data science algorithms to deliver the best result for our
searched query in a fraction of seconds.
 Google gives accurate autocomplete prompts in search box as if it is reading our mind
about search context
 Digital Advertisements / Targeted Advertising
 Data science algorithms are used to determine the banners or advertisements to be
displayed on various websites you visit. They can be tailored to a user's previous actions
or browsing patterns.
 Price Comparison Websites
 These websites provide the convenience of comparing the price of a product from
multiple vendors at one place. These websites are being driven by lots and lots of data
which is fetched and processed in seconds. PriceGrabber, PriceRunner, Junglee,
Shopzilla, DealTime are some examples of price comparison websites. Now a days,
price comparison website can be found in almost every domain such as technology,
hospitality, automobiles, durables, apparels etc.
 Sentiment Analysis: predicting positive and negative sentiment analysis
 Influencer Marketing: identifying influencers and promotion effectiveness
❑ Others
❖ Google maps predicts which route will be faster and shows such an accurate estimated time of
reaching destination.
❖ Gmail filters spam mails to the spam folder in your gmail account.
❖ Amazon or Netflix are so good in recommending you what you may like to buy or watch movie.
❖ “Friends you may know” feature by Facebook.
Travel and Transport
 Airline Route Planning
 Predict flight delay
 Decide which class of airplanes to buy
 Whether to directly land at the destination, or take a halt in between (For example: A
flight can have a direct route from New Delhi to New York. Alternatively, it can also
choose to halt in any country.)
 Effectively drive customer loyalty programs
 Popular cab services

 Ola, Uber employ data science to improve price and delivery routes, as well as optimal
resource allocation, by combining numerous factors such as consumer profiles,
geography, economic indicators, and logistical providers.
 Dynamic Pricing
 Faster Route and ETA (Estimated Time of arrival ) prediction
Data Science Job Roles

Data science is a rapidly growing field. The hot new field promises to revolutionize various
sectors from business to government, health care to academia. There is a rising demand for data
science jobs around the world. These job opportunities would continue to surge in coming years.
Data science roles and responsibilities are diverse and skills required for them vary considerably.
Various job titles are Data Scientist, Data Analyst, Data Engineer, Data Architect, Machine
Learning Scientist, Machine Learning Engineer, Business Intelligence Developer, Database
Administrator etc.
The two most common and high paid job titles are Data Scientists and Data Analysts.
Major Roles and Responsibilities:

Extracting / Gathering data from various sources using automated tools
Developing and maintaining databases
Processing, cleansing and verifying the integrity of data.
Performing Exploratory Data Analysis
Discover business insights using machine learning tools and techniques.
Identifying new trends and patterns in data to make predictions for the future.
Develop visualization and Presentation KPIs
Major Skill Sets:

Programming skills in Python, R, Perl, SQL, MS Excel,
Knowledge of Statistics, Machine Learning, Model building
Big data tools like Hadoop, Pig, Hive and Spark
Knowledge of SQL and NoSQL databases, Data Modelling and Design
Data Visualization, and tools like Tableau, Microsoft Power BI
Analytical and Creative Thinking / Structured Thinking
Here’s a sample Data Analyst Job Description

Here’s a sample Data Scientist Job Description –

Data Security, Privacy, and Ethical Issues
What is Data Ethics?

Data ethics encompasses the moral obligations of gathering, protecting, and using personally
identifiable information and how it affects individuals.
Data ethics are of the utmost concern to analysts, data scientists, and information technology
professionals.
For instance, your company may collect and store data about customers’ mobile number, email address
on your e-commerce website when they purchase your product. If you’re a digital marketer, you likely
interact with this data daily. Data ethics asks, ‘Is this the right thing to do?’ and ‘Can we do better?’
While you may not be the person responsible for implementing tracking code, managing a database, or
writing and training a machine-learning algorithm, understanding data ethics can allow you to catch
any instances of unethical data collection, storage, or use. By doing so, you can protect your customers'
safety and save your organization from legal issues.

Here are five principles of data ethics to apply at your organization.

Principles of Data Ethics
1. Ownership
The first principle of data ethics is that an individual has ownership over their personal information.
Just as it’s considered stealing to take an item that doesn’t belong to you, it’s unlawful and unethical
to collect someone’s personal data without their consent.
Some common ways you can obtain consent are through signed written agreements, digital privacy
policies that ask users to agree to a company’s terms and conditions, and pop-ups with checkboxes that
permit websites to track users’ online behavior with cookies. Never assume a customer is OK with you
collecting their data; always ask for permission to avoid ethical and legal dilemmas.
2. Transparency
In addition to owning their personal information, data subjects have a right to know how you plan to
collect, store, and use it. When gathering data, exercise transparency.
For instance, imagine your company has decided to implement an algorithm to personalize the website
experience based on individuals’ buying habits and site behavior. You should write a policy explaining
that cookies are used to track users’ behavior and that the data collected will be stored in a secure
database and train an algorithm that provides a personalized website experience. It’s a user’s right to
have access to this information so they can decide to accept your site’s cookies or decline them.
Withholding or lying about your company’s methods or intentions is deception and both unlawful and
unfair to your data subjects.
3. Privacy
Another ethical responsibility that comes with handling data is ensuring data subjects’ privacy. Even
if a customer gives your company consent to collect, store, and analyze their personally identifiable
information (PII), that doesn’t mean they want it publicly available.
PII is any information linked to an individual’s identity. Some examples of PII include:
Full name
Birthdate
Address
Phone number
Aadhar Number
PAN Card Number
Credit card information
Bank account number
Passport number
To protect individuals’ privacy, ensure you’re storing data in a secure database so it doesn’t end up in
the wrong hands. Data security methods that help protect privacy include dual-authentication password
protection and file encryption.

Data security focuses on systems in place that prevent malicious external attempts to access, steal, or
destroy data, whereas data privacy focuses on the ethical and legal use and access to sensitive data and
PII. Data security and data privacy work together to ensure your customers’ safety and anonymity.
For professionals who regularly handle and analyze sensitive data, mistakes can still be made. One way
to prevent slip-ups is by de-identifying a dataset. A dataset is de-identified when all pieces of PII are
removed, leaving only anonymous data. This enables analysts to find relationships between variables
of interest without attaching specific data points to individual identities.
4. Intention
When discussing the context of ethics, intentions matter. Before collecting data, ask yourself why you
need it, what you’ll gain from it, and what changes you’ll be able to make after analysis. If your
intention is to hurt others, profit from your subjects’ weaknesses, or any other malicious goal, it’s not
ethical to collect their data.
When your intentions are good—for instance, collecting data to gain an understanding of women’s
healthcare experiences so you can create an app to address a pressing need—you should still assess
your intention behind the collection of each piece of data.
Are there certain data points that don’t apply to the problem at hand? For instance, is it necessary to
ask if the participants struggle with their mental health? This data could be sensitive, so collecting it
when it’s unnecessary isn’t ethical. Strive to collect the minimum viable amount of data, so you’re
taking as little as possible from your subjects while making a difference.
5. Outcomes
Even when intentions are good, the outcome of data analysis can cause inadvertent harm to individuals
or groups of people.
For example, Voters’ mood/opinion survey may negatively impact a party during election season.

21BCAD5C01 IDA Module 1 Notes

Uploaded by

Copyright:

Available Formats

21BCAD5C01 IDA Module 1 Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

21BCAD5C01 IDA Module 1 Notes

Uploaded by

Copyright:

Available Formats

Introduction to Data Analytics

4th Semester, Dept. of BCA 1

What is Data Science?

Why Data Science?

4th Semester, Dept. of BCA 2

Figure-2: Massive Growth of Data

Evolution of Data Science

4th Semester, Dept. of BCA 3

4th Semester, Dept. of BCA 4

“Journal of Data Science” by Columbia University, respectively kickstarted the journey of

Data Science vs Data Analytics

4th Semester, Dept. of BCA 5

Types of Data Analytics

1. Descriptive Data Analytics

4th Semester, Dept. of BCA 6

2. Diagnostic Data Analytics

3. Predictive Data Analytics

It answers the question, “What might happen in the future?”

4th Semester, Dept. of BCA 7

4. Prescriptive Data Analytics

Prescriptive analytics answers the question, “What should we do next?”

4th Semester, Dept. of BCA 8

Figure-3: DIKW Pyramid

4th Semester, Dept. of BCA 9

4th Semester, Dept. of BCA 10

Typical human-generated unstructured data includes:

4th Semester, Dept. of BCA 11

• Mobile data: Text messages, locations.

4th Semester, Dept. of BCA 12

Big Data Characteristics

Advantages of Data-Driven Decisions

4th Semester, Dept. of BCA 13

In business, this is seen in many forms. For example, a company might:

How to become more Data-Driven

4th Semester, Dept. of BCA 14

2. Tie Every Decision Back to the Data

Data Science Process

1. Frame the Business Problem

• Who are the customers? How to identify them?

4th Semester, Dept. of BCA 15

Descriptive statistics describe the data based on its properties such as

4th Semester, Dept. of BCA 16

Figure-4: Model Building Process

7. Presentation and Model deployment

Applications of Data Science

4th Semester, Dept. of BCA 17

Banking, Finance & Insurance

4th Semester, Dept. of BCA 18

 Genomics and Proteomics: identification, quantification, profiling new genes, proteins

4th Semester, Dept. of BCA 19

Data Science Job Roles

Major Roles and Responsibilities:

Major Skill Sets:

Here’s a sample Data Analyst Job Description

4th Semester, Dept. of BCA 20

Here’s a sample Data Scientist Job Description –

4th Semester, Dept. of BCA 21

Data Security, Privacy, and Ethical Issues

What is Data Ethics?