21BCAD5C01 IDA Module 1 Notes
21BCAD5C01 IDA Module 1 Notes
21BCAD5C01 IDA Module 1 Notes
Module -1
Syllabus of the Module
Introductory Concepts (12 Hours)
Overview of Data Science and Data Analytics, Types of Analytics: Descriptive, Diagnostics, Predictive
and Prescriptive; Data Ubiquity, Nature of Data: Structured, Unstructured, Big Data; Advantages of
Data-Driven Decisions, Data Science Process, Applications of Data Science in various fields, Data
Science Roles, Data Security, Privacy, and Ethical Issues.
Figure-1: Modified Data Science Venn Diagram originally suggested by Drew Conway
(2013)
This data is generated from different sources like financial logs, text files, multimedia forms,
sensors, machines and instruments.
Simple BI tools are not capable of processing this huge volume and variety of data.
This is why we need more complex and advanced analytical tools and algorithms for
processing, analyzing and extracting meaningful insights out of it.
Today, the whole world contributes to massive data growth in colossal volumes. The World
Economic Forum states that by the end of 2025, the daily global data generation will reach 463
exabytes (260 bytes) of data! (Figure-2)
managing data but to statisticians it was not a data science because it didn’t involve much
analysis of data.
Statistics had been used mostly in military applications and more mundane logistics and
demographic reporting. Then the dominance of deterministic engineering applications,
industrial optimization grew and drew most of the public’s attention.
In the late 1960s statistical-software-packages, most notably BMDP (Bio-Medical Data
Package) and later SPSS (Statistical Package for the Social Sciences) and SAS (Statistical
Analysis System), were developed and applied statisticians became very important people in
1970s.
Age of Data-Wrangling: 1980s-2000s
Statistical analysis changed a lot after the 1970s. PC sales had reached almost a million per
year by 1980. Now companies had IT Departments. From the early 1990s, sales of PCs had
been fueled by Pentium-speed, GUIs, the Internet, and affordable, user-friendly software,
including spreadsheets with statistical functions.
Statistics was going through a phase of explosive evolution. By the mid-1980s, statistical
analysis was no longer considered the exclusive domain of professionals. With PCs and
statistical software proliferating and universities providing a statistics course for a wide
variety of degrees, it became common for non-professionals to conduct their own analyses.
Another major event in the 1983 was the introduction of Lotus 1–2–3. The spreadsheet
software provided users with the ability to manage their data, perform calculations, and create
charts. In 1985, Microsoft Excel was introduced and became the prominent spreadsheet
software within a decade surpassing Lotus 1-2-3. The popularity of Microsoft Excel fueled
data analysis tasks in management information reporting systems. BI (Business Intelligence)
emerged in 1989, mainly in major corporations.
Against that backdrop of applied statistics came the explosion of data wrangling capabilities
(i.e., transforming or mapping data from one form to another with an intent of making it
suitable for analysis). Relational databases and Sequel (SQL) data retrieval became the trend.
Technology also exerted its influence. Not only were PCs becoming faster but, perhaps more
importantly, hard disk drives were getting bigger and less expensive. This led to data
warehousing, and eventually, the emergence of Big Data. Big data brought Data Mining and
black-box modeling / machine learning models. R programming impacted profoundly the
statistical analysis and data mining task.
In the late 90s, instant messaging, blogging, and social media evolved and became very
popular in a short span of time. The amount of data generated by and available from the
Internet skyrocketed. Big Data became big and inflicted challenges for organizations.
Technologies such as Hadoop, Spark, NoSQL databases like MongoDB, Cassandra evolved
to manage it. Big Data required special software, like Hadoop, not just because of its volume
but also because much of it was unstructured.
In 2001, William S. Cleveland coined the term “Data Science”. Shortly thereafter, in April
2002, the publications of the “CODATA Data Science Journal” by the International Council
for Science: Committee on Data for Science and Technology and in January 2003, the
• Competitive advantage
Descriptive data analytics is the simplest type of analytics and the foundation (or backbone) the other
types are built on. It allows you to pull trends and relationships from raw data and concisely describe
what happened or is currently happening but doesn’t dig deeper.
Descriptive analytics answers the question, “What happened?”. It also answers question of when
happened, where happened, how many happened…
For example, imagine you’re analyzing your company’s data and find there’s a seasonal surge in sales
for one of your products: a video game console. Here, descriptive analytics can tell you, “This video
game console experiences an increase in sales in October, November, and December each year.”
Another example, you may be responsible for reporting on which media channels drive the most traffic
to the product page of your company’s website. Using descriptive analytics, you can analyze the page’s
traffic data to determine the number of users from each source. You may decide to take it one step
further and compare traffic source data to historical data from the same sources. This can enable you
to update your team on movement; for instance, highlighting that traffic from paid advertisements
increased 20 percent year over year.
Another example of descriptive analytics that may be familiar to you is financial statement
analysis. Financial statements are periodic reports that detail financial information about a business
and, together, give a holistic view of a company’s financial health.
Data visualization is a natural fit for communicating descriptive analysis because charts, graphs, and
maps can show trends in data—as well as dips and spikes—in a clear, easily understandable way.
Basic statistical software, such as Microsoft Excel or data visualization tools, such as Google Charts
and Tableau, can help parse data, identify trends and relationships between variables, and visually
display information.
Diagnostic analytics addresses the next logical question, “Why did this happen?”
Taking the analysis a step further, this type includes comparing coexisting trends or movement,
uncovering correlations between variables, and determining causal relationships where possible.
Continuing the aforementioned example, you may dig into video game console users’ demographic
data and find that they’re between the ages of eight and 18. The customers, however, tend to be between
the ages of 35 and 55. Analysis of customer survey data reveals that one primary motivator for
customers to purchase the video game console is to gift it to their children. The spike in sales in the
October, November and December months may be due to the festivals and holidays that include gift-
giving.
Diagnostic data analytics is useful for getting at the root of an organizational issue.
Diagnostic analysis can be done manually, using an algorithm, or with statistical software (such as
SPSS, Microsoft Excel).
There several concepts you need to understand before diving into diagnostic analytics: hypothesis
testing, correlation and causation, and diagnostic regression analysis.
The predictions could be for the near future—for next day, next month or the more distant future - for
the upcoming year.
By analyzing historical data in tandem with industry trends, you can make informed predictions about
what the future could hold for your company.
For instance, knowing that video game console sales have spiked in October, November, and December
every year for the past decade provides you with ample data to predict that the same trend will occur
next year. Backed by upward trends in the video game industry as a whole, this is a reasonable
prediction to make.
Making predictions for the future can help your organization formulate strategies based on likely
scenarios.
Predictive analysis can be conducted manually or using machine-learning algorithms. Either way,
historical data is used to make assumptions about the future.
Prescriptive analytics is “the future of data analytics”. This type of analysis goes beyond explanations
and predictions to recommend the best course of action moving forward. Prescriptive analytics takes
into account all possible factors in a scenario and suggests actionable takeaways. This type of analytics
can be especially useful when making data-driven decisions.
Rounding out the video game example: What should your team decide to do given the predicted trend
in seasonality due to winter gift-giving? Perhaps you decide to run an A/B test (a method of two-sample
hypothesis testing for comparing the outcomes of two different choices, A and B) with two ads: one
that caters to product end-users (children) and one targeted to customers (their parents). The data from
that test can inform how to capitalize on the seasonal spike and its supposed cause even further. Or,
maybe you decide to increase marketing efforts in September with festival / holiday-themed messaging
to try to extend the spike into next months.
While manual prescriptive analysis is doable and accessible, machine-learning algorithms are often
employed to help parse through large volumes of data to recommend the optimal next step. Algorithms
use “if” and “else” statements, which work as rules for parsing data. If a specific combination of
requirements is met, an algorithm recommends a specific course of action.
DIKW Pyramid
The DIKW (Data, Information, Knowledge, Wisdom) Pyramid represents the relationships among
data, information, knowledge and wisdom. The hierarchical representation moves from the lower level
to a higher level - first comes data, then is information, next is knowledge and finally comes wisdom.
Each step answers different questions about the initial data and adds value to it. The more we enrich
our data with meaning and context, the more knowledge and insights we get out of it so we can take
better, informed and data-driven decisions. At the top of the pyramid, we have turned the knowledge
and insights into a learning experience that guides our actions.
We can also say that, if data and information are like a look back to the past, knowledge and wisdom
are associated with what we do now and what we want to achieve in the future.
It provides understanding, explanation, why it is happening. Example: why is the decrease in sales?
It helps in doing the right thing, tells what is the best thing to do, wisdom to enable right decisions.
Example: what is the right action to nullify the factors that cause decrease in sales and increase the
sales.
The four types of data analytics add values to the raw data and contribute to the DIKW pyramid. You
process the data and get information; when you derive hidden patterns from the data, you get
knowledge; you understand it and apply the knowledge to take right decision- you become intelligent
and wise.
Data Ubiquity
According to experts "What we're seeing right now is the end of the era of software. Hardware had its
20- to 30-year run. Software had its 20- to 30-year run".
So, what's next? An era that experts call the "age of data ubiquity," one in which a new generation of
data-centric apps exploit massive data sets generated by both individuals and enterprises.
Data will dominate your life. Whether your walking, talking, driving or shopping, you are generating
data valuable for public and private sector. Organizations and their customers generate huge amount
of data on daily basis.
In the era of social media and Internet of Things, data is everywhere. Your old refrigerator was a dumb
refrigerator. Your new refrigerator is a smart refrigerator. It records time, temperatures, vibrations. It
records the electricity it's consuming. It's connected to the Internet for better monitoring and control.
The plant in your house is going to be instrumented, and it's going to tell you when it's dry and needs
to be watered, and maybe it will be able to turn on the sprinkler when it knows it's thirsty and needs
water.
When you walk into a big retail store, you are recognised as a shopper, your movement, shopping
behaviour is analysed and custom advertisements get displayed in digital signage to attract you and
provide you better shopping experience.
In addition to the connected devices, we use today smartphones, tablets, PCs and smartwatches -
sensors attached to or embedded in other physical objects generate data as well. The IDC estimates that
by 2025 there will be over 40 billion IoT devices on earth, generating almost half the world’s total
digital data.
The bulk of data generated comes from three primary sources: social data, machine data and
transactional data.
Transactional data is generated from all the daily transactions that take place both online and
offline. Invoices, payment orders, storage records, delivery receipts – all are characterized as
transactional data. However, data alone is almost meaningless, and most organizations struggle
to make sense of the data that they are generating and how it can be put to good use.
Machine data is defined as information which is generated by industrial equipment, sensors
that are installed in machinery, and even web logs which track user behavior. This type of data
is expected to grow exponentially as the internet of things grows, become more pervasive and
expands around the world. Sensors such as medical devices, smart meters, road CCTV cameras,
satellites, and drones will deliver high velocity, value, volume and variety of data in the very
near future.
Social data comes from the Posts, Likes, Tweets & Retweets, Comments, Video Uploads, and
general media (e.g., image, text docs) that are uploaded and shared via the world’s favorite
social media platforms. This kind of data provides invaluable insights into consumer behavior
and sentiment and can be enormously influential in marketing analytics. The public web is
another good source of social data.
Nature of Data
Data generated at different sources are of many different types, such as audio, video, text, image,
streaming and each of them tends to require different tools and techniques to process. Data can be
distinguished along many dimensions. The most important one is the degree of organization. When
differentiating the degree of organization between data, we classify it into three different types:
structured, semi structured and unstructured data. The distinction is not always clear cut, and should
be understood from processing point of view. The type of data has implications on how the data should
be stored, how machine readable it is and therefore how easy it is to analyze.
Structured Data
Structured data has a high degree of organization. The data that has been formatted and transformed
into a well-defined data model / schema, consisting of rows and columns so that it can be machine
readable and extracted through algorithms easily. Examples: Relational Data, ExcelSheet, CSV etc.
Structured data is generated by both humans and machines.
Unstructured Data
Unstructured data has an internal structure but is not structured via predefined data models or schema.
It may be textual or non-textual, and human- or machine-generated. This data is difficult to process
due to its complex arrangement and no specific formatting. Examples: Audio, Video, Image, Text
files, PDF files etc.
Typical machine-generated unstructured data includes:
• Satellite imagery: Weather data, landforms, military movements.
• Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric
data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.
Semi-Structured Data
Semi-structured data has some degree of organization in it. It is not as rigorously formatted as
structured data, but also not as messy as unstructured data. This degree of organization is typically
achieved with some sort of tags and markings that identify separate data elements, which enables data
analysts to determine information grouping and hierarchies. These can be comma or colons or anything
else for that matter. Examples: Markup languages, HTML, XML; Open standard JSON
(JavaScript Object Notation); Data stored in NoSQL databases.
Big Data
Big data refers to data that is so large, fast or complex that it’s very difficult or impossible to process
using traditional methods.
Big data analytics is the use of advanced processing and analysis techniques against very large, diverse
data sets that include structured, semi-structured and unstructured data, from different sources, and in
different sizes from terabytes (1012 bytes) to zettabytes (1021 bytes).
The act of accessing and storing large amounts of information for analytics has been around for a long
time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug
Laney articulated the now-mainstream definition of big data as the three V’s: volume, velocity and
variety.
Some more Vs have emerged over the past few years: veracity, value and variability. Data has
intrinsic value. But it’s of no use until that value is discovered. Equally important: How truthful is your
data—and how much can you rely on it?
Veracity: The truthfulness or reliability of the data, which refers to the quality of data. Because data
comes from so many different sources, the data quality of captured data can vary greatly; it’s difficult
to link, match, cleanse and transform data across systems; it can be erroneous affecting an accurate
analysis.
Value: The worth in information that can be achieved by the processing and analysis of large datasets.
Value also can be measured by an assessment of the usability of information that is retrieved from the
analysis of big data.
Variability: In addition to the increasing velocities and varieties of data, data flows are unpredictable
– changing often and varying greatly. It’s challenging, but businesses need to know when something
is trending in social media, and how to manage daily, seasonal and event-triggered peak data loads.
How exactly data can be incorporated into the decision-making process will depend on a number of
factors, such as your business goals and the types and quality of data you have access to. Though data-
driven decision-making has existed in business in one form or another for centuries, it’s a truly modern
phenomenon.
Benefits
1. You’ll Make More Confident Decisions
Once you begin collecting and analyzing data, you’re likely to find that it’s easier to reach a confident
decision about virtually any business challenge, whether you’re deciding to launch or discontinue a
product, adjust your marketing message, branch into a new market, or something else entirely.
2. You’ll Become More Proactive
When you first implement a data-driven decision-making process, it’s likely to be reactionary in nature.
The data tells a story, which you and your organization must then react to.
While this is valuable in its own right, it’s not the only role that data and analysis can play within your
business. Given enough practice and the right types and quantities of data, it’s possible to leverage it
in a more proactive way—for example, by identifying business opportunities before your competition
does, or by detecting threats before they grow too serious.
3. You Can Realize Cost Savings and profitability
When your decision-making become more data-driven you result
• Improved efficiency and productivity in organizational processes
• Better financial performance
• Identification and creation of new product and service revenue
• Improved customer acquisition and retention
• Improved customer experiences
• Competitive advantage
After a discussion with the marketing team, you decide to focus on the problem: “How can we identify
potential customers who are more likely to buy our product?”
The next step for you is to figure out what all data you have available with you to answer the above
questions.
2. Data Acquisition
After defining the problem, you will need to collect the required data to derive insights and turn the
business problem into a probable solution.
The required data may be available in the organization, if not it can be outsourced / purchased from
external sources.
If you think the data available is not sufficient, then you must make arrangements to collect new data.
Many companies store the sales data they have in customer relationship management (CRM) systems.
3. Data preparation
The data you have collected may contain errors like invalid entries, inconsistence values, missing
values / null values, duplicate values, and many more. First, you need to make sure the data is clean
and free from all possible errors.
Since data collection is an error-prone process, in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing removes
false values from a data source and inconsistencies across data sources, data integration enriches data
sources by combining information from multiple data sources, and data transformation ensures that
the data is in a suitable format for use in your models.
4. Data exploration
This is one of the most crucial steps in a data science process. This step often popularly known as EDA
(Exploratory Data Analysis). Data exploration is concerned with building a deeper understanding of
your data. You try to understand how variables interact with each other, the distribution of the
data, and whether there are outliers. Identify correlation and trends between the dependent and
independent variable.
To achieve this, you mainly use descriptive statistics, and data visualization techniques.
Data visualisation makes the patterns and trends identification much easier rather than just looking at
thousands of rows on a dataset and just using statistics.
5. Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data and
to transform data into a form where a model can understand better. Feature engineering also deal with
data and pattern anomalies:
• Data Transformation techniques: Log transform, Square-root transformation, Categorical
encoding, Power functions and Scaling.
• Dealing with: Skewed data, Bias mitigation, Binning, Outlier detection the latter identifies data
points or observations that deviate from a dataset’s normal behaviour. Outliers data can indicate
critical incidents, such as a technical glitch, or potential hazards therefore this need to be treated
accordingly otherwise it could mislead our final model(s).
After transforming the data to the right format and dealing with potential data hazard, most of the time
especially with a high dimensional dataset, we end up with many features. We cannot feed all the
features to the machine learning model, that is not how it works, that would overfit the model hugely.
Instead, we have to choose the right number of features. This is called Feature Selection.
6. Model Building
Once the data with reduced features are ready to be modelled, the data is divided into training and test
sets. Model building (see Figure-4) involves model training and evaluation in iteration. We create
a Baseline model and keep increasing or decreasing the complexity of the model by Hyperparameter
Tuning to get desired accuracy.
Model is trained using training set. We need to evaluate the performance of the Trained model, on test
dataset.
Once the model is deployed, it is monitored for its performance. If required you may go back to one or
more previous stages to recalibrate the model to improve its performance.
Due to the iterative nature of the process, it is sometimes called Data Science process life cycle.
Data science has been the most popular field with applications in a wide range of domains. It has been
revolutionizing the way we perceive data and data-driven decisions. Below given are some of the
popular domains, and important applications although the list is not exhaustive.
Banking, Finance and Insurance
Sales and Marketing
Healthcare and Pharma
Internet and Social media
Travel and Transport
Ola, Uber employ data science to improve price and delivery routes, as well as optimal
resource allocation, by combining numerous factors such as consumer profiles,
geography, economic indicators, and logistical providers.
Dynamic Pricing
Faster Route and ETA (Estimated Time of arrival ) prediction
To protect individuals’ privacy, ensure you’re storing data in a secure database so it doesn’t end up in
the wrong hands. Data security methods that help protect privacy include dual-authentication password
protection and file encryption.
Data security focuses on systems in place that prevent malicious external attempts to access, steal, or
destroy data, whereas data privacy focuses on the ethical and legal use and access to sensitive data and
PII. Data security and data privacy work together to ensure your customers’ safety and anonymity.
For professionals who regularly handle and analyze sensitive data, mistakes can still be made. One way
to prevent slip-ups is by de-identifying a dataset. A dataset is de-identified when all pieces of PII are
removed, leaving only anonymous data. This enables analysts to find relationships between variables
of interest without attaching specific data points to individual identities.
4. Intention
When discussing the context of ethics, intentions matter. Before collecting data, ask yourself why you
need it, what you’ll gain from it, and what changes you’ll be able to make after analysis. If your
intention is to hurt others, profit from your subjects’ weaknesses, or any other malicious goal, it’s not
ethical to collect their data.
When your intentions are good—for instance, collecting data to gain an understanding of women’s
healthcare experiences so you can create an app to address a pressing need—you should still assess
your intention behind the collection of each piece of data.
Are there certain data points that don’t apply to the problem at hand? For instance, is it necessary to
ask if the participants struggle with their mental health? This data could be sensitive, so collecting it
when it’s unnecessary isn’t ethical. Strive to collect the minimum viable amount of data, so you’re
taking as little as possible from your subjects while making a difference.
5. Outcomes
Even when intentions are good, the outcome of data analysis can cause inadvertent harm to individuals
or groups of people.
For example, Voters’ mood/opinion survey may negatively impact a party during election season.