Chapter 1
Chapter 1
Chapter 1
Amar Sahay
Essentials of Data Science and Analytics:
Statistical Tools, Machine Learning, and R-Statistical Software Overview
10 9 8 7 6 5 4 3 2 1
To Priyanka Nicole, Our Love and Joy
Description
This text provides a comprehensive overview of Data Science. With
continued advancement in storage and computing technologies, data
science has emerged as one of the most desired fields in driving busi-
ness decisions. Data science employs techniques and methods from
many other fields such as statistics, mathematics, computer science,
and information science. Besides the methods and theories drawn from
several fields, data science uses visualization techniques using specially
designed big data software and statistical programming language, such
as R p rogramming, and Python. Data science has wide applications in
the areas of Machine Learning (ML) and Artificial Intelligence (AI).
The book has four parts divided into different chapters. These chapters
explain the core of data science. Part I of the book introduces the field
of data science, different disciplines it comprises of, and the scope with
future outlook and career prospects. This section also explains analytics,
business analytics, and business intelligence and their similarities and dif-
ferences with data science. Since the data is at the core of data science, Part
II is devoted to explaining the data, big data, and other features of data.
One full chapter is devoted to data analysis, creating visuals, pivot table,
and other applications using Excel with Office 365. Part III explains the
statistics behind data science. It uses several chapters to explain the statis-
tics and its importance, numerical and data visualization tools and meth-
ods, probability, and probability distribution applications in data science.
Other chapters in the Part III are sampling, estimation, and hypothesis
testing. All these are integral part of data science applications. Part IV of
the book provides the basics of Machine Learning (ML) and R-statistical
software. Data science has wide applications in the areas of Machine
Learning (ML) and Artificial Intelligence (AI) and R-statistical software
is widely used by data science professionals. The book also outlines a brief
history, the body of knowledge, skills, and education requirements for
data scientist and data science professionals. Some statistics on job growth
and prospects are also summarized. A career in data science is ranked at
viii Description
the third best job in America for 2020 by Glassdoor and was ranked the
number one best job from 2016 to 2019.29
Primary Audience
Scope
Keywords
data science; data analytics; business analytics; business intelligence;
data analysis; decision making; descriptive analytics; predictive analytics;
prescriptive analytics; statistical analysis; quantitative techniques; data
mining; predictive modeling; regression analysis; modeling; time-series
forecasting; optimization; simulation; machine learning; neural networks;
artificial intelligence
Contents
Preface������������������������������������������������������������������������������������������������xiii
Acknowledgments��������������������������������������������������������������������������������� xxi
Online References��������������������������������������������������������������������������������443
Additional Readings����������������������������������������������������������������������������445
About the Author��������������������������������������������������������������������������������449
Index�������������������������������������������������������������������������������������������������451
Preface
This book is about Data Science, one of the fastest growing fields with
applications in almost all disciplines. The book provides a comprehensive
overview of data science.
have made it possible to process and analyze this huge data with smarter
storage spaces.
Data science is a multidisciplinary field that involves the ability to
understand, process, and visualize data in the initial stages followed by
applications of statistics, modeling, mathematics, and technology to
address and solve analytically complex problems using structured and
unstructured data. At the core of data science is data. It is about using
this data in creative and effective ways to help businesses in making
data-driven business decisions.
The field of data science is vast and has a wide scope. The terms data
science, data analytics, business analytics, and business intelligence are often
used interchangeably even by the professions in the fields. All these areas
are somewhat related with the field of data science having the largest
scope. This book tries to outline the tools, techniques, and applications of
data science and explain the similarities and differences of this field with
data analytics, analytics, business analytics, and business intelligence.
The knowledge of statistics in data science is as important as the
applications of computer science. Statistics is the science of data and vari-
ation. Statistics and data analysis, and statistical analysis constitute major
applications of data science. Therefore, a significant part of this book
emphasizes the statistical concepts needed to apply data science in real
world. It provides a solid foundation of statistics applied to data science.
Data visualization and other descriptive and inferential tools—the knowl-
edge of which are critical for data science professionals are discussed in
detail. The book also introduces the basics of machine learning that is
now a major part of data science and introduces the statistical program-
ming language R, which is widely used by data scientists. A chapter by
chapter synopsis is provided.
Chapter 1 provides an overview of data science by defining and out-
lining the tools and techniques. It describes the differences and similar-
ities between data science and data analytics. This chapter also discusses
the role of statistics in data science, a brief history of data science, knowl-
edge and skills for data science professionals, and a broad view of data
science with associated areas. The body of knowledge essential for data
science, and different tools technologies used in data science are also parts
of this chapter. Finally, the chapter looks into the future outlook of data
Preface xv
science and carrier career path for data scientists along with future out-
look of data science as a field. The major topics discussed in Chapter 1 are:
(a) broad view of data science with associated areas, (b) data science body
of knowledge, (c) technologies used in data science, (d) future o utlook,
and (d) career path for data science professional and data scientist.
The other concepts related to data science including analytics, busi-
ness analytics, and business intelligence (BI) are discussed in subsequent
chapters. Data science continues to evolve as one of the most sought-after
areas by companies. The job outlook for this area continues to be one of
the highest of all field.
The discussion topic of Chapter 2 is analytics and business analytics.
One of the major areas of data science is analytics and business analyt-
ics. These terms are often used interchangeably with data science. We
outline the differences between the two along with the explanation of
different types of analytics and the tools used in each one. The deci-
sion-making process in data science heavily makes use of analytics and
business analytics tools and these are integral parts of data analysis. We,
therefore, felt it necessary to explain and describe the role of analytics
in data science. Analytics is the science of analysis—the processes by
which we analyze data, draw conclusions, and make decisions. Business
analytics (BA) covers a vast area. It is a complex field that encompasses
visualization, statistics and modeling, optimization, simulation-based
modeling, and statistical analysis. It uses descriptive, predictive, and pre-
scriptive analytics including text and speech analytics, web analytics, and
other application-based analytics and much more. This chapter also dis-
cusses different predictive models and predictive analytics. Flow diagrams
outlining the tools of each of the descriptive, predictive, and prescriptive
analytics presented in this chapter. The decision-making tools in analytics
are part of data science.
Chapter 3 draws a comparison between the business intelligence (BI)
and business analytics. Business analytics, data, analytics, and advanced
analytics fall under the broad area of business intelligence (BI). The broad
scope of BI and the distinction between the BI and business analytics
(BA) tools are outlined in this chapter.
Chapter 4 is devoted to the study of collection, presentation, and
various classification of data. Data science is about the study of data.
xvi Preface
Data are of various types and are collected using different means. This
chapter explained the types of data and their classification with exam-
ples. Companies collect massive amounts of data. The volume of data
collected and analyzed by businesses is so large that it is referred to as
“Big Data.” The volume, variety, and the speed (velocity) with which data
are collected requires specialized tools and techniques including specially
designed big data software for analysis.
In Chapter 5, we introduce Excel, a widely available and used software
for data visualization and analysis. A number of graphs and charts with
stepwise instructions are presented. There are several packages available as
add-ins to Excel to enhance its capabilities. The chapter presents basic to
more involved features and capabilities. The chapter is divided into sec-
tions including “Getting Stated with Excel” followed by several applica-
tions including formatting data as a table, filtering and sorting data, and
simple calculations. Other applications in this chapter are analyzing data
using pivot_table/pivot chart, descriptive statistics using Excel, visualiz-
ing data using Excel charts and graphs, visualizing categorical data—bar
charts, pie charts, cross tabulation, exploring the relationship between
two and three variables—scatter plot bubble graph, and time-series plot.
Excel is very widely used software application program in data science.
Chapters 6 and 7 deal with basics of statistical analysis for data
science. Statistics, data analysis, and analytics are at the core of data
science applications. Statistics involves making decisions from the
data. Making effective decisions using statistical methods and data require
the understanding of three areas of statistics: (1) descriptive statistics,
(2) probability and probability distributions, and (3) inferential statis-
tics. Descriptive statistics involves describing the data using graphical and
numerical methods. Graphical and numerical methods are used to create
visual representation of the variables or data and to calculate various sta-
tistics to describe the data. Graphical tools are also helpful in identifying
the patterns in the data. This chapter discusses data visualization tools.
A number of graphical techniques are explained with their applications.
There has been an increasing amount of pressure on businesses to pro-
vide high-quality products and services. This is critical to improving their
market share in this highly competitive market. Not only it is critical for
businesses to meet and exceed customer needs and requirements, it is also
Preface xvii
Introduction
Data science is about extracting knowledge and insights from data. The
tools and techniques of data science are used to drive business and process
decisions. It can be seen as a major data-driven decision-making approach
to decision making. Data science is a multidisciplinary field that involves
the ability to understand, process, and visualize data in the initial stages
followed by applications of statistics, modeling, mathematics, and tech-
nology to address and solve analytically complex problems using struc-
tured and unstructured data. At the core of data science is data. It is about
using this data in creative and effective ways to help businesses in making
data-driven business decisions.
The knowledge of statistics in data science is as important as the
applications of computer science. Companies now collect massive
4 Essentials of Data Science and Analytics
The initial chapters of the book introduce data science and closely
related areas. The terms data science, data analytics, business analytics,
and business intelligence are often used interchangeably even by the pro-
fessions in the fields. Therefore, Chapter 1, which provides an overview
of data science, is followed by two chapters that explain the relationship
between data science, analytics, and business intelligence. Analytics itself
is wide area and different forms of analytics including descriptive, pre-
dictive, and prescriptive analytics are used by companies to drive major
business decisions. Chapters 2 and 3 outline the differences and similari-
ties between data science, analytics, and business intelligence. Chapter 2
also outlines the tools of descriptive, predictive, and prescriptive analytics
along with the most recent and emerging technologies of machine learn-
ing and artificial intelligence. Since the field is data science is about the
data, a chapter is devoted to data and data types. Chapter 4 provides defi-
nitions of data, different forms of data, and their types followed by some
tools and techniques for working with data. One of the major objectives
of data science is to make sense from the massive amounts of data compa-
nies collect. One of the ways of making sense from data is to apply data
visualization or graphical techniques used in data analysis. Understand-
ing other tools and techniques for working with data are also important.
A chapter is devoted to data visualization.
Data science is a vast area. Besides visualization techniques and
statistical analysis, it uses statistical programming language such as
R programming, and a knowledge of databases (SQL or MySQL) or
other data base management system.
One major application of data science is in the area of Machine
Learning (ML) and Artificial Intelligence. The book provides a detailed
overview of data science by defining and outlining the tools and tech-
niques. As mentioned earlier, the book also explains the differences and
similarities between data science and data analytics. The other concepts
related to data science including analytics, business analytics, and busi-
ness intelligence (BI) are discussed in detail. The field of data science is
about processing, cleaning, and analyzing data. These concepts and topics
are important to understand the field of data science and are discussed in
this book. Data science is an emerging field in data analysis and decision
making.
6 Essentials of Data Science and Analytics
field, c hampions the broadening of learning scope in the form of data sci-
ence.20 John Chambers who urges statisticians to adopt an inclusive concept
of learning from data.22 Together, these statisticians envision an increasingly
inclusive applied field that grows out of traditional statistics and beyond.
2002 In April 2002, the International Council for Science (ICSU): Commit-
tee on Data for Science and Technology (CODATA)17 started the Data
Science Journal, a publication focused on issues such as the description
of data systems, their publication on the Internet, applications and legal
issues.
2012 In the 2012 Harvard Business Review article “Data Scientist: The
Sexiest Job of the 21st Century”,24 DJ Patil claims to have coined this term
in 2008 with Jeff Hammerbacher to define their jobs at LinkedIn and
Facebook, respectively. He asserts that a data scientist is “a new breed” and
that a “shortage of data scientists is becoming a serious constraint in some
sectors” but describes a much more business-oriented role.
2015 In 2015, the International Journal on Data Science and Analytics was
launched by Springer to publish original work on data science and big data
analytics.
2016 In 2016, The ASA changed its section name to “Statistical Learning and
Data Science.”
Some argue that the two fields—data science and data analytics—
can be considered different sides of the same coin, and their functions
are highly interconnected. Data science lays important foundations and
parses big datasets to create initial observations, future trends, and poten-
tial insights that can be important. This information by itself is useful
for some fields, especially modeling, improving machine learning, and
enhancing AI algorithms as it can improve how information is sorted
and understood. However, data science asks important questions that we
were unaware of before while providing little in the way of answers. By
combining data analytics with data science, we have additional insights,
prediction capabilities, and tools to apply in practical applications.
When thinking of these two disciplines, it’s important to forget about
viewing them as data science versus data analytics. Instead, we should
see them as parts of a whole that are vital to understanding not just the
information we have, but how.
Statistics&
Data
Analysis Math &
(R Statistical Mathematical
Predictive Programming) Modeling
Analytics
Machine
Data DATA Learning and
Visualization SCIENCE Artificial
Intelligence
(AI)
Data Base
Management
Business
and
& Process
query(SQL)
Programming Knowledge
(Python)
Future Outlook
Data science is a growing field. It continues to evolve as one of the most
sought-after areas by companies. An excellent outlook is provided in
reference24: Davenport, T. H., and D.J. Patil (October 1, 2012). “Data
Scientist: The Sexiest Job of the 21st Century”. Harvard Business
Review (October 2012). ISSN 0017-8012. Retrieved 3 April 2020.
Data science is a growing field. It continues to evolve as one of the
most sought-after areas by companies. An excellent outlook is provided
in reference.24
Data Science and Its Scope 15
A career in data science is ranked at the third best job in America for
2020 by Glassdoor, and was ranked the number one best job from 2016
to 2019.29 Data scientists have a median salary of $118,370 per year or
$56.91 per hour.30 These are based on level of education and experience
in the field. Job growth in this field is also above average, with a projected
increase of 16 percent from 2018 to 2028.30 The largest employer of
data scientists in the United States is the federal government, employing
28 percent of the data science workforce.30 Other large employers of
data scientists are computer system design services, research and devel-
opment laboratories, big technology companies, and colleges and univer-
sities. Typically, data scientists work full time, and some work more than
40 hours a week. See references17,26,27 for the above paragraphs.
The outlook for data science field looks promising. It is estimated that
2 to 2.5 million jobs will be created in this area in the next ten years. The
data science area is vast and requires the knowledge and training from
different fields. It is one of the fastest growing areas. Data scientists can
have a major positive impact on a business success.
Data science continues to evolve as one of the most promising and
in-demand career paths for skilled professionals. Today, successful data
professionals understand that they must advance past the traditional skills
of analyzing large amounts of data, data mining, and programming skills.
In order to uncover useful intelligence for their organizations, data sci-
entists must master the full spectrum of the data science life cycle and
possess a level of flexibility and understanding to maximize returns at
each phase of the process.
Much of the data collected by companies underutilized. This data,
through meaningful information extraction and discovery, can be used to
make critical business decisions and drive significant business change. It
can also be used to optimize customer success and subsequent acquisition,
retention, and growth.
Business and research treat their data as an asset. The businesses, pro-
cesses and companies are run using their data. The data and variables
collected are highly dynamic and continuously change. Data science pro-
fessionals are needed to process, analyze, and model the data, which is
usually in the big data form to be able to visualize and help companies in
making timely data-driven decision. “The data science professionals must
be trained to understand, clean, process, and analyze the data to extract
16 Essentials of Data Science and Analytics
value from it. It is also important to be able to visualize the data using
conventional and big data software in order to communicate data in a
meaningful way. This will enable applying proper statistical, modeling,
and programming techniques to be able to draw conclusions. All these
require knowledge and skills from different areas and these are hugely
important skills in the next decades,” says Hal Varian, chief economist
at Google and UC Berkeley professor of information sciences, business,
and economics3 The increase in demand for data science jobs is expected
to grow by 28 percent by 2020 https://datascience.berkeley.edu/about/
what-is-data-science/.
Summary
Data science is a data-driven decision-making approach that uses s everal
different areas, methods, algorithms, models, and disciplines with a pur-
pose of extracting insights and knowledge from structured and unstruc-
tured data. These insights are helpful in applying algorithms and models
to make decisions. The models in data science are used in predictive
analytics to predict future outcomes. Businesses collect massive amounts
of data in different forms and by different means. With the continued
advancement in technology and data science, it is now possible for busi-
nesses to store and process huge amounts of data in their data bases. At
the core of data science is data. The field of data science is about using
this data in creative and effective ways to help businesses in making
data-driven business decisions.
Data science uses several disciplines and areas including, statistical
modeling, data mining, big data, machine learning, and artificial intel-
ligence (AI), management science, optimization techniques, and related
methods in order to “understand and analyze actual phenomena” from
data.3
Data science also employs techniques and methods from many other
fields, such as mathematics, statistics, computer science, and informa-
tion science. Besides the methods and theories drawn from several fields,
data science uses visualization techniques using specially designed big data
software and statistical programming language, such as R programming,
and Python. Data science has wide applications in the areas of machine
Data Science and Its Scope 17
learning (ML) and artificial intelligence (AI). The chapter provided over-
view of data science by defining and outlining the tools and techniques
and explained the differences and similarities between data science and
data analytics. The other concepts related to data science including ana-
lytics, business analytics, and business intelligence (BI) were discussed.
Data science continues to evolve as one of the most sought-after areas by
companies. The chapter also outlined the career path and job-outlook for
this area, which continues to be one of the highest of all field. The field is
promising and is showing tremendous job growth.
Index
Addition law Box and Whisker Plot
mutually exclusive events, 279 home sales data, 116
nonmutually exclusive events, income data, 120–121
280–287 interpretation, 116–118
Analytics, 10–11, 21 Box plots, 200
analytical models, 37–38 applications, 192
big data, 35–36 categorical data (see Categorical data)
data mining, 37 displays, 170
Apache Hadoop, 14 exploratory data analysis, 241
Area plot, variation, 197 samples vs. machines, 173
Artificial neural network (ANN), samples vs. operators, 173
430, 435 shaft manufacturing process, 172
utility bill data, 242
Bar charts variations, 192
applications, 193 waiting time data, 170–171
cluster, 175, 193–194 Bubble graph/chart, 130–132
connected lines, 174 Business analytics (BA)
data visualization, 173–176 applications and implementation,
employment status and major, 182 31–32
gender and major, 182 vs. business intelligence (BI), 46–47
monthly sales, 174 business performance, 22
product rating, 180, 181 categories, 21–22
stacked, 176 data mining, 23–24
tally, 177–179 decision making, 22–24
variation, 193 definition, 21
vertical, 174–175 descriptive analytics, 25–26
Bayes’ theorem, 299–300 objectives, 40–41
Big data, 21, 22, 25, 146 overall process, 41
algorithm, 35 predictive analytics, 26–29
analytics, 35–36 prescriptive analytics, 29–31
data analysis, 63–64 statistical analysis, 23
definition, 35 statistics, 148–149
Gartner, 35 tools and algorithms, 20
visualization, 157 Business intelligence (BI), 24
Binomial distribution, 305, 310–311 advanced analytics projects, 45
binomial formula, 312 broad area, 45
binomial table, 312–313 vs. business analytics, 46–47
excel function, 314–315 definition, 45
mean or expected value, 316 statistics, 149
probability calculations, 314
probability of success, 311 Categorical data, 56
standard deviation, 316–317 bar chart
452 Index