The Art of Data Science: Student - Feedback@sti - Edu
The Art of Data Science: Student - Feedback@sti - Edu
The Art of Data Science: Student - Feedback@sti - Edu
There is no exact definition of what data science is. Various studies and researches tried to define what the
discipline is about, and here is the list:
• Data science is a scientific discovery and practice that involves the collection, management, processing,
analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a
diverse array of scientific, translational, and inter-disciplinary applications (Danoho, 2017).
• Data science lies at the intersection of the statistical and the computational sciences, and domain-
specific scholarly disciplines and application areas. It incorporates the availability and diversity of
quantitative information and the theory and practice of statistics and computer science that make
processing and understanding data possible. (datascience.harvard.edu, 2017)
• Data mining and big data have the same concept, the use of the most powerful hardware, the most
powerful programming systems, and the most efficient algorithms to solve problems in science,
commerce, healthcare, government, humanities, and many other fields of human endeavor (Leskove,
Rajaraman, & Ullman, 2014).
• Data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics,
computing, communication, management, and sociology to study data and its environments (including
domains and other contextual aspects, such as organizational and social aspects) in order to transform
data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology
(Cao, 2017).
• Data science is primarily used to make decisions and predictions by using predictive causal analytics,
prescriptive analytics (predictive plus decision science), and machine learning (Sharma, 2019).
The definitions above can be summarized in the following aspects (Leibowitz, 2018):
1. The center of data science is data, especially Big Data.
2. The purpose of data science is to obtain information or knowledge from the data that will help in making
better decisions and understanding the development and change of nature or society better.
3. Data science is a multidisciplinary field that has applied theories and technologies from several
disciplines.
Many academics and journalists see no distinction between data science and statistics. For Karl Broman,
professor of University of Wisconsin, data science is statistics. For him, “if you’re analyzing data, you’re doing
statistics. You can call it ‘data science’ or ‘informatics’ or ‘analytics’ or whatever, but it’s still statistics” (Danoho,
2017). For Applied Statistician Nate Silver, “data scientist” is an attractive term for a statistician. He added that
statistics is “science” and that data scientist is “slightly redundant in some way,” so people shouldn’t berate the
term “statistician” (Ratner, 2017).
However, criticisms are as numerous. Some comments pointed out the near irrelevance of statistics with data
science. For Andrew Gelman, professor of statistics and director of the Applied Statistics Center at Columbia
University, statistics is “not the most important part of data science or even close.” He emphasized that data
science deals with databases and coding, and statistics is just an option (Gelman, 2013). Moreover, Vasant
Dhar, professor of Information Systems at New York University, believes that data science is different from the
existing practice of data analysis across all disciplines, which focuses only on explaining data sets. It seeks
actionable and consistent patterns for predictive uses(Dhar, 2013).
Data Scientist
Data science has been considered an interdisciplinary discipline. It was originally developed within the statistics
and mathematics community (Leibowitz, 2018). Data scientists are one-part mathematician, one-part computer
scientist, and one-part trend-spotter because of their duties to collect large amounts of unruly data and
organizing them for various forms of consumption—from spotting trends to predicting outcomes, or even to
visualizing information so that it can be easily read (SAS, n.d.). Because of their flexibility, data scientists are
now sought after by business industries because they can manipulate various raw data into information that can
help boost their sales or even predict which trend could kickstart their bankruptcy.
It is mentioned earlier that the programming languages required are R, Python, and SAS. Refer to the following
for their definitions and functions:
• R is a language and environment for statistical computing and graphics developed by Bell Laboratories
(present-day Lucent Technologies). It allows users to extrapolate data into a wide variety of statistical
and graphical techniques. It is also a free software and highly extensible (Foundation, n.d.). It can
compile and run on a wide variety of operating systems.
• Python is an object-oriented, interpreted, and interactive programming language developed by Guido
van Rossum. It combines remarkable power with very clear syntax and is compatible with other
programming languages depending on the user’s preferences (Holden, 2018).
• The SAS language is a programming language developed by Anthony James Barr as a statistical
analysis tool. It is the leading tool in commercial analytics space, offering a variety of functions and a
good user interface that can be easily learned. However, it is the most expensive language (Jain, 2017).
References:
About. (2017, April). Retrieved from datascience.harvard.edu: https://datascience.harvard.edu/about
Cao, L. (2017). Data Science: A Comprehensive Overview. Data Science: A Comprehensive Overview, 50, 8.
doi:10.1145/3076253
Danoho, D. (2017). 50 Years of Data Science. Journal of Computational and Graphical Statistics, 26, 4.
doi:10.1080/10618600.2017.1384734
Dhar, V. (2013). Data Science and Prediction. Communications of the ACM, 56, 64-73. doi:10.1145/2500499
Foundation, T. R. (n.d.). What is R? Retrieved from R-Project: https://www.r-project.org/about.html
Gelman, A. (2013, November 14). Statistics is the least important part of data science [Online forum Comment] Retrieved
from https://statmodeling.stat.columbia.edu/2013/11/14/statistics-least-important-part-data-science/
Holden, S. (2018, September 16). The Python Wiki. Retrieved from Python: https://wiki.python.org/moin/
Jain, K. (2017, September 12). Python vs. R vs. SAS – which tool should I learn for Data Science? Retrieved from Analytics
Vidhya: https://www.analyticsvidhya.com/blog/2017/09/sas-vs-vs-python-tool-learn/
Leibowitz, J. (2018). Analytics and knowledge management (1st ed.) Retrieved from
https://books.google.com.ph/books?id=u3NgDwAAQBAJ&lpg=PP1&hl=fil&pg=PP1#v=onepage&q&f=false
Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets (2nd ed.). Retrieved from
https://books.google.com.ph/books?id=16YaBQAAQBAJ&printsec=frontcover&dq=Mining+of+massive+datasets
+(&hl=fil&sa=X&ved=0ahUKEwjl2ov_sqXlAhWOMt4KHc6JBtgQ6AEIMzAB#v=onepage&q=Mining%20of%20ma
ssive%20datasets%20(&f=false
Ratner, B. (2017). Statistical and Machine-Learing Data Mining: Techniques for Better Predictive Modeling and Analysis of
Big Data (3rd ed.). Retrieved from
https://books.google.com.ph/books?id=ulAsDwAAQBAJ&printsec=frontcover&dq=Statistical+and+Machine-
Learning+Data+Mining+Ratner&hl=fil&sa=X&ved=0ahUKEwiM7aefs6XlAhWF3mEKHW7KD84Q6AEINzAB#v=o
nepage&q=Statistical%20and%20Machine-Learning%20Data%20Mining%20Ratner&f=false
SAS. (n.d.). What is a data scientist? Retrieved from SAS: https://www.sas.com/en_ph/insights/analytics/what-is-a-data-
scientist.html
Sharma, H. (2019, June 20). Retrieved from https://www.edureka.co/: https://www.edureka.co/blog/what-is-data-science/