June 30 – July 4, 2019, Jerusalem, Israel
MADASCA: INSTRUCTION OF DATA SCIENCE
TO MANAGERS
Sahahar Golan*
Jerusalem College of Technology,
Jerusalem, Israel
[email protected]
Dan Bouhnik
Jerusalem College of Technology,
Jerusalem, Israel
[email protected]
* Corresponding author
ABSTRACT
Aim/Purpose
Build a program that teaches prospect managers the skills that are relevant for
leading data science activity.
Background
Data science becomes ubiquitous in organizations. It is imperative to train students in management departments in the skills that are relevant to this field.
Most courses in data science focus on technical knowledge like model building
methods, and neglect organizational knowledge such as team roles, ethical considerations and project stages. This work suggests a complementary program
that supplies the students with the required knowledge. The authors believe that
this program is most suitable for management-students, and that it can also be
adapted to software engineering students, in order to provide them with a wider
scope.
Contribution
We present the MaDaScA (Managing Data Science Activity) program. The program defines a list of topics that are required for managers’ education in order
to lead data science activity. This work suggests the content and take-away messages of each topic. The paper surveys several existing courses that teach datascience to managers.
Findings
All existing courses supply a part of the suggested topics, either focusing on
technical aspects of data-science or on organizational aspects. In particular, only
a small minority of the courses discuss ethical aspects of data science.
Recommendations We recommend adopting MaDaScA in management departments in order to
for Practitioners
prepare managers for the challenges in data-science.
Accepting Editor: Eli Cohen │ Received: December 12, 2018 │ Revised: January 30, 2019 │
Accepted: February 2, 2019.
Cite as: Golan, S., & Bouhnik, D. (2019). MaDaScA: Instruction of data science to managers. Proceedings of the
Informing Science and Information Technology Education Conference, Jerusalem, Israel, pp. 125-140. Santa Rosa, CA: Informing Science Institute. https://doi.org/10.28945/4271
(CC BY-NC 4.0) This article is licensed to you under a Creative Commons Attribution-NonCommercial 4.0 International
License. When you copy and redistribute this paper in full or in part, you need to provide proper attribution to it to ensure
that others can later locate this work (and to ensure that others do not accuse you of plagiarism). You may (and we encourage you to) adapt, remix, transform, and build upon the material for any non-commercial purposes. This license does not
permit you to use this material for commercial purposes.
MaDaScA
Recommendations We recommend adapting the MaDaScA model to the curriculum of the faculty
for Researchers
of engineering, especially for the department of industrial engineering.
Impact on Society
Educating prospect managers on the capabilities of data science and responsibilities that come with it is key for making sure organizations become much
more data driven, efficient and ethical.
Future Research
It is possible to make this program more effective by adding practical experience
Keywords
data-science, data-science instruction, management
INTRODUCTION
Management of data science teams requires a combination of technical and soft skills. In many cases
the team manager role is filled by an experienced data scientist who was promoted. Since the data
science team may (and should) have high impact on the organization roadmap, it is important that
the team leader have an academic background also in management. One approach would be to include management material in computer science programs. This work suggests a course plan for
teaching data science principles as an organic part of management program.
The course aims at giving the manager tools that will enable him to understand the flow of a data
science project, the responsibilities of each team member, the challenges that may arise during each
stage and ways to overcome them. One important presumption is that the participants have no previous experience in programming. This requires finding creative solutions for presenting the main
building blocks of data science activity.
The course is composed of four main topics that supply a holistic view of data science activity. The
topics are elaborated in the following sections.
In the first topic, based on Kross, Peng, Caffo, Gooding, and Leek (2017), the course describes the
data science team: its structure and its relationships with other parts of the organization. This topic
gives the framework in which the other topics take place. The second topic describes common use
cases of data science in organizations. These use cases include recommender systems, fraud detection
and applications of data science methods to health care, human resources and logistics and maintenance. In the third topic the course delves into the different stages of data-science projects. The
fourth topic touches on ethical issues that are involved when practicing data science: privacy concerns, requirements of experiments involving humans, data-driven discrimination and potential misuse of statistics.
This work proposes a method for assessing students based on Project/Problem Based Learning
(PBL) and elaborates the different criteria the project must meet. In particular, it presents the value
students get from hands-on experience and guidance from the course mentors.
The concluding section surveys a list of existing courses that aim at teaching data-science to managers and show how each focuses on a different subset of the topics presented above. Moreover, the
survey shows that some courses emphasize the concrete, data-science oriented topics, while others
emphasize the management-oriented ones. The survey shows how the proposed course spans both
viewpoints and supplies a wide scope of the relevant areas.
THE DATA SCIENCE TEAM
Data science is a team sport, and it requires multidisciplinary skills. Moreover, data-science team
management involves constructing the team, leading the internal dynamics in the team and position-
126
Golan & Bouhnik
ing the team as a key contributor in the organization. The discussion of the data-science team is divided into two sub-topics:
•
•
Building the team: Defining the different roles in the team and their responsibilities.
Structuring the team’s interactions within the organization.
B UILDING TH E DATA S CIENCE T EAM
The course describes the responsibilities and skill-set of the team and map them into three categories: engineering, science and management. Following is the list of responsibilities:
•
•
•
•
•
•
Infrastructure establishment:
o Hardware: Servers, Network, Storage.
o Software: Parallelization,
o Databases: Access control, Data architecture and query optimization
o Cloud services
Software development
o Implementing data analysis algorithms
o Model learning
o Scripting languages (e.g., Python), Statistics analysis languages (e.g., R)
o Parallelization: Map/Reduce, Hadoop, Pig
Machine learning techniques
o Supervised learning
o Unsupervised learning
Research skills
o Data validation
o Hypothesis generation
o Statistical analysis
o Experiment design
Management skills
o Team building
o Empowering employees
o Communication
o Presenting results: visualization
Leadership
o Advocating Data Science
o Leading the organization’s culture to a data-driven decision-making approach
It is important to note that the mapping of the roles into job descriptions is highly dependent on the
size and type of the organization. In particular, in very large and technology-oriented organizations,
each responsibility might have a dedicated position. However, it is useful to group these responsibilities into roles that have common characteristics.
The infrastructure and software skills are mapped to the data engineer role. The data engineer is
responsible for supplying the team with the hardware and software resources. He should be able to
determine the number of servers, the storage size and hardware connectivity required for parallel
processing of big data volumes. This role also requires database-administration skills, since the data
should be available for large scale analysis and manipulation, and also should be secure against unauthorized access. In terms of software development, the data engineer should be able to develop data
pipelines, learning algorithms, and data modeling solutions. The data engineer role usually comes
from a quantitative background with a strong software orientation.
The machine-learning and research skills are mapped to the data-scientist role. The data scientist is
responsible for experiment planning and analysis. He should define the research hypothesis that
127
MaDaScA
translates business goals into a scientific task and lead the model definition and optimization. The
role requires deep understanding of statistics and machine learning techniques, including what types
of models might underlie the sampled data, how to evaluate the quality of a model and what are the
meta-parameters that may be employed for improving the learning process. In general, the data scientist has a more mathematical/statistical background that enables him to perform data analysis, but
also software capabilities.
The management and leaderships skills are mapped to the data-science manager role. The datascience manager is responsible for the team’s organic functioning and to its ability to collaborate with
external teams and add value to the organization as a whole. He should define the priorities of the
team, guide and support team members in the (many) challenging stages of the project and make
sure that they are not blocked by technical or communication difficulties. An important part of this
role is communicating the team’s capabilities to the management and other team leaders, and the project’s progress. On the highest level, the role includes the responsibility of leading the organization
towards a data-driven decision-making mentality and positioning data-science as a valuable tool in
strategy definition. The role requires excellent communication skills, the ability to tell a story when
presenting a project, the passion to make a difference in the organizations’ culture and to harness
data science for making well based decisions. Stories are a valuable tool for modern organizations in
general and for the information management area specifically. Stories allow organizations to efficiently pass on knowledge within the organization. They improve the organization’s abilities to cope with
complex situations, they encourage creativity and help attain the organizational vision. Stories are the
most ancient tool available to mankind for conveyance of knowledge in a clear manner while emphasizing interpersonal connections, which are at the base of knowledge impartation and preservation. It
is natural for the employees and helps recruit them to work toward achieving the organization’s goals
and fulfilling the processes which will bring about success. The data-science manager should be
closely familiar with data science methodology and preferably have hands-on experience in industrial
data-science projects.
Building a data science team requires locating candidates that match the required skill set. Interviews
of candidates should also be adapted to reflect their skills in data science. A suggested format is to
supply the candidate with a toy data-set and ask about what can be deduced from it.
S TRUCTURING TH E T EAM ’S I NTERACTIONS
After establishing the team’s structure, it is important to define what the team’s role in the organization is and how to structure the team’s interaction, both internally and externally, in order to make it
effective and meaningful. In each stage of the process there are considerations that are data-science
specific and should be taken into account.
The communication within the team should reflect the nature of data science as a field that involves
many experiments with challenging situations and a constant need to overcome obstacles. The team
manager should hold periodical individual meetings where he should communicate personal goals,
expectations, and get updates on progress of the projects and potential obstacles. In addition, there
are team meetings, in which the team can be updated on the big picture and discuss priorities. The
team meeting can be leveraged to raising infrastructure issues that impact the team as a whole. In
addition, this meeting is an opportunity for sharing ideas and milestones that have been achieved.
Team meetings should be a place for secure discussion where one can raise new ideas and suggest
new directions.
The goal of internal communication is to supply the team members with the supportive environment
they need in order to make progress. On the one hand, there should be an open-door policy where
the team member can raise quick questions he needs to resolve in order to remove obstacles. On the
other hand, there shouldn’t be too many meetings that hamper the work continuum. Since in many
situations, data science activity may involve frustrating moments, where experiments fail or lead to
128
Golan & Bouhnik
unexpected results, the manager should use meetings for empowering the team, celebrating success
and communicating reasonable expectations.
There are two approaches for structuring the communication between the data science team and
other teams. According to the integrated approach, the team members are dispersed among other
teams, each acting as a part of these teams. The advantage of this approach is that the team members
are perfectly aligned with the organization’s business-units’ goals and can better aim their efforts to
achieve the external team’s goals. When incorporating the alternative Organic approach, the team
members sit in the same area and collaborate closely in order to make progress. The advantage here
is that each team member enjoys the support of his team members, who are familiar with the challenges he is tackling and can contribute to his efforts through brainstorming and advice.
The suggested approach is a hybrid of the two approaches, where on the one hand the team has a
shared working space, where the team spends most of its time, closely communicating and collaborating. On the other hand, each team member is assigned a task in an external team, where he can
contribute his capabilities in the many tasks that require data analysis, modeling and experimenting.
Adopting this approach builds a strong data science team that empowers its members to conduct
meaningful research and base their activity on the highest standards. At the same time it positions the
team as an important stakeholder in the team’s business strategy.
There are several tools that can enable the data science team to help the organization become more
data driven. The team can hold lectures educating the employees on data-science capabilities and statistical-thinking. There can be Show-and-Tell events in which the team presents its recent results or
case studies describing its activity. New employees orientation processes should include an introduction to the support that data-science can supply to the different teams. The data science team can
extract and create new data-sources and make them available and accessible to teams that may benefit
from them. The team can also develop tools and technologies that make data much more present in
the working process: Dashboards, analysis tools and machine-learning platforms can help promote
data-science to a central role in any team’s progress.
Adopting the above-mentioned guidelines can help build a data science team that is capable of making a real change in the organization and making data science a leading factor in its core processes.
DATA SCIENCE USE CASES
Following the description of a generic data-science project, the program delves into a description of
several use-cases that are most common and have most potential for future development. For each
use case the course describes the following properties:
•
•
•
•
•
•
The motivation behind the use case
Main characteristics
The data that is used to develop a model
Main methods that are used
Challenges
A Case-study
Table 1 summarizes the discussion content for each use case.
129
MaDaScA
Table 1. Example of a table
Use case
Motivation
Data
Characteristics
Methods
Challenges Case study
Recommender system
√
√
√
√
√
Fraud detection
√
√
√
√
√
Human resources
√
√
√
√
Health care
√
√
√
√
Logistics
√
√
√
√
√
√
R ECOMMENDER S YSTEMS
The motivation for recommender systems stems from the plethora of options that surround consumers in any step they make. People are flooded by choices of commercial products (online shopping, restaurants), of content (books, movies, papers and even jokes) and social opportunities (collaboration, dating).
The input data for the recommender system learning is composed of ratings of users to items and
of user and item profiles. Recommender systems vary by the number of users and number of products, the historical information they have on past transactions and the profile richness they maintain
on both products and users. The rating information may be binary (for example purchase history),
numerical (for example star-rating) or relative (comparing items) (Schafer, Konstan, & Riedl, 1999).
The input might be explicit (solicited from the user) or implicit (click data, dwell time, sequence analysis) (Oard & Kim, 1998).
The trivial recommendation method relies on popular trends and rates each item by its current popularity. This approach does not take into account personalization considerations and can be used as a
good baseline for other approaches. Personalized recommender systems are divided into content
based and collaborative filtering. For the content-based approach, the course discusses how to build
user/item profiles, and how to use user-item similarity and item-item similarity for the recommendation. For the collaborative filtering approach, the nearest-neighbors method and the matrix factorization method are defined. The course discusses the advantages and disadvantages for each approach
in terms of the cost of profile creation, cold-start recommendation and computational requirements
(Adomavicius & Tuzhilin, 2005; Koren, Bell, & Volinsky, 2009).
The challenges of recommender systems include creating diverse recommendations, identifying individual users from shared devices, user/item cold start and understanding user intent from implicit
feedback (Ricci, Rokach, & Shapira, 2015).
This program includes a case study of a recommender system for mobile apps that compares alternative recommendation approaches and analyses the performance of each approach (Jannach, & Hegelich, 2009).
F RAUD DETECTION
The motivation behind fraud-detection systems is presented in several reports that indicate that more
than 30% of organizations fell victim to fraud and that the cost of fraud is about 5% of organizations income. Common fraud types involve credit cards, telecommunication, insurance, taxes and
employee fraud in the workplace (Levi & Burrows, 2008; PWC, 2018).
Fraud detection stands out from other data-science use-cases, in the fact that it operates in an adversarial setting where the goal of the con-man is to avoid detection. Other characteristics of fraud detection problems are that they require constant adaptation to new behavior and that they involve very
130
Golan & Bouhnik
un-balanced data-sets, where most of the activity is legitimate (Laleh, & Azgomi, 2009; Phua, Lee,
Smith, & Gayler, 2010).
The goals of fraud detection are diverse, and they include identifying as many frauds as possible,
while avoiding misidentification of legitimate activity. Possible success metrics for fraud detection
systems may include reducing the cost of fraud and misidentification (Stolfo, Fan, Lee, Prodromidis,
& Chan, 2000), or minimizing the need for manual auditing of suspicious transactions.
Fraud detection methods can be divided into those learning from past fraudulent activity and trying
to identify similar behavior (using supervised learning) and those who identify anomalies as suspicious (using unsupervised learning). The first approach is exemplified by Link Analysis (Bolton &
Hand, 2002), the use connections between suspicious entities to discover fraud. For the second approach the course presents Break Point Analysis and Peer Group Analysis (Bolton & Hand, 2001).
H UMAN R ESOURCES
Data science can assist Human Resources activity in many aspects. The motivation is to make HR
processes more effective, more efficient and fairer. The relevant processes are preliminary candidate
sifting, candidate evaluation, and employee assessment.
For potential candidates, the relevant data includes experience (Organization, tenure, role, technical
skills), education (Major subject, grades), recommendations, publications (papers, open source content) and areas of interests. It is possible to employ NLP and modeling methods in order to filter
multitudes of CV’s to find the appropriate candidates. Automatic chatbots can make the communication with the candidates much more transparent and interactive,
The interviewing process is also rich in data. It is possible to weigh scores in many areas and from
many interviewers. Moreover, once an interviewer has enough track record, his evaluation can be
calibrated and become more standardized. The calibration process can remove any biases, conscious
and unconscious and increase diversity. Once the evaluation process is over, it is possible to determine a salary range that matches the candidate’s parameters.
For employees, periodical assessment and calibration involves many types of data: self-assessment,
peer-assessment and manager assessment, textual, numerical or relative to other employees. In some
scenarios, it is possible to analyze the content created by the employee. Here, too, it is important to
employ data-driven processes to get an efficient and unbiased assessment (Angrave, Charlwood,
Kirkpatrick, Lawrence, & Stuart, 2016).
H EALTH C ARE
A recent research states that mis-diagnosis is responsible for 10% of patient mortality and 6-17% of
medical complications (Makary, & Daniel, 2016). Data science assists in reducing these numbers and
achieving more reliable decision making.
As a scientific field, health care is based on vast amounts of data of multiple types: Hundreds of
years of accumulated knowledge, research, medical protocols and clinical experiment results. For a
specific patient it is possible to use data of medical history, medical tests (imaging, biochemical, genetic and functional), family relations and demographic origin.
One important use case is early diagnosis of medical conditions that increases the treatment effectivity and chances of recovery. Image analysis applications – such as Ultrasound, MRI, Mammography,
Tomography, Colonography, Angiography and X- rays are ubiquitous for diagnosis tasks. An interesting use case diagnoses diabetes using Ophthalmoscopy (analyzing images the retina). Signal processing is used in ECG and analyzing the input from digital stethoscopes (Arnoldi et al., 2010; Doi,
2006; Esteva et al., 2017; Kononenko, 2001).
131
MaDaScA
A second important use case employs data science for developing new medical treatments, and in
particular personalized treatments. It is possible to run thousands of experiments simultaneously and
efficiently analyze the results (Raghupathi & Raghupathi, 2014).
Another use case is clinical decision-support-systems (Musen, Middleton, & Greenes, 2014). A review of the research in this field has shown a significant improvement in medical treatment when
using a decision support system (Garg et al., 2005).The course presents the differences between
standalone and integrated systems and discuss how standards are important for sharing clinical decision support content (Wright & Sittig, 2008).
A crucial challenge in data-science based health care lies in the strict regulations that governments
enact in order to ensure patient privacy and safety. Another challenge is assimilating new technologies
into medical facilities and the plethora of new tools physicians need to learn.
L OGISTICS AND M AINTENANCE
Large systems that are composed of multiple parts pose a challenge for maintenance and management. Examples include aircrafts, trucks and civil infrastructure such as water and electricity. In addition, organizations like supermarkets also require efficient systems for managing logistic operation.
Many large systems collect data from sensors that continuously report the state of its parts. Modern
technologies advance IoT as a fundamental source of information and GPS and GIS data allow
elaborate analysis of location. Such systems use data science to predict malfunctions and for defining
preventive care policy. An aircraft, for example, can continually monitor 5000 different parameters
during its flight, which is equivalent to 2PB of data. The data can also be used for better evaluation
of usage patterns and improve cost estimation. Rolls Royce harnessed data science in order to define
a new business model of charging its clients based on the number of hours they operated their engines (Smith, 2013).
Retail stores also manage their supply chain using many sources of data: customer profiles, supplier
data, store arrangement data, and stock data all play a critical role in optimizing the costs and profit
of the store (Waller, & Fawcett, 2013).
THE DATA SCIENCE PROJECT
The data science project flow is comprised of the following stages:
•
•
•
•
•
•
Research question
Data collection and cleaning
Learning a model
Model evaluation
Results presentation
Decision
The first stage in a data-science project is defining the research question (Peng & Matsui, 2015).
The course describes the different types of questions that are asked in different stages of the flow
and highlight the importance of differentiating between the types in order to set expectations and
choose the correct methodology. There are six types of questions:
•
•
132
Descriptive: What are the high-level characteristics of the dataset? This question requires
understanding the distribution of the input features and is asked in the initial stages of the
project.
Exploratory: After answering the descriptive questions, the exploratory question is about
deeper relationships and patterns that lie within the data. It concerns correlations between
input fields and can be the base for the next types of questions.
Golan & Bouhnik
•
•
•
•
Inferential: Asking whether a pattern that was found in a given dataset is also valid for external situations. Answering this question requires information outside the scope of the dataset.
Predictive: Can one field be predicted given the other field?
Causal: Is there a causal effect between the values of the fields? This question is different
from the predictive question in that it is sometimes possible to predict one field from other
fields even if there is no causal connection between them.
Mechanistic: If a causal connection exists, it might be relevant to ask why the input fields
cause the output. Answering this question may require researching the real-world mechanism
behind the causal connection (Psychology/Physics/Biology) and in many cases requires domain expertise rather than data-science methodology.
After the questions are defined, the next step is collecting the data and cleaning it. Data acquisition requires either locating the data or generating it. The former may involve data that exists within
the organization or from external resources. The latter may involve adding logging devices to existing
operation flows or defining new experiments that result in new data. Often, data acquisition requires
a data pipeline that combines information from several sources and coverts the data into a standard
format. Once the data is available it requires validation and cleaning: taking care of missing values,
data duplication and anomalies, converting the measurements to standard units of measurements,
normalizing value ranges and grouping values into categorical fields (Rahm & Do, 2000).
Following the data acquisition, it is possible to start learning a model . Since this program does not
focus on programming skills, the machine learning techniques are presented using a cloud platform
called BigML.com. The platform includes supervised-learning based models such as Decision-trees
(Quinlan, 1986), Tree-Ensemble (Sollich & Krogh, 1996), Logistic Regression (Hosmer, Lemeshow,
& Sturdivant, 2013) and Neural-Networks (Hagan, Demuth, Beale, & De Jesús, 1996), as well as unsupervised learning based models such as K-Means clustering (Forgy, 1965), Anomaly-detection
(Chandola, Banerjee, & Kumar, 2009) and field-association. For each model type, the concept is explained, as well as the major advantages and disadvantages.
Model evaluation methodology is presented, starting with the usage of Train/Validation/Test sets
for creating a model, configuring meta-parameters and evaluating the quality of the model. The
course describes several evaluation metrics of models: Error rate, Precision/Recall and F1, AUC,
RMSE and log-loss. The next step is to describe stratified sampling and cross validation methods.
After discussing the methodology for evaluating a single model, the course discusses live experiments
and the usage of A/B testing.
Once a model has been trained, configured and evaluated, the project’s technical part is over. The
data-science manager now presents the project results and main insights that were learned during its
execution. Visualization takes a central role of the results presentation, and the course goes over the
visual variables that can be used to clarify the presentation: location, size, shape, value (light/dark),
orientation, color and texture. The course presents a variety of visualization classes that can be employed when presenting data: For distributions there are pie-charts, histograms, stacked bars, and
sunburst charts. 2D plots include colored clusters and bubble charts. For process and flows, Sankey
charts and funnels are introduced. Finally some non-orthodox visualizations are presented (for example the WorldMapper project), where the main motivation is to inspire the audience.
I NTRODUCTION TO SQL
As part of the Project topic, the course includes a section dedicated to learning basic SQL. SQL is a
tool that is used in many organizations to retrieve data and to analyze patterns. It presents a combination of a simple syntax and a wide functionality range. This introduction includes data-types, tableschema, simple/nested queries, filters, aggregation, ordering, joins and grouping. This topic also discusses how to define a table, manipulate its data and metadata and optimize its operation using indexes and views.
133
MaDaScA
ETHICAL DATA SCIENCE
Data science is a powerful tool, and with great power comes great responsibility. It is crucial to include a discussion of ethical issues arising from data science in the program. Moreover, there is an
increasing trend in many states that add legal regulations regarding the usage of data.
H OW (N OT ) TO L IE WITH S TATISTICS
The book How to lie with statistics (Huff, 1993) is used to demonstrate some of the most common
ways of misusing statistical thinking.
Biased sample: There may be different reasons for getting a biased sample. Some populations may
not be available for sampling. In other cases, the subjects do not supply truthful responses or do not
respond at all. The person conducting the survey may also introduce bias by hinting at the "correct"
opinion.
How to describe a distribution: Definition of average, mean, mode and how each one is not informative enough by itself. What are the statistics required for describing a distribution.
Data Dredging, Causation vs. Correlation: The common misunderstanding or mixing causation
and correlation is discussed. Explaining how repeating the same stochastic experiment may result in
improbable patterns that do not represent the real data.
Misleading visualizations: This section presents several ways that graphs can give a misleading impression. One example is that changing the graph baseline can cause trends to look more dramatic
than they really are. Another example is the usage of 2D and 3D pictograms in order to exaggerate
the effect of a comparison.
Percentage misconceptions: Misconceptions on percentages may lead to false notions. When
working with percentages, it is most important to keep track of the whole that the percentage refers
to. When one deals with a sequence of changes (increases or decreases) in percentages, the whole
changes. When the reported change is very dramatic it may indicate that the relevant whole was small
to begin with. In addition, when summing parts of the same whole, it is important to make sure they
do not overlap.
E TH ICAL C ONSIDERATIONS
Experiments with humans: In recent decades there are increasingly more regulations limiting experiments in humans. It is imperative that the experiment subjects give their voluntary and informed
consent for taking part in an experiment. The experiment should be planned in a way that it reduces
the risks the subjects are exposed to (Kramer, Guillory, & Hancock, 2014).
Privacy and Anonymization: The course defines the difference between sensitive and non-sensitive
information and shows that many organizations hold sensitive information about their users/customers (Barbaro, Zeller, & Hansell, 2006). Omitting some identifying details may still contain
enough information to identify some of the persons. The course presents k-Anonymity (Sweeney,
2002), l-diversity (Machanavajjhala, Gehrke, Kifer, & Venkitasubramaniam, 2006) and t-closeness
criteria (Li, Li, & Venkatasubramanian, 2007) that measure possibility to identify and discuss their
limitations.
Data based discrimination: Data science can be misused when historic patterns that result from
discriminating behavior reflect biased information that supports discrimination. Data driven discrimination can be much harder to fight than "classical" discrimination based on unconscious bias. Two
definitions of fairness are given and several ways of identifying and preventing discrimination are
presented (Calmon, Wei, Vinzamuri, Ramamurthy, & Varshney, 2017; Hardt, Price, & Srebro, 2016;
Fish, Kun, & Lelkes, 2016).
134
Golan & Bouhnik
ASSESSMENT
The assessment of the students in the proposed course will be by the implementation of a project.
The assessment itself is part of the learning process and can be categorized as Project (or Problem)
Based Learning (PBL). This type of assessment allows the students to acquire knowledge via a continuous structured process surrounding an authentic question and concluding with the design of a
product that mirrors the learning process. During this process the students deal with creative challenges and in order to attain achievements they must delve deeply into the subjects related to the project. All this, on their own. They must seek relevant material, research it and find answers to their
questions.
In this paper, the term ‘team’ is vital and prominent. Therefore, also in the evaluation stages, it is important that the teamwork be entwined. Furthermore, many of the problems in the use cases are
similar and the solution in one case may help solve the other. Towards this purpose, we chose the
PBL evaluation method, which allows evaluation of teamwork and helps achieve the following goals:
ability to learn new subjects: acquirement of problem solving capabilities; use of knowledge to solve
problems; breaking down knowledge into parts; creative and critical thinking; development of an
holistic approach to problems and situations; ability to work independently; ability to work in groups
- collaborating and improving communication skills. Figure 1 depicts the reciprocal relations among
the PBL integrated action circles.
Practice
Discussion
What
PBL
So What
Now What
Reflection
Figure 1: Reciprocal relations among the action circles in the PBL method
Self-feedback, which also exists in the PBL approach, may reveal changes in self-perception or apparent changes in behavior. At the basis of self-guided learning processes, lie reflection processes such
as self-reflection, self-judgment and self-reaction (Zimmerman & Schunk, 2001)
It is important to note, that this process is not the end of the learning process. Just the opposite. It is
the basis. While working on the project the material is studied in depth. The goal is to allocate a reasonable number of questions to the project and to include all necessary elements upon which the
knowledge may be built and established
135
MaDaScA
Evaluation of the project: The final evaluation does not relate only to the final product, but to the
whole learning process. The process is based on chronological steps, including presentation of drafts,
reflection processes and intermediate feedback.
The project should meet the following criteria:
•
AL - Applied Learning – Learning that can be applied to future projects. Team work, interpersonal communication, presentations
AE – Active Exploration – Learning that demands search and active movement, activity outside the classroom, such as reaching out to the community and specializing in relevant subjects.
AC – Adult Connections – While working on the project the students will meet an advisor
who specializes in the subject that they are researching.
AR - Academic Rigor – Search for academic material and connection to knowledge acquired
outside of the academic setting.
AP – Assessment Practices – While working on the project the students will be evaluated at
each step using various appropriate assessment tools.
•
•
•
•
This method creates an authentic experience from which they can learn and draw conclusions regarding students who have not experienced complicated situations.
EXISTING COURSES
The following section surveys several courses and programs that aim at teaching data science to
managers/executives. Table 1 summarizes our findings. The surveyed courses are:
•
•
•
•
•
•
Data Analytics for Managers (DAfM): Given by edx.org.
Data Science for Executives (DSfE): Given by edx.org.
Data Science for Managers (DSfM1): Given by Naya College.
Executive Data Science Specialization (EDSS): Given by Coursera.
Data Science for Managers (DSfM2): Given by Monash University
Managing Data Science Activity (MaDaScA): The course presented here
The survey checks for the existence of the following components in the course:
•
•
•
•
•
•
•
•
SQL introduction (SQL)
Data science use cases (UC)
Data science project: initial steps (PI)
Data science project: Model creation and eval (PM)
Data science project: presenting results (PP)
Building the data science team (BT)
Structuring the data science team in the organization (ST)
Ethical data science (EDS)
For example, the course ‘Data Science for Managers’ (DSfM1) includes a discussion on the main use
cases of data science and the steps required for creating a model (including the initial steps such as
acquiring the data and pre-processing it). In addition to the content that MaDaScA includes, this
course also touches upon programming skills and the students learn how to implement algorithms.
While the survey shows that the core components of all programs are about creating, evaluating
models (Only DSfM1 does not discuss result-presentation), part from MaDaScA, only DAfM discusses SQL, only EDSS discusses organizational considerations and only DSfM2 discuss ethical considerations in data science.
136
Golan & Bouhnik
The columns are arranged so that the more concrete, technical-oriented topics (Use-cases, SQL, project flow) are on the left and more high level, management oriented (team structure, ethical considerations) are on the right. From the table it is evident that the first three courses are focused on the
data-science elements of the content, while the remaining are focused on the management components. This is also reflected by the additional subjects that some of the courses include, and the suggested course omits.
MaDaScA balances the two perspectives and gives a wide scope for managers who want to impact
the organization’s roadmap using data-driven decision-making.
Table 1. Courses contents
Course
name
SQL
3B
PI PM PP
5B
√
√
√
DSfE
√
√
√
DSfM1
√
DAfM
√
UC
4B
BT
√
EDSS
√
√
√
√
DSfM2
√
√
√
√
√
√
√
√
√
EDS Additional subjects
IoT
√
MaDaScA √
ST
Programming, Algorithms implementation
√
√
√
History, business strategies
√
CONCLUSION
The paper presents MaDaScA program for teaching Data science to potential managers. The course
aims at balancing professional knowledge of the process of data science project and organizational
knowledge regarding the structure of the team, its role in the organization, and its duties and responsibilities. It is crucial that more professional managers join the data science field, so they may complement the technical capabilities that data science teams have. It is even more important that datascience managers have a solid background in data-driven decision making and ethical usage of the
power that lies in big data and ever developing technologies. The combination of strong datascientists and strong data-science manager may lead to the next level of data science capabilities.
DISCUSSION
There are several ideas that might contribute additional value to this program. It would be interesting
to experiment with an interactive session where the student simulates the learning process and tries
deriving an intuitive (manual) model from a given dataset, given point after point. Promoting joined
projects with experienced and active data-science managers can add value to the students and help
them in their first steps in the field. It might be beneficial to add more case studies comparing different approaches. Moreover, it is important to learn also from "negative" case-studies, where methodological caused projects to fail. In general, it would be interesting to add more ways to make the program tangible and keep it in close correspondence with the industry.
REFERENCES
Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: A survey of the
state-of-the-art and possible extensions. IEEE Transactions on Knowledge & Data Engineering, 6, 734-749.
https://doi.org/10.1109/tkde.2005.99
Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M., & Stuart, M. (2016). HR and analytics: Why HR is
set to fail the big data challenge. Human Resource Management Journal, 26(1), 1-11.
https://doi.org/10.1111/1748-8583.12090
137
MaDaScA
Arnoldi, E., Gebregziabher, M., Schoepf, U. J., Goldenberg, R., Ramos-Duran, L., Zwerner, P. L., ... & Thilo, C.
(2010). Automated computer-aided stenosis detection at coronary CT angiography: initial experience. European radiology, 20(5), 1160-1167. https://doi.org/10.1007/s00330-009-1644-7
Barbaro, M., Zeller, T., & Hansell, S. (2006, August 9). A face is exposed for AOL searcher no. 4417749. New
York Times, p. 8. Retrieved from
http://shawndra.pbworks.com/f/A+Face+Is+Exposed+for+AOL+Searcher+No.+4417749++New+York+T.pdf
Bolton, R. J., & Hand, D. J. (2001). Unsupervised profiling methods for fraud detection. Credit Scoring and Credit
Control VII, 235-255. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.5743&rep=rep1&type=pdf
Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review. Statistical Science, 235-249. Retrieved
from https://projecteuclid.org/download/pdf_1/euclid.ss/1042727940
Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., & Varshney, K. R. (2017). Optimized pre-processing
for discrimination prevention. In Advances in Neural Information Processing Systems (pp. 3992-4001). Retrieved
from http://papers.nips.cc/paper/6988-optimized-pre-processing-for-discrimination-prevention.pdf
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR),
41(3), 15. Retrieved from http://www.cs.umn.edu/sites/cs.umn.edu/files/tech_reports/07-017.pdf
Doi, K. (2006). Diagnostic imaging over the last 50 years: Research and development in medical imaging science and technology. Physics in Medicine & Biology, 51(13), R5. Retrieved from
https://www.uio.no/studier/emner/matnat/fys/nedlagteemner/FYS4760/h08/undervisningsmateriale/Diagnostics%2050%20year%20pmb6_13_r02.pdf
Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level
classification of skin cancer with deep neural networks. Nature, 542(7639), 115. Retrieved from http://ondemand.gputechconf.com/gtc/2017/presentation/s7822-andre-esteva-dermatologiest-level-classificationof-skin-cancer.pdf
Fish, B., Kun, J., & Lelkes, Á. D. (2016, June). A confidence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM International Conference on Data Mining (pp. 144-152). Society for Industrial
and Applied Mathematics. Retrieved from https://arxiv.org/pdf/1601.05764.pdf
Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications.
Biometrics, 21, 768-769.
Garg, A. X., Adhikari, N. K., McDonald, H., Rosas-Arellano, M. P., Devereaux, P. J., Beyene, J., ... & Haynes, R.
B. (2005). Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: A systematic review. Jama, 293(10), 1223-1238. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.468.3830&rep=rep1&type=pdf
Hagan, M. T., Demuth, H. B., Beale, M. H., & De Jesús, O. (1996). Neural network design (Vol. 20). Boston: Pws
Pub.
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Advances in neural
information processing systems (pp. 3315-3323). Retrieved from http://papers.nips.cc/paper/6374-equality-ofopportunity-in-supervised-learning.pdf
Hosmer, D. W., Jr, Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley &
Sons.
Huff, D. (1993). How to lie with statistics. WW Norton & Company.
Jannach, D., & Hegelich, K. (2009, October). A case study on the effectiveness of recommendations in the
mobile internet. In Proceedings of the third ACM conference on Recommender systems (pp. 205-208). ACM. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.3453&rep=rep1&type=pdf
Kononenko, I. (2001). Machine learning for medical diagnosis: History, state of the art and perspective. Artificial Intelligence in Medicine, 23(1), 89-109. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.184&rep=rep1&type=pdf
138
Golan & Bouhnik
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer,
8, 30-37. Retrieved from https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf
Kramer, A. D., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 201320040. Retrieved from
https://www.pnas.org/content/pnas/early/2014/05/29/1320040111.full.pdf
Kross, S., Peng, R. D., Caffo, B. S., Gooding, I., & Leek, J. T. (2017). The democratization of data science education (No. e3195v1). PeerJ Preprints. Retrieved from https://peerj.com/preprints/3195.pdf
Laleh, N., & Azgomi, M. A. (2009, March). A taxonomy of frauds and fraud detection techniques. In International Conference on Information Systems, Technology and Management (pp. 256-267). Springer, Berlin, Heidelberg.
Retrieved from https://s3.amazonaws.com/academia.edu.documents/46467288/Laleh-2015-Profile1Paper2.pdf ?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1544185945&Signature=4E7c4W
RxByvsMDQx9vUoXKjMv%2FA%3D&response-contentdisposition=inline%3B%20filename%3DA_Taxonomy_of_Frauds_and_Fraud_Detection.pdf
Levi, M., & Burrows, J. (2008). Measuring the impact of fraud in the UK: A conceptual and empirical journey.
The British Journal of Criminology, 48(3), 293-318. doi:10.1093/bjc/azn001
Li, N., Li, T., & Venkatasubramanian, S. (2007, April). t-closeness: Privacy beyond k-anonymity and l-diversity.
In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on (pp. 106-115). IEEE. Retrieved
from http://www.utdallas.edu/~mxk055100/courses/privacy08f_files/tcloseness.pdf
Machanavajjhala, A., Gehrke, J., Kifer, D., & Venkitasubramaniam, M. (2006, April). 𝑙-Diversity: Privacy Beyond
𝑘-Anonymity. In 22nd International Conference on Data Engineering (ICDE'06) (p. 24). IEEE. Retrieved from
https://ptolemy.berkeley.edu/projects/truststc/pubs/465/L%20Diversity%20Privacy.pdf
Makary, M. A., & Daniel, M. (2016). Medical error—The third leading cause of death in the US. Bmj, 353,
i2139. Retrieved from http://healthofamericans.org/files/Medical_error.pdf
Musen, M. A., Middleton, B., & Greenes, R. A. (2014). Clinical decision-support systems. In Biomedical Informatics (pp. 643-674). Springer, London. Retrieved from
https://www.researchgate.net/profile/Mark_Musen/publication/226706299_Clinical_DecisionSupport_Systems/links/0fcfd5082f6e1def38000000.pdf
Oard, D. W., & Kim, J. (1998, July). Implicit feedback for recommender systems. In Proceedings of the AAAI
workshop on recommender systems (Vol. 83). WoUongong. Retrieved from
http://www.aaai.org/Papers/Workshops/1998/WS-98-08/WS98-08-021.pdf
PWC. (2018), Pulling fraud out of the shadows: Global economic crime and fraud survey 2018. Retrieved 18
December 2018 from http://www.pwc.com/gx/en/forensics/global-economic-crime-and-fraud-survey2018.pdf
Peng, R. D., & Matsui, E. (2015). The art of data science. A guide for anyone who works with data. Skybrude
Consulting, 200, 162.
Phua, C., Lee, V., Smith, K., & Gayler, R. (2010). A comprehensive survey of data mining-based fraud detection research. arXiv preprint arXiv:1009.6119. Retrieved from
https://arxiv.org/ftp/arxiv/papers/1009/1009.6119.pdf
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106.
Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 3. Retrieved from https://hal.archives-ouvertes.fr/hal-01663474/document
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4),
3-13.
Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender systems: Introduction and challenges. In Recommender
systems handbook (pp. 1-34). Springer, Boston, MA. Retrieved from
http://fumblog.um.ac.ir/gallery/1057/Recommender%20Systems_%20Introduction%20and%20Challeng
es.pdf
139
MaDaScA
Schafer, J. B., Konstan, J., & Riedl, J. (1999, November). Recommender systems in e-commerce. In Proceedings of
the 1st ACM conference on Electronic commerce (pp. 158-166). ACM. Retrieved from
https://s3.amazonaws.com/academia.edu.documents/31095343/recommender-systems-ecommerce.pdf ?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1544183998&Signature=ufIz%
2BHQMwiGBqApxbGF%2FPKCBGWg%3D&response-contentdisposition=inline%3B%20filename%3DRecommender_systems_in_e-commerce.pdf
Smith, D. J. (2013). Power-by-the-hour: The role of technology in reshaping business strategy at Rolls-Royce.
Technology Analysis & Strategic Management, 25(8), 987-1007. Retrieved from
http://irep.ntu.ac.uk/id/eprint/926/1/214516_Re-shaping%2520Business%2520Strategy_v1.6c.pdf
Sollich, P., & Krogh, A. (1996). Learning with ensembles: How overfitting can be useful. In Advances in neural
information processing systems (pp. 190-196).
Stolfo, S. J., Fan, W., Lee, W., Prodromidis, A., & Chan, P. K. (2000). Cost-based modeling for fraud and intrusion detection: Results from the JAM project. Columbia University New York Department of Computer Science. Retrieved from https://pdfs.semanticscholar.org/7334/806e28edef38aadcc0a52e1b016dfae5fff6.pdf
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems, 10(05), 557-570. Retrieved from http://www.cs.pomona.edu/~sara/classes/cs190fall12/k-anonymity.pdf
Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data: A revolution that will
transform supply chain design and management. Journal of Business Logistics, 34(2), 77-84, from
https://pdfs.semanticscholar.org/9c1b/9598f82f9ed7d75ef1a9e627496759aa2387.pdf
Wright, A., & Sittig, D. F. (2008). A four-phase model of the evolution of clinical decision support architectures. International Journal of Medical Informatics, 77(10), 641-649. Retrieved from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2627782/pdf/nihms68821.pdf
Zimmerman, B. J., & Schunk, D. H. (2001). Self-regulated learning and academic achievement: Theoretical perspective.
Lawrence Erlbaum Associates, New Jersey.
BIOGRAPHIES
Shahar Golan, PhD is a lecturer in Lev Academic Center (JCT), in the
department of software engineering. Before joining JCT, he worked as a
researcher and developer in Google, Yahoo Labs and HP Labs. His main
research topics include Machine Learning (Recommender Systems in particular) and Constraint Satisfaction Problems.
Professor Dan Bouhnik is the head of the Computer Science department in the Jerusalem College of Technology. He is the author of a
number of books used for teaching Advanced Computer Sciences. In his
research he touches upon information security issues from a number of
angles: anonymity, privacy, usability, personalization and the awareness
level of the user to these issues.
140