A Meta-Summary of Challenges in Building Products With ML Components - Collecting Experiences From 4758+ Practitioners

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

A Meta-Summary of Challenges in Building Products with ML

Components – Collecting Experiences from 4758+ Practitioners


Nadia Nahar Haoran Zhang Grace Lewis
[email protected] Carnegie Mellon University Carnegie Mellon Software
Carnegie Mellon University Pittsburgh, PA, USA Engineering Institute
Pittsburgh, PA, USA Pittsburgh, PA, USA

Shurui Zhou Christian Kästner


University of Toronto Carnegie Mellon University
arXiv:2304.00078v1 [cs.SE] 31 Mar 2023

Toronto, Ontario, Canada Pittsburgh, PA, USA

ABSTRACT Requirements Engineering: Lack of AI literacy causes unrealistic ex-


pectations from customers, managers, and even other team members •
Incorporating machine learning (ML) components into software Vagueness in ML problem specifications makes it difficult to map business
products raises new software-engineering challenges and exacer- goals to performance metrics • Regulatory constraints specific to data and
bates existing challenges. Many researchers have invested signifi- ML introduce additional requirements that restrict development
cant effort in understanding the challenges of industry practitioners Architecture, Design, and Implementation: Transitioning from a
working on building products with ML components, through in- model-centric to a pipeline-driven or system-wide view is considered
terviews and surveys with practitioners. With the intention to important for moving into production, but a difficult paradigm shift for
aggregate and present their collective findings, we conduct a meta- many teams • ML adds substantial design complexity with many, often
summary study: We collect 50 relevant papers that together inter- implicit, data and tooling dependencies, and entanglements due to a lack
of modularity • Difficulty in scaling model training and deployment on
acted with over 4758 practitioners using guidelines for systematic
diverse hardware • While monitorability and planning for change are
literature reviews. We then collected, grouped, and organized the often considered important, they are mostly considered only late after
over 500 mentions of challenges within those papers. We high- launching
light the most commonly reported challenges and hope this meta- Model Development: Model development benefits from engineering in-
summary will be a useful resource for the research community to frastructure and tooling but provided infrastructure and technical support
prioritize research and education in this field. are limited in many teams • Code quality is not standardized in model
development tools, leading to conflicts about code quality
1 INTRODUCTION Data Engineering: Data quality is considered important, but difficult for
practitioners and not well supported by tools • Internal data security and
After decades of effort in machine learning (ML) research to build
privacy policies restrict data access and use • Although training-serving
better models, researchers from industry and academia have re- skew is common, many teams lack support for its required detection and
cently started to shift their attention to improving how to build monitoring • Data versioning and provenance tracking are often seen as
software products with such models. Incorporating a ML compo- elusive, with not enough tool support
nent into a software product is often argued to be harder than Quality Assurance: Testing and debugging ML models is difficult due to
incorporating traditional functional components, because of the lack of specifications • Testing of model interactions, pipelines, and the
specific characteristics of machine learning (e.g., based on data, no entire system is considered challenging and often neglected • Testing and
specifications in the traditional sense, fairness concerns) and how monitoring models in production are considered important but difficult,
they impact the entire life cycle of the product [41, 49, 58, 83, 111]. and often not done • There are no standard processes or guidelines on
how to assess system qualities such as fairness, security, and safety in
While the traditional software development process has challenges
practice
of its own, bringing ML into the picture is argued to break a lot of
Process: Development of products with ML component(s) is often ad-
existing software architecture and engineering assumptions [49, 58]. hoc, lacking well-defined processes • The uncertainty in ML development
This leads initiatives to rethink existing processes and practices makes it hard to plan and estimate effort and time
and shift priorities in software teams. As a result, we keep hearing Organization and Teams: Building products with ML components re-
from practitioners on how they perceive building, deploying, and quires diverse skill sets, which is often missing in development teams •
incorporating machine learning in software products as a challenge, Many teams are not well prepared for the extensive interdisciplinary col-
even when the initial ML research and model prototypes seemed laboration and communication needed in ML products • ML development
promising. can be costly and resource limits can substantially curb/limit efforts • Lack
While some practitioners give talks on challenges or write experi- of organizational incentives, resources, and education hampers achieving
all system-level qualities
ence papers (e.g., examples in academic venues surveyed elsewhere
[66, 84]), researchers have also been actively studying the chal- Table 1: Overview of Identified Challenges
lenges faced by practitioners when building software products with
ML components across many projects. In recent years, many re-
searchers have interviewed or surveyed practitioners to identify
what has really changed for them with the introduction of machine opportunities, and best practices in a rapidly changing field. While
learning, often with the goal of identifying challenges, research some studies focus on specific aspects, such as challenges regarding
CAIN ’23, May 15–16, 2023, Melbourne, Australia Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner

2 SCOPING AND RELATED WORK


With the advance of ML techniques, many organizations have
invested substantial efforts in building products with ML com-
ponents. While there is a large amount of research that focuses
entirely on the challenges that data scientists face in their model-
development work (e.g., development responsibilities [51], data
exploration [62, 74], data-science processes [29, 69], development
in notebooks [19, 32, 87], AutoML [121]), another body of work
focuses on the challenges of building products with those models,
often with interdisciplinary teams, and placing substantial attention
on qualities like safety and observability. The latter work, which
forms the scope of this survey, moves beyond the model-centric
view of classic data-science workflows and considers building auto-
mated pipelines and entire software systems with many ML and
non-ML components, as well as the engineering challenges involved.
Figure 1: Research Method It emerges in a growing research community often named Software
Engineering for Machine Learning (SE4ML) studying the engineer-
ing challenges of building both ML components and products that
contain ML components.
Understanding practitioner needs. Academic research is often
architecture [109], collaboration [77], or fairness [36, 94], many criticized for being far removed from the needs faced by practition-
others explore challenges more broadly. Many of these studies have ers in industry [8, 94, 120]. If researchers want to achieve rapid
identified similar challenges. We believe that we have reached a impact in industry, they need to understand what problems are
point where practices have settled and research on challenges ap- important to practitioners; conversely, practitioners may attempt
proaches saturation – we think that now is a good time to step back to attract researchers to work on their problems. Attempts to close
and survey the collective findings of the research community. the gap between academia and practice typically need to navigate
In this paper, we aim to consolidate knowledge about challenges a tradeoff between (a) investigating one or few teams in depth with
in the practice of building software products with ML components, findings that may not generalize or (b) exploring common problems
with a systematic literature survey of existing studies that inter- across many teams with more shallow engagements. Focused on in-
viewed or surveyed industry practitioners across multiple projects. dividual teams, we see a few ethnographic studies [85, 86], many di-
We identified 50 studies of which, 30 conducted interviews, 11 rect collaborations with an industrial partner [52, 93], and many ex-
conducted surveys, and nine did both, with a total of over 4758 perience reports published by practitioners in papers [9, 31, 48, 59],
identified participants (seven studies did not report the number of talks [26, 30, 70], or blog posts [34, 113, 131]. To understand prob-
participants; some participants may have participated in multiple lems across teams, many researchers conduct interviews across
studies). Using the meta-summary research method [24, 95, 103], we multiple teams and organizations, e.g., [58, 65, 77, 109, 119], to
analyze, organize, and synthesize findings across all these studies be either addressed by the same researchers or reported as open
(as shown in Figure 1), answering the overall research question: problems to the community. Other researchers have focused on
What are the challenges experienced by industry practition- surveying practitioners at scale across companies and regions, e.g.,
ers in building software products with ML components? [44, 57, 122, 130].
We group the challenges found in the meta-summary into cat- In this paper, we go one step further in aggregating and ana-
egories. In a nutshell, we find practitioners struggle in different lyzing results from prior interviews and surveys with over 4758
product development stages: (1) requirements engineering, (2) ar- practitioners, which we hope will help guide future research and
chitecture, design, and implementation, and (3) quality assurance. educational activities toward challenges relevant to practitioners.
We also find several engineering challenges in ML-specific stages, in Previous literature reviews. There have been several prior lit-
particular (4) model development, and (5) data engineering. Other erature reviews on topics related to building products with ML
issues relate to cross-cutting concerns related to (6) process, and (7) components. Most surveys review academic papers proposing solu-
organization and teams. Following the meta-summary method, we tions in subfields, such as testing ML components [5, 15, 38, 97, 129],
present and organize the challenges mentioned by the practitioners safety and security [13, 38, 63], data management [90], and even
in the original papers as they have been reported, without attempt- trying to cover published research on SE4ML broadly [27, 68]. The
ing to speculate or pass our own judgments on the findings. Table closest to our work are two literature surveys that analyze practi-
?? contains a summary of the findings. We conclude the paper with tioner experience reports published at academic conferences (not
a brief discussion, reflecting our own views. including grey literature) collecting the self-reported challenges of
Our key contribution is the meta-summary, which we present a few dozen teams [66, 84]. In this work, we specifically perform
in narrative form in this paper. We additionally provide details a meta-summary of academic papers reporting on interviews and
with clear traceability to findings and papers as supplementary surveys with practitioners.
documents [76].
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners CAIN ’23, May 15–16, 2023, Melbourne, Australia

3 RESEARCH METHOD Table 2: Paper Selection


The goal of this paper is to summarize challenges in building prod-
ucts with ML components, accumulated from industry practitioners Data Source Initial After Filtering by Final Selec-
in prior research. To achieve this goal and answer our research Search Title/Abstract and tion
question, we first define the appropriate search strategy and study Result Snowballing
selection criteria to find the relevant literature that identifies chal- IEEE 69 30 19
lenges through communicating with industry practitioners. We ACM 48 11 10
follow the guidelines for systematic literature review (SLR) for Willey 6 0 0
this paper collection step [50]. Then, we extract the data from the ScienceDirect 32 5 3
selected papers and analyze the data to complete the synthesis Engineer Village 101 3 0
Springer 6* 3 2
process. Several approaches have been explored for synthesizing
arXiv 79 8 5
qualitative research in software engineering, such as thematic syn- Snowballing - 26 11
thesis, meta-ethnography, and meta-summary [39]. As our aim Total 341 86 50
is to discover patterns or themes of challenges in building prod- *abstract filtering from 5612 papers retrieved with fulltext search
ucts with ML components, as well as get a sense of the priority
of the challenges based on the frequency of reports by industry
practitioners, the meta-summary method is best suited for this re-
search problem [39, 95, 103]. The meta-summary method provides a set was composed of papers that we knew well from our past work
well-balanced synthesis mechanism, which is deeper than mapping in this field. We then analyzed the seed set to define the keywords
studies, and not as exhaustive as meta-ethnography, which requires needed to retrieve those and similar papers.
significant expertise and experience with the methodology and We realized that our research question has three aspects, and
its philosophical stance [95]. Thus, we apply the meta-summary therefore to retrieve the papers that would satisfy our research
method [24, 95, 103] to perform the quantitative aggregation of the question, we focused on those three parts to formulate the search
qualitative evidence that we present as findings. Figure 1 shows query: (A) The paper needs to mention an ML-related key-
the overview of the research process followed in this study. word, since we focus on challenges introduced by ML components.
(B) The paper needs to mention a software engineering or
ML deployment-related keyword, since we focus on engineer-
3.1 Paper Selection ing challenges that go beyond local concerns of data scientists; for
To increase reliability, reproducibility and objectivity of the pro- example the paper should discuss concerns related to actual prod-
cess for paper selection, we follow the established procedure of uct development where models are deployed and incorporated into
conducting systematic literature reviews [50]. larger software systems. Finally, (C) the paper needs to mention
Relevant years. Much of the research on engineering products surveys or interviews, since we are interested in the challenges
with ML components was inspired by the seminal 2015 ML technical mentioned by industry practitioners and these are the most com-
debt paper by Sculley et al. [104], which outlined various engineer- mon relevant research methods; we are not interested in a single-
ing challenges in building and operating ML infrastructure. For team case study or ethnographic study, as the challenges found in
completeness, we selected the year range of the papers to be from such papers may be specific to individual products.
2010 to 2022. After adding some semantically similar terminologies, we devel-
oped the following search query fragments – A: “machine learning”
Publication venues. To search for papers, we select digital li- OR “artificial intelligence” OR “deep learning” OR “ML component”
braries and databases commonly used by software engineering OR ”data science”; B: ”software engineering” OR “software systems”
review papers, e.g, [20, 45, 64]. We do not filter by the venue, as OR ”production-ready systems” OR ”ML systems” OR “deploying
we expect to find papers that are published in different communi- ML” OR “ML deployment”; C: “interview” OR “survey” OR “ques-
ties including software engineering, human-computer interaction, tionnaire”. The final query was of the following format “A AND B
and machine learning. Since we aim to aggregate results from ro- AND C.”
bust empirical studies, we did not include gray literature, such
We searched with this query within the abstract of the papers
as blog posts, which typically reflect opinions or individual expe-
in all the digital data sources except SpringerLink, as it did not
rience only. However, we did include arXiv as a data source, as have the option to search within abstracts. For SpringerLink, we
it contains many relevant academic papers in this field, even if retrieved 5612 papers based on a full-text search, and subsequently
some have not been peer reviewed. Specifically, we use the fol- used a custom script to search within the abstracts of these papers.
lowing 8 data sources: IEEE Xplore (ieeexplore.ieee.org), ACM This provided us with a total of 341 papers from all the sources (see
Digital library (portal.acm.org/dl.cfm), Wiley InterScience (www. Table 2).
interscience.wiley.com), Elsevier Science Direct (www.sciencedirect. This search query retrieved 18 of the 21 seed papers. Two papers
com), SpringerLink (www.springerlink.com), EI Compendex (www. were missed because the conducted interviews were not mentioned
engineeringvillage.com), and arXiv (https://arxiv.org). in the abstract (the abstract framed the research as a case study),
Search query. Defining the right scope and corresponding search and one paper was not listed within the libraries searched (only
query required some iteration. We started by assembling an initial available on TechRxiv). To account for this difference we performed
set of 21 papers as a seed set (a common practice [25, 66]). The seed one round of snowballing, as explained later in this section.
CAIN ’23, May 15–16, 2023, Melbourne, Australia Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner

Table 3: Inclusion and Exclusion Criteria

Inclusion Criteria
I1: Paper includes software engineering challenges for ML systems
I2: Paper uses interview or survey with industry practitioners (soft-
ware engineers, data scientists, etc.) to identify the challenges
I3: Paper appears in a refereed publication (including conference Figure 2: Year Distribution of the Selected Papers
proceedings, journal, etc.) or uploaded in arxiv in a publication
format
I4: Paper is written in English
responses were reported in 43 papers, and the seven remaining
Exclusion Criteria papers did not report specific counts of the interviewed or surveyed
practitioners.
E1: Paper has a strict ML model view and does not consider the system
Of the 50 papers, 31 papers explicitly list research questions or
or product using the model
the aim of their research as identification of challenges (or issues,
E2: Paper interviews/surveys only non-technical people (end-users,
domain experts, etc.) problems, difficulties) in different aspects of building products with
E3: Paper focuses on ML for software engineering instead of software ML components. The other papers do not explicitly set a goal of
engineering for ML systems identifying challenges but more broadly study the process of build-
E4: Paper falls in the category of gray literature: blog post, technical ing products with ML components, yet they also report practitioner
report, government report, webinar, poster session, presentation, challenges in their findings.
etc.
3.2 Qualitative Meta-Summary Process
As stated earlier, we used the meta-summary research method
Selection criteria. The initial search returned many papers that [24, 95, 103] to synthesize the findings from the collected papers.
were not directly relevant to our research question. Next, we se- This method is used to perform quantitative aggregation of qualita-
lected 86 relevant papers by reading the title and abstract, evalu- tive findings, which are necessarily the thematic summaries of the
ating them against the inclusion and exclusion criteria (see Table underlying data from different studies. We conduct the following
3), which we incrementally refined. Finally, we read the full paper, steps to perform the synthesis, as per the guidelines.
and once again evaluated each against the inclusion and exclusion Extracting findings. Along with the standard metadata (title,
criteria, which narrowed our set down to 39 papers. Multiple re- source, venue, year, etc.), we extracted study-specific data regarding
searchers participated in this process and discussed papers at the
research questions, study method, interview and survey participant
boundary.
counts, and, most importantly, the challenges reported within the
Most of the papers that were discarded in this round were either papers. To maintain consistency in extracting the findings, we con-
literature surveys in the domain of machine learning for software sidered only challenges that were derived from the interview and
engineering (i.e., using ML techniques to facilitate software engi- survey answers in the papers, not challenges derived from other
neering tasks; not relevant to this study) or used interviews or literature or personal experience of the authors. We extracted chal-
surveys to evaluate tools. We also removed papers that have a nar- lenges related to building software systems with ML components,
row focus or are entirely model-centric, e.g., interviewing only but excluded those that relate exclusively to the data- and model-
data scientists about their modeling work (e.g., [23, 35, 46, 80]) or related work performed by a data scientist, such as algorithmic
interviewing only non-technical people (e.g., [12, 33, 100, 117]). problems, notebook coding, and hyper-parameter tuning. We ex-
Snowballing. To capture relevant papers that did not match our tracted a total of 520 excerpts relating to challenges from the 50
keywords in their abstract, we performed one iteration of backward papers. We stored all extracted information from each paper in a
snowballing [126], which means that we went through the selected spreadsheet for further analysis.
papers’ reference list to find whether we missed any relevant pa-
Grouping topically similar findings. We organize the findings
pers. We analyzed 26 additional papers and considered 11 of them
at the level of reported challenges that we extracted from the pa-
as relevant based on the inclusion and exclusion criteria, which pers. Different papers grouped findings in different ways and using
included the three papers from the seed set we previously missed. different terminologies; we aimed to find a consistent organiza-
Final paper set. Overall, our process resulted in a final set of 50 tional principle. For identifying similar topics and grouping those
papers. Most of the papers were published recently, since 2019 (see together, we needed to understand and compare those reported
Figure 2). This sudden explosion of interview and survey studies challenges in their original context. Card sorting is a common tech-
with practitioners in recent years justifies our motivation for this nique for grouping similar findings [40, 115], which we used for
study to aggregate all the findings of these papers. Most of the this paper. Following the standard card sorting method, we created
papers, 30 out of 50, were published in software engineering venues one (virtual) card per reported challenge, and incrementally and
(including five at WAIN/CAIN), 11 papers in HCI venues, two papers iteratively organized those cards into groups of similar challenges.
in AI Ethics venues, and the seven remaining ones are scattered Multiple researchers went through all the cards in synchronous and
over other communities. A total of 947 interviews and 3811 survey asynchronous fashion to grasp the different concepts and identify
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners CAIN ’23, May 15–16, 2023, Melbourne, Australia

relevant themes and clusters around the reported challenges. This some interpretation of the papers and some judgment calls. The
being a collaborative effort, we did not aim for inter-rater agree- method encourages quantification of effect sizes, but those may not
ment between independent grouping by individual researchers, but be entirely reliable as the analyzed papers use different methods
instead worked together as a team to build consensus. There were and sometimes focus on specific subquestions.
many rounds of card sorting including moving the cards back and It would have been interesting to analyze findings in additional
forth between different clusters, splitting the cards to handle dif- dimensions, for example, whether team members in different roles
ferent dimensions, merging similar clusters, and splitting clusters or projects, or in different application domains, experience different
when we found there was more than one theme, until all involved challenges, or whether different challenges surface depending on
researchers were satisfied with the clusters and placement of the the research method in the original study (e.g., survey vs. interview,
cards. We developed three layers of clustering – the reported chal- open question vs. closed question). Unfortunately, data in the origi-
lenges extracted from the papers as the smallest unit, groups of nal studies is frequently not reported consistently and with enough
common themes or patterns in the challenges as the second layer, granularity to enable such analyses.
and finally a third (or top) layer grouping the second layer clusters While the meta-summary method can in principle also identify
by development stages or cross-cutting concerns for the ease of conflicts within the literature, this was not feasible in our study.
reporting results. We performed this card sorting process in an The analyzed papers typically reported challenges, not the absence
online platform (miro.com), allowing us to manipulate colors, add or relative importance of certain challenges. Given that different
different tags to the cards, add comments, emojis, and so on. We papers often had a different focus, rather than being replications of
share the resulting card-sorting board as supplementary documents each other, we cannot conclude that not mentioning a challenge
[76]. implies that there was no such challenge. Hence, we limited our
Abstracting and formatting findings. For each of the second analysis to aggregating and grouping reported challenges.
layer clusters we abstracted out the concrete details of the reported
challenges and summarized the clusters based on the identified 4 RESULTS
themes of the groups. For this, we once again looked into the cards We report our findings of the meta-summary in this section using
of each of the clusters individually and attempted to develop broad the layers derived from the card sorting. The top layer includes
statements that capture the content of the cards in that cluster, development stages (1) Requirements Engineering, (2) Architecture,
which provide the headings of our results presented in Section Design, and Implementation (with a special focus on (2a) Model De-
4. We wanted to be concise, but also comprehensive to properly velopment and (2b) Data Engineering), and (3) Quality Assurance,
capture the themes in the card. At the same time, as Sandelowski and plus (4) Process challenges and (5) Team challenges as crosscutting
Barroso suggested [15], we were careful to preserve the context in concerns. Also, although MLOps, Fairness, and other more specific
which the findings appeared by going back and forth in the original categories are often used to organize results in the surveyed papers,
papers when confusion arose, moving cards to other clusters or we eventually settled on minimizing the number of cross-cutting
themes as needed. topics. We decided to include operations challenges in the Architec-
ture and Design group, as we consider them primarily as a design
Calculating effect sizes. Methods for meta-summaries recom-
for change issue; and we separate and group various concerns for
mend reporting the frequency of findings in the original sources
specific qualities, such as fairness, in the development stages where
[103]. Since many of our analyzed papers ask similar broad research
the concerns arise, such as requirements and quality assurance.
questions, we can carefully interpret findings mentioned more fre-
Within these top layer headings, we have our second layer clusters
quently as more common, though some papers clearly specialize in
which are the abstracted challenges based on our identified themes,
specific sub areas such as fairness or software architecture [36, 58].
reported as the sub-headings in the following sections.
We do not attempt to count frequencies of mentions within the
papers (“intensity effect size”) because they are not consistently
reported, but just report the percentage of papers reporting on a
4.1 Requirements Engineering
challenge theme (“frequency effect size”). Requirements engineering is known as an important and challeng-
ing stage of any software project, but as a consistent theme, we
3.3 Limitations and Threats to Validity find that practitioners argue that the incorporation of ML further
complicates requirements engineering.
All research designs come with limitations that threaten validity and
credibility of results. As usual, readers should be careful when gen- Lack of AI literacy causes unrealistic expectations from cus-
eralizing findings beyond what is allowed by the methods. Despite tomers, managers, and even other team members [6, 22, 37,
best efforts in our selection methods (SLR process, snowballing) 44, 51, 55, 61, 67, 77, 78, 85, 99, 109, 118, 119, 122, 127] (17/50).
we may have missed some relevant papers. In setting clear rules Across many studies, many practitioners report that customers fre-
for scope, we had to do some judgment calls by consensus of all quently have unrealistic expectations of ML capabilities in a prod-
researchers for a number of papers, for example, whether to include uct, like demanding a complete lack of false positives or expecting
[2, 6, 11, 35, 108]. very high accuracy that is infeasible with provided resources (e.g.,
As discussed earlier, the meta-summary synthesis method was data, funding). Commonly, practitioners similarly blame a lack of
chosen deliberately for its fit,but comes with its own limitations: AI literacy on customers not wanting to pay for the continuous
it does not analyze original raw data, but only what is reported improvement of the model: they have a static view of model de-
by other papers. Organizing and categorizing the data required velopment [44, 77] only consider paying for coding, as they do
CAIN ’23, May 15–16, 2023, Melbourne, Australia Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner

not understand the need for experimental analysis [61] and even Transitioning from a model-centric to a pipeline-driven or
difficulty convincing engineering teams to invest in collecting high- system-wide view is considered important for moving into
quality data [51]. The issue of unrealistic requirements does not production, but a difficult paradigm shift for many teams
only come from customers, but also from team members within the [1, 42, 55, 58, 61, 65, 67, 73, 75, 109, 127] (11/50). Practitioners
company itself: Data scientists find it hard to explain the capabilities frequently report challenges in migrating from exploratory model
of ML to managers, requirements engineers, and even designers code, often in a notebook, to deployable production-quality code
[22, 37, 77, 78, 85, 122]. According to practitioners, a lack of AI liter- in automated ML pipelines [61, 127]. Building an end-to-end ML
acy in team members manifests particularly in defining and scoping pipeline is considered to be a challenge due to the difficulties of inte-
the project: Stakeholders find it hard to understand the suitability grating various ML and non-ML components in a system operating
of applying ML itself [55, 127], scoping and deciding the functional within an environment [55, 58, 109], the overwhelming complexity
and non-functional requirements [61, 119], interpreting the model of integrating many tools and frameworks [65, 67, 73], the need for
outcomes [78, 99, 109], and the infrastructure needs (e.g., appropri- engineering skills beyond the comfort zone of some data scientists
ate data, monitoring infrastructure, retraining requirements) when [73], and so on. While practitioners emphasized the importance of
building products [78, 119, 127]. Many practitioners also report pipeline automation for many projects where frequent re-training
that ML-specific system-level qualities like fairness and explainabil- and deployment of models are needed, they also consider it time-
ity are frequently ignored during requirements elicitation, as the consuming, labor-intensive, error-prone, and not well supported
stakeholders are not aware of them [10, 77, 94, 119]. by current tools [1, 42, 65, 75, 109].
Vagueness in ML problem specifications makes it difficult to ML adds substantial design complexity with many, often im-
map business goals to performance metrics [29, 36, 55, 57, 60, plicit, data and tooling dependencies, and entanglements due
61, 65, 77, 78, 85, 94, 99, 109, 114, 118, 119, 122] (17/50). Practi- to a lack of modularity [1, 4, 22, 58, 60, 65, 109, 110, 114, 122,
tioners across many studies mention the challenge of formulating 127] (11/50). Many practitioners report challenges from additional
the specific software and ML problem in a way that satisfies busi- complexity when designing systems incorporating machine learn-
ness goals and objectives. ML practitioners find it difficult to map ing, and the traditional software architecture and design practices
the high-level business goals to the low-level requirements for a no longer fit [22, 58, 60, 127]. ML changes the assumptions in tradi-
model. While customers are broadly interested in improving the tional software systems such as encapsulation and modularity and
business, practitioners often find it difficult to quantify the contribu- causes entanglements of data, source code, and ML models, which
tion of the ML model and its return on investment. Also, Responsible can lead to “pipeline jungles” and “change anything changes every-
AI initiatives find it difficult to quantify their contributions to the thing” integrations that are hard to maintain [1, 60, 65, 109, 110, 122].
business, for example, measuring the value added by improving Unlike traditional systems, ML requires the incorporation of data
fairness and explainability, or to deliberate about tradeoffs between pipelines that need to handle a high volume of data and often
conflicting fairness and business objectives [10, 36, 85, 94]. Even data architectures of distributed nature, and practitioners also need
with some notion of the responsible AI requirements in hand, prac- to understand and design for the data flow in the entire system
titioners find the requirements vague and not concrete enough [114, 122, 127]. Practitioners also point out that complexities arise
to actually implement (e.g., unclear subpopulations and protected due to a large amount of surrounding “glue code” to support the ML
characteristics to balance discrimination) [94, 119]. On the other models [4, 109], and complicated dependency and configuration
hand, practitioners also frequently report that many projects are management [4, 122].
exploratory without clear upfront business goals, thus, starting off Difficulty in scaling model training and deployment on di-
the project without clear requirements is pretty common, albeit
verse hardware [29, 42, 61, 65, 67, 73, 99, 107, 109, 110] (10/50).
often problematic [29, 61, 109, 122].
Practitioners commonly report difficulty dealing with cloud and
Regulatory constraints specific to data and ML introduce ad- computational resources, even with the recent emergence of MLOps.
ditional requirements that restrict development [10, 29, 37, Practitioners find the technologies to be difficult to integrate into
81, 108, 109, 119] (7/50). Practitioners in multiple studies ex- the production environment and require substantial time, effort,
pressed how regulatory restrictions constrain ML development and money [29, 61, 65, 67, 73, 99]. Among the common problems
and require audits and involvement from legal teams. Privacy laws of such deployments, practitioners brought up the mismatch of
such as GDPR impose additional requirements on ML practitioners development and production environments [61, 110], difficulties in
such as ensuring the collection of individual consent [37, 119] and building a scalable pipeline [29, 42, 65, 107], adhering to serving
providing the nontrivial ability to remove individuals from training requirements such as latency and throughput [65, 109], as well
data after they revoke consent. Similarly, practitioners in regulated as undocumented tribal knowledge within the team, hampering
domains report a need for explainability and transparency that pre- future deployments [110]. Despite the emerging MLOps tooling,
vents them from using deep learning and post-hoc explainability practitioners still raise many questions about how to utilize those
techniques [10, 29, 108]. resources and sometimes express being overwhelmed by the sudden
flood of tools and frameworks to choose from [2, 51].
While monitorability and planning for change are often con-
4.2 Architecture, Design, and Implementation sidered important, they are mostly considered only late af-
We find that many ML practitioners struggle with designing the ter launching [1, 4, 10, 29, 42, 54, 55, 57, 58, 73, 77, 98, 109, 110,
architecture of products with ML components.
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners CAIN ’23, May 15–16, 2023, Melbourne, Australia

127] (15/50). Practitioners report struggling with monitoring their Data quality is considered important, but difficult for practi-
deployed models for detecting drift, bias, or even failures. While tioners and not well supported by tools [1, 4, 28, 29, 37, 51, 60,
many highlight monitoring as very important, planning for mon- 61, 65, 67, 73, 77, 92, 99, 109, 110, 119] (17/50). ML practitioners
itoring is rare [77]. Even for companies that adopt a monitoring commonly report struggling with validating and improving data
infrastructure, practitioners report struggling with ad-hoc mon- quality. Even with significant research efforts in building tools for
itoring practices of logging, creating alerts, or doing everything data labeling, cleaning, visualization, and management, data work
manually [58, 110]. Similar concerns were raised about model evo- is still reported as a problematic area for practitioners. Practitioners
lution, where practitioners acknowledge it to be important, but reported that they need to invest significant effort and time in data
fall behind in planning for change in their architectural design pre-processing, cleaning, and assembly [1, 28, 37, 51, 61, 65, 77, 92].
[1, 4, 29, 42, 109]. Practitioners mentioned that ML-centric software Practitioners also mention their pain points in handling data er-
goes through frequent revisions more than traditional software rors and validating data quality, where better tool support is de-
(e.g., due to model retraining, or even model replacement for data sired [51, 60, 73, 99, 107, 109, 110, 119]. Although it is common
change, hyperparameter tuning, or change of domain, etc.), and to associate these data issues within the model building pipeline,
the changes tend to be nontrivial and nonlocal, raising the need practitioners feel the need for cooperation from other parts of the
for an architecture that supports such changes. As a result, we organization (e.g., requirements engineers need to identify and
find practitioners’ soliciting the need for adapted architectural pat- specify requirements regarding data collection, formats, and the
terns to design for such post-launch activities for products with ML ranges of data and domain experts need to help to understand the
components with monitorability as a significant quality attribute structure and semantics of the data), which they mention is lacking
[58, 127]. [37, 77, 92, 99, 119].
Internal data security and privacy policies restrict data ac-
4.3 Model Development cess and use [4, 29, 47, 51, 55, 61, 65, 67, 77, 99] (10/50). Data
Although we explicitly exclude challenges relating only to the work access is often restricted due to security and privacy policies within
and tools of data scientists when building models, we find reports organizations, beyond possible regulatory restrictions, e.g., policies
of engineering challenges during model development, which we ensuring that customer data is not shared outside the company.
report in this section. Due to restrictions on the flow of data, ML practitioners need to
deal with additional complexities in the data pipeline, as only a
Model development benefits from engineering infrastructure
restricted number of team members can analyze the data and as
and tooling but provided infrastructure and technical sup- they have limited access to the right data and no access to data
port are limited in many teams [2, 4, 7, 11, 28, 29, 42, 51, 54, locally for model optimization or model debugging due to data
55, 57, 61, 75, 78, 81, 92, 118, 122, 130] (19/50). ML practitioners movement constraints [29, 47, 61, 65].
share tooling needs for different tasks including data analysis and
visualization, feature engineering, model development, integration, Although training-serving skew is common, many teams lack
evaluation, deployment, monitoring, reproducibility, and support support for its required detection and monitoring [4, 29, 57,
for specific qualities like privacy, security, and explainability. They 65, 77, 110, 127] (7/50). The mismatch between training data and
report a lack of adequate tools in these areas and find the existing production data is a common problem in products with ML compo-
tools and techniques to be (a) unavailable in their environment nents, where models work well on test data but generalize poorly
[7, 29], (b) not automated enough [92], (c) requiring too much ex- to real-world data in production. Even if training the model with a
pert knowledge to be used [57, 81, 92, 130], (d) limited to specific representative dataset initially, the production environment often
tasks and types of data sets [7, 92], or (e) not suitable for their own encounters drift toward data distributions that are less well sup-
problems [7, 11, 51]. This raises demand for custom tools but many ported by the model. Practitioners explain that monitoring models
teams lack the resources and engineering support. in production for staleness is an important activity that supports
detecting the degradation of model performance and retraining it
Code quality is not standardized in model development tools, with new data if needed. However, they also find it challenging
leading to conflicts about code quality [77, 110, 122] (3/50). to set up the monitoring infrastructure and report a lack of tool
Practitioners report that code quality and review processes are support.
usually not standardized and are inconsistent across development
and production environments. The expectations around code qual- Data versioning and provenance tracking are often seen as
ity and versioning also differ widely in teams and create conflicts elusive, with not enough tool support [1, 37, 42, 55, 67, 107,
within teams, especially among team members with different roles 118] (7/50). While software engineers routinely adopt mature ver-
and backgrounds. Practitioners commonly complain about low code sion control systems for code, practitioners report challenges in
quality in data science code, especially in notebooks. versioning data, typically due to the large volumes of data involved.
Practitioners mention that they need to have traceability and trans-
parency to answer questions like “Which data was this model trained
4.4 Data Engineering on?” or “Which code or data change made our accuracy deteriorate?”
In developing machine learning models, data plays an important [42], but it’s not possible for them to keep track of data and models
role. While we exclude challenges related exclusively to data-related across the life cycle without technological support [1, 42, 128]. This
work within ML pipelines, we report engineering challenges related is a bigger problem for practitioners in small companies as they do
to handling data within the system.
CAIN ’23, May 15–16, 2023, Melbourne, Australia Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner

not want to invest in storage capacity to version their models and Testing and monitoring models in production are considered
datasets, though they understand the importance [37]. important but difficult, and often not done [60, 77, 92, 109,
110] (5/50). Many practitioners recognize the need to test in pro-
duction (online testing), since offline test data for models may not
be representative, especially as data distributions drift. However,
4.5 Quality Assurance practitioners consider online testing complex as it is not trivial
One of the biggest changes that the incorporation of ML models for them to design online metrics that do not only depend on the
has brought into traditional software development is challenging model but also on the external environment, user interactions af-
the traditional notion of correctness, where models are evaluated ter deployment, and the context of the product overall [60, 110].
for accuracy or fit rather than whether they fully meet a specifica- Practitioners also find online testing very time-consuming, as it re-
tion. Understandably this impacts the conventional processes and quires longer observation periods to determine meaningful results
practices associated with testing and quality assurance. [109, 110]. Practitioners also pointed out that there is no surefire
strategy to precisely detect when the model is underperforming in
Testing and debugging ML models is difficult due to lack of
online testing [110].
specifications [1, 4, 28, 29, 44, 57, 60, 61, 65, 75, 77, 92, 99, 107,
109, 110, 118, 122, 130] (19/50). Practitioners find testing and de- There are no standard processes or guidelines on how to as-
bugging of ML models challenging. In particular, they ubiquitously sess system qualities such as fairness, security, and safety in
report difficulty establishing quality assurance criteria and met- practice [10, 11, 36, 37, 42, 54, 98, 108, 118] (9/50). Research
rics, given that no model is expected to be always correct, but it often discusses how machine learning influences fairness, robust-
is difficult to define what amount and what kind of mistakes are ness, security, safety, and other qualities, but practitioners report
acceptable for a model [4, 28, 29, 44, 60, 61, 65, 77, 122, 130]. In that they find evaluating these as challenging. While practitioners
particular, practitioners find it difficult to define accuracy thresh- consider these qualities important [42, 54], they often report hav-
olds for evaluations. Furthermore, practitioners report finding it ing no effective methodology or concrete guidelines for evaluating
difficult to select adequate test data, specifically curating test data them [11, 36, 37, 54, 98, 108, 118]. Even regarding fairness, which
of sufficient quality and quantity that is representative of the pro- has received a lot of research attention lately, practitioners report
duction environment [57, 75, 92, 118, 122]. Curating test data for finding it hard to apply auditing and de-biasing methods due to
ML testing is also considered costly and labor-intensive, and practi- not having a proper process in place [36, 37]. Some practitioners
tioners desire methods and tools from the research community for report waiting for complaints from customers rather than being
automated test input generation to reduce this cost [44, 60, 122]. proactive when it comes to fairness [36], or even blindly expecting
Practitioners consider it a challenge to get labels for test data and the algorithms to inherently provide qualities like security against
evaluate test quality (e.g., in terms of coverage) due to the diffi- attacks [54].
culty of defining the valid input space and the test oracle problem
[28, 60, 122]. Practitioners also mention the silent failing of models 4.6 Process
(i.e., models give wrong answers rather than crashing), the long tail
of corner cases, and the “invisible errors”, that are handled on an ad- Building software products with ML components involves many
hoc basis without a systematic framework or a standard approach moving parts that need to be planned and integrated. Fitting all of
[28, 110, 122]. Additionally, practitioners raise challenges regarding these together in a cohesive process can be challenging.
evaluating model robustness, on one hand, suffering from the lack Development of products with ML component(s) is often ad-
of a concrete methodology [28, 92], and on the other hand, having hoc, lacking well-defined processes [4, 6, 29, 44, 51, 55, 61, 75,
various metrics but no consensus on which metric to use [60]. 114, 118, 122] (11/50). Many practitioners report that they strug-
Testing of model interactions, pipelines, and the entire sys- gle with finding a good process for developing ML components
tem is considered challenging and often neglected [28, 55, and products around them [29, 44, 61, 114, 122], often coming up
60, 65, 75, 77, 92, 130] (8/50). Testing literature often focuses on with ad-hoc strategies and experiencing a lack of good engineer-
ML models and data quality, but less on how models are integrated ing practices [44, 75]. ML practitioners have explored using the
traditional software development life cycles and found those to be
into the system, and even less on the infrastructure to produce the
a poor fit for exploratory development work. Even with a flexible
models. Practitioners find sole unit testing of individual models
insufficient and ineffective, due to the entanglement of models and agile methodology, practitioners identified that small iterations of
different ML components, as well as the difficulty of explaining why sprints cannot fit the initial feasibility study that ML requires, with
an error occurred due to the low interpretability of individual mod- the timeline being too fixed and too short [6, 29, 61]. Also, they find
els [60, 122, 130]. The lack of pipeline and system testing beyond the it hard to set expectations for each sprint, as the project objectives
model is also considered a problematic area [28, 55, 75, 77, 92, 130]: may remain unclear at the beginning and need to be revisited after
While practitioners tend to focus more on the data- and model- the initial investigation [6, 55].
related issues, the error handling around the model is found to be The uncertainty in ML development makes it hard to plan
insufficient in previous studies [28, 130], leading to system failures and estimate effort and time [4, 6, 28, 44, 61, 118, 122] (7/50).
even where the model gives the correct results [75]. Practitioners Machine learning work tends to be iterative and exploratory and as
also report having no systematic evaluation strategy nor automated such uncertain, where practitioners cannot estimate upfront how
tools and techniques for pipeline and system-level testing [60, 77]. long it may take to reach a model with a certain level of accuracy
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners CAIN ’23, May 15–16, 2023, Melbourne, Australia

or whether that is even possible at all; instead, they commonly vocabulary; however, practitioner experiences seem to indicate
progress with many experiments with different algorithms and that such cross-disciplinary education is not broadly available yet
datasets [4, 44]. Practitioners, therefore, report having difficulties [28, 110].
setting expectations and (intermediate) deadlines for a project [4, 6, Many teams are not well prepared for the extensive interdis-
28, 61, 122] and providing any upfront estimates about effort and
ciplinary collaboration and communication needed in ML
cost [4, 61, 118].
products [4, 7, 17, 22, 67, 75, 77, 78, 107, 122, 127] (11/50). For
Practitioners find documentation more important than ever building a product with ML components, team members need to
in ML, but find it more challenging than traditional software collaborate with people from different disciplines as mentioned
documentation [18, 29, 53, 57, 61, 77, 88, 107, 128] (9/50). Many above, such as business leaders, engineers, designers, and various
practitioners point out various process and coordination challenges other departments inside the company, and even outside the orga-
rooted in poor documentation. Some practitioners emphasize that nization [22, 78, 107, 122, 127]. Practitioners report that they often
documentation is even more important when it comes to ML com- struggle to collaborate effectively in such interdisciplinary teams,
ponents, as human decisions are inscribed in different stages of ML because team members often do not understand the concerns of
pipelines and cannot be retrieved from code or data without docu- other members from other backgrounds, like data scientists lacking
mentation [29, 128]. The final model code is the outcome of many knowledge of engineering practices, testing frameworks, contin-
different explorations and experimentations that include multiple uous integration and delivery, and such [1, 4, 29, 75]; software
rounds of data processing, feature engineering, hyperparameter engineers lacking AI literacy [29, 75]; and data scientists and soft-
tuning, and other activities. Many problem-specific decisions have ware engineers not understanding or interacting members with in
been made in those stages that cannot be understood from the with business roles [78, 127]. Practitioners report struggling with
resulting model or pipeline code. Some argue that not recording cultural differences, differences in expectations, and conflicting
these decisions in documentation causes them to slowly become in- priorities [4, 7, 67, 78], and they often do not agree on assigned
visible, severely impacting future re-analysis and revisions, or even responsibilities [61, 77]. These multidisciplinary teams also suffer
model integration and deployment [29, 57, 128]. Others emphasize from miscommunications arising from inconsistency in their techni-
that, along with model documentation, data documentation is also cal terminologies [17, 75, 77]. Siloing of teams by specialization and
imperative to share hidden information inside the data and create lack of communication across such silos are also observed in many
a shared data understanding, yet mostly missing in organizations production settings, fostering integration problems even further
[77, 107]. Others report that, with the incorporation of ML, the [7, 77].
documentation process becomes more complicated as ML practi-
ML development can be costly and resource limits can sub-
tioners find it difficult to present complex model information in
stantially curb/limit efforts [2, 4, 47, 67, 78, 110] (6/50). Prac-
an accessible way to all levels of stakeholders [18, 53, 88]. It is also titioners report that organizations involved in the development
non-trivial for practitioners to decide on the right amount of details of products with ML components often suffer from resource and
to include in the documentation. They place the blame mostly on budget limitations. Hardware, infrastructure, cloud storage, GPUs,
the lack of organizational incentives, resources, and unclear and etc., are expensive, and especially for small companies, it is diffi-
vague guidelines for ML documentation [18, 88]. cult to justify such expenditures based on the expected return on
investment from the model.
4.7 Organization and Teams
Along with the challenges faced in different development stages, Lack of organizational incentives, resources, and education
practitioners also mention challenges they suffer from the organi- hampers achieving all system-level qualities [11, 47, 54, 78,
zational and teamwork perspective while building products with 81, 94, 98, 108] (8/50). Practitioners mention that organizational
ML components. incentives also have an impact on achieving certain qualities of
products with ML components. A quality that practitioners reported
Building products with ML components requires diverse skill frequently as particularly challenging due to the lack of organiza-
sets, which is often missing in development teams [2, 4, 6, 47, tional incentives is fairness [36, 94]. Awareness of potential prob-
67, 75, 78, 109, 118, 122, 123, 127] (12/50). Incorporation of ML lems, including potential consequences from biased models, seems
in a product does not merely mean adding just another component to be the main reason for lacking responsible AI practices, along
to the system; it requires people from multiple disciplines to get with the lack of organizational incentives and structures, as well as
involved to support different aspects of this component. The team priority conflicts. Safety, security, and privacy also seem to suffer
requires many diverse skill sets to develop, deploy, and integrate from similar issues of awareness, education, resource constraints,
the model into the complete product, including hardware expertise, and are often disregarded due to tradeoffs with development cost
engineering skills, knowledge of math and statistics, business un- [11, 47, 54, 81, 98].
derstanding, UX design ability, operations, and domain expertise.
The lack of this varied expertise in the team is commonly men-
tioned to be a challenge by practitioners [2, 6, 47, 109, 118, 122]. 5 DISCUSSION AND CONCLUSIONS
Also, as discussed in the next subsection that communication is With this meta-summary, we aggregate and summarize the chal-
often hindered by a lack of AI literacy or common terminology lenges reported by industry practitioners who build software prod-
[1, 4, 29, 75, 78, 127], cross-disciplinary knowledge seems to be im- ucts with ML components. We find that practitioners report chal-
portant for team members to interact and understand each other’s lenges in all stages of the development process, from the initial
CAIN ’23, May 15–16, 2023, Melbourne, Australia Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner

requirements specification stage to quality assurance of the de- seems to be a field though, where industry-oriented research
ployed product. They report a broad range of issues from lacking (similar to the data architecture of facebook [31]) has more
process, organizational structure, and team collaboration strategies, access to the complicated real-world scenarios where archi-
to lacking tool support for data, model building, deployment, and tectural planning becomes important than what academics
monitoring. can typically access. From the challenges raised by practi-
tioners, it is apparent that along with the need for design
Old, new, and harder challenges. Arguably, many reported chal-
lenges are not new to software engineers, and likely many software practices, patterns, and mechanisms to handle system and
engineers may have reported similar challenges in non-ML projects. model-level considerations (e.g., dependency management,
It seems though that the introduction of machine learning exac- scalability, monitorability), we also need to support teams in
erbates some universal challenges and introduces new ones. For shifting from model-centric work to system thinking, possi-
example, software engineering literature is well aware that require- bly through tailored education for ML practitioners.
ments engineering is challenging, with customers having unrealistic • Model Development and Data Engineering. Consistent
expectations and developers directly jumping into coding without across many papers, we find that ML practitioners desire
understanding requirements first. While our study does not support more engineering support, such as better infrastructure and
tools for model and data work. Data scientists also indicate
direct comparisons, it seems that these problems haunt ML practi-
a need for more cooperation from other team members in
tioners more, given how ML inspires hopes for amazing capabilities,
but in a way that may be difficult to understand and specify without terms of support for data, which necessitates better collabo-
substantial ML expertise. Similarly, the software-engineering liter- ration strategies and data education for the entire team. On
ature is full of nuanced discussions of development life cycles and the other hand, a few practitioners highlighted the necessity
competing process models, but ML practitioners struggle adopting of standardization of ML code quality, which may be a low
even the most flexible agile-inspired processes for their projects hanging fruit technically, but may require a change to the
with the uncertainty that ML brings. Also, team collaboration and culture and practices in many projects.
organizational challenges are well known in traditional software • Quality Assurance. Quality assurance for machine learn-
engineering, but those seem to become even more central with the ing, especially for models, is a very active area of research,
additional complexity and inclusion of more people with different with proposals for many different testing strategies to vali-
date different model characteristics covered in multiple lit-
backgrounds, cultures, and priorities. Other challenges seem new,
erature surveys [38, 96, 97, 129]. While we found that a lot
such as the data- and model-related challenges associated with
ML components, and several of the reported challenges regarding of practitioners mentioned concerns about specifying model
architecture and quality assurance stemming from the different adequacy goals, few practitioners showed concerns about
nature of reasoning in machine learning. system testing, monitoring in production, and testing for
fairness, security, and safety. We are surprised to not see
Toward better engineering of ML products. A finding from our more concerns about system-level quality beyond the model,
study is that there is much more consensus on what the challenges which might indicate either that practitioners do not consider
are, than how to overcome them. Some challenges could be ad- these testing areas as challenging, or that most organizations
dressed with new tooling or new practices; for others it may be (especially outside of big tech) are not yet mature enough
possible to simply adopt existing good engineering practices; and to even start thinking about such testing needs. Monitoring
yet others may just be intrinsically hard problems. While we cannot though is recognized as an important challenge, with many
provide a rigorous summary or analysis, we close by reflecting on available tools but common adoption problems that may be
possible directions. worth investigating further.
• Requirements Engineering. For the challenges of unreal- • Process. While there is research on the development pro-
istic requirements, several studies mentioned that practition- cesses for ML models [69, 116], there seems to be little work
ers found it useful to conduct training sessions with clients on addressing process challenges that arise when integrat-
and other team members on AI literacy, before starting the ing ML and non-ML work in production projects that are
ML projects [77, 95, 106, 119]. But again, while many practi- commonly mentioned by practitioners. We believe that this
tioners mention suffering from unclear model requirements, is an area with plenty of research opportunities to evaluate
we still do not seem to have a good solution to that, and ad- what processes and practices work well in different contexts.
ditional research on how to elicit and describe requirements • Organization and Teams. While there is lots of research
for models may be needed. Another area for future research on technical issues, practitioners often see organizational
would be to better understand and prepare for regulatory and team issues (such as a lack of AI literacy in teams, unclear
constraints and provide evidence of compliance. responsibility boundaries, and a lack of team synchroniza-
• Architecture, Design, and Implementation. Machine learn- tion) as some of the most difficult challenges to overcome.
ing seems to provide significant challenges to architectural Education and better collaboration strategies seem to be the
design of software systems, but arguably many challenges factors that might put a positive impact on mitigating many
are similar to other large and complex and distributed soft- of the challenges that the practitioners mentioned.
ware systems. While there are nascent discussions on orga- Overall we believe that a lot of progress can be made with bet-
nizing architecture knowledge as patterns [56, 58, 109, 124], ter education and better adoption of good software engineering
it does not seem like the field has reached saturation. This practices. There are plenty research opportunities to adapt existing
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners CAIN ’23, May 15–16, 2023, Melbourne, Australia

practices, support them with tooling, and create new interventions Conference on Human Factors in Computing Systems, 1–7.
altogether. We hope that the collection of challenges, which can [18] Chang, J. and Custis, C. 2022. Understanding Implementation Challenges in
Machine Learning Documentation. Equity and Access in Algorithms, Mechanisms,
be traced to the original studies where they were raised by practi- and Optimization, 1–8.
tioners, will be helpful in selecting and prioritizing research and [19] Chattopadhyay, S., Prasad, I., Henley, A.Z., Sarma, A. and Barik, T. 2020. What’s
wrong with computational notebooks? Pain points, needs, and design opportu-
education in our community. nities. Proceedings of the 2020 CHI Conference on Human Factors in Computing
Acknowledgments. Kästner’s, Nahar’s, and Zhang’s work was Systems, 1–12.
supported in part by the National Science Foundation (#2131477), [20] Chotisarn, N., Merino, L., Zheng, X., Lonapalawong, S., Zhang, T., Xu, M. and
Chen, W. 2020. A systematic literature review of modern software visualization.
Zhou’s work was supported in part by the Natural Sciences and En- Journal of visualization / the Visualization Society of Japan. 23, 4, 539–558.
gineering Research Council of Canada (NSERC, RGPIN2021-03538), [21] Dilhara, M., Ketkar, A. and Dig, D. 2021. Understanding Software-2.0: A Study of
and Lewis’ work was funded and supported by the Department Machine Learning Library Usage and Evolution. ACM Transactions on Software
Engineering and Methodology. 30, 4, 1–42.
of Defense under Contract No. FA8702-15-D-0002 with Carnegie [22] Dove, G., Halskov, K., Forlizzi, J. and Zimmerman, J. 2017. UX Design Inno-
Mellon University for the operation of the Software Engineering In- vation: Challenges for Working with Machine Learning as a Design Material.
Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems,
stitute, a federally funded research and development center (DM23- 278–288.
0228). [23] Epperson, W., Wang, A.Y., DeLine, R. and Drucker, S.M. 2022. Strategies for
Reuse and Sharing among Data Scientists in Software Teams. Proceedings of the
44th International Conference on Software Engineering: Software Engineering in
REFERENCES Practice, 243–252.
[1] Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., [24] Faan, M.S.P. and Aprn, J.B.P. 2006. Handbook for Synthesizing Qualitative Re-
Nushi, B. and Zimmermann, T. 2019. Software Engineering for Machine Learn- search. Springer Publishing Company.
ing: A Case Study. Proceedings of the 41st International Conference on Software [25] Felizardo, K.R., Mendes, E., Kalinowski, M., Souza, É.F. and Vijaykumar, N.L.
Engineering: Software Engineering in Practice (ICSE-SEIP), 291–300. 2016. Using Forward Snowballing to update Systematic Reviews in Software
[2] Andrade, H., Lwakatare, L.E., Crnkovic, I. and Bosch, J. 2019. Software Chal- Engineering. Proceedings of the 10th ACM/IEEE International Symposium on
lenges in Heterogeneous Computing: A Multiple Case Study in Industry. Pro- Empirical Software Engineering and Measurement, 1–6.
ceedings of the 45th Euromicro Conference on Software Engineering and Advanced [26] Follow, R. Bridging the Gap Between Data Science & Engineer: Building High-
Applications (SEAA), 148–155. Performance Teams.
[3] Arnold, M., Piorkowski, D., Reimer, D., Richards, J., Tsay, J., Varshney, K.R., Bel- [27] Giray, G. 2021. A Software Engineering Perspective on Engineering Machine
lamy, R.K.E., Hind, M., Houde, S., Mehta, S., Mojsilovic, A., Nair, R., Ramamurthy, Learning Systems: State of the Art and Challenges. Journal of Systems and
K.N. and Olteanu, A. 2019. FactSheets: Increasing trust in AI services through Software. 180„ 111031.
supplier’s declarations of conformity. IBM journal of research and development. [28] Golendukhina, V., Lenarduzzi, V. and Felderer, M. 2022. What is software quality
63, 4/5, 6:1–6:13. for AI engineers? Towards a thinning of the fog. Proceedings of the 1st Interna-
[4] Arpteg, A., Brinne, B., Crnkovic-Friis, L. and Bosch, J. 2018. Software Engineering tional Conference on AI Engineering: Software Engineering for AI, 1–9.
Challenges of Deep Learning. Proceedings of the 44th Euromicro Conference on [29] Haakman, M., Cruz, L., Huijgens, H. and van Deursen, A. 2021. AI Lifecycle
Software Engineering and Advanced Applications (SEAA), 50–59. Models Need To Be Revised. An Exploratory Study in Fintech. Empirical Software
[5] Ashmore, R., Calinescu, R. and Paterson, C. 2022. Assuring the Machine Learning Engineering. 26, 5, 1–29.
Lifecycle: Desiderata, Methods, and Challenges. ACM Computing Surveys. 54, 5, [30] Harris, J. 2020. Beyond the jupyter notebook: how to build data science products.
1–39. Towards Data Science.
[6] Baijens, J., Helms, R. and Iren, D. 2020. Applying Scrum in Data Science Projects. [31] Hazelwood, K. et al. 2018. Applied Machine Learning at Facebook: A Datacenter
Proceedings of the 22nd Conference on Business Informatics (CBI), 30–38. Infrastructure Perspective. Proceedings of the 2018 IEEE International Symposium
[7] Bäuerle, A., Cabrera, Á.A., Hohman, F., Maher, M., Koski, D., Suau, X., Barik, T. on High Performance Computer Architecture (HPCA) (Feb. 2018), 620–629.
and Moritz, D. 2022. Symphony: Composing Interactive Interfaces for Machine [32] Head, A., Hohman, F., Barik, T., Drucker, S.M. and DeLine, R. 2019. Managing
Learning. Proceedings of the 2022 CHI Conference on Human Factors in Computing messes in computational notebooks. Proceedings of the 2019 CHI Conference on
Systems, 1–14. Human Factors in Computing Systems - CHI ’19, 1–12.
[8] Begel, A. and Zimmermann, T. 2014. Analyze this! 145 questions for data scien- [33] Henry, K.E., Kornfield, R., Sridharan, A., Linton, R.C., Groh, C., Wang, T., Wu,
tists in software engineering. Proceedings of the 36th International Conference on A., Mutlu, B. and Saria, S. 2022. Human-machine teaming is key to AI adoption:
Software Engineering, 12–23. clinicians’ experiences with a deployed machine learning system. NPJ digital
[9] Bernardi, L., Mavridis, T. and Estevez, P. 2019. 150 Successful Machine Learning medicine. 5, 1, 1–6.
Models: 6 Lessons Learned at Booking.com. Proceedings of the 25th ACM SIGKDD [34] Hermann, J. and Del Balso, M. 2017. Meet Michelangelo: Uber’s machine learning
International Conference on Knowledge Discovery & Data Mining - KDD ’19, platform.
1743–1751. [35] Hill, C., Bellamy, R., Erickson, T. and Burnett, M. 2016. Trials and tribulations of
[10] Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, developers of intelligent systems: A field study. Proceedings of the IEEE Sympo-
R., Moura, J.M.F. and Eckersley, P. 2020. Explainable machine learning in de- sium on Visual Languages and Human-Centric Computing (VL/HCC), 162–170.
ployment. Proceedings of the 2020 Conference on Fairness, Accountability, and [36] Holstein, K., Wortman Vaughan, J., Daumé, H., Dudik, M. and Wallach, H.
Transparency, 648–657. 2019. Improving Fairness in Machine Learning Systems: What Do Industry
[11] Boenisch, F., Battis, V., Buchmann, N. and Poikela, M. 2021. “I Never Thought Practitioners Need? Proceedings of the 2019 CHI Conference on Human Factors in
About Securing My Machine Learning Systems”: A Study of Security and Pri- Computing Systems, 1–16.
vacy Awareness of Machine Learning Practitioners. Proceedings of Mensch und [37] Hopkins, A. and Booth, S. 2021. Machine Learning Practices Outside Big Tech:
Computer 2021, 520–546. How Resource Constraints Challenge Responsible Development. Proceedings of
[12] Borch, C. 2022. Machine learning, knowledge risk, and principal-agent problems the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 134–145.
in automated trading. Technology in society. 68„ 101852. [38] Huang, X., Kroening, D., Ruan, W., Sharp, J., Sun, Y., Thamo, E., Wu, M. and Yi,
[13] Borg, M., Englund, C., Wnuk, K., Duran, B., Levandowski, C., Gao, S., Tan, Y., X. 2020. A survey of safety and trustworthiness of deep neural networks: Verifi-
Kaijser, H., Lönn, H. and Törnqvist, J. 2020. Safely Entering the Deep: A Review cation, testing, adversarial attack and defence, and interpretability. Computer
of Verification and Validation for Machine Learning and a Challenge Elicitation Science Review. 37„ 100270.
in the Automotive Industry. Journal of Automotive Software Engineering. 1, 1, [39] Huang, X., Zhang, H., Zhou, X., Babar, M.A. and Yang, S. 2018. Synthesizing
1–19. qualitative research in software engineering: a critical review. Proceedings of
[14] Boyd, K.L. 2021. Datasheets for Datasets help ML Engineers Notice and Un- the 40th International Conference on Software Engineering, 1207–1218.
derstand Ethical Issues in Training Data. Proceedings of the ACM on Human- [40] Hudson, W. 2013. Card Sorting. The Encyclopedia of Human-Computer Interaction,
Computer Interaction. 5, CSCW2, 1–27. 2nd Ed. The Interaction Design Foundation.
[15] Braiek, H.B. and Khomh, F. 2020. On testing machine learning programs. The [41] Hulten, G. 2018. Building Intelligent Systems: A Guide to Machine Learning
Journal of systems and software. 164„ 110542. Engineering. Apress.
[16] Breck, E., Cai, S., Nielsen, E., Salib, M. and Sculley, D. 2017. The ML test score: [42] Hummer, W., Muthusamy, V., Rausch, T., Dube, P., El Maghraoui, K., Murthi, A.
A rubric for ML production readiness and technical debt reduction. Proceedings and Oum, P. 2019. ModelOps: Cloud-Based Lifecycle Management for Reliable
of the 2017 IEEE International Conference on Big Data (Big Data), 1123–1132. and Trusted AI. Proceedings of the 2019 IEEE International Conference on Cloud
[17] Brennen, A. 2020. What Do People Really Want When They Say They Want
“Explainable AI?” We Asked 60 Stakeholders. Extended Abstracts of the 2020 CHI
CAIN ’23, May 15–16, 2023, Melbourne, Australia Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner

Engineering (IC2E), 113–120. [68] Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz,
[43] Hynes, N., Sculley, D. and Terry, M. 2017. The data linter: Lightweight, automated A., Vollmer, A.M. and Wagner, S. 2022. Software Engineering for AI-Based
sanity checking for ml data sets. NIPS MLSys Workshop. 1„ 5. Systems: A Survey. ACM Transactions on Software Engineering and Methodology.
[44] Ishikawa, F. and Yoshioka, N. 2019. How do engineers perceive difficulties in 31, 2, 1–59.
engineering of machine-learning systems? - questionnaire survey. Proceedings [69] Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Hernandez Orallo, J.,
of the 2019 IEEE/ACM Joint 7th International Workshop on Conducting Empirical Kull, M., Lachiche, N., Ramirez Quintana, M.J. and Flach, P.A. 2020. CRISP-DM
Studies in Industry (CESI) and 6th International Workshop on Software Engineering twenty years later: From data mining processes to data science trajectories. IEEE
Research and Industrial Practice (SER&IP), 2–9. transactions on knowledge and data engineering. 33, 8, 3048–3061.
[45] Jain, R. and Suman, U. 2015. A Systematic Literature Review on Global Software [70] McGlohon, M. 2021. Demystifying Machine Learning in Production: Reasoning
Development Life Cycle. SIGSOFT Softw. Eng. Notes. 40, 2, 1–14. about a Large-Scale ML Platform.
[46] Jentzsch, S. and Hochgeschwender, N. 2021. A qualitative study of Machine [71] McGraw, G., Figueroa, H., Shepardson, V. and Bonett, R. 2020. An architectural
Learning practices and engineering challenges in Earth Observation. it - Infor- risk analysis of machine learning systems: Toward more secure machine learn-
mation Technology. 63, 4, 235–247. ing. Berryville Institute of Machine Learning, Clarke County, VA. Accessed on:
[47] John, M.M., Olsson, H.H. and Bosch, J. 2020. AI Deployment Architecture: Multi- Mar. 23,.
Case Study for Key Factor Identification. Proceedings of the 27th Asia-Pacific [72] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B.,
Software Engineering Conference (APSEC), 395–404. Spitzer, E., Raji, I.D. and Gebru, T. 2019. Model Cards for Model Reporting.
[48] Kanagal, B. and Tata, S. 2018. Recommendations for All: Solving Thousands of Proceedings of the Conference on Fairness, Accountability, and Transparency,
Recommendation Problems Daily. Proceedings of the 34th International Confer- 220–229.
ence on Data Engineering (ICDE), 1404–1413. [73] Muiruri, D., Lwakatare, L.E., K Nurminen, J. and Mikkonen, T. 2022. Practices and
[49] Kästner, C. 2022. Machine Learning in Production: From Models to Products. Infrastructures for ML Systems–An Interview Study in Finnish Organizations.
[50] Keele, S. 2007. Guidelines for performing systematic literature reviews in software TechRxiv.
engineering. Technical Rep., Ver. 2.3 EBSE Tech. Report. EBSE. [74] Muller, M., Lange, I., Wang, D., Piorkowski, D., Tsay, J., Liao, Q.V., Dugan, C.
[51] Kim, M., Zimmermann, T., DeLine, R. and Begel, A. 2018. Data Scientists in and Erickson, T. 2019. How Data Science Workers Work with Data: Discovery,
Software Teams: State of the Art and Challenges. IEEE Transactions on Software Capture, Curation, Design, Creation. Proceedings of the 2019 CHI Conference on
Engineering. 44, 11, 1024–1038. Human Factors in Computing Systems, 1–15.
[52] Kim, M., Zimmermann, T., DeLine, R. and Begel, A. 2016. The emerging role of [75] Myllyaho, L., Raatikainen, M., Männistö, T., Nurminen, J.K. and Mikkonen, T.
data scientists on software development teams. Proceedings of the 38th Interna- 2022. On misbehaviour and fault tolerance in machine learning systems. Journal
tional Conference on Software Engineering, 96–107. of Systems and Software. 183„ 111096.
[53] Königstorfer, F. and Thalmann, S. 2022. AI Documentation: A path to account- [76] Nahar, N. 2022. Supplementary documents: A meta-summary of challenges in
ability. Journal of Responsible Technology. 11„ 100043. building products with ML components – collecting experiences from 4758+
[54] Kumar, R.S.S., Nystrom, M., Lambert, J., Marshall, A., Goertzel, M., Comissoneru, practitioners. OSF. https://osf.io/y5edu/
A., Swann, M. and Xia, S. 2020. Adversarial Machine Learning - Industry Per- [77] Nahar, N., Zhou, S., Lewis, G. and Kästner, C. 2022. Collaboration Challenges
spectives. Proceedings of the 2020 IEEE Security and Privacy Workshops (SPW)., in Building ML-Enabled Systems: Communication, Documentation, Engineer-
69–75. ing, and Process. Proceedings of the 44th International Conference on Software
[55] Laato, S., Birkstedt, T., Mäantymäki, M., Minkkinen, M. and Mikkonen, T. 2022. Engineering, 413–425.
AI governance in the system development life cycle: insights on responsible [78] Namvar, M., Intezari, A., Akhlaghpour, S. and Brienza, J.P. 2022. Beyond effective
machine learning engineering. Proceedings of the 1st International Conference on use: Integrating wise reasoning in machine learning development. International
AI Engineering: Software Engineering for AI, 113–123. journal of information management., 102566.
[56] Lakshmanan, V., Robinson, S. and Munn, M. 2020. Machine Learning Design [79] Nascimento, E., Nguyen-Duc, A., Sundbø, I. and Conte, T. 2020. Software engi-
Patterns. O’Reilly Media, Inc. neering for artificial intelligence and machine learning software: A systematic
[57] Lewis, G.A., Bellomo, S. and Ozkaya, I. 2021. Characterizing and Detecting literature review. arXiv [cs.SE].
Mismatch in Machine-Learning-Enabled Systems. Proceedings of the IEEE/ACM [80] Nikanjam, A., Morovati, M.M., Khomh, F. and Ben Braiek, H. 2022. Faults in
1st Workshop on AI Engineering-Software Engineering for AI (WAIN), 133–140. deep reinforcement learning programs: a taxonomy and a detection approach.
[58] Lewis, G.A., Ozkaya, I. and Xu, X. 2021. Software Architecture Challenges for Automated software engineering. 29, 1.
ML Systems. Proceedings of the 2021 IEEE International Conference on Software [81] Nikhil, K., Anandayuvaraj, D., Detti, A., Lee Bland, F., Rahaman, S. and Davis, J.C.
Maintenance and Evolution (ICSME), 634–638. 2022. “If security is required”: Engineering and Security Practices for Machine
[59] Lin, J. and Kolcz, A. 2012. Large-scale machine learning at twitter. Proceedings Learning-based IoT Devices. Proceedings of the 4th International Workshop on
of the 2012 ACM SIGMOD International Conference on Management of Data, Software Engineering Research and Practices for the IoT (SERP4IoT), 1–8.
793–804. [82] Nushi, B., Kamar, E., Horvitz, E. and Kossmann, D. 2017. On human intellect and
[60] Li, S., Guo, J., Lou, J.-G., Fan, M., Liu, T. and Zhang, D. 2022. Testing machine machine failures: troubleshooting integrative machine learning systems. Pro-
learning systems in industry: an empirical study. Proceedings of the 44th Inter- ceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 1017–1025.
national Conference on Software Engineering: Software Engineering in Practice, [83] Ozkaya, I. 2020. What is really different in engineering AI-enabled systems?
263–272. IEEE software. 37, 4, 3–6.
[61] Liu, H., Eksmo, S., Risberg, J. and Hebig, R. 2020. Emerging and Changing Tasks [84] Paleyes, A., Urma, R.-G. and Lawrence, N.D. 2022. Challenges in deploying
in the Development Process for Machine Learning Systems. Proceedings of the machine learning: A survey of case studies. ACM computing surveys..
International Conference on Software and System Processes, 125–134. [85] Passi, S. and Jackson, S.J. 2018. Trust in Data Science: Collaboration, Translation,
[62] Liu, J., Boukhelifa, N. and Eagan, J.R. 2020. Understanding the Role of Alterna- and Accountability in Corporate Data Science Projects. Proceedings of the ACM
tives in Data Analysis Practices. IEEE transactions on visualization and computer on Human-Computer Interaction. 2, CSCW (Nov. 2018), 1–28.
graphics. 26, 1, 66–76. [86] Passi, S. and Sengers, P. 2020. Making data science systems work. Big data &
[63] Liu, Q., Li, P., Zhao, W., Cai, W., Yu, S. and Leung, V.C.M. 2018. A Survey on society. 7, 2, 205395172093960.
Security Threats and Defensive Techniques of Machine Learning: A Data Driven [87] Pimentel, J.F., Murta, L., Braganholo, V. and Freire, J. 2019. A large-scale study
View. IEEE Access. 6„ 12103–12117. about quality and reproducibility of jupyter notebooks. Proceedings of the 16th
[64] Lopez, G. and Guerrero, L.A. 2017. Awareness Supporting Technologies used in International Conference on Mining Software Repositories (MSR), 507–517.
Collaborative Systems: A Systematic Literature Review. Proceedings of the 2017 [88] Piorkowski, D., González, D., Richards, J. and Houde, S. 2020. Towards evaluating
ACM Conference on Computer Supported Cooperative Work and Social Computing, and eliciting high-quality documentation for intelligent systems. arXiv [cs.SE].
808–820. [89] Piorkowski, D., Park, S., Wang, A.Y., Wang, D., Muller, M. and Portnoy, F. 2021.
[65] Lwakatare, L.E., Raj, A., Bosch, J., Olsson, H.H. and Crnkovic, I. 2019. A taxonomy How AI Developers Overcome Communication Challenges in a Multidisci-
of software engineering challenges for machine learning systems: An empirical plinary Team: A Case Study. Proceedings of the ACM on Human-Computer Inter-
investigation. Proceedings of the 2019 International Conference on Agile Software action 5.CSCW1, 1–25.
Development, 227–243. [90] Polyzotis, N., Roy, S., Whang, S.E. and Zinkevich, M. 2018. Data Lifecycle Chal-
[66] Lwakatare, L.E., Raj, A., Crnkovic, I., Bosch, J. and Olsson, H.H. 2020. Large- lenges in Production Machine Learning: A Survey. ACM SIGMOD Record. 47, 2,
scale machine learning systems in real-world industrial settings: A review of 17–28.
challenges and solutions. Information and software technology. 127, 106368, [91] Rahimi, M., Guo, J.L.C., Kokaly, S. and Chechik, M. 2019. Toward Requirements
106368. Specification for Machine-Learned Components. Proceedings of the 27th Interna-
[67] Mäkinen, S., Skogström, H., Laaksonen, E. and Mikkonen, T. 2021. Who Needs tional Requirements Engineering Conference Workshops (REW), 241–244.
MLOps: What Data Scientists Seek to Accomplish and How Can MLOps Help? [92] Rahman, M.S., Khomh, F., Hamidi, A., Cheng, J., Antoniol, G. and Washizaki,
Proceedings of the IEEE/ACM 1st Workshop on AI Engineering - Software Engi- H. 2021. Machine Learning Application Development: Practitioners’ Insights.
neering for AI (WAIN), 109–112. arXiv [cs.SE].
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners CAIN ’23, May 15–16, 2023, Melbourne, Australia

[93] Rahman, M.S., Rivera, E., Khomh, F., Guéhéneuc, Y.-G. and Lehnert, B. 2019. [112] Siebert, J., Joeckel, L., Heidrich, J., Nakamichi, K., Ohashi, K., Namba, I., Ya-
Machine Learning Software Engineering in Practice: An Industrial Case Study. mamoto, R. and Aoyama, M. 2020. Towards Guidelines for Assessing Qualities
arXiv [cs.SE]. of Machine Learning Systems. Proceedings of the International Conference on the
[94] Rakova, B., Yang, J., Cramer, H. and Chowdhury, R. 2020. Where Responsible AI Quality of Information and Communications Technology, 17–31.
meets Reality: Practitioner Perspectives on Enablers for shifting Organizational [113] Smith, D. 2017. Exploring development patterns in data science.
Practices. Proceedings of the ACM on Human-Computer Interaction, 1–23. [114] d. S. Nascimento, E., Ahmed, I., Oliveira, E., Palheta, M.P., Steinmacher, I. and
[95] Ribeiro, D.M., Cardoso, M., da Silva, F.Q.B. and França, C. 2014. Using qualitative Conte, T. 2019. Understanding Development Process of Machine Learning Sys-
metasummary to synthesize empirical findings in literature reviews. Proceedings tems: Challenges and Solutions. Proceedings of the 2019 ACM/IEEE International
of the 8th ACM/IEEE International Symposium on Empirical Software Engineering Symposium on Empirical Software Engineering and Measurement (ESEM), 1–6.
and Measurement, 1–4. [115] Spencer, D. 2009. Card Sorting: Designing Usable Categories. Rosenfeld Media.
[96] Ribeiro, M.T., Wu, T., Guestrin, C. and Singh, S. 2020. Beyond Accuracy: Behav- [116] Studer, S., Bui, T.B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S. and
ioral Testing of NLP models with CheckList. arXiv [cs.CL]. Mueller, K.-R. 2021. Towards CRISP-ML(Q): A Machine Learning Process Model
[97] Riccio, V., Jahangirova, G., Stocco, A., Humbatova, N., Weiss, M. and Tonella, P. with Quality Assurance Methodology. Machine Learning and Knowledge Extrac-
2020. Testing machine learning based systems: a systematic mapping. Empirical tion. 3, 2, 392–413.
Software Engineering. 25, 6, 5193–5254. [117] Tonekaboni, S., Joshi, S., McCradden, M.D. and Goldenberg, A. 2019. What
[98] Rismani, S., Shelby, R., Smart, A., Jatho, E., Kroll, J., Moon, A. and Rostamzadeh, N. Clinicians Want: Contextualizing Explainable Machine Learning for Clinical
2022. From plane crashes to algorithmic harm: applicability of safety engineering End Use. Proceedings of the 4th Machine Learning for Healthcare Conference,
frameworks for responsible ML. arXiv [cs.HC]. 359–380.
[99] Riungu-Kalliosaari, L., Kauppinen, M. and Männistö, T. 2017. What Can Be [118] Uchihira, N. 2022. Project FMEA for Recognizing Difficulties in Machine Learn-
Learnt from Experienced Data Scientists? A Case Study. Product-Focused Soft- ing Application System Development. Proceedings of the 2022 Portland Inter-
ware Process Improvement, 55–70. national Conference on Management of Engineering and Technology (PICMET),
[100] Saha, D., Schumann, C., McElfresh, D.C., Dickerson, J.P., Mazurek, M.L. and 1–8.
Tschantz, M.C. 2020. Human Comprehension of Fairness in Machine Learning. [119] Vogelsang, A. and Borg, M. 2019. Requirements Engineering for Machine Learn-
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 152. ing: Perspectives from Data Scientists. Proceedings of the 27th International
[101] Salay, R., Queiroz, R. and Czarnecki, K. 2017. An Analysis of ISO 26262: Using Requirements Engineering Conference Workshops (REW), 245–251.
Machine Learning Safely in Automotive Software. arXiv [cs.AI]. [120] Wagstaff, K. 2012. Machine Learning that Matters. arXiv [cs.LG].
[102] Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P. and Aroyo, [121] Wang, D., Weisz, J.D., Muller, M., Ram, P., Geyer, W., Dugan, C., Tausczik, Y.,
L.M. 2021. “Everyone wants to do the model work, not the data work”: Data Samulowitz, H. and Gray, A. 2019. Human-AI Collaboration in Data Science:
Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Exploring Data Scientists’ Perceptions of Automated AI. Proceedings of the ACM
Factors in Computing Systems, 1–15. on Human-Computer Interaction. 3, CSCW, 1–24.
[103] Sandelowski, M., Barroso, J. and Voils, C.I. 2007. Using qualitative metasum- [122] Wan, Z., Xia, X., Lo, D. and Murphy, G.C. 2019. How does Machine Learn-
mary to synthesize qualitative and quantitative descriptive findings. Research in ing Change Software Development Practices? IEEE Transactions on Software
nursing & health. 30, 1, 99–111. Engineering. 47, 9, 1857–1871.
[104] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, [123] Washizaki, H., Takeuchi, H., Khomh, F., Natori, N., Doi, T. and Okuda, S. 2020.
V., Young, M., Crespo, J.-F. and Dennison, D. 2015. Hidden Technical Debt in Practitioners’ insights on machine-learning software engineering design pat-
Machine Learning Systems. Advances in Neural Information Processing Systems terns: a preliminary study. Proceedings of the 2020 IEEE International Conference
28. C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, eds. Curran on Software Maintenance and Evolution (ICSME), 797–799.
Associates, Inc. 2503–2511. [124] Washizaki, H., Uchida, H., Khomh, F. and Guéhéneuc, Y.-G. 2020. Machine
[105] Sculley, D., Otey, M.E., Pohl, M., Spitznagel, B., Hainsworth, J. and Zhou, Y. learning architecture and design patterns. IEEE Software. 8,.
2011. Detecting adversarial advertisements in the wild. Proceedings of the 17th [125] Washizaki, H., Uchida, H., Khomh, F. and Guéhéneuc, Y.-G. 2019. Studying
ACM SIGKDD international conference on Knowledge discovery and data mining, Software Engineering Patterns for Designing Machine Learning Systems. Pro-
274–282. ceedings of the 10th International Workshop on Empirical Software Engineering in
[106] Sendak, M.P. et al. 2020. Real-World Integration of a Sepsis Deep Learning Practice (IWESEP), 49–495.
Technology Into Routine Clinical Care: Implementation Study. JMIR medical [126] Wohlin, C. 2014. Guidelines for snowballing in systematic literature studies
informatics. 8, 7, e15182. and a replication in software engineering. Proceedings of the 18th International
[107] Serban, A., van der Blom, K., Hoos, H. and Visser, J. 2020. Adoption and Effects Conference on Evaluation and Assessment in Software Engineering, 1–10.
of Software Engineering Best Practices in Machine Learning. Proceedings of the [127] Zdanowska, S. and Taylor, A.S. 2022. A study of UX practitioners roles in design-
14th ACM/IEEE International Symposium on Empirical Software Engineering and ing real-world, enterprise ML systems. Proceedings of the 2022 CHI Conference
Measurement (ESEM), 1–12. on Human Factors in Computing Systems, 1–15.
[108] Serban, A., van der Blom, K., Hoos, H. and Visser, J. 2021. Practices for Engi- [128] Zhang, A.X., Muller, M. and Wang, D. 2020. How do data science workers
neering Trustworthy Machine Learning Applications. Proceedings of the 1st collaborate? Roles, workflows, and tools. Proceedings of the ACM on human-
Workshop on AI Engineering - Software Engineering for AI (WAIN), 97–100. computer interaction. 4, CSCW1, 1–23.
[109] Serban, A. and Visser, J. 2022. Adapting Software Architectures to Machine [129] Zhang, J.M., Harman, M., Ma, L. and Liu, Y. 2022. Machine learning testing:
Learning Challenges. Proceedings of the 2022 IEEE International Conference on Survey, landscapes and horizons. IEEE Transactions on Software Engineering. 48,
Software Analysis, Evolution and Reengineering (SANER), 152–163. 1, 1–36.
[110] Shankar, S., Garcia, R., Hellerstein, J.M. and Parameswaran, A.G. 2022. Opera- [130] Zhang, X., Yang, Y., Feng, Y. and Chen, Z. 2019. Software Engineering Practice
tionalizing Machine Learning: An Interview Study. arXiv [cs.SE]. in the Development of Deep Learning Applications. arXiv [cs.SE].
[111] Shaw, M. and Zhu, L. 2022. Can Software Engineering Harness the Benefits of [131] Zinkevich, M. 2017. Rules of machine learning: Best practices for ML engineering.
Advanced AI? IEEE Software. 39, 6, 99–104. minegrado.ovh.
CAIN ’23, May 15–16, 2023, Melbourne, Australia Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner

APPENDIX (14) Haakman, M., Cruz, L., Huijgens, H. and van Deursen, A. 2021.
AI Lifecycle Models Need To Be Revised. An Exploratory Study
List of Papers
in Fintech. Empirical Software Engineering. 26, 5, 1–29.
(1) d. S. Nascimento, E., Ahmed, I., Oliveira, E., Palheta, M.P., Stein- (15) Wan, Z., Xia, X., Lo, D. and Murphy, G.C. 2019. How does Ma-
macher, I. and Conte, T. 2019. Understanding Development Pro- chine Learning Change Software Development Practices? IEEE
cess of Machine Learning Systems: Challenges and Solutions. Transactions on Software Engineering. 47, 9, 1857–1871.
Proceedings of the 2019 ACM/IEEE International Symposium on (16) Washizaki, H., Takeuchi, H., Khomh, F., Natori, N., Doi, T. and
Empirical Software Engineering and Measurement (ESEM), 1–6. Okuda, S. 2020. Practitioners’ insights on machine-learning
(2) Lewis, G.A., Bellomo, S. and Ozkaya, I. 2021. Characterizing software engineering design patterns: a preliminary study. Pro-
and Detecting Mismatch in Machine-Learning-Enabled Systems. ceedings of the 2020 IEEE International Conference on Software
Proceedings of the IEEE/ACM 1st Workshop on AI Engineering- Maintenance and Evolution (ICSME), 797–799.
Software Engineering for AI (WAIN), 133–140. (17) Li, S., Guo, J., Lou, J.-G., Fan, M., Liu, T. and Zhang, D. 2022. Test-
(3) Lewis, G.A., Ozkaya, I. and Xu, X. 2021. Software Architecture ing machine learning systems in industry: an empirical study.
Challenges for ML Systems. Proceedings of the 2021 IEEE In- Proceedings of the 44th International Conference on Software En-
ternational Conference on Software Maintenance and Evolution gineering: Software Engineering in Practice, 263–272.
(ICSME), 634–638. (18) Nikhil, K., Anandayuvaraj, D., Detti, A., Lee Bland, F., Rahaman,
(4) Serban, A. and Visser, J. 2022. Adapting Software Architectures S. and Davis, J.C. 2022. “If security is required”: Engineering
to Machine Learning Challenges. Proceedings of the 2022 IEEE and Security Practices for Machine Learning-based IoT Devices.
International Conference on Software Analysis, Evolution and Proceedings of the 4th International Workshop on Software Engi-
Reengineering (SANER), 152–163. neering Research and Practices for the IoT (SERP4IoT), 1–8.
(5) Vogelsang, A. and Borg, M. 2019. Requirements Engineering for (19) Chang, J. and Custis, C. 2022. Understanding Implementation
Machine Learning: Perspectives from Data Scientists. Proceed- Challenges in Machine Learning Documentation. Equity and
ings of the 27th International Requirements Engineering Confer- Access in Algorithms, Mechanisms, and Optimization, 1–8.
ence Workshops (REW), 245–251. (20) Bäuerle, A., Cabrera, Á.A., Hohman, F., Maher, M., Koski, D.,
(6) Kim, M., Zimmermann, T., DeLine, R. and Begel, A. 2018. Data Suau, X., Barik, T. and Moritz, D. 2022. Symphony: Composing
Scientists in Software Teams: State of the Art and Challenges. Interactive Interfaces for Machine Learning. Proceedings of the
IEEE Transactions on Software Engineering. 44, 11, 1024–1038. 2022 CHI Conference on Human Factors in Computing Systems,
(7) Ishikawa, F. and Yoshioka, N. 2019. How do engineers perceive 1–14.
difficulties in engineering of machine-learning systems? - ques- (21) Laato, S., Birkstedt, T., Mäantymäki, M., Minkkinen, M. and
tionnaire survey. Proceedings of the 2019 IEEE/ACM Joint 7th Mikkonen, T. 2022. AI governance in the system development
International Workshop on Conducting Empirical Studies in In- life cycle: insights on responsible machine learning engineering.
dustry (CESI) and 6th International Workshop on Software Engi- Proceedings of the 1st International Conference on AI Engineering:
neering Research and Industrial Practice (SER&IP), 2–9. Software Engineering for AI, 113–123.
(8) Nahar, N., Zhou, S., Lewis, G. and Kästner, C. 2022. Collaboration (22) Mäkinen, S., Skogström, H., Laaksonen, E. and Mikkonen, T.
Challenges in Building ML-Enabled Systems: Communication, 2021. Who Needs MLOps: What Data Scientists Seek to Accom-
Documentation, Engineering, and Process. Proceedings of the plish and How Can MLOps Help? Proceedings of the IEEE/ACM
44th International Conference on Software Engineering, 413–425. 1st Workshop on AI Engineering - Software Engineering for AI
(9) Rakova, B., Yang, J., Cramer, H. and Chowdhury, R. 2020. Where (WAIN), 109–112.
Responsible AI meets Reality: Practitioner Perspectives on En- (23) John, M.M., Olsson, H.H. and Bosch, J. 2020. AI Deployment
ablers for shifting Organizational Practices. Proceedings of the Architecture: Multi-Case Study for Key Factor Identification.
ACM on Human-Computer Interaction, 1–23. Proceedings of the 27th Asia-Pacific Software Engineering Confer-
(10) Holstein, K., Wortman Vaughan, J., Daumé, H., Dudik, M. and ence (APSEC), 395–404.
Wallach, H. 2019. Improving Fairness in Machine Learning Sys- (24) Uchihira, N. 2022. Project FMEA for Recognizing Difficulties in
tems: What Do Industry Practitioners Need? Proceedings of the Machine Learning Application System Development. Proceed-
2019 CHI Conference on Human Factors in Computing Systems, ings of the 2022 Portland International Conference on Management
1–16. of Engineering and Technology (PICMET), 1–8.
(11) Hopkins, A. and Booth, S. 2021. Machine Learning Practices Out- (25) Kumar, R.S.S., Nystrom, M., Lambert, J., Marshall, A., Goertzel,
side Big Tech: How Resource Constraints Challenge Responsible M., Comissoneru, A., Swann, M. and Xia, S. 2020. Adversarial
Development. Proceedings of the 2021 AAAI/ACM Conference on Machine Learning - Industry Perspectives. Proceedings of the
AI, Ethics, and Society, 134–145. 2020 IEEE Security and Privacy Workshops (SPW)., 69–75.
(12) Serban, A., van der Blom, K., Hoos, H. and Visser, J. 2020. Adop- (26) Zdanowska, S. and Taylor, A.S. 2022. A study of UX practitioners
tion and Effects of Software Engineering Best Practices in Ma- roles in designing real-world, enterprise ML systems. Proceed-
chine Learning. Proceedings of the 14th ACM/IEEE International ings of the 2022 CHI Conference on Human Factors in Computing
Symposium on Empirical Software Engineering and Measurement Systems, 1–15.
(ESEM), 1–12. (27) Liu, H., Eksmo, S., Risberg, J. and Hebig, R. 2020. Emerging
(13) Rahman, M.S., Khomh, F., Hamidi, A., Cheng, J., Antoniol, G.
and Changing Tasks in the Development Process for Machine
and Washizaki, H. 2021. Machine Learning Application Devel-
opment: Practitioners’ Insights. arXiv [cs.SE].
A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners CAIN ’23, May 15–16, 2023, Melbourne, Australia

Learning Systems. Proceedings of the International Conference (40) Serban, A., van der Blom, K., Hoos, H. and Visser, J. 2021. Prac-
on Software and System Processes, 125–134. tices for Engineering Trustworthy Machine Learning Appli-
(28) Zhang, X., Yang, Y., Feng, Y. and Chen, Z. 2019. Software Engi- cations. Proceedings of the 1st Workshop on AI Engineering -
neering Practice in the Development of Deep Learning Applica- Software Engineering for AI (WAIN), 97–100.
tions. arXiv [cs.SE]. (41) Shankar, S., Garcia, R., Hellerstein, J.M. and Parameswaran, A.G.
(29) Rismani, S., Shelby, R., Smart, A., Jatho, E., Kroll, J., Moon, A. and 2022. Operationalizing Machine Learning: An Interview Study.
Rostamzadeh, N. 2022. From plane crashes to algorithmic harm: arXiv [cs.SE].
applicability of safety engineering frameworks for responsible (42) Andrade, H., Lwakatare, L.E., Crnkovic, I. and Bosch, J. 2019.
ML. arXiv [cs.HC]. Software Challenges in Heterogeneous Computing: A Multi-
(30) Myllyaho, L., Raatikainen, M., Männistö, T., Nurminen, J.K. and ple Case Study in Industry. Proceedings of the 45th Euromicro
Mikkonen, T. 2022. On misbehaviour and fault tolerance in Conference on Software Engineering and Advanced Applications
machine learning systems. Journal of Systems and Software. 183„ (SEAA), 148–155.
111096. (43) Boenisch, F., Battis, V., Buchmann, N. and Poikela, M. 2021. “I
(31) Golendukhina, V., Lenarduzzi, V. and Felderer, M. 2022. What Never Thought About Securing My Machine Learning Systems”:
is software quality for AI engineers? Towards a thinning of A Study of Security and Privacy Awareness of Machine Learn-
the fog. Proceedings of the 1st International Conference on AI ing Practitioners. Proceedings of Mensch und Computer 2021,
Engineering: Software Engineering for AI, 1–9. 520–546.
(32) Königstorfer, F. and Thalmann, S. 2022. AI Documentation: A (44) Brennen, A. 2020. What Do People Really Want When They
path to accountability. Journal of Responsible Technology. 11„ Say They Want “Explainable AI?” We Asked 60 Stakeholders.
100043. Extended Abstracts of the 2020 CHI Conference on Human Factors
(33) Riungu-Kalliosaari, L., Kauppinen, M. and Männistö, T. 2017. in Computing Systems, 1–7.
What Can Be Learnt from Experienced Data Scientists? A Case (45) Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh,
Study. Product-Focused Software Process Improvement, 55–70. J., Puri, R., Moura, J.M.F. and Eckersley, P. 2020. Explainable ma-
(34) Namvar, M., Intezari, A., Akhlaghpour, S. and Brienza, J.P. 2022. chine learning in deployment. Proceedings of the 2020 Conference
Beyond effective use: Integrating wise reasoning in machine on Fairness, Accountability, and Transparency, 648–657.
learning development. International journal of information man- (46) Hummer, W., Muthusamy, V., Rausch, T., Dube, P., El Maghraoui,
agement., 102566. K., Murthi, A. and Oum, P. 2019. ModelOps: Cloud-Based Lifecy-
(35) Baijens, J., Helms, R. and Iren, D. 2020. Applying Scrum in Data cle Management for Reliable and Trusted AI. Proceedings of the
Science Projects. Proceedings of the 22nd Conference on Business 2019 IEEE International Conference on Cloud Engineering (IC2E),
Informatics (CBI), 30–38. 113–120.
(36) Lwakatare, L.E., Raj, A., Bosch, J., Olsson, H.H. and Crnkovic, I. (47) Zhang, A.X., Muller, M. and Wang, D. 2020. How do data science
2019. A taxonomy of software engineering challenges for ma- workers collaborate? Roles, workflows, and tools. Proceedings
chine learning systems: An empirical investigation. Proceedings of the ACM on human-computer interaction. 4, CSCW1, 1–23.
of the 2019 International Conference on Agile Software Develop- (48) Piorkowski, D., González, D., Richards, J. and Houde, S. 2020.
ment, 227–243. Towards evaluating and eliciting high-quality documentation
(37) Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., for intelligent systems. arXiv [cs.SE].
Nagappan, N., Nushi, B. and Zimmermann, T. 2019. Software (49) Passi, S. and Jackson, S.J. 2018. Trust in Data Science: Collabora-
Engineering for Machine Learning: A Case Study. Proceedings tion, Translation, and Accountability in Corporate Data Science
of the 41st International Conference on Software Engineering: Projects. Proceedings of the ACM on Human-Computer Interac-
Software Engineering in Practice (ICSE-SEIP), 291–300. tion. 2, CSCW (Nov. 2018), 1–28.
(38) Arpteg, A., Brinne, B., Crnkovic-Friis, L. and Bosch, J. 2018. (50) Dove, G., Halskov, K., Forlizzi, J. and Zimmerman, J. 2017. UX De-
Software Engineering Challenges of Deep Learning. Proceedings sign Innovation: Challenges for Working with Machine Learn-
of the 44th Euromicro Conference on Software Engineering and ing as a Design Material. Proceedings of the 2017 CHI Conference
Advanced Applications (SEAA), 50–59. on Human Factors in Computing Systems, 278–288.
(39) Muiruri, D., Lwakatare, L.E., K Nurminen, J. and Mikkonen, T.
2022. Practices and Infrastructures for ML Systems–An Inter- Paper Analysis and Card Sorting
view Study in Finnish Organizations. TechRxiv. Access the complete paper analysis and card sorting details in our
online supplementary documents –
Nahar, N. 2022. Supplementary documents: A meta-summary of
challenges in building products with ML components – collecting
experiences from 4758+ practitioners. OSF. https://osf.io/y5edu/

You might also like