Ai in Health Care Prepub Final
Ai in Health Care Prepub Final
Ai in Health Care Prepub Final
NAM.EDU
NATIONAL ACADEMY OF MEDICINE 500 Fifth Stre et, NW Washington, D C 20001
This publication has undergone peer review according to procedures established by the National
Academy of Medicine (NAM). Publication by the NAM signifies that it is the product of a carefully
considered process and is a contribution worthy of public attention, but does not constitute
endorsement of conclusions and recommendations by the NAM. The views presented in this
publication are those of individual contributors and do not represent formal consensus positions
of the authors’ organizations; the NAM; or the National Academies of Sciences, Engineering, and
Medicine.
Suggested citation: Matheny, M., S. Thadaney Israni, M. Ahmed, and D. Whicher, Editors. 2022.
Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. Washington, DC:
National Academy of Medicine.
“Knowing is not enough; we must apply.
Willing is not enough; we must do.”
—GOETHE
AB OUT THE NATIONAL ACADEMY OF MEDICINE
v
NAM Staff
vi
REVIEWERS
This Special Publication was reviewed in draft form by individuals chosen for
their diverse perspectives and technical expertise, in accordance with review
procedures established by the National Academy of Medicine (NAM). We wish
to thank the following individuals for their contributions:
existence of substantial and sensitive data assets raises concerns about privacy and
security. Aspiring to the promise of AI requires both continuing innovation and
attention to the potential perils.
In our opinion, this publication presents a sober and balanced celebration of
accomplishments, possibilities, and pitfalls. We commend Drs. Michael McGinnis
and Danielle Whicher for their thoughtful sponsorship of the NAM Consortium
and Digital Health Learning Collaborative, Dr. Matheny and Mrs. Thadaney Israni
for their leadership in producing this volume, and to all the contributors who
have produced an exceptional resource with practical relevance to a wide array
of key stakeholders.
Summary��������������������������������������������������������������������������������������1
xiii
xiv | Contents
7 Health Care Artificial Intelligence: Law, Regulation, and Policy ������ 197
Introduction, 197
Overview of Health Care AI Laws and Regulations in the United States, 198
Safety and Efficacy of Clinical Systems, 200
Privacy, Information, and Data, 220
Key Considerations, 225
References, 227
Appendixes
A Additional Key Reference Materials, 251
B Workshop Agenda and Participant List, 253
C Author Biographies, 259
B OXES, FIGURES, AND TABLES
B OXES
7-1 Federal Food, Drug, and Cosmetic Act (21 U.S.C. § 360j) Medical Device
Definition, 201
FIGURES
1-1 Life expectancy gains and increased health spending, selected high-
income countries, 1995–2015, 8
1-2 A summary of the domains of artificial intelligence, 14
1-3 A summary of the most common methods and applications for training
machine learning algorithms, 15
1-4 Growth in facts affecting provider decisions versus human cognitive
capacity, 18
1-5 Framework for implementing artificial intelligence through the lens of
human rights values, 24
1-6 Chapter relationship, 28
TABLES
6-1 Leveraging Artificial Intelligence Tools into a Learning Health System, 173
6-2 Key Considerations for Institutional Infrastructure and Governance, 175
6-3 Key Artificial Intelligence (AI) Tool Implementation Concepts,
Considerations, and Tasks Translated to AI-Specific Considerations, 177
xix
xx | Acronyms and Abbreviations
QI quality improvement
QMS quality management system
1
2 | Artificial Intelligence in Health Care
Transparency is key to building this much needed trust among users and
stakeholders, but there are distinct domains with differential needs of transparency.
There should be full transparency on the composition, semantics, provenance, and
quality of data used to develop AI tools. There also needs to be full transparency
and adequate assessment of relevant performance components of AI. However,
algorithmic transparency may not be required for all cases. AI developers,
implementers, users, and regulators should collaboratively define guidelines for
clarifying the level of transparency needed across a spectrum. These are key issues
for regulatory agencies and clinical users, and requirements for performance are
differential based on risk and intended use. Most importantly, we suggest clear
separation of data, algorithmic, and performance reporting in AI dialogue, and the
development of guidance in each of these spaces.
In order to benefit from, sustain, and nurture AI tools in health care we need
a thoughtful, sweeping, and comprehensive expansion of relevant training and
educational programs. Given the scale at which health care AI systems could
change the medical domain, the educational expansion must be multidisciplinary
and engage AI developers, implementers, health care system leadership, frontline
clinical teams, ethicists, humanists, and patients and patient caregivers because
each brings a core set of much needed requirements and expertise. Health care
professional training programs should incorporate core curricula focused on
teaching how to appropriately use data science and AI products and services. The
needs of practicing health care professionals can be fulfilled via their required
continuing education, empowering them to be more informed consumers.
Additionally, retraining programs to address a shift in desired skill sets due to
increasing levels of AI deployment and the resulting skill and knowledge
mismatches will be needed. Last, but not least, consumer health educational
programs, at a range of educational levels, to help inform consumers on health
care application selection and use are vital.
safety based on real-world data. Throughout that process, transparency can help
deliver better-vetted solutions. To enable both AI development and oversight,
government agencies should invest in infrastructure that promotes wider, ethical
data collection and access to data resources for building AI solutions within a
priority of ethical use and data protection (see Figure S-2).
CONCLUSION
INTRODUCTION
FIGURE 1-1 | Life expectancy gains and increased health spending, selected high-income
countries, 1995–2015.
SOURCE: Figure redrawn from OECD, 2017, Health at a Glance 2017: OECD Indicators, OECD Publishing, Paris,
https://doi.org/10.1787/health_glance-2017-en.
Given the current national focus on AI and its potential utility for improving
health and health care in the United States, the National Academy of Medicine
(NAM) Leadership Consortium: Collaboration for a Value & Science-Driven
Learning Health System (Leadership Consortium)—through its Digital
Health Learning Collaborative (DHLC)—brought together experts to explore
opportunities, issues, and concerns related to the expanded application of AI in
health and health care settings (NAM, 2019a,b).
informatics, incentives, and culture are aligned for enduring improvement and
innovation; best practices are seamlessly embedded in the care process; patients and
families are active participants in all elements; and new knowledge is captured as
an integral by-product of the care experience. Priorities for achieving this vision
include advancing the development of a fully interoperable digital infrastructure,
the application of new clinical research approaches, and a culture of transparency
on outcomes and cost.
The NAM Leadership Consortium serves as a forum for facilitating
collaborative assessment and action around issues central to achieving the vision
of a continuously learning health system. To address the challenges of improving
both evidence development and evidence application, as well as improving the
capacity to advance progress on each of those dimensions, Leadership Consortium
members (all leaders in their fields) work with their colleagues to identify the
issues not being adequately addressed, the nature of the barriers and possible
solutions, and the priorities for action. They then work to marshal the resources
of the sectors represented in the Leadership Consortium to work for sustained
public–private cooperation for change.
The work of the NAM Leadership Consortium falls into four strategic action
domains—informatics, evidence, financing, and culture—and each domain has
a dedicated innovation collaborative that works to facilitate progress in that
area. This Special Publication was developed under the auspices of the DHLC.
Co-chaired by Jonathan Perlin from the Hospital Corporation of America and
Reed Tuckson from Tuckson Health Connections, the DHLC provides a venue
for joint activities that can accelerate progress in the area of health informatics
and toward the digital infrastructure necessary for continuous improvement and
innovation in health and health care.
PUBLICATION GENESIS
In 2017, the DHLC identified issues around the development, deployment, and
use of AI as being of central importance to facilitating continuous improvement
and innovation in health and health care. To consider the nature, elements,
applications, state of play, key challenges, and implications of AI in health and health
care, as well as ways in which the NAM might enhance collaborative progress, the
DHLC convened a meeting at the National Academy of Sciences (NAS) building
in Washington, DC, on November 30, 2017. Participants included AI experts
from across the United States representing different stakeholder groups within
Artificial Intelligence in Health Care | 11
PUBLICATION WORKFLOW
Authors were organized from among the meeting participants along expertise
and interest, and each chapter was drafted with guidance from the NAM and
the editors, with monthly publication meetings where all authors were invited
to participate and update the group. Author biographies can be found in
Appendix C.
As an initial step, the authors, the NAM staff, and co-chairs developed the
scope and content focus of each of the chapters based on discussion at the initial
in-person meeting. Subsequently, the authors for each chapter drafted chapter
outlines from this guideline. Outlines were shared with the other authors, the
NAM staff, and the working group co-chairs to ensure consistency in the level of
detail and formatting. Differences and potential overlap were discussed before the
authors proceeded with drafting of each chapter. The working group co-chairs
and the NAM staff drafted content for Chapters 1 and 8, and were responsible for
managing the monthly meetings and editing the content of all chapters.
After all chapters were drafted, the resulting publication was discussed at a
meeting that brought together working group members and external experts
at the NAS building in Washington, DC, on January 16, 2019. The goal of the
meeting was to receive feedback on the draft publication to improve its utility to
the field. Following the meeting, the chapter authors refined and added content
to address suggestions from meeting participants. To improve consistency in
voice and style across authors, an external editor was hired to review and edit
the publication in its entirety before the document was sent out for external
review. Finally, 10 external expert reviewers agreed to review the publication and
provide critiques and recommendations for further improvement of the content.
Working group co-chairs and the NAM staff reviewed all feedback and added
recommendations and edits, which were sent to chapter authors for consideration
for incorporation. Final edits following chapter author re-submissions were
Artificial Intelligence in Health Care | 13
resolved by the co-chairs and the NAM staff.The resulting publication represents
the ideas shared at both meetings and the efforts of the working group.
IMPORTANT DEFINITIONS
Artificial Intelligence
The term “artificial intelligence” (AI) has a range of meanings, from specific
forms of AI, such as machine learning, to the hypothetical AI that meets criteria for
consciousness and sentience. This publication does not address the hypothetical,
as the popular press often does, and focuses instead on the current and near-future
uses and applications of AI.
A formal definition of AI starts with the Oxford English Dictionary: “The
capacity of computers or other machines to exhibit or simulate intelligent
behavior; the field of study concerned with this,” or Merriam-Webster online:
“1: a branch of computer science dealing with the simulation of intelligent
behavior in computers, 2: the capability of a machine to imitate intelligent human
behavior.” More nuanced definitions of AI might also consider what type of goal
the AI is attempting to achieve and how it is pursuing that goal. In general, AI
systems range from those that attempt to accurately model human reasoning to
solve a problem, to those that ignore human reasoning and exclusively use large
volumes of data to generate a framework to answer the question(s) of interest,
to those that attempt to incorporate elements of human reasoning but do not
require accurate modeling of human processes. Figure 1-2 includes a hierarchical
representation of AI technologies (Mills, 2015).
Machine learning is a family of statistical and mathematical modeling
techniques that uses a variety of approaches to automatically learn and improve
the prediction of a target state, without explicit programming (e.g., Boolean rules)
14 | Artificial Intelligence in Health Care
FIGURE 1-3 | A summary of the most common methods and applications for training
machine learning algorithms.
SOURCE: Reprinted with permission from Isazi Consulting, 2015. http://www.isaziconsulting.co.za/machine
learning.html.
than other AI domains because context, interpretation, and nuance add needed
information. NLP incorporates rule-based and data-based learning systems,
and many of the internal components of NLP systems are themselves machine
learning algorithms with pre-defined inputs and outputs, sometimes operating
under additional constraints. Examples of NLP applications include assessment of
cancer disease progression and response to therapy among radiology reports (Kehl
et al., 2019), and identification of post-operative complication from routine EHR
documentation (Murff et al., 2011).
Speech algorithms digitize audio recordings into computable data elements
and convert text into human speech (Chung et al., 2018). This field is closely
connected with NLP, with the added complexity of intonation and syllable
emphasis impacting meaning. This complicates both inbound and outbound
speech interpretation and generation. For examples of how deep learning neural
networks have been applied to this field, see a recent systematic review of this
topic (Nassif, 2019).
Expert systems are a set of computer algorithms that seek to emulate the
decision-making capacity of human experts (Feigenbaum, 1992; Jackson, 1998;
Leondes, 2002; Shortliffe and Buchanan, 1975). These systems rely largely on a
complex set of Boolean and deterministic rules. An expert system is divided into
a knowledge base, which encodes the domain logic, and an inference engine,
which applies the knowledge base to data presented to the system to provide
recommendations or deduce new facts. Examples of this are some of the clinical
decision support tools (Hoffman et al., 2016) being developed within the Clinical
Pharmacogenetics Implementation Consortium, which is promoting the use of
knowledge bases such as PharmGKB to provide personalized recommendations
for medication use in patients based on genetic data results (CPIC, 2019;
PharmGKB, 2019).
Automated planning and scheduling systems produce optimized strategies
for action sequences (such as clinic scheduling), which are typically executed
by intelligent agents in a virtual environment or physical robots designed to
automate a task (Ghallab et al., 2004). These systems are defined by complex
parameter spaces that require high dimensional calculations.
Computer vision focuses on how algorithms interpret, synthesize, and generate
inference from digital images or videos. It seeks to automate or provide human
cognitive support for tasks anchored in the human visual system (Sonka et al.,
2008). This field leverages multiple disciplines, including geometry, physics,
statistics, and learning theory (Forsyth and Ponce, 2003). One example is
deploying a computer vision tool in the intensive care unit to monitor patient
mobility (Yeung et al., 2019), because patient mobility is key for patient recovery
from severe illness and can drive downstream interventions.
Artificial Intelligence in Health Care | 17
Data are critical for delivering evidence-based health care and developing
any AI algorithm. Without data, the underlying characteristics of the process
and outcomes are unknown. This has been a gap in health care for many years,
but key trends (such as commodity wearable technologies) in this domain in
the past decade have transformed health care into a heterogeneous data-rich
environment (Schulte and Fry, 2019). It is now common in health and health care
for massive amounts of data to be generated about an individual from a variety
of sources, such as claims data, genetic information, radiology images, intensive
care unit surveillance, EHR care documentation, and medical device sensing and
surveillance. The reasons for these trends include the scaling of computational
capacity through decreases in cost of technology; widespread adoption of EHRs
promoted by the Health Information Technology for Economic and Clinical
Health (HITECH) Act; precipitous decreases in cost of genetic sample processing
(Wetterstrand, 2019); and increasing integration of medical- and consumer-grade
sensors. U.S. consumers used approximately 3 petabytes of Internet data every
minute of the day in 2018, generating possible health-connected data with each
use (DOMO, 2019). There are more than 300,000 health applications in app
stores, with more than 200 being added each day and an overall doubling of these
applications since 2015 (Aitken et al., 2017).
Data Aggregation
FIGURE 1-4 | Growth in facts affecting provider decisions versus human cognitive capacity.
SOURCES: NRC, 2009; presentation by William Stead at IOM meeting on October 8, 2007, titled “Growth in
Facts Affecting Provider Decisions Versus Human Cognitive Capacity.”
The growth in data generation and need for data synthesis exceeding human
capacity has surpassed prior estimates. This trend most likely underestimates the
magnitude of the current data milieu.
AI algorithms require large volumes of training data to achieve performance
levels sufficient for “success” (Shrott, 2017; Sun et al., 2017), and there are
multiple frameworks and standards in place to promote data aggregation for AI
use. These include standardized data representations that both manage data at
rest1 and data in motion.2 For data at rest, mature common data models (CDMs),3
such as Observational Medical Outcomes Partnership (OMOP), Informatics
for Integrating Biology & the Bedside (i2b2), the Patient-Centered Clinical
Research Network (PCORNet), and Sentinel, are increasingly providing a
backbone to format, clean, harmonize, and standardize data that can then be used
for the training of AI algorithms (Rosenbloom et al., 2017). Some of these CDMs
(e.g., OMOP) are also international in focus, which may support compatibility and
portability of some AI algorithms across countries. Some health care systems have
invested in the infrastructure for developing and maintaining at least one CDM
1
Data at rest: Data stored in a persistent structure, such as a database or in a file system, and not in
active use.
2
Data in motion: Data that are being transported from one computer system to another or between
applications in the same computer.
3
A common data model is a standardized, modular, extensible collection of data schemas that is
designed to make it easier to build, use, and analyze data. Data are transformed into the data model
from many sources, which allows experts to make informed decisions about data representation, which
allows users to easily reuse the data.
Artificial Intelligence in Health Care | 19
This could be similar to Henrietta Lacks’s biological tissue story where no consent
was obtained to culture her cells (as was the practice in 1951), nor was she or the
Lacks family compensated for their monetization (Skloot, 2011). There is a need
to address and clarify current regulations, legislation, and patient expectations
when patient data are used for building profit-motivated products or for research
(refer to Chapter 7).
The lack of national unique patient identifiers in the United States could greatly
reduce the error rates of de-duplication during data aggregation. However, there
are several probabilistic patient linkage tools that are currently attempting to fill
this gap (Kho et al., 2015; Ong et al., 2014, 2017).While there is evidence that AI
algorithms can overcome noise from erroneous linkage and duplication of patient
records through use of large volumes of data, the extent to which these problems
may impact algorithm accuracy and bias remains an open question.
Cloud computing that places physical computational resources in widespread
locations, sometimes across international boundaries, is another particularly
challenging issue. Cloud computing can result in disastrous cybersecurity breaches
as data managers attempt to maintain compliance with many local and national
laws, regulations, and legal frameworks (Kommerskollegium, 2012).
Finally, to make AI truly revolutionary, it is critical to consider the power of
linking clinical and claims data with data beyond the narrow, traditional care
setting by capturing the social determinants of health as well as other patient-
generated data. This could include utilizing social media datasets to inform the
medical team of the social determinants that operate in each community. It
could also include developing publicly available datasets of health-related factors
such as neighborhood walkability, food deserts, air quality, aquatic environments,
environmental monitoring, and new areas not yet explored.
Data Bias
Diversity in AI Teams
FIGURE 1-5 | Framework for implementing artificial intelligence through the lens of
human rights values.
SOURCE: Reprinted with permission from Judy Estrin. Based on a slide Estrin shared at The Future of Human-Centered
AI: Governance Innovation and Protection of Human Rights Conference, Stanford University, April 16, 2019.
and more5 (British Medical Association, 2018). Many of these on-demand workers
also have personal health issues that impact their lives (Bajwa et al., 2018). Thus,
it is judicious to involve political and social scientists to examine and plan for the
societal impacts of AI in health care.
Health care AI tools have the capability to impact trust in the health care system
on a national scale, especially if these tools lead to worse outcomes for some
patients or result in increasing inequities. Ensuring that these tools address, or at
least do not exacerbate, existing inequities will require thoughtful prioritization
of a national agenda that is not driven purely by profit, but instead by an
understanding of the important drivers of health care costs, quality, and access.
As a starting point, system leaders must identify key areas in which there
are known needs where AI tools can be helpful, where they can help address
existing inequities, and where implementation will result in improved outcomes
for all patients. These areas must also have an organizational structure in place
that addresses other ethical issues, such as patient–provider relationships, patient
privacy, transparency, notification, and consent, as well as technical development,
validation, implementation, and maintenance of AI tools within an ever evolving
learning health care system.
The implementation of health care AI tools requires that information
technologists, data scientists, ethicists and lawyers, clinicians, patients, and clinical
teams and organizations collaborate and prioritize governance structures and
processes. These teams will need a macro understanding of the data flows,
transformations, incentives, levers, and frameworks for algorithm development
and validation, as well as knowledge of ongoing changes required post-
implementation (see Chapter 5).
When developing and implementing those tools, it may be tempting to ignore
or delay the considerations of the needed legal and ethical organizational structure
to govern privacy, transparency, and consent. However, there are substantial risks
in disregarding these considerations, as witnessed in data uses and breaches,
inappropriate results derived from training data, and algorithms that reproduce and
scale prejudice via the underlying historically biased data (O’Neil, 2017). There
must also be an understanding of the ethical, legal, and regulatory structures that are
relevant to the approval, use, and deployment of AI tools, without which there will
be liability exposure, unintended consequences, and limitations (see Chapter 7).
5
A 2018 survey showed that 7.7 percent of UK medical workers who are EU citizens would leave
the United Kingdom for other regions if the United Kingdom withdrew from the European Union,
as it did in 2019.
26 | Artificial Intelligence in Health Care
the desirable attributes of humans who choose the path of caring for others
include, in addition to scientific knowledge, the capacity to love, to have
empathy, to care and express caring, to be generous, to be brave in advocating for
others, to do no harm, and to work for the greater good and advocate for justice.
How might AI help clinicians nurture and protect these qualities? This type of
challenge is rarely discussed or considered at conferences on AI and medicine,
perhaps because it is viewed as messy and hard to define. But, if the goal is for AI
to emulate the best qualities of human intelligence, it is precisely the territory
that cannot be avoided. (Israni and Verghese, 2019)
As discussed in Chapter 4, the U.S. health care system can draw important
lessons from the aviation industry, the history of which includes many examples
of automation addressing small challenges, but also occasionally creating
Artificial Intelligence in Health Care | 27
extraordinary disasters. The 2009 plane crash of an Air France flight from Rio to
Paris showed the potential
unintended consequence of designing airplanes that anyone can fly: anyone can take
you up on the offer. Beyond the degradation of basic skills of people who may once
have been competent pilots, the fourth-generation jets have enabled people who
probably never had the skills to begin with and should not have been in the cockpit.
As a result, the mental makeup of airline pilots has changed. (Langewiesche, 2014)
More recently, disasters with Boeing’s 737 Max caused by software issues offer
another caution: Competent pilots’ complaints about next-generation planes
were not given sufficient review (Sharpe and Robison, 2019).
Finally, just because technology makes it possible to deploy a particular solution,
it may still not be appropriate to do so. Recently, a doctor in California used a
robot with a video-link screen in order to tell a patient that he was going to die.
After a social media and public relations disaster, the hospital apologized, stating,
“We don’t support or encourage the use of technology to replace the personal
interactions between our patients and their care teams—we understand how
important this is for all concerned, and regret that we fell short of the family’s
expectations” (BBC News, 2019). Technochauvinism in AI will only further
complicate an already complex and overburdened health care system.
In summary, health care is a complex field that incorporates genetics, physiology,
pharmacology, biology, and other related sciences with the social, human, and
cultural experience of managing health. Health care is both a science and an art,
and challenges the notion that simple and elegant formulas will be able to explain
significant portions of health care delivery and outcomes (Toon, 2012).
PUBLICATION ORGANIZATION
This publication is structured around several distinct topic areas, each covered
in a separate chapter and independently authored by the listed expert team.
Figure 1-6 shows the relationship of the chapters.
Each chapter is intended to stand alone and represents the views of its authors.
In order to allow readers to read each chapter independently there is some
redundancy in the material, with relevant references to other chapters where
appropriate. Each chapter initially summarizes the key content of the chapter
and concludes with a set of key considerations for improving the development,
adoption, and use of AI in health care.
Chapter 2 examines the history of AI, using examples from other industries,
and summarizes the growth, maturity, and adoption of AI in health care. The
28 | Artificial Intelligence in Health Care
REFERENCES
Abraham, C., and P. Sheeran. 2007. The health belief model. In Cambridge handbook
of psychology, health and medicine, edited by S. Ayers, A. Baum, C. McManus, S.
Newman, K. Wallston, J. Weinman, and R. West, 2nd ed., pp. 97–102. Cambridge,
UK: Cambridge University Press.
Agbo, C. C., Q. H. Mahmoud, and J. M. Eklund. 2019. Blockchain technology
in healthcare: A systematic review. Healthcare 7(2):E56.
Aggarwal, N. K., M. Rowe, and M. A. Sernyak. 2010. Is health care a right or
a commodity? Implementing mental health reform in a recession. Psychiatric
Services 61(11):1144–1145.
Aitken, M., B. Clancy, and D. Nass. 2017. The growing value of digital health: Evidence
and impact on human health and the healthcare system. https://www.iqvia.com/
institute/reports/the-growing-value-of-digital-health (accessed May 12, 2020).
Ashby, W. R. 1964. An introduction to cybernetics. London: Methuen and Co. Ltd.
ASTHO (Association of State and Territorial Health Officials). 2019. Medicaid and
public health partnership learning series. http://www.astho.org/Health-Systems-
Transformation/Medicaid-and-Public-Health-Partnerships/Learning-Series/
Managed-Care (accessed May 12, 2020).
Bajwa, U., D. Gastaldo, E. D. Ruggiero, and L. Knorr. 2018.The health of workers
in the global gig economy. Global Health 14(1):124.
Baras, J. D., and L. C. Baker. 2009. Magnetic resonance imaging and low back pain
care for Medicare patients. Health Affairs (Millwood) 28(6):w1133–w1140.
Basu, S., and R. Narayanaswamy. 2019. A prediction model for uncontrolled
type 2 diabetes mellitus incorporating area-level social determinants of health.
Medical Care 57(8):592–600.
Bauchner, H., and P. B. Fontanarosa. 2019. Waste in the US health care system.
JAMA 322(15):1463–1464.
BBC News. 2019. Man told he’s going to die by doctor on video-link robot. March 9. https://
www.bbc.com/news/world-us-canada-47510038 (accessed May 12, 2020).
Becker’s Healthcare. 2018. AI with an ROI: Why revenue cycle automation may be
the most practical use of AI. https://www.beckershospitalreview.com/artificial-
intelligence/ai-with-an-roi-why-revenue-cycle-automation-may-be-the-
most-practical-use-of-ai.html (accessed May 12, 2020).
British Medical Association. 2018. Almost a fifth of EU doctors have made plans
to leave UK following Brexit vote. December 6. https://psmag.com/series/the-
future-of-work-and-workers (accessed May 12, 2020).
Broussard, M. 2018. Artificial unintelligence: How computers misunderstand the world.
Cambridge, MA: MIT Press.
30 | Artificial Intelligence in Health Care
https://obamawhitehouse.archives.gov/sites/default/files/microsites/
ostp/2016_0504_data_discrimination.pdf (accessed May 12, 2020).
Murff, H. J., F. Fitzhenry, M. E. Matheny, N. Gentry, K. L. Kotter, K. Crimin,
R. S. Dittus, A. K. Rosen, P. L. Elkin, S. H. Brown, and T. Speroff.
2011. Automated identification of postoperative complications within
an electronic medical record using natural language processing. JAMA
306:848–855.
NAM (National Academy of Medicine). 2019a. Leadership Consortium for a Value
& Science-Driven Health System. https://nam.edu/programs/value-science-
driven-health-care (accessed May 12, 2020).
NAM. 2019b. Digital learning. https://nam.edu/programs/value-science-driven-
health-care/digital-learning (accessed May 12, 2020).
Nassif, A. B., I. Shahin, I. Attilli, M. Azzeh, and K. Shaalan. 2019. Speech recognition
using deep neural networks: A systematic review. IEEE Access 7:19143–19165.
https://ieeexplore.ieee.org/document/8632885.
NRC (National Research Council). 2009. Computational technology for effective
health care: Immediate steps and strategic directions. Washington, DC: The National
Academies Press. https://doi.org/10.17226/12572.
OECD (Organisation for Economic Co-operation and Development). 2017.
Health at a glance 2017. https://www.oecd-ilibrary.org/content/publication/
health_glance-2017-en (accessed May 12, 2020).
OHDSI (Observational Health Data Sciences and Informatics). 2019. Home.
https://ohdsi.org (accessed May 12, 2020).
Ohno-Machado, L., Z. Agha, D. S. Bell, L. Dahm, M. E. Day, J. N. Doctor,
D. Gabriel, M. K. Kahlon, K. K. Kim, M. Hogarth, M. E. Matheny, D. Meeker,
J. R. Nebeker, and pSCANNER team. 2014. pSCANNER: Patient-centered
scalable national network for effectiveness research. Journal of the American
Medical Informatics Association 21(4):621–626.
ONC (The Office of the National Coordinator for Health Information
Technology). 2019. 21st Century Cures Act: Interoperability, information
blocking, and the ONC Health IT Certification Program. Final Rule. Federal
Register 84(42):7424.
O’Neil, C. 2017. Weapons of math destruction: How big data increases inequality and
threatens democracy. New York: Broadway Books.
Ong, T. C., M. V. Mannino, L. M. Schilling, and M. G. Kahn. 2014. Improving
record linkage performance in the presence of missing linkage data. Journal of
Biomedical Informatics 52:43–54.
Ong, T., R. Pradhananga, E. Holve, and M. G. Kahn. 2017. A framework for
classification of electronic health data extraction-transformation-loading
34 | Artificial Intelligence in Health Care
Sharpe, A., and P. Robison. 2019. Pilots flagged software problems on Boeing
jets besides Max. Bloomberg. https://www.bloomberg.com/news/articles/
2019-06-27/boeing-pilots-flagged-software-problems-on-jets-besides-the-
max (accessed May 12, 2020).
Shi, M., S. Sacks, Q. Chen, and G. Webster. 2019. Translation: China’s personal
information security specification. New America. https://www.newamerica.
org/cybersecurity-initiative/digichina/blog/translation-chinas-personal-
information-security-specification (accessed May 12, 2020).
Shortliffe, E. H., and B. G. Buchanan. 1975. A model of inexact reasoning
in medicine. Mathematical Biosciences 23(3–4):351–379. doi: 10.1016/0025-
5564(75)90047-4.
Shrott, R. 2017. Deep learning specialization by Andrew Ng—21 lessons learned.
Medium. https://towardsdatascience.com/deep-learning-specialization-by-
andrew-ng-21-lessons-learned-15ffaaef627c (accessed May 12, 2020).
Skloot, R. 2011. The immortal life of Henrietta Lacks. New York: Broadway Books.
Sonka, M., V. Hlavac, and R. Boyle. 2008. Image processing, analysis, and machine
vision, 4th ed. Boston, MA: Cengage Learning.
Stanford Engineering. 2019. Carlos Bustamante: Genomics has a diversity problem.
https://engineering.stanford.edu/magazine/article/carlos-bustamante-
genomics-has-diversity-problem (accessed May 12, 2020).
Sullivan, A. 2010. Western, educated, industrialized, rich, and democratic.
The Daily Dish. October 4. https://www.theatlantic.com/daily-dish/archive/
2010/10/western-educated-industrialized-rich-and-democratic/181667
(accessed May 12, 2020).
Sun, C., A. Shrivastava, S. Singh, and A. Gupta. 2017. Revisiting unreasonable
effectiveness of data in deep learning era. https://arxiv.org/pdf/1707.02968.pdf
(accessed May 12, 2020).
Toon, P. 2012. Health care is both a science and an art. British Journal of General
Practice 62(601):434.
Vayena, E., A. Blasimme, and I. G. Cohen. 2018. Machine learning in medicine:
Addressing ethical challenges. PLoS Medicine 15(11):e1002689.
Verghese, A. 2018. How tech can turn doctors into clerical workers. The New
York Times, May 16. https://www.nytimes.com/interactive/2018/05/16/
magazine/health-issuewhat-we-lose-with-data-dr iven-medicine.html
(accessed May 12, 2020).
Wakabayashi, D. 2019. Google and the University of Chicago are sued over
data sharing. The New York Times. https://www.nytimes.com/2019/06/
26/technology/google-university-chicago-data-sharing-lawsuit.html (accessed
May 12, 2020).
36 | Artificial Intelligence in Health Care
West, S. M., M. Whittaker, and K. Crawford. 2019. Gender, race and power in
AI. AI Now Institute. https://ainowinstitute.org/discriminatingsystems.html
(accessed May 12, 2020).
Wetterstrand, K. A. 2019. DNA sequencing costs: Data from the NHGRI Genome
Sequencing Program (GSP). https://www.genome.gov/sequencingcostsdata
(accessed May 12, 2020).
Whittaker, M., K. Crawford, R. Dobbe, G. Fried, E. Kaziunas, V. Mathur,
S. M. West, R. Richardson, J. Schultz, and O. Schwartz. 2018. AI Now report
2018. AI Now Institute at New York University. https://stanford.app.box.
com/s/xmb2cj3e7gsz5vmus0viadt9p3kreekk (accessed May 12, 2020).
Wiens, J., S. Saria, M. Sendak, M. Ghassemi, V. X. Liu, F. Doshi-Velez, K. Jung,
K. Heller, D. Kale, M. Saeed, P. N. Ossorio, S.Thadaney-Israni, and A. Goldenberg.
2019. Do no harm: A roadmap for responsible machine learning for health care.
Nature Medicine 25(9):1137–1340.
Witten, I. H., E. Frank, M. A. Hall, and C. J. Pal. 2016. Data mining: Practical machine
learning tools and techniques. Burlington, MA: Morgan Kaufmann.
Yeung, S., F. Rinaldo, J. Jopling, B. Liu, R. Mehra, N. L. Downing, M. Guo,
G. M. Bianconi, A. Alahi, J. Lee, B. Campbell, K. Deru, W. Beninati, L. Fei-Fei,
and A. Milstein. 2019. A computer vision system for deep-learning based
detection of patient mobilization activities in the ICU. NPJ Digital Medicine 2:11.
doi: 10.1038/s41746-019-0087-z.
Zarsky, T. 2016. The trouble with algorithmic decisions: An analytic road map to
examine efficiency and fairness in automated and opaque decision making. Science,
Technology, & Human Values 41(1):118–132. doi: 10.1177/0162243915605575.
INTRODUCTION
This chapter first acknowledges the roots of artificial intelligence (AI).We then
briefly touch on areas outside of medicine where AI has had an impact and
highlight where lessons from these other industries might cross into health care.
For decades, the attempt to capture knowledge in the form of a book has
been challenging, as indicated by the adage “any text is out of date by the time
the book is published.” However, in 2019, with what has been determined by
some analyses as exponential growth in the field of computer science and AI in
particular, change is happening at a pace that renders sentences in this chapter out
of date almost immediately. To stay current, we can no longer rely on monthly
updates from a stored PubMed search. Rather, daily news feeds from sources such
as the Association for the Advancement of Artificial Intelligence or arXiv1 are
necessary. As such, this chapter contains references to both historical publications
as well as websites and web-based articles.
It is surpassingly difficult to define AI, principally because it has always been
loosely spoken of as a set of human-like capabilities that computers seem about
ready to replicate. Yesterday’s AI is today’s commodity computation. Within that
caveat, we aligned this chapter with the definition of AI in Chapter 1. A formal
definition of AI starts with the Oxford English Dictionary: “The capacity of
computers or other machines to exhibit or simulate intelligent behavior; the
field of study concerned with this,” or Merriam-Webster online: “1: a branch
1
See https://arxiv.org/list/cs.LG/recent.
37
38 | Artificial Intelligence in Health Care
HISTORICAL PERSPECTIVE
If the term “artificial intelligence” has a birthdate, it is August 31, 1955, when
John McCarthy, Marvin L. Minsky, Nathaniel Rochester, and Claude E. Shannon
submitted “A Proposal for the Dartmouth Summer Research Project on Artificial
Intelligence.” The second sentence of the proposal reads,“The study is to proceed
on the basis of the conjecture that every aspect of learning or any other feature
of intelligence can in principle be so precisely described that a machine can
be made to simulate it” (McCarthy et al., 2006). Naturally, the proposal and
the resulting conference—the 1956 Dartmouth Summer Research Project on
Artificial Intelligence—were the culmination of decades of thought by many
others (Buchanan, 2005; Kline, 2011; Turing, 1950; Weiner, 1948). Although the
conference produced neither formal collaborations nor tangible outputs, it clearly
galvanized the field (Moor, 2006).
Thought leaders in this era saw the future clearly, although optimism was
substantially premature. In 1960, J. C. R. Licklider wrote
The hope is that, in not too many years, human brains and computing machines
will be coupled together very tightly, and that the resulting partnership will think
as no human brain has ever thought and process data in a way not approached by
the information-handling machines we know today. (Licklider, 1960)
2
See Newquist (1994) for a thorough review of the birth, development, and decline of early AI.
40 | Artificial Intelligence in Health Care
There are many industries outside of health care that are further along in their
adoption of AI into their workflows. The following section highlights a partial
list of those industries and discusses aspects of AI use in these industries to be
emulated and avoided.
Users
Automotive
Of all of the industries making headlines with the use of AI, the self-driving car
has most significantly captured the public’s imagination (Mervis, 2017). In concept,
a self-driving car is a motor vehicle that can navigate and drive its occupants without
their interaction. Whether this should be the aspirational goal (i.e., “without their
interaction”) is a subject of debate. For this discussion, it is more important to
note that the component technologies have been evolving publicly for some
years. Navigation has evolved from humans reading paper maps to satellite-based
global positioning system (GPS)-enabled navigation devices, to wireless mobile
telecommunications networks that evolved from analog to increasingly broadband
digital technologies (2G to 3G to 4G to 5G), and most recently, navigation systems
that supplement mapping and simple navigation with real-time, crowd-sourced
traffic conditions (Mostafa et al., 2018). In research contexts, ad hoc networks
enable motor vehicles to communicate directly with each other about emergent
situations and driving conditions (Zongjian et al., 2016).
The achievement of successful self-driving cars has been and continues to be
evolutionary. In terms of supporting the act of driving itself, automatic transmissions
and anti-lock braking systems were early driver-assistance technologies. More
recently, we have seen the development of driver-assistance AI applications that
rely on sensing mechanisms such as radar, sonar, lidar, and cameras with signal
processing techniques, which enable lane departure warnings, blind spot assistance,
following distance alerts, and emergency braking (Sujit, 2015).
This recent use of cameras begins to distinguish AI techniques from the prior
signal processing techniques. AI processes video data from the cameras at a level
of abstraction comparable to that at which a human comprehends. That is, the
machine extracts objects such as humans, cyclists, road signs, other vehicles, lanes,
and other relevant factors from the video data and has been programmed to
identify and interpret the images in a way that is understandable to a human.This
combination of computer vision with reasoning comprises a specialized AI for
driving.
However, for all the laudable goals, including improving driving safety, errors
remain and sometimes those errors are fatal. This is possibly the single most
important lesson that the health care domain can learn from the increasing use of
AI in other industries. As reported in the press, the woman killed in spring 2018
42 | Artificial Intelligence in Health Care
as she walked her bicycle across the street was sensed by the onboard devices,
but the software incorrectly classified her as an object for which braking was
unnecessary. It was also reported that the “backup” driver of this autonomous
vehicle was distracted, watching video on a cell phone (Laris, 2018).
The point of the above example is not that AI (in this case a self-driving
car) is evil. Rather, we need to understand AI not in isolation but as part of a
human–AI “team.” Certainly, humans without any AI assistance do far worse; on
average in 2017, 16 pedestrians were killed each day in traffic crashes (NHTSA,
2018). Reaching back to the 1960 quote from Licklider, it is important to note
that the title of his article was “Man–Computer Symbiosis.” In this example, the
driver–computer symbiosis failed. Even the conceptualization that the human
was considered a backup was wrong. The human is not just an alternative to
AI; the human is an integral part of the complete system. For clinicians to
effectively manage this symbiosis, they must (1) understand their own weaknesses
(e.g., fatigue, biases), (2) understand the limits of the sensors and analytics, and
(3) be able to assist or assume complete manual control of the controlled process
in time to avoid an unfortunate outcome. AI must be viewed as a team member,
not an “add-on” (Johnson and Vera, 2019).
Other examples of AI outside of medicine are outlined below. The theme of
symbiosis is ubiquitous.
Engineers and architects have long applied technology to enhance their design,
and AI is set to accelerate that trend (Noor, 2017). Unusual AI-generated structural
designs are in application today. A well-known example is the partition in the
Airbus A320, in which AI algorithms utilized biomimetics to design a material
almost half as light and equally as strong as the previous design (Micallef, 2016).
Finance has also been an early adopter of machine learning and AI techniques.
The field of quantitative analytics was born in response to the computerization
of the major trading exchanges. This has not been a painless process. One of
the early “automated” trading strategies, portfolio insurance, is widely believed
to have either caused or exacerbated the 1987 stock market crash (Bookstaber,
2007). The failure of Long-Term Capital Management offers another cautionary
example (Lowenstein, 2011). This fund pursued highly leveraged arbitrage trades,
where the pricing and leverage were algorithmically determined. Unexpected
events caused the fund to fail spectacularly, requiring an almost $5 billion bailout
from various financial institutions. Despite these setbacks, today all major banks
and a tremendous number of hedge funds pursue trading strategies that rely on
systematic machine learning or AI techniques. Most visible are the high-frequency
trading desks, which rely on AI technologies to place, cancel, and execute orders
at a speed as minute as one-hundredth of a microsecond, far faster than a human
can think, react, and act (Seth, 2019).
Media
minute, and key, among others. Then, it compares this structured representation
of the song to others that have been successful in order to identify similarity, and
hence the new song’s likelihood of also becoming a hit. The general complaint
that all the music on the radio “sounds the same” may be based in part on the
need to conform to the styles “approved” by the algorithms.
An emerging AI capability is generative art. Google initially released software
called Deep Dream, which was able to create art in the style of famous artists,
such as Vincent van Gogh (Mordvintsev et al., 2015). This technique is now used
in many cell phone apps, such as Prisma Photo Editor,3 as “filters” to enhance
personal photography.
Another more disturbing use of AI surfaces in the trend known as “deepfakes,”
technology that enables face and voice swapping in both audio and video
recordings (Chesney and Citron, 2018). The deepfake technique can be used to
create videos of people saying and doing things that they never did, by swapping
their faces, bodies, and other features onto videos of people who did say or do what
is portrayed in the video.This initially emerged as fake celebrity pornography, but
academics have demonstrated that the technique can also be used to create fake
political videos (BBC News, 2017).The potential effect of such technology, when
coupled with the virality of social networks, for the dissemination of false content
is terrifying. Substantial funding is focused on battling deepfakes (Villasenor,
2019). An ethical, societal, and legal response to this technology has yet to emerge.
Security is well suited to the application of AI, because the domain exists to
detect the rare exception, and vigilance in this regard is a key strength of all
computerized algorithms. One current application of AI in security is automated
license plate reading, which relies on basic computer vision (Li et al., 2018).
Because license plates conform to strict standards of size, color, shape, and location,
the problem is well constrained and thus suitable for early AI.
Predictive policing has captured the public imagination, potentially due to
popular representations in science fiction films such as Minority Report (Perry
et al., 2013). State-of-the-art predictive policing technology identifies areas and
times of increased risk of crime rather than identifying the victim or perpetrator
(Kim et al., 2018). Police departments typically utilize such tools as part of larger
strategies. However, implementations of these technologies can propagate racial,
gender, and other kinds of profiling when based on historically biased datasets
(Caliskan et al., 2017; Garcia, 2016) (see Chapter 1).
3
See https://prisma-ai.com.
Overview of Current Artificial Intelligence | 45
Space Exploration
AI BY INDIVIDUAL FUNCTIONS
Thus, an AI system typically receives input from sensors (afferent systems) and
operates in the environment through displays/effectors (efferent systems). These
capture reality, represent and reason over it, and then affect reality, respectively.
The standard representation of this is a keyboard and mouse as inputs, a central
processing unit (CPU) and data storage units for processing, and a monitor for
output to a human user. A robot with AI may contain sonar for inputs, a CPU,
and motorized wheels for its outputs. More sophisticated AIs, such as a personal
assistant, may have to interpret and synthesize data from multiple sensors such
as a microphone, a camera, and other data inputs in order to interact with a
user through a speaker or screen. As each of the effectors, representations, and
AI reasoning systems improves, it becomes more seamless, or human-like, in its
capabilities.
AI Technologies
AI in the guise of “the next impossible thing computers will do,” almost by
definition, occupies the forefront of technology at whatever time it is considered.
As a result, AI is often conflated with its enabling technology. For instance, in the
Overview of Current Artificial Intelligence | 47
1980s there were hardware machines, called Lisp machines, specifically created
to execute the AI algorithms of the day (Phillips, 1999). Today, technologies such
as graphics processing units (GPUs), the Internet of Things (IoT), and cloud
computing are closely associated with AI, while not being AI in and of themselves.
In this section, we briefly review and clarify.
GPUs and, recently, tensor processing units (TPUs) are computer hardware
elements specialized to perform mathematical calculations rapidly. They are
much like the widely understood CPUs, but rather than being generalized so
that they are able to perform any operation, GPUs and TPUs are specialized to
perform calculations more useful to machine learning algorithms and hence AI
systems. The operations in question are linear algebra operations such as matrix
multiplication. The GPUs and TPUs enable AI operations.
IoT is the movement to collect sensor data from all manner of physical devices
and make them available on the Internet. Examples abound including lightbulbs,
doors, cameras, and cars; theoretically anything that can be manufactured might
be included. IoT is associated with AI because the data that flow from these
devices comprise the afferent arm of AI systems. As IoT devices proliferate,
the range of domains to which AI can be applied expands. In the emergent
Internet of Medical Things, patient-generated physiological measurements
(e.g., pulse oximeters and sphygmomanometers) are added to the data collected
from these “environmental” devices. It will be crucial that we “understand the
limitations of these technologies to avoid inappropriate reliance on them for
diagnostic purposes” (Deep Blue, 2019; Freedson et al., 2012) and appreciate the
social risks potentially created by “intervention-generated inequalities” (Veinot
et al., 2018).
Cloud computing abstracts computation by separating the computer services
from the proximate need for a physical computer. Large technology companies
such as Amazon, Google, Microsoft, and others have assembled vast warehouses
filled with computers. These companies sell access to their computers and, more
specifically, the services they perform over the Internet, such as databases, queues,
and translation. In this new landscape, a user requiring a computer service can
obtain that service without owning the computer. The major advantage of this
is that it relieves the user of the need to obtain and manage costly and complex
infrastructure. Thus, small and relatively technologically unsophisticated users,
including individuals and companies, may benefit from advanced technology.
Most data now are collected in clouds, which means that the complex computing
hardware (such as GPU machines) and services (such as natural language
processing [NLP]) needed for AI are easily obtainable. Cloud computing also
creates challenges in multinational data storage and other international law
complexities, some of which are briefly discussed in Chapter 7.
48 | Artificial Intelligence in Health Care
Computers are recognized as superior machines for their rigorous logic and
expansive calculation capabilities, but the point at which formal logic becomes
“thinking” or “intelligence” has proven difficult to pinpoint (Turing, 1950).
Expert systems that have been successful in military and industrial settings have
captured the imagination of the public with the Deep Blue versus Kasparov chess
matches. More recently, Google DeepMind’s AlphaGo defeated Lee Sedol at the
game Go, using deep learning methods (Wikipedia, 2019a).
Adaptive Learning
A defining feature of a machine learning system is that the programmer does not
instruct the computer to perform a specific task but, rather, instructs the computer
how to learn a desired task from a provided dataset. Programs such as deep learning,
reinforcement learning, gradient boosting, and many others comprise the set of
machine learning algorithms. The programmer also provides a set of data and
describes a task, such as images of cats and dogs and the task to distinguish the two.
The computer then executes the machine learning algorithm upon the provided
data, creating a new, derivative program specific to the task at hand. This is called
training. That program, usually called the model, is then applied to a real-world
problem. In a sense, then, we can say that the computer has created the model.
The machine learning process described above comprises two phases, training
and application. Once learning is complete, it is assumed that the model is
unchanging. However, an unchanging model is not strictly necessary. A machine
learning algorithm can alternatively continue to supplement the original training
data with data and performance encountered in application and then retrain itself
with the augmented set. Such algorithms are called adaptive because the model
adapts over time.
All static models in health care degrade in performance over time as characteristics
of the environment and targets change, and this is one of the fundamental
distinctions between industrial and health care processes (addressed in more detail
in Chapter 6). However, adaptive learning algorithms are one of the family of
methods that can adapt to this constantly changing environment, but they create
special challenges for regulation, because there is no fixed artifact to certify or
approve. To draw a health care analogy, the challenge would be like the U.S. Food
and Drug Administration (FDA) attempting to approve a molecule that evolved
over time. Although it is possible to certify that an adaptive algorithm performs to
specifications at any given moment and that the algorithm by which it learns is
sound, it is an open question as to whether the future states of an adaptive algorithm
Overview of Current Artificial Intelligence | 49
Reinforcement Learning
Understood best in the setting of video games, where the goal is to finish with
the most points, reinforcement learning examines each step and rewards positive
choices that the player makes based on the resulting proximity to a target end
state. Each additional move performed affects the subsequent behavior of the
automated player, known as the agent in reinforcement learning semantics. The
agent may learn to avoid certain locations to prevent falls or crashes, touch tokens,
or dodge arrows to maximize its score. Reinforcement learning with positive
rewards and negative repercussions is how robot vacuum cleaners learn about
walls, stairs, and even furniture that moves from time to time (Jonsson, 2019).
Computer Vision
Electronic noses are still marginal but increasingly useful technology (Athamneh
et al., 2008). They couple chemosensors with classification systems in order to
detect simple and complex smells. There is not yet significant research into an
electronic tongue, although early research similar to that concerning electronic
noses exists. Additionally, there is early research on computer generation of taste
or digital gustation, similar to the computer generation of speech; however, no
applications of this technology are apparent today.
KEY STAKEHOLDERS
United States
The United States has been coordinating strategic research and development
(R&D) investment in AI technologies for a number of years. In reaction to
the Soviet Union’s launch of Sputnik in 1957, the U.S. government founded
DARPA, which poured significant funding into computing, resulting in the early
Overview of Current Artificial Intelligence | 51
China
United Kingdom
in 2018 unveiled its AI Sector Deal, a broad industry plan to stimulate innovation,
build digital infrastructure, and develop workforce competency in data science,
engineering, and mathematics (UK Government, 2019).
Academic
Commercial Sector
Professional Societies
Nonprofits
Public–Private Partnerships
There are other notable public–private partnerships that have recently formed
to work in the AI sector to bridge collaboration in these spaces. For example,
Partnership for AI is a large consortium of more than 90 for-profit and nonprofit
institutions in multiple countries to share best practices and further research in
AI, to advance public understanding of AI, and to promote socially benevolent
applications in AI (Partnership on AI, 2018). Another example is AINow, which
is hosted by New York University but receives funding from a variety of large
for-profit and nonprofit institutions interested in AI development.
KEY CONSIDERATIONS
• As described in more detail in Chapters 3 and 4, history has shown that AI has
gone through multiple cycles of emphasis and disillusionment in use. It is critical
that all stakeholders be aware of and actively seek to educate and address public
expectations and understanding of AI (and associated technologies) in order to
manage hype and establish reasonable expectations, which will enable AI to be
applied in effective ways that have reasonable opportunities for sustained success.
• Integration of reinforcement learning into various elements of the health care
system will be critical in order to develop a robust, continuously improving
health care industry and to show value for the large efforts invested in data
collection.
• Support and emphasis for open source and free tools and technologies for use
and application of AI will be important to reduce cost and maintain wide use
of AI technologies as the domain transitions from exponential growth to a
future plateau stage of use.
• The domain needs strong patient and consumer engagement and empowerment
to ensure that preferences, concerns, and expectations are transmitted and
ethically, morally, and appropriately addressed by AI stakeholder users.
• Large-scale development of AI technologies in industries outside of health
care should be carefully examined for opportunities for incorporation of those
advances within health care. Evaluation of these technologies should include
consideration for whether they could effectively translate to the processes and
workflows in health care.
56 | Artificial Intelligence in Health Care
REFERENCES
AMA (American Medical Association). 2018. AMA passes first policy recommendations
on augmented intelligence. Press release. https://www.ama-assn.org/press-
center/press-releases/ama-passes-first-policy-recommendations-augmented
-intelligence.
Amani, F. A., and A. M. Fadlalla. 2017. Data mining applications in accounting:
A review of the literature and organizing framework. International Journal
ofAccounting Information Systems 24:32–58.https://doi.org/10.1016/j.accinf.2016
.12.004.
Athamneh, A. I., B. W. Zoecklein, and K. Mallikarjunan. 2008. Electronic nose
evaluation of cabernet sauvignon fruit maturity. Journal of Wine Research 19:
69–80.
BBC News. 2017. Fake Obama created using AI video tool. https://www.youtube.
com/watch?v=AmUC4m6w1wo.
Bookstaber, R. 2007. A demon of our own design: Markets, hedge funds, and the perils
of financial innovation. New York: John Wiley & Sons.
Buchanan, B. G. 2005. A (very) brief history of artificial intelligence. AI Magazine
26(4):53–60.
Caliskan, A., J. J. Bryson, and A. Narayanan. 2017. Semantics derived automatically
from language corpora contain human-like biases. Science 356:183–186.
Case, N. 2018. How to become a centaur. MIT Press. https://doi.org/10.21428/
61b2215c.
CBInsights. 2018. Artificial intelligence trends to watch in 2018. https://www.
cbinsights.com/research/report/artificial-intelligence-trends-2018 (accessed
November 5, 2019).
CBInsights. 2019. AI100: The artificial intelligence startups redefining industries.
https://www.cbinsights.com/research/artificial-intelligence-top-startups.
Chen, J. H., and R. B. Altman. 2015. Data-mining electronic medical records for
clinical order recommendations: Wisdom of the crowd or tyranny of the mob?
AMIA Joint Summits on Translational Science Proceedings. Pp. 435–439. https://
www.ncbi.nlm.nih.gov/pmc/articles/PMC4525236.
Chesney, R., and D. K. Citron. 2018. Deep fakes: A looming challenge for privacy,
democracy, and national security. California Law Review 107. https://dx.doi.
org/10.2139/ssrn.3213954.
Chien, S., and K. L. Wagstaff. 2017. Robotic space exploration agents. Science
Robotics 2(7). https://doi. org/10.1126/scirobotics.aan4831.
Overview of Current Artificial Intelligence | 57
Yu, A. 2019. How Netflix uses AI, data science, and machine learning—From a
product perspective. Medium. https://becominghuman.ai/how-netflix-uses-ai-
and-machine-learning-a087614630fe.
Zhang, D. D., G. Peng, and Y. O. Yao. 2019. Artificial intelligence or intelligence
augmentation? Unravelling the debate through an industry-level analysis.
SSRN. https://dx.doi.org/10.2139/ssrn.3315946.
Zongjian, H., C. Jiannong, and L. Xuefeng. 2016. SDVN: Enabling rapid network
innovation for heterogeneous vehicular communication. IEEE Network 30:10–15.
Suggested citation for Chapter 2: Fackler, J., and E. Jackson. 2022. Overview
of current artificial intelligence. In Artificial intelligence in health care: The hope,
the hype, the promise, the peril. Washington, DC: National Academy of Medicine.
3
HOW ARTIFICIAL INTELLIGENCE IS CHANGING
HEALTH AND HEALTH CARE
INTRODUCTION
these stakeholders. These examples are not exhaustive. The following sections
explore the promise of AI in health care in more detail.
Conversational Agents
Conversational agents can engage in two-way dialogue with the user via speech
recognition, natural language processing (NLP), natural language understanding,
and natural language generation (Laranjo et al., 2018). AI is behind many of them.
These interfaces may include text-based dialogue, spoken language, or both. They
are called, variously, virtual agents, chatbots, or chatterbots. Some conversational
agents present a human image (e.g., the image of a nurse or a coach) or nonhuman
image (e.g., a robot or an animal) to provide a richer interactive experience. These
are called embodied conversational agents (ECAs).These visible characters provide a
richer and more convincing interactive experience than non-embodied voice-only
agents such as Apple’s Siri, Amazon’s Alexa, or Microsoft’s Cortana. The imagistic
entities can communicate nonverbally through hand gestures and facial expressions.
In the “self-management” domain, conversational agents already exist to address
depression, smoking cessation, asthma, and diabetes. Although many chatbots
and ECAs exist, evaluation of these agents has, unfortunately, been limited
(Fitzpatrick et al., 2017).
The future potential for conversational agents in self-management seems high.
While simulating a real-world interaction, the agent may assess symptoms, report
back on outputs from health monitoring, and recommend a course of action
based on these varied inputs. Most adults say they would use an intelligent virtual
coach or an intelligent virtual nurse to monitor health and symptoms at home.
There is somewhat lower enthusiasm for mental health support delivered via this
method (Accenture, 2018).
Social support improves treatment outcomes (Hixson et al., 2015; Wicks
et al., 2012). Conversational agents can make use of humans’ propensity to
68 | Artificial Intelligence in Health Care
treatment to users over time (Nahum-Shani et al., 2015; Spruijt-Metz and Nilsen,
2014). The JITAI makes decisions about when and how to intervene based on
response to prior intervention, as well as on awareness of current context, whether
internal (e.g., mood, anxiety, blood pressure) or external (e.g., location, activity).
JITAI assistance is provided when users are most in need of it or will be most
receptive to it. These systems can also tell a clinician when a problematic pattern
is detected. For example, a JITAI might detect when a user is in a risky situation
for substance abuse relapse—and deliver an intervention against it.
These interventions rely on sensors, rather than a user’s self-report, to detect
states of vulnerability or intervention opportunity. This addresses two key self-
management challenges: the high user burden of self-monitoring and the limitations
of self-awareness. As sensors become more ubiquitous in homes, in smartphones,
and on bodies, the data sources for JITAIs are likely to continue expanding. AI can
be used to allow connected devices to communicate with one another. (Perhaps
a glucometer might receive feedback from refrigerators regarding the frequency
and types of food consumed.) Leveraging data from multiple inputs can uniquely
enhance AI’s ability to provide real-time behavioral management.
There are two main areas of opportunity for AI in clinical care: (1) enhancing
and optimizing care delivery, and (2) improving information management, user
experience, and cognitive support in electronic health records (EHRs). Strides
have been made in these areas for decades, largely through rule-based, expert-
designed applications typically focused on specific clinical areas or problems. AI
techniques offer the possibility of improving performance further.
Care Deliver y
The amount of relevant data available for patient care is growing and will
continue to grow in volume and variety. Data recorded digitally through EHRs
only scratch the surface of the types of data that (when appropriately consented)
could be leveraged for improving patient care. Clinicians are beginning to have
access to data generated from wearable devices, social media, and public health
records; to data about consumer spending, grocery purchase nutritional value,
and an individual’s exposome; and to the many types of -omic data specific to an
individual. AI will probably, we think, have a profound effect on the entire clinical
care process, including prevention, early detection, risk/benefit identification,
diagnosis, prognosis, and personalized treatment.
How Artificial Intelligence Is Changing Health and Health Care | 71
The area of prediction, early detection, and risk assessment for individuals is
one of the most fruitful AI applications (Sennaar, 2018). In this chapter, we discuss
examples of such use; Chapters 5 and 6 provide thoughts about external evaluation.
Diagnosis
Surgery
been shown to improve the safety of interventions where clinicians are exposed to
high doses of ionizing radiation and makes surgery possible in anatomic locations
not otherwise reachable by human hands (Shen et al., 2018; Zhao et al., 2018). As
autonomous robotic surgery improves, it is likely that surgeons will in some cases
oversee the movements of robots (Shademan et al., 2016).
The following sections describe a few areas that could benefit from AI-supported
tools integrated with EHR systems, including information management (e.g., clinical
documentation, information retrieval), user experience, and cognitive support.
How Artificial Intelligence Is Changing Health and Health Care | 73
Information Management
Cognitive Support
AI has the potential to not only improve existing clinical decision support
(CDS) modalities but also enable a wide range of innovations with the potential
to disrupt patient care. Improved cognitive support functions include smarter
CDS alerts and reminders as well as better access to peer-reviewed literature.
A core cause for clinicians’ dissatisfaction with EHR systems is the high incidence
of irrelevant pop-up alerts that disrupt the clinical workflow and contribute to
“alert fatigue” (McCoy et al., 2014). This problem is partially caused by the low
specificity of alerts, which are frequently based on simple and deterministic
handcrafted rules that fail to consider the full clinical context. AI can improve the
specificity of alerts and reminders by considering a much larger number of patient
and contextual variables (Joffe et al., 2012). It can provide probability thresholds
that can be used to prioritize alert presentation and determine alert format in the
user interface (Payne et al., 2015). It can also continuously learn from clinicians’
past behavior (e.g., by lowering the priority of alerts they usually ignore).
74 | Artificial Intelligence in Health Care
Next, we explore AI solutions for population and public health programs. These
include solutions that could be implemented by health systems (e.g., accountable
care organizations), health plans, or city, county, state, and federal public health
departments or agencies. Population health examines the distribution of health
outcomes within a population, the range of factors that influence the distribution
of health outcomes, and the policies and interventions that affect those factors
(Kindig and Stoddart, 2003). Population health programs are often implemented
through nontraditional partnerships among different sectors of the community—
public health, industry, academia, health care, local government entities, etc. On the
other hand, public health is the science of protecting and improving the health of
people and their communities (CDC Foundation, 2019). This work is achieved by
promoting healthy lifestyles, researching disease and injury prevention, and detecting,
preventing, and responding to infectious diseases. Overall, public health is concerned
with protecting the health of entire populations.These populations can be as small as
a local neighborhood or as big as an entire country or region of the world.
Public health professionals are focused on solutions for more efficient and
effective administration of programs, policies, and services; disease outbreak
detection and surveillance; and research. Relevant AI solutions are being
experimented with in a number of areas.
Disease Surveillance
These changes may occur due to business needs (rather than the needs of a flu
outbreak detection application) or due to changes in search behavior of consumers.
Finally, relying on such methods exclusively misses the opportunity to combine
them and co-develop them in conjunction with more traditional methods. As
Lazer et al. (2014) detail, combining traditional and innovative methods (e.g.,
Google Flu Trends) performs better than either method alone.
Researchers and solution developers have experimented with the integration
of case- and event-based surveillance (e.g., news and online media, sensors, digital
traces, mobile devices, social media, microbiological labs, and clinical reporting)
to arrive at dashboards and analysis approaches for threat verification. Such
approaches have been referred to as digital epidemiological surveillance and can
produce timelier data and reduce labor hours of investigation (Kostokova, 2013;
Zhao et al., 2015). Such analyses rely on AI’s capacities in spatial and spatiotemporal
profiling, environmental monitoring, and signal detection (i.e., from wearable
sensors).They have been successfully implemented to build early warning systems
for adverse drug events, falls detection, and air pollution (Mooney and Pejaver,
2018). The ability to rely on unstructured data such as photos, physicians’ notes,
sensor data, and genomic information, when enabled by AI, may lead to additional,
novel approaches in disease surveillance (Figge, 2018).
Moreover, participatory systems such as social media and listservs could be
relied on to solicit information from individuals as well as groups in particular
geographic locations. For example, such approaches may encourage a reduction
in unsafe behaviors that put individuals at risk for human immunodeficiency
virus (HIV) infection (Rubens et al., 2014; Young et al., 2017). For example, it
has been demonstrated that psychiatric stressors can be detected from Twitter
posts in select populations through keyword-based retrieval and filters and the
use of neural networks (Du et al., 2018). However, how such AI solutions could
improve the health of populations or communities is less clear, due to the lack of
context for some tweets and because tweets may not reflect the true underlying
mental health status of a person who tweeted. Studies that retroactively analyze
the tweeting behavior of individuals with known suicide attempts or ideation, or
other mental health conditions, may allow refinement in such approaches.
Finally, AI and machine learning have been used to develop a dashboard to
provide live insight into opioid usage trends in Indiana (Bostic, 2018). This tool
enabled prediction of drug positivity for small geographic areas (i.e., hot spots),
allowing for interventions by public health officials, law enforcement, or program
managers in targeted ways. A similar dashboarding approach supported by AI
solutions has been used in Colorado to monitor HIV surveillance and outreach
interventions and their impact after implementation (Snyder et al., 2016). This
tool integrated data on regional resources with near-real-time visualization of
78 | Artificial Intelligence in Health Care
Coordination and payment for care in the United States is highly complex.
It involves the patient, providers, health care facilities, laboratories, hospitals,
pharmacies, benefit administrators, payers, and others. Before, during, and after a
patient encounter, administrative coordination occurs around scheduling, billing,
How Artificial Intelligence Is Changing Health and Health Care | 79
Prior Authorization
Most health plans and pharmacy benefit managers require prior authorization
of devices, durable equipment, labs, and procedures. The process includes the
submission of patient information along with the proposed request, along with
justification. Determinations require professional skill, analysis, and judgment.
Automating this process can reduce biased decisions and improve speed,
consistency, and quality of decisions.
There are a number of different ways that AI is applied today. For example,
AI could simply be used to sort cases to the appropriate level of reviewer (e.g.,
nurse practitioner, physician advisor, medical director). Or, AI could identify and
highlight the specific, relevant information in long documents or narratives to
80 | Artificial Intelligence in Health Care
TABLE 3-2 | Illustrative Examples of Artificial Intelligence Solutions to Aid in Health Care
Administration Processes
Example Output/
Topic Opportunity Value Intervention Data
Prior authorization Automate decisions Reduced cost, Authorization Relevant patient
(Rowley, 2016; on drugs, labs, or efficiency, or rejection electronic health
Wince, 2018; Zieger, procedures improved quality, record (EHR)
2018) reduce bias data
Fraud, waste (Bauder Identify appropriate Reduced cost, Identification Provider claims
and Khoshgoftaar, or fraudulent claims improved care of targets for data
2017; da Rosa, 2018; investigation
He et al., 1997)
Provider directory Maintain accurate Reduced patient Accurate Provider data
management information on frustration through provider from many
providers accurate provider directory sources
availability, avoid
Medicare penalties
Adjudication Determine if a Improved Adjudication Relevant patient
hospital should be compliance, decision EHR record
paid for an admission accurate payments data
versus observation
Automated coding Automate ICD-10a Improved ICD-10 Relevant patient
(Huang et al., 2019; Li coding of patient compliance, coding EHR record
et al., 2018; Shi et al., encounters accurate payments data
2017; Xiao et al., 2018)
Chart abstraction Summarize Reduced cost, Accurate, Relevant patient
(Gehrmann et al., redundant data into a efficiency, clean EHR record
2017) coherent narrative or improved quality narrative/ data
structured variables problem list
Patient scheduling Identify no-shows Improved patient Optimized Scheduling
(Jiang et al., 2018; and optimize satisfaction, faster physician history, EHR
Nelson et al., 2019; scheduling appointments, schedule data
Sharma, 2016) provider efficiency
a
The ICD-10-CM (International Classification of Diseases, 10th Revision, Clinical Modification) is a system used by
physicians and other health care providers to classify and code all diagnoses, symptoms, and procedures recorded in
conjunction with hospital care in the United States.
Automated Coding
Machine Learning
Deep Learning
Deep learning algorithms rely on the large quantities of data and massive
computer resources, both of which are newly possible in this era. Deep learning can
identify underlying patterns in data well beyond the pattern-perceiving capacities
of humans. Deep learning and its associated techniques have become popular in
many data-driven fields of research. The principal difference between deep and
traditional (i.e., shallow) machine learning paradigms lies in the ability of deep
learning algorithms to construct latent data representations from a large number
of raw features, often through deep architectures (i.e., many layers) of artificial
neural networks. This “unsupervised feature extraction” sometimes permits
highly accurate predictions. Recent research on EHR data has shown that deep
learning predictive models can outperform traditional clinically used predictive
How Artificial Intelligence Is Changing Health and Health Care | 83
models for predicting early detection of heart failure onset (Choi et al., 2017),
various cancers (Miotto et al., 2016), and onset of and weaning from intensive
care unit interventions (Suresh et al., 2017).
Nevertheless, the capacity of deep learning is a double-edged sword. The
downside of deep learning comes from exactly where its superiority to other
learning paradigms originates—that is, its ability to build and learn features.
Model complexity means that human interpretability of deep learning models is
almost nonexistent, because it is extremely hard to infer how the model makes
its predictions so well. Deep learning models are “black box” models, where the
internal workings of the algorithms remain unclear or mysterious to users of
these models. As in other black box AI approaches, there is significant resistance
to implementing deep learning models in the health care delivery process.
Detecting abnormal brain structure is much more challenging for humans and
machines than detecting a broken bone or a fracture. Exciting things are being
done in this area with deep learning. One recent study predicted age from brain
images (Cole et al., 2017). Multimodal image recognition analysis has discovered
novel impairments not visible from a single view of the brain (e.g., structural
MRI versus functional MRI) (Plis et al., 2018). Companies such as Avalon AI1 are
commercializing this type of work.
Effective AI use does not always require new modeling techniques. Some work
at Massachusetts General Hospital in Boston uses a large selection of images
and combines established machine learning techniques with mature brain-image
analysis tools to explore what is normal for a child’s developing brain (NITRC,
2019; Ou et al., 2017). Other recent applications of AI to radiology data include
using machine learning on electrocardiogram data to characterize types of heart
failure (Sanchez-Martinez et al., 2018). In addition, AI can aid in reducing noise
in real images (e.g., endoscopy) via “adversarial training.” It can smooth out
erroneous signals in images to enhance prediction accuracy (Mahmood et al.,
2018). AI is also being applied to moving images; gait analysis has long been done
by human observation alone, but it now can be performed with greater accuracy
by AI that uses video and sensor data. These techniques are being used to detect
Parkinson’s disease, to improve geriatric care, for sports rehabilitation, and in
other areas (Prakash et al., 2018). AI can also improve video-assisted surgery, for
example, by detecting colon polyps in real time (Urban et al., 2018)
1
See http://avalonai.mystrikingly.com.
84 | Artificial Intelligence in Health Care
AI can assist in analyzing clinical practice patterns from EHR data to develop
clinical practice models before such research can be distilled into literature or
made widely available in clinical decision support tools. This notion of “learning
from the crowd” stems from Condorcet’s jury theorem, which states that the
average decisions of a crowd of unbiased experts are more correct than any
individual’s decisions. (Think of the “jellybeans in a jar” challenge—the average
of everyone’s guesses is surprisingly close to the true number.)
The most straightforward approach uses association rule mining to find patterns,
but this tends to find many false associations (Wright et al., 2010).Therefore, some
researchers have attempted to use more AI approaches such as Bayesian network
learning and probabilistic topic modeling (Chen et al., 2017; Klann et al., 2014).
Phenotyping
Drug Discover y
Machine learning has the capacity to make drug discovery faster, cheaper, and
more effective. Drug designers frequently apply machine learning techniques
to extract chemical information from large compound databases and to design
drugs with important biological properties. Machine learning can also improve
drug discovery by permitting a more comprehensive assessment of cellular
systems and potential drug effects.With the emergence of large chemical datasets
in recent years, machine and deep learning methods have been used in many
areas (Baskin et al., 2016; Chen et al., 2018; Lima et al., 2016; Zitnik et al., 2018).
These include
• predicting synthesis,
• biological activity of new ligands,
• drug selectivity,
• pharmacokinetic and toxicological profiles,
• modeling polypharmacy side effects (due to drug–drug interactions), and
• designing de novo molecular structures and structure-activity models.
Large chemical databases have made drug discovery faster and cheaper. EHR
databases have brought millions of patients’ lives into the universe of statistical
learning. Research initiatives to link structured patient data with biobanks,
86 | Artificial Intelligence in Health Care
radiology images, and notes are creating a rich and robust analytical playground
for discovering new knowledge about human disease. Deep learning and other
new techniques are creating solutions that can operate on the scale required
to digest these multiterabyte datasets. The accelerating pace of discovery will
probably challenge the research pipelines that translate new knowledge back into
practice.
KEY CONSIDERATIONS
Augmented Intelligence
Partnerships
The central focus of health care will continue to expand from health care
delivery systems to a dispersed model that aggregates information about behavior,
traits, and environment in addition to medical symptoms and test results.
Market forces and privacy concerns or regulations may impede data sharing
and analysis (Roski et al., 2014). Stakeholders will need to creatively balance
How Artificial Intelligence Is Changing Health and Health Care | 87
Interoperability
AI will probably discover associations that have not yet been detected by humans
and make predictions that differ from prevailing knowledge and expertise. As a
result, some currently accepted practices may be abandoned, and best practice
guidelines will be adjusted.
If the output of AI systems is going to influence international guidelines,
developers of the applications will require fuller and more representative datasets
for training and testing.
The dissemination of innovation will occur rapidly, which on the one hand
may advance the adoption of new scientific knowledge but on the other may
encourage the rushed adoption of innovation without sufficient evidence.
REFERENCES
22nd International Conference on World Wide Web. New York: CMS. Pp. 687–694.
https://www.researchgate.net/publication/250963354_A_roadmap_
to_integrated_digital_public_health_surveillance_The_vision_and_the_
challenges (accessed November 12, 2019).
Laranjo, L., A. G. Dunn, H. L. Tong, A. B. Kocaballi, J. Chen, R. Bashir, D. Surian,
B. Gallego, F. Magrabi, and A. Coiera. 2018. Conversational agents in healthcare:
A systematic review. Journal of the American Medical Informatics Association
25(9):1248–1258.
Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. The parable of Google
flu: Traps in big data analysis. Science 343(6176):1203–1205.
Leider, N. 2018. AI could protect public health by monitoring water treatment
systems. AI in Healthcare News. https://www.aiin.healthcare/topics/
artificial-intelligence/ai-public-health-monitoring-water-treatment (accessed
November 12, 2019).
Li, M., Z. Fei, M. Zeng, F. Wu,Y. Li,Y. Pan, and J. Wang. 2018. Automated ICD-9
coding via a deep learning approach. IEEE/ACM Transactions on Computational
Biology and Bioinformatics 16(4):1193–1202. https://doi.org/ 10.1109/
TCBB.2018.2817488.
Lima, A. N., E. A. Philot, G. H. G.Trossini, L. P. B. Scott,V. G. Maltarollo, and K. M.
Honorio. 2016. Use of machine learning approaches for novel drug discovery.
Expert Opinion on Drug Discovery 11(3):225–239.
Liu, F., Z. Zhou, A. Samsonov, D. Blankenbaker,W. Larison, A. Kanarek, K. Lian, S.
Kambhampati, and R. Kijowski. 2018. Deep learning approach for evaluating
knee MR images: Achieving high diagnostic performance for cartilage lesion
detection. Radiology 289(1):160–169.
Lu, Y. 2018. The association of urban greenness and walking behavior: Using
Google StreetView and deep learning techniques to estimate residents’ exposure
to urban greenness. International Journal of Environmental Health Research and
Public Health 15:1576.
Maharana, A., and E. O. Nsoesie. 2018. Use of deep learning to examine the
association of the built environment with prevalence of neighborhood
adult obesity. JAMA Network Open 1(4):e181535. https://doi.org/10.1001/
jamanetworkopen.2018.1535.
Mahmood, F., R. Chen, and N. J. Durr. 2018. Unsupervised reverse domain adaptation
for synthetic medical images via adversarial training. IEEE Transactions on Medical
Imaging 37(12):2572–2581. https://doi.org/10.1109/TMI.2018.2842767.
Matheson, R. 2018. Machine-learning system determines the fewest, smallest doses
that could still shrink brain tumors. MIT News. http://news.mit.edu/2018/
artificial-intelligence-model-learns-patient-data-cancer-treatment-less-toxic-0810.
How Artificial Intelligence Is Changing Health and Health Care | 93
Zick, R. G., and J. Olsen. 2001. Voice recognition software versus a traditional
transcription service for physician charting in the ED. American Journal of
Emergency Medicine 19(4):295–298.
Zieger, A. 2018. Will payers use AI to do prior authorization? And will these
AIs make things better? Healthcare IT Today. https://www.healthcareittoday.
com/2018/12/27/will-payers-use-ai-to-do-prior-authorization-and-will-
these-ais-make-things-better (accessed November 12, 2019).
Zitnik, M., M. Agrawal, and J. Leskovec. 2018. Modeling polypharmacy side
effects with graph convolutional networks. Bioinformatics 34(13):i457–i466.
INTRODUCTION
technologies such as deep learning and machine learning are riding atop the utmost
peak of inflated expectations for emerging technologies, as noted by the Gartner
Hype Cycle, which tracks relative maturity stages for emerging technologies (Chen
and Asch, 2017; Panetta, 2017) (see Figure 4-1). Without an appreciation for both
the capabilities and limitations of AI technology in medicine, we will predictably
crash into a “trough of disillusionment.” The greatest risk of all may be a backlash
that impedes real progress toward using AI tools to improve human lives.
Over the past decade, several factors have led to increasing interest in and escalating
hype of AI. There have been legitimate discontinuous leaps in computational
capacity, electronic data availability (e.g., ImageNet [Russakovsky et al., 2015] and
digitization of medical records), and perception capability (e.g., image recognition
[Krizhevsky et al., 2017]). Just as algorithms can now automatically name the
breed of a dog in a photo and generate a caption of a “dog catching a frisbee”
(Vinyals et al., 2017), we are seeing automated recognition of malignant skin
lesions (Esteva et al., 2017) and pathology specimens (Ehteshami et al., 2017).
Such functionality is incredible but can easily lead one to mistakenly assume that
the computer “knows” what skin cancer is and that a surgical excision is being
considered. It is expected that an intelligent human who can recognize an object
Potential Trade-Offs and Unintended Consequences of Artificial Intelligence | 101
in a photo can also naturally understand and explain the context of what they are
seeing, but the narrow, applied AI algorithms atop the current hype cycle have
no such general comprehension. Instead, these algorithms are each designed to
complete specific tasks, such as answering well-formed multiple-choice questions.
With Moore’s law of exponential growth in computing power, the question
arises whether it is reasonable to expect that machines will soon possess greater
computational power than human brains (Saracco, 2018). This comparison may
not even make sense with the fundamentally different architectures of computer
processors and biological brains, because computers already can exceed human
brains by measures of pure storage and speed (Fischetti, 2011). Does this mean that
humans are headed toward a technological singularity (Shanahan, 2015; Vinge,
1993) that will spawn fully autonomous AI systems that continually self-improve
beyond the confines of human control? Roy Amara, co-founder of Institute for
the Future, reminds us that “we tend to overestimate the effect of a technology in
the short run and underestimate the effect in the long run” (Ridley, 2017). Among
other reasons however, intelligence is not simply a function of computing power.
Increasing computing speed and storage makes a better calculator, but not a better
thinker. For the near future at least, this leaves us with fundamental design and
concept issues in (general) AI research that have remained unresolved for decades
(e.g., common sense, framing, abstract reasoning, creativity; Brooks, 2017).
Explicit advertising hyperbole may be one of the most direct triggers for
unintended consequences of hype. While such promotion is important to drive
interest and motivate progress, it can become counterproductive in excess.
Hyperbolic marketing of AI systems that will “outthink cancer” (Brown, 2017) can
ultimately set the field back when confronted by the hard realities in attempting to
deliver changes in actual patient lives (Ross and Swetlitz, 2017). Modern advances
do reflect important progress in AI software and data, but can shortsightedly
discount the “hardware” of a health care delivery system (people, policies, and
processes) needed to actually execute care. Limited AI systems can fail to provide
insights to clinicians beyond what they already knew, undercutting many hopes
for early warning systems and screening asymptomatic patients for rare diseases
(Butterfield, 2018). Ongoing research has a tendency to promote the latest
technology as a cure-all (Marcus, 2018), even if there is a “regression to regression”
where well-worn methods backed by a good data source can be as, or more, useful
than “advanced” AI methods in many applications (Razavian et al., 2015).
A combination of technical and subject domain expertise is needed to recognize
the credible potential of AI systems and avoid the backlash that will come from
overselling them.Yet, there is no need for pessimism if our benchmark is improving
on the current state of human health. Algorithms and AI systems cannot provide
“guarantees of fairness, equitability, or even veracity” (Beam and Kohane, 2018),
102 | Artificial Intelligence in Health Care
but no humans can either. The “Superhuman Human Fallacy” (Kohane, 2017) is
to dismiss computerized systems (or humans) that do not achieve an unrealizable
standard of perfection or improve on the best performing human. For example,
accidents attributed to self-driving cars receive outsized media attention even
though they occur far less frequently than accidents attributed to human-driven
cars (Felton, 2018).Yet, the potential outsized impact of automated technologies
reasonably makes us demand a higher standard of reliability (Stewart, 2019) even if
the necessary degree is unclear and may even cost more lives in opportunity cost
while awaiting perfection (Kalra and Groves, 2017). In health care, it is possible to
determine where even imperfect AI clinical augmentation can improve care and
reduce practice variation. For example, gaps exist now where humans commonly
misjudge the accuracy of screening tests for rare diagnoses (Manrai et al., 2014),
grossly overestimate patient life expectancy (Christakis and Lamont, 2000; Glare
et al., 2003), and deliver care of widely varied intensity in the last 6 months
of life (Barnato et al., 2007; Dartmouth Atlas Project, 2018). There is no need
to overhype the potential of AI in medicine when there is ample opportunity
(as reviewed in Chapter 3) to address existing issues with undesirable variability,
crippling costs, and impaired access to quality care (DOJ and FTC, 2015).
To find opportunities for automated predictive systems, stakeholders should
consider where important decisions hinge upon humans making predictions
with a clear outcome (Bates et al., 2014; Kleinberg et al., 2016). Though human
intuition is powerful, it is inevitably variable without a support system. One could
identify scarce interventions that are known to be valuable and use AI tools to
assist in identifying patients most likely to benefit. For example, an intensive
outpatient care team need not attend to everyone, but can be targeted to only
those patients that AI systems predict are at high risk of morbidity (Zulman
et al., 2017). In addition, there are numerous opportunities to deploy AI workflow
support to assist humans to rapidly answer or complete repetitive information
tasks (e.g., documentation, scheduling, and other back-office administration).
This will mark a fundamental change in the expectations for the next generation
of physicians (Silver et al., 2018). Though there is much upside in the potential
for the use of AI systems to improve health and health care, like all technologies,
implementation does not come without certain risks. This section outlines some
ways in which AI in health care may cause harm in unintended ways.
patterns (Schulam and Saria, 2017). More broadly, both humans and predictive
models can fail to generalize from training to implementation environments
because of many different types of dataset shift—shift in dataset characteristics
over time, in practice pattern, or across populations—posing a threat to model
reliability and the safety of downstream decisions made in practice (Subbaswamy
and Saria, 2018). Recent works have proposed that proactive learning techniques
are less susceptible to dataset shifts (Schulam and Saria, 2017; Subbaswamy et al.,
2019). These algorithms proactively correct for likely shifts in data.
In addition to learning a model once, an alternative approach is to update
models over time so that they continuously adapt to local and recent data. Such
adaptive algorithms offer constant vigilance and monitoring for changing behavior.
However, this may exacerbate disparities when only well-resourced institutions
can deploy the expertise to do so in an environment. In addition, regulation and
law, as reviewed in Chapter 7, faces significant challenges in addressing approval
and certification for continuously evolving systems.
Rule-based systems are explicitly authored by human knowledge engineers,
encoding their understanding of an application domain into a computing inference
engine.These are generally more explicit and interpretable in their intent, making
these easier to audit for safety and reliability. On the other hand, they take less
advantage of relationships that can be automatically inferred through data-driven
models and therefore are often less accurate. Integrating domain-knowledge
within learning-based frameworks, and combining these with methods for
measuring and proactively eliminating bias, provides a promising path forward
(Subbaswamy and Saria, 2018). Much of the literature on predictive modeling
is based on black box models that memorize associations. Increases in model
complexity can reduce both the interpretability and ability of the user to respond
to predictions in practical ways (Obermeyer and Emanuel, 2016). As a result,
these models are susceptible to unreliability, leading to harmful suggestions.
Evaluating for reliability and actionability are key in developing models that have
the potential to affect health outcomes. These issues are at the core of the tension
between “black box” and “interpretable” model algorithms that afford end users
some explanation for why certain predictions are favored.
Training reliable models depends on training datasets to be representative of
the population where the model will be applied. Learning from real-world data—
where insights can be drawn from patients similar to a given index patient—has
the benefit of leading to inferences that are more relevant, but it is important
to characterize populations where there are inadequate data to support robust
conclusions. For example, a tool may show acceptable performance on average
across individuals captured within a dataset but may perform poorly for specific
subpopulations because the algorithm has not had enough data to learn from.
106 | Artificial Intelligence in Health Care
Amplification or Exacerbation?
AI systems will generally make people more efficient at what they are already
doing, whether that is good or bad. Bias is not inherently undesirable, because
the whole point of learning from (clinical) practices is that there is an underlying
assumption that human experts are making nonrandom decisions biased toward
achieving desirable effects. Machine learning relying on observational data will
generally have an amplifying effect on our existing behavior, regardless of whether
that behavior is beneficial or only exacerbates existing societal biases. For instance,
Google Photos, an app that uses machine learning technology to organize images,
incorrectly identified people with darker skin tones as “gorillas,” an animal that
has historically been used as a racial slur (Lee, 2018). Another study found that
machine translation systems were biased against women due to the way in which
women were described in the data used to train the system (Prates et al., 2018).
In another example, Amazon developed a hiring algorithm based on its prior
hiring practices, which recapitulated existing biases against women (Dastin, 2018).
Although some of these algorithms were revised or discontinued, the underlying
issues will continue to be significant problems, requiring constant vigilance, as
well as algorithm surveillance and maintenance to detect and address them (see
Chapter 6). The need for continuous assessment about the ongoing safety of systems
is discussed in Chapter 7, including a call for significant changes in regulatory
compliance. Societal biases reflected in health care data may be amplified as
automated systems drive more decisions, as further addressed in Chapters 5 and 6.
AI Systems Transparency
Vulnerabilities
Most of this chapter focuses on the side effects of nonmalicious actors using
ethically neutral AI technology. Chapter 1 discusses some of the challenges in
the ethical uses of health care AI tools. However, it is also important to consider
how increasing automation opens new risks for bad actors to directly induce
harm, such as through overt fraud. E-mail gave us new ways to communicate and
increased productivity, but it also enabled new forms of fraud through spam and
phishing. Likewise, new health care technology may open up new streams for fraud
and abuse. After the widespread adoption of digital health records, data breaches
resulting in the release of millions of individuals’ private medical information
have become commonplace (Patil and Seshadri, 2014). These breaches will
likely increase in an era when our demand for health data exceeds its supply in
the public sector (Jiang and Bai, 2019; Perakslis, 2014). Health care systems are
increasingly vigilant, but ongoing attacks demonstrate that safeguarding against
a quickly evolving threat landscape remains exceedingly difficult (Ehrenfeld,
2017). The risk to personal data safety will continue to increase as AI becomes
mainstream and commercialized. Engaging the public on how and when their
secondary data are being used will be crucial to preventing public backlash as
we have seen with the Facebook–Cambridge Analytica data scandal (Cadwalladr
and Graham-Harrison, 2018). A recent study also indicates that hospital size
and academic environment could be associated with increased risk for breaches,
calling for better data breach statistics (Fabbri et al., 2017).
Health care data will not be the only target for attackers; the AI systems themselves
will become the subject of assault and manipulation. FDA has already approved
several AI systems for clinical use, some of which can operate without the oversight
of a physician. In parallel, the health care economy in the United States is projected
to represent 20 percent of the gross domestic product by 2025 (Papanicolas et al.,
2018), making automated medical AI systems a natural target for manipulation as
they drive decisions that move billions of dollars through the health care system.
Though recent advances in AI have made impressive progress on clinical tasks,
the fact remains that these systems as currently conceived are exceptionally brittle,
making them easy to mislead and manipulate with seemingly slight variations
in input. Medical images that have small but intentionally crafted modifications
(imperceptible to the human eye) can be used to create error in the diagnoses that
an AI system provides (Finlayson et al., 2018, 2019). Such attacks allow the attacker
to exert arbitrary control over the AI model by modifying the input provided to
the system. Figure 4-2 demonstrates how such an attack may be carried out.
Potential Trade-Offs and Unintended Consequences of Artificial Intelligence | 109
Adversarial Defenses
There are roughly two broad classes of possible defenses: infrastructural and
algorithmic (Qiu et al., 2019; Yuan et al., 2018). Infrastructural defenses prevent
image tampering or detect if it has occurred. For instance, an image hash, also
110 | Artificial Intelligence in Health Care
The examples in this chapter largely revolve around clinical cases and
risks, but the implications reach far beyond to all of the application domains
explored in Chapter 3. Public health, consumer health, and population health
and/or risk management applications and risks are all foreseeable. Operational
and administrative cases may be more viable early target areas with much more
forgiving risk profiles for unintended harm, without high-stakes medical decisions
depending on them. Even then, automated AI systems will have far-reaching
implications for patient populations, health systems, and the workforce in terms of
the efficiency and equity of delivering against the unmet and unlimited demands
for health care.
Potential Trade-Offs and Unintended Consequences of Artificial Intelligence | 111
“It’s just completely obvious that in five years deep learning is going to do better
than radiologists. It might be 10 years,” according to Geoffrey Hinton, a pioneer
in artificial neural network research (Mukherjee, 2017). How should health
care systems respond to the statement by Sun Microsystems co-founder Vinod
Khosla that “Machines will replace 80 percent of doctors in a health care future
that will be driven by entrepreneurs, not medical professionals” (Clark, 2012)?
With the advancing capabilities of AI, and a history of prior large-scale workforce
disruptions through technology advances, it seems reasonable to posit that
entire job categories may be replaced by automation (see Figure 4-3), including
some of the most common (e.g., retail clerks and drivers) (Desjardins, 2017;
Frey and Osborne, 2013).
Are job losses in medicine a credible consequence of advancing AI? In 1968,
Warner Slack commented that “any doctor that can be replaced by a machine
should be replaced by a machine” (deBronkart and Sands, 2018).This sentiment is
often misinterpreted as an argument for replacing people with computer systems,
when it is meant to emphasize the value a good human adds that a computer
system does not. If one’s job is restricted to relaying information and answering
well-structured, verifiable multiple-choice questions, then it is likely those tasks
should be automated and the job eliminated. Most clinical jobs and patient needs
require much more cognitive adaptability, problem solving, and communication
skills than a computer can muster. Anxiety over job losses due to AI and automation
are likely exaggerated, but advancing technology will almost certainly change
roles as certain tasks are automated. A conceivable future could eliminate manual
tasks such as checking patient vital signs (especially with self-monitoring devices),
collecting laboratory specimens, preparing medications for pickup, transcribing
clinical documentation, completing prior authorization forms, scheduling
appointments, collecting standard history elements, and making routine diagnoses.
Rather than eliminate jobs, however, industrialization and technology typically
yield net productivity gains to society, with increased labor demands elsewhere
such as in software, technical, support, and related services work. Even within the
same job category, many assumed automated teller machines would eliminate
the need for bank tellers. Instead, the efficiencies gained enabled expansion of
branches and even greater demand for tellers that could focus on higher cognitive
tasks (e.g., interacting with customers, rather than simply counting money)
(Pethokoukis, 2016). Health care is already the fastest growing and now largest
employment sector in the nation (outstripping retail), but most of that growth is
not in clinical professionals such as doctors and nurses, but rather in home care
support and administrative staff (Thompson, 2018).
112
Even if health-related jobs are not replaced by AI, deskilling (“skill rot”) is a risk
of over-reliance on computer-based systems (Cabitza et al., 2017). While clinicians
may not be totally displaced, the fear is that they may lose “core competencies”
considered vital to medical practice. In light of the rapid advancements of AI
capabilities in reading X-rays (Beam and Kohane, 2016; Gulshan et al., 2016),
will radiologists of the future be able to perform this task without the aid of
116 | Artificial Intelligence in Health Care
Anxieties over the potential for automated AI systems to replace jobs rests in a
false dichotomy. Humans and machines can excel in distinct ways that the other
cannot, meaning that the two combined can accomplish what neither could do
alone. In one example of a deep learning algorithm versus an expert pathologist
identifying metastatic breast cancer, the high accuracy of the algorithm was
impressive enough, but more compelling was that combining the algorithm with
the human expert outperformed both (Wang et al., 2016).
intended to improve care delivery, particularly at the population level, but these
benefits may not be felt on the frontlines of care. Instead, it can turn clinical
professionals into data entry clerks, feeding data-hungry machines (optimized
for billing incentives rather than clinical care). This may escalate as AI tools
need even more data, amid a policy climate imposing ever more documentation
requirements to evaluate and monitor metrics of health care quality.
The transition to more IT solutions, computerized data collection, and
algorithmic feedback should ultimately improve the consistency of patient
care quality and efficiency. However, will the measurable gains necessarily
outweigh the loss of harder-to-quantify human qualities of medicine? Will it
lead to different types of medical errors when health care relies on technology-
driven test interpretations and care recommendations instead of human clinical
assessment, interpretation, and management? These are provocative questions, but
acknowledging that these are public concerns and addressing them are important
from a societal perspective.
More optimistically, perhaps such advancing AI technologies can instead
enhance human relationships. Multiple companies are exploring remote and
automating approaches to “auto-scribe” for clinical encounters (Cashin-Garbutt,
2017), allowing patient interactions to focus on direct care instead of note-
taking and data entry. Though such promise is tantalizing, it is also important
to be aware of the unintended consequences or overt actions of bad actors who
could exploit such passive monitoring, intruding on confidential physician–
patient conversations that could make either party unwilling to discuss important
issues. Health care AI developments may be better suited in the near term to
back-office administrative tasks (e.g., coding, prior authorization, supply chain
management, and scheduling). Rather than developing patches like scribes for
mundane administrative tasks, a holistic system redesign may be needed to reorient
incentives and eliminate the need for low-value tasks altogether. Otherwise, AI
systems may just efficiently automate low-value tasks, further entrenching those
tasks in the culture, rather than facilitating their elimination.
This was published in 1970. Will excitement over the current wave of AI
technology only trigger the next AI Winter? Why should this time be any different?
General AI systems will remain elusive for the foreseeable future, but there are credible
reasons to expect that narrow, applied AI systems will still transform many areas of
medicine and health in the next decade. Although many foundational concepts for
AI systems were developed decades ago, only now is there availability of the key
ingredient: data. Digitization of medical records, aggregated Internet crowdsourcing,
and patient-generated data streams provide the critical fuel to power modern AI
systems. Even in the unlikely event that no further major technological breakthroughs
follow, the coming decades will be busy translating existing technological advances
(e.g., image recognition, machine translation, voice recognition, predictive modeling)
into practical solutions for increasingly complex problems in health.
KEY CONSIDERATIONS
The review in this chapter seeks to soften any crash into a trough of disillusionment
over the unintended consequences of health care AI, so that we may quickly move
on to the slope of enlightenment that follows the hype cycle (Chen and Asch,
Potential Trade-Offs and Unintended Consequences of Artificial Intelligence | 121
2017; see Figure 4-1) where we effectively use all information and data sources to
improve our collective health. To that end are the following considerations:
REFERENCES
Acemoglu, D., and P. Restrepo. 2018. Artificial intelligence, automation and work.
Working Paper 24196. National Bureau of Economic Research. https://doi.
org/10.3386/w24196.
Agniel, D., I. S. Kohane, and G. M.Weber. 2018. Biases in electronic health record
data due to processes within the healthcare system: Retrospective observational
study. BMJ 361:k1479.
AlphaStar Team. 2019. AlphaStar: Mastering the real-time strategy game
StarCraft II. DeepMind. https://deepmind.com/blog/alphastar-mastering-real-
time-strategy-game-starcraft-ii (accessed November 12, 2019).
Barnato, A. E., M. B. Herndon, D. L. Anthony, P. M. Gallagher, J. S. Skinner, J.
P. W. Bynum, and E. S. Fisher. 2007. Are regional variations in end-of-life
care intensity explained by patient preferences? A study of the US Medicare
population. Medical Care 45(5):386–393.
Bates, D. W., S. Saria, L. Ohno-Machado, A. Shah, and G. Escobar. 2014. Big data
in health care: Using analytics to identify and manage high-risk and high-cost
patients. Health Affairs 33:1123–1131.
Beam, A. L., and I. S. Kohane. 2016. Translating artificial intelligence into clinical
care. JAMA 316:2368–2369.
Beam, A. L., and I. S. Kohane. 2018. Big data and machine learning in health care.
JAMA 319:1317–1318.
Borzykowski, B. 2016. Truth be told, we’re more honest with robots. BBC
WorkLife. https://www.bbc.com/worklife/article/20160412-truth-be-told-
were-more-honest-with-robots (accessed November 12, 2019).
Potential Trade-Offs and Unintended Consequences of Artificial Intelligence | 123
Brones,A. 2018. Food apartheid:The root of the problem with America’s groceries.
The Guardian. https://www.theguardian.com/society/2018/may/15/food-
apartheid-food-deserts-racism-inequality-america-karen-washington-
interview (accessed November 12, 2019).
Brooks, R. 2017. The seven deadly sins of AI predictions. MIT Technology Review.
https://www.technologyreview.com/s/609048/the-seven-deadly-sins-of-ai-
predictions (accessed November 12, 2019).
Brown, J. 2017. Why everyone is hating on IBM Watson—including the people
who helped make it. Gizmodo. https://gizmodo.com/why-everyone-is-hating-
on-watson-including-the-people-w-1797510888 (accessed November 12, 2019).
Brown, N., and T. Sandholm. 2018. Superhuman AI for heads-up no-limit poker:
Libratus beats top professionals. Science 359:418–424.
Butterfield, S. 2018. Let the computer figure it out. ACP Hospitalist. https://
acphospitalist.org/archives/2018/01/machine-learning-computer-figure-out.
htm (accessed November 12, 2019).
Cabitza, F., R. Rasoini, and G. F. Gensini. 2017. Unintended consequences of
machine learning in medicine. JAMA 318(6):517–518.
Cadwalladr, C., and E. Graham-Harrison. 2018. Revealed: 50 million Facebook profiles
harvested for Cambridge Analytica in major data breach. https://www.theguardian.
com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-
election (accessed December 7, 2019).
Caruana, R., P. Koch, Y. Lou, M. Sturm, J. Gehrke, and N. Elhadad. 2015.
Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day
readmission. In Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. New York: ACM. Pp. 1721–1730.
https://doi.org/10.1145/2783258.2788613.
Cashin-Garbutt, A. 2017. Could smartglass rehumanize the physician patient
relationship? News-Medical.net. https://www.news-medical.net/news/20170307/
Could-smartglass-rehumanize-the-physician-patient-relationship.aspx (accessed
November 12, 2019).
Chen, J. H., and R. B. Altman. 2014. Automated physician order recommendations
and outcome predictions by data-mining electronic medical records. In AMIA
Summits on Translational Science Proceedings. Pp. 206–210.
Chen, J., and S. Asch. 2017. Machine learning and prediction in medicine—
Beyond the peak of inflated expectations. New England Journal of Medicine
376:2507–2509.
Choi, P. J., F. A. Curlin, and C. E. Cox. 2015. “The patient is dying, please call
the chaplain”: The activities of chaplains in one medical center’s intensive care
units. Journal of Pain and Symptom Management 50:501–506.
124 | Artificial Intelligence in Health Care
Verghese, A. 2018. How tech can turn doctors into clerical workers. The New
York Times. https://www.nytimes.com/interactive/2018/05/16/magazine/
health-issue-what-we-lose-with-data-driven-medicine.html (accessed
November 12, 2019).
Victory, J. 2018. What did journalists overlook about the Apple Watch “‘heart
monitor’” feature? HealthNewsReview.org. https://www.healthnewsreview.
org/2018/09/what-did-journalists-overlook-about-the-apple-watch-heart-
monitor-feature (accessed November 13, 2019).
Vinge, V. 1993. The coming technological singularity: How to survive in the
post-human era. National Aeronautics and Space Administration. https://ntrs.nasa.
gov/archive/nasa/casi.ntrs.nasa.gov/19940022856.pdf (accessed November 12,
2019).
Vinyals, O., A. Toshev, S. Bengio, and D. Erhan. 2017. Show and tell:
Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE
Transactions on Pattern Analysis and Machine Intelligence 39:652–663.
Wang, D., A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck. 2016. Deep
learning for identifying metastatic breast cancer. arXiv.org. https://arxiv.org/
abs/1606.05718 (accessed November 13, 2019).
Wilson, H. J., P. R. Daugherty, and N. Morini-Bianzino. 2017. The jobs that
artificial intelligence will create. MIT Sloan Management Review. https://
sloanreview.mit.edu/article/will-ai-create-as-many-jobs-as-it-eliminates
(accessed November 12, 2019).
Woolhandler, S., and D. U. Himmelstein. 2017. The relationship of health
insurance and mortality: Is lack of insurance deadly? Annals of Internal Medicine
167:424–431.
Yuan, X., H. Pan, Q. Zhu, and X. Li. 2018. Adversarial examples: Attacks and
defenses for deep learning. arXiv.org. https://arxiv.org/pdf/1712.07107.pdf
(accessed November 12, 2019).
Zou, J., and L. Schiebinger. 2018. Design AI so that it’s fair. Nature 559:324–326.
https://www.nature.com/magazine-assets/d41586-018-05707-8/d41586-
018-05707-8.pdf (accessed November 12, 2019).
Zulman, D. M., C. P. Chee, S. C. Ezeji-Okoye, J. G. Shaw, T. H. Holmes, J. S. Kahn,
and S. M. Asch. 2017. Effect of an intensive outpatient program to augment
primary care for high-need veterans affairs patients: A randomized clinical trial.
JAMA Internal Medicine 177:166–175.
INTRODUCTION
systems address the needs of health care delivery. Second, it is necessary that such
models be developed and validated through a team effort involving AI experts and
health care providers.Throughout the process, it is important to be mindful of the
fact that the datasets used to train AI are heterogeneous, complex, and nuanced
in ways that are often subtle and institution specific. This affects how AI tools
are monitored for safety and reliability, and how they are adapted for different
locations and over time. Third, before deployment at the point of care, AI systems
should be rigorously evaluated to ensure their competency and safety, in a process
similar to that done for drugs, medical devices, and other interventions.
TABLE 5-1 | Example of Artificial Intelligence Applications by the Primary Task and
Main Stakeholder
Classify (Diagnose) Predict (Prognose) Treat
Payers Identify which patients Estimate risk of a Select the best second
will not adhere to a “no-show” for a line agent for managing
treatment plan magnetic resonance diabetes after Metformin
imaging appointment given a specific clinical
history
Patients and Estimate risk of having Estimate risk Identify a combination
caregivers an undiagnosed genetic of a postsurgical of anticancer drugs that
condition (e.g., familial complication will work for a specific
hypercholesterolemia) tumor type
Providers Identify patients with Determine risk of acute Establish how to manage
unrecognized mental deterioration needing an incidentally found
health needs heightened care atrial septal defect
of a care team to provide quality care or, at the very least, free up their time from
the busy work of reporting outcomes. Below, additional examples are discussed
regarding applications driving the use of algorithms to classify, predict, and treat,
which can guide users in figuring out for whom to take action and when. These
applications may be driven by needs of providers, payers, or patients and their
caregivers (see Table 5-1).
MODEL DEVELOPMENT
Establishing Utility
constraints on the action triggered by the model’s output (e.g., continuous rhythm
monitoring might be constrained by availability of Holter monitors) often can
have a much larger influence in determining model utility (Moons et al., 2012).
For example, if Vera was suspected of having atrial fibrillation based on a
personalized risk estimate (Kwong et al., 2017), the execution of follow-up action
(such as rhythm monitoring for 24 hours) depends on availability of the right
equipment. In the absence of the ability to follow up, having a personalized
estimate of having undiagnosed atrial fibrillation does not improve Vera’s care.
Therefore, a framework for assessing the utility of a prediction-action pair
resulting from an AI solution is necessary. During this assessment process, there
are several key conceptual questions that must be answered (see Box 5-1).
Quantitative answers to these questions can drive analyses for optimizing the
desired outcomes, adjusting components of the expected utility formulation and
fixing variables that are difficult to modify (e.g., the cost of an action) to derive
the bounds of optimal utility.
For effective development and validation of AI/machine learning applications
in health care, one needs to carefully formulate the problem to be solved, taking
into consideration the properties of the algorithm (e.g., positive predictive value)
and the properties of the resulting action (e.g., effectiveness), as well as the
constraints on the action (e.g., costs, capacity), given the clinical and psychosocial
environment.
If Vera’s diagnosis was confirmed and subsequently the CHADS2 risk score
indicated a high 1-year risk of ischemic stroke, the utility of treating using
anticoagulants has to be determined in the light of the positive predictive value of
the CHADS2 score and the known (or estimated) effectiveness of anticoagulation
in preventing the stroke as well as the increased risk of bleeding incurred from
anticoagulant use in the presence of hypertension.
BOX 5-1
Key Considerations in Model Development
After the potential utility has been established, there are some key choices
that must be made prior to actual model development. Both model developers
and model users are needed at this stage in order to maximize the chances of
succeeding in model development because many modeling choices are dependent
on the context of use of the model. Although clinical validity is discussed in
Chapter 6, we note here that the need for external validity depends on what one
wishes to do with the model, the degree of agency ascribed to the model, and the
nature of the action triggered by the model.
by rigorous testing and prospective assessment of how often the model’s predictions
are correct and calibrated, and for assessing the impact of the interventions on the
outcome. At the same time, prospective assessment can be costly. Thus, doing all one
can to vet the model in advance (e.g., by inspecting learned relationships) is imperative.
group allows an assignment of higher (or lower) risk of specific outcomes, that
is considered a sign that the learned groups have meaning. For example, Shah
et al. (2015) analyzed a group of roughly 450 patients who had heart failure with
preserved ejection fraction in order to find three subgroups (Shah et al., 2015). In
data that were not used to learn the groups, application of the grouping scheme
sorted patients into high, medium, and low risk of subsequent mortality.
Reinforcement learning differs from supervised and unsupervised learning,
because the algorithm learns through interacting with its environments rather than
through observational data alone. Such techniques have had recent successes in game
settings (Hutson, 2017). In games, an agent begins in some initial stage and then takes
actions affecting the environment (i.e., transitioning to a new state) and receiving a
reward.This framework mimics how clinicians may interact with their environment,
adjusting medication or therapy based on observed effects. Reinforcement learning
is most applicable in settings involving sequential decision making where the reward
may be delayed (i.e., not received for several time steps). Although most applications
consider online settings, recent work in health care has applied reinforcement
learning in an offline setting using observational data (Komorowski et al., 2018).
Reinforcement learning holds promise, although its current applications suffer from
issues of confounding and lack of actionability (Saria, 2018).
LEARNING A MODEL
BOX 5-2
Key Definitions in Model Development
performs are the test data (Wikipedia, 2019). Training data are often further split
into training and validation subsets. Model selection, which is the selection of one
specific model from among the many that are possible given the training data, is
performed using the validation data.
Bad data will result in bad models, recalling the age-old adage “garbage in, garbage
out” (Kilkenny and Robinson, 2018). There is a tendency to hype AI as something
magical that can learn no matter what the inputs are. In practice, the choice of data
always trumps the choice of the specific mathematical formulation of the model.
In choosing the data for any model learning exercise, the outcome of interest
(e.g., inpatient mortality) and the process for extracting it (e.g., identified using
chart review of the discharge summary note) should be described in a reproducible
manner. If the problem involves time-series data, the time at which an outcome
is observed and recorded versus the time at which it needs to be predicted have
to be defined upfront (see Figure 5-2). The window of data used to learn the
model (i.e., observation window) and the amount of lead time needed from the
prediction should be included.
It is necessary to provide a detailed description of the process of data acquisition,
the criteria for subselecting the training data, and the description and prevalence
of attributes that are likely to affect how the model will perform on a new dataset.
For example, when building a predictive model, subjects in the training data may
Artificial Intelligence Model Development and Validation | 143
The decisions made during the creation and acquisition of datasets will be
reflected in downstream models. In addition to knowing the final features
representing patient data (see Figure 5-1), any preprocessing steps should be clearly
documented and made available with the model. Such data-wrangling steps (e.g.,
how one dealt with missing values or irregularly sampled data) are often overlooked
or not reported. The choices made around data preparation and transformation
into the analytical data representation can contribute significantly to bias that then
gets incorporated into the AI algorithm (Suresh and Guttag, 2019).
Often, the users of the model’s output hold the model itself responsible for such
biases, rather than the underlying data and the model developer’s design decisions
surrounding the data (Char et al., 2018). In nonmedical fields, there are numerous
examples in which model use has reflected biases inherent in the data used to train
them (Angwin et al., 2016; Char et al., 2018; O’Neil, 2017). For example, programs
designed to aid judges in sentencing by predicting an offender’s risk of recidivism
have shown racial discrimination (Angwin et al., 2016). In health care, attempts to
use data from the Framingham Heart Study to predict the risk of cardiovascular
events in minority populations have led to biased risk estimates (Gijsberts et al.,
2015). Subtle discrimination inherent in health care delivery may be harder to
anticipate; as a result, it may be more difficult to prevent an algorithm from learning
and incorporating this type of bias (Shah et al., 2018). Such biases may lead to
self-fulfilling prophesies: If clinicians always withdraw care from patients with
certain findings (e.g., extreme prematurity or a brain injury), machine learning
systems may conclude that such findings are always fatal. (Note that the degree to
which such biases may affect actual patient care depends on the degree of causality
ascribed to the model and to the process of choosing the downstream action.)
Learning Setup
In the machine learning literature, the dataset from which a model is learned is
also called the training dataset. Sometimes a portion of this dataset may be set aside
for tuning hyperparameters—the weights assigned to different variables and their
combinations.This portion of the training data is referred to as the hyperparameter-
validation dataset, or often just the validation dataset.The validation dataset confirms
whether the choices of the values of the parameters in the model are correct or
not. Note that the nomenclature is unfortunate, because these validation data have
nothing to do with the notion of clinical validation or external validity.
Given that the model was developed from the training/validation data, it is
necessary to evaluate its performance in classifying or making predictions on a
“holdout” test set (see Figure 5-1). This test set is held out in the sense that it was
not used to select model parameters or hyperparameters. The test set should be as
Artificial Intelligence Model Development and Validation | 145
close as possible to the data that the model would be applied to in routine use.The
choice of metrics used to assess a model’s performance is guided by the end goal of
the modeling as well as the type of learning being conducted (e.g., unsupervised
versus supervised). Here, we focus on metrics for supervised binary classifiers (e.g.,
patient risk stratification tools).The estimation of metrics can be obtained through
cross validation where the whole dataset is split randomly into multiple parts with
one part set as the test set and the remaining parts used for training a model.
Recall and precision are two of the most debated performance metrics because
they exhibit varying importance based on the use case. Sensitivity quantifies a
classifier’s ability to identify the true positive cases. Typically, a highly sensitive
classifier can reliably rule out a disease when its result is negative (Davidson, 2002).
Precision quantifies a classifier’s ability to correctly identify a true positive case—that
is, it estimates the number of times the classifier falsely categorizes a noncase as a case.
Specificity quantifies the portion of actual negatives that are correctly identified as
such.There is a trade-off between the recall, precision, and specificity measures, which
needs to be resolved based on the clinical question of interest. For situations where
we cannot afford to miss a case, high sensitivity is desired. Often, a highly sensitive
classifier is followed up with a highly specific test to identify the false positives among
those flagged by the sensitive classifier.The trade-off between specificity and sensitivity
can be visually explored in the receiver operating characteristic (ROC) curve. The
area under the ROC (AUROC) curve is the most popular index for summarizing
the information in the ROC curves.When reporting results on the holdout test set,
we recommend going beyond the AUROC curve and instead reporting the entire
ROC curve as well as the sensitivity, specificity, positive predictive value, and negative
predictive value at a variety of points on the curve that represent reasonable decision-
making cutoffs (Bradley, 1997; Hanley and McNeil, 1982).
However, the limitations of the ROC curves are well known even though they
continue to be widely used (Cook, 2007). Despite the popularity of AUROC
curve and ROC curve for evaluating classifier performance, there are other
important considerations. First, the utility offered by two ROC curves can be
wildly different, and it is possible that classifiers with a lower overall AUROC
curve have higher utility based on the shape of the ROC curve. Second, in highly
imbalanced datasets, where negative and positive labels are not distributed equally,
a precision-recall (PR) curve provides a better basis for comparing classifiers
(McClish, 1989).Therefore, in order to enable meaningful comparisons, researchers
should report both the AUROC curve and area under the PR curve, along with
the actual curves, and error bars around the average classifier performance.
146 | Artificial Intelligence in Health Care
For decision making in the clinic, additional metrics such as calibration, net
reclassification, and a utility assessment are necessary (Lorent et al., 2019; Shah
et al., 2019; Steyerberg et al., 2010). While the ROC curves provide information
about a classifier’s ability to discriminate a true case from a noncase, calibration
metrics quantify how well the predicted probabilities of a true case being a case
agree with observed proportions of cases and noncases. For a well-calibrated
classifier, 90 of 100 samples with a predicted probability of 0.9 will be correctly
identified true cases (Cook, 2008).
When evaluating the use of machine learning models, it is also important to
develop parallel baselines, such as a penalized regression model applied on the
same data that are supplied to more sophisticated models such as deep learning
or random forests. Given the non-obvious relationship between a model’s
positive predictive value, recall, and specificity to its utility, having these parallel
models provides another axis of evaluation in terms of cost of implementation,
interpretability, and relative performance.
Aside from issues related to quantifying the incremental value of using a
model to improve care delivery, there are methodological issues in continuously
evaluating or testing a model as the underlying data change. For example, a model
for predicting 24-hour mortality could be retrained every week or every day as
new data become available. It is unclear which metrics of the underlying data as
well as of the model performance we should monitor to manage such continuously
evolving models. It is also unclear how to set the retraining schedule, and what
information should guide that decision. The issues of model surveillance and
implementation are more deeply addressed in Chapter 6. Finally, there are unique
regulatory issues that arise if a model might get retrained after it is approved
and then behave differently, and some of the current guidance for these issues is
discussed in Chapter 7.
DATA QUALIT Y
A variety of issues affect data integrity in health care. For example, the software
for data retrieval, preprocessing, and cleaning is often lost or not maintained,
making it impossible to re-create the same dataset. In addition, the data from the
source system(s) may have been discarded or may have changed. The problem
is further compounded by fast-changing data sources or changes over time in
institutional data stores or governance procedures. Finally, silos of expertise and
access around data sources create dependence on individual people or teams.
When the collection and provenance of the data that a model is trained on is a black
box, researchers must compensate with reliance on trusted individuals or teams,
which is suboptimal and not sustainable in the long run. Developing AI based
Artificial Intelligence Model Development and Validation | 147
processing the large amount of data, whereas an irregular heart rhythm presenting
on a 12-lead ECG may be interpreted and acted upon within minutes.Therefore,
AI development teams should have information technology (IT) engineers who
are knowledgeable about the details of when and where certain data become
available and whether the mechanics of data availability and access are compatible
with the model being constructed.
Another critical point is that the acquisition of the data elements present in the
training data must be possible without major effort. Models derived using datasets
where data elements are manually abstracted (e.g., Surgical Risk Calculator from
the American College of Surgeons) cannot be deployed without significant
investment by the deploying site to acquire the necessary data elements for the
patient for whom the model needs to be used. While this issue can be overcome
with computational phenotyping methods, such methods struggle with portability
due to EHR system variations resulting in different reporting schemes, as well
as clinical practice and workflow differences. With the rise of interoperability
standards such as the Fast Healthcare Interoperability Resource, the magnitude
of this problem is likely to decrease in the near future. When computationally
defined phenotypes serve as the basis for downstream analytics, it is important
that computational phenotypes themselves be well managed and clearly defined
and adequately reflect the target domain.
As a reasonable starting point for minimizing the data quality issues, data should
adhere to the FAIR (findability, accessibility, interoperability, and reusability)
principles in order to maximize the value of the data (Wilkinson et al., 2016).
Researchers in molecular biology and bioinformatics put forth these principles,
and, admittedly, their applicability in health care is not easy or straightforward.
One of the unique challenges (and opportunities) facing impactful design and
implementation of AI in health care is the disparate data types that comprise today’s
health care data. Today’s EHRs and wearable devices have greatly increased the
volume, variety, and velocity of clinical data. The soon-to-be in-clinic promise of
genomic data further complicates the problems of maintaining data provenance,
timely availability of data, and knowing what data will be available for which
patient at what time.
Always keeping a timeline view of the patient’s medical record is essential
(see Figure 5-2), as is explicitly knowing the times at which the different data
types across different sources come into existence. It stands to reason that any
predictive or classification model operating at a given point in the patient
timeline can only expect to use data that have come into being prior to the time
at which the model is used (Jung et al., 2016; Panesar, 2019). Such a real-life
view of data availability is crucial when building models, because using clean
data gives an overly optimistic view of models’ performance and an unrealistic
Artificial Intelligence Model Development and Validation | 149
impression of their potential value. Finally, we note that the use of synthetic data,
if created to mirror real-life data in its missingness and acquisition delay by data
type, can serve as a useful strategy for a model builder to create realistic training
and testing environments for novel methods (Carnegie Mellon University, 2018;
Franklin et al., 2017; Schuler, 2018).
EDUCATION
KEY CONSIDERATIONS
The rapid increase in the volume and variety of data in health care has driven
the current interest in the use of AI (Roski et al., 2014).There is active discussion
and interest in addressing the potential ethical issues in using AI (Char et al., 2018),
the need for humanizing AI (Israni and Verghese, 2019), the potential unintended
consequences (Cabitza et al., 2017), and the need to tamper the hype (Beam and
Kohane, 2018). However, more discovery and work in these areas is essential.
The way that AI is developed, evaluated, and utilized in health care must change.
At present, most of the existing discussion focuses on evaluating the model from
a technical standpoint. A critically underassessed area is the net benefit of the
integration of AI into clinical practice workflow (see Chapter 6).
150 | Artificial Intelligence in Health Care
Establishing Utility
Model Learning
After the potential utility has been established, model developers and model
users need to interact closely during model learning because many modeling
choices are dependent on the context of use of the model (Wiens et al., 2019).
For example, the need for external validity depends on what one wishes to do
with the model, the degree of agency ascribed to the model, and the nature of
the action triggered by the model.
It is well known that biased data will result in biased models; thus, the data that are
selected to learn from matter far more than the choice of the specific mathematical
formulation of the model. Model builders need to pay closer attention to the data
they train on and need to think beyond the technical evaluation of models. Even in
technical evaluation, it is necessary to look beyond the ROC curves and examine
multiple dimensions of performance (see Box 5-2). For decision making in the clinic,
additional metrics such as calibration, net reclassification, and a utility assessment are
necessary. Given the non-obvious relationship between a model’s positive predictive
value, recall, and specificity to its utility, it is important to examine simple and
obvious parallel baselines, such as a penalized regression model applied on the same
data that are supplied to more sophisticated models such as deep learning.
The topic of interpretability deserves special discussion because of ongoing
debates around interpretability, or the lack of it (Licitra et al., 2017; Lipton, 2016;
Voosen, 2017). To the model builder, interpretability often means the ability to
explain which variables and their combinations, in what manner, led to the output
produced by the model (Friedler et al., 2019). To the clinical user, interpretability
could mean one of two things: a sufficient enough understanding of what is going
on, so that they can trust the output and/or be able to get liability insurance for its
Artificial Intelligence Model Development and Validation | 151
Data Quality
Bad data quality adversely affects patient care and outcomes (Jamal et al., 2009).
A recent systematic review shows that the AI models could dramatically improve if
four particular adjustments were made: use of multicenter datasets, incorporation of
time-varying data, assessment of missing data as well as informative censoring, and
development of metrics of clinical utility (Goldstein et al., 2017). As a reasonable
starting point for minimizing the data quality issues, data should adhere to the
FAIR principles in order to maximize the value of the data (Wilkinson et al.,
2016).An often overlooked detail is when and where certain data become available
and whether the mechanics of data availability and access are compatible with the
model being constructed. In parallel, we need to educate the different stakeholders,
and the model builders need to understand the datasets they learn from.
The use of AI solutions presents a wide range of legal and ethical challenges,
which are still being worked out (see Chapter 7). For example, when a physician
makes decisions assisted by AI, it is not always clear where to place blame in the
case of failure. This subtlety is not new to recent technological advancements,
and in fact was brought up decades ago (Berg, 2010). However, most of the legal
and ethical issues were never fully addressed in the history of computer-assisted
decision support, and a new wave of more powerful AI-driven methods only adds
to the complexity of ethical questions (e.g., the frequently condemned black box
model) (Char et al., 2018).
The model builders need to better understand the datasets they choose to learn
from. The decision makers need to look beyond technical evaluations and ask for
utility assessments.The media needs to do a better job in articulating both immense
potential and the risks of adopting the use of AI in health care.Therefore, it is important
to promote a measured approach to adopting AI technology, which would further
AI’s role as augmenting rather than replacing human actors. This framework could
allow the AI community to make progress while managing evaluation challenges
(e.g., when and how to employ interpretable models versus black box models) as
well as ethical challenges that are bound to arise as the technology is widely adopted.
152 | Artificial Intelligence in Health Care
REFERENCES
learning over logistic regression for clinical prediction models. Journal of Clinical
Epidemiology 110:12–22.
Cleophas, T. J., A. H. Zwinderman, and H. I. Cleophas-Allers. 2013. Machine
learning in medicine. New York: Springer. https://www.springer.com/gp/
book/9789400758230 (accessed November 13, 2019).
Cook, N. R. 2007. Use and misuse of the receiver operating characteristic curve
in risk prediction. Circulation 115(7):928–935.
Cook, N. R. 2008. Statistical evaluation of prognostic versus diagnostic models:
Beyond the ROC curve. Clinical Chemistry 54(1):17–23.
Cushman, W. C., P. K. Whelton, L. J. Fine, J. T. Wright, Jr., D. M. Reboussin, K. C.
Johnson, and S. Oparil. 2016. SPRINT trial results: Latest news in hypertension
management. Hypertension 67(2):263–265.
Davidson, M. 2002. The interpretation of diagnostic tests: A primer for
physiotherapists. Australian Journal of Physiotherapy 48(3):227–232.
Doshi-Velez, F., Y. Ge, and I. Kohane. 2014. Comorbidity clusters in autism
spectrum disorders: An electronic health record time-series analysis. Pediatrics
133(1):54–63.
Dyagilev, K., and S. Saria. 2016. Learning (predictive) risk scores in the presence
of censoring due to interventions. Machine Learning 102(3):323–348.
Esteva, A., A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo, K. Chou, C.
Cui, G. Corrado, S. Thrun, and J. Dean. 2019. A guide to deep learning in
healthcare. Nature Medicine 25(1):24.
Fox, C. S. 2010. Cardiovascular disease risk factors, type 2 diabetes mellitus, and
the Framingham Heart Study. Trends in Cardiovascular Medicine 20(3):90–95.
Franklin, J. M., W. Eddings, P. C. Austin, E. A. Stuart, and S. Schneeweiss. 2017.
Comparing the performance of propensity score methods in healthcare database
studies with rare outcomes. Statistics in Medicine 36(12):1946–1963.
Friedler, S. A., C. D. Roy, C. Scheidegger, and D. Slack. 2019. Assessing the local
interpretability of machine learning models. arXiv.org. https://ui.adsabs.harvard.
edu/abs/2019arXiv190203501S/abstract (accessed November 13, 2019).
Gianfrancesco, M., S. Tamang, and J. Yazdanzy. 2018. Potential biases in machine
learning algorithms using electronic health record data. JAMA Internal Medicine
178(11):1544–1547.
Gijsberts, C. M., K. A. Groenewegen, I. E. Hoefer, M. J. Eijkemans, F.W. Asselbergs,
T. J. Anderson, A. R. Britton, J. M. Dekker, G. Engström, G. W. Evans, and J. De
Graaf. 2015. Race/ethnic differences in the associations of the Framingham risk
factors with carotid IMT and cardiovascular events. PLoS One 10(7):e0132321.
Goldstein, B. A., A. M. Navar, M. J. Pencina, and J. Ioannidis. 2017. Opportunities
and challenges in developing risk prediction models with electronic health
154 | Artificial Intelligence in Health Care
Jamal, A., K. McKenzie, and M. Clark. 2009. The impact of health information
technology on the quality of medical and health care: A systematic review.
Health Information Management Journal 38(3):26–37.
Jung, K., and N. H. Shah. 2015. Implications of non-stationarity on predictive
modeling using EHRs. Journal of Biomedical Informatics 58:168–174.
Jung, K., S. Covington, C. K. Sen, M. Januszyk, R. S. Kirsner, G. C. Gurtner, and
N. H. Shah. 2016. Rapid identification of slow healing wounds. Wound Repair
and Regeneration 24(1):181–188.
Ke, C.,Y. Jin, H. Evans, B. Lober, X. Qian, J. Liu, and S. Huang. 2017. Prognostics of
surgical site infections using dynamic health data. Journal of Biomedical Informatics
65:22–33.
Khare, R., L. Utidjian, B. J. Ruth, M. G. Kahn, E. Burrows, K. Marsolo, N.
Patibandla, H. Razzaghi, R. Colvin, D. Ranade, and M. Kitzmiller. 2017. A
longitudinal analysis of data quality in a large pediatric data research network.
Journal of the American Medical Informatics Association 24(6):1072–1079.
Kilkenny, M. F., and K. M. Robinson. 2018. Data quality: “Garbage in–garbage
out.” Health Information Management 47(3):103–105.
Komorowski, M., L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal. 2018. The
Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in
intensive care. Nature Medicine 24(11):1716–1720.
Kroll, J. A. 2018. Data science data governance [AI Ethics]. IEEE Security & Privacy
16(6):61–70.
Kwong, C., A. Y. Ling, M. H. Crawford, S. X. Zhao, and N. H. Shah. 2017. A
clinical score for predicting atrial fibrillation in patients with crytogenic stroke
or transient ischemic attack. Cardiology 138(3):133–140.
Licitra, L., A. Trama, and H. Hosni. 2017. Benefits and risks of machine learning
decision support systems. JAMA 318(23):2356.
Lipton, Z. C. 2016. The mythos of model interpretability. arXiv.org. https://arxiv.
org/abs/1606.03490 (accessed November 13, 2019).
Lorent, M., H. Maalmi, P. Tessier, S. Supiot, E. Dantan, and Y. Foucher. 2019.
Meta-analysis of predictive models to assess the clinical validity and utility
for patient-centered medical decision making: Application to the Cancer of
the Prostate Risk Assessment (CAPRA). BMC Medical Informatics and Decision
Making 19(1):Art. 2.
Mason, P. K., D. E. Lake, J. P. DiMarco, J. D. Ferguson, J. M. Mangrum, K. Bilchick,
L. P. Moorman, and J. R. Moorman. 2012. Impact of the CHA2DS2-VASc
score on anticoagulation recommendations for atrial fibrillation. American
Journal of Medicine 125(6):603.e1–603.e6.
McClish, D. K. 1989. Analyzing a portion of the ROC curve. Medical Decision
Making 9(3):190–195.
156 | Artificial Intelligence in Health Care
Meskó, B., G. Hetényi, and Z. Győrffy. 2018. Will artificial intelligence solve the
human resource crisis in healthcare? BMC Health Services Research 18(1):Art. 545.
Molenberghs, G., and M. Kenward. 2007. Missing data in clinical studies, vol. 61.
Hoboken, NJ: John Wiley & Sons. https://www.wiley.com/en-us/Missing+
Data+in+Clinical+Studies-p-9780470849811 (accessed November 13, 2019).
Moons, K. G., A. P. Kengne, D. E. Grobbee, P. Royston, Y. Vergouwe, D. G.
Altman, and M. Woodward. 2012. Risk prediction models: II. External
validation, model updating, and impact assessment. Heart 98(9):691–698.
Nuttall, J., N. Evaniew, P. Thornley, A. Griffin, B. Deheshi, T. O’Shea, J. Wunder,
P. Ferguson, R. L. Randall, R. Turcotte, and P. Schneider. 2016. The inter-rater
reliability of the diagnosis of surgical site infection in the context of a clinical
trial. Bone & Joint Research 5(8):347–352.
O’Neil, C. 2017. Weapons of math destruction: How big data increases inequality and
threatens democracy. New York: Broadway Books.
Panesar, A., 2019. Machine learning and AI for healthcare: Big data for improved health
outcomes. New York: Apress.
Pearl, J., and D. Mackenzie. 2018. The book of why:The new science of cause and effect.
New York: Basic Books.
Poursabzi-Sangdeh, F., D. G. Goldstein, J. M. Hofman, J. W. Vaughan, and H.
Wallach. 2018. Manipulating and measuring model interpretability. arXiv.org.
https://arxiv.org/abs/1802.07810 (accessed November 13, 2019).
Rajkomar, A., E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J.
Marcus, M. Sun, P. Sundberg, H.Yee, K. Zhang,Y. Zhang, G. Flores, G. E. Duggan,
J. Irvine, Q. Le, K. Litsch, A. Mossin, J. Tansuwan, D. Wang, J. Wexler, J. Wilson,
D. Ludwig, S. L. Volchenboum, K. Chou, M. Pearson, S. Madabushi, N. H. Shah,
A. J. Butte, M. D. Howell, C. Cui, G. S. Corrado, and J. Dean. 2018. Scalable and
accurate deep learning with electronic health records. NPJ Digital Medicine 1(1):18.
Rajpurkar, P., J. Irvin, R. L. Ball, K. Zhu, B.Yang, H. Mehta, T. Duan, D. Ding, A.
Bagul, C. P. Langlotz, B. N. Patel, K. W.Yeom, K. Shpanskaya, F. G. Blankenberg,
J. Seekins, T. J. Amrhein, D. A. Mong, S. S. Halabi, E. J. Zucker, A.Y. Ng, and M.
P. Lungren. 2018. Deep learning for chest radiograph diagnosis: A retrospective
comparison of the CheXNeXt algorithm to practicing radiologists. PLoS
Medicine 15(11):e1002686.
Roski, J., G. Bo-Linn, and T. Andrews. 2014. Creating value in healthcare through
big data: Opportunities and policy implications. Health Affairs 33(7):1115–1122.
Sachs, K., O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan. 2005. Causal
protein-signaling networks derived from multiparameter single-cell data. Science
308(5721):523–529.
Saria, S. 2018. Individualized sepsis treatment using reinforcement learning.
Nature Medicine 24(11):1641–1642.
Artificial Intelligence Model Development and Validation | 157
Saria, S., and A. Goldenberg. 2015. Subtyping: What it is and its role in precision
medicine. IEEE Intelligent Systems 30(4):70–75.
Schulam, P., and S. Saria. 2017. Reliable decision support using counterfactual
models. arXiv.org. https://arxiv.org/abs/1703.10651 (accessed November 13,
2019).
Schulam, P., and S. Saria. 2018. Discretizing logged interaction data biases learning
for decision-making. arXiv.org. https://arxiv.org/abs/1810.03025 (accessed
November 13, 2019).
Schuler, A. 2018. Some methods to compare the real-world performance of causal
estimators. Ph.D. dissertation, Stanford University. http://purl.stanford.edu/
vg743rx0211 (accessed November 13, 2019).
Shah, N. D., E. W. Steyerberg, and D. M. Kent. 2018. Big data and predictive
analytics: Recalibrating expectations. JAMA 320(1):27–28.
Shah, N. H., A. Milstein, and S. C. Bagley. 2019. Making machine learning
models clinically useful. JAMA 322(14):1351–1352. https://doi.org/10.1001/
jama.2019.10306.
Shah, S. J., D. H. Katz, S. Selvaraj, M. A. Burke, C. W. Yancy, M. Gheorghiade,
R. O. Bonow, C. C. Huang, and R. C. Deo. 2015. Phenomapping for novel
classification of heart failure with preserved ejection fraction. Circulation
131(3):269–279.
Shortliffe, E. H. 1974. MYCIN: A rule-based computer program for advising physicians
regarding antimicrobial therapy selection. No. AIM-251. Stanford University. https://
searchworks.stanford.edu/view/12291681 (accessed November 13, 2019).
Sohn, S., D. W. Larson, E. B. Habermann, J. M. Naessens, J.Y. Alabbad, and H. Liu.
2017. Detection of clinically important colorectal surgical site infection using
Bayesian network. Journal of Surgical Research 209:168–173.
Steyerberg, E. W., A. J.Vickers, N. R. Cook, T. Gerds, M. Gonen, N. Obuchowski,
M. J. Pencina, and M. W. Kattan. 2010. Assessing the performance of prediction
models: A framework for some traditional and novel measures. Epidemiology
21(1):128–138.
Subbaswamy, A., P. Schulam, and S. Saria. 2018. Learning predictive models that
transport. arXiv.org. https://arxiv.org/abs/1812.04597 (accessed November 13,
2019).
Suresh, H., and J. V. Guttag. 2019. A framework for understanding unintended
consequences of machine learning. arXiv.org. https://arxiv.org/abs/1901.10002
(accessed November 13, 2019).
Vellido, A. 2019. The importance of interpretability and visualization in
machine learning for applications in medicine and health care. Neural
Computing and Applications. https://doi.org/10.1007/s00521-019-04051-w
(accessed November 13, 2019).
158 | Artificial Intelligence in Health Care
Verghese, A., N. H. Shah, and R. A. Harrington. 2018. What this computer needs
is a physician: Humanism and artificial intelligence. JAMA 319(1):19–20.
Voosen, P. 2017. The AI detectives. Science 357(6346):22–27. https://science.
sciencemag.org/content/357/6346/22.summary (accessed November 13, 2019).
Wang, W., M. Bjarnadottir, and G. G. Gao. 2018. How AI plays its tricks:
Interpreting the superior performance of deep learning-based approach in
predicting healthcare costs. SSRN. https://papers.ssrn.com/sol3/papers.cfm?
abstract_id=3274094 (accessed November 13, 2019).
Wiens, J., S. Saria, M. Sendak, M. Ghassemi,V. X. Liu, F. Doshi-Velez, K. Jung, K.
Heller, D. Kale, M. Saeed, P. N. Ossocio, S. Thadney-Israni, and A. Goldenberg.
2019. Do no harm: A roadmap for responsible machine learning for health care.
Nature Medicine 25(9):1337–1340.
Wikipedia. 2019. Training, validation, and test sets. https://en.wikipedia.org/wiki/
Training,_validation,_and_test_sets (accessed November 13, 2019).
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak,
N. Blomberg, J. W. Boiten, L. B. da Silva Santos, P. E. Bourne, and J. Bouwman.
2016. The FAIR Guiding Principles for scientific data management and
stewardship. Scientific Data 3:Art 160018.
Williams, J. B., D. Ghosh, and R. C. Wetzel. 2018. Applying machine learning to
pediatric critical care data. Pediatric Critical Care Medicine 19(7):599–608.
Wu, A. 2019. Solving healthcare’s big epidemic—Physician burnout. Forbes. https://
www.forbes.com/sites/insights-intelai/2019/02/11/solving-healthcares-big-
epidemicphysician-burnout/#64ed37e04483 (accessed November 13, 2019).
Yadlowsky, S., R. A. Hayward, J. B. Sussman, R. L. McClelland, Y. I. Min, and
S. Basu. 2018. Clinical implications of revised pooled cohort equations for
estimating atherosclerotic cardiovascular disease risk. Annals of Internal Medicine
169(1):20–29.
Yu, K. H., A. L. Beam, and I. S. Kohane. 2018. Artificial intelligence in healthcare.
Nature Biomedical Engineering 2(10):719–731.
Zhan, A., S. Mohan, C.Tarolli, R. B. Schneider, J. L. Adams, S. Sharma, M. J. Elson,
K. L. Spear, A. M. Glidden, M. A. Little, and A. Terzis. 2018. Using smartphones
and machine learning to quantify Parkinson disease severity: The mobile
Parkinson disease score. JAMA Neurology 75(7):876–880.
Zhou, Z. H. 2017. A brief introduction to weakly supervised learning. National
Science Review 5(1):44–53.
INTRODUCTION
The effective use of artificial intelligence (AI) in clinical settings currently presents
an opportunity for thoughtful engagement. There is steady and transformative
progress in methods and tools needed to manipulate and transform clinical data,
and increasingly mature data resources have supported novel development of
accurate and sophisticated AI in some health care domains (although the risk of not
being sufficiently representative is real). However, few examples of AI deployment
and use within the health care delivery system exist, and there is sparse evidence
for improved processes or outcomes when AI tools are deployed (He et al., 2019).
For example, within machine learning risk prediction models—a subset of the
larger AI domain—the sizable literature on model development and validation
is in stark contrast to the scant data describing successful clinical deployment of
those models in health care settings (Shortliffe and Sepúlveda, 2018).
This discrepancy between development efforts and successful use of AI reflects
the hurdles in deploying decision support systems and tools more broadly
(Tcheng et al., 2017). While some impediments are technical, more relate to
the complexity of tailoring applications for integration with existing capabilities
in electronic health records (EHRs), poor understanding of users’ needs and
expectations for information, poorly defined clinical processes and objectives, and
even concerns about legal liability (Bates, 2012; Miller et al., 2018; Unertl et al.,
2007, 2009).These impediments may be balanced by the potential for gain, as one
159
160 | Artificial Intelligence in Health Care
cross-sectional review of closed malpractice claims found that more than one-half
of malpractice claims could have been potentially prevented by well-designed
clinical decision support (CDS) in the form of alerts (e.g., regarding potential
drug–drug interactions and abnormal test results), reminders, or electronic
checklists (Zuccotti et al., 2014). Although, in many instances, the deployment
of AI tools in health care may be conducted on a relatively small scale, it is
important to recognize that an estimated 50 percent of information technology
(IT) projects fail in the commercial sector (Florentine, 2016).
Setting aside the challenges of physician-targeted, point-of-care decision
support, there is great opportunity for AI to improve domains outside of
encounter-based care delivery, such as in the management of patient populations
or in administrative tasks for which data and work standards may be more
readily defined, as is discussed in more detail in Chapter 3.These future priority
areas will likely be accompanied by their own difficulties, related to translating
AI applications into effective tools that improve the quality and efficiency of
health care.
AI tools will also produce challenges that are entirely related to the novelty of
the technology. Even at this early stage of AI implementation in health care, the
use of AI tools has raised questions about the expectations of clinicians and health
systems regarding transparency of the data models, the clinical plausibility of the
underlying data assumptions, whether AI tools are suitable for discovery of new
causal links, and the ethics of how, where, when, and under what circumstances
AI should be deployed (He et al., 2019). At this time in the development cycle,
methods to estimate the requirements, care, and maintenance of these tools and
their underlying data needs remain a rudimentary management science.
There are also proposed regulatory rules that will influence the use of AI in
health care. On July 27, 2018, the Centers for Medicare & Medicaid Services
(CMS) published a proposed rule that aims to increase Medicare beneficiaries’
access to physicians’ services routinely furnished via “communication technology.”
The rule defines such clinician services as those that are defined by and inherently
involve the use of computers and communication technology; these services
will be associated with a set of Virtual Care payment codes (CMS, 2019). These
services would not be subject to the limitations on Medicare telehealth services
and would instead be paid under the Physician Fee Schedule, as other physicians’
services are. CMS’s evidentiary standard of clinical benefit for determining
coverage under the proposed rule does not include minor or incidental benefits.
This proposed rule is relevant to all clinical AI applications, because they all
involve computer and communication technology and aim to deliver substantive
clinical benefit consistent with the examples set forth by CMS. If clinical AI
Deploying Artificial Intelligence in Clinical Settings | 161
Kesselheim, 2018). SaMD provided by EHR vendors will not be regulated under
the Precertification Program, but those supported outside those environments
and either added to them or provided to patients via separate routes (e.g., apps
on phones, services provided as part of pharmacy benefit managers) will be. The
nature of regulation is evolving but will need to account for changes in clinical
practice, data systems, populations, etc. However, regulated or not, SaMD that
incorporates AI methods will require careful testing and retesting, recalibration,
and revalidation at the time of implementation as well as periodically afterward.
All point-of-care AI applications will need to adhere to best practices for the
form, function, and workflow placement of CDS and incorporate best practices
in human–computer interaction and human factors design (Phansalkar et al.,
2010). This will need to occur in an environment where widely adopted EHRs
continue to evolve and where there will likely be opportunities for disruptive
technologies.
Face-to-face interactions with patients are, in a sense, only the tip of the iceberg
of health care delivery (Blane et al., 2002; Network for Excellence in Health
Innovation, 2015; Tarlov, 2002). A complex array of people and services are
necessary to support direct care and they tend to consume and generate massive
amounts of data. Diagnostic services such as laboratory, pathology, and radiology
procedures are prime examples and are distinguished by the generation of clinical
data, including dense imaging, as well as interpretations and care recommendations
that must be faithfully transmitted to the provider (and sometimes the patient) in
a timely manner.
AI will certainly play a major role with tasks such as automated image (e.g.,
radiology, ophthalmology, dermatology, and pathology) and signal processing
(e.g., electrocardiogram, audiology, and electroencephalography). In addition to
interpretation of tests and images, AI will be used to integrate and array results
with other clinical data to facilitate clinical workflow (Topol, 2019).
Enterprise Operations
clinical decisions, and usually pose lower risk, it is likely that they may be more
tractable targets for AI systems to support in the immediate future. In addition,
the data necessary to train models in these settings are often more easily available
than in clinical settings. For example, in a hospital, these tasks might include
the management of billing, pharmacy, supply chain, staffing, and patient flow.
In an outpatient setting, AI-driven applications could assume some of the
administrative tasks such as gathering information to assist with decisions about
insurance coverage, scheduling, and obtaining preapprovals. Some of these topics
are discussed in greater detail in Chapter 3.
traditional medical care and public health systems (see Figure 6-2). Although
there is no broadly accepted definition of this function within health care delivery
systems, one goal is to standardize routine aspects of care, typically in an effort to
improve clinical performance metrics across large populations and systems of care
with the goal of improving quality and reducing costs.
Promoting healthy behaviors and self-care are major focuses of population
health management efforts (Kindig and Stoddart, 2003). Much of the work of
FIGURE 6-2 | Relationship of population health to public health and standard clinical care.
SOURCE: Definition of population health from Kindig, 2007.
Deploying Artificial Intelligence in Clinical Settings | 165
health care delivery system. In particular, smartphone and mobile applications have
transformed the potential for patient contact, active participation in health care
behavior modification, and reminders. These applications also hold the potential
for health care delivery to access new and important patient data streams to help
stratify risk, provide care recommendations, and help prevent complications of
chronic diseases. This trend is likely to blur the traditional boundaries of tasks
now performed during face-to-face appointments.
Patient- and caregiver-facing tools represent an area of strong potential
growth for AI deployment and are expected to empower users to assume greater
control over their health and health care (Topol, 2015). Moreover, there is
the potential for creating a positive feedback loop where patients’ needs and
preferences, expressed through their use of AI-support applications, can then
be incorporated into other applications throughout the health care delivery
system. As growth of online purchasing continues, the role of AI in direct patient
interaction—providing wellness, treatment, or diagnostic recommendations via
mobile platforms—will grow in parallel to that of brick-and-mortar settings.
Proliferation of these applications will continue to amplify and enhance data
collected through traditional medical activities (e.g., lab results, pharmacy fill
data). Mobile applications are increasingly able to cross-link various sources of
data and potentially enhance health care (e.g., through linking grocery purchase
data to health metrics to physical steps taken or usage statistics from phones).
The collection and presentation of AI recommendations using mobile- or
desktop-based platforms is critical because patients are increasingly engaging
in self-care activities supported by applications available on multiple platforms.
In addition to being used by patients, the technology will likely be heavily used
by their family and caregivers.
Unfortunately, the use of technologies intended to support self-management
of health by individuals has been lagging as has evaluation of their effectiveness
(Abdi et al., 2018). Although there are more than 320,000 health apps currently
available, and these apps have been downloaded nearly 4 billion times, little
research has been conducted to determine whether they improve health (Liquid
State, 2018). In a recent overview of systematic reviews of studies evaluating
stand-alone, mobile health apps, only 6 meta-analyses including a total of
23 randomized trials could be identified. In all, 11 of the 23 trials showed a
meaningful effect on health or surrogate outcomes attributable to apps, but
the overall evidence of effectiveness was deemed to be of very low quality
(Byambasuren et al., 2018). In addition, there is a growing concern that many
of these apps share personal health data in ways that are opaque and potentially
worrisome to users (Loria, 2019).
Deploying Artificial Intelligence in Clinical Settings | 167
Risk Prediction
For example, in one instance, EHR data from nearly 250 million patients were
analyzed using machine learning to determine the most effective second-line
hypoglycemic agents (Vashisht et al., 2018).
Image Processing
More than a decade ago, the National Academy of Medicine (NAM) recognized
the necessity for health systems to respond effectively to the host of challenges
posed by rising expectations for quality and safety in an environment of rapidly
evolving technology and accumulating massive amounts of data (IOM, 2011).
The health system was reimagined as a dynamic system that does not merely
deliver health care in the traditional manner based on clinical guidelines and
professional norms, but is continually assessing and improving by harnessing
the power of IT systems—that is, a system with ongoing learning hardwired
into its operating model. In this conception, the wealth of data generated in the
process of providing health care becomes readily available in a secure manner for
172 | Artificial Intelligence in Health Care
incorporation into continuous improvement activities within the system and for
research to advance health care delivery in general (Friedman et al., 2017). In this
manner, the value of the care that is delivered to any individual patient imparts
benefit to the larger population of similar patients. To provide context for how
the learning health system (LHS) is critical for how to consider AI in health
care, we have referenced 10 recommendations from a prior NAM report in this
domain, and aligned them with how AI could be considered within the LHS for
each of these key recommendation areas (see Table 6-1).
Clearly listed as 1 of the 10 priorities is involvement of patients and families,
which should occur in at least two critically important ways. First, they need
to be informed, both generally and specifically, about how AI applications are
being integrated into the care they receive. Second, AI applications provide
an opportunity to enhance engagement of patients in shared decision making.
Although interventions to enhance shared decision making have not yet shown
consistently beneficial effects, these applications are very early in development
and have great potential (Légaré et al., 2018).
The adoption of AI technology provides a sentinel opportunity to advance
the notion of LHSs. AI requires the digital infrastructure that enables the LHS
to operate and, as described in Table 6-1, AI applications can be fundamentally
designed to facilitate evaluation and assessment. One large health system in the
United States has proclaimed ambitious plans to incorporate AI into “every patient
interaction, workflow challenge and administrative need” to “drive improvements
in quality, cost and access,” and many other health care delivery organizations and
IT companies share this vision, although it is clearly many years off (Monegain,
2017). As AI is increasingly embedded into the infrastructure of health care
delivery, it is mandatory that the data generated be available not only to evaluate
the performance of AI applications themselves, but also to advance understanding
about how our health care systems are functioning and how patients are faring.
TABLE 6-1 | Leveraging Artificial Intelligence Tools into a Learning Health System
Institute of Medicine
Learning Health System Mapping to Artificial
Topic Recommendation Intelligence in Health Care
Foundational Elements
Digital Improve the capacity to capture Improve the capacity for unbiased, representative
infrastructure clinical, care delivery process, and data capture with broad coverage for data elements
financial data for better care, system needed to train artificial intelligence (AI).
improvement, and the generation
of new knowledge.
Data utility Streamline and revise research Leverage continuous quality improvement (QI)
regulations to improve care, and implement scientific methods to help select
promote and capture clinical data, when AI tools are the most appropriate choice
and generate knowledge. to optimize clinical operations and harness AI
tools to support continuous improvement.
Care Improvement Targets
Clinical decision Accelerate integration of the Accelerate integration of AI tools into clinical
support best clinical knowledge into care decision support applications.
decisions.
Patient-centered Involve patients and families Involve patient and families in how, when,
care in decisions regarding health and and where AI tools are used to support care
health care, tailored to fit their in alignment with preferences.
preferences.
Community Promote community–clinical Promote use of AI tools in community and
links partnerships and services aimed patient health consumer applications in a
at managing and improving health responsible, safe manner.
at the community level.
Care continuity Improve coordination and Improve AI data inputs and outputs through
communication within and across improved card coordination and data
organizations. interchange.
Optimized Continuously improve health care Leverage continuous QI and Implementation
operations operations to reduce waste, streamline Science methods to help select when AI tools
care delivery, and focus on activities are the most appropriate choice to optimize
that improve patient health. clinical operations.
Policy Environment
Financial Structure payment to reward Use AI tools in business practices to optimize
incentives continuous learning and reimbursement, reduce cost, and (it is hoped)
improvement in the provision of do so at a neutral or positive balance on
best care at lower cost. quality of care.
Performance Increase transparency on health Make robust performance characteristics for
transparency care system performance. AI tools transparent and assess them in the
populations within which they are deployed.
Broad Expand commitment to the goals Promote broad stakeholder engagement and
leadership of a continuously learning health ownership in governance of AI systems in
care system. health care.
NOTE: Recommendations from Best Care at Lower Cost (IOM, 2013) for a Learning Health System (LHS) in
the first two columns are aligned with how AI tools can be leveraged into the LHS in the third column.
SOURCE: Adapted with permission from IOM, 2013.
174 | Artificial Intelligence in Health Care
After the considerations delineated in the previous section have been resolved
and a decision has been made to proceed with the adoption of an AI application,
the organization requires a systematic approach to implementation. Frameworks
for conceptualizing, designing, and evaluating this process are discussed below, but
all implicitly incorporate the most fundamental basic health care improvement
model, often referred to as a plan-do-study-act (PDSA) cycle. This approach was
introduced more than two decades ago by W. E. Deming, the father of modern
quality improvement (Deming, 2000). The PDSA cycle relies on the intimate
Deploying Artificial Intelligence in Clinical Settings | 175
As is the case with any health care improvement activity, the nature of the
effort is cyclical and iterative, as is summarized in Figure 6-3. As discussed earlier,
the process begins with clear identification of the clinical problem or need to be
addressed. Often the problem will be one identified by clinicians or administrators
as a current barrier or frustration or as an opportunity to improve clinical or
operational processes. Even so, it is critical for the governance process to delineate
the extent and magnitude of the issue and ensure that it is not idiosyncratic
and that there are not simpler approaches to addressing the problem, rather than
Deploying Artificial Intelligence in Clinical Settings | 177
TABLE 6-3 | Key Artificial Intelligence (AI) Tool Implementation Concepts, Considerations,
and Tasks Translated to AI-Specific Considerations
Implementation Task or Concept Artificial Intelligence Relevant Aspects
Identifying the clinical or administrative problem Consideration of the problem to be addressed
to be addressed. should precede and be distinct from the selection
and implementation of specific technologies,
such as AI systems.
Assessing organizational readiness for change, It is important to include clinicians, information
which may entail surveying individuals technology (IT) professionals, data scientists, and
who are likely to be affected. An example would health care system leadership. These stakeholders
be the Organizational Readiness to Change are essential to effective planning for organizational
Assessment tool based on the Promoting preparation for implementing an AI solution.
Action on Research Implementation in Health
Services framework (Helfrich et al., 2009).
Achieving consensus among stakeholders that the It is important to include clinicians, IT professionals,
problem is important and relevant and providing data scientists, and health care system leadership.
persuasive information that the proposed solution These stakeholders are essential to effective planning
is likely to be effective if adopted. for organizational preparation for implementing an
AI solution.
When possible, applying standard organizational For AI technologies, this includes developing and
approaches that will be familiar to staff adopting standards for approaches for how data are
and patients without undue rigidity and prepared, models are developed, and performance
determining what degree of customization characteristics are reported. In addition, using
will be permitted. standard user interfaces and education surrounding
these technologies should be considered.
When possible, defining how adoption will When possible, explicitly state and evaluate a value
improve workflow, patient outcomes, or proposition, and, as important, assess the likelihood
organizational efficiency. and magnitude of improvements with and without
implementation of AI technologies.
Securing strong commitment from senior Typically, this includes organizational, clinical, IT
organizational leadership. and financial leaders for establishing governance and
organizational prioritization strategies and directives.
Identifying strong local leadership, typically in Each AI system placed into practice needs a clinical
the form of clinical champions and thought owner(s) who will be the superusers of the tools,
leaders. champion them, and provide early warning when
these tools are not performing as expected.
Engaging stakeholders in developing a plan for It is critical that AI tools be implemented in an
implementation that is feasible and acceptable environment incorporating user-centered design
to users and working to identify offsets if principles, and with a goal of decreasing user
the solution is likely to require more work workload, either time or cognition. This requires
on the part of users. detailed implementation plans that address changes
in workflow, data streams, adoption or elimination
of equipment if necessary, etc.
Providing adequate education and technical Implementation of AI tools in health care settings
support during implementation. should be done in concert with educational initiatives
and both clinical champion and informatics/IT support
that ideally is available immediately and capable of
evaluating and remedying problems that arise.
continued
178 | Artificial Intelligence in Health Care
from other stakeholders such as patients, end users, and members of the public.
Although AI applications are currently a popular topic and new products are
being touted, it is important to recall that the field is near the peak of the Gartner
Hype Cycle (see Chapter 4), and some solutions are at risk for overpromising the
achievable benefit.Thus, it is likely that early adopters will spend more and realize
less value than organizations that are more strategic and, perhaps, willing to defer
investments until products are more mature and have been proven. Ultimately,
it will be necessary for organizations to assess the utility of an AI application
in terms of the value proposition. For example, in considering adoption of an
AI application for prediction, it is possible that in certain situations, given the
cost, logistical complexity, and efficacy of the action, there may not be feasible
operating zones in which a prediction–action pair, as described below, has clinical
utility. Therefore, assessing the value proposition of deploying AI in clinical
settings has to include the utility of downstream actions triggered by the system
along with the frequency, cost, and logistics of those actions.
The clinical validation of AI tools should be viewed as distinct from the technical
validation described in Chapter 5. For AI, clinical validation has two key axes:
Given the novelty of AI, the limited evidence of its successful use to date, the
limited regulatory frameworks around it, and that most AI tools depend on nuances
of local data—and the clinical workflows that generate these data—ongoing
180 | Artificial Intelligence in Health Care
monitoring of AI as it is deployed in health care is critical for ensuring its safe and
effective use. The basis for ongoing evaluation should be on prediction–action
pairs, as discussed earlier in this chapter, and should involve assessment of factors
such as
• how often the AI tool is accessed and used in the process of care;
• how often recommendations are accepted and the frequency of overrides
(with reasons if available);
• in settings where the data leave the EHR, logs of data access, application
programming interface (API) calls, and privacy changes;
• measures of clinical safety and benefit, optimally in the form of agreed-upon
outcome or process measures;
• organizational metrics relevant to workflow or back-office AI;
• user-reported issues, such as perceived inaccurate recommendations, untimely
or misdirected prompts, or undue distractions;
• records of ongoing maintenance work (e.g., data revision requests); and
• model performance against historical data (e.g., loss of model power due to
changes in documentation).
The complexity and extent of local evaluation and monitoring may necessarily
vary depending on the way AI tools are deployed into the clinical workflow, the
clinical situation, and the type of CDS being delivered, as these will in turn define
the clinical risk attributable to the AI tool.
The International Medical Device Regulators Forum (IMDRF) framework
for assessing risk of SaMD is a potentially useful approach to developing SaMD
evaluation monitoring strategies tailored to the level of potential risk posed by the
clinical situation where the SaMD is employed. Although the IMRDF framework
is currently used to identify SaMD that require regulation, its conceptual model is
one that might be helpful in identifying the need for evaluating and monitoring
AI tools, both in terms of local governance and larger studies. The IMDRF
framework focuses on the clinical acuity of the location of care (e.g., intensive
care unit versus general preventive care setting), type of decision being suggested
(immediately life-threatening versus clinical reminder), and type of decision
support being provided (e.g., interruptive alert versus invisible “nudge”). In
general, the potential need for evaluation rises in concert with the clinical setting
and decision acuity, and as the visibility of the CDS falls (and the opportunity for
providers to identify and catch mistakes becomes lower).
Deploying Artificial Intelligence in Clinical Settings | 181
For higher risk AI tools, a focus on clinical safety and effectiveness—from either
a noninferiority or superiority perspective—is of paramount importance even as
other metrics (e.g., API data calls, user experience information) are considered.
High-risk tools will likely require evidence from rigorous studies for regulatory
purposes and will certainly require substantial monitoring at the time of and
following implementation. For low-risk clinical AI tools used at point of care, or
those that focus on administrative tasks, evaluation may rightly focus on process-
of-care measures and metrics related to the AI’s usage in practice to define its
positive and negative effects.We strongly endorse implementing all AI tools using
experimental methods (e.g., randomized controlled trials or A/B testing) where
possible. Large-scale pragmatic trials at multiple sites will be critical for the field
to grow but may be less necessary for local monitoring and for management of
an AI formulary. In some instances, due to feasibility, costs, time constraints, or
other limitations, a randomized trial may not be practical or feasible. In these
circumstances, quasi-experimental approaches such as stepped-wedge designs or
even carefully adjusted retrospective cohort studies may provide valuable insights.
Monitoring outcomes after implementation will permit careful assessment, in
the same manner that systems regularly examine drug usage or order sets and
may be able to utilize data that are innately collected by the AI tool itself to
provide a monitoring platform. Recent work has revealed that naive evaluation
of AI system performance may be overly optimistic, providing a need for more
thorough evaluation and validation.
In one such study (Zech et al., 2018), researchers evaluated the ability of a
clinical AI application that relied on imaging data to generalize across hospitals.
Specifically, they trained a neural network to diagnose pneumonia from patient
radiographs in one hospital system and evaluated its diagnostic ability on external
radiographs from different hospital systems, with their results showing that
performance on external datasets was significantly degraded. The AI application
was unable to generalize across hospitals due to differences between the training
data and evaluation data, a well-known but often ignored problem termed
dataset shift (Quiñonero-Candela et al., 2009; Saria and Subbaswamy, 2019). In
this instance, Zech and colleagues (2018) showed that large differences in the
prevalence of pneumonia between populations caused performance to suffer.
However, even subtle differences between populations can result in significant
performance changes (Saria and Subbaswamy, 2019). In the case of radiographs,
differences between scanner manufacturers or type of scanner (e.g., portable
versus nonportable) result in systematic differences in radiographs (e.g., inverted
color schemes or inlaid text on the image). Thus, in the training process, an
AI system can be trained to very accurately determine which hospital system
182 | Artificial Intelligence in Health Care
(and even which department within the system) a particular radiograph came
from (Zech et al., 2018) and to use that information in making its prediction,
rather than using more generalizable patient-based data.
Clinical AI performance can also deteriorate within a site when practices,
patterns, or demographics change over time. As an example, consider the policy
by which physicians order blood lactate measurements. Historically, it may have
been the case that, at a particular hospital, lactate measurements were only ordered
to confirm suspicion of sepsis. A clinical AI tool for predicting sepsis that was
trained using historical data from this hospital would be vulnerable to learning
that the act of ordering a lactate measurement is associated with sepsis rather
than the elevated value of the lactate. However, if hospital policies change and
lactate measurements are more commonly ordered, then the association that had
been learned by the clinical AI would no longer be accurate. Alternatively, if the
patient population shifts, for example, to include more drug users, then elevated
lactate might become more common and the value of lactate being measured
would again be diminished. In both the case of changing policy and of patient
population, performance of the clinical AI application is likely to deteriorate,
resulting in an increase of false-positive sepsis alerts.
More broadly, such examples illustrate the importance of careful validation
in evaluating the reliability of clinical AI. A key means for measuring reliability
is through validation on multiple datasets. Classical algorithms that are applied
natively or used for training AI are prone to learning artifacts specific to the site
that produced the training data or specific to the training dataset itself. There
are many subtle ways that site-specific or dataset-specific bias can occur in real-
world datasets.Validation using external datasets will show reduced performance
for models that have learned patterns that do not generalize across sites (Schulam
and Saria, 2017). Other factors that could influence AI prediction might include
insurance coverage, discriminatory practices, or resource constraints. Overall,
when there are varying, imprecise measurements or classifications of outcomes
(i.e., labeling cases and controls), machine learning methods may exhibit what
is known as causality leakage (Bramley, 2017) and label leakage (Ghassemi
et al., 2018). An example of causality leakage in a clinical setting would be
when a clinician suspects a problem and orders a test, and the AI uses the test
itself to generate an alert, which then causes an action. Label leakage is when
information about a targeted task outcome leaks back into the features used to
generate the model.
While external validation can reveal potential reliability issues related to
clinical AI performance, external validation is reactive in nature because
differences between training and evaluation environments are found after the
Deploying Artificial Intelligence in Clinical Settings | 183
AI Model Maintenance
AI AND INTERPRETABILIT Y
One area of active innovation is the synthesis and display of where and how AI
outputs are presented to the end user, in many cases to assist in interpretability.
Innovations include establishing new methods, such as parallel models where one
is used for core computation and the other for interpretation (Hara and Hayashi,
2018; Krause et al., 2016; Turner, 2016). Others utilize novel graphical displays
and data discovery tools that sit alongside the AI to educate and help users in
health care settings as they become comfortable using the recommendations.
There remains a paradox, however, because machine learning produces
algorithms based upon features that may not be readily interpretable. In the
absence of absolute transparency, stringent standards for performance must be
monitored and ensured. We may not understand all of the components upon
which an algorithm is based, but if the resulting recommendations are highly
accurate, and if surveillance of the performance of the AI system over time is
maintained, then we might continue to trust it to perform the assigned task. A
burgeoning number of applications for AI in the health care system do not assess
end users’ needs for the level of interpretability. Although the most stringent
criteria for transparency are within the point-of-care setting, there are likely
circumstances under which accuracy may be desired over transparency. Regardless
of the level of interpretability of the outputs of the AI algorithms, considerations
for users’ requirements should be addressed during development. With this in
mind, there are a few key factors related to the use and adoption of AI tools that
are algorithmically nontransparent but worthy of consideration by clinicians and
health care delivery systems.
Although the need for algorithm transparency at a granular level is probably
overstated, descriptors of how data were collected and aggregated are essential,
186 | Artificial Intelligence in Health Care
It should be evident from this chapter that in this early phase of AI development,
adopting this technology requires substantial resources. Because of this barrier,
only well-resourced institutions may have access to AI tools and systems, while
institutions that serve less affluent and disadvantaged individuals will be forced to
forgo the technology. Early on, when clinical AI remains in rudimentary stages,
this may not be terribly disadvantageous. However, as the technology improves,
the digital divide may widen the significant disparities that already exist between
institutions. This would be ironic because AI tools have the potential to improve
quality and efficiency where the need is greatest. An additional potential risk
is that AI technology may be developed in environments that exclude patients
of different socioeconomic, cultural, and ethnic backgrounds, leading to poorer
performance in some groups. Early in the process of AI development, it is
critical that we ensure that this technology is derived from data gathered from
Deploying Artificial Intelligence in Clinical Settings | 187
KEY CONSIDERATIONS
• The clinical and administrative leadership of health care systems, with input
from all relevant stakeholders such as patients, end users, and the general
public, must define the near- and far-term states that would be required to
measurably improve workflow or clinical outcomes. If these target states are
clearly defined, AI is likely to positively affect the health care system through
efficient integration into the EHR, population health programs, and ancillary
and allied health workflows.
• Before deploying AI, health systems should assess through stakeholder and user
engagement, especially patients, consumers, and the general public, the degree
to which transparency is required for AI to operate in a particular use case. This
includes determining cultural resistance and workflow limitations that may dictate
key interpretability and actionability requirements for successful deployment.
• Through IT governance, health systems should establish standard processes for
the surveillance and maintenance of AI applications’ performance and, if at all
possible, automate those processes to enable the scalable addition of AI tools
for a variety of use cases.
• IT governance should engage health care system leadership, end users, and
target patients to establish a value statement for AI applications. This will
include analyses to ascertain the potential cost savings and/or clinical outcome
gains from implementation of AI.
• AI development and implementation should follow established best-practice
frameworks in implementation science and software development.
188 | Artificial Intelligence in Health Care
REFERENCES
Abdi, J., A. Al-Hindawi,T. Ng, and M. P.Vizcaychipi. 2018. Scoping review on the
use of socially assistive robot technology in elderly care. BMJ Open 8(2):e018815.
Bandos, A. I., H. E. Rockette, T. Song, and D. Gur. 2009. Area under the
free-response ROC curve (FROC) and a related summary index. Biometrics
65(1):247–256.
Bates, D. W. 2012. Clinical decision support and the law: The big picture. Saint
Louis University Journal of Health Law & Policy 5(2):319–324.
Blane, D., E. Brunner, and R. Wilkinson. 2002. Health and social organization:
Towards a health policy for the 21st century. New York: Routledge.
Braithwaite, J. 2018. Changing how we think about healthcare improvement.
BMJ 361:k2014.
Bramley, N. R. 2017. Constructing the world: Active causal learning in cognition. Ph.D.
thesis, University College London. http://discovery.ucl.ac.uk/1540252/9/
Bramley_neil_phd_thesis.pdf (accessed November 13, 2019).
Byambasuren, O., S. Sanders, E. Beller, and P. Glasziou. 2018. Prescribable mHealth
apps identified from an overview of systematic reviews. NPI Digital Medicine
1(1):Art. 12.
Christodoulou, E., J. Ma, G. S. Collins, E. W. Steyerberg, J. Y. Verbakel, and
B. A. Van Calster. 2019. Systematic review shows no performance benefit of
machine learning over logistic regression for clinical prediction models. Journal
of Clinical Epidemiology 110:12–22.
CMS (Centers for Medicare & Medicaid Services). 2019. Proposed policy, payment,
and quality provisions changes to the Medicare physician fee schedule for calendar
year 2019. https://www.cms.gov/newsroom/fact-sheets/proposed-policy-
payment-and-quality-provisions-changes-medicare-physician-fee-schedule-
calendar-year-3 (accessed November 13, 2019).
Damschroder, L. J., D. C. Aron, R. E. Keith, S. R. Kirsh, J. A. Alexander, and J. C.
Lowery. 2009. Fostering implementation of health services research findings
Deploying Artificial Intelligence in Clinical Settings | 189
Kindig, D., and G. Stoddart. 2003. What is population health? American Journal of
Public Health 93(3):380–323.
Koola, J. D., S. B. Ho, A. Cao, G. Chen, A. M. Perkins, S. E. Davis, and M. E.
Matheny. 2019. Predicting 30-day hospital readmission risk in a national
cohort of patients with cirrhosis. Digestive Diseases and Sciences. https://doi.
org/10.1007/s10620-019-05826-w.
Kraft, S. A., M. K. Cho, K. Gillespie, M. Halley, N. Varsava, K. E. Ormond, H. S.
Luft, B. S.Wilfond, and S. Soo-Jin Lee. 2018. Beyond consent: Building trusting
relationships with diverse populations in precision medicine research. American
Journal of Bioethics 18:3–20.
Krause, J., A. Perer, and E. Bertini. 2016. Using visual analytics to interpret
predictive machine learning models. arXiv.org. https://arxiv.org/abs/1606.
05685 (accessed November 13, 2019).
Kull, M., T. M. S. Filho, and P. Flach. 2017. Beyond sigmoids: How to obtain
well-calibrated probabilities from binary classifiers with beta calibration. Electronic
Journal of Statistics 11:5052–5080. https://doi.org/10.1214/17-EJS1338SI.
Lee, K. J. 2018. AI device for detecting diabetic retinopathy earns swift FDA
approval. American Academy of Ophthalmology ONE Network. https://www.aao.
org/headline/first-ai-screen-diabetic-retinopathy-approved-by-f (accessed
November 13, 2019).
Lee, T. T., and A. S. Kesselheim. 2018. U.S. Food and Drug Administration
precertification pilot program for digital health software: Weighing the
benefits and risks. Annals of Internal Medicine 168(10):730–732. https://doi.
org/10.7326/M17-2715.
Légaré, F., R. Adekpedjou, D. Stacey, S. Turcotte, J. Kryworuchko, I. D. Graham, A.
Lyddiatt, M. C. Politi, R. Thomson, G. Elwyn, and N. Donner-Banzhoff. 2018.
Interventions for increasing the use of shared decision making by healthcare
professionals. Cochrane Database of Systematic Reviews. https://doi. org/10.1002/
14651858.CD006732.pub4.
Lenert, M. C., M. E. Matheny, and C. G. Walsh. 2019. Prognostic models will be
victims of their own success, unless. . . . Journal of the American Medical Informatics
Association 26(10). https://doi.org/10.1093/jamia/ocz145.
Li, P., S. N.Yates, J. K. Lovely, and D.W. Larson. 2015. Patient-like-mine: A real time,
visual analytics tool for clinical decision support. In 2015 IEEE International
Conference on Big Data. Pp. 2865–2867.
Liquid State. 2018. 4 digital health app trends to consider for 2018. https://liquid-
state.com/digital-health-app-trends-2018 (accessed November 13, 2019).
Loria, K. 2019. Are health apps putting your privacy at risk? Consumer Reports.
https://www.consumerreports.org/health-privacy/are-health-apps-putting-
your-privacy-at-risk (accessed November 13, 2019).
192 | Artificial Intelligence in Health Care
Lynch, J. K., J. Glasby, and S. Robinson. 2019. If telecare is the answer, what
was the question? Storylines, tensions and the unintended consequences of
technology-supported care. Critical Social Policy 39(1):44–65.
Marx, V. 2015. Human phenotyping on a population scale. Nature Methods
12:711–714.
McGinnis, J. M., P. Williams-Russo, and J. R. Knickman. 2002. The case for
more active policy attention to health promotion. Health Affairs 21(2):78–93.
Miller, A., J. D. Koola, M. E. Matheny, J. H. Ducom, J. M. Slagle, E. J. Groessl,
F. F. Minter, J. H. Garvin, M. B. Weinger, and S. B. Ho. 2018. Application of
contextual design methods to inform targeted clinical decision support
interventions in sub-specialty care environments. International Journal of Medical
Informatics 117:55–65.
Monegain, B. 2017. Partners HealthCare launches 10-year project to boost
AI use. Healthcare IT News. https://www.healthcareitnews.com/news/partners-
healthcare-launches-10-year-project-boost-ai-use (accessed November 13, 2019).
Moons, K. G., A. P. Kengne, D. E. Grobbee, P. Royston,Y.Vergouwe, D. G. Altman,
and M.Woodward. 2012. Risk prediction models: II. External validation, model
updating, and impact assessment. Heart 98(9):691–698.
Morain, S. R., N. E. Kass, and R. R. Faden. 2018. Learning is not enough: Earning
institutional trustworthiness through knowledge translation. American Journal of
Bioethics 18:31–34.
Muller, J. Z. 2018. The tyranny of metrics. Princeton, NJ: Princeton University Press.
Nelson, E. C., P. B. Batalden, and M. M. Godfrey. 2011. Quality by design: A clinical
microsystems approach. San Francisco, CA: John Wiley & Sons.
Nelson, K. M., E. T. Chang, D. M. Zulman, L.V. Rubenstein, F. D. Kirkland, and S.
D. Fihn. 2019. Using predictive analytics to guide patient care and research in
a national health system. Journal of General Internal Medicine 34(8):1379–1380.
Network for Excellence in Health Innovation. 2015. Healthy People/Healthy
Economy:An initiative to make Massachusetts the national leader in health and wellness.
https://www.nehi.net/publications/65-healthy-people-healthy-economy-
a-five-year-review-and-five-pr ior ities-for-the-future/view (accessed
November 13, 2019).
Object Management Group. Case management model and notation, v1.1. https://
www.omg.org/cmmn/index.htm (accessed November 13, 2019).
Palmer, A. 2018. IBM’s Watson AI suggested “often inaccurate” and “unsafe”
treatment recommendations for cancer patients, internal documents show.
DailyMail.com. https://www.dailymail.co.uk/sciencetech/article-6001141/
IBMs-Watson-suggested-inaccurate-unsafe-treatment-recommendations-
cancer-patients.html?ito=email_share_article-top (accessed November 13, 2019).
Deploying Artificial Intelligence in Clinical Settings | 193
INTRODUCTION
Developers and users of health care AI systems may encounter many different
legal regimes, including federal statutes, federal regulations, and state tort law.
Below are a few of the most significant among these laws and regulations:
• Federal Food, Drug, and Cosmetic Act (FDCA): FDA enforces the
FDCA, which regulates the safety and effectiveness of drugs and medical
devices, including certain forms of medical software (21 U.S.C. §§ 301 ff.).
The bulk of this chapter describes the application of the FDCA to health care
clinical AI systems.
• Health Insurance Portability and Accountability Act (HIPAA): In
addition to the Privacy Rule (described in more detail below), HIPAA authorizes
the U.S. Department of Health and Human Services to enforce the Security
Rule (45 C.F.R. Parts 160 and 164). These rules create privacy and security
requirements for certain health information.The HIPAA Breach Notification
Rule also requires certain entities to provide notifications of health
information breaches (45 C.F.R. §§ 164.400–164.414). To the extent that the
development or use of health care AI systems involves health information
covered by HIPAA, those requirements may apply to developers or users of
such systems.
• Common Rule: The Common Rule sets requirements for research on
human subjects that either is federally funded or, in many instances, takes
place at institutions that receive any federal research funding (45 C.F.R.
Part 46).Among other things, most human subjects research must be reviewed
by an institutional review board (45 C.F.R. § 46.109). These requirements
can apply to AI used for research or the research used to create health care
AI. The Common Rule is enforced by the Office for Human Research
Protections.
• Federal Trade Commission Act (FTCA): The FTCA prohibits deceptive
and unfair trade practices affecting interstate commerce (15 U.S.C. §§ 41–58).
These could include acts relating to false and misleading health claims,
representations regarding a piece of software’s performance, or claims affecting
consumer privacy and data security. Health care AI products may raise any
of these types of claims. The Federal Trade Commission (FTC) enforces the
requirements of the FTCA.
• FTC Health Breach Notification Rule: This FTC rule, separate from
HIPAA’s Breach Notification Rule, requires certain businesses to provide
Health Care Artificial Intelligence: Law, Regulation, and Policy | 199
TABLE 7-1 | Typical Applicability of Various Laws and Regulations to U.S. Health Care
Artificial Intelligence Systems
200 | Artificial Intelligence in Health Care
A key set of laws work to ensure the safety and efficacy of medical technology,
including clinical AI systems. The principal requirements are determined by the
FDCA and enforced by FDA. State tort law also plays a role in ensuring quality
by managing liability for injuries, including those that may arise from insufficient
care in developing or using clinical AI.
The raison d’être of clinical AI systems is to be coupled with and to inform
human decision making that bears upon the content and conduct of clinical
care, including preventive care, to promote favorable, equitable, and inclusive
clinical outcomes and/or mitigate risks or interdict adverse events or nonoptimal
outcomes. Regulatory authorities in various countries, including FDA, expect
the pharmaceutical, medical device, and biotechnology industries to conduct
their development of all diagnostics and therapeutics (including companion
and complementary diagnostics and therapeutics) toward the goal of safer, more
efficacious, and personalized medicine.This development should result in care that
is, at a minimum, not inferior to conventional (non-AI-based) standard-of-care
outcomes and safety endpoints. Health services are expected to fund such
AI-coupled diagnostics and therapeutics, and prescribers and patients are, over
time, likely to adopt and accept them. Increased development of “coupled”
products (including clinical AI systems) could result in “safer and improved
clinical and cost-effective use of medicines, more efficient patient selection for
clinical trials, more cost-effective treatment pathways for health services,” and a
less risky, more profitable development process for therapeutics and diagnostics
developers (Singer and Watkins, 2012).
The right level of regulation requires striking a delicate balance. While the
over-regulation or over-legislation of AI-based personalized medical apps may
delay the translation of machine learning findings to meaningful, widespread
deployment, appropriate regulatory oversight is necessary to ensure adoption, trust,
quality, safety, equitable inclusivity, and effectiveness. Regulatory oversight is also
needed to minimize false-negative and false-positive errors and misinterpretation
of clinical AI algorithms’ outputs, actions, and recommendations to clinicians.
Recent examination of the ethics of genome-wide association studies for
multifactorial diseases found three criteria necessary for identification of genes
to be useful: (1) the data in the studies and work products derived from them
must be reproducible and applicable to the target population; (2) the data and the
derived work products should have significant usefulness and potential beneficial
impact to the patients to whom they are applied; and (3) the resulting knowledge
should lead to measurable utility for the patient and outweigh associated risks or
potential harms (Jordan and Tsai, 2010). Thus, regulatory standards for clinical
Health Care Artificial Intelligence: Law, Regulation, and Policy | 201
AI tools should at least extend to accuracy and relevancy of data inputs and model
outputs, marketing of AI systems for specific clinical indications, and transparency
or auditability of clinical AI performance.
Some AI systems, particularly those algorithms that will perform or assist with
clinical tasks related to diagnosis, interpretation, or treatment, may be classified
as medical devices and fall under applicable FDA regulations. Other AI systems
may instead be classified as “services” or as “products,” but not medical devices
(see Box 7-1). FDA’s traditional regulatory processes for medical devices include
establishment registration and listing plus premarket submissions for review and
approval or clearance by FDA’s Center for Devices and Radiological Health Office
of Device Evaluation or Office of In Vitro Diagnostics and Radiological Health.
In the United States, the Medical Device Amendments of 1976 (P.L. 94-295)
to the FDCA (21 U.S.C. § 360c) established a risk-based framework for the
regulation of medical devices.The law established a three-tiered risk classification
system based on the risk posed to patients should the device fail to perform
as intended. The FDCA (21 U.S.C. § 360j) definition of a medical device is
summarized in Box 7-1.
BOX 7-1
Federal Food, Drug, and Cosmetic Act (21 U.S.C. § 360j)
Medical Device Definition
The 21st Century Cures Act (Cures Act, P.L. 114-255) was signed into law on
December 13, 2016. The significant portion with regard to clinical AI systems is
Section 3060 (“Regulation of Medical and Certain Decisions Support Software”),
which amends Section 520 of the FDCA so as to provide five important
exclusions from the definition of a regulatable medical device. Under Section
3060 of the Act, clinical decision support (CDS) software is nominally exempted
from regulation by FDA—that is, it is defined as not a medical device—if it is
intended for the purpose of:
This exemption does not apply to software that is “intended to acquire, process,
or analyze a medical image or a signal from an in vitro diagnostic device or a
pattern or signal from a signal acquisition system” (21st Century Cures Act § 3060).
FDA has stated that it would use enforcement discretion to not enforce compliance
with medical device regulatory controls for medical device data systems, medical
image storage devices, and medical image communications devices (FDA, 2017a).
The 21st Century Cures Act codifies some of FDA’s prior posture of restraint
from enforcement.
Under this system, devices that pose greater risks to patients are subject to more
regulatory controls and requirements. Specifically, general controls are sufficient
to provide reasonable assurance of a Class I device’s safety and effectiveness, while
special controls are utilized for Class II devices for which general controls alone
are insufficient to provide reasonable assurance of device safety and effectiveness
(21 C.F.R. § 860.3). FDA classifies Class III devices as ones intended to be used in
supporting or sustaining human life or for a use that is of substantial importance
in preventing the impairment of human health, or that may present a potential
unreasonable risk of illness or injury, and for which insufficient information exists
to determine whether general controls or special controls are sufficient to provide
reasonable assurance of the safety and effectiveness of a device (21 C.F.R. § 860.3).
This highest risk class of devices is subject to premarket approval to demonstrate a
Health Care Artificial Intelligence: Law, Regulation, and Policy | 203
reasonable assurance of safety and effectiveness. Even for this highest risk class of
devices, the evidence FDA requires for premarket approval has long been flexible,
varying according to the characteristics of the device, its conditions of use, the
existence and adequacy of warnings and other restrictions, and other factors.
There is generally more flexibility in the amount of clinical evidence needed for
medical devices than for drugs and biological products, because they are subject
to different statutory criteria and the mechanism of action and modes of failure
are generally more predictable and better characterized for devices than for drugs
and biological products.
Additionally, the design process for a medical device is more often an iterative
process based largely on rational design and non-clinical testing rather than clinical
studies. However, this last aspect is not, in general, true for clinical AI systems.The
machine learning process is itself a kind of observational research study. In some
cases—particularly for medium- and high-risk clinical AIs—the design process
may depend on lessons learned as such tools are deployed or on intermediate
results that inform ways to improve efficacy (FDA, 2019b).The Clinical Decision
Support Coalition and other organizations have recently opined that many types
of clinical AI tools should not be regulated or that the industry should instead
self-regulate in all application areas that FDA chooses not to enforce on the basis
of their review of risks to the public health. Notably, the principles and risk-
based classification processes have recently been updated to address requirements
for software as a medical device (SaMD) products (see FDA, 2017c § 6.0, p. 11;
IMDRF, N12 § 5.1).
It is worth noting the distinction between CDS software tools, including clinical
AIs, that replace the health professional’s role in making a determination for
the patient (i.e., automation) and those that simply provide information to the
professional, who can then take it into account and independently evaluate it (i.e.,
assistance).The former may be deemed by FDA to be a medical device and subject
to medical device regulations. Under the 21st Century Cures Act, if a CDS product
has multiple functions, where one is excluded from the definition of a medical
device and another is not, FDA can assess the safety and effectiveness to determine
whether the product should be considered a medical device (21st Century Cures
Act § 3060). Also, FDA can still regulate the product as a medical device if it
finds that the software “would be reasonably likely to have serious adverse health
consequences” or meets the criteria for a Class III medical device. Clinical AI
systems that are deemed to be medical devices will generally require either De
Novo or premarket approval submissions (FDA, 2018a). In some instances, where a
valid pre-1976 predicate exists, a traditional 510(k) submission may be appropriate.
Note, too, that the 21st Century Cures Act’s statutory language, while already
in force, is subject to implementing regulations to be developed by FDA over
204 | Artificial Intelligence in Health Care
Not all clinical AI systems will manifest hazards or have risk levels comparable
to those associated with existing IVDMIA products. However, the methodology,
review, and clearance criteria that have been found effective for the regulation of
IVDMIAs may form a useful point of reference for the regulatory practice for
clinical AI systems.
FDA has indicated that it will apply a risk-based assessment framework, where
the risk level of different clinical AI systems will be influenced by the different
types of on-label clinical indications and contexts in which they are intended to
be used, plus the different situations in which their off-label usage might plausibly
be anticipated, adopting the IMDRF framework (FDA, 2019b; IMDRF, 2014).
For example, a clinical AI system’s intended use might be as a screening test to
determine the person’s susceptibility to, or propensity in the future to, develop
a clinical condition or disease that has not yet materialized; this affords time for
longitudinal observation, repeat testing, and vigilance to monitor signs and
symptoms of the emergence of the disease and is accordingly lower risk. Similarly,
an AI system designed to classify a condition’s stage or current severity, or to establish
the prognosis or probable clinical course and rate of progression of a condition,
functions essentially like a biomarker that characterizes risk and does so in a manner
that is amenable to multiple repeat tests and observations over a period of time.
Such situations have low time sensitivity and a plurality of opportunities
for the experienced clinicians to review, second-guess, and corroborate the
recommendations of the screening clinical AI system. In IMDRF parlance, these
are clinical AI systems that “inform” clinical management but do not “drive”
clinical management. Indeed, the “informing care” function of some present-day
clinical AI tools of this type is to automatically/reflexively order the appropriate
standard-of-care confirmatory diagnostic testing and monitoring. These clinical
AI systems provide additional evidence or advice (e.g., regarding the likelihood of
the condition screened for and/or the cost-effectiveness of pursuing a diagnostic
workup for the condition) and promote consistency, relevancy, and quality in
diagnostic workups. In general, such screening or informational clinical AI systems
will be classified as having low risk. As such, many clinical AI systems are outside
the formal scope of medical device regulation and do not require establishment
registration and listing or other regulatory filings (21st Century Cures Act § 3060).
By contrast, some classification, forecasting, and prognostic biomarker clinical
AI algorithms that instead drive clinical management and/or involve clinical
indications may be associated with a medium or high risk; the AI systems could
Health Care Artificial Intelligence: Law, Regulation, and Policy | 207
contain faults that cause harm via commissive or omissive errors, either directly
or through clinicians’ actions or inaction. Perioperative, anesthesiology, critical
care, obstetrics, neonatology, and oncology use-cases are examples of medium- or
high-risk settings (Therapeutic Monitoring Systems, Inc., 2013). In such situations,
there is great time sensitivity and there may be little or no time or opportunity to
seek additional testing or perform more observations to assess the accuracy of the
AI’s recommendation or action. In some instances, such as oncology and surgery,
the decision making informed by the AI tool may lead to therapeutic actions that
are not reversible and either close other therapeutic avenues or alter the clinical
course of the illness and perhaps its responsiveness to subsequent therapy. Such
AI tools would, by Section 3060 of the 21st Century Cures Act, be formally
within the scope of medical device regulation and would require establishment
registration, listing, and other regulatory filings—De Novo, 510(k), premarket
approval, or precertification—and associated postmarket surveillance, reporting,
and compliance procedures.
AI systems are often criticized for being black boxes (Pasquale, 2016) that are
very complex and difficult to explain (Burrell, 2016). Nevertheless, such systems
can fundamentally be validated and understood in terms of development and
performance (Kroll, 2018; Therapeutic Monitoring Systems, Inc., 2013), even if
not in terms of mechanism—and even if they do not conform to preexisting
clinician intuitions or conventional wisdom (Selbst and Barocas, 2018). Notably,
the degree of “black box” lack of explainability that may be acceptable to
regulators validating performance might differ from the amount of explainability
clinicians demand, although the latter is an open empirical question.This chapter
addresses explainability to clinicians and other nonregulators only to the extent
that it interacts with regulatory requirements. Instead, the focus is largely on
validation by regulators, which may be satisfied by some current development
processes.
While the rest of this section focuses on how explainability and transparency
may or may not be required for regulators to oversee safety and efficacy,
regulators may also require explainability for independent reasons. For instance,
regulators may require clinical AI tools to be explainable to clinicians to whose
decision making they are coupled; to quality assurance officers and IT staff
in a health provider organization who acquire the clinical AI and have risk-
management/legal responsibility for their operation; to developers; to regulators;
or to other humans. The European Union’s General Data Protection Regulation
(GDPR) right to explanation rules, for instance, enacted in 2016 and effective
208 | Artificial Intelligence in Health Care
May 2018, applies to AI systems as well as humans and web services (Kaminski,
2019) and governs European Union citizens worldwide. Similar standards may
be implemented in the United States and other jurisdictions. Such standards and
regulations are important for public safety and for the benefits of clinical AI systems
to be realized through appropriate acceptance and widespread use. However, the
notion of explainability is not well defined. There is a lack of agreement about
both what constitutes an adequate explanation of clinical AI tools, and to whom
the explanation must be provided to conform to applicable right to explanation
rules and thus be suitable for regulatory approval.
Current right to explanation regulations and standards fails to acknowledge that
human data scientists, clinicians, regulators, courts, and the broader public have
limitations in recognizing and interpreting subtle patterns in high-dimensional
data. Certain types of AI systems are capable of learning—and certain AI models
are capable of intelligently and reliably acting upon—patterns that humans are
entirely and forever incapable of noticing or correctly interpreting (Seblst and
Barocas, 2019). Correspondingly, humans, unable to grasp the patterns that AI
recognizes, may be in a poor position to comprehend the explanations of AI
recommendations or actions.As noted, the term “black box” is sometimes pejorative
toward AI, especially neural networks, deep learning, and other fundamentally
opaque models. They are contrasted to logistic regression; decision-tree; and
other older-technology, static, deterministic models—all with low dimensionality
but are able to show the inputs that led to the recommendation or action, with
variables that are generally well known to the clinician and causally related.
If society, lawmakers, and regulatory agencies were to expect every clinical
AI system to provide an explanation of its actions, it could greatly limit the
capacity of clinical AI developers’ use of the best contemporary AI technologies,
which markedly outperform older AI technology but are not able to provide
explanations understandable to humans. Regulators do not currently require
human-comprehensible explanations for AI in other industries that have
potential risks of serious injury or death. For example, autonomous vehicles are
not required to provide a running explanation or commentary on their roadway
actions.
While requiring explainability may not always be compatible with maximizing
capacity and performance, different forms of transparency are available that
might enable oversight (see Figure 7-1). For instance, transparency of the
initial dataset—including provenance and data-processing procedures—helps
to demonstrate replicability. Transparency of algorithm or system architecture is
similarly important for regulatory oversight. When AI systems are transparent not
just to the regulator but more broadly, such transparency can enable independent
validation and oversight by third parties and build trust with users.
Health Care Artificial Intelligence: Law, Regulation, and Policy | 209
advisory output that each clinical AI system performs; (2) the versioning of the
AI system’s elements that performed the transaction, traceable to the data sources;
and (3) the validation and software quality-assurance testing that led to the AI
systems being authorized for production and subsequent use. These logs enable
the examination of the inputs, outputs, and other details in case of anomalies or
harms. The logs are open to the clinician-users, employers, organizations who
acquire/authorize the AI system’s deployment (e.g., health provider organizations,
health plans, or public health agencies), regulators, developers, and the courts.The
individual release-engineered and version-controlled instances of present-day
industrial-strength clinical AI systems are identifiable and rigorously auditable,
based on these SDLC controlled-document artifacts, which are maintained by
the developer organization that owns the intellectual property.
For this type of clinical AI system, the regulatory agencies’ traditional submissions
and compliance processes for SaMD are feasible and may not need substantial
alteration (e.g., FDA, 2018b). The types of evidence required by plaintiffs,
defendants, counsel, and the courts may not need substantial alteration, although
the manner of distributed storage, retrieval, and other aspects of provisioning
such evidence will change. Moreover, the availability of such evidence will not be
significantly altered by the nature of clinical AI systems, provided that developers
follow QMS and CGMPs and maintain conformity, including controlled-
document artifacts retention.
Some clinical AI tools will be developed using RWD. Because RWD are
messy in ways that affect the quality and accuracy of the resulting inferences,
as described in Chapter 6, more rigorous requirements for auditing clinical AI
systems developed and validated using RWD will need to be established.
clinical AI. Continuous machine learning offers one solution for dataset shift and
drift by updating with new population-specific data (FDA, 2019b). Not all clinical
AI systems aim to do this, and very few clinical AI systems implement continuous
learning today. The goals of learning health systems and personalized medicine
do create an impetus for more continuous machine learning–based AI systems,
as discussed in Chapters 3 and 6 in more detail. However, current regulatory and
jurisprudential methods and infrastructure are not prepared for this.
Number Needed to Treat (NNT) and Number Needed to Harm (NNH) are
useful measures of population-level clinical utility of a therapeutic or a diagnostic
(Cook and Sackett, 1995; Laupacis et al., 1988). A product (i.e., medication or
medical device) or a health service (i.e., clinical intervention, procedure, or care
process) that has a very high NNT value (>100) or that has a very low NNH
value (<10) is unlikely to meet clinicians’ or consumers’ expectations of probable
clinical benefit and improbable clinical harm.
The international CONsolidated Standards of Reporting Trials (CONSORT,
2010) and STARD (Standards for Reporting of Diagnostic Accuracy Studies;
Bossuyt et al., 2003) initiatives pertain to the verification of diagnostic accuracy
conforming to existing good clinical practice rules and guidelines (Steyerberg,
2010). While these initiatives are not focused on studies that aim to demonstrate
diagnostic device equivalence, many of the reporting concepts involved are
nonetheless relevant and applicable to clinical AI. The CONSORT guidelines
aim to improve the reporting of randomized controlled trials, enabling reviewers
to understand their design, conduct, analysis, and interpretation, and to assess the
validity of their results (CONSORT, 2010). However, CONSORT is also applicable
to observational, nonrandomized studies and AI derived from machine learning.
According to a 2007 FDA guidance document,
These steps are essential to eliminate validation leakage and help to estimate
the stability of the model over time.
Two main biases are important to consider: representational bias and information
bias (Althubaiti, 2016). Representational bias refers to which individuals or data
sources are represented in the data and which are not. Information bias is meant
to represent collectively “all the human biases that distort the data on which a
decision maker [relies] and that account for the validity of data [that is, the extent
these represent what they are supposed to represent accurately]” (Cabitza et al.,
2018). These two biases and the related phenomenon of information variability
together can degrade the accuracy of the data and, consequently, the accuracy of
the clinical AI model derived from them.
within a single health services facility for the care of patients for whom named
clinicians in that facility have responsibility. Diagnostic tests that are not marketed
commercially beyond the therapeutics development process, clinical trials, and
regulatory marketing approval are generally referred to by FDA and others as
development tools. Such tests are established and overseen similarly to other
development tools such as biomarkers.
State tort law also provides a source of risk and of regulatory pressure for the
developers and users of clinical AI systems, as well as other AI systems that could
cause injury but that are not the focus of this section. Briefly, state tort law may make
the developers or users of clinical AI systems liable when patients are injured as a
result of using those systems. Such liability could come in the form of malpractice
liability—that is, potential lawsuits against health providers, hospitals or other health
care systems, and AI system developers for performing below the standard of care
(Froomkin et al., 2019). Developers could also face product liability for defects in
the design or manufacturing of AI systems or for failure to adequately warn users
of the risks of a particular AI system. By imposing liability for injuries caused by AI
systems when those injuries could have reasonably been avoided, whether by more
careful development or more careful use, tort law exerts pressure on developers.
How exactly tort law will deal with clinical AI systems remains uncertain, because
court decisions are retrospective and the technology is nascent.Tort law is principally
grounded in state law, and its contours are shaped by courts on a case-by-case basis.
This area will continue to develop. Three factors influencing tort liability are of
particular note: the interaction of FDA approval and tort liability, liability insurance,
and the impact of transparency on tort liability. To be clear, this area of law is still
very much developing, and this section only sketches some of the ways different
aspects of health care AI systems may interact with the system of tort liability.
premarket approval, state tort lawsuits are generally preempted under the Supreme
Court’s holding in Riegel v. Medtronic, 552 U.S. 312 (2008). Nevertheless, this
preemption will not apply to most AI apps, which are likely to be cleared through
the 510(k) clearance pathway rather than premarket approval. Clearance under the
510(k) pathway will generally not preempt state tort lawsuits under the reasoning
of Medtronic v. Lohr, 518 U.S. 470 (1996), because rather than directly determining
safety and efficacy, FDA finds the new app to be equivalent to an already approved
product. It is unclear what preemptive effect De Novo classification will have on
preempting state tort lawsuits, because the Supreme Court emphasized both the
thoroughness of premarket review and its determination that the device is safe
and effective, rather than equivalent to an approved predicate device.
State tort lawsuits alleging violations of industry-wide requirements, such
as CGMP or other validation requirements, are a contestable source of state
tort liability. Some courts have found that lawsuits alleging violations of state
requirements that parallel industry-wide requirements are preempted by federal
law and that such violations may only be addressed by FDA. Other courts disagree,
and the matter is currently unsettled (Tarloff, 2011). In at least some jurisdictions,
if app developers violate FDA-imposed requirements, courts may find parallel
duties under state law and developers may be held liable. Nevertheless, if app
developers comply with all FDA-imposed industry-wide requirements, states
cannot impose additional requirements.
Liability Insurance
The possibility of liability creates another avenue for regulation through the
intermediary of insurance. Developers, providers, and health systems are all likely
to carry liability insurance to decrease the risk of a catastrophic tort judgment
arising from potential injury. Liability insurers set rules and requirements regarding
what information must be provided or what practices and procedures must be
followed in order to issue a policy. Although insurers are often not considered
regulators, they can exert substantial, if less visible, pressure that may shape the
development and use of clinical AI systems (Ben-Shahar and Logue, 2012).
Transparency and opacity also interact with tort liability. Determining causation
can already be difficult in medical tort litigation, because injuries may result
from a string of different actions and it is not always obvious which action or
combination of actions caused the injury. Opacity in clinical AI systems may
further complicate the ability of injured patients, lawyers, or providers or health
220 | Artificial Intelligence in Health Care
systems to determine precisely what caused the injury. Explainable algorithms may
make it easier to assess tort liability, as could transparency around data provenance,
training and validation methods, and ongoing oversight. Perversely, this could
create incentives for developers to avoid certain forms of transparency as a way
to lessen the likelihood of downstream tort liability. On the other hand, courts—
or legislatures—could mandate that due care, in either the development or use
of clinical AI tools, requires some form of transparency. To take a hypothetical
example, a court might one day hold that when a provider relies on an algorithmic
diagnosis, that provider can only exercise due care by assessing how the algorithm
was validated. Developers or other intermediaries would then need to provide
sufficient information to allow that assessment.
Regulation regarding patient privacy and data sharing is also highly relevant
to AI development, implementation, and use, whether clinical AI or AI used for
other health care purposes (“health care AI”). The United States lacks a general
data privacy regime, but HIPAA includes a Privacy Rule that limits the use and
disclosure of protected health information (PHI)—essentially any individually
identifiable medical information—by covered entities (i.e., almost all providers,
health insurers, and health data clearinghouses) and their business associates
where the business relationship involves PHI (45 C.F.R. § 160.103). Covered
entities and business associates may only use or disclose information with patient
authorization, if the entity receives a waiver from an institutional review board or
privacy board, or for one of several exceptions (45 C.F.R. § 164.502).These listed
exceptions include the use and disclosure of PHI for the purposes of payment,
public health, law enforcement, or health care operations, including quality
improvement efforts but not including research aimed at creating generalizable
knowledge (45 C.F.R. § 164.501). For health systems that intend to use their own
internal data to develop in-house AI tools (e.g., to predict readmission rates or the
likelihood of complications among their own patients), the quality improvement
exception will likely apply. Even when the use or disclosure of information is
permitted under HIPAA, the Privacy Rule requires that covered entities take
reasonable steps to limit the use or disclosure to the minimum necessary to
accomplish the intended purpose. While HIPAA does create protections for
patient data, its reach is limited, and health information can come from many
sources that HIPAA does not regulate (Price and Cohen, 2019).
A complex set of other laws may also create requirements to protect patient
data. HIPAA sets a floor for data privacy, not a ceiling. State laws may be more
restrictive; for instance, some states provide stronger protections for especially
Health Care Artificial Intelligence: Law, Regulation, and Policy | 221
With regard to discrete clinical data, unstructured textual data, imagery data,
waveform and time-series data, and hybrid data used in clinical AI models, the
development and deployment of AI systems have complex interactions with
privacy concerns and privacy law (e.g., Loukides et al., 2010). Adequate oversight
of clinical AI systems must address the nature of potential privacy concerns
wherever they may arise, approaches to address those concerns, and management
of the potential tension between privacy and other governance concerns for
clinical AI.
Initial AI Development
Privacy concerns occur in the first instance because training health care AI
depends on assembling large collections of health data about patients (Horvitz
and Mulligan, 2015). Health data about individuals are typically considered
sensitive. Some forms of data are particularly sensitive, such as substance abuse
data or sexually transmitted disease information (Ford and Price, 2016). Other
forms of data raise privacy concerns about the particular individual, such as
genetic data that can reveal information about family members (Ram et al.,
2018). Collecting, using, and sharing patient health data raise concerns about the
privacy of the affected individuals, whether those concerns are consequentialist
(e.g., the possibility of future discrimination based on health status) or not (e.g.,
dignitary concerns about others knowing embarrassing or personal facts) (Price
and Cohen, 2019). The process of collecting and sharing may also make data
more vulnerable to interception or inadvertent access by other parties.
222 | Artificial Intelligence in Health Care
External Validation
Inference Generation
A third form of potential privacy harm that could arise from health care AI is
quite different and involves the generation of inferences about individual patients
based on their health data. Machine learning makes predictions based on data, and
those predictions may themselves be sensitive data, or may at least be viewed that
way by patients. In one highly publicized example of such a case,Target identified
a teenage woman’s pregnancy based on changes in her purchasing habits and then
sent her targeted coupons and advertisements, which led to her father learning
of her pregnancy from Target before his daughter had shared the news (Duhigg,
2012). The epistemic status of this information is debatable; arguments have been
made that inferences cannot themselves be privacy violations, although popular
perception may differ (Skopek, 2018).
Some standard privacy-protecting approaches of data collectors and users face
difficulties when applied to health care AI.The most privacy-protective approach
limits initial data collection to necessarily limit the potential for problematic
use or disclosure (Terry, 2017). However, this approach presumes that the data
collector knows which data are necessary and which are not, knowledge that is
often absent for health care AI.
De-Identification
Data Infrastructure
and validation purposes. Once particular AI systems are deployed in the real world,
RWD should be collected to ensure that the AI systems are performing well and,
ideally, to improve that performance. However, numerous hurdles exist to the
collection of sufficient data (Price, 2016).Various privacy laws, as described above,
restrict the collection of identifiable information, and de-identified information can
be difficult to assemble to capture either long-term effects or data across different
data sources. Informed consent laws, such as the Common Rule for federally
funded research or the consent requirements incorporated into the GDPR, create
additional barriers to data collection. Even where privacy or informed consent
rules do not actually prohibit the collection, use, or sharing of data, some health care
actors may limit such actions out of an abundance of caution, creating a penumbra
of data limitations. In addition, for those actors who do find ways around these
requirements, criticism and outrage may arise if patients feel they are inadequately
compensated for their valuable data. On an economic level, holders of data have
strong incentives to keep data in proprietary siloes to derive competitive advantage,
leading to more fragmentation of data from different sources. For data holders who
wish to keep data proprietary for economic reasons, referencing privacy concerns
can provide a publicly acceptable reason for these tactics.
At least four possibilities emerge for collection of data, with some evidence of
each in current practice:
Of the four models, the first three are the straightforward results of current
market dynamics. Each creates challenges, including smaller dataset size, potential
bias in collection, access for other developers or for validators, and, in the case
of failures to collect data, exclusion of some populations from AI development
and validation. Government data infrastructure—that is, data gathered via
government efforts for the purposes of fostering innovation, including clinical
AI—has the greatest possibility of being representative and available for a variety
of downstream AI uses but also faces potential challenges in public will for its
collection. Even when the government itself does not collect data, it can usefully
promulgate standards for data collection and consolidation (Richesson and
Krischer, 2007; Richesson and Nadkarni, 2011); the lack of standards for EHRs,
for instance, has led to persistent problems aggregating data across contexts.
KEY CONSIDERATIONS
and sustainably deployed, implemented in EHR systems, and curated over time
to maintain adequate accuracy and reliability. However, clinical AI systems could
potentially pose risks in terms of inappropriate treatment recommendations,
privacy breaches, or other harms (Evans and Whicher, 2018), and some types of
clinical AI systems will be classified by regulatory agencies as SaMDs, subject to
premarket clearance or approval and other requirements that aim to protect the
public’s health. Other clinical AI tools may be deemed to be LDT-type services,
subject to CLIA and similar regulations.Whatever agency is involved in oversight,
compliance with regulations should be mandatory rather than voluntary, given the
potential for problematic incentives for system developers (Evans and Whicher,
2018). As the law and policy of health care AI systems develop over time, it is
both expected and essential that multiple stakeholders—including payers, patients
and families, policy makers, diagnostic manufacturers and providers, clinicians,
academics, and others—remain involved in helping determine how best to ensure
that such systems advance the quintuple aim and improve the health care system
more generally.
• The black box nature of a clinical AI system should not disqualify a system from
regulatory approval or use, but transparency, where possible, can aid in oversight
and adoption and should be encouraged or potentially required. AI systems,
including black box systems, should be capable of providing the users with an
opportunity to examine quantitative evidence that the recommendation in
the current situation is indeed the best recent historical choice, supplying de-
identified, aggregated data sufficient for the user to satisfy the user’s interest in
confirming that this is so, or is at least no worse and no more uncertain than
decisions the user would take independently were the AI not involved.
• When possible, machine learning–based predictive models should be evaluated
in an independent dataset (i.e., external validation) before they are adopted in
the clinical practice. Risk assessment to determine the degree to which dataset-
specific biases affect the model should be undertaken. Regulatory agencies
should recommend specific statistical methods for evaluating and mitigating
bias.
• To the extent that machine learning–based models continuously learn from
new data, regulators should adopt postmarket surveillance mechanisms to
ensure continuing (and ideally improving) high-quality performance.
• Regulators should engage in collaborative governance efforts with other
stakeholders and experts throughout the health system, including data scientists,
clinicians, ethicists, and others, to continuously evaluate deployed clinical AI
for effectiveness and safety on the basis of RWD.
Health Care Artificial Intelligence: Law, Regulation, and Policy | 227
REFERENCES
Chung, P., and D. Edwards. 1999. Hazard identification in batch and continuous
computer-controlled plants.Industrial & Engineering Chemical Research 38:4359–4371.
Cohen, I., and M. Mello. 2019. Big data, big tech, and protecting patient privacy.
JAMA 322(12):1141–1142. https://doi.org/10.1001/jama.2019.11365.
CONSORT (CONsolidated Standards of Reporting Trials). 2010. CONSORT
2010. http://www.consort-statement.org/consort-2010 (accessed November 14,
2019).
Cook, N. 2008. Statistical evaluation of prognostic versus diagnostic models:
Beyond the ROC curve. Clinical Chemistry 54:17–23.
Cook, R., and D. Sackett. 1995. The number needed to treat: A clinically useful
measure of treatment effect. BMJ 310(6977):452–454.
Duhigg, C. 2012. How companies learn your secrets. The New York Times.
https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html
(accessed November 14, 2019).
El Emam, K. 2013. Guide to the de-identification of personal health information. New
York: CRC Press.
Evans, E., and D. Whicher. 2018. What should oversight of clinical decision
support systems look like? AMA Journal of Ethics 20:E857–E863.
FDA (U.S. Food and Drug Administration). 2007a. In vitro diagnostic multivariate
index assays—Draft guidance. https://www.fda.gov/downloads/MedicalDevices/
DeviceRegulationandGuidance/GuidanceDocuments/ucm071455.pdf
(accessed November 14, 2019).
FDA. 2007b. Statistical guidance on reporting results from studies evaluating diagnostic
tests—Guidance for industry and FDA staff. https://www.fda.gov/downloads/
MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/
ucm071287.pdf (accessed November 14, 2019).
FDA. 2016a. Non-inferiority clinical trials to establish effectiveness—Guidance for
industry. https://www.fda.gov/downloads/Drugs/Guidances/UCM202140.pdf
(accessed November 14, 2019).
FDA. 2016b. Principles for co-development of an in vitro companion diagnostic device
with a therapeutic product—Draft guidance. https://www.fda.gov/downloads/
MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/
UCM510824.pdf (accessed November 14, 2019).
FDA. 2017a. Changes to existing medical software policies resulting from Section 3060 of
the 21st Century Cures Act. https://www.fda.gov/downloads/MedicalDevices/
DeviceRegulationandGuidance/GuidanceDocuments/UCM587820.pdf
(accessed November 14, 2019).
FDA. 2017b. Clinical and patient decision support software—Draft guidance. https://
www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/
GuidanceDocuments/UCM587819.pdf (accessed November 14, 2019).
Health Care Artificial Intelligence: Law, Regulation, and Policy | 229
Pasquale, F. 2016. The black box society: The secret algorithms that control money and
information. Cambridge, MA: Harvard University Press.
Pepe, M. 2003. The statistical evaluation of medical tests for classification and prediction.
Oxford, UK: Oxford University Press.
Price, W. 2016. Big data, patents, and the future of medicine. Cardozo Law Review
37(4):1401–1452.
Price, W. 2017a. Regulating black-box medicine. Michigan Law Review
116(3):421–474.
Price, W. 2017b. Risk and resilience in health data infrastructure. Colorado
Technology Law Journal 16(1):65–85.
Price, W., and I. Cohen. 2019. Privacy in the age of medical big data. Nature
Medicine 25(1):37–43.
Quiñonero-Candela, J., M. Sugiyama, A. Schwaighofer, and N. Lawrence. 2009.
Dataset shift in machine learning. Cambridge, MA: MIT Press.
Ram, N., C. Guerrini, and A. McGuire. 2018. Genealogy databases and the future
of criminal investigations. Science 360(6393):1078–1079.
Reniers, G., and V. Cozzani. 2013. Domino effects in the process industries: Modelling,
prevention and managing. New York: Elsevier.
Richesson, R., and J. Krischer. 2007. Data standards in clinical research: Gaps,
overlaps, challenges and future directions. Journal of the American Medical
Informatics Association 14(6):687–696.
Richesson, R., and P. Nadkarni. 2011. Data standards for clinical research data
collection forms: Current status and challenges. Journal of the American Medical
Informatics Association 18(3):341–346.
Scheerens, H., A. Malong, K. Bassett, Z. Boyd, V. Gupta, J. Harris, C. Mesick,
S. Simnett, H. Stevens, H. Gilbert, and P. Risser. 2017. Current status of
companion and complementary diagnostics: Strategic considerations for
development and launch. Clinical and Translational Science 10(2):84–92.
Seblst, A., and S. Barocas. 2018. The intuitive appeal of explainable machines.
Fordham Law Review 87:1085–1139.
Singer, D., and J.Watkins. 2012. Using companion and coupled diagnostics within
strategy to personalize targeted medicines. Personalized Medicine 9(7):751–761.
Skopek, J. 2018. Big data’s epistemology and its implications for precision medicine and
privacy. Cambridge, UK: Cambridge University Press.
Spector-Bagdady, K. 2016. The Google of healthcare: Enabling the privatization
of genetic bio/databanking. Annals of Epidemiology 26:515–519.
Steyerberg, E. 2010. Clinical prediction models: A practical approach to development,
validation, and updating. New York: Springer.
Health Care Artificial Intelligence: Law, Regulation, and Policy | 233
Subbaswamy, A., P. Schulam, and S. Saria. 2019. Preventing failures due to dataset
shift: Learning predictive models that transport. arXiv.org. https://arxiv.org/
abs/1812.04597 (accessed November 14, 2019).
Tarloff, E. 2011. Medical devices and preemption: A defense of parallel claims
based on violations of non-device specific FDA regulations. NYU Law Review
86:1196.
Terry, N. 2017. Regulatory disruption and arbitrage in health-care data protection.
Yale Journal of Health Policy, Law and Ethics 17:143.
Therapeutic Monitoring Systems, Inc. 2013. CIMVA universal traditional premarket
notification; K123472. https://www.accessdata.fda.gov/cdrh_docs/pdf12/
K123472.pdf (accessed November 14, 2019).
Venkatasubramanian,V., J. Zhao, and S.Viswanathan. 2000. Intelligent systems for
HAZOP analysis of complex process plants. Computers & Chemical Engineering
24(9–10):2291–2302.
Villa, V., N. Paltrinieri, F. Khan, and V. Cozzani. 2016. Towards dynamic risk
analysis: A review of the risk assessment approach and its limitations in the
chemical process industry. Safety Science 89:77–93.
Woloshin, S., L. Schwartz, B. White, and T. Moore. 2017. The fate of FDA
postapproval studies. New England Journal of Medicine 377:1114–1117.
Suggested citation for Chapter 7: McNair, D., and W. N. Price II. 2022. Health
care artificial intelligence: law, regulation, and policy. In Artificial intelligence in
health care: The hope, the hype, the promise, the peril. Washington, DC: National
Academy of Medicine.
8
ARTIFICIAL INTELLIGENCE IN HEALTH CARE:
HOPE NOT HYPE, PROMISE NOT PERIL
INTRODUCTION
Health care delivery in the United States, and globally, continues to face
significant challenges from the increasing breadth and depth of data and
knowledge generation. This publication focuses on artificial intelligence (AI)
designed to improve health and health care, the explosion of electronic health
data, the significant advances in data analytics, and mounting pressures to reduce
health care costs while improving health care equity, access, and outcomes.AI tools
could potentially address known challenges in health care delivery and achieve
the vision of a continuously learning health system, accounting for personalized
needs and preferences. The ongoing challenge is to ensure the appropriate and
equitable development and implementation of health care AI. The term AI is
inclusive of machine learning, natural language processing, expert systems,
optimization, robotics, speech, and vision (see Chapter 1), and the terms AI tools,
AI systems, and AI applications are used interchangeably.
While there have been a number of promising examples of AI applications
in health care (see Chapter 3), it is judicious to proceed with caution to avoid
another AI winter (see Chapter 2), or further exacerbate health care disparities.
AI tools are only as good as the data used to develop and maintain them, and
there are many limitations with current data sources (see Chapters 1, 3, 4,
and 5). Plus, there is the real risk of increasing current inequities and distrust
(see Chapters 1 and 4) if AI tools are developed and deployed without thoughtful
preemptive planning, self-governance, trust-building, transparency, appropriate
235
236 | Artificial Intelligence in Health Care
levels of automation and augmentation (see Chapters 3, 4, and 5), and regulatory
oversight (see Chapters 4, 5, 6, and 7).
This publication synthesizes the major literature to date, in both the academic
and general press, to create a reference document for health care AI model
developers, clinical teams, patients, “fRamilies,” and regulators and policy
makers to:
1. identify the current and near-term uses of AI within and outside the traditional
health care systems (see Chapters 2 and 3);
2. highlight the challenges and limitations (see Chapter 4) and the best practices
for development, adoption, and maintenance of AI tools (see Chapters 5
and 6);
3. understand the legal and regulatory landscape (see Chapter 7);
4. ensure equity, inclusion, and a human rights lens for this work; and
5. outline priorities for the field.
The authors of the eight chapters are experts convened by the National
Academy of Medicine’s Digital Health Learning Collaborative to explore the
field of AI and its applications in health and health care, consider approaches for
addressing existing challenges, and identify future directions and opportunities.
This final chapter synthesizes the challenges and priorities of the previous
chapters, highlights current best practices, and identifies key priorities for the field.
This section summarizes the key findings and priorities of the prior chapters
without providing the underlying evidence or more detailed background. Please
refer to the referenced chapters for details.
components of a health care system, between different health care systems, and
with consumer health applications. We cannot disregard the fact that there are
varying data requirements for the training of AI and for the downstream use
of AI. Some initiatives do exist and are driving the health care community in
the direction of interoperability and data standardization, but they have yet to
see widespread use (HL7, 2018; Indiana Health Information Exchange, 2019;
NITRD et al., 2019; OHDSI, 2019).
Methods to assess data validity and reproducibility are often ad hoc. Ultimately,
for AI models to be trusted, the semantics and provenance of the data used
to derive them must be fully transparent, unambiguously communicated, and
available, for validation at least, to an independent vetting agent. This is a distinct
element of transparency, and the conflation of data transparency with algorithmic
transparency complicates the AI ecosystem’s discourse. We suggest a clear
separation of these topics. One example of a principles declaration that promotes
data robustness and quality is the FAIR (findability, accessibility, interoperability,
and reusability) Principles (Wilkinson et al., 2016).
These principles, put forth by molecular biology and bioinformatics researchers,
are not easily formalized or implemented. However, for health care AI to mature,
a similar set of principles should be developed and widely adopted.
The health care community should continue to advocate for policy, regulatory,
and legislative mechanisms that improve the ease of data aggregation. These
would include (but are not limited to) a national patient health care identifier
and mechanisms to responsibly bring together data from multiple sources. The
debate should focus on the thoughtful and responsible ability of large-scale health
care data resources to serve as a public good and the implications of that ability.
Discussions around wider and more representative data access should be carefully
balanced by stronger outreach, education, and consensus building with the public
and patients in order to address where and how their data can be reused for AI
research, data monetization, and other secondary uses; which entities can reuse
their data; and what safeguards need to be in place. In a recent commentary,
Glenn Cohen and Michelle Mello propose that “it is timely to reexamine the
adequacy of the Health Insurance Portability and Accountability Act (HIPAA),
the nation’s most important legal safeguard against unauthorized disclosure
and use of health information. Is HIPAA up to the task of protecting health
information in the 21st century?” (Cohen and Mello, 2018).
When entities bring data sources together, they face ethical, business, legislative,
and technical hurdles. There is a need for novel solutions that allow for robust
data aggregation while promoting transparency and respecting patient privacy
and preferences.
Artificial Intelligence in Health Care | 239
AI to benefit the common good, at the very least its design and deployment
should avoid harms to fundamental human values. International human rights
provide a robust and global formulation of those values” (Latonero, 2018).
For objective governance, a new neutral agency or a committee within an
existing governmental or nongovernmental entity, supported by a range of
stakeholders, could own and manage the review of health care AI products and
services while protecting developers’ intellectual property rights. One example
of this type of solution is the New Model for Industry-Academic Partnerships,
which developed a framework for academic access to industry (Facebook) data
sources: The group with full access to the data is separate from the group doing
the publishing, but both are academic, independent, and trusted. The group with
full access executes the analytics and verifies the data, understands the underlying
policies and issues, and delivers the analysis to a separate group who publishes the
results but does not have open access to the data (Social Science Research Council,
2019). To ensure partisan neutrality, the project is funded by ideologically diverse
supporters, including the Laura and John Arnold Foundation, the Democracy
Fund, the William and Flora Hewlett Foundation, the John S. and James L. Knight
Foundation, the Charles Koch Foundation, the Omidyar Network, and the Alfred
P. Sloan Foundation. Research projects use this framework when researchers use
Facebook social media data for election impact analysis, and Facebook provides
the data required for the research but does not have the right to review or approve
the research findings prior to publication.
Perhaps the best way to ensure that equity and inclusion are foundational
components of a thriving health care system is to add them as a dimension to the
quadruple aim, expanding it to a Quintuple Aim for health and health care: better
health, improved care experience, clinician well-being, lower cost, and health
equity throughout (see Figure 8-2).
FIGURE 8-2 | The Quintuple Aim to ensure equity and inclusion are stated and measured
goals when designing and deploying health care interventions.
FIGURE 8-3 | Summary of relationships between requirements for transparency and the
three axes of patient risk, user trust, and algorithm performance within three key domains:
data transparency, algorithmic transparency, and product/output transparency.
NOTE:While not comprehensive, examples of how different users and use cases require different levels of transparency
in each of these three domains.
end user, and is unlikely to impose legal liability on the institution, manufacturer,
or end user, then the need for complete algorithmic transparency is likely to be
lower. See Figure 8-3 for additional details on the relationships of transparency
and these axes within different conceptual domains.
Although some AI applications for health care business operations are likely to
be poised for full automation, most of the near-term dialogue around AI in health
care should focus on promoting, developing, and evaluating tools that support
human cognition rather than replacing it. Popular culture and marketing have
overloaded the term “AI” to the point where it means replacing human labor,
and as a result, other terms have emerged to distinguish AI that is used to support
human cognition. Augmented intelligence refers to the latter, which is the term
the authors of this chapter endorse.
Artificial Intelligence in Health Care | 243
Stanford Univerity’s Curt Langlotz, offered the following question and answer:
“Will AI ever replace radiologists? I say the answer is no—but radiologists who
use AI will replace radiologists who don’t” (Stanford University, 2017).
In order to sustain and nurture health care AI, we need a sweeping,
comprehensive expansion of relevant professional health education focused on
data science, AI, medicine, humanism, ethics, and health care. This expansion
must be multidisciplinary and engage AI developers, implementers, health care
system leadership, frontline clinical teams, ethicists, humanists, and patients and
“fRamilies,” because each brings essential expertise and AI progress is contingent
on knowledgeable decision makers balancing the conflicting pressures of the
relative ease of implementing newly developed AI solutions while understanding
their validity and influence on care.
To begin addressing challenges, universities such as the Massachusetts Institute
of Technology, Harvard, Stanford, and The University of Texas have added
new courses focused on the embedding ethics into their development process.
244 | Artificial Intelligence in Health Care
The United States and many other nations prioritize human rights values
and are appropriately measured and thoughtful in supporting data collection, AI
development, and AI deployment. Other nations, with China and Russia being
prime examples, have different priorities. The current AI arms race in all fields,
including and beyond health care, creates a complex and, some argue, untenable
geopolitical state of affairs (Apps, 2019). Others point out that it is not an AI
arms race because interdependencies and interconnections among nations are
needed to support research and innovation. Regardless, Kai Fu Lee outlines
China’s competitive edge in AI in his 2018 book AI Superpowers: China, Silicon
Valley, and the New World Order (Lee, 2018). Putin has also outlined a national
248 | Artificial Intelligence in Health Care
AI strategy. And in February 2019, the White House issued an Executive Order
on Maintaining American Leadership in Artificial Intelligence (White House,
2019). The downstream implications of this AI arms race in health care raise
questions and conundrums this publication does not cover. We acknowledge they
are countless and should be investigated.
CONCLUSIONS
REFERENCES
Apps, P. 2019. Are China, Russia winning the AI arms race? Reuters. https://
www.reuters.com/article/apps-ai/column-are-china-russia-winning-the-ai-
arms-race-idINKCN1PA08Y (accessed May 13, 2020).
BARNII (Bay Area Regional Health Inequities Initiative). 2015. Framework.
http://barhii.org/framework (accessed May 13, 2020).
Cohen, G., and M. Mello. 2018. HIPAA and protecting health information in the
21st century. JAMA 320(3):231–232. https://doi.org/10.1001/jama.2018.5630.
Commins, J. 2019. UnitedHealthcare, AMA push new ICS-10 codes for social
determinants of health. HealthLeaders. https://www.healthleadersmedia.
com/clinical-care/unitedhealthcare-ama-push-new-icd-10-codes-social-
determinants-health (accessed May 13, 2020).
Evans, E. L., and D. Whicher. 2018. What should oversight of clinical decision
support systems look like? AMA Journal of Ethics 20(9):857–863.
Artificial Intelligence in Health Care | 249
Harari, Y. N. 2018. 21 Lessons for the 21st century. New York: Random House.
HL7 (Health Level Seven). 2018. Fast healthcare interoperability resources. https://
www.hl7.org/fhir (accessed May 13, 2020).
Indiana Health Information Exchange. 2019. https://www.ihie.org (accessed
May 13, 2020).
Latonero, M. 2018. Governing artificial intelligence: Upholding human rights &
dignity. Data & Society, October.
Lee, K. F. 2018. AI superpowers: China, Silicon Valley, and the new world order.
New York: Houghton Mifflin Harcourt.
NIH (National Institutes of Health). 2019. Undiagnosed diseases network. https://
commonfund.nih.gov/diseases (accessed May 13, 2020).
NITRD (Networking and Information Technology Research and Development),
NCO (National Coordination Office), and NSF (National Science
Foundation). 2019. Notice of workshop on artificial intelligence & wireless
spectrum: Opportunities and challenges. Notice of workshop. Federal Register
84(145):36625–36626.
OHDSI (Observational Health Data Sciences and Informatics). 2019. https://
ohdsi.org.
O’Neil, C. 2017. Weapons of math destruction: How big data increases inequality and
threatens democracy. New York: Broadway Books.
Piper, K. 2019. Exclusive: Google cancels AI ethics board in response
to outcry. Vox. https://www.vox.com/future-perfect/2019/4/4/18295933/
google-cancels-ai-ethics-board?_hsenc=p2ANqtz-81fkloDdAtmNyGvd-
p g T 9 Q xe Q E t z E G X e Q C E i 6 K r 1 B X Z 5 c LT 8 A F G x 7 w h _ 2 4 v i g o A -
QP9p0CLTRvbpnI85nEsONPzEvwUQ&_hsmi=71485114 (accessed
May 13, 2020).
Rosenthal, E. 2017. An American sickness: How healthcare became big business and how
you can take it back. London, UK: Penguin Press.
Sample, I. 2019. Scientists call for global moratorium on gene editing of
embryos. The Guardian. https://www.theguardian.com/science/2019/mar/
13/scientists-call-for-global-moratorium-on-crispr-gene-editing?utm_
term=RWRpdG9yaWFsX0d1YXJkaWFuVG9kYXlVUy0xOTAzMTQ
%3D&utm_source=esp&utm_medium=Email&utm_
campaign=GuardianTodayUS&CMP=GTUS_email (accessed May 13, 2020).
Shrott, R. 2017. Deep learning specialization by Andrew Ng—21 lessons learned.
Medium. https://towardsdatascience.com/deep-learning-specialization-by-
andrew-ng-21-lessons-learned-15ffaaef627c (accessed May 13, 2020).
Singer, N. 2018. Tech’s ethical “dark side”: Harvard, Stanford and others want
to address it. The New York Times. https://www.nytimes.com/2018/02/12/
business/computer-science-ethics-courses.html (accessed May 13, 2020).
250 | Artificial Intelligence in Health Care
Social Science Research Council. 2019. Social data initiative: Overview: SSRC.
https://www.ssrc.org/ programs/view/social-data-initiative/#overview
(accessed May 13, 2020).
Stanford University. 2017. RSNA 2017: Rads who use AI will replace rads who
don’t. Center for Artificial Intelligence in Medicine & Imaging. https://aimi.stanford.
edu/about/news/rsna-2017-rads-who-use-ai-will-replace-rads-who-don-t
(accessed May 13, 2020).
Sun, C., A. Shrivastava, S. Singh, and A. Gupta. 2017. Revisiting unreasonable
effectiveness of data in deep learning era. arXIV. https://arxiv.org/pdf/
1707.02968.pdf (accessed May 13, 2020).
White House. 2019. Executive Order on Maintaining American Leadership in
Artificial Intelligence. Executive orders: Infrastructure & technology. https://www.
whitehouse.gov/presidential-actions/executive-order-maintaining-american-
leadership-artificial-intelligence (accessed May 13, 2020).
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton,
A. Baak, N. Blomberg, J. W. Boiten, L. B. da Silva Santos, P. E. Bourne, and
J. Bouwman. 2016. The FAIR Guiding Principles for scientific data management and
stewardship. Scientific Data 3.
VA (U.S. Department of Veterans Affairs). 2016. Veteran population projections
2017–2037. https://www.va.gov/vetdata/docs/Demographics/New_
Vetpop_Model/Vetpop_Infographic_Final31.pdf (accessed May 13, 2020).
continued
251
252 | Artificial Intelligence in Health Care
Meeting Focus: Artificial intelligence and the future of continuous health learning
and improvement.
Meeting Objectives:
1. Aim: Consider the nature, elements, applications, state of play, and implications of
artificial intelligence (AI) and machine learning (ML) in health and health care, and
ways the National Academy of Medicine might enhance collaborative progress.
2. AI/ML opportunities: Identify and discuss areas within health and health care for
which AL and ML have already shown promise. Consider implications for other
applications.
3. Barriers: Identify and discuss the practical challenges to the advancement and application
of AI and ML, including those related to data integration, ethical/regulatory implications,
clinician acceptance, workforce development, and business case considerations.
253
254 | Artificial Intelligence in Health Care
This session will focus on the role of data integration and sharing in
enhancing the capabilities of machine learning algorithms to improve health and
health care.
Noel Southall, National Institutes of Health
Douglas McNair, Cerner Corporation
Jonathan Perlin and Edmund Jackson, Hospital Corporation of America, Inc.
James Fackler, Johns Hopkins Medicine
Q&A and Open Discussion
The lunch session will focus on the areas of health care where machine
learning has the potential to improve patient outcomes, including opportunities
Appendix B | 255
for better, faster, cheaper diagnosis, treatment, prevention, self and family care,
service linkages, and etiologic insights.
Paul Bleicher, OptumLabs
Steve Fihn, University of Washington
Daniel Fabbri, Vanderbilt University
Tim Estes, Digital Reasoning
Q&A and Open Discussion
The aim of this session is to develop a charge and charter for an ongoing NAM
AI/ML Collaborative Working Group. The charge will outline opportunities for
the Working Group to address barriers and accelerate progress.
Sean Khozin, U.S. Food and Drug Administration
Javier Jimenez, Sanofi
Leonard D. Avolio, Cyft
Wendy Chapman, University of Utah
Q&A and Open Discussion
WORKSHOP AT TENDEES
Ernest Sohn, MS
Chief Data Scientist
Booz Allen Hamilton
Appendix C
AUTHOR BIO GRAPHIES
259
260 | Artificial Intelligence in Health Care
Award for outstanding research as a fellow and the Western Society for Clinical
Investigation Outstanding Investigator award, and is a member of the American
Society for Clinical Investigation.
Paul Bleicher, MD, PhD, is a strategic advisor to OptumLabs. Dr. Bleicher was
formerly the chief executive officer of OptumLabs since its inception. Prior to
OptumLabs, he was the chief medical officer for Humedica, a next-generation
clinical informatics company. He also co-founded and was a leader at Phase
Forward, which was instrumental in transforming pharmaceutical clinical trials
from paper to the web. Dr. Bleicher has served as a leader in industry organizations
such as the National Academy of Medicine’s Leadership Consortium for Value
& Science-Driven Health Care and the Drug Information Association. He has
received numerous awards for his industry leadership. Dr. Bleicher holds a BS
from Rensselaer, as well as an MD and a PhD from the University of Rochester
School of Medicine and Dentistry. He began his career as a physician/investigator
and an assistant professor at the Massachusetts General Hospital and the Harvard
Medical School.
Appendix C | 261
Wendy Chapman, PhD, earned her bachelor’s degree in linguistics and her PhD
in medical informatics from the University of Utah in 2000. From 2000–2010,
she was a National Library of Medicine (NLM) postdoctoral fellow and then
a faculty member at the University of Pittsburgh. She joined the Division of
Biomedical Informatics at the University of California, San Diego, in 2010. In
2013, Dr. Chapman became the chair of the University of Utah’s Department
of Biomedical Informatics. Dr. Chapman’s research focuses on developing and
disseminating resources for modeling and understanding information described
in narrative clinical reports. She is interested not only in better algorithms for
extracting information out of clinical text through natural language processing
(NLP) but also in generating resources for improving the NLP development
process (such as shareable annotations and open-source toolkits) and in developing
user applications to help non-NLP experts apply NLP in informatics-based tasks
like clinical research and decision support. She has been a principal investigator on
several National Institutes of Health grants from the NLM, National Institute for
Dental and Craniofacial Research, and the National Institute for General Medical
Sciences. In addition, she has collaborated on multi-center grants, including the
ONC SHARP Secondary Use of Clinical Data and the iDASH National Center
for Biomedical Computing. Dr. Chapman is a principal investigator and a co-
investigator on a number of U.S. Department of Veterans Affairs (VA) Health
Services Research and Development grant proposals extending the development
and application of NLP within the VA. A tenured professor at the University of
Utah, Dr. Chapman continues her research in addition to leading the Department
of Biomedical Informatics. Dr. Chapman is an elected fellow of the American
College of Medical Informatics and currently serves as treasurer and was the
previous chair of the American Medical Informatics Association Natural Language
Processing Working Group.
Jonathan Chen, MD, PhD, practices medicine for the concrete rewards of
caring for real people and to inspire research focused on discovering and
distributing the latent knowledge embedded in clinical data. Dr. Chen co-
founded a company to translate his computer science graduate work into an
expert system for organic chemistry, with applications from drug discovery to an
education tool for students around the world.To gain perspective tackling societal
problems in health care, he completed training in internal medicine and a research
fellowship in medical informatics. He has published influential work in the New
England Journal of Medicine, JAMA, JAMA Internal Medicine, Bioinformatics, Journal of
Chemical Information and Modeling, and the Journal of the American Medical Informatics
Associations, with awards and recognition from the National Institutes of Health’s
Big Data 2 Knowledge initiative, the National Library of Medicine, the American
262 | Artificial Intelligence in Health Care
Guilherme Del Fiol, MD, PhD, is currently an associate professor and the
vice-chair of research in the University of Utah’s Department of Biomedical
Informatics. Prior to the University of Utah, Dr. Del Fiol held positions in clinical
knowledge management at Intermountain Healthcare and as faculty at the Duke
Community and Family Medicine Department. Since 2008, he has served as an
elected co-chair of the Clinical Decision Support Work Group at Health Level
International (HL7). He is also an elected fellow of the American College of
Medical Informatics and a member of the Comprehensive Cancer Center at
Huntsman Cancer Institute. Dr. Del Fiol’s research interests are in the design,
development, evaluation, and dissemination of standards-based clinical decision
support interventions. He has been focusing particularly in clinical decision
support for cancer prevention. He is the lead author of the HL7 Infobutton
Standard and the project lead for OpenInfobutton, an open-source suite of
infobutton tools and web services, which is in production use at several health care
organizations throughout the United States, including Intermountain Healthcare,
Duke University, and the Veterans Health Administration. His research has been
funded by various sources including the National Library of Medicine, the
National Cancer Institute, the Agency for Healthcare Research and Quality, the
Centers for Medicare & Medicaid Services, and the Patient-Centered Outcomes
Research Institute. He earned his MD from the University of Sao Paulo, Brazil;
his MS in computer science from the Catholic University of Parana, Brazil; and
his PhD in biomedical informatics from the University of Utah.
to explore data quality in electronic health record (EHR) data. His research with
LCS is focused on applying statistical learning techniques (Deep Learning and
unsupervised clustering) and data science methodologies to design systems that
characterize patients and evaluate EHR data quality. Dr. Estiri holds a PhD in
urban planning and a PhD track in statistics from University of Washington.
Prior to joining LCS, Dr. Estiri completed a 2-year postdoctoral fellowship with
the University of Washington’s Institute of Translational Health Sciences and
Department of Biomedical Informatics.
Stephan Fihn, MD, MPH, attended St. Louis University School of Medicine
and completed an internship, residency, and chief residency at the University of
Washington (UW). He was a Robert Wood Johnson Foundation Clinical Scholar
and earned a master’s degree in public health at UW where he is professor of
medicine and health services and the head of the Division of General Internal
Medicine. During a 36-year career with the U.S. Department of Veterans Affairs
(VA), Dr. Fihn held a number of clinical, research, and administrative positions.
He directed one of the first primary care clinics in the VA and for 18 years led the
Northwest VA Health Services Research & Development Center of Excellence
264 | Artificial Intelligence in Health Care
at the Seattle VA. He also served several national roles in the Veterans Health
Administration including acting chief research and development officer, chief
quality and performance officer, director of analytics and business, and director
of clinical system development and evaluation. His own research has addressed
strategies for improving the efficiency and quality of primary and specialty
medical care and understanding the epidemiology of common medical problems.
He received the VA Undersecretary’s Award for Outstanding Contributions in
Health Services Research in 2002. He has published more than 300 scientific
articles and book chapters and two editions of a textbook titled Outpatient Medicine.
He is deputy editor of JAMA Network Open. He is active in several academic
organizations including the Society of General Internal Medicine (SGIM) (past-
president), the American College of Physicians (fellow), the American Heart
Association (fellow), and AcademyHealth. He received the Elnora M. Rhodes
Service Award and the Robert J. Glaser Award from SGIM.
This position is partially funded by a $1.75 million donation from Amar Varma
(a Toronto entrepreneur whose newborn son underwent surgery at the Hospital
for Sick Children).
Seth Hain, MS, leads Epic’s analytics and machine learning research and
development. This includes business intelligence tools, data warehousing
software, and a foundational platform for deploying machine learning across Epic
applications. Alongside a team of data scientists and engineers, he focuses on a
variety of use cases ranging from acute care and population health to operations
and improving workflow efficiency.
practitioner for the State of California, serving as the co-chair of the Commission
on Juvenile Delinquency and Prevention for San Mateo County.
Edmund Jackson, PhD, is the HCA Healthcare chief data scientist and the vice
president of data and analytics within the Clinical Services Group. His education
is a BscEng and MScEng, both in electronic engineering, followed by a PhD in
statistical signal processing from Cambridge University. In that work, Dr. Jackson
focused on applications of sequential Markov chain methods in bioinformatics.
He pursued a career as a quantitative analyst in the hedge fund industry for
several years. More recently, Dr. Jackson has sought more meaningful work and
found it at HCA, where his remit is to create algorithms and systems to improve
the quality of clinical care, operational efficiency, and financial performance of
the firm through better utilization of data.
Jeffrey Klann, PhD, focuses his work with the Laboratory of Computer Science
on knowledge discovery for clinical decision support, sharing medical data to
improve population health, revolutionizing user interfaces, and making personal
health records viable. Dr. Klann holds two degrees in computer science from
the Massachusetts Institute of Technology and a PhD from Indiana University
in health informatics. He completed a National Library of Medicine Research
Training Fellowship concurrently with his PhD. He holds faculty appointments
at the Harvard Medical School and Massachusetts General Hospital.
She is also the director of the Health Communication and Informatics Laboratory
at the Department of Biomedical Informatics and certificate lead for Public
Health Informatics at Mailman. Her research interests focus on patient and
community engagement technologies, risk communication, decision science,
and implementation of health promoting and disease prevention technologies
into clinical workflow. Her projects include developing decision aids, portals
for community engagement, requirement and usability evaluation, and mixed-
method approaches to studying implementation and outcomes. Dr. Kukafka is an
elected member of the American College of Medical Informatics and the New
York Academy of Medicine. She has been an active contributor to the American
Medical Informatics Association (AMIA), and is an AMIA board member. She
has chaired the Consumer Health Informatics Working group for AMIA, and
served on an Institute of Medicine committee that authored the report Who Will
Keep the Public Healthy?: Educating Public Health Professionals for the 21st Century.
Dr. Kukafka has authored more than 100 articles, chapters, and books in the
field of biomedical informatics including a textbook (Consumer Health Informatics:
Informing Consumers and Improving Health Care, 2005, with D. Lewis, G. Eysenbach,
P. Z. Stavri, H. Jimison, and W.V. Slack. New York: Springer).
Eneida Mendonça, MD, PhD, received her MD from the Federal University
of Pelotas in Brazil and her PhD in biomedical informatics in 2002 from
Columbia University in New York. Dr. Mendonça pioneered the use of natural
language processing in both biomedical literature and in electronic medical
record narratives in order to identify knowledge relevant to medical decision
making in the context of patient care. In addition, she has devoted many
years to developing innovative clinical information systems that have been
Appendix C | 269
Joni Pierce, MBA, is a principal at J. Pierce and Associates and adjunct faculty
at the University of Utah’s David Eccles School of Business. Ms. Pierce received
her MBA from the University of Utah and is currently pursuing a master’s degree
in biomedical informatics and clinical decision support.
W. Nicholson Price II, JD, PhD, is an assistant professor of law at the University
of Michigan Law School, where he teaches patents and health law and studies
life science innovation, including big data and artificial intelligence in medicine.
Dr. Price is co-founder of regulation and innovation in the biosciences; co-
chair of the Junior IP Scholars Association; co-lead of the Project on Precision
Medicine, Artificial Intelligence, and Law at the Harvard Law School’s Petrie-
Flom Center for Health Law Policy, Biotechnology, and Bioethics; and a core
partner at the University of Copenhagen’s Center for Advanced Studies in
Biomedical Innovation Law.
Suchi Saria, PhD, MSc, is a professor of machine learning and health care at
Johns Hopkins University, where she uses big data to improve patient outcomes.
Her interests span machine learning, computational statistics, and its applications
to domains where one has to draw inferences from observing a complex, real-
world system evolve over time. The emphasis of her research is on Bayesian and
probabilistic graphical modeling approaches for addressing challenges associated
with modeling and prediction in real-world temporal systems. In the past 7 years,
270 | Artificial Intelligence in Health Care
she has been particularly drawn to computational solutions for problems in health
informatics as she sees a tremendous opportunity there for high impact work. Prior
to joining Johns Hopkins, she earned her PhD and master’s degree at Stanford
University in computer science, working with Dr. Daphne Koller. She also spent
1 year at Harvard University collaborating with Dr. Ken Mandl and Dr. Zak
Kohane as a National Science Foundation Computing Innovation Fellow. While
in the Valley, she also spent time as an early employee at Aster Data Systems, a big
data startup acquired by Teradata. She enjoys consulting and advising data-related
startups. She is an investor and an informal advisor to Patient Ping.
Ranak Trivedi, MA, MS, PhD, is a clinical health psychologist and a health
services researcher interested in understanding how families and patients can better
work together to improve health outcomes for both. Dr. Trivedi is also interested
in identifying barriers and facilitators of chronic illness self-management, and
developing family centered self-management programs that address the needs of
both patients and their family members. Dr.Trivedi is also interested in improving
the assessment and treatment of mental illnesses in primary care settings and
evaluating programs that aim to improve these important activities.