Academia.eduAcademia.edu

Mitigating Medical Alarm Fatigue with Cognitive Heuristics

2016, Proceedings of the 10th EAI International Conference on Pervasive Computing Technologies for Healthcare

Automated patient monitoring systems suffer from several design problems. Among them, alarm fatigue is one of the most critical issues, as evidenced by the Sentinel Event Alert that The Joint Commission-the U.S. hospital-accrediting body-recently issued. In this study, we explore fast-andfrugal heuristics that may be used to prioritize patient alarms, while continuing to monitor patient physiological state. By using a combination of human factors methodologies and the theory of Distributed Cognition (DCog), we studied alarm fatigue and its relationship to the underlying hospital systems. We identified three specific factors that we envision to be helpful for clinical personnel: ventilator presence, number of intravenous drips, and number of medications. We discuss their application in daily hospital operation.

Mitigating Medical Alarm Fatigue with Cognitive Heuristics Mustafa Hussain James Dewey Nadir Weibel Florida Polytechnic University 4700 Research Way Lakeland, FL, USA Florida Polytechnic University 4700 Research Way Lakeland, FL, USA UC San Diego 9500 Gilman Dr La Jolla, CA, USA [email protected] [email protected] ABSTRACT Automated patient monitoring systems suffer from several design problems. Among them, alarm fatigue is one of the most critical issues, as evidenced by the Sentinel Event Alert that The Joint Commission – the U.S. hospital-accrediting body – recently issued. In this study, we explore fast-andfrugal heuristics that may be used to prioritize patient alarms, while continuing to monitor patient physiological state. By using a combination of human factors methodologies and the theory of Distributed Cognition (DCog), we studied alarm fatigue and its relationship to the underlying hospital systems. We identified three specific factors that we envision to be helpful for clinical personnel: ventilator presence, number of intravenous drips, and number of medications. We discuss their application in daily hospital operation. Categories and Subject Descriptors [Applied Computing]: Life and medical sciences—Health informatics; [Information Systems]: Information systems applications—Decision support systems; Human-centered computing [Human computer interaction (HCI)]: HCI design and evaluation methods—Field Studies General Terms Clinical informatics, clinical decision support systems, human factors, Distributed Cognition, decision modeling Keywords Cognitive heuristics, fast-and-frugal trees, patient monitoring systems, alarm fatigue. 1. INTRODUCTION In July 2010, a patient who suffered a severe blow to the face underwent surgery, and was then admitted to the hospital’s Intensive Care Unit (ICU). Agitated, the patient kept removing the pulse oximeter from their finger, triggering an alarm to sound each time. These were obviously [email protected] false alarms, and the staff stopped paying attention to them. However, a real problem soon arose: the patient’s heart rate and breathing started to increase, while the blood oxygen decreased. Alarms sounded, to no response, for an hour. Then, the patient stopped breathing. A critical alarm sounded. Hospital personnel finally responded, but it was too late: the patient had suffered severe brain damage [17]. This is not an isolated incident. Alarm fatigue is a common problem in ICUs. Approximately 80% of ICU monitor alarms are irrelevant [29]. This volume of irrelevant alarms desensitizes nurses [34], leading to inappropriate behavior during real emergencies [32]. The Joint Commission identified alarm fatigue as a threat to patient safety [14]. In this study, we identify cognitive heuristics that nurses may already be using to quickly assess patient acuity, and we propose that automated patient monitors use such heuristics to automatically prioritize physiological alarms. Current monitors feature simple alarm prioritization. However, it appears that cognition is inappropriately distributed. Too much of the cognitive burden of determining whether a physiological state requires action falls on nurses or clinicians. This burden exceeds their available cognitive resources, resulting in alarm fatigue. We conjecture that, by redistributing cognition such that automated actors bear more of this burden, they will more effectively prioritize the information that they convey to clinical personnel, without increasing the risk of an alarm being missed. Our research contribution is twofold. We propose using a heuristic model to measure patient acuity, which we define in Section 3.2. While it is known that nurses use heuristics to assess patient acuity [30], to our knowledge, building these heuristics into patient monitors is a novel concept. By exploiting our model, we propose that future physiological monitors prioritize alerts using such a heuristic, and we present heuristics that have a high potential to succeed. Furthermore, we frame alarm fatigue through the lens of Distributed Cognition (DCog). We believe that this novel approach is a necessary step, motivated by the critical observation that situation awareness is distributed among automated monitors and team members in the ICU. 2. BACKGROUND Multiple disciplines have addressed medical alarm fatigue. In this section, we discuss how nurses and engineers have addressed the problem. Then, we apply concepts from the broader cognitive sciences literature to the medical domain. In 2010, Graham and Cvach [10] demonstrated that bestpractices nurse training could improve patient monitor alarm validity. They showed that this training reduced the number of critical monitor alarms in an ICU by 43%. However, frequent and comprehensive training is costly and timeconsuming, and hospital personnel rarely undergo the necessary training to effectively solve this problem. As an alternative, designers and engineers believe that good product design solves user interface issues more effectively than training [4]. To address alarm fatigue, they have recently made important advancements to increase the relevance of alarms, by integrating measures from multiple monitoring systems, and by leveraging statistical methods and artificial intelligence techniques. While promising, these solutions have largely been implemented in simulation only [29],1 so there is little to no data on their impact in the field. Furthermore, as we discuss in the next section, drastically decreasing the percentage of false alarms will likely result in a new range of issues. This is because it may lead staff to assume perfect accuracy, and then to modify their behaviors to follow this assumption, without understanding the shortcomings of the technical design. 2.1 Human Factors and Ergonomics Human factors research in different work domains, such as aviation and nuclear power plant operation, has addressed alarm fatigue. Notably, Wickens et al. [33] (p. 25) introduced an important framework and practical guidelines: 1. Use multiple alarm levels. Prioritize alarms based on each event’s level of urgency and certainty. 2. Raise automated beta slightly. This refers to Signal Detection Theory, where false positives may be directly traded off for false negatives by raising the alarm threshold, known as “beta.” 3. Keep the human “in the loop.” Humans should monitor the raw data in parallel with the automated systems. 4. Improve operator understanding of false alarms. The statistical necessity of a high sensitivity and low specificity should be explained to nurses and clinical personnel. This involves encouraging nurses to shift how they think of alarms, from a stimulus intended to indicate an error to a stimulus intended to guide attention. In this paper, we focus on Wickens’ guidelines 1 and 3. Guideline 4 raises questions of training, which are separate from the question of heuristic modeling that we focus on in this study. In order to apply guideline 2, it is necessary to determine how far beta may be adjusted by weighing benefits and risks; such an analysis is also outside the scope of this study. Furthermore, guideline 1 recommends that alarms indicate also their level of certainty, in addition to the level of urgency that they detect. Although this is clearly important, we leave this to future work, focusing instead on urgency, as measured by acuity. To contextualize our analysis, we frame alarm fatigue as under-trust in the alarm system. As we hinted above, when an alarm is highly accurate, but not perfect, this results in over-trust [19]. Similarly, Mosier et al. [24] speak of automation bias, a “heuristic replacement for vigilant information seeking and processing.” It manifests as several issues. 1 In addition to effectively prioritizing alarms, new medical technologies should sound alarms that nurses can easily identify. The ISO/IEC 60601-1-8 alarm set does not meet this requirement, although an alternative set does [1]. One issue is Complacency, which is observed when the operator no longer monitors the raw sensor data, instead relying on the system to issue an alarm in the event of a problem [23]. Reliance occurs when the operator does not take precautions because the system does not issue any warning. Compliance occurs when the operator responds to an alarm as if the indicated problem is truly happening, without first checking for a false alarm. Finally, after extended periods of over-trusting automation, operators tend to deskill [7], meaning that they lose the ability to perform tasks manually. This may be remedied with regular drills [26]. 2.2 Distributed Cognition In healthcare, knowledge, work, and situation awareness are represented and transformed collaboratively, among many actors and artifacts. Plans change dynamically, because future states of the work system are unpredictable. DCog views cognition as distributed among human, technological actors, and cognitive artifacts (such as “to-do” lists), as well as through time, within specific work systems [12]. We believe that the environment and characteristics of critical alarms in the ICU is a typical example of a DCog system. Thus, DCog is well-suited to help address the problem of ICU alarm fatigue. In DCog, responsibilities overlap vertically in the actor hierarchy, creating a shared responsibility to catch errors. Additionally, communication channels are separated, to ensure independent error-checks. In the case of ICU alarms, nurses occupy a higher role in the actor hierarchy, above automated physiological monitors. They share the responsibility of monitoring the raw data to catch abnormalities. 2.3 Cognitive Heuristics How do nurses monitor patients? There are accurate models of patient acuity [11], such as APACHE II and NEWS [35]. However, they are computationally intensive and complex, and most use more than ten variables.2 Simmons et al. [30] found that nurses use heuristics to assess patient acuity, rather than perform mental computations that resemble these models. Heuristics are not necessarily inferior to computational models [9]. In fact, Kruse et al. [18] found nurse estimation of mortality risk to be as reliable as APACHE II. Building upon this reasoning, we recommend that patient monitors prioritize alarms by patient acuity, using a heuristic that mimics the reasoning process of clinical staff. This would keep the human in the loop; the algorithm of choice must be usable in rapid decision-making contexts. Gigerenzer and Gaissmaier [9] advocate the use of heuristics in medical domains, because they are intuitive, easily learned and recalled, and rapidly applied. These features are key to their adoption in clinical practice [22]. Indeed, they have been successfully implemented to determine which patients should be sent to a coronary care unit. By contrast, complex statistical models are unintuitive, difficult to learn and recall, and tedious to apply. These considerations provide the basis for guidelines ‘1’ and ‘2,’ introduced below. Next, we explore the design of such a heuristic. In order to evaluate alternative heuristic models, we consider our previous discussion to generate the following criteria: 2 NEWS uses only 6 variables. While intentionally more manageable than its predecessors, it preserves a decision structure that necessitates the use of a scoring worksheet. 1. Nurses and other clinical personnel should find its decision structure intuitive. 7. How do you know they are the most acute? 8. Where did you get that information? 2. Nurses should be able to rapidly recall and use it. 3. Its parameters should be visually available, reducing noise, which can impede communication during medical emergencies [27] 4. Perhaps counterintuitively, as discussed in Section 2.1, the system should be inaccurate enough to avoid overtrust, so that nurses monitor the raw data. In order to understand how we may build on humanfactors engineering, apply cognitive heuristics, and consider the theory of Distributed Cognition to address the problem of alarm fatigue, we conducted an exploratory study. In the remainder of this paper, we describe our study, and discuss the results and conclusions that we drew from it. 3. METHODS Data collection took place in a large, non-teaching hospital, located in a mid-sized metropolitan area in the Southeastern United States. After IRB approval, we approached nurses on the ICU floor or in the break-room, informed them of the benefits and risks of participation, and asked them to consider participating in our study. Throughout our study, seventeen nurses were enrolled, and we were able to observe approximately 77% of patient rooms. Despite the relatively high number of participants and the large amount of data we collected, several potential participants were not able to join our study, mainly due to heavy workload or specific dangerous situations. For example, when a patient required urgent care, interviewing the nurse would have endangered the patient. Nevertheless, in our study, out of the 7 situations in which more than 1 nurse identified a patient as highly acute or having coded in the previous 24 hours (i.e., having entered a rapidly declining state), we were unable to observe and sample only 2 of them. In addition, occasionally nurses were simply not in the unit, because they had taken the patient to radiology. In two cases, nurses declined to participate. We discuss implications of the unobserved cases in Section 6 (Discussion), and recommend ways to overcome these obstacles for future studies in Section 6.4 (Outlook). 3.1 Exploratory Interview Phase In order to gather enough information, we scheduled six 2-hour observation visits to the ICU. Additionally, we conducted semi-structured interviews to identify potential indicators and informational sources of acuteness, busyness, and patient progression. Below, we list typical questions that we used to guide our semi-structured interviews: 1. How would you rate the acuity of your patient, on a scale of 1 to 5, where 5 represents the greatest risk? 2. Please rate the busyness of your patient, on 1 to 5 scale, where 5 represents the greatest workload. 3. What indicates to you that their acuity is that high? 4. Where did you get that information? 5. What are you watching that will indicate to you that your patient’s condition is improving or worsening? 6. Who are the most acute patients in this unit right now? We coded the transcriptions from semi-structured interviews in order to identify the variables that nurses use to assess patient acuity (we reveal these in the next section). In order to build initial heuristic models, we systematically gathered additional empirical data. 3.2 Questionnaire Design The exploratory interview phase revealed a number of variables to consider. The answers to our semi-structured interview questions guided therefore the design of a questionnaire that we based on six specific areas. Nurse Experience. We asked nurses to self-report where they stood on Benner’s [3] novice-to-expert scale. A Novice is one with no experience, an Advanced Beginner has begun to see patterns, a Competent nurse has 2-3 years’ experience in similar situations, a Proficient nurse makes holistic decisions, anticipates outcomes, and adapts plans, and an Expert no longer relies on principles, rules, or guidelines. Patient Acuity. Kruse et al. [18] found that nurse estimation is as accurate a measure of mortality risk as APACHE II. We asked nurses, “On a scale of 1 to 5, where ‘1’ means that the patient is ready to transfer, and ‘5’ means they probably won’t make it, how acute is the patient?” Certainty about Acuity. During the semi-structured interviews, nurses indicated that sometimes they could not assess acuity, because their patient had not arrived. Others mentioned that regularly scheduled measurements, such as lab results and scans, indicated the effectiveness of treatments. We hypothesize that certainty of patient acuity (1) starts low, when a patient initially arrives, (2) increases when fresh results arrive, and (3) reduces when a new treatment is administered. While acuity is the main focus of this study, we gathered nurse certainty perception on a 1-to-5 scale, ‘5’ representing complete certainty. Patient Busyness. During our interviews, nurses frequently pointed out that, contrary to intuition, some patients at low risk of mortality require more time and attention than patients who face higher risk, and vice-versa. In order to ensure that nurses did not report busyness instead of acuity, we asked them to assess patients on both dimensions. We asked, “On a scale of 1 to 5, where ‘1’ means the patient can take care of themselves, and ‘5’ means you must constantly watch them, how busy is the patient?” Identified By Others. If nurses who are not assigned to the patient are able to reliably indicate which patients in the unit are most acute, then indicators of patient acuity that are visually observable are better candidates for use in heuristics. We observed that nurses communicate patient details in informal conversations. However, nurses cited visual observations, rather than conversations, when asked how they knew that another nurse’s patient was highly acute. We asked nurses, “Which patients in this unit are most acute?” and tallied their responses. Has Coded. In hospital vernacular, “to code” means “to enter a rapidly declining physiological state, requiring emergency measures.” During a code, there is a high likelihood of patient mortality. We asked nurses whether their patient had coded in the last 24 hours. They answered “yes” in only 4 of 54 cases. We considered this an insufficient quantity from which to draw conclusions, and discarded this variable prior to analysis. Medication Questions. Nurses frequently cited their patients’ medications as evidence of acuity. Interviews suggested that medication class and dosage indicate acuity. For example, many patients have one vasopressor line, and nurses do not consider this an indicator of high acuity. If, on the other hand, a patient has six vasopressor lines, a nurse may infer that the patient is highly acute. However, there exist hundreds of medications, and many dosage scales, which are adjusted to account for additional factors, such as weight and age. Such multidimensionality demands more data than can be gathered in this study. Instead, we gathered the three following measures, in order to broadly characterize medication consumption: 1. Relative Medication Quantity. For purposes of keeping the human “in the loop,” it is necessary to consider whether nurses have an accurate mental model of the quantity of medications they are administering to their patients. We asked, “On a scale of 1 to 5, ‘1’ being very few medications and ‘5’ being the most you have ever seen, how would you rate the quantity of the medications this patient is on?” 2. Actual number of medications. After estimating relative medication quantity, we asked nurses to retrieve the exact number of unique medications administered in the last 24 hours from the Electronic Health Record. 4. ANALYSIS AND RESULTS In this section, we report on the analysis of nurse responses in terms of certainty in acuity assessments, how well nurses assess the quantity of their patients’ medications, whether nurses who are not assigned to a patient know which nearby patients are highly acute, and how well each of the viable candidate factors predict acuity. The ordered logistic regression relies on the parallel regression assumption, so we accompany these with Brant [5] tests of this assumption. Low Brant p-values indicate that the assumption is likely violated. In practice, models may still be useful even if this assumption is violated. For an explanation of this assumption, consult [20], page 150. Nurses expressed complete certainty, ‘5’, in their acuity assessments in 72% of cases, and never reported a certainty below ‘3.’ An ordered logistic regression found no correlation between certainty and busyness (β = 0.07, p = 0.87, S.E. = 0.40), and passed the Brant test (p = 0.63). Estimated Relative Medication Quantity. We ran an ordered logistic regression between number of medications and nurse-estimated medication quantity to determine whether nurses have a well-developed mental model of medication quantities. We eliminated categories 4 and 5, because they contained only 3 datapoints in total. Figure 1 plots the data. We found a significant positive correlation in support of this hypothesis (β = 0.16, p = 0.001, S.E. = 0.05), Brant test withstanding (p = 0.29). 3. Number of intravenous medication drips. We asked nurses, “How many drips does this patient have, including saline, but not including food? If they have more than one line for the same medication, this counts as more than one drip.” Number of Watch Variables. We asked nurses, “What variables are you watching that indicate the progression of this patient?” Nurses indicated a range of variables as evidence of acuity. We identified the following categories: Arterial, Fluids, Labs, Medication Dosages, Neuromotor Status, Oral Intake, Pain, Respiratory Status, Scans, Temperature, Urinary Output, Visuosensory Cues, Vitals, and Wounds. Invasive Equipment. We took note of whether the following were present in the patient’s room: Ventilator, Chest Tube, Balloon Pump, and CRRT (Continuous Renal Replacement Therapy). Nurses indicated these as evidence of acuity. However, we did not observe any Balloon Pumps or CRRTs during our study, and only observed a chest tube twice, so we discarded these two categories prior to analysis. 3.3 Questionnaire Administration Phase In order to administer the questionnaire, we visited the ICU for five additional 2-hour visits. At that point, we had gathered 54 observations, and we felt that this was enough for an exploratory analysis. We interviewed all nurses who were present and willing to participate during each visit. Visits took place on both weekday and weekend afternoons, to sample a variety of contexts. Each nurse was administered the questionnaire described above. Figure 1: An ordered logistic regression indicated that nurse estimates of relative medication quantity predict actual number of medications. Larger dots indicate overlapping datapoints. We excluded categories 4 and 5 from the analysis due to data paucity. Predicting Acuity. In order to define the terms acute and most acute, we split acuity into approximate percentiles, as shown in Table 1. We aimed to define the top 50% as acute, and the top 25% as most acute. Categories 3-5 represented the top 57%, and categories 4-5 represented the top 22%. Table 1: Definitions of acute and most acute. We chose the category ranges that came closest to the top 25% and 50% to define these terms. Acuity 5 4 3 2 1 Portion 11.11% 11.11% 35.19% 11.11% 31.48% Cumulative 11.11% 22.22% 57.41% 68.52% 100.00% Acute Top 57% Most Acute Top 22% We split number of medications into approximate percentiles, as shown in Table 2. Patients had up to 29 medications, so this independent variable has a precise granularity. Reducing its granularity in this way makes the results easier to interpret, since its odds ratios are more directly comparable with those of other predictor variables. Table 2: Percentile definitions of number of medications. Ranges are boundary-inclusive. Number of Medications 2-5 6-10 11-15 16-29 Cumulative 20.37% 50.00% 75.93% 100.00% Nurses reliably identified the most acute patients in the unit, as evidenced by a logistic regression between most acute and identified by others (Odds ratio = 1.80, p = 0.05, S.E. = 0.55). A Brant test is not applicable here, since the dependent variable is binary. Figure 2 shows the marginal probabilities. In order to obtain these marginal probabilities, we calculated the marginal probabilities of not being most acute, then subtracted them from 1. This was necessary because the most acute patients are defined as uncommon, resulting in a small sample of most acute patients. We ran an ordered logistic regression to determine the extent to which each candidate predictor variable indicated acuity (Table 3). Ventilator presence, number of drips, and number of medications quartile are promising predictors. Probability of Most Acute 1 Table 3: Ordered logistic regression results show the ability to predict if a patient is acute (OR = Odds Ratio, S.E. = Standard Error, C.I = Confidence Interval) Predictor Ventilator # Drips # Meds Quartile # Watch Variables Constant OR 13.84 2.27 1.12 1.02 0.11 p-value 0.03 0.10 0.16 0.96 0.03 S.E. 16.61 1.13 0.09 0.40 0.10 95% C.I. 1.32 - 145.43 0.85 - 6.05 0.95 - 1.31 0.47 - 2.20 0.02 - 0.73 decision tree algorithm, implemented in Stata by Luchman [21]. Figure 3 shows the resulting trees. We trained and tested the two models on each of the 9 segment pairs using each of 4 sets of independent variables: 1. Ventilator presence only. 2. Ventilator presence and number of drips. 3. Ventilator presence and medication quantity quartile. 4. All of the above. We compared accuracy between the ordinal logistic regression and the fast-and-frugal tree models using the Wilcoxon test of pairwise comparisons. Jaimes et. al [13] also used this method to compare logistic models with neural networks. As Table 4 shows, there is little reason to believe that the models differ in accuracy. We defined 57% of patients as acute (see Table 1), so a naı̈ve classifier would classify all patients as acute, achieving 57% accuracy. As Table 4 shows, both models performed significantly better than chance. 0.8 6. DISCUSSION 0.6 In previous sections, we explored and analyzed nurse’s mental models of patient acuity, proposing heuristic models to mimic the structure of their acuity assessment process. Here, we discuss the results in detail. 0.4 0.2 0 0 1 2 3 4 Num. Nurses Identified Patient as Most Acute Figure 2: Nearby nurses tend to know who is most acute. This chart shows the probability of a patient’s acuity being a ‘4’ or a ‘5,’ as assessed by their assigned nurse, given that a number of nurses have identified them as the most acute in the unit. 5. EXPLORING POTENTIAL HEURISTICS In this section, we compare the accuracy of ordinal logistic regression models with fast-and-frugal tree models, a common cognitive heuristic [9]. We provided the rationale and explanation for designing heuristics in Section 2.1. We trained our heuristic models to distinguish between patients who were and were not acute, as defined in Table 1. In order to conduct the comparison, we split the data 9 ways by selecting every 9th datapoint in all 9 possible ways. This resulted in 9 combinations of training and testing sets, each with 48 training datapoints and 6 testing datapoints. In order to generate fast-and-frugal trees, we used Kass’s [15] 6.1 Certainty Nurses expressed high certainty in their acuity assessments. Nurse assessments of acuity are a reliable predictor of mortality risk [18], so this confidence may have been well-placed. Overconfidence bias [8] may have played a role as well. In our interviews, two nurses reported that they face pressure from family members and physicians to express confidence, even when they feel uncertain, since expressing uncertainty is met with consequence from both parties. Nurse confidence is key to patient-perceived competence [31]. While we, the observers, were not physicians or family members, nurses may present confidence habitually. Additionally, we hypothesized in Section 3.2 that patients tend to arrive in an uncertain state, and that certainty is repeatedly recovered and reduced as new observations are taken and treatments are attempted. While this is not the focus of this study, it is still worth noting, especially because, in our study, nurses with new patients who had just arrived quickly became too busy to participate. This could explain the clustering of certainty in the higher categories. While there does not appear to be a correlation between certainty and busyness, this may be due to the paucity of low-certainty samples. Ventilator? Ventilator? No Yes Not Acute Acute No Yes Any Drips? No Ventilator? No Acute Not Acute Yes No Acute No Acute > 5 Meds? Yes Ventilator? No Acute Acute > 5 Meds? Yes Not Acute Yes Yes Not Acute Any Drips? No Yes Not Acute Acute Figure 3: These fast-and-frugal trees heuristically determine whether a patient is acute. As shown in Table 4, they are correct approximately three-fourths of the time, about as often as ordered logistic regression models. 6.2 Estimates of Medication Quantity Overall, nurse perception of relative patient medication quantity coincided well with actual quantity. However, most did not readily report a medication quantity. They tended to find the measure unintuitive, and most appeared to conduct a mental inventory before reporting an answer. Several nurses carried around a sheet of handwritten paper that listed “to-do” notes and medications to administer; these nurses seemed to report relative medication quantity more quickly, sometimes even without looking at their paper. In contrast to the number of medications, nurses seemed to report the number of drips and the presence of a ventilator quickly. We hypothesize that the mental availability of these variables is affected by observation frequency, perhaps due to the effect of spaced repetition on retention [2]. 6.3 (p. 268). Further work would determine whether nurses are typically only aware of nearby patients. Surprisingly, the number of variables that nurses were watching was not a significant predictor of patient acuity. While it is possible that the number of variables being watched has no relationship with patient acuity, it is also possible that this is due to the measure. Two expert nurses pointed out that several variables are watched for all patients. Both reported that vitals are watched for all patients; one also 7 8 6 9 10 11 12 14 9-17 3-9 15 5 Predicting Acuity Nurses were able to identify the most acute patients in the unit, even though they were not specifically assigned to those patients. Nurses frequently stated that they were only aware of nearby patients. This may be because physiological monitors are configured to display the nearest patients, as shown in Figure 4. Some nurses stated that they were only aware of their own patients; we suspect that nurses with particularly busy patients tended to respond this way. Vitals are a strong predictor of acuity, as evidenced by the APACHE II model [16]. Because of the physical configuration of the unit, vitals, like ventilator presence and the number of drips, are visually available. This explains the assertion that nurses are most aware of the status of nearby patients: they are aware of the information available within their horizon of observation, as identified by Hutchins [12] 1-8 1-19 11-19 4 Desks 17 3 2 1 16 Breakroom 18 19 1-18 Figure 4: Mapping of monitors to rooms. There is one patient and monitor per room (not shown). The monitors on the outer desk typically show vitals from the six nearest patients. When an alarm occurs, all monitors sound the alarm, and display the corresponding raw data. Not drawn to scale. Table 4: Comparison of ordered logistic regression and tree models to identify acute patients. The low baseline p-values indicate that the models are more accurate than chance. The high Wilcoxon p-values indicate that the models are unlikely to differ in accuracy. “Meds” is short for Medication Quantity Quartile. Independent Variables Vent Vent and Drips Vent and Meds Vent, Drips, and Meds Accuracy µ Logit 79.26% 79.63% 71.48% 81.48% Tree 77.78% 77.16% 77.16% 77.78% Accuracy σ Logit 13.82% 16.20% 17.09% 15.47% Tree 14.43% 8.98% 9.58% 7.97% t-test Comparison with 57% Baseline (p-values) Logit Tree 0.0007 0.0014 0.0017 0.0001 0.0193 0.0001 0.0008 0.0000 Wilcoxon Model Comparison p-values 0.92 0.55 0.34 0.39 reported watching urinary output for all patients. Nevertheless, as shown in Table 5, vitals were the most-reported watched variable. Nurses with more expertise may have only reported the distinctive watch variables. Additionally, if two variables were listed in the same category, this was counted as one variable. However, we saw this as necessary, because sometimes, participants would list the category, such as “vitals,” but other times, they would list items within that category, such as “heart rhythm.” While this reduced the granularity of the measure, we do not believe that it reduced the quality of the data. Presumably, if there were a relationship between the number of variables watched and acuity, the watched variables would be spread out among several categories (e.g. “I am watching vitals and two lab values”), rather than clustered into one (e.g. “I am watching four lab values and ignoring vitals entirely”). Reporting low acuity in circumstances of certain mortality, however, is consistent with the definition of “acuity” given by an expert nurse as the time and attention that a patient requires, which matches our definition of “busyness.” In future work, we recommend avoiding the term “acuity” altogether, opting instead to refer to “likelihood of mortality” and “busyness,” in order to more closely match nurse vernacular, improving researcher-participant communication. It is still possible that medication class and dosage, which we did not measure due to feasibility limitations, predict acuity. Further research would determine whether this is the case. However, much like medication quantity, these parameters are largely invisible to emergency-responding staff. In the interest of keeping all actors “in-the-loop,” we recommend only using visually available parameters. We discuss this further in the next section. 6.4 Outlook The evidence suggests that designers of the next generation of monitors reduce alarm fatigue while avoiding overtrust by prioritizing alarms in a way that nurses understand. In this paper, we suggest constructing a cognitive heuristic for alarm prioritization, and we identified variables of interest that may be incorporated into such a heuristic. This addresses some of the Joint Commission’s concerns. Based on the research presented in this paper, we presently recommend that automated patient monitors meet the following constraints: 1. They should prioritize alarms using a heuristic that follows the guidelines given in Section 2.3. 2. They should reveal this heuristic to staff, to inform their mental model of its decision mechanism, consistent with Nielsen’s visibility of system status design guideline [25]. In future work, we plan to build a more accurate model, by integrating physiological measurements into our heuristics. We also plan to gather a larger number of observations, to accurately identify the extent to which each variable predicts acuity, as well as to gather sufficient data in rare categories, such as balloon pumps, CRRTs, and chest tubes. After collecting the data, we plan to construct a heuristic that balances accuracy with complexity. Further work will be needed in order to determine how complex this heuristic may be made before nurses no longer find it to be usable. We observed that, during critical events, many actors respond. The room quickly becomes noisy, with many peo- Table 5: The number of observations in which each watch variable was listed. Categories not listed in this table had a frequency of zero. Frequency 30 15 14 12 7 5 2 1 1 Watch Variable Vitals Respiratory Labs Neuromotor Urinary Temperature Pain Scans Wounds ple speaking at once. This has been independently observed [27]. Thus, we believe it is important that actors are able to visually gather the information they need to assess the validity of each new alarm. We recommend using variables that are directly observable in the current technological environment in this heuristic. It is widely understood that stress negatively affects human performance [6]. This presents two special considerations. First, data should be gathered from stressful situations; because these contexts place extensive demand on nurse attention and cognition, we recommend using video recording to gather the data. Other researchers have been able to do this in the past (e.g. Sarcevic [28]). In our experience, because of the perceived risks posed by patient privacy laws, this requires strong trust between hospital leadership and researchers. Second, the performance of constructed heuristics in high-stress situations, in addition to real-world environments, will need to be studied. Due to the difficulty in sampling cases where a patient requires continuous attention, understanding performance in high-stress contexts will likely require testing in simulated care environments. 7. CONCLUSION In this paper, we have identified the presence of a ventilator, the number of intravenous drips, and the number of medications as visually available factors that predict patient acuity. We propose that these, as well as other visually available physiological parameters, should be used to construct a cognitive heuristic to prioritize automated patient alarms. We also recommend that these heuristics should be considered in the design of future alarm systems in the ICU. By using an understandable mechanism to prioritize alarms, nurses will be able to better identify misprioritized alarms, giving rise to an appropriate level of trust in the automated monitoring system, and avoiding over-trust. 8. ACKNOWLEDGMENTS We would like to thank all participants for their time, as well as the reviewers for their helpful feedback. 9. REFERENCES [1] J. Atyeo and P. Sanderson. Comparison of the identification and ease of use of two alarm sound sets by critical and acute care nurses with little or no music training: a laboratory study. Anaesthesia, 70(7):818–827, 2015. [2] D. P. Ausubel and M. Youssef. The effect of spaced repetition on meaningful retention. The Journal of General Psychology, 73(1):147–150, 1965. [3] P. Benner. From novice to expert. Menlo Park, 1984. [4] R. G. Bias and D. J. Mayhew. Cost-justifying usability: an update for an Internet age. Morgan Kaufmann, 2005. [5] R. Brant. Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics, pages 1171–1178, 1990. [6] M. W. Eysenck, N. Derakshan, R. Santos, and M. G. Calvo. Anxiety and cognitive performance: attentional control theory. Emotion, 7(2):336, 2007. [7] E. E. Geiselman, C. M. Johnson, D. R. Buck, and T. Patrick. Flight deck automation: A call for context-aware logic to improve safety. Ergonomics in Design: The Quarterly of Human Factors Applications, 21(4):13–18, 2013. [8] G. Gigerenzer. How to make cognitive illusions disappear: Beyond “heuristics and biases”. European review of social psychology, 2(1):83–115, 1991. [9] G. Gigerenzer and W. Gaissmaier. Heuristic decision making. Annual review of psychology, 62:451–482, 2011. [10] K. C. Graham and M. Cvach. Monitor alarm fatigue: standardizing use of physiological monitors and decreasing nuisance alarms. American Journal of Critical Care, 19(1):28–34, 2010. [11] C. W. Hug and P. Szolovits. ICU acuity: real-time models versus daily models. In AMIA Annual Symposium Proceedings, volume 2009, page 260. American Medical Informatics Association, 2009. [12] E. Hutchins. Cognition in the Wild. MIT press, 1995. [13] F. Jaimes, J. Farbiarz, D. Alvarez, and C. Martı́nez. Comparison between logistic regression and neural networks to predict death in patients with suspected sepsis in the emergency room. Critical care, 9(2):R150, 2005. [14] Joint Commission on Accreditation of Healthcare Organizations. Sentinel Event Alert: Medical device alarm safety in hospitals. Electronic, April 2013. [15] G. V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied statistics, pages 119–127, 1980. [16] W. A. Knaus, E. A. Draper, D. P. Wagner, and J. E. Zimmerman. APACHE II: a severity of disease classification system. Critical care medicine, 13(10):818–829, 1985. [17] L. Kowalczyk. Alarm fatigue a factor in 2d death, 2011. Boston Globe. [18] J. A. Kruse, M. C. Thill-Baharozian, and R. W. Carlson. Comparison of clinical assessment with APACHE II for predicting mortality risk in patients admitted to a medical intensive care unit. JAMA, 260(12):1739–1742, 1988. [19] J. D. Lee and K. A. See. Trust in automation: Designing for appropriate reliance. Human Factors: The Journal of the Human Factors and Ergonomics Society, 46(1):50–80, 2004. [20] J. S. Long and J. Freese. Regression models for categorical dependent variables using Stata, 2006. [21] J. N. Luchman. Chaid: Stata module to conduct chi-square automated interaction detection. Statistical Software Components, 2014. [22] J. N. Marewski and G. Gigerenzer. Heuristic decision making in medicine. Dialogues Clin Neurosci, 14(1):77–89, 2012. [23] J. Meyer. Conceptual issues in the study of dynamic hazard warnings. Human Factors: The Journal of the Human Factors and Ergonomics Society, 46(2):196–204, 2004. [24] K. L. Mosier, L. J. Skitka, S. Heers, and M. Burdick. Automation bias: Decision making and performance in high-tech cockpits. The International journal of aviation psychology, 8(1):47–63, 1998. [25] J. Nielsen. Enhancing the explanatory power of usability heuristics. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 152–158. ACM, 1994. [26] R. Pandian, M. Mathur, and D. Mathur. Impact of ‘fire drill’ training and dedicated obstetric resuscitation code in improving fetomaternal outcome following cardiac arrest in a tertiary referral hospital setting in Singapore. Archives of gynecology and obstetrics, pages 1–5, 2014. [27] A. Sarcevic. Who’s scribing?: documenting patient encounter during trauma resuscitation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1899–1908. ACM, 2010. [28] A. Sarcevic and R. S. Burd. “What’s the story?” Information needs of trauma teams. In AMIA Annual Symposium Proceedings, volume 2008, page 641. American Medical Informatics Association, 2008. [29] F. Schmid, M. S. Goepfert, and D. A. Reuter. Patient monitoring alarms in the ICU and in the operating room. Crit Care, 17(2):216, 2013. [30] B. Simmons, D. Lanuza, M. Fonteyn, F. Hicks, and K. Holm. Clinical reasoning in experienced nurses. Western Journal of Nursing Research, 25(6):701–719, 2003. [31] A. G. Taylor, K. Hudson, and A. Keeling. Quality nursing care: The consumers’ perspective revisited. Journal of Nursing Care Quality, 5(2):23–31, 1991. [32] J. Welch. An evidence-based approach to reduce nuisance alarms and alarm fatigue. Biomedical instrumentation & technology/Association for the Advancement of Medical Instrumentation, pages 46–52, 2010. [33] C. D. Wickens, J. G. Hollands, S. Banbury, and R. Parasuraman. Engineering Psychology and Human Performance. Pearson, 4th edition, 2013. [34] S. Wilcox. Auditory alarm signals. Biomedical Instrumentation & Technology, 45(4):284–289, 2011. [35] B. Williams, G. Alberti, C. Ball, D. Bell, R. Binks, L. Durham, et al. National early warning score (NEWS): Standardising the assessment of acute-illness severity in the NHS. London: The Royal College of Physicians, 2012.