Sleep_apnea_events_recognition_based_on_polysomnog
Sleep_apnea_events_recognition_based_on_polysomnog
Sleep_apnea_events_recognition_based_on_polysomnog
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
Abstract—Goal: The gold standard for detecting the presence turnover and tissue regeneration and especially strengthens
of apneic events is a time and effort-consuming manual evaluation the central nervous system since during sleep the maximum
of type I polysomnographic recordings by experts, often not of neuronal plasticity is reached. According to the Interna-
error-free. Such acquisition protocol requires dedicated facilities
resulting in high costs and long waiting lists. The usage of artificial tional classification of sleep disorders (ICSD), sleep disorders
intelligence models assists the clinician’s evaluation overcoming were divided into eight categories: insomnias, sleep-related
the aforementioned limitations and increasing healthcare quality. breathing disorders, hypersomnias of central origin not due
Methods: The present work proposes a machine learning-based to a circadian rhythm sleep disorder, circadian rhythm sleep
approach for automatically recognizing apneic events in subjects disorders, parasomnias, isolated symptoms (apparent normal
affected by sleep apnea-hypopnea syndrome. It embraces a vast
and diverse pool of subjects, the Wisconsin Sleep Cohort (WSC) variants, and unresolved issues), and other sleep disorders.
database. Results: An overall accuracy of 87.2±1.8% is reached
for the event detection task, significantly higher than other works A. Sleep Apnea-Hypopnea Syndrome
in literature performed over the same dataset. The distinction
between different types of apnea was also studied, obtaining Sleep Apnea-Hypopnea Syndrome (SAHS) falls within the
an overall accuracy of 62.9±4.1%. Conclusions: The proposed Sleep-related Breathing Disorder (SBD) spectrum. Patients
approach for sleep apnea events recognition, validated over a affected by SAHS experience numerous involuntary respiratory
wide pool of subjects, enlarges the landscape of possibilities for pauses during the night referred to as ”apneic events”, which
sleep apnea events recognition, identifying a subset of signals
that improves State-of-the-art performance and guarantees simple must last between 10 seconds to 2 minutes (typically around 20
interpretation. to 40 seconds [1]) to be considered clinically significant. The
duration of apneic events is influenced by several factors such
Index Terms—Sleep-Related Breathing Disorders, Sleep Apnea,
Obstructive Sleep Apnea, Machine Learning, polysomnography as gender, obesity, age, sleep position, pharyngeal collapsibil-
ity, loop gain, with many of these factors interacting with each
Impact Statement- The current work presents a novel
other [2] [3]. The airflow reduction causes a proportional drop
machine learning-based approach for sleep apnea events
in arterial blood oxygen saturation level, triggering an auto-
recognition that relies on low-invasive-to-record signals
nomic response that commonly evolves in neurophysiological
with above SoA performances, guaranteeing simple
awakening, disturbing the subject’s rest [4]. This condition
interpretation.
translates into different symptoms both during sleep and during
the wake. Typical symptoms during sleep are loud snoring,
choking sounds, and sudden body movements, while typical
I. I NTRODUCTION symptoms during wake are daytime sleepiness, fatigue, and
leep has a crucial importance in every physiological con- memory-related problems. Sleep apnea is categorized into three
S dition of the body. It relaxes the muscles, allows cell forms:
• Central Sleep Apnea (CSA), characterized by the absence
Submission date: 14 May 2024
1 Faculty of Informatics, Università della Svizzera Italiana (USI), Lugano- of respiratory effort due to central nervous system dys-
Viganello, Switzerland functions.
2 Institute of Information Systems and Networking (ISIN), University
• Obstructive Sleep Apnea (OSA), characterized by respira-
of Applies Sciences and Arts of Southern Switzerland (SUPSI), Lugano- tory effort hampered by the collapse of upper airway soft
Viganello, Switzerland [email protected]
3 Institute of Digital Technologies for Personalised Healthcare (MeDiTech), tissues and tongue.
University of Applied Sciences and Arts of Southern Switzerland (SUPSI), • Mixed Sleep Apnea (MSA), a combination of OSA and
Lugano-Viganello, Switzerland CSA.
4 Department of Clinical Neurosciences, Lausanne University Hospital
(CHUV) and University of Lausanne (UNIL),Lausanne, Switzerland Respectively the 0.4%, 84%, 15% of cases in the U.S. and
5 Defitech Centre for Interventional Neurotherapies (.NeuroRestore), Lau-
Europe [5]. Hypopnea (HYP), instead, is a less severe condition
sanne University Hospital (CHUV) and Ecole Polytechnique Fédérale de not pathologically comparable to apnea and continues to be an
Lausanne (EPFL), Lausanne, Switzerland
6 Department of Electronics and Telecommunications, Politecnico di Torino, area of considerable controversy [6]. The gold standard for
Turin, Italy SAHS recognition is Type I PolySomnoGraphy (PSG) manual
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
evaluation, which comprehends evaluating a whole night on the the light of the present work. A schematic view of the sleep
basis of at least seven of the following physiological parameters diagnostic device types according to AASM can be found in
(signal): EEG (C4-A1 or C3-A2), EOG, EMG (chin), ECG, Appendix A.
Airflow, Respiratory effort, and Oxygen saturation, as well From this perspective, this paper aims to answer two re-
as the tracks of body position and eventual leg movements. search questions (RQ1 and RQ2). Firstly, if it is possible to
Each recording session is carried out in ad-hoc facilities within describe the sleep apnea-related SBD condition of the pa-
hospitals or sleep centers and require the continuous presence tients detecting apneic events exploiting only low-invasive-to-
of specialize personnel during the whole night, resulting in record signals from PSG. This research question is of primary
high costs associated with the diagnosis of SAHS. From a importance in the perspective of obtaining simpler and less
technical standpoint, the evaluation procedure is very time- cumbersome diagnostic devices. In particular, we aim to obtain
consuming and requires high effort from clinicians; despite a system capable of accomplishing the task, which uses the
being highly standardized by American Academy of Sleep minimum number of parameters (signals) to be considered a
Medicine (AASM) guidelines [7], it is not error-free. These type-3 sleep diagnostic device (see Table V in Appendix A).
limits, along with the saturation of sleep units, result in costly Secondly, if it is possible to perform such description based on
procedures associated with the treatment of the patients [8]. machine learning methods that do not extend to ensembles or
From a clinical standpoint, the impact of SAHS on the quality deep learning solutions, thus excluding completely black-box
of life of the patients is non-negligible, and many works approaches. This stringent constraint aims to reduce clinicians’
investigated this aspect. Subjects who suffer from SAHS have doubts about what the model do to return predictions. Several
a higher probability of having cardiac and cerebral infarcts works, such as [14] [15] [16] suggest how fundamental is to
or high arterial blood pressure, as well as arrhythmias and build by design a system with a good tradeoff between accuracy
other dysfunctions of the cardiorespiratory system [9]. In [10] and interpretability, thus increasing the thrustworthiness of the
the association of objectively measured SBD with incident system from the clinicians standpoint. Moreover, ML models
coronary heart disease (CHD) or heart failure (HF) was studied, have the benefit of being light-weight models with respect to
unrevealing an increasing trend in estimated hazard ratios with ensembles or deep learning approaches, resulting in a suitable
increasing SBD severity, reaching a 2.6 times more likely choice for the implementation into portable devices (RQ1).
incidence of CHD or HF in patients with severe SBD compared We aim to address RQ1 and RQ2 by exploiting a large
to those without sleep-disordered breathing. Moreover, both and variegated dataset, surpassing the size of those used in
total and cancer mortality show an increasing linear trend comparable studies. In this way, the system will take great
with increasing SBD severity as well as with an increasing advantage of the dataset heterogeneity, increasing the models’
hypoxemia index [11]. From the treatment-delivery standpoint, generalizability over new unseen data. The present work’s
a huge bottleneck is represented by unawareness, with a great novelty relies on the identification of an apnea detection
part of the patients being unaware of their own symptoms. system’s optimal configuration in terms of both features and
AASM estimates about 29 million U.S. adults that suffer from models, which ensures a good tradeoff between system sim-
moderate to severe OSA, with an estimated 80% living unaware plicity, model interpretability, and discriminative power by
of it and undiagnosed. It is obvious to understand why SAHS design. Such configuration can be employed as an algorithmic
is a public health and economic challenge. [4]. Thus, the backbone for type-3 sleep diagnostic devices (see Appendix
necessity of early identification of SAHS events for a more A), given that the models examined are lightweight enough to
effective outcome of patients’ treatment is crucial. be implemented in portable systems yet powerful enough to
Although various devices have been used to measure phys- correctly accomplish the detection task.
iological signals, detect apneic events, and help treat sleep The rest of the paper is organized as follows. Section II
apnea, significant opportunities remain to improve the quality, presents the proposed approach describing the data used and
efficiency, and affordability of sleep apnea care. their processing, the features extracted, and the choice of
American Academy of Sleep Medicine (AASM) digital task the ML models. The results are presented from five different
force identifies five basic tasks a system used to diagnose and points of view, depending on the clusters considered and the
detect breathing-related events must embrace [12]: incisiveness of the dimensionality reduction applied, in Section
III and are discussed in Section IV. Finally, a conclusion is
1) The system must allow to acquire and record data;
made in Section V.
2) The system must allow to visualize the aforementioned
data;
3) The system must allow data manipulation so that clini- II. M ETHOD
cians can visually assign a score to events; A. Database and signals description
4) The system must allow for data reduction. In particular, In the present work, we used the Wisconsin Sleep Cohort
the final goal is to obtain useful diagnostic summary (WSC) database [17] [18], which comes from an ongoing
statistics for reporting starting from epochs; longitudinal study that started more than 20 years ago. This
5) The system must allow the storage of relevant data and dataset includes subjects with and without cardiovascular dis-
results. ease, CPAP users, subjects already suffering from SAHS of any
There does not exist a uniform standard for the upper-listed severity classified with the apnea-hypopnea index (AHI), and
processes. Stages 3 and 4 are the most interesting ones in subjects who have never received a diagnosis (see Table I for
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
TABLE I. Subjects demographic, clinical, and PSG-related characteristics of WSC database.
* TST: Total Sleep Time, measured as the total time spent in sleep stages N1, N2, N3, and REM.
subjects details). It comprises 2570 PSGs gathered with two Figure 1 shows a dendrogram plot of the group means obtained
different acquisition systems, 1800 with the Grass Heritage from the MANOVA that displays two main clusters: one
System (GHS) and 770 with the Grass Comet Lab Based containing normal and hypopnea epochs and one containing
system. To be consistent, we used data collected only from apnea epochs (OSA, CSA, and MSA).
GHS since most of the data were collected with it. From the
total of 1800 records, we removed the ones presenting missing
data or very noisy signals, ending with 1587 records. The PSGs
collected with the Grass Heritage System include the following
signals: 2 EEG, 2 EOG, 2 EMG, 1 ECG, 1 audio registration,
nasal- and oral airflow, 1 nasal pressure, 3 RIP-belt volume
signals (thoracic, abdominal, and sum), 1 body position, and
1 blood saturation (SpO2 ). All the signals within the dataset
were sampled at 100 Hz with a 16-bit resolution ADC and
were already analogically pre-filtered with a pass band-filter
to remove the stationary component and frequencies above 30
Hz. The WSC database comprehends both the raw data and
the true labels of apneic events. These labels were manually
assigned by experts according to the scoring procedure reported
in Appendix C.
To meet RQ1, in our study, we excluded the most invasive-
to-record signals, namely nasal- and oral airflow and nasal pres-
sure, usually collected through cannulas. Studies like [21] and Fig. 1. MANOVA dendrogram considering all 5 classes.
[22] highlight how the usage of cannulas can cause discomfort
In this work we analyzed the following dichotomous classi-
in patients. Moreover, we considered only the thoracic signals
fication tasks:
among the three RIP-belt volume signals.
(i) Apnea vs Normal/Hypopnea
(ii) Apnea vs Normal
B. Signal processing and feature extraction (iii) OSA/MSA vs. CSA
To be consistent with the AASM guidelines [7] each sig- Cases (i) and (ii) reflect the main objective of the work
nal was segmented into non-overlapping 30-second epochs, (RQ1), being apnea detection tasks. However, we decided to
discarding the ones affected by sensor detachments and the investigate also case (iii), in order to further classify the type
epochs of wakefulness. The features extracted can be grouped of apnea. Cases i and ii aim to differentiate the apnea condition
into three main categories: time-based statistics, describing the from other condition, while case iii aims to distinguish between
timeseries distribution, complexity, quantifying the presence different types of apnea. Secondly, a correlation analysis was
of long-range correlations in non-stationary time series, and performed highlighting a correlation among features derived
frequency-based. Moreover, according to the nature of the sig- from the same signals. Therefore, the dataset underwent a
nals, some signal-specific features were extracted, e.g. hypoxic feature dimensionality reduction process. In particular, for the
burden features for SpO2 , RR intervals-based features for ECG, apnea detection cases (i and ii), Principal Component Analysis
position encoding, etc. (PCA) [32] was performed over the covariance matrix of the
Finally, a dataset of 973’000 epochs described by 130 variables initial dataset after features z-score normalization. Then, a par-
was obtained. allel coordinate chart was plotted to check the amount of vari-
ance explained by a limited number of Principal Components
(PCs) and retain only the features that weighed most within
C. Case studies and dimensionality reduction those PCs. Two subsets were obtained from this process: (a) 78-
To better understand how the available dataset’s variance can feature-subset and (b) 32-feature-subset. The former preserved
help discriminate the different classes, firstly a Multivariate features from most of the signals and was obtained through
ANalysis Of VAriance (MANOVA) [30] [31] was performed. a looser selection, while the latter was obtained through a
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
TABLE II. 32-feature subset.
more stringent approach, discarding EEG, ECG, and position The upper part refers to case study (iib) while the lower
features (see Table II). part refers to case study (iii), normal breathing vs apnea
The same approach was then applied for the apnea distinc- and OSA/MSA vs CSA respectively. The leftmost graphs
tion, and a 44-feature-subset was obtained, including 9 SpO2 , are the Pareto charts or explained variance charts, obtained
18 ECG, 1 position, 4 audio, and 12 thoracic volume features through PCA analysis. They show how much each Principal
(see Table III). It is noteworthy that 18 of the 33 ECG features Component (PC) contributes to explain the variance (infor-
were retained, whereas in previous cases, they were discarded mation) contained in the dataset. The horizontal axis refers
completely. to the explained variance while the vertical axis refers to the
Figure 2 shows insights about PCA and correlation analysis.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
ii b Explained Variance
35% PC1
59% PC3 0
65% PC4
0.4
70% PC5
0.05
-1
PC1 PC2 PC3 PC4 PC5
0% 30% Audio
Blood EOG
recording Thorax Belt
saturation
iii
Explained Variance 1
Extracted features (44)
29% PC1
Total variance
47% PC2
56% PC3
0
63% PC4 Body position
0.5
66% PC5
0.05
0% 30% -1
PC1 PC2 PC3 PC4 PC5
Body position
Blood Audio
ECG Thorax
saturation recording
Belt
Fig. 2. Principal components and correlation analyses for case studies iib (upper row) and iii (lower row). The leftmost graphs represent the
cumulative variance charts; the central graphs represent heatmaps of the first five PCs stratified per biosignal; the rightmost graphs represent
the correlation matrices.
cumulative variance, so, for instance, in case iib the first PC D. Training, Test, and Model fitting
(PC1) contributes to explaining 35% of the total variance, while The datasets obtained after feature selections were divided
the second PC (PC2) contributes to explaining the 16% of into a training set (TRS) and a test set (TS). Both datasets
the total variance that summed up with PC1 reaches 51%, were balanced in terms of the number of events (normal,
and so on. The central graphs represent heatmaps of the hypopnea, OSA, CSA, and MSA). Since the datasets were
principal components stratified according to biosignals. The composed of a variable number of epochs per subject, to
color code of these heatmaps goes from lighter colors to darker avoid overfitting, they were firstly divided into TRS (70%) and
colors based on the weights a certain feature has within a TS (30%), randomizing on subjects, ensuring that a subject
certain principal component: the darker the color, the more present in the TRS was not present also in the TS. Then
weight a certain feature brings to the PC. For instance, in case different case-wise balancing approaches were applied: in both
(iib) the blood-saturation-derived features weigh more than the cases (i) and (ii), TRS and TS were balanced by taking all
audio features. Ultimately, the rightmost graphs represent the the apnea epochs and randomly picking the same amount of
correlation matrices calculated over the subset of features that normal-hypopnea or normal epochs. In case (iii), instead, we
characterize each case study. These matrices can be seen as kept all the MSA epochs, which are the less represented, and
a composition of blocks belonging to different domains, the we randomly selected OSA and CSA epochs, maintaining the
different biosignals, as it is possible to infer from the icons apnea events proportion with respect to the MSA.
along the two axes. It is clear how there is a certain amount
of residual correlation among features derived from the same
signals while there is almost no correlation among features E. Model choice and tuning
derived from different signals. In order to meet RQ2, five different supervised learning
algorithms have been applied to predict the apnea events:
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
Fig. 3. Comparison of Diagnostic Accuracy (DA) for the best model of each case study on TS (*** = p-value < 0.001). The table in the top-
right corner contains the prediction metrics under the form mean ± standard deviation. (ia) Norm/Hyp vs Apn (78 features). (ib) Norm/Hyp
vs Apn (32 features). (iia) Norm vs Apn (78 features). (iib) Norm vs Apn (32 features). (iii) OSA/MSA vs CSA (44 features).
III. R ESULTS
True Negatives
Specificity = The classification performance of all the models were calcu-
True Negatives + False Positives
lated over the same TS for a fair comparison. The number of
iterations per model was fixed to 200. The ground truth for this
True Positives work is the manual evaluation of the same PSGs by experts.
PPV = For each case study the best result in terms of tradeoff between
True Positives + False Positives
diagnostic accuracy and balanced sensitivity and specificity is
reported and the resulting prediction metrics are reported in
Figure 3 under the form mean ± standard deviation. In section
True Negatives IV the letter a will indicate the 78-feature-dataset, while the
NPV =
True Negatives + False Negatives letter b will indicate the 32-feature-dataset.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
IV. D ISCUSSION been collected over more than 20 years by different experts.
Secondly, the hardware prefiltering of the signals, which trans-
Focusing on the apnea detection task (case (i) and case lated into a loss of spectral information for some of them,
(ii)), SVM models proved to be the best tradeoff in terms such as EEG and EOG. Thirdly, the results highlighted that the
of performance, and the most informative features resulted thoracic belt respitrace is one of the most informative signals.
from SpO2 and thoracic volume signals. Moreover, after the According to the WSC manual of operation [40], these signals
hyperparameter optimization, all the SVM models were tuned were collected through semi-disposable RIP belts, which have
using a linear kernel (for a schematic view of the main been shown to produce less reliable output with respect to
hyperparameters obtained during the optimization process see disposable cut-to-fit and disposable snap-on RIP belts [39].
Appendix B for a more detailed description). The outcome From a technical standpoint, ensemble classifiers and neural
of the feature selection, especially for case (ii), suggests network-based approaches could be more effective in solving
the feasibility of implementing a simple acquisition system the tasks of the presented work. However, it was decided to
suitable for a home setting that only entails using a pulse investigate standard ML approaches in order to meet RQ2.
oximeter and RIP bands. The performance of the models over Finally, despite having analyzed a variegated pool of subjects,
the apnea distinction task were suboptimal because the same the population considered for the current work was still limited,
dataset of the apnea detection task was utilized despite being and further inclusion of a different pool of subjects would have
a separate classification problem. It is evident how searching helped for a better generalization.
for more specific features for this task will boost the models’
performance (a starting point could be investigating the ECG
features, which emerged as the most informative in this case). V. C ONCLUSIONS
This suboptimal choice of the dataset for this task translated The present study proposed a novel ML-based approach
into the non-unique choice of the distance metric after the for automated detection and distinction of apneic events
optimization process (see Appendix B). In general, taking starting from conventional PSG data. Different ML models
into account HYP in cases (i) and (ii), and MSA in case and combination of features have been examinated in order to
(iii) deteriorates model performance since these intermediate identify the optimal configuration. Both the research questions
conditions cause overlapping between data distributions in have been addressed; it has been demonstrated how the usage
the respective case studies. Lastly, the more radical feature of low-invasive-to-record signals is feasible for the detection of
selection improved performance in terms of both mean and apneic events. The performed analyses evidenced how blood
standard deviations of all prediction metrics in both cases (i) saturation and respitrace signals are the most informative for
and (ii) in favor of a reduced number of collectible signals. The the detection task, while the ECG is the most informative
present work positively answers to both the research questions: for the distinction task. Moreover, it has been demonstrated
the subset of this condition can be the basis for cheaper, less- that standard ML approaches are powerful enough to solve
cumbersome and easier-to-use type 3 sleep diagnostic device, the apneic detection task. Further, as discussed in Section
which can leverage on the proposed apnea detection approach. IV, studies could build upon the current work by improving
Such systems will be more compliant for the patient and by feature selection and hyperparameter optimization processes
design more trustworthy for the clinician. to explore the potentialities of this dataset. Furthermore,
The novel approach presented in this work also demonstrates other studies could focus on extracting more informative
how simple ML models can perform well over a variegated features for specific classification tasks retaining only the
dataset containing many subjects. The literature shows a broad most discriminant ones identified in the current work (e.g. a
spectrum of approaches for both apnea detection and apnea more in-depth analysis of the distinction task based on ECG,
distinction tasks. Reviews such as [34], [35], [36], and [37] SpO2 , and/or respitrace signals). Finally, more fine-grained
can simplify the comparison with our method and help make detection approaches are currently under our investigation.
qualitative considerations by examining other single- or multi- The intention is to craft a cascade system where the current
channel ML approaches using open-access databases (neither approach is used to identify apneic epochs while a further
neural networks nor ensemble classifiers are taken into ac- approach is used to identify the exact extension of apneic
count). However, there is a limited number of studies carried events within these epochs. This would considerably help
out over a vast number of patients, and even fewer utilize the experts’ evaluation because knowing the exact number and
same database. In particular, only two out of more than one duration of the apneic event can give a deeper insight into the
hundred studies reviewed in the aforementioned works allow a pathological condition of the patients in terms of the severity
fair comparison. In particular, [34] reports a study with SENS of SAHS measured through AHI.
93.1% and PPV 97% on a pool of only 10 subjects, very few
compared to the 1587 subjects used in the present work. [35], Acknowledgement. This Wisconsin Sleep Cohort Study
instead, reports another study on 1479 subjects with SENS was supported by the U.S. National Institutes of Health
and PPV of only 68.60% and 66.36% respectively, which are (NIH), National Heart, Lung, and Blood Institute (NHLBI)
considerably lower than our results. (R01HL62252), National Institute on Aging (R01AG036838,
The current work is limited by several factors. Firstly, the R01AG058680), and the National Center for Research
non-negligible inter- and intra-operator variability [38]. Despite Resources (1UL1RR025011). The National Sleep Research
the scoring procedure is highly standardized, the database has Resource was supported by the NIH, NHLBI (R24 HL114473,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
75N92019R002). [14] C. Rudin, Stop explaining black box machine learning models for high
stakes decisions and use interpretable models instead, Nat. Mach. Intell.
1 (5) (2019) 206–215 https://doi.org/10.1038/s42256-019-0048-x
Authors contributions: Conceptualization, N.L.P., F.M. [15] Nasarian, E., Alizadehsani, R., Acharya, U. R., Tsui, K. L.
and P.A.; Methodology, N.L.P. and S.S.; Software, N.L.P.; (2024). Designing interpretable ML system to enhance trust
Validation, N.L.P. and S.S.; Formal analysis, N.L.P. and in healthcare: A systematic review to proposed responsible
clinician-AI-collaboration framework. Information Fusion, 102412.
S.S.; Investigation, N.L.P.; Resources, N.L.P. and P.A.; https://doi.org/10.1016/j.inffus.2024.102412
Data curation, N.L.P.; Writing—original draft preparation, [16] Arbelaez Ossa, L., Starke, G., Lorenzini, G., Vogt, J. E., Shaw, D. M., El-
N.L.P. and S.S.; Writing—review and editing, N.L.P. and ger, B. S. (2022). Re-focusing explainability in medicine. Digital health,
8, 20552076221074488. https://doi.org/10.1177/2055207622107448
S.S.; Visualization, N.L.P. and S.S.; Supervision, M.P., F.M [17] Young, T., Palta, M., Dempsey, J., Peppard, P. E., Nieto, F. J., & Hla, K.
and P.A.; Project administration, N.L.P. and P.A.; Funding M. (2009). Burden of sleep apnea: rationale, design, and major findings
acquisition, NONE. All authors have read and agreed to the of the Wisconsin Sleep Cohort study. WMJ: official publication of the
State Medical Society of Wisconsin, 108(5), 246–249.
published version of the manuscript. [18] Zhang, G. Q., Cui, L., Mueller, R., Tao, S., Kim, M., Rueschman, M.,
Mariani, S., Mobley, D., & Redline, S. (2018). The National Sleep
Conflicts of interest. The authors declare no conflicts of Research Resource: towards a sleep data commons. Journal of the
American Medical Informatics Association: JAMIA, 25(10), 1351–1358.
interest in the current study. [19] Mallett, S., Halligan, S., Thompson, M., Collins, G. S., & Altman, D.
G. (2012). Interpreting diagnostic accuracy studies for patient care. BMJ
Informed consent. Not applicable. (Clinical research ed.), 345, e3999.
[20] Trevethan R. (2017). Sensitivity, Specificity, and Predictive Values:
Foundations, Pliabilities, and Pitfalls in Research and Practice. Frontiers
Funding. This research received no external funding. in public health, 5, 307.
[21] Thurnheer, R. and Bloch, K. (2004). Monitoring nasal conductance by
bilateral nasal cannula pressure transducers. Physiological Measurement,
25(2), 577-584.
R EFERENCES [22] Nishimura, M. (2016). High-flow nasal cannula oxygen therapy in adults:
physiological benefits, indication, clinical benefits, and adverse effects.
[1] Oksenberg, A., & Leppänen, T. (2023). Duration of respiratory events in
Respiratory Care, 61(4), 529-541.
obstructive sleep apnea: In search of paradoxical results. Sleep Medicine
[23] Levy, J., Álvarez, D., Rosenberg, A. A., Alexandrovich, A., Del Campo,
Reviews, 68, 101728.
F., & Behar, J. A. (2021). Digital oximetry biomarkers for assessing res-
[2] Oksenberg, A., & Leppänen, T. (2023). Duration of respiratory events in
piratory function: standards of measurement, physiological interpretation,
obstructive sleep apnea: Factors influencing the duration of respiratory
and clinical use. NPJ digital medicine, 4(1), 1-14.
events. Sleep Medicine Reviews, 68, 101729.
[3] Borker, P. V., Reid, M., Sofer, T., Butler, M. P., Azarbarzin, A., Wang, [24] Pincus, S. M. (2001). Assessing serial irregularity and its implications for
H., ... & Redline, S. (2021). Non-REM apnea and hypopnea duration health. Annals of the New York Academy of Sciences, 954(1), 245-267.
varies across population groups and physiologic traits. American journal [25] Van Steenkiste, T., Groenendaal, W., Ruyssinck, J., Dreesen, P., Klerkx,
of respiratory and critical care medicine, 203(9), 1173-1182. S., Smeets, C., ... & Dhaene, T. (2018, July). Systematic comparison
[4] Shokoueinejad, M., Fernandez, C., Carroll, E., Wang, F., Levin, J., Rusk, of respiratory signals for the automated detection of sleep apnea. In
S., ... & Webster, J. (2017). Sleep apnea: a review of diagnostic sensors, 2018 40th Annual International Conference of the IEEE Engineering
algorithms, and therapies. Physiological measurement, 38(9), R204. in Medicine and Biology Society (EMBC) (pp. 449-452). IEEE
[5] Morgenthaler, T. I., Kagramanov, V., Hanak, V., & Decker, P. A. (2006). [26] Wang, C., Peng, J., Song, L., & Zhang, X. (2017). Automatic snoring
Complex sleep apnea syndrome: is it a unique clinical syndrome?. Sleep, sounds detection from sleep sounds via multi-features analysis. Aus-
29(9), 1203-1209. tralasian physical & engineering sciences in medicine, 40(1), 127-135.
[6] Berry, R. B., Budhiraja, R., Gottlieb, D. J., Gozal, D., Iber, C., Kapur, [27] Alvarez, D., Hornero, R., Marcos, J., Del Campo, F., & Lopez, M. (2009).
V. K., ... & Tangredi, M. M. (2012). Rules for scoring respiratory events Spectral analysis of electroencephalogram and oximetric signals in ob-
in sleep: update of the 2007 AASM manual for the scoring of sleep and structive sleep apnea diagnosis. Annual International Conference of the
associated events: deliberations of the sleep apnea definitions task force IEEE Engineering in Medicine and Biology Society. IEEE Engineering in
of the American Academy of Sleep Medicine. Journal of clinical sleep Medicine and Biology Society. Annual International Conference, 2009,
medicine, 8(5), 597-619. 400–403.
[7] Berry, R. B., Brooks, R., Gamaldo, C. E., Harding, S. M., Marcus, C., [28] Rachim, V. P., Li, G., & Chung, W. Y. (2014). Sleep apnea classification
& Vaughn, B. V. (2012). The AASM manual for the scoring of sleep using ECG-signal wavelet-PCA features. Bio-medical materials and
and associated events. Rules, Terminology and Technical Specifications, engineering, 24(6), 2875-2882.
Darien, Illinois, American Academy of Sleep Medicine, 176, 2012. [29] Isa, S., Fanany, M., Jatmiko, W., & Murini, A. (2010, November). Feature
[8] Roebuck, A., Monasterio, V., Gederi, E., Osipov, M., Behar, J., Malhotra, and model selection on automatic sleep apnea detection using ECG. In
A., Penzel, T., & Clifford, G. D. (2014). A review of signals used in sleep International Conference on ComputerScience and Information Systems
analysis. Physiological measurement, 35(1), R1–R57. (pp. 357-362)
[9] Baillieul, S., Dekkers, M., Brill, A. K., Schmidt, M. H., Detante, O., [30] Morrison, D. F. (1990). Multivariate statistical methods. In Multivariate
Pépin, J. L., ... & Bassetti, C. L. (2022). Sleep apnoea and ischaemic statistical methods (pp. 495-495).
stroke: current knowledge and future directions. The Lancet Neurology, [31] Krzanowski, W. J. (2000). Principles of multivariate analysis. Oxford
21(1), 78-88. University Press.
[10] Hla, K. M., Young, T., Hagen, E. W., Stein, J. H., Finn, L. A., Nieto, [32] Jolliffe, I. T. (2002). Principal component analysis for special types of
F. J., & Peppard, P. E. (2015). Coronary heart disease incidence in sleep data (pp. 338-372). Springer New York.
disordered breathing: the Wisconsin Sleep Cohort Study. Sleep, 38(5), [33] Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian opti-
677-684. mization of machine learning algorithms. Advances in neural information
[11] Nieto, F. J., Peppard, P. E., Young, T., Finn, L., Hla, K. M., & Farré, processing systems, 25.
R. (2012). Sleep-disordered breathing and cancer mortality: results from [34] Alvarez-Estevez, D., & Moret-Bonillo, V. (2015). Computer-Assisted
the Wisconsin Sleep Cohort Study. American journal of respiratory and Diagnosis of the Sleep Apnea-Hypopnea Syndrome: A Review. Sleep
critical care medicine, 186(2), 190-194. disorders, 2015, 237878.
[12] Penzel, T., & Conradt, R. (2000). Computer based sleep recording and [35] Xu, S., Faust, O., Silvia, S., Chakraborty, S., Barua, P. D., Loh, H. W., ...
analysis. Sleep medicine reviews, 4 (2), 131–148. & Acharya, U. R. (2022). A review of automated sleep disorder detection.
[13] Ferber, R., Millman, R., Coppola, M., Fleetham, J., Friederich Murray, Computers in Biology and Medicine, 106100.
C., Iber, C., McCall, W. V., Nino-Murcia, G., Pressman, M., Sanders, M., [36] Ramachandran, A., & Karuppiah, A. (2021, July). A survey on recent
et al. (1994). Portable recording in the assessment of obstructive sleep advances in machine learning based sleep apnea detection systems. In
apnea. Sleep. Healthcare (Vol. 9, No. 7, p. 914). MDPI.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
[37] Mostafa, S. S., Mendonça, F., G. Ravelo-Garcı́a, A., & Morgado-Dias, F.
(2019). A systematic review of detecting sleep apnea using deep learning.
Sensors, 19(22), 4934.
[38] Younes, M., Raneri, J., & Hanly, P. (2016). Staging sleep in polysomno-
grams: analysis of inter-scorer variability. Journal of Clinical Sleep
Medicine, 12(6), 885-894.
[39] Montazeri, K., Jonsson, S. A., Agustsson, J. S., Serwatko, M., Gislason,
T., & Arnardottir, E. S. (2021). The design of RIP belts impacts the
reliability and quality of the measured respiratory signals. Sleep and
Breathing, 1-7.
[40] Wisconsin Sleep Cohort (2009). WSC Manual of Operations. Wisconsin
Sleep Cohort.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
Abbreviations The following abbreviations are used in this manuscript:
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
A PPENDIX A
This appendix contains a schematic view of the sleep diagnostic device types according to AASM. For each type are indicated
the possible parameters, the possible parameters (physiological signals) are indicated, as well as other important characteristics.
A PPENDIX B
This appendix contains a brief summary of the main hyperparameters obtained from the optimization process. For a more
schematic view, they were collected in Table VI. The notation of values between brackets is: (median, 25%ile, 75%ile).
It is noteworthy how the optimization process led the SVM models to use linear kernels. From a technical standpoint, this is
a great advantage because this kernel choice speed up the computations with respect to other nonlinear kernels.
Further, the kNN models were tuned over a reasonable number of neighbours considering the numerosity of the dataset.
Furthermore, the kNN model for apnea distinction was tuned half of the time over a standardized euclidean distance and half of
the time over a mahalanobis distance. The latter is a distance metric that takes into account the data distribution (it can measure
how far away a data point is from the distribution, while Euclidean distance can not), so this means that the distribution of the
datapoints plays a key role in this case which differs from the other cases.
TABLE VI. Main hyperparameters.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Open Journal of Engineering in Medicine and Biology. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/OJEMB.2024.3508477
Technology
A PPENDIX C
In this appendix will be deepened the definitions and the scoring procedures of apneas and hypopnea according to the WSC
manual of operations [40] followed by clinicians to score the breathing-related events of the database exploited for this work.
A. Apneas
Definitions: Apneas are characterized by no indication of airflow in nasal pressure, no detectable breathing pattern in the
thermistor and a clear amplitude reduction in effort, followed by an associated desaturation. The different types of apnea have
been distinguished observing the thermocouple and the respitrace signals: OSA (no indication of airflow by thermocouple and
an indication of effort in respitrace channels), CSA (no indication of airflow by thermocouple and no indication of effort in
respitrace channels), and MSA (no indication of airflow by thermocouple and areas of no effort followed by effort in respitrace
channels)
Scoring Procedure
1) Determine if there is flow or no flow. Criteria for NO flow:
• Does not follow the previous pattern of flow and/or
is < 20% of amplitude of the largest previous breath (determined by /mm of unclipped airflow sensitivity, if necessary)
and
has an interruption of airflow that is > 10 seconds in duration.
2) Determine if there is effort or no effort. Criteria for no-effort (from Respitrace):
• Does not follow the previous pattern of breathing and
has no discernable amplitude of the signal in the respitrace.
3) Measure duration of event:
• Measure from the beginning of the last expiration on the airflow channels to the beginning of the next inspiration on
the airflow channels to determine the 10-second criterion;
• If 10 seconds, measure the duration of the event from the beginning of the last expiration to the beginning of the next
inspiration on the SUM channel (sum of volumes) of the respitrace that best corresponds to the points of measurement
of duration in the airflow channels;
• If not 10 seconds, determine if the event meets the criterion for a hypopnea: 4% desaturation. If it does not, then the
event is ignored and not scored;
• NOTE: When the event is obviously an apnea and is between 9.5 and 10 seconds, round the duration up to 10 seconds
and score.
B. Hypopnea
Definition: A discernable decrease in flow in the nasal pressure channel and/or thermistor with an associated oxygen
desaturation of 4% or greater indicated in the SpO2 channel beginning in sleep.
Scoring Procedure
1) Use a display view of at least a 120-second window;
2) Determine a discernable decrease in the SUM channel (sum of volumes) defined as a > 50% decrease in the mean
amplitude of the three largest breaths preceding the onset of the event, or a clear reduction in amplitude that is < 50%
with an associated oxygen desaturation of > 4%;
3) Measure the duration of the event:
• Measure from the beginning of the last expiration on the SUM to the beginning of the next inspiration on the SUM to
determine the 10-second criterion (from the beginning to the end of the event);
• If not 10 seconds, delete the event mark;
• Mark the desaturation event on the SpO2 channel following the respiratory event, beginning within 30 seconds of the
end of the respiratory event;
• Delete the desaturation event for a hypopnea if the desaturation is < 4%;
• Determine that the desaturation occurs in sleep.
Events that begin in sleep and end in wake are always scored. Events that begin and end in the wake are never scored;
• Mark the beginning and end of the event in the SpO2 channel corresponding to the desaturation. Duration of desaturation
events should not be greater than 120 sec.
4) Mark the corresponding event in the SUM channel as Hypopnea if associated with a desaturation of > 4%;
5) Without the presence of an adequate signal in the nasal pressure channel, use the nasal/oral thermistor channel for
determination of flow.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/