Individual Project

Task 1: Evaluating AI Models in Research
Papers
David Lindahl
s234817
June 21, 2024
What the papers evaluate

Paper 1: Block-regularized 5×2 Cross-validated McNemar’s Test for Com-
paring Two Classification Algorithms [1]
Paper 1 investigates the importance of performing cross-validation instead of the holdout method when
using McNemar’s test, by training 10 models on 5 different datasets to compare the effectiveness of
the 5×2 BCV McNemar’s test with existing methods. The paper dives into the upsides of performing
cross-validation compared to the holdout method, by training multiple models and evaluating them.
Paper 2: Detecting influenza epidemics using search engine query data [2]
Paper 2 evaluates the effectiveness of using Google search query data to track and predict influenza-
like illness (ILI) trends across the United States. They do not perform cross-evaluation or any type of
comparisons.
Why paper 1 evaluates properly

Paper 1 performs 5×2 block-regularized cross-validation, which previous papers has found outperforms
K-fold cross validation when comparing multiple classifiers using paired t-test. In the paper, they not only
use 5×2 block-regularized cross-validation, but also apply McNemar’s test to evaluate the performance
differences between classifiers. This approach helps ensure that the comparison is robust and reliable by
minimizing the variance and leveraging the stability of the 5×2 cross-validation method.
What should have been considered in paper 2

Paper 2 should have considered integrating multiple data sources, such as clinical and epidemiological
data, to provide a more comprehensive and accurate forecast of influenza trends.
Implementing a cross-validation approach and dynamically retraining the model based on real-time
data could further improve the robustness and accuracy of the predictions. This could help evaluate the
model on the go, and ensure more reliable monitoring of influenza-like illness (ILI) trends. Perhaps this
adjustment could have prevented the model, from predicting systematically high 100 out of 108 weeks
starting in august 2011.[3]
An important conclusion to draw from this paper is that big data and search queries on their own
may not always provide accurate and reliable data. We already know, that combining attributes to train
models, can sometimes improve accuracy by a wide margin. Perhaps if the researches included more
medical data, the model could have predicted with higher accuracy.
Furthermore this paper teaches us the importance of adding model regularization, to avoid overfitting
to training data, and making models that generelize to new data.
1
Task 2: Predicting Frustration from Heart
Rate Signals
Abstract
This study aims to build and evaluate predictive models to classify levels of frustration based on
heart rate (HR) features. The motivation is to enhance mental health monitoring and user experience
improvement. The problem addressed is the lack of accurate predictive models for frustration levels.
Our approach involves using Logistic Regression, Decision Tree, ANN, RF, and a baseline model,
trained on HR data. The results indicate no significant differences among the models, as confirmed
by ANOVA (p-value = 0.5488). The conclusion is that current methods do not achieve high accuracy,
and future work should explore additional features or more complex models.
1 Introduction
Understanding and predicting emotional states such as frustration can be crucial for various applications,
including mental health monitoring and user experience improvement. Heart rate (HR) signals are a
non-invasive measure that can provide insights into these emotional states. This study aims to develop
and evaluate five models to classify frustration levels from HR signals. The five models are a Logistic
Regression model, ANN, Random Forrest Classifier (RF), Dececion Tree and a baseline model (majority
pick). The frustration attribute is binarized, with levels above the median considered as ’Frustrated’.
2 Data Preprocessing
The classification problem we are trying to solve, is predicting the frustration-level given a heart rate.
To be more specific, the output of the models will be a binary value, either 1 or 0, which represents
the prediction of the attribute ’Frustration Binary’. A value of 1 indicates the presence of frustration
(frustration ≥ 2), while a value of 0 indicates its absence (frustration < 2).The only features, that will
be included in the model, are features regarding the heart rate.
Before proceeding with model training, it is essential to examine the relationships between the features
to ensure their ability for inclusion in the predictive models. This involves analyzing the correlation
matrix and the pairwise relationship scatterplots.
Figure 1: Correlation matrix of heart rate features
2
2.1 Correlation Matrix Analysis
The correlation matrix, shown in Figure 1, indicates the linear relationships between the heart rate
features and the target variable ’Frustration Binary’. The correlation coefficients between the features
and ’Frustration Binary’ range from -0.02 to 0.14. These low correlation values suggest that none of
the features have a strong linear relationship with the target variable. This suggests, that linear models
could have a hard time predicting whether or not the individual is frustrated.
Looking at Figure 1, one could argue that since the correlation between the features ’HR Mean’ and
’HR Median’ is 0.95, including both is redundant. Additionally, ’HR Min’ has a correlation of only 0.03
with ’Frustrated-Binary’, suggesting that it might contribute more noise than useful information. How-
ever, even though these features aren’t linearly dependent on ’Frustrated-Binary’, they can still provide
valuable information for the predictive models through non-linear relationships and interactions with
other features.
Figure 2: Pairwise relationship scatterplot of heart rate features
2.2 Pairwise Relationship Scatterplot

The pairwise relationship scatterplot (Figure 2) provides a visual inspection of the distribution and
relationships between features. The scatterplots reveal, that there is no linear relationship between any
of the input features, and the ’frustrated’-feature. Furthermore, one can see, that there are no obvious
outliers on our data.
3
2.3 Input Features
Therefore, despite the potential redundancy and low correlation values, all six features were included in
the dataset. This decision ensures that the models have access to the full range of information captured by
the heart rate signals, which might improve their ability to accurately predict frustration levels through
complex patterns that are not immediately apparent from linear correlations alone. The input features
used for these models are therefore:
• HR Mean: The average heart rate.

• HR Median: The median heart rate.
• HR std: The standard deviation of heart rate, representing variability.
• HR Min: The minimum recorded heart rate.

• HR Max: The maximum recorded heart rate.
• HR AUC: The area under the curve of the heart rate signal over time, providing a cumulative
measure of heart rate.
3 Cross-validation
3.1 Cross-Validation: Grouped K-Fold
It is crucial to ensure that the same individuals are not present in both the training and test data to avoid
data leakage and to ensure unbiased evaluation of the models. Additionally, due to the class imbalance
(94 instances of ’Frustrated’ = true, 74 instances of ’Frustrated’ = false), we have chosen the Grouped
K-Fold method. This guarantees that no individual is present in both the training and test sets, and
ensures that the models are tested on every individual.
3.2 Confidence Intervals

After performing Stratified Grouped 14-fold Cross-validation, each model yielded 14 accuracy values.
With only 14 accuracy values, we cannot assume that the mean is normally distributed due to the
Central Limit Theorem (CLT). Therefore, we applied non-parametric bootstrapping with a resample
size of 10,000 to calculate the 95% confidence intervals (CI).
The results are presented in table 1. Giving table 1 a first eye look, it appears that all of the confidence
intervals overlap, meaning that the models perform equally good. We will continue examining the models,
to see if one is superior to another, in the next section
Classification Accuracy
Model Mean Accuracy Lower Bound Upper Bound
Decision Tree 0.4996 0.4107 0.5893
ANN 0.5893 0.4583 0.7202
Baseline 0.5586 0.4226 0.6964
RF 0.5590 0.4583 0.6429
Logistic Regression 0.5417 0.4345 0.6488
Table 1: Mean accuracies and confidence intervals for each model
4 Model Comparison
4.1 ANOVA: Assumptions
Before conducting the ANOVA, it is essential to verify that the data meets the assumptions required for
the test.
4
4.1.1 Normality
To test for normality, we used the Shapiro-Wilk Test. The results for each model are summarized in
Table 2.
Model Shapiro-Wilk Test p-value

Decision Tree 0.5755
ANN 0.4556
Baseline 0.3716
RF 0.7139
Logistic Regression 0.4948
Table 2: Shapiro-Wilk Test p-values for normality assessment of model accuracies
Since all p-values are above 0.05, we fail the reject the hypothesis, that the accuracies are normally
distribution. This conclusion is further supported by the Q-Q plots shown in Figure 3.
Figure 3: Q-Q plots for normality assessment of model accuracies
With the normality assumption satisfied, we can proceed to test the other assumptions before con-
ducting the ANOVA.
4.1.2 Homogeneity of Variance

To test for homogeneity of variance, we applied Levene’s Test. The result is summarized in Table 3.
Test p-value
Levene’s Test 0.1383
Table 3: Levene’s Test p-value for homogeneity of variance
Since the p-value is above 0.05, we cannot reject the null hypothesis of homogeneity of variance,
indicating that the variance of accuracies is similar across all models.
4.1.3 Independence
We are comparing models, that in each fold has trained on the same data, and tested on the same data.
Therefore, one cannot assume independace. Therefore, we have chosen to use to use the ”Repeated
Measures ANOVA” (RM ANOVA), which accounts for correlation between the models being examined.
Below are the results from the RM ANOVA. [4]
4.2 ANOVA Results

The ANOVA results are summarized in the table below:
Source Sum of Squares df Mean Square F Value Pr > F

Model 0.7713 4 0.1928 0.7713 0.5488
Error 52.0000 260 0.2000
Total 52.7713 264
Table 4: RM ANOVA results
5
The p-value from the ANOVA test is 0.5488, which is above the significance level of 0.05. Therefore,
we fail to reject the null hypothesis that there are no significant differences in the mean accuracies of the
different models. This implies that none of the models significantly outperforms the others in predicting
frustration levels from heart rate signals.
4.3 Model Robustness & Generalization

Model robustness refers to the ability of the models to maintain performance when trained on different
subsets of data. Based on the cross-validation results and confidence intervals, all models demonstrated
similar robustness. Notably, RF had a slightly narrower CI (0.4583 to 0.6429, see Table 1), indicating
slightly more robustness. However, this difference is not substantial enough to prefer RF over the other
models. With only 14 individuals in the dataset, one cannot conclude that the results from these models
can be generalized and replicated. If however the dataset included more people, and a model accuracy
had significantly outperformed the baseline model, one could conclude the opposite.
5 Conclusion
Our study aimed to develop and evaluate predictive models for classifying frustration levels based on
heart rate signals. We utilized Logistic Regression, Decision Tree, ANN, RF, and a baseline model.
The ANOVA results, with a p-value of 0.5488, indicate no significant differences in the mean accuracies
among the models. All models demonstrated similar robustness, generalization, and consistency. Given
these findings, no model significantly outperforms the others, suggesting that simpler models like the
baseline can be as effective as more complex ones for this task. Overall, this study has found that the
task of predicting frustration levels based on heart rate signals has not been accomplished with high
accuracy given the current dataset.
Future work could explore having a bigger dataset, or include more complex models such as Ensem-
ble models or Deep Learning models, to achieve better accuracy, when predicting frustration-levels from
heartrate data.
References
[1] Jihong Li Ruibo Wang. Block-regularized 5 × 2 cross-validated mcnemar’s test for comparing two
classification algorithms. arxiv.org, 2015.
[2] Jeremy Ginsberg. Detecting influenza epidemics using search engine query data. Nature.com, 2009.
[3] Gary King Alessandro Vespignani David Lazer, Ryan Kennedy. The parable of google flu: Traps in
big data analysis. dash.harvard.edu, 2018.
[4] Lutfiyya N. Muhammad. Guidelines for repeated measures statistical analysis approaches with basic
science research considerations. National library of measurement, 2018.

Individual Project

Uploaded by

Copyright:

Available Formats

Individual Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Individual Project

Uploaded by

Copyright:

Available Formats

Task 1: Evaluating AI Models in Research

What the papers evaluate

Why paper 1 evaluates properly

What should have been considered in paper 2

Figure 1: Correlation matrix of heart rate features

Figure 2: Pairwise relationship scatterplot of heart rate features

2.2 Pairwise Relationship Scatterplot

• HR Mean: The average heart rate.

• HR Min: The minimum recorded heart rate.

3.2 Confidence Intervals

Table 1: Mean accuracies and confidence intervals for each model

Model Shapiro-Wilk Test p-value

Table 2: Shapiro-Wilk Test p-values for normality assessment of model accuracies

Figure 3: Q-Q plots for normality assessment of model accuracies

4.1.2 Homogeneity of Variance

Table 3: Levene’s Test p-value for homogeneity of variance

4.2 ANOVA Results

Source Sum of Squares df Mean Square F Value Pr > F

Table 4: RM ANOVA results

4.3 Model Robustness & Generalization

You might also like