A Food Recipe Recommendation System Based On Nutritional Factors in The Finnish Food Community
A Food Recipe Recommendation System Based On Nutritional Factors in The Finnish Food Community
A Food Recipe Recommendation System Based On Nutritional Factors in The Finnish Food Community
Master’s Thesis
Degree Programme in Computer Science and Engineering
June 2023
Walpitage A. C. (2023) A Food recipe recommendation system based on
nutritional factors in the Finnish food community. University of Oulu, Degree
Programme in Computer Science and Engineering. Master’s Thesis, 152 p.
ABSTRACT
This thesis presents a comprehensive study on the relationships between user
feedback, recipe content, and additional factors in the context of a recipe
recommendation system. The aim was to investigate the influence of various
factors on user ratings and comments related to nutritional variables, while also
exploring the potential for personalized recipe suggestions. Statistical analysis,
clustering techniques, and sentiment analysis were employed to analyze a dataset
of food recipes and user feedback. We determined that user feedback is a comple x
phenomenon influenced by subjective factors beyond recipe content alone.
Cluster analysis identified four distinct clusters within the dataset, highlighting
variations in nutritional values and sentiment among recipes. However, due to an
imbalanced distribution within the clusters, these relationships were not
considered in the recommendation system. To address the absence of user-related
data, a content-based filtering approach was implemented, utilizing nutritio nal
factors and a health factor calculation. The system provides personalized recipe
recommendations based on nutritional similarity and health considerations. A
maximum limit of 20 recommended recipes was set, allowing users to specify the
desired number of recommendations. The accompanying API also provides a
mean squared error metric to assess recommendation quality. This research
contributes to a better understanding of user preferences, recipe content, and the
challenges in developing effective recommendation systems for food recipes.
ABSTRACT ................................................................................................................. 2
TABLE OF CONTENTS ............................................................................................. 3
FOREWORD................................................................................................................ 4
ABBREVIATIONS ...................................................................................................... 5
1 INTRODUCTION ................................................................................................ 6
1.1 Research Questions .................................................................................. 8
1.2 Contribution.............................................................................................. 9
1.3 Structure of Thesis.................................................................................. 11
2 RELATED WORK ............................................................................................. 12
2.1 Recommendation Systems...................................................................... 12
2.1.1 Content-Based Filtering ............................................................. 12
2.1.2 Collaborative Filtering............................................................... 13
2.1.3 Hybrid-Based Filtering .............................................................. 15
2.1.4 Challenges and Limitations in Recommendation Systems ....... 17
2.2 Healthy Recommendations and food recommender system .................. 19
3 IMPLEMENTATION ........................................................................................ 21
3.1 Dataset .................................................................................................... 21
3.2 Methodology........................................................................................... 21
3.3 Concepts ................................................................................................. 25
3.3.1 Attributed Graph........................................................................ 25
3.3.2 Health Measurement of the Foods............................................. 26
3.3.3 WHO Health Factors and FSA Health Factors.......................... 26
3.3.4 Clustering Methods ................................................................... 29
3.3.5 Social Media & Social Network Analysis ................................. 30
4 RESULTS & DISCUSSIONS ................................................................................. 32
4.1 Step 1 – Graphical Analysis ................................................................... 32
4.2 Step 2 – Sentiment Analysis ................................................................... 33
4.3 Step 3 – Correlation Analysis ................................................................. 36
4.4 Step 4 – Clustering ................................................................................. 37
4.1 Step 5 – Health-based Recommender..................................................... 43
5 SUMMARY ....................................................................................................... 46
6 REFERENCES ................................................................................................... 48
7 APPENDICES........................................................................................................ 51
FOREWORD
I am grateful to present this thesis as part of my Master's degree in Computer Science
and Engineering with a specialization in Applied Computing at the University of Oulu,
Finland. The focus of my research lies in the fascinating field of hybrid
recommendation systems, specifically exploring the utilization of multiple informa tio n
sources derived from food recipes within the Finnish food community. Furthermore, I
incorporated health factors based on WHO guidelines to enhance the personalized
recommendations provided by the system.
I would also like to extend my heartfelt thanks to my family members for their
unwavering support and encouragement throughout this journey.
It is my hope that this thesis contributes to the ever-evolving field of
recommendation systems, paving the way for enhanced personalized
recommendations in the domain of Finnish food recipes while considering vital health
factors.
Why are health factors important for food recipe recommender systems? Health
aspects are important in food recipe recommender systems, especially when it comes
to encouraging healthy eating habits and meeting individual dietary demands. By
including health criteria in these systems, users are guaranteed to obtain suggestio ns
that are tailored to their individual health objectives, preferences, and constraints.
These systems can help users' general well-being and support their attempts to
7
maintain a balanced and healthy diet by taking the healthiness of recommended meals
into account.
Recommendation systems can use many forms of information to propose the healthiest
eating selections. Some significant information sources that can be used are:
1.2 Contribution
Extract sentiment scores of the food recipes using Afinn and Vader:
To gain a deeper understanding of the sentiments associated with the
food recipes in our dataset, we utilized two popular sentiment analysis tools:
Afinn and Vader. These tools provided us with sentiment scores that classified
each recipe's sentiment as positive, neutral, or negative. By extracting
sentiment scores, we were able to uncover valuable insights into the overall
sentiment trends surrounding different recipes. This information played a
crucial role in refining our recommendation algorithm and classification biases
regarding recipe properties.
Clustering the foods in our dataset to find different clusters and their
properties using k-means and spectral clustering:
To discover inherent patterns and structures within the recipe dataset,
we employed clustering techniques such as k-means and spectral clustering.
These algorithms allowed us to group similar recipes together based on their
shared properties, such as energy, protein, carbohydrates, fat, saturated fat, and
other nutritional content. By identifying distinct clusters, we gained a better
grasp understanding of the diversity within the dataset and the unique
characteristics of each cluster. This knowledge served as a foundation for
further analysis and recommendation generation.
2 RELATED WORK
their behavior, can be used to modify the amount of personalization. For example, to
enhance recommendations, the algorithm may assign greater weight to the user's
favored genres or use the user's ratings of related entities.
In recommendation systems, content-based filtering has various advantages.
Because the emphasis is on the item's substance rather than on past user statistics, it
may make suggestions even for new or unpopular products. Because the suggestio ns
are based on specific item qualities and attributes, it also provides straightfor ward
explanations of recommendations. Furthermore, content-based filtering can help with
the cold start problem, which occurs when new users have little or no user data.
However, there are several limits to content-based screening. It is strongly
reliant on the accessibility and quality of item content and qualities. The suggestio ns
may not be ideal if the item data is insufficient, erroneous, or lacking key attributes.
Information-based filtering is also susceptible to the "filter bubble" effect, in which
users may obtain suggestions that reinforce their existing preferences while limiting
their exposure to different information.
Various strategies and developments have been presented to improve the
efficiency of content-based filtering. Textual data may be analyzed using Natural
Language Processing (NLP) techniques to extract semantic meaning from item
descriptions or reviews. Sentiment analysis may be used to take into account user
sentiment toward products and deliver more sophisticated suggestions. Decision trees,
support vector machines, and neural networks are examples of machine learning
techniques that may be used to understand complicated patterns and increase
recommendation accuracy.
In summary, content-based filtering is an important approach in
recommendation systems that uses the content and qualities of entities to generate
tailored suggestions. Content-based filtering allows the system to propose products
that fit with the user's interests by assessing intrinsic qualities and comparing them to
the user's preferences. While it has limits, advances in data processing, machine
learning, and natural language processing (NLP) continue to improve the efficacy and
usefulness of content-based filtering in a variety of disciplines [7].
To overcome the limits of individual approaches and give more accurate and
tailored suggestions, hybrid-based filtering algorithms combine the characteristics of
numerous recommendation systems. Hybrid techniques try to harness the
complementing characteristics of each methodology and increase overall suggestio n
quality by merging diverse filtering methods.
The impetus for hybrid filtering stems from the observation that no one
suggested approach is optimal in all cases. Different approaches each have their own
set of advantages and disadvantages, and by combining them, it is possible to improve
recommendation performance, solve data sparsity difficulties, deal with the cold start
problem, and deliver more diversified and tailored suggestions.
There are numerous techniques to developing hybrid recommendation systems,
and the strategy used relies on the features of the data, available recommendatio n
algorithms, and intended recommendation goals. In this section, we will look at several
typical hybrid strategies used in recommendation systems:
Weighted Hybridization:
This strategy combines recommendations in several ways by allocating weights to
each method depending on its performance or relevance. Weights can be modified
statically or dynamically based on circumstances or user preferences. If
collaborative and content-based filtering methods are utilized, for example, the
suggestions from each approach can be weighted and combined to generate the
final recommendation list.
Feature Combination:
This strategy entails combining characteristics or attributes from several
recommendation systems into a single model. In a hybrid system that combines
collaborative filtering with content-based filtering, for example, the user and item
features from each technique can be integrated into a single feature representation.
The combined representation may then be used by machine learning algorithms to
16
Switching Hybridization:
Different recommendation algorithms are used for different contexts or user
segments in switching hybridization. Depending on the conditions or criteria, the
system alternates between techniques. Collaborative filtering, for example, maybe
more successful for individuals with a rich engagement history, whereas content -
based filtering may be better suited for new users with minimal data. Based on the
user's profile or behavior, the system chooses the best approach to utilize.
Cascade Hybridization:
Cascade hybridization includes pre-filtering the item space with one
recommendation approach and then refining the suggestions with another method.
For example, as the first phase, a content-based filtering mechanism can be used
to choose a subset of entities related to the user's preferences. The improved item
set may then be utilized to produce more customized suggestions via collaborative
filtering.
Ensemble Methods:
In hybrid filtering, ensemble methods aggregate the outputs of numerous
recommendation algorithms using techniques such as voting, averaging, or
stacking. Each technique provides its own set of suggestions, which are then
aggregated by the ensemble algorithm to make the final recommendation list. The
goal of ensemble techniques is to harness the knowledge of numerous approaches
while improving the resilience and accuracy of suggestions.
Meta-level Hybridization:
In meta-level hybridization, the suggestions from several approaches are fed into
a higher-level model, which mixes and refines the outputs. A machine learning
algorithm, a rule-based system, or even a human expert can be used to create this
meta-level model. Based on historical data or user feedback, the meta-model learns
to weight or combine recommendations from multiple approaches, resulting in an
optimum recommendation.
substantially influence their success. It is also critical to ensure correct integration and
eliminate redundancy or contradictory advice.
Finally, hybrid-based filtering approaches in recommendation systems offer a
strong option for improving suggestion quality, customization, and user happiness.
Hybrid techniques provide more accurate, diversified, and context-aware suggestio ns
by combining the capabilities of different methodologies. As recommendation systems
advance, hybridization is anticipated to play an important role in advancing the field
and addressing the improving expectations of users across several domains [10] [5].
Dynamic and Evolving Preferences: User preferences and item qualities can
change over time. To deliver up-to-date and meaningful recommendations,
recommendation systems must adapt and capture these dynamic changes.
Incorporating techniques to deal with idea drift, user input, and contextual
changes might assist in addressing the difficulty of changing preferences.
Cold Start Problem for Items: Similar to the cold start problem for users,
recommendation algorithms confront difficulties when dealing with new
goods that have little or no historical data. It is difficult to provide reliable
suggestions for these things since there is inadequate information to evaluate
19
suggestions that not only meet the user's nutritional needs but also cater to their own
taste preferences. This tailored approach raises the chances of consumers accepting the
recommended items and establishing healthier eating habits in the long term.
A healthy food recommender system can benefit from feedback and user
interactions to improve its performance. The system may continually learn and
enhance its suggestions by collecting users' comments on recommended foods, such
as ratings, reviews, or consumption habits. This feedback loop enables the system to
adapt to changing user preferences, increase the grasp of its dietary objectives, and
enhance the accuracy of future recommendations.
A complete healthy food recommender system might also include informa tio n
such as food quality, sourcing procedures, and sustainability issues. This broadens the
scope of the study beyond human health to include larger aspects such as
environmental impact and ethical concerns about food choices. By combining these
aspects, the recommender system may direct consumers toward environmentally and
socially responsible food selections, so encouraging not just personal health but also
environmental and social well-being.
In conclusion, a healthy food recommender system integrates nutritio na l
knowledge, user preferences, and dietary objectives to deliver individualized
suggestions for nutritious food options. The system intends to enable users to make
educated food decisions, adopt healthy eating habits, and enhance their overall well-
being by utilizing data analysis, machine learning algorithms, and user feedback.
Healthy food recommender systems have the potential to play a key role in
encouraging better lives and solving public health concerns connected to nutrition as
technology progresses and more data becomes accessible [12] [13] [14].
21
3 IMPLEMENTATION
3.1 Dataset
The dataset used in this study was obtained from the website
https://www.valio.fi/, a highly visited and prominent social media platform focused on
food, with over 25 million annual visits. The dataset comprises 5,472 recipes that were
posted on the website between 2012 and 2022. Each recipe in the dataset includes
various attributes such as the recipe name, published time, ingredients, preparation
time, difficulty level, tags, users' ratings, users' comments, and the nutritional content
per 100 grams, including energy, protein, carbohydrates, fat, saturated fat, dietary
fiber, and salt. In the preprocessing phase, 663 recipes that lacked nutritio na l
information were excluded, resulting in a final dataset of 4,833 recipes. This dataset
provides a comprehensive collection of recipes from the Valio Oy commercial website,
along with user ratings, comments, and detailed nutritional values. It offers a rich and
diverse source of information for evaluating and analyzing the relationships between
recipe properties and nutritional values, making it suitable for studying hybrid
recommendation systems in the food domain.
Table 1- Basic statistics of the crawled dataset
Entity Description
Number of all recipes 5,472
Years of published recipes 2012-2022
Recipes containing complete information 4833
3.2 Methodology
In order to identify the key nutrition factors that exhibit significant variance
and will be used for subsequent clustering techniques, a Principal Component Analys is
(PCA) was performed on the dataset. Prior to the analysis, the data were standardized
to ensure that all features were on a consistent scale. PCA is a dimensionality reduction
technique that transforms the original set of features into a new set of uncorrelated
variables known as principal components. These components capture differe nt
amounts of variance in the data, with each component representing a linear
combination of the original features. By analyzing the explained variance ratio for each
principal component, we can determine which nutrition factors contribute the most to
the overall variability in the dataset. The principal components with higher explained
variance ratios are indicative of nutrition factors that exhibit significant variations and
are more likely to have a substantial impact on the subsequent clustering analysis. We
consider all seven nutritional factors to identify the principal nutritional components
that shows higher variant. In following table shows the basic properties of our origina l
dataset before we preprocess data (Table 2 - Properties of Original dataset).
To determine the optimal number of clusters for the data, we utilized the
Silhouette coefficient, which is a statistical measure employed to evaluate the quality
of clustering results. This coefficient provided a quantitative assessment of how we ll
each data point fit within its assigned cluster, aiding in the determination of the
appropriate number of clusters.
In addition to K-means clustering, we leveraged Spectral Clustering to identify
non-linear boundaries within the dataset. Spectral clustering is particularly useful
when dealing with complex data structures and enables the discovery of clusters that
exhibit non-linear relationships. By employing this technique, we aimed to uncover
intricate associations and patterns in our food recipe dataset, with a specific focus on
nutritional factors. Overall, our aim is to reveal hidden patterns using K-means
clustering and Spectral Clustering, detect non-linear relationships, and gain insights
into the nutritional aspects of the recipes.
In the subsequent step, we focused on examining the correlations and
coefficients among different variables within the dataset, specifically analyzing the
relationship between nutritional values such as energy, protein, fat, and user-related
metrics like user rating, user sentiment, and the number of comments.
As the available dataset did not provide sufficient user data and ratings, finally we
made the decision to develop a content-based recommendation system. To imple me nt
this, we began by constructing a cosine similarity matrix based on the nutritio na l
values of the recipes. Cosine similarity is a commonly used measure to determine the
similarity between two vectors, disregarding their magnitudes. In our study, the cosine
similarity matrix allowed us to identify recipes that were most similar to each other
based on their nutritional content. This information served as the foundation for
providing recommendations to users. Furthermore, we calculated a Food Standards
Agency (FSA) factor for each recipe, considering the nutritional factors. The FSA
factor played a critical role in determining the final recommendations provided to
users. This content-based recommendation approach allowed us to provide users with
personalized recommendations by considering the similarity between nutritio na l
profiles and incorporating the health factor.
25
3.3 Concepts
In our study, we employed the FSA score, also known as the "traffic light"
system, to assess the healthiness of the recipes in our dataset. The scoring
methodology, as outlined in Table 3, takes into account the scaling of three
macronutrients: fats, saturated fats, and salt. These macronutrients are categorized into
three color-coded levels, ranging from green (healthy) to red (unhealthy).
To calculate the FSA score for each recipe, we followed the approach of
previous studies and considered the available nutrient information, excluding sugar
content due to data limitations. Based on the nutrient content of fats, saturated fats,
and salt, we assigned scores of 1, 2, or 3 to represent the green, amber, and red
categories, respectively. The total FSA score for each recipe ranged from 3 to 9, with
a lower score indicating a healthier recipe and a higher score indicating a less healthy
recipe(Table 3).
By incorporating the FSA score into our analysis, we were able to evaluate the
healthiness of recipes and consider this factor when making recommendations to users.
This approach allowed us to prioritize healthier options and provide users with a more
informed understanding of the nutritional content of the recommended recipes.
Table 3
Dietary Factor Low (1) Medium (2) High (3)
Fat 3% or less 3 - 17.5 % 17.5% or more
Saturated fats 1.5% or less 1.5 - 5 % 5% or more
Salt 0.3% or less 0.3 - 1.5 % 1.5% or more
metabolic functions, immunity, and general health. The FSA hopes to enhance
individuals' growth, development, and general well-being by encouraging a
balanced diet.
Portion management: The FSA emphasizes portion management as a critical
component of maintaining a healthy eating pattern. It is easy to lose sight of
portion sizes and consume more calories than the body requires in today's food-
centric culture. Overeating can result in weight gain, which raises the risk of a
variety of health problems, including obesity, diabetes, and cardiovascular
disease. The FSA hopes to empower individuals to manage their energy intake
properly, maintain a healthy weight, and lower their risk of acquiring chronic
diseases by encouraging them to be conscious of portion sizes. Paying attention
to serving sizes, listening to internal indications of hunger and fullness, and
avoiding large quantities often given in restaurants and fast-food places are all
part of practicing portion control.
Nutrient Density: Promoting the consumption of nutrient-dense foods is
another significant emphasis area for the FSA. Nutrient-dense foods have a
high concentration of vital nutrients in comparison to their calorie value. Fruits,
vegetables, whole grains, lean proteins, and low-fat dairy products are
examples of such foods. These foods are high in vitamins, minerals, and
antioxidants, all of which are essential for good health and well-being. Fruits
and vegetables are high in vitamins A, C, and E, as well as minerals such as
potassium and folate. Whole grains are high in fiber, B vitamins, and minera ls
like magnesium and iron. Lean proteins, such as poultry, fish, lentils, and tofu,
are high in essential amino acids, which are required for tissue growth and
repair. Low-fat dairy products offer calcium, vitamin D, and protein. By
encouraging nutrient-dense options, the FSA hopes to guarantee that people get
the nutrients they need while also controlling their overall calorie intake.
Salt Reduction: The Food Standards Agency (FSA) acknowledges that
excessive salt consumption is closely connected to high blood pressure, a major
risk factor for cardiovascular disease. The FSA advises people to minimize
their salt intake in order to improve their cardiovascular health. This can be
accomplished by selecting low-sodium options, such as reduced-salt bread,
soups, and sauces. Individuals are also encouraged to restrict their usage of salt
when cooking and at the table. Individuals may make educated decisions and
choose goods with reduced salt levels by reviewing food labels for sodium
content. The FSA's emphasis on salt reduction seeks to enhance public health
outcomes by lowering the prevalence of high blood pressure and the burden of
cardiovascular illnesses.
Fat Quality: The FSA emphasizes the necessity of eating healthy fats while
reducing saturated and trans fats. Healthy fats, such as monounsaturated and
polyunsaturated fats, are essential for overall health, particularly heart health.
Nuts, peanuts, avocados, and fatty seafood like salmon and mackerel are good
sources of healthful fats. These fats are advantageous because they aid in the
reduction of dangerous cholesterol levels and the promotion of cardiovascular
health. Saturated fats, which are often found in fatty meats, butter, and full- fat
29
dairy products, on the other hand, can raise the risk of cardiovascular disease
when ingested in excess. Similarly, trans fats, which are commonly present in
processed and fried meals, are known to be harmful to heart health. The FSA
recommends that people restrict their consumption of saturated and trans fats
and instead choose healthier fat sources. The FSA strives to enhance
cardiovascular health outcomes and lower the burden of heart disease in the
population by boosting the consumption of healthy fats and minimizing the
consumption of saturated and trans fats.
Clustering rules: Specific rules regulate the process of grouping related items in
clustering algorithms. Some fundamental principles are as follows:
a. Similarity or Distance: Clustering is based on determining how similar or
different entities are. To assess the dissimilarity of data points, several distance
metrics such as Euclidean distance, Manhattan distance, or cosine similar ity
are utilized.
Online platforms and websites that allow users to produce and share
information, communicate with others, and engage in virtual communities are referred
31
to as social media. It has changed the way people communicate, share informatio n,
and interact with one another. Social media platforms have become a vital element of
modern life, giving numerous possibilities for individuals, businesses, and
organizations to communicate, cooperate, and share information.
Social network analysis (SNA) is the study of the linkages and interactio ns
between persons or entities in a social network. It examines the structure, patterns, and
dynamics of social interactions in order to gain insight into how information moves,
influence is exercised, and communities emerge and change.
SNA is especially important in social media because it helps scholars and
practitioners to grasp the complex networks of relationships established by online
platforms. It gives useful information on user habits, social impact, knowledge
diffusion, and community dynamics. SNA uncovers hidden structures and identifies
significant players or influential persons inside a social network by evaluating the
relationships between users, their interactions, and the patterns of informa tio n
exchange.
To summarize, social media and social network analysis have transformed the
way people and organizations connect, communicate, and exchange information. SNA
gives useful insights into the complex network architecture, social behaviors, and
information dynamics that exist on social media platforms. Researchers and
practitioners may get a deeper knowledge of social phenomena, devise successful
tactics, and make informed judgments in a variety of fields by studying user
connections and interactions [24] [25] [26].
32
In our analysis, we sought to identify relationships and clusters within the dataset
by generating a graphical representation using the Kamada-Kawai layout algorithm
However, upon visual inspection of the graph, we were unable to discern any clear
clusters or distinct patterns. The graph generated using the Kamada-Kawai layout
algorithm did not provide a clear visual representation of the underlying relations hips
and groupings within the data.
To further investigate and gain deeper insights into the dataset, we utilized the
Gephi tool to generate an alternative graph representation (Figure 1- Generated Graph Using
Gephi). The graph generated through Gephi allowed for a more comprehens ive
exploration of the data structure, enabling us to identify four distinct clusters with
greater clarity. These clusters represented meaningful groupings and relations hips
among the data points. We used this information to perform the clustering techniques.
POSITIVE NEUTRAL
29 % 58 % NEUTRAL POSITIVE
NEGATIVE NEGATIVE
95 %
a lack of strong positive or negative sentiment. Finally, sentiments with a value greater
than +0.5 are categorized as Positive, representing a generally positive sentiment. By
implementing this categorization scheme, we can effectively differentiate and evaluate
sentiments based on the provided AFINN sentiment values.
Graph 2 - FSA - Num of comm. words Graph 3 - Preparation time and AFINN
sentiment
Graph 5 - Salt and AFINN sentiment Graph 4 - Energy and AFINN sentiment
37
We have attached a few graphs that shows correlation between several factors.
In Graph 2 it shows the correlation between FSA factor and the number of commented
words by the user for each recipe. In Graph 3 it shows the correlation between
preparation time and AFINN sentiment score. Graph 5 represents the correlation
between Salt level and the AFINN sentiment score. In Graph 4 it exhibits the correlation
between AFINN sentiment score and Energy level. These are few graphs we attached
and the Table 6 shows the all the correlational values between each nutritional factor
and AFINN sentiment, number of words commented by users for each recipe. Despite
considering various aspects of the recipes and user feedback, we did not find
substantial evidence to suggest a significant correlation. This suggests that factors
beyond the recipe content alone contribute to user feedback, making it a complex and
multifaceted phenomenon.
𝑏(𝑖)−𝑎(𝑖)
s(i)=
𝑘𝑚𝑎𝑥 {𝑎(𝑖),𝑏(𝑖)}
where:
Since, we can clearly identify there are four clusters in the attributed graph, we
expanded our analysis by applying spectral clustering, a powerful algorithm that excels
at identifying clusters with non-linear boundaries (Figure 7). We used the principa l
components that we learned by doing the PCA to cluster our recipes. By utilizing this
technique, we were able to unravel the intricate relationships and non-linear
associations within the dataset. The spectral clustering process yielded a partitioning
of the data into four distinct clusters, unveiling previously unrecognized patterns and
dependencies.
Cluster Summary
Cluster Class Count
2500
Cluster 1 1950
2000 Cluster 2 1859
1500 Cluster 4 810
Total Cluster 3 214
1000
Grand Total 4833
500
0
Cluster 1 Clsuter 2 Clsuter 4 Clsuter 3
Among the clusters of recipes analyzed, Cluster 1 stood out with the highest
energy levels, indicating a significant contrast compared to the other clusters. This
suggests that the recipes in Cluster 1 are generally more calorie-dense and potentially
offer more substantial meals. However, when considering the protein content, Cluster
42
1 and Cluster 2 displayed similar levels, with Cluster 1 at 6.39 and Cluster 2 at 6.63.
In contrast, Cluster 3 showed a notably lower protein level, measuring only 1.89. This
indicates that recipes in Cluster 3 may provide comparatively lower amounts of
protein.
Looking at the carbohydrate levels, Cluster 1 exhibited a substantial contrast
compared to the other clusters. The carbohydrate level in Cluster 1 was 27.61,
significantly higher than the second largest value of 9.33 found in the other clusters.
This difference suggests that recipes in Cluster 1 may contain significantly more
carbohydrates, which could make them a suitable choice for individuals seeking higher
carbohydrate intake.
In terms of fat content, Cluster 3 showed a significant contrast when
considering both fat and saturated fat levels. The recipes in Cluster 3 had the lowest
fat content, with a value of 0.5, and completely lacked saturated fat (0.0). On the other
hand, Cluster 1 displayed the highest levels of both fat and saturated fat. This suggests
that recipes in Cluster 1 may have a higher fat content and can contribute to a higher
intake of saturated fat compared to the other clusters.
Furthermore, when considering the FSA (Food Standards Agency) factor,
Cluster 1 had the highest FSA value among all the clusters. This indicates that recipes
in Cluster 1 may be associated with a higher FSA score, which implies a potentially
higher level of protein, carbohydrate, and fat content. In contrast, Cluster 3 had the
lowest FSA value, suggesting that recipes in this cluster may have a lower overall FSA
score, indicating a healthier composition in terms of salt, sugar, and fat content. (Table
8- Average of each nutrition component).
However, a significant concern about the dataset is Cluster 1 and cluster 2
accounted for more than 78.8% of the entire dataset. This uneven distribution within
the clusters led us to decide against considering these relationships when
recommending recipes to users. Even though these findings provided valuable insights
into the dataset, they were not supposing this suitable for our recipe recommendatio n
system due to the imbalanced cluster distribution (RQ2).
43
Example Calculation –
We computed the Mean Squared Error (MSE) and similarities according to the
given formulas. The k most similar foods were identified based on the cosine
similarity, where k was set to 3. The MSE was calculated by considering the similar ity
between food items, and the health factor was determined based on the FSA factor.
The Sum of MSE and Sum of health factor were aggregated to accumulate the values
for further analysis. Suppose cosine similarity matrix is Table 9.
Cosine Similarity:
F1 F2 F3 F4 F5 F6
FSA factor 4 3 7 9 5 8
For recipe 𝑓𝑖 Find k recipes which the multiply of the health factor of that food and
similarity values of that food with 𝑓𝑖 are maximum
[𝑓𝑖, 𝑓𝑖 , 𝑓𝑖 , … 𝑓𝑘 ] (consider k = 3)
∑𝑘 (
𝑗=1 𝑆𝑖𝑚 (𝑓𝑖 ,𝑓𝑗
)
MSE=1- 𝑘
Health Factor:
Health=∑𝑘𝑗=1 (ℎ𝑒𝑎𝑙𝑡ℎ(𝑓𝑗 )
SumMSE =MSE+SumMSE
SumHealth=SumHelath+Health
For example, a section of the cosine similarity matrix and FSA factor is
provided. We showcased the computation of the content-based
recommendation using the MSE and the average health factor. Below matrix
is contains sample value of similarity between each food (F1, F2, ..)
F1 F2 F3 F4 F5 F6
2 4
Health factor 0.83333 0.33333 0.66666 0.16666
31 30 7 7
F1 F2 F3 F4 F5 F6
Similarity × - 0.5 0.1333332 0 0.4666669 0.1000002
Health
Factor
The mean squared error plays a crucial role in evaluating the accuracy of the
recommendation system. By comparing the system's predictions to the actual
nutritional values of the recommended recipes. By leveraging content-based filter ing
and considering nutritional factors, our recommendation system aims to deliver health-
conscious recipe recommendations. The ability to customize the number of
recommendations empowers users to receive suggestions that align with their
preferences and dietary requirements.
Our study provides valuable insights into the correlation between nutritio na l
factors, user feedback, and the design of a hybrid food recommender system. While
the correlation between nutritional factors and popularity in social contribution was
not significant, we identified dominant features influencing nutritional factors and
designed a hybrid recommender system. However, challenges such as capturing user
preferences, lack of user-related data, user feedback, and addressing imbala nced
clusters need further investigation. This research contributes to the understanding of
food datasets and lays the groundwork for future advancements in food
recommendation systems.
46
5 SUMMARY
In this section, we present a comprehensive summary of our study, highlighting
the steps we followed and the key findings we obtained. We began by analyzing the
relationships between user feedback, recipe content, and additional factors such as
preparation time and steps. We utilized statistical and clustering techniques to explore
the dataset, uncover patterns, and identify distinct clusters. Furthermore, sentime nt
analysis was conducted on user comments to gain insights into the overall sentime nt.
Finally, we developed a recommendation system based on content-based filtering to
offer personalized recipe suggestions. Let's now delve into the specific details of each
step and the significant findings we derived from our investigation.
In our analysis, we initially attempted to identify relationships and clusters
within the dataset using the Kamada-Kawai layout algorithm but found no clear
patterns or clusters. To gain deeper insights, we turned to the Gephi tool, which
allowed us to generate an alternative graph representation revealing four distinct
clusters with greater clarity. These clusters represented meaningful groupings and
relationships among the data points. Moving on to sentiment analysis, we used two
libraries to analyze user comments, namely AFINN and VADER. AFINN provided
sentiment analysis based on the original language of the comments, while VADER
involved translating the comments from Finnish to English. Comparing the results, we
found significant differences between AFINN and VADER, indicating a discrepancy
in sentiment classification. Despite the differences, we relied on the AFINN sentime nt
analysis, considering the original words and context of the comments. AFINN
categorized 58% of the comments as positive, 29% as neutral, and 13% as negative.
On the other hand, VADER classified 95% of the comments as neutral, 5% as positive,
and only 4 recipes as negative. We attribute these differences to the potential loss of
semantic nuances during the translation process.
In our study, we conducted a statistical analysis to explore potential
correlations between user feedback and the content of food recipes. Despite
considering various factors, including preparation time and steps, we did not find
strong associations between these variables. This suggests that user feedback is
influenced by subjective factors beyond the recipe content alone, making it a complex
phenomenon (RQ1).
To unravel the underlying structure of the dataset, we employed graph analysis
and clustering techniques. Silhouette analysis helped us determine the optimal number
of clusters, and we used the K-means and spectral clustering algorithms to partition
the data. The analysis revealed four distinct clusters, with Cluster 3 being the largest,
containing 86.5% of the dataset.
We performed a comprehensive statistical analysis on each cluster, focusing
on nutritional values, sentiment from user feedback, ratings, and comment length.
Cluster 1 exhibited the highest protein levels, Cluster 2 had the highest fat and
saturated fat levels, and most recipes showed higher levels of dietary fiber. However,
due to the imbalanced cluster distribution, we decided not to consider these
relationships in our recipe recommendation system (RQ2).
In our recommendation system, we have implemented a content-based filter ing
approach since we lack user-related data. The system utilizes nutritional factors such
as Energy, Protein, Carbohydrate, Fat, Saturated Fat, Dietary Fiber, and Salt from the
recipes to calculate the cosine similarity between each recipe. This enables us to
determine the similarity between recipes based on their nutritional content.
47
6 REFERENCES
[1] "Wikipedia. Recommender system.," Online. [Online]. Availab le :
https://en.wikipedia.org/wiki/Recommender_system.
[2] Hadash, G., Shalom, O. S., & Osadchy, R., "Rank and rate. In Proceedings of
the 12th ACM Conference on Recommender Systems," in ACM, 2018.
[3] A. Tugend, "Too many choices: A problem that can paralyze," The New York
Times, p. 26, 2010.
[4] Lops, P., de Gemmis, M., & Semeraro, G., "Content-based recommender
systems: State of the art and trends," in Recommender Systems
Handbook, Springer, 2011, pp. 73-105.
[5] Adomavicius, G., & Tuzhilin, A., "Toward the next generation of recommender
systems: A survey of the state-of-the-art and possible extensio ns, "
IEEE Transactions on Knowledge and Data Engineering, 2005, pp.
734-749.
[6] B.O. Mbah, P.E. Eme and O.F. Ogbusu, "Effect of Cooking Methods (Boiling
and Roasting) on Nutrients and Anti-nutrients Content of Moringa
oleifera Seeds," in Pakistan Journal of Nutrition, Nsukka, 2012.
[7] Donghui Wanga, Yanchun Lianga, Dong Xua, Xiaoyue Fenga, Renchu Guan,
"A content-based recommender system for computer science
publications," in ELSEVIER, 2018.
[8] Christopher R. Aberger, "Recommender: An Analysis of Collaborative
Filtering".
[9] J. Ben Schafer, Dan Frankowski, Jon Herlocker, Shilad Sen , "Collaborative
Filtering Recommender Systems," in Springer.
[10] R. Burke, "Hybrid Recommender Systems: Survey and Experiments. User
Modeling and User-Adapted Interaction," 2002.
[11] Balraj Kumar, Neeraj Sharma, "Approaches, Issues and Challenges in
Recommender Systems: A Systematic Review," 2016.
[12] Katherine Harris‐Lagoudakis, "Online shopping and the healthfulness of grocery
purchases," Ames, Iowa.
[13] RACIEL YERA, AHMAD A. ALZAHRANI, LUIS MARTíNEZ, "A Food
Recommender System Considering Nutritional Information and User
Preferences," in IEEE.
[14] Raciel Yera Toledo, Ahmad A. Alzahrani, Luis Martínez, "A Food
Recommender System Considering Nutritional Information and User
Preferences," in IEEE, 2019.
[15] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S., "A Comprehensive
Survey on Graph Neural Networks," in IEEE, 2019.
[16] Lizi Liao, Xiangnan He, Hanwang Zhang, Tat-Seng Chua, "Attributed Social
Network Embedding," in TKDE, 2018.
[17] Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K. I., & Jegelka, S.,
"Representation learning on graphs with jumping knowledge
networks," in In Proceedings of the 32nd Conference on Neural
Information Processing Systems (NeurIPS), 2018.
49
https://noduslabs.com/featured/sentiment-analysis-afinn-bert-ai-
twitter-amazon/.
[39] Z. Zhang, "Text Mining for Social and Behavioral Research," 2018.
[40] Shefali Singh, Tureen Chauhan, Priyanka Meel, "Mining Tourists’ Opinions on
Popular Indian Tourism Hotspots using Sentiment Analysis and Topic
Modeling," 2021.
51
7 APPENDICES
Appendix