Unit 4 - MLMM
Unit 4 - MLMM
Unit 4 - MLMM
Recommender systems are the systems that are designed to recommend things to the
user based on various factors. It takes the user model consisting of ratings, preferences,
demographics, etc. and items with its descriptions as input, finds relevant score which is used
for ranking, and finally recommends items that are relevant to the user.
Recommendations are commonly seen in e-commerce systems, LinkedIn, friend
recommendation on Facebook, song recommendation at FM, news recommendations at
Forbes.com, etc.. Companies like Netflix, Amazon, etc. use recommender systems to help
their users to identify the correct product or movies for them.
The recommender system deals with a large volume of information present by filtering the
most important information based on the data provided by a user and other factors that take
care of the user’s preference and interest. It finds out the match between user and item and
imputes the similarities between users and items for recommendation.
B. Better Product search experience: Helps to categories the product based on their
features. Eg: Material, Season, etc.
Recommendation Algorithms Design
The first step is to define the problem you want to solve with our recommendation algorithm.
What is the purpose of our recommendation? Who are our target users? What are their needs
and expectations? What kind of data do we have access to? How will we measure the
performance and impact of your recommendation? These questions will help us narrow down
the scope and objectives of our problem and set the criteria for evaluating our solution.
2. Choose the approach
The next step is to choose the approach that best suits our problem and data. There are three
main types of recommendation algorithms: content-based, collaborative filtering, and hybrid.
Content-based algorithms recommend items that are similar to the ones the user has liked or
interacted with before, based on the features or attributes of the items. Collaborative filtering
algorithms recommend items that are popular or liked by other users who have similar
preferences or behavior to the user, based on the ratings or feedback of the users. Hybrid
algorithms combine both content-based and collaborative filtering methods to leverage the
strengths and overcome the limitations of each one.
The third step is to implement the algorithm using the appropriate tools and techniques.
Depending on the complexity and scale of our problem, we may need to use different
programming languages, libraries, frameworks, or platforms to build and deploy our
algorithm. We may also need to apply various methods and algorithms to preprocess,
analyze, and model our data, such as data cleaning, feature extraction, dimensionality
reduction, clustering, classification, regression, or neural networks. We should also follow the
best practices and standards of coding, testing, debugging, and documenting your algorithm.
The fourth step is to evaluate the algorithm using the metrics and criteria we defined in the
first step. We should test our algorithm on different datasets, such as training, validation, and
test sets, to measure its accuracy, precision, recall, coverage, diversity, novelty, serendipity,
or relevance. We should also compare your algorithm with other existing or baseline
algorithms to benchmark its performance and identify its strengths and weaknesses. We
should also collect and analyze the feedback and behavior of our users to assess the user
satisfaction and engagement with our algorithm.
The fifth step is to optimize the algorithm based on the results and insights we obtained from
the previous step. We should identify and address the issues and challenges that affect our
algorithm's performance and quality, such as data sparsity, cold start, scalability, or privacy.
We should also experiment with different parameters, features, methods, or models to
improve our algorithm's efficiency, effectiveness, or robustness. We should also update and
refine our algorithm based on the changing needs and preferences of our users and the
evolving trends and patterns of our data.
6. Deploy and monitor the algorithm
The final step is to deploy and monitor the algorithm in the real-world setting. We should
integrate our algorithm with our online platform and ensure its compatibility and
functionality with the existing systems and components. We should also monitor our
algorithm's performance and impact over time and track its key indicators and metrics. We
should also maintain and troubleshoot our algorithm and fix any errors or bugs that may arise.
We should also keep learning and improving our algorithm based on the new data and
feedback we collect and the new challenges and opportunities we encounter.
A Model for Recommendation System
information about the users. This technique can filter out items that users like on the
basis of the ratings or reactions by similar users. It uses community data from peer groups
for recommendations. This exhibits all those things that are popular among the peers.
Collaborative filtering systems recommend items based on similarity measures between users
and/or items. The items recommended to a user are those preferred by similar users
(community data). Eg. When we shop on Amazon it recommends new products saying
It is based on the notion of users’ similarity. Eg. On the left side, we can see a picture where
3 children named A, B, C, and 4 fruits i.e, grapes, strawberry, watermelon, and orange
respectively. Based on the image let assume A purchased all 4 fruits, B purchased only
strawberry and C purchased strawberry as well as watermelon. Here A & C are similar kinds
of users because of this C will be recommended Grapes and Orange as shown in dotted line.
Steps for User-Based Collaborative Filtering:
Step 1: Finding the similarity of users to the target user U. Similarity for any two users.
Using Pearson’s Correlation in peer-based collaborative filtering, the formula turns out to be:
where “a” and “b” are users; ra, p is the rating of user “a” for item “p”; “P” is the set of items
Example: Consider a matrix that shows four users Alice, U1, U2 and U3 rating on different
news apps. The rating range is from 1 to 5 on the basis of users’ likability of the news app.
The ‘?’ indicates that the user has not rated the app.
Calculating the similarity between Alice and all the other users At first we calculate the
averages of the ratings of all the user excluding I5 as it is not rated by Alice. Therefore, we
calculate the average as:
If we assume the threshold rating as 3.5, we recommend I5 to Alice as the predicted
rating is 3.83. But the problem arises when it is a cold start. How to recommend new items?
What to recommend to the new users? In the initial phase, the user can be requested to rate a
set of items. Otherwise either demographic data or non-personalized data can be used.
This formula yields a value between -1 and 1, where 1 indicates perfect similarity, 0 indicates
no similarity, and -1 indicates perfect dissimilarity.
2. Executing a recommendation system. It uses the items (already rated by the user) that are
most similar to the missing item to generate rating.
Eg. Given below is a set table that contains some items and the user who have rated those
items. The rating is explicit and is on a scale of 1 to 5. Each entry in the table denotes the
rating given by aith User to a jth Item. We need to find the missing ratings for the respective
user.
Sim(Item1, Item2):
In the table, we can see only User_2 and User_3 have rated for both items 1 and 2.
I2 = 2U2 + 3U3
Sim(Item2, Item3):
In the table we can see only User_3 and User_4 have rated for both the items 2 and 3.
I3 = 1U3 + 2U4
Sim(Item1, Item3):
In the table we can see only User_1 and User_3 have rated for both the items 1 and 3.
I3 = 3U1 + 1U3
Rule-based System
Rule-based systems lack the ability to learn from experience, restricting their capacity
to adapt and improve over time.
Rule-based systems may struggle with uncertain or ambiguous information, leading to
potential inaccuracies in decision-making.
Managing a large number of rules can become complex, posing challenges in
organization.
Sequential recommendation systems try to understand the user input over time and model in
sequential order. The user input interaction is essentially sequence-dependent. That means if
a person books a flight, it books a taxi also for the destination, and books a room. This
information is stored in sequence. If another person books a flight and taxi, the system will
give recommendations for hotel or room booking. Both users’ preference and item popularity
are dynamic or it changes over time.
For instance, we can see more people are buying wireless earphones and some people tend to
buy new phones every year. By these tends, the popularity of some phones or earphones or
any other products changes. Such dynamic profiling is of great significance for precisely
profiling a user or an item for more accurate recommendations and they can only be captured
by only sequential recommendation systems.
In the digital world, where everything is going digital, user’s sequential behavior is a wealth
of information. A sequential recommendation system can be used in the field of e-commerce
to determine the historical behavior of the person and predict continuous change in
preference.
How Does it Work?
Generally, an SRS takes a sequence of user-item interactions as the input and tries to predict
the subsequent user-item interactions that may happen in the near future through modelling
the complex sequential dependencies embedded in the sequence of user-item interactions.
More specifically, given a sequence of user-item interactions, a recommendation list
consisting of top ranked candidate items are generated by maximizing a utility function value
(e.g., the likelihood):
where f is a utility function to output a ranking score for the candidate items, and it could be
of various forms, like a conditional probability [Wang et al., 2018], or an interaction score
[Huang et al., 2018]. S = {i1, i2, ..., i|S|} is a sequence of user-item interactions where each
interaction ij =< u, a, v > is a triple consisting of a user u, the user’s action a, and the
corresponding item v. In some cases, users and items are associated with some meta data
(e.g., the demographics or the features), while the actions may have different types (e.g.,
click, add to the cart, purchase) and happen under various contexts (e.g., the time, location,
weather). The output R is a list of items ordered by the ranking score.
Data Characteristics and Challenges
As the behaviour of customers, for example, shopping is diverse and complex in a real-world
scenario, different data input characteristics from customers will bring different challenges.
Some of them are listed in the below table.
Sequential recommendation systems add more value to many industries and the system itself
is very user-friendly as it recommends what next action should be taken.
It is used in the e-commerce industry to recommend the next relevant product to customers.
For example, if person A buys sports shoes, then the system will recommend socks. This will
help the company to generate more revenue as the recommended product is relevant. It is
used in the tourism/travel industry as well. While booking a ticket for a holiday the system
will recommend the next step like booking top-rated hotels in the area.
SVD for Recommender Systems
SVD is Singular Vector Decomposition. It decomposes a matrix into constituent arrays of
feature vectors corresponding to each row and each column. The SVD technique was
introduced into the recommendation system domain by Brandyn Webb, much more famously
known as Simon Funk during the Netflix Prize challenge.
Just like a number such as 24 can be decomposed as factors 24=2×3×4, a matrix can also be
expressed as multiplication of some other matrices. Because matrices are arrays of numbers,
they have their own rules of multiplication. Consequently, they have different ways of
factorization, or known as decomposition. Example is singular value decomposition, which
has no restriction to the shape or properties of the matrix to be decomposed.
Where,
U is a m×m orthogonal matrix (for an m×n matrix A).
Σ is a m×n diagonal matrix containing the singular values of A.
V is a n×n orthogonal matrix.
Vectors are orthogonal to each other if any two vectors’ dot product is zero. A vector
is unit vector if its length is 1. Orthonormal matrix has the property that its transpose
is its inverse. In other words, since U is an orthonormal matrix, UT=U-1 OR
U.UT=UT.U= I, where I is the identity matrix.
Implementation:
The SVD can be calculated by calling the svd() function. The function takes a matrix and
returns the U, Sigma and V^T elements. The Sigma diagonal matrix is returned as a vector of
singular values. The V matrix is returned in a transposed form.
The example below defines a 3×2 matrix and calculates the Singular-value decomposition.
# Singular-value decomposition
from numpy import array
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# SVD
U, s, VT = svd(A)
print(U)
print(s)
print(VT)
Running the example first prints the defined 3×2 matrix, then the 3×3 U matrix, 2 element
Sigma vector, and 2×2 V^T matrix elements calculated from the decomposition.
[[1 2]
[3 4]
[5 6]]
[ 9.52551809 0.51430058]
[[-0.61962948 -0.78489445]
[-0.78489445 0.61962948]]
This is a method to compute the SVD of a matrix by iteratively adding columns (or rows) and
updating the decomposition as new columns (or rows) are added. This approach can be useful
when dealing with large matrices where computing the SVD directly may be computationally
expensive or memory-intensive.
The incremental approach is particularly useful in scenarios where the matrix is too large to
fit into memory entirely, or when new data arrives over time, and we need to update the SVD
accordingly without recomputing it from scratch each time.
It's important to note that the incremental approach may introduce some approximation errors
compared to computing the SVD directly, but it offers a trade-off between accuracy and
computational efficiency in certain situations.
Ex: 6 = 2 * 3 , where 2,3 are treated as factors of 6 and we can generate 6 again by product of
There are many different matrix decomposition techniques, each finds use among a particular
class of problems.
LU decomposition factorizes a matrix into a Lower triangle and a Upper triangle matrix.
QR decomposition decomposes of a matrix A into a product A = QR of an orthogonal
Non-negative matrix factorization (NMF) decomposes a matrix A into two matrices that
PCA is a transform that uses eigen decomposition to obtain the transform matrix.
Singular Value Decomposition (SVD) factorizes any matrix with any dimension as 3
parts USV’.
Matrix factorization which separates a matrix into two other matrices that is typically
much easier to solve than the original matrix. This not only makes the problem easy to
solve, but it reduces the amount of time required by a computer to calculate the answer.
Matrix decomposition is mostly used to solve linear systems in an easy way or quickly.
Matrix factorization reduces a computers storage space for matrices, instead of storing the
large non factorized matrix (A), We can use less storage for its factors (i.e B, C),
c) Applications
5. Eigen faces.
Overview
With the growth of the internet and faster computers, large amounts of data are being stored
by companies all over the world. This has resulted in the rise of a subfield within computer
science called big data, which focuses on the problems of storing and analyzing vast troves of
data. Collaborative filtering is a way of extracting useful information from this data, in a
general process called information filtering. The algorithm compares a user with other similar
users (in terms of preferences) and recommends a specific product or action based on these
similarities.
2. The algorithm finds other users with similar tastes to the given user.
3. The system recommends items that the user has not yet rated (thus, likely being new
to the user) and that are highly rated by users similar to the given user.
A caveat in this process includes active participation by the users of the service. In order to be
recommended an item, the user must have liked or disliked items in the past; otherwise, the
system will not provide good recommendations.
Differences in Implementation
There are two main ways of implementing collaborative filtering - a memory-based approach
and a model-based approach.
The memory-based approach calculates similarity (si,j) between users i and j in the following
manner:
Simil is a user-defined similarity function that takes in two user preferences (pi and pj)
and determines how similar the preferences are to each other (higher means more
similar). For example, a simple but effective similarity function is one that returns 1 if
the preferences are the same, and returns 0 if they are not.
The similarity calculation as a whole acts as a variant of the k-nearest neighbors algorithm by
determining which users are the closest neighbors to the given user by preferences. The
benefits of the memory-based approach are that it is easy to implement and very effective.
However, some issues include data sparsity, which occurs when there is not enough data to
make good recommendations, and scalability, since with a large dataset more computation is
required to determine similarity.
The model-based approach uses machine learning techniques to find patterns in the dataset.
For example, neural networks can be used to find trends among item preferences. Advantages
to this approach include easy scalability, but can lead to expensive model building for the
neural network. Many collaborative filtering systems use a hybrid approach, which is a
combination of the memory-based and model-based approaches. Though such systems are
expensive and complex to implement, they overcome the shortcomings of each of the above
approaches.
Challenges
There are various known challenges regarding collaborative filtering. The following are some
notable problems.
Lack of data - In order to start recommending, the system must first obtain a sufficient
number of a user's past preferences. This delays the usefulness of the recommendation
system until this number is met. Additionally, new products must be rated by a
sufficient number of people before being recommended. These issues cause a delay in
the usefulness of collaborative filtering to users.
Scalability - With more and more people and more and more preferences, the
collaborative filtering algorithm becomes computationally intensive and can take a lot
of time to return a result. There are many ways to combat this issue, from using
subsets of data to returning a result after a certain similarity threshold is reached.
Synonyms - Collaborative filtering may not be able to distinguish two products that
are the same (for example, different names for the same item on Amazon or a song
and a fan cover of the same song). In this instance, the algorithm may recommend the
same product unknowingly and will unnecessarily perform extra computation in
processing that item rather than avoiding it.
Shilling attacks - Users can rate their own products highly and rate competitors'
products poorly - this can cause the filtering algorithm to prefer one product over its
competitors', a huge problem for services that guarantee fairness to its users.
Additionally, recommendations within a community will prefer one point of view
over another. Examples of this include Reddit, Facebook, and Buzzfeed - where the
top-rated links will be biased towards the community's preferences.
Preferring old products - Because old products have more data associated with them
over new products, the algorithm might recommend these old products, rather than the
new products users are looking for. This defeats the purpose for some applications,
where the user might use the service to find new content.
Gray sheep - Some users have preferences that do not consistently agree with any
group of people (gray sheep). These users do not find collaborative filtering
particularly helpful when determining their wants.
Implementation
Introduction
Have you ever wondered how YouTube can pick up on your brand-new interest
immediately after you watched one video about it? Or how Amazon can recommend products
based on what you currently have in your shopping cart? The answer is online
recommendation systems (or recommender systems or recsys). These systems can
generate recommendations for users based on real-time contextual information, such as the
latest item catalog, user behaviors, and session context. While online recommendation
systems can vary substantially, they are all generally composed of the same high-level
components. Those are:
Candidate generation
Feature retrieval
Filtering
Model inference
Listwise ranking
Candidate generation
Candidate generation is responsible for quickly and cheaply narrowing the set of possible
candidates down to a small enough set that can be ranked. For example, narrowing a billion
possible YouTube videos down to ~hundreds based on a user’s subscriptions and recent
interests. A good candidate generation system produces a relatively small number of diverse,
high-quality candidates for the rest of the system.
There are many different options that can make effective candidate generators:
Query your existing operational database
o E.g., use Redis sorted sets to track popular items in a given locale, or
o run a daily batch job to generate a list of 500 candidates for every user and
load those candidates into DynamoDB.
Vector/Embedding similarity-based approaches
o Many different ML approaches that can learn “embeddings” for users and
items. These approaches can be combined with vector search technologies
like Faiss, Annoy, Milvus, and Elasticsearch to scale to online serving.
Every approach will have its pros and cons with respect to complexity, freshness, candidate
diversity, candidate quality, and cost. A good practice is to use a combination of approaches
and then “union” the results in the online system. For example, collaborative filtering models
can be used to generate high-quality candidates for learned users and items, but collaborative
filtering models suffer from the cold-start problem. Your system could supplement a
collaborative filtering model with recently added, popular items fetched from an operational
database.
Feature retrieval
At one or more points, the recommendation system will need to look up or compute
data/features for the user and the candidates being considered. This data will fall into three
categories:
Item features
o E.g., user_favorite_product_categories=[clothing,
sporting_goods] or user_site_spend_past_month=408.10
User-Item cross features
o E.g., user_has_bought_this_product_before=False or product_is_in_user_favo
rite_categories=True
o Cross features like these can be very helpful with model performance without
requiring the model to spend capacity to learn these relationships.
Your system does not necessarily need to use a purpose-built “feature store” for feature
retrieval, but the data service needs to handle the following:
o Due to the user-to-item fanout, this service may receive 100-1000x the QPS of
your recommendations service, and long-tail latency will heavily impact your
overall system performance.
o Fortunately, in most recsys cases, query volume and cost can be significantly
reduced at the expense of data freshness by caching item features. Item
features (as opposed to user features) are usually well suited to caching
because of their lower cardinality and looser freshness requirements. (It’s
probably not an issue if your product_average_review_rating feature is a
minute or two stale.) Feature caching may be done at multiple levels; e.g., a
first-level, in-process cache and a second-level, out-of-process cache like
Redis.
Online-offline parity
o Feature data fetched online must be consistent with the feature data that
ranking models were trained with. For example, if the
feature user_site_spend_past_month is pre-tax during offline training and
post-tax during online inference, you may get inaccurate online model
predictions.
Feature and data engineering capabilities
o Your chosen data service should support serving the kinds of features and data
that your models and heuristics require. For example, time-windowed
aggregates
(e.g., product_average_review_rating_past_24h or user_recent_purchase_ids)
are a popular and powerful class of features, and your data service should have
a scalable approach to building and serving them.
o If your data service does not support these classes of features, then
contributors will look for escape hatches, which may degrade system
complexity, reliability, or performance.
Filtering
Filtering is the process of removing candidates based on fetched data or model predictions.
Filters primarily act as system guardrails for bad user experience. They fall into the following
categories:
o E.g., a recently_view_products_filter that filters out products that the user has
viewed in the past day or in the current session.
Model-based filters:
o E.g., an unhelpful_review_filter that filters out item reviews if a personalized
model predicts that the user would rate the review as unhelpful.
Even though filters act as guardrails and are typically very simple, they themselves may need
some guardrails. Here are some recommended practices:
Filter limits
Model inference
Finally, we get to the “normal” machine learning part of the system: model inference. Like
with other ML systems, online model inference boils down to sending feature vectors to a
model service to get predictions. Typically these models predict easily measurable
downstream events, like Click Through Rate, Probability of a Purchase, Video Watch Time,
etc. The exact ML method, model architecture, and feature engineering for these models are
huge topics, and fully covering them is outside of the scope of this blog post.
Instead, we will focus on a couple of practical topics that are covered less in data science
literature.
“Pointwise” ranking is the process of scoring and ranking items in isolation; i.e., without
considering other items in the output. The item pointwise score may be as simple as a single
model prediction (e.g., the predicted click through rate) or a combination of multiple
predictions and heuristics.
For example, at YouTube, the ranking score is a simple algebraic combination of many
different predictions, like predicted “click through rate” and predicted “watch time”, and
occasionally some heuristics, like a “small creator boost” to slightly bump scores for smaller
channels.
ranking_score=
f(pCTR,pWatchTime,isSmallChannel)=
pCTR ^ X * pWatchTime ^ Y * if(isSmallChannel, Z, 1.0)
Tuning the parameters X, Y, and Z will shift the system’s bias from one objective to another.
If your model scores are calibrated, then this ranking function can be tuned separately from
model launches. The parameters can even be tuned in a personalized way; e.g., new users
may use different ranking function parameters than power users.
Listwise ranking
Have you ever noticed that your YouTube feed rarely has two similar videos next to each
other? The reason is that item diversity is a major objective of YouTube’s “listwise” ranking.
(The intuition being if a user has scrolled past two “soccer” videos, then it’s probably sub-
optimal to recommend another “soccer” video in the third position.)
Listwise ranking is the process of ordering items in the context of other items in the list. ML-
based and heuristic-based approaches can both be very effective for listwise optimization—
YouTube has successfully used both. A simple heuristic approach that YouTube had success
with is to greedily rank items based on their pointwise rank but apply a penalty to items that
are too similar to the preceding N items.
If item diversity is an objective of your final output, keep in mind that listwise ranking can
only be impactful if there are high-quality, diverse candidates in its input. This means that
tuning the upstream system, and in particular candidate generation, to source diverse
candidates is critical.