Chatbots & Recommendation Systems Final Review
Chatbots & Recommendation Systems Final Review
Chatbots & Recommendation Systems Final Review
● Personalised recommendation
● Your might also like .. (upsells)
● Frequently bought together (cross-selling)
● Similar alternatives (downselling)
● Visual recommendations
● Location-based recommendations
● Personalised recommendations
● Frequently bought together
● Category-based recommendations
● News recommendations
● Video recommendations
● Music recommendations
● Event-based recommendations
● People you may know
● Location-based recommendations
● Activity recommendations
● Personalised item recommendations
● Trip itinerary recommendations
Recommendation system
Data terminology
In Reco there are always the components: user, item and interaction
User = A customer who interacted with the company and to whom the system
recommends products
Item = A product that can be recommended to a user
Interaction and Feedback = A type of interaction between a user and an item
Explicit feedback
○ Ratings
○ Survey results
○ Reviews
Implicit feedback
○ Click or no click
○ Purchase or no purchase
○ Clicks, purchases, pointing links
○ User engagement
○ CTR, conversion rate, impressions
● Often less graded (e.g. click or no click), less personalised (like CTR
● Hard to define and find negative samples
Cold users
Cold items
● Items that have not been interacted with by many users to make
accurate recommendations
● Example: new releases
One hot encoding
Binarize
Negative Sampling
Data Split
● Random Split: Randomly split the dataset
● Stratified split: randomly split the dataset by group of users and items
○ This makes sure users and items are not cold in both training and
test data
● Chronological split: Split the dataset by group of users w.r.t.
timestamps of user-item interactions
Stratified Split
● A split is stratified when the same set of users or items appear in both
training and testing datasets.
● It can filter by user of item and it can consider a minimum number of
interactions.
● EXAMPLE USE: Movie recommendation.
● To make sure the evaluation is statistically sound, the same set of
users for both model building and testing should be used (to avoid any
coldness of users), and a stratified splitting strategy should be taken
Chronological Split
● If Jeremy loves A and B, and Tao loves A,B, and C, then Jeremy is more
likely to love C
● Discover Patterns in observed preference behaviour (e.g. purchase
history, item ratings, click counts) across community of users.
● Predict new preferences based on those patterns
● It does NOT rely on item or user attributes (e.g. demographic info,
author, genre.)
⸺
● Jake is in zip code A and age group N and bought item X; Logan is in
the same area and age group, so he’s likely to like item X too.
● Trained on user features and item features to predict preferences
based on those patterns.
● Can handle cold starts (new users or items with no interaction data)
Offline Metrics:
Online Metrics:
● Precision
● Recall
● Mean Average Precision (MAP) - weighted on position
● NDCG - closely linked to MAP
Rating Metrics
Regression Metrics
● Root Mean Square Error (RMSE) - it measures the average error in the
predicted ratings. More affected by outliers, but maintains robustness.
● Mean Absolute Error - Similar to RMSE but it uses the absolute value
instead of the square. It’s a better estimator of the true average error.
● R squared - It evaluates how well a model performs, based on the
proportion of total variations of the observed results.
● Explained Variance - Evaluates how much of the variance in the data is
explained by the model.
Classification Metrics
● Area under the curve (AUC) - Integral area under the receiver
operating characteristic curve. It represents the ability to discriminate
between positive and negative classes.
● Logistic Loss (LogLoss) - the negative log-likelihood of the true labels
given the predictions of a classifier. Log loss penalises heavily
classifiers that are confident about incorrect classifications. Also
called cross-entropy loss.
Ranking Metrics
Long-tail items
– Item distribution has the form of a long tail. Non-popular items can be
highly profitable but suffer from the cold-start problem.
Diversity Metrics
– Measures effectiveness and added-value of a recommender
ALS VS RANDOM
● The long-tail less popular items have less chance of getting introduced
to the users, so ALS doesn’t perform that well in diversity metrics.
⸻⸻—
● CTR
● Conversion rate
● AOV - average order value
● MAU - Monthly active users
● LTV - lifetime value -> a good LTV should be 3 times the CAC
A/B Tests
Architectures
– Batch Architecture
– Real-time Architecture
– 2 step recommender
⸻⸺
SIMILARITY METRICS
Affinity Matrix
● The affinity matrix captures the strength of the relationship between
users and items
Affinity formula:
SAR scores:
Matrix factorisation: Model based approach
Latent Factors:
● The simplest way to model latent factors is as user & item vectors that
multiply (as inner products)
● Learn these factors from the data and use as model
● Predict an unseen rating of user-item by multiplying user factor with
item factor
○ The matrix factors P, Q have f columns, rows rest (latent
dimensions)
○ The number of factors f is also called the rank of the model
● FFM uses the same equation as FM, except in the V which varies
● It uses different factored latent factors for different groups of features
(fields)
● FFM solves the issue that the latent factors shared by features that
represent different categories of information, may not generalise well.
Decision Trees
● The intuition: You split the data into regions until all data points
belonging to each class are inside their own region
● Details:
○ DT models are fairly intuitive and easy to explain
○ DT cuts feature space in rectangles
○ DT -> can have as many categorical variables as needed
○ DT overfit easily. You can use Random Forest or Boosted Decision
Random Forest
the data. The we train a decision tree on each bag and compute
the ensemble of these models
● Parallel training
● We build subsamples of data WITH replacement (this means that the
same data can go in the same bag) -> in order to not run out of
datapoints
● For finding the optimal split we use the same measures of impurity
that are used in individual Decision Trees, the Gini Index or entropy.
root
○ Better for smaller datasets
loss change.
○ Better in larger datasets where it is considerably faster
Exact split:
● Conventional techniques find the exact split for each leaf, and require
scanning through all the data in each iteration
● Slower, bur more accurate
Histogram split:
LightGBM
workers
○ Then workers find the local best split on local merged histograms
Pointwise Loss
Pairwise Loss
preferences.
○ The missing values between two non-observed items are exactly
● MLP model
○ For the case of concatenating user + item features
○ To give the model even higher flexibility and non-linearity
● Implemented in tensorflow
● Featured by different types of GMF and MLP layers
● Parameters:
○ N factors - dimensionality of the latent space
○ Layer sizes - sizes of input layers
○ N epochs - number of epochs
● Both wide model and deep model are jointly trained, not ensembled.
● Solves the problem for:
○ Wide part: ‘Memorisation’ - learning the frequent co-occurrence of
items or features.
○ Deep part: ‘Generalisation’ - based on transitivity of correlation
Sequential
Session-based
Sequential data
● Modifications
○ Due to the time irregularity between interactions, we can change
techniques.
Sequential Convolutions
Transformed-based
– High-order connectivity
– Encode the interaction signal in the embeddings
– GNNs are powerful with graph-structured data
Graph-structured data
● GGNs learn not only features but also the structure of the data
● Data sparsity is addressed via the graph connections
Sequence to Sequence
Autoencoders
● Train by adding noise to the text and reconstructing it. This model is
used in downstream tasks.
● E.g. BERT
Autoregressive
Transformers
Attention mechanism
● Instead of producing a single hidden state for the input sequence, the
encoder creates a hidden state at each step that the decoder can
access
● Understand better the context, but the input to the decoder is huge
● Attention gives different weights to the encoder states. It measures
the importance
Self-attention intuition
and the resulting output is used as the input to the next layer
Self-Attention process
Softmax
Advantages
Multi-head self-attention
Positional Encoding
● Unlike in RNNs that we know how sequence are fed into model
○ In transformers we need to encode the position of each element in
Residual Connections
Advantages
Layer Normalisation
Advantages
———
Transfer Learning
ImageNet Dataset
Freezing VS Finetuning
If the target and base domains are similar -> freeze and retrain the lat layer
If the target and base domains are different -> fine-tune all the network
Pretraining NLP
Supervised
Unsupervised
Finetuning in NLP
● Supervised fine-tuning:
○ The traditional supervised learning model
○ The model learns to map the input data to the correct output labels
Bert Overview
Training BERT
● There are two pertaining tasks - masked language model and next
token prediction
● Pretrained on BooksCorpus (800M words) and English Wikipedia
(2500M words)
● The fine-tuning tasks are Q&A
Q&A
Summarisation
Text Classification
Types:
– Sentiment Analysis
– Sentence Classification
– Topic modelling
– Spam filtering
– Language classification
Language Modelling
Types:
– N-gram models
– Bayesian models
– Neural Network
Memory in LLMs
LLMs understand the context you provide in the prompt. This acts like a default
short-term memory. However, it is useful to have a long-term memory that
stores and retrieves information from previous conversations. Memory uses the
stored information to generate more accurate responses in the future.
Roles:
Threads are persistent and store the message history for each conversation.
They automatically truncate the history when it surpasses the context length.
Knowledge retrieval - Augments the assistant with knowledge from outside its
model, such as information or documents provided by the user.
Functions - Allows the assistant to execute callbacks and call external APIs
ChatGPT plugins
Microsoft Copilot
● Based on prometheus
○ Phase 1: When the user asks a question, the data is sent to the
Google Gemini