Chatbots & Recommendation Systems Final Review

Chatbots & Recommendation Systems | Final
● 35% of Amazon’s revenue comes from Recommendations

● 75% of what they watch on Netflix come from product
recommendations
● Recommendations account for around 60% of video clicks from the
homepage
● CNBC reported an online sales growth of 23.7% percent in the second
quarter of 2016 due to a speedier checkout process, better navigation
and more relevant product recommendations
Use Cases in Retail
● Personalised recommendation
● Your might also like .. (upsells)
● Frequently bought together (cross-selling)
● Similar alternatives (downselling)
● Visual recommendations
Use Cases in Gaming
● Personalised game recommendations

● Personalised item recommendation
● Similar items
● Next best action prediction
Use Cases in Food & Restaurants
● Location-based recommendations
● Personalised recommendations
● Frequently bought together
● Category-based recommendations
Use Cases in Media and Entertainment
● News recommendations
● Video recommendations
● Music recommendations
● Event-based recommendations
● People you may know
Use cases in travel and leisure
● Location-based recommendations
● Activity recommendations
● Personalised item recommendations
● Trip itinerary recommendations
Recommendation system
● They are information filters

● They utilise data available to infer customers interests and predict
their preferences
● They are the evolution of search engines
Data terminology
In Reco there are always the components: user, item and interaction
User = A customer who interacted with the company and to whom the system
recommends products
Item = A product that can be recommended to a user
Interaction and Feedback = A type of interaction between a user and an item
How do users interact and how is data collected?
Explicit feedback
○ Ratings
○ Survey results
○ Reviews
Advantages of Explicit Feedback
● Can be more informative and richer

● Allows for active learning
Cons of Explicit Feedback
● Can be costly to obtain

● Inconvenient to users
● Biased towards certain users
● May need calibration (ratings may not be easily comparable)
Implicit feedback
○ Click or no click
○ Purchase or no purchase
○ Clicks, purchases, pointing links
○ User engagement
○ CTR, conversion rate, impressions
Advantages of Implicit Feedback
● Often readily available from transaction logs

● Not disruptive to users, we have more data points
● Less statistical bias, more representative of customer base
Cons of Implicit Feedback
● Often less graded (e.g. click or no click), less personalised (like CTR
● Hard to define and find negative samples
Other data (features) about users and items available
Advantages of user and item features
● May reveal additional predictive factors (e.g. demographics, images of

products)
● Usually provides less granularity than user feedback: less
personalisation
Cons of user and item features
● Limitations due to privacy

● Valuable when data per user or item is sparse
● Allows for stratification by user features
● Also useful when customer targeting is a business objective
Knowledge-based data (as opposed to historical data) e.g. user

requirements, goals, etc.
Advantages of Knowledge-based data
● Useful for items not frequently purchased

● Sometimes features or historical data may not be sufficient, domains
complex
● User can specify what he or she wants
● Can be encoded as rules, cases, constraints, etc.
● Make the recommender explainable
Cons of Knowledge-based data

● Mapping from entities to recommendation items
Cold users
● Users who have not provided sufficient data to make accurate

recommendations
● It happens when a user is new to the system, has not yet interacted
with many items, or when a user is infrequent
● Example: buying house
Cold items
● Items that have not been interacted with by many users to make
accurate recommendations
● Example: new releases
One hot encoding
● A technique to deal with categorical data

● It can make data even more sparse
Binarize
● Numerical values can be converted to ordinal ones
Negative Sampling
● Implicit feedback is only positive. Negative sampling generates

negative interactions.
● PROBLEM: Negative interactions can be negative feedback or no
feedback
Data Split
● Random Split: Randomly split the dataset
● Stratified split: randomly split the dataset by group of users and items
○ This makes sure users and items are not cold in both training and
test data
● Chronological split: Split the dataset by group of users w.r.t.
timestamps of user-item interactions
Stratified Split
● A split is stratified when the same set of users or items appear in both
training and testing datasets.
● It can filter by user of item and it can consider a minimum number of
interactions.
● EXAMPLE USE: Movie recommendation.
● To make sure the evaluation is statistically sound, the same set of
users for both model building and testing should be used (to avoid any
coldness of users), and a stratified splitting strategy should be taken
Chronological Split
● Chronologically splitting method takes in a dataset and splits it on

timestamp
● It can filter by user or item and it can consider a minimum number of
interactions
● EXAMPLE USE: Fashion recommendation
● This recommender considers the time-dependency of customer
purchases, as apparently, the tastes of the customers in fashion items
may be drifting over time. In this case, a chronological split can be
used.
Collaborative VS Content-based filtering
Collaborative Filtering - correlating personal behaviours
● If Jeremy loves A and B, and Tao loves A,B, and C, then Jeremy is more
likely to love C
● Discover Patterns in observed preference behaviour (e.g. purchase
history, item ratings, click counts) across community of users.
● Predict new preferences based on those patterns
● It does NOT rely on item or user attributes (e.g. demographic info,
author, genre.)
⸺
– Uses feedback form multiple users in a collaborative way to predict

missing feedback
– Intuition: Users who give similar ratings to the same items will have
similar preferences, so should produce similar recommendations to
them.
Content-based Filtering - understanding user and item profiles
● Jake is in zip code A and age group N and bought item X; Logan is in
the same area and age group, so he’s likely to like item X too.
● Trained on user features and item features to predict preferences
based on those patterns.
● Can handle cold starts (new users or items with no interaction data)
– Content can be user & item features, review comments, knowledge

graph, multi-domain information, contextual information, etc.
– Mitigate the cold-start issues found in collaborative filtering typed
–
algorithms.
Offline VS Online Metrics
Offline Metrics:
● Metrics computed offline for measuring the performance of the

machine learning model
● These metrics include ranking, rating, and diversity metrics
Online Metrics:
● They are the metrics computed with the recommendation system in

production that reflects how the model is helping the business
● These metrics include CTR, conversion rate, monthly active usage,
average revenue per user, etc.
● Also known as business metrics.
Evaluating Recommendation models
In some cases, we are concerned about rating performance
● RMSE (Root mean squared error)

● MAE (Mean absolute error)
In most practical cases, we are concerned about ranking performance in:
● Precision
● Recall
● Mean Average Precision (MAP) - weighted on position
● NDCG - closely linked to MAP
Rating Metrics
– Measure how accurate a recommender is at predicting ratings
Regression Metrics
● Root Mean Square Error (RMSE) - it measures the average error in the
predicted ratings. More affected by outliers, but maintains robustness.
● Mean Absolute Error - Similar to RMSE but it uses the absolute value
instead of the square. It’s a better estimator of the true average error.
● R squared - It evaluates how well a model performs, based on the
proportion of total variations of the observed results.
● Explained Variance - Evaluates how much of the variance in the data is
explained by the model.
Classification Metrics
● Area under the curve (AUC) - Integral area under the receiver
operating characteristic curve. It represents the ability to discriminate
between positive and negative classes.
● Logistic Loss (LogLoss) - the negative log-likelihood of the true labels
given the predictions of a classifier. Log loss penalises heavily
classifiers that are confident about incorrect classifications. Also
called cross-entropy loss.
Ranking Metrics
– Measure how relevant recommendations are for users
● Precision - it measures the proportion of recommended items that are

relevant. In other words, it is the ability of the model to label a correct
sample as correct.
● Recall - It measures the proportion of relevant items that were
recommended. Ability to capture all correct samples.
● NDCG - Evaluates how well the recommender performs in
recommending ranked items to users.
● Mean Average Precision (MAP) - It evaluates the average precision for
each user in the dataset.
Long-tail items
– Item distribution has the form of a long tail. Non-popular items can be
highly profitable but suffer from the cold-start problem.
Diversity Metrics
– Measures effectiveness and added-value of a recommender
● Novelty - it measures how novel recommendation items are by

calculating their recommendation frequency among users
● Diversity - it measures how different the items in a set are with respect
to each other
● Serendipity - it measures how surprising recommendations are to a
specific user by comparing them to the items that the user has already
interacted with.
● Coverage - It measures the proportion of items recommended by the
system. Coverage can be based on all the item catalog or based on a
distribution.
ALS VS RANDOM
● ALS recommender outperforms the random recommender on ranking

metrics
● The random recommender outperforms ALS on diversity metrics
● ALS is optimised for estimating the item rating as accurately as
possible.
○ The items recommended tend to be popular items
● The long-tail less popular items have less chance of getting introduced
to the users, so ALS doesn’t perform that well in diversity metrics.
⸻⸻—
– In production, the model needs to provide value based on business

metrics
● CTR
● Conversion rate
● AOV - average order value
● MAU - Monthly active users
● LTV - lifetime value -> a good LTV should be 3 times the CAC
A/B Tests
● Used to measure an ML model in real time.

● Works by randomising an environment response into two groups, half
of the traffic foes to the machine learning model output, and the other
half has a random model.
● By comparing these metrics, it is possible to evaluate whether it is
●
beneficial to use the model or not (revenue-wise)

● A test with more than two groups is a “Multivariate test”
Architectures
– Batch Architecture
– Real-time Architecture
– 2 step recommender
⸻⸺
SIMILARITY METRICS
– Co-occurrence is a computation of item similarity
Affinity Matrix
● The affinity matrix captures the strength of the relationship between
users and items
In terms of type of interaction
● SAR can accept explicit data (ratings 1-5)

● It also accepts implicit data (view, click, purchase, etc)
○ Each interaction can be assigned a weight to capture its strength.
In terms of when the interaction occurred
● We can add a discount factor to give more importance to recent

interactions
● Then scaling factor causes the parameter T to serve as half-life:
events T units before t_0 will be given half the weight as those taking
place at t_0.
Affinity formula:
SAR scores:
Matrix factorisation: Model based approach
Latent Factors:
– Explains the ratings by characterizing all users and items in a more

global way
– The item factors weight the items in ways that may reflect e.g. comedy
vs drama, or uninterpretable dimensions.
– The user factors weight the user’s preferences towards the item
factors
Matrix factorisation
● The simplest way to model latent factors is as user & item vectors that
multiply (as inner products)
● Learn these factors from the data and use as model
● Predict an unseen rating of user-item by multiplying user factor with
item factor
○ The matrix factors P, Q have f columns, rows rest (latent
dimensions)
○ The number of factors f is also called the rank of the model
Matrix factorisation problem
● For regularised matrix factorisation, the task is to minimise:

Learning algorithms:
● Stochastic gradient descent

○ Parameters are updated in the opposite direction of gradient
○ Example: SVD
● Alternating least squares
○ Learn one of q or p at a time while keeping the other a constant,
and then alternate.

○ Example: ALS
Factorisation Machines (FM)
● FM extends matrix factorisation to model interactions of order 2

(cross-product between features)
● The input x contains users, items and their features as a single vector
● Trick: Estimate an approximation V of the interaction items
Field-Aware factorisation machines (FFM)
● FFM uses the same equation as FM, except in the V which varies
● It uses different factored latent factors for different groups of features
(fields)
● FFM solves the issue that the latent factors shared by features that
represent different categories of information, may not generalise well.
Decision Trees
● A decision tree is an ML algorithm that can be used for classification or

regression
● The intuition: You split the data into regions until all data points
belonging to each class are inside their own region
● Details:
○ DT models are fairly intuitive and easy to explain
○ DT cuts feature space in rectangles
○ DT -> can have as many categorical variables as needed
○ DT overfit easily. You can use Random Forest or Boosted Decision
Trees to solve that.

○ No need for normalisation
● To train a DT we optimise a loss function that finds the optimal split.

○ The optimal split is the one that minimises the classification error.
○ This is measured in terms of ‘purity’
○ You can measure purity by: GIni Index or Entropy.
● Gini is a measure of variance in the node

● Gini is high if there are many data points belonging to the wrong class
● Gini is zero if all the data points belong to the same class
● Gini is faster to compute than Entropy
● Also used in economics to measure inequality
● Entropy is a measure of uncertainty in the node
● Entropy is high if there are many data points belonging to the wrong
class
● Entropy is zero if all the data points belong to the same class
● Entropy is slower to compute than Gini
● It is used in stats to measure the expected amount of information that
can be drawn from a distribution.
Random Forest
● Can be used for classification and regression

● Improvement of top of DT and helps overcoming overfitting
● Uses BAGGING, short of bootstrap aggregation.
○ Data is randomly divided into bags, and on each bag we put part of
the data. The we train a decision tree on each bag and compute
the ensemble of these models
● Parallel training
● We build subsamples of data WITH replacement (this means that the
same data can go in the same bag) -> in order to not run out of
datapoints
● For finding the optimal split we use the same measures of impurity
that are used in individual Decision Trees, the Gini Index or entropy.
Gradient-Boosting Decision trees
● Gradient boosting is a machine learning technique that produces a

prediction model in the form of an ensemble of weak classifiers
● One of the most popular types of gradient boosting is GBDT, that
internally is made up of an ensemble of weak learners
● The two most popular GBDT frameworks are XGBoost and LightGBM.
● There are two different strategies to compute the trees: level-wise and
leaf-wise
● There are two ways of finding the split, using an approximation
(histogram) or by the exact split.
Tree computation
● The level-wise strategy:

○ Grows the tree level by level
○ Each node splits the data prioritising the nodes closer to the tree
root
○ Better for smaller datasets
● The leaf-wise strategy:

○ Grows the tree by splitting the data at the nodes with the highest
loss change.
○ Better in larger datasets where it is considerably faster
– A key challenge in training GBDT is the computational cost of finding

the best split
Exact split:
● Conventional techniques find the exact split for each leaf, and require
scanning through all the data in each iteration
● Slower, bur more accurate
Histogram split:
● It approximates the split by building histograms of the features

● Algorithm dies not need to evaluate every value of the features to
compute the split, but only the bins of the histogram.
● Faster, typically does not affect accuracy (particularly in large
datasets)
LightGBM
● Gradient boosting decision tree algorithm written in C++

● Has multiple APIs for Python and R
● It supports distributed training and GPU
● Split is approximated by creating a histogram of features
● Has +130 parameters
XGBoost
● XGBoost is a gradient boosting decision tree also written in C++

● Multiple APIs in python, R, java.
● Supports distributed training and GPU
● Allows BOTH exact and histogram split
● Almost same number of parameters than LightGBM
LightGBM Spark: SynapseML
● LightGBM is also a GBDT that uses Spark to accommodate large

datasets
● LightGBM in Spark is 10-30% faster than SparkML on the Higgs
dataset, and achieves a 15% increase in AUC.
● Data parallel distributed process:

○ The dataset is partitioned to each Spark worker
○ Workers use local data to construct local histograms
○ Workers merge histograms of different features for different
workers
○ Then workers find the local best split on local merged histograms
and sync up the global best split
Bayesian Pairwise Ranking (BPR)
● Algorithm focused on ranking

● Addresses the problem of distinguishing between negative feedback
and not interacted items
● It uses a loss function called pairwise loss
● The dataset is reformulated and divided into observed and unobserved
items
● The intuition behind it is that observed entries should be ranked higher
than unobserved ones.
● Pairwise loss can be applied to any ranking algorithm.
Pointwise Loss
● Many recommendation scenarios use implicit feedback (views, clicks,

buys, etc.)
● We only have positive feedback
● In the pointwise loss, the model is trained to predict as 1 the elements
in the observed dataset and 0 for the rest
● The big problem: All elements the model should rank are presented as
negative feedback during training, but in reality, some are unknown
and some are true negatives.
Pairwise Loss
● For each user, a pairwise matrix of items is generated

● The plus indicates that the user prefers item i over item j, and the
minus is that the user prefers item j over item i. The interrogation
indicates that we don’t know.
● Advantages:
○ Training set is defined as positive, negative and unknown
preferences.
○ The missing values between two non-observed items are exactly
the item pairs that have to be ranked by the model

○ The training data is created for the actual objective of ranking
○ Pairwise ranking is not restricted to implicit feedback and can be
used in explicit feedback by setting user preferences as pairs (i.e.

user u prefers item a more than item b)
⸻
Matrix Factorisation Limitations

● In matrix factorisation, we apply an inner product to the features of
users and items. It simply combines the latent features linearly.
● The inner product may not be sufficient to capture the complex
structure of user interaction data.
Neural Collaborative Filtering
● Generalised Matrix Factorisation

○ Neural CF layer is added to the classic MF model
○ Edge weights of the output layer are learned from training data
○ Generalise MF to a non-linear setting as a non-linear activation
function is used - more expressive.
● MLP model
○ For the case of concatenating user + item features
○ To give the model even higher flexibility and non-linearity
Loss function: Binary cross-entropy loss
Fusion of GMF and MLP
● MLP and GMF can be combined for even greater flexibility

● The two learn separate embeddings and are concatenated in the last
hidden layer
Generalised Matrix Factorisation Model
Multi-Layer perceptron Model

NCF Implementation
● Implemented in tensorflow
● Featured by different types of GMF and MLP layers
● Built-in data preparation module:

○ Negative sampling
○ Data loader in batches (with shuffling operation supported)
○ Leave-one-out evaluation protocol
● Parameters:
○ N factors - dimensionality of the latent space
○ Layer sizes - sizes of input layers
○ N epochs - number of epochs
Wide and Deep models
● Both wide model and deep model are jointly trained, not ensembled.
● Solves the problem for:
○ Wide part: ‘Memorisation’ - learning the frequent co-occurrence of
items or features.
○ Deep part: ‘Generalisation’ - based on transitivity of correlation
and tend to improve the diversity of the recommended items

Sequential VS Session-based recommendation systems
Sequential
● In many business scenarios, the recommendations are done in

sequences e.g. e-commerce.
● A sequential recommender combines the user’s historical preferences
with the user’s recent actions
● E.g. markov chains, RNNs, CNNs, Transformers.
Session-based
● The data is also typically sequential

● We don’t have user information - each session of the same user is
handled independently
● E.g. LSTMs, transformers.
Sequential data
● Common to set the data as session-parallel mini-batches

● Each sequence has a different length
● If any of the sessions end, the next available session is put in its place
● Sessions are assumed to be independent.
○ If different sessions are concatenated, some mechanism needs to
be in place to distinguish between them.

GRU-based
● It uses GRU networks

● Network’s input is the actual state of the session while the output is
the item of the next event in the session
● Items are one-hot encoded
● Sequence of items has a weight used as a discount if they have
occurred earlier.
● Output is the predicted preference of the items
LSTM-Based
● Good are modelling sequences - limited by sequence length
● Modifications
○ Due to the time irregularity between interactions, we can change
the gating logic to account for different time intervals

○ Not all items are related, we can use attention mechanisms to filter
out irrelevant items or distinguish different levels of influence.

○ To capture long-term behaviour, we can add matrix factorisation
techniques.
Sequential Convolutions
● The idea is to embed a sequence of recent items into an ‘image’ in the

time and latent spaces
● Horizontal and vertical convolutions are applied to the image
● Vertical filters encode how single items influence the next item
● Horizontal filters group the influence of items
● Using the previous 4 actions (L=4), we predict the items the user will
interact within the next 2 steps (T=2)
Dilated Convolutions
● Dilated convolutions can be used to model long-range dependencies

in comparison to CNNs.
● They allow the network to increase the receptive field of a layer
without adding extra parameters
● A longer receptive field captures features at multiple scales
● The idea of dilation is to apply the convolutional filter over a filed larger
than its original length by dilating it with zeros. It is also referred to as
a holed or sparse filter.
Transformed-based
● Strong for sequential recommendations

● Transformers improve over LSTMs and CNNs for sequences
● They use the ‘attention mechanism’ to focus on items relevant to the
next action
Graph Neural Networks (GNNs)
– They are neural networks designed for graph data
● Great for understanding networks or sequences

● Composed of nodes (entities) and edges (relationships)
● Leverage locality
● Specify different types of edges
● Computationally efficient
● Applicable to inductive problems to generalise to new data
Motivation of GNNs for Reco
● Great mechanism to learn relationships
– High-order connectivity
– Encode the interaction signal in the embeddings
– GNNs are powerful with graph-structured data
High Order Connectivity
– GNNs help achieve high-order connectivity
● Standard reco models capture similarity

○ User-user (user collaborative filtering)
○ Item-item (item collaborative filtering)
○ User-Item (model collaborative filtering)
Interaction in the embeddings
– GNNs can be used to encode the interaction signal in the embeddings

– Items are user features and users are item features
● Interacted items are user features: they provide direct evidence of

user preferences
● Users that consume an item are item’s features: they measure the
collaborative similarity of two items
Graph-structured data
● GGNs learn not only features but also the structure of the data
● Data sparsity is addressed via the graph connections
DKN - Knowledge-graph network
● Combination of a knowledge graph, attention and convolutions for

news recommendations.
● CTR prediction
● Joint Learning of semantic-level and knowledge-level representations
● Use of attention to encode the user’s historical behaviour
NLP approaches
Sequence to Sequence
● An encoder takes an input sequence and transforms it into a vector

representation
● A decoder takes the representation an outputs a different sequence
● E.g. Vanilla transformer
Autoencoders
● Train by adding noise to the text and reconstructing it. This model is
used in downstream tasks.
● E.g. BERT
Autoregressive
● Predict the future sequences based on previous ones.

● E.g. GPT
Transformers
● First architecture based solely on attention mechanisms

● Good for learning global word relationships and working in parallel
● Better behaviour than LSTMs and CNNs
● The encoder generates a hidden state that is fed to the decoder
● Key component: multi-head self-attention
● It’s the base of many successful NLP models
Attention mechanism
● Instead of producing a single hidden state for the input sequence, the
encoder creates a hidden state at each step that the decoder can
access
● Understand better the context, but the input to the decoder is huge
● Attention gives different weights to the encoder states. It measures
the importance
Self-attention intuition
● As it processes each word, self-attention allows it to look at other

positions in the input sequence for clues that can help lead to a better
encoding for this word.
● Self-attention allows the transformer model to capture dependencies
between all positions in the sequence, regardless of their relative
position. e.g. long term dependencies.
● It computes a weighted sum of the values to all positions, where the

weights are determined by the similarity between the query and key
vectors at each position.
● Query and vectors are derived from the input embeddings and are
used to compute an attention score for each position
● The attention score represents the importance of each position in the
sequence relative to the current position
○ Used to compute a wighted sum of the values at each position
◆ This is the output of the self-attention mechanism.
◆ This output is then fed through a feedforward neural network,
and the resulting output is used as the input to the next layer
– The main idea of self-attention is that instead of using fixed

embeddings for each token, we use the whole sequence to compute a
weighted average of each embedding
Self-Attention process
The most common attention implementation is the scaled dot-product:
. We project each token embedding into three matrices, q, k, and v

. Compute a similarity function between q and k, the dot product.
Similar queries and keys will have a high dot product.
. Compute the attention weights, by scaling the dot product and
applying a softmax
. Update the token embeddings by multiplying by the value vector
Softmax
Advantages
– It normalises a vector to a probability distribution
. It can convert any vector into a probability distribution, so it is easier

to compare
. It acts like a soft argmax function
. - Softmax will exaggerate that difference if one value is higher than
the others with the winning value close to one and all the others
close to zero
. - If there are several values close to the top, it will preserve them
all as highly probable, rather than artificially crushing close
second-place results.
. Softmax is differentiable, so it works well with backpropagation
Multi-head self-attention
● In multi-head attention, the attention mechanism is repeated multiple

●
times with linear projections of Q, K, and V.

● This allows the system to learn from different Q, K, and V
representations.
● In other words, this lets the transformer consider several previous
words simultaneously when predicting the next.
● The multi-head attention module that connects the encoder and
decoder will make sure that the encoder input sequence is taken into
account together with the decoder input sequence up to a given
position.
● This operation can be done in parallel.
Positional Encoding
● Unlike in RNNs that we know how sequence are fed into model
○ In transformers we need to encode the position of each element in
the input sequence

● The positional encoding is added to the input embeddings by adding a
fixed sinusoidal function of different frequencies and phases to each
embedding dimension.
● The frequency and phase of the sinusoidal function for each
dimension are determined by the position of the word in the sequence.
Residual Connections
Advantages
. Reduce the vanishing gradient since the gradient value is transferred

through the network
. It allows later layers to learn from features generated in the initial
layers, without the skip connection, that initial info would be lost.
. Skip connections help to maintain the gradient surface smooth and
without too many saddle points. This keeps gradient descent to get
stuck in local minima, is other words, the optimisation process is more
robust and then we can use deeper networks.
Layer Normalisation
● It normalises each input to have Zero mean and variance of one
Advantages
– Improved training stability

– Improved generalization
– Reduced sensitivity to initialisation
– Faster convergence
———
Transfer Learning
● First used in computer vision

● Common approach: train on one task, and then adapt it or fine tune it
on a new task.
● The original weights (body) of the task are used to initialise the
weights of the new task (head)
ImageNet Dataset
● Has become one of the most important benchmarks for computer

vision research.
● Several categories
○ Image classification
○ Object detection
● 1.2 million images
● 1000 classes of animals and objects
Freezing VS Finetuning
If the target and base domains are similar -> freeze and retrain the lat layer
If the target and base domains are different -> fine-tune all the network
Pretraining NLP
Supervised
● An LLM can be pretrained with translation pairs

● This is expensive and time consuming
Unsupervised
● Next token prediction: Autoregressive

○ Goal - Learn the set of parameters that accurately predict the next
token in a sequence given the previous tokens
● Masked token prediction: Denoising

○ Goal - Learn the set of parameters that accurately predict a mask
token given surrounding tokens.
Finetuning in NLP
● Finetuning adjusts the pretrained weights to a new task
● Supervised fine-tuning:
○ The traditional supervised learning model
○ The model learns to map the input data to the correct output labels
for the specific task
● Reward model finetuning:

○ A reward function is defined, and the model is fine-tuned using
reinforcement learning (RL) to maximise the reward
● Reinforcement Learning with Human Feedback (RLHF):

○ Used in ChatGPT
○ The model is fine-tuned using RL to maximise a reward function,
but the reward function is based on human feedback. The model

generates outputs, which are then evaluated by humans, who
provide feedback in the form of a reward signal.
○ The model is then updated to maximise the reward signal.
Bert Overview
● Bi-directional Encoder representations from transformers

● Encoder structure of the transformer
● Pretrained as a denoising autoencoder and autoregressive
● BERT large uses 340M parameters:

○ 24 encoder blocks
○ 1024 - dimensional embedding vectors
○ 16 attention blocks
● Deployed in Google search in 2019

● Many successful iterations: Roberta, Debra, DistilBert, Albert, etc.
Training BERT
● There are two pertaining tasks - masked language model and next
token prediction
● Pretrained on BooksCorpus (800M words) and English Wikipedia
(2500M words)
● The fine-tuning tasks are Q&A
Masked LM and Bidirectionality
– Masking tokens enables a true bidirectional model

Bert Tasks
● Sentence pair classification task

● Single sentence classification task
● Question answering task
● Single sentence tagging task
Zero-shot learning -> Without examples provided the model understands the
task based on the given instruction.
Methods used in Chatbots
Named-Entity Recognition (NER)
– New is the task of classifying words or key phrases into predefined

entities of interest
Steps of the NER process
. Text preprocessing: The text is preprocessed to remove any irrelevant

information, such as step words or punctuation
. Tokenisation: The text is broken down into individual tokens
. Part-of Speech tagging (POS): Each token is tagged with its part of
speech, such as noun, verb, or adjective.
. Named-Entity Recognition: The algorithm analyses the text and
identifies words or phrases that correspond to named entities, such as
people, places, or organisations.
. Post-Processing: The output of the NER algorithm may be post-
processed to improve the accuracy and consistency of the results.
– Some algorithms for NER are Conditional Random Fields, CNNs,

LSTMs, Decision Trees, and LLMs.
Q&A
– Q&A automatically generates answers to questions posed in natural

language.
Steps of the Q&A process:

. Question Analysis: The question is analysed to determine its intent, the
type of answer expected, and any relevant context, such as time or
location
. Information Retrieval: The system searches for relevant information
that can be used to answer the question, using techniques such as
keyword matching, NER, and semantic search
. Answer Generation: The system generates one or more candidate
answers to the question, using a variety of techniques such as text
summarisation, knowledge representation, and inference.
. Answer ranking and selection: The candidate answers are ranked
based on their relevance, accuracy, and other criteria, and the best
answer is selected for presentation to the user.
Two types - Extractive and Abstractive Q&A
● Extractive Q&A involves selecting an answer from a given source of

information. The answer is extracted directly from the text, without any
modification or summarisation.
● Abstractive Q&A involves generating a new answer based on the
information contained in the source text. It requires understanding the
meaning and generating a summary or paraphrase of the relevant
information.
Summarisation
Steps of the Summarisation process:

. Sentence extraction: The model identifies the most important
sentences from the original text. This can be done using various
methods, such as frequency analysis, graph-based algorithms, or
other machine learning models.
. Sentence Ranking: The extracted sentences are then ranked according
to their importance in the original text. This can be done using
different criteria, such as word frequency, sentence length, or
semantic similarity.
. Summary generation: The final step is to generate a summary using
the extracted and ranked sentences
Two types - Extractive and Abstractive Summarisation
● Extractive Summarisation involves selecting the most important

sentences or phrases from the original text and assembling them into
a summary
● In Abstractive Summarisation, the summary is generated by
understanding the meaning of the original text and generating a new
text that conveys the most important information
Text Classification
– Text classification involves assigning labels to text documents based

on their content

. Feature engineering: The next step is to extract meaningful features
from the preprocessed text data. This can be done using various
techniques, such as Bag-of-words, TF-IDF, or word embeddings.
. Model training: The extracted features are then used to train a ML
.
model. Popular algorithms for text classification include Naive Bayes,
SVMs, and NNs.
. Model evaluation: The trained model is then evaluated using a
validation dataset to measure its performance. (Accuracy, precision,
recall, f1-score)
. Model deployment: Once the model has been evaluated and fine-
tuned, it can be deployed to classify new text data in real-world
applications
Types:
– Sentiment Analysis
– Sentence Classification
– Topic modelling
– Spam filtering
– Language classification
Language Modelling
– Language modelling is the task of predicting the likelihood of a

sequence of words

. Feature engineering: The next step is to extract meaningful features
from the preprocessed text data. This can be done using various
techniques, such as Bag-of-words, TF-IDF, or word embeddings.
. Model training: Train the language model on the features. This involves
estimating the probability distribution over sequences of words using
maximum likelihood estimation.
. Model evaluation: Evaluate the performance of the language model on
a held-out test set of text data. This may involve computing perplexity,
a measure of how well the model predicts the test data
. Model deployment: Deploy the trained language model to perform the
desired task, such as text generation, machine translation, or speech
recognition. This may involve integrating the model into a larger
system.
Types:
– N-gram models
– Bayesian models
– Neural Network
RLHF Secret Sauce
● The standard loss of GPT-4 is cross-entropy loss. It learns to match

the probability distribution of the text
● RLHF instead of doing distribution matching, it does mode seeking
● RLHF biases the model toward the most rated preferences provided by
the human
● The effect is that you are losing a lot of the diversity of the base
model, in exchange for more reliable answers
● Regarding generating novelty and disruptive ideas, RLHF is not the
best solution because it constraints the model to output common
knowledge
● In same way, RLHF performs a function similar to Page Rank 20 years
ago at capturing the relevance of websites.
Memory in LLMs
LLMs understand the context you provide in the prompt. This acts like a default
short-term memory. However, it is useful to have a long-term memory that
stores and retrieves information from previous conversations. Memory uses the
stored information to generate more accurate responses in the future.
1 - Buffer Memory: Keep a buffer of all prior messages

2 - Conversation summary memory: Creates a summary of the conversation
over time
3 - Entity memory: Remembers facts about specific entities (names, locations,
product, etc.)
4 - Knowledge graph memory: Stores information about relationships between
entities. It represents knowledge in a structured way.
5 - Vector-stored memory: Stores conversations in a vector database, and
queries the top-K most ‘salient’ conversations every time it is called.
OpenAI API - ChatGPT for devs
Roles:
– System: it helps set the behaviour of the assistant

– User: This is the user’s prompt
– Assistant: This is ChatGPT’s response. It can be used to stored prior
responses
Threads are persistent and store the message history for each conversation.
They automatically truncate the history when it surpasses the context length.
Code interpreter - can run python code in a sandboxed execution environment
Knowledge retrieval - Augments the assistant with knowledge from outside its
model, such as information or documents provided by the user.
Functions - Allows the assistant to execute callbacks and call external APIs
ChatGPT plugins
● With plugins developers can connect applications to ChatGPT.

Examples include a web browser, a code interpreter, zapier, Expedia,
etc.
Microsoft Copilot
● Based on prometheus
○ Phase 1: When the user asks a question, the data is sent to the
Bing orchestrator. It generates multiple related search queries to

feed the system.
○ Phase 2: Query is combined with other pieces of information like
fresh data, news, answers, contextual signals, and location. This is

called grounding.
◆ During grounding, safeguards are applied to prevent offensive
and harmful content

○ Phase 3: The final answer is generated and enriched with relevant
citations
Google Gemini
● Multimodal learning - Gemini can process text, images, sound and

video
● Architecture - Based on the transformer decoder
● Gemini 1.0 Ultra has 56B parameters, Gemini 1.5 has 47B but reaches
parity with 1.0
● Gemini 1.0 Ultra has a context window of 128k tokens, while Gemini 1.5
has 1M tokens

Chatbots & Recommendation Systems Final Review

Uploaded by

Copyright:

Available Formats

Chatbots & Recommendation Systems Final Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chatbots & Recommendation Systems Final Review

Uploaded by

Copyright:

Available Formats

Chatbots & Recommendation Systems | Final

● 35% of Amazon’s revenue comes from Recommendations

Use Cases in Retail

Use Cases in Gaming

● Personalised game recommendations

Use Cases in Food & Restaurants

Use Cases in Media and Entertainment

Use cases in travel and leisure

● They are information filters

How do users interact and how is data collected?

Advantages of Explicit Feedback

● Can be more informative and richer

Cons of Explicit Feedback

● Can be costly to obtain

Advantages of Implicit Feedback

● Often readily available from transaction logs

Cons of Implicit Feedback

Other data (features) about users and items available

Advantages of user and item features

● May reveal additional predictive factors (e.g. demographics, images of

Cons of user and item features

● Limitations due to privacy

Knowledge-based data (as opposed to historical data) e.g. user

Advantages of Knowledge-based data

● Useful for items not frequently purchased

Cons of Knowledge-based data

● Users who have not provided sufficient data to make accurate

● A technique to deal with categorical data

● Numerical values can be converted to ordinal ones

● Implicit feedback is only positive. Negative sampling generates

● Chronologically splitting method takes in a dataset and splits it on

Collaborative VS Content-based filtering

Collaborative Filtering - correlating personal behaviours

– Uses feedback form multiple users in a collaborative way to predict

Content-based Filtering - understanding user and item profiles

– Content can be user & item features, review comments, knowledge

Offline VS Online Metrics

● Metrics computed offline for measuring the performance of the

● They are the metrics computed with the recommendation system in

Evaluating Recommendation models

In some cases, we are concerned about rating performance

● RMSE (Root mean squared error)

In most practical cases, we are concerned about ranking performance in:

– Measure how accurate a recommender is at predicting ratings

– Measure how relevant recommendations are for users

● Precision - it measures the proportion of recommended items that are

● Novelty - it measures how novel recommendation items are by

● ALS recommender outperforms the random recommender on ranking

– In production, the model needs to provide value based on business

● Used to measure an ML model in real time.

beneficial to use the model or not (revenue-wise)

– Co-occurrence is a computation of item similarity

In terms of type of interaction

● SAR can accept explicit data (ratings 1-5)

In terms of when the interaction occurred

● We can add a discount factor to give more importance to recent

– Explains the ratings by characterizing all users and items in a more

Matrix factorisation problem

● For regularised matrix factorisation, the task is to minimise: