Chatbots & Recommendation Systems Final Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Chatbots & Recommendation Systems | Final

● 35% of Amazon’s revenue comes from Recommendations


● 75% of what they watch on Netflix come from product
recommendations
● Recommendations account for around 60% of video clicks from the
homepage
● CNBC reported an online sales growth of 23.7% percent in the second
quarter of 2016 due to a speedier checkout process, better navigation
and more relevant product recommendations

Use Cases in Retail

● Personalised recommendation
● Your might also like .. (upsells)
● Frequently bought together (cross-selling)
● Similar alternatives (downselling)
● Visual recommendations

Use Cases in Gaming

● Personalised game recommendations


● Personalised item recommendation
● Similar items
● Next best action prediction

Use Cases in Food & Restaurants

● Location-based recommendations
● Personalised recommendations
● Frequently bought together
● Category-based recommendations

Use Cases in Media and Entertainment

● News recommendations
● Video recommendations
● Music recommendations
● Event-based recommendations
● People you may know

Use cases in travel and leisure

● Location-based recommendations
● Activity recommendations
● Personalised item recommendations
● Trip itinerary recommendations

Recommendation system

● They are information filters


● They utilise data available to infer customers interests and predict
their preferences
● They are the evolution of search engines

Data terminology

In Reco there are always the components: user, item and interaction

User = A customer who interacted with the company and to whom the system
recommends products
Item = A product that can be recommended to a user
Interaction and Feedback = A type of interaction between a user and an item

How do users interact and how is data collected?

Explicit feedback

○ Ratings
○ Survey results
○ Reviews

Advantages of Explicit Feedback

● Can be more informative and richer


● Allows for active learning

Cons of Explicit Feedback

● Can be costly to obtain


● Inconvenient to users
● Biased towards certain users
● May need calibration (ratings may not be easily comparable)

Implicit feedback

○ Click or no click
○ Purchase or no purchase
○ Clicks, purchases, pointing links
○ User engagement
○ CTR, conversion rate, impressions

Advantages of Implicit Feedback

● Often readily available from transaction logs


● Not disruptive to users, we have more data points
● Less statistical bias, more representative of customer base

Cons of Implicit Feedback

● Often less graded (e.g. click or no click), less personalised (like CTR
● Hard to define and find negative samples

Other data (features) about users and items available

Advantages of user and item features

● May reveal additional predictive factors (e.g. demographics, images of


products)
● Usually provides less granularity than user feedback: less
personalisation

Cons of user and item features

● Limitations due to privacy


● Valuable when data per user or item is sparse
● Allows for stratification by user features
● Also useful when customer targeting is a business objective

Knowledge-based data (as opposed to historical data) e.g. user


requirements, goals, etc.

Advantages of Knowledge-based data

● Useful for items not frequently purchased


● Sometimes features or historical data may not be sufficient, domains
complex
● User can specify what he or she wants
● Can be encoded as rules, cases, constraints, etc.
● Make the recommender explainable

Cons of Knowledge-based data


● Mapping from entities to recommendation items

Cold users

● Users who have not provided sufficient data to make accurate


recommendations
● It happens when a user is new to the system, has not yet interacted
with many items, or when a user is infrequent
● Example: buying house

Cold items

● Items that have not been interacted with by many users to make
accurate recommendations
● Example: new releases
One hot encoding

● A technique to deal with categorical data


● It can make data even more sparse

Binarize

● Numerical values can be converted to ordinal ones

Negative Sampling

● Implicit feedback is only positive. Negative sampling generates


negative interactions.
● PROBLEM: Negative interactions can be negative feedback or no
feedback

Data Split
● Random Split: Randomly split the dataset
● Stratified split: randomly split the dataset by group of users and items
○ This makes sure users and items are not cold in both training and

test data
● Chronological split: Split the dataset by group of users w.r.t.
timestamps of user-item interactions

Stratified Split

● A split is stratified when the same set of users or items appear in both
training and testing datasets.
● It can filter by user of item and it can consider a minimum number of
interactions.
● EXAMPLE USE: Movie recommendation.
● To make sure the evaluation is statistically sound, the same set of
users for both model building and testing should be used (to avoid any
coldness of users), and a stratified splitting strategy should be taken

Chronological Split

● Chronologically splitting method takes in a dataset and splits it on


timestamp
● It can filter by user or item and it can consider a minimum number of
interactions
● EXAMPLE USE: Fashion recommendation
● This recommender considers the time-dependency of customer
purchases, as apparently, the tastes of the customers in fashion items
may be drifting over time. In this case, a chronological split can be
used.

Collaborative VS Content-based filtering

Collaborative Filtering - correlating personal behaviours

● If Jeremy loves A and B, and Tao loves A,B, and C, then Jeremy is more
likely to love C
● Discover Patterns in observed preference behaviour (e.g. purchase
history, item ratings, click counts) across community of users.
● Predict new preferences based on those patterns
● It does NOT rely on item or user attributes (e.g. demographic info,
author, genre.)

– Uses feedback form multiple users in a collaborative way to predict


missing feedback
– Intuition: Users who give similar ratings to the same items will have
similar preferences, so should produce similar recommendations to
them.

Content-based Filtering - understanding user and item profiles

● Jake is in zip code A and age group N and bought item X; Logan is in
the same area and age group, so he’s likely to like item X too.
● Trained on user features and item features to predict preferences
based on those patterns.
● Can handle cold starts (new users or items with no interaction data)

– Content can be user & item features, review comments, knowledge


graph, multi-domain information, contextual information, etc.
– Mitigate the cold-start issues found in collaborative filtering typed

algorithms.

Offline VS Online Metrics

Offline Metrics:

● Metrics computed offline for measuring the performance of the


machine learning model
● These metrics include ranking, rating, and diversity metrics

Online Metrics:

● They are the metrics computed with the recommendation system in


production that reflects how the model is helping the business
● These metrics include CTR, conversion rate, monthly active usage,
average revenue per user, etc.
● Also known as business metrics.

Evaluating Recommendation models

In some cases, we are concerned about rating performance

● RMSE (Root mean squared error)


● MAE (Mean absolute error)

In most practical cases, we are concerned about ranking performance in:

● Precision
● Recall
● Mean Average Precision (MAP) - weighted on position
● NDCG - closely linked to MAP
Rating Metrics

– Measure how accurate a recommender is at predicting ratings

Regression Metrics
● Root Mean Square Error (RMSE) - it measures the average error in the
predicted ratings. More affected by outliers, but maintains robustness.
● Mean Absolute Error - Similar to RMSE but it uses the absolute value
instead of the square. It’s a better estimator of the true average error.
● R squared - It evaluates how well a model performs, based on the
proportion of total variations of the observed results.
● Explained Variance - Evaluates how much of the variance in the data is
explained by the model.

Classification Metrics
● Area under the curve (AUC) - Integral area under the receiver
operating characteristic curve. It represents the ability to discriminate
between positive and negative classes.
● Logistic Loss (LogLoss) - the negative log-likelihood of the true labels
given the predictions of a classifier. Log loss penalises heavily
classifiers that are confident about incorrect classifications. Also
called cross-entropy loss.

Ranking Metrics

– Measure how relevant recommendations are for users

● Precision - it measures the proportion of recommended items that are


relevant. In other words, it is the ability of the model to label a correct
sample as correct.
● Recall - It measures the proportion of relevant items that were
recommended. Ability to capture all correct samples.
● NDCG - Evaluates how well the recommender performs in
recommending ranked items to users.
● Mean Average Precision (MAP) - It evaluates the average precision for
each user in the dataset.

Long-tail items

– Item distribution has the form of a long tail. Non-popular items can be
highly profitable but suffer from the cold-start problem.

Diversity Metrics
– Measures effectiveness and added-value of a recommender

● Novelty - it measures how novel recommendation items are by


calculating their recommendation frequency among users
● Diversity - it measures how different the items in a set are with respect
to each other
● Serendipity - it measures how surprising recommendations are to a
specific user by comparing them to the items that the user has already
interacted with.
● Coverage - It measures the proportion of items recommended by the
system. Coverage can be based on all the item catalog or based on a
distribution.

ALS VS RANDOM

● ALS recommender outperforms the random recommender on ranking


metrics
● The random recommender outperforms ALS on diversity metrics
● ALS is optimised for estimating the item rating as accurately as
possible.
○ The items recommended tend to be popular items

● The long-tail less popular items have less chance of getting introduced
to the users, so ALS doesn’t perform that well in diversity metrics.

⸻⸻—

– In production, the model needs to provide value based on business


metrics

● CTR
● Conversion rate
● AOV - average order value
● MAU - Monthly active users
● LTV - lifetime value -> a good LTV should be 3 times the CAC

A/B Tests

● Used to measure an ML model in real time.


● Works by randomising an environment response into two groups, half
of the traffic foes to the machine learning model output, and the other
half has a random model.
● By comparing these metrics, it is possible to evaluate whether it is

beneficial to use the model or not (revenue-wise)


● A test with more than two groups is a “Multivariate test”

Architectures

– Batch Architecture

– Real-time Architecture

– 2 step recommender

⸻⸺
SIMILARITY METRICS

– Co-occurrence is a computation of item similarity

Affinity Matrix
● The affinity matrix captures the strength of the relationship between
users and items

In terms of type of interaction

● SAR can accept explicit data (ratings 1-5)


● It also accepts implicit data (view, click, purchase, etc)
○ Each interaction can be assigned a weight to capture its strength.

In terms of when the interaction occurred

● We can add a discount factor to give more importance to recent


interactions
● Then scaling factor causes the parameter T to serve as half-life:
events T units before t_0 will be given half the weight as those taking
place at t_0.

Affinity formula:

SAR scores:
Matrix factorisation: Model based approach

Latent Factors:

– Explains the ratings by characterizing all users and items in a more


global way
– The item factors weight the items in ways that may reflect e.g. comedy
vs drama, or uninterpretable dimensions.
– The user factors weight the user’s preferences towards the item
factors
Matrix factorisation

● The simplest way to model latent factors is as user & item vectors that
multiply (as inner products)
● Learn these factors from the data and use as model
● Predict an unseen rating of user-item by multiplying user factor with
item factor
○ The matrix factors P, Q have f columns, rows rest (latent

dimensions)
○ The number of factors f is also called the rank of the model

Matrix factorisation problem

● For regularised matrix factorisation, the task is to minimise:


Learning algorithms:

● Stochastic gradient descent


○ Parameters are updated in the opposite direction of gradient
○ Example: SVD
● Alternating least squares
○ Learn one of q or p at a time while keeping the other a constant,

and then alternate.


○ Example: ALS
Factorisation Machines (FM)

● FM extends matrix factorisation to model interactions of order 2


(cross-product between features)
● The input x contains users, items and their features as a single vector
● Trick: Estimate an approximation V of the interaction items

Field-Aware factorisation machines (FFM)

● FFM uses the same equation as FM, except in the V which varies
● It uses different factored latent factors for different groups of features
(fields)
● FFM solves the issue that the latent factors shared by features that
represent different categories of information, may not generalise well.
Decision Trees

● A decision tree is an ML algorithm that can be used for classification or


regression

● The intuition: You split the data into regions until all data points
belonging to each class are inside their own region

● Details:
○ DT models are fairly intuitive and easy to explain
○ DT cuts feature space in rectangles
○ DT -> can have as many categorical variables as needed
○ DT overfit easily. You can use Random Forest or Boosted Decision

Trees to solve that.


○ No need for normalisation

● To train a DT we optimise a loss function that finds the optimal split.


○ The optimal split is the one that minimises the classification error.
○ This is measured in terms of ‘purity’
○ You can measure purity by: GIni Index or Entropy.

● Gini is a measure of variance in the node


● Gini is high if there are many data points belonging to the wrong class
● Gini is zero if all the data points belong to the same class
● Gini is faster to compute than Entropy
● Also used in economics to measure inequality
● Entropy is a measure of uncertainty in the node
● Entropy is high if there are many data points belonging to the wrong
class
● Entropy is zero if all the data points belong to the same class
● Entropy is slower to compute than Gini
● It is used in stats to measure the expected amount of information that
can be drawn from a distribution.

Random Forest

● Can be used for classification and regression


● Improvement of top of DT and helps overcoming overfitting
● Uses BAGGING, short of bootstrap aggregation.
○ Data is randomly divided into bags, and on each bag we put part of

the data. The we train a decision tree on each bag and compute
the ensemble of these models
● Parallel training
● We build subsamples of data WITH replacement (this means that the
same data can go in the same bag) -> in order to not run out of
datapoints
● For finding the optimal split we use the same measures of impurity
that are used in individual Decision Trees, the Gini Index or entropy.

Gradient-Boosting Decision trees

● Gradient boosting is a machine learning technique that produces a


prediction model in the form of an ensemble of weak classifiers
● One of the most popular types of gradient boosting is GBDT, that
internally is made up of an ensemble of weak learners
● The two most popular GBDT frameworks are XGBoost and LightGBM.
● There are two different strategies to compute the trees: level-wise and
leaf-wise
● There are two ways of finding the split, using an approximation
(histogram) or by the exact split.
Tree computation

● The level-wise strategy:


○ Grows the tree level by level
○ Each node splits the data prioritising the nodes closer to the tree

root
○ Better for smaller datasets

● The leaf-wise strategy:


○ Grows the tree by splitting the data at the nodes with the highest

loss change.
○ Better in larger datasets where it is considerably faster

– A key challenge in training GBDT is the computational cost of finding


the best split

Exact split:

● Conventional techniques find the exact split for each leaf, and require
scanning through all the data in each iteration
● Slower, bur more accurate

Histogram split:

● It approximates the split by building histograms of the features


● Algorithm dies not need to evaluate every value of the features to
compute the split, but only the bins of the histogram.
● Faster, typically does not affect accuracy (particularly in large
datasets)

LightGBM

● Gradient boosting decision tree algorithm written in C++


● Has multiple APIs for Python and R
● It supports distributed training and GPU
● Split is approximated by creating a histogram of features
● Has +130 parameters
XGBoost

● XGBoost is a gradient boosting decision tree also written in C++


● Multiple APIs in python, R, java.
● Supports distributed training and GPU
● Allows BOTH exact and histogram split
● Almost same number of parameters than LightGBM

LightGBM Spark: SynapseML

● LightGBM is also a GBDT that uses Spark to accommodate large


datasets
● LightGBM in Spark is 10-30% faster than SparkML on the Higgs
dataset, and achieves a 15% increase in AUC.

● Data parallel distributed process:


○ The dataset is partitioned to each Spark worker
○ Workers use local data to construct local histograms
○ Workers merge histograms of different features for different

workers
○ Then workers find the local best split on local merged histograms

and sync up the global best split

Bayesian Pairwise Ranking (BPR)

● Algorithm focused on ranking


● Addresses the problem of distinguishing between negative feedback
and not interacted items
● It uses a loss function called pairwise loss
● The dataset is reformulated and divided into observed and unobserved
items
● The intuition behind it is that observed entries should be ranked higher
than unobserved ones.
● Pairwise loss can be applied to any ranking algorithm.

Pointwise Loss

● Many recommendation scenarios use implicit feedback (views, clicks,


buys, etc.)
● We only have positive feedback
● In the pointwise loss, the model is trained to predict as 1 the elements
in the observed dataset and 0 for the rest
● The big problem: All elements the model should rank are presented as
negative feedback during training, but in reality, some are unknown
and some are true negatives.

Pairwise Loss

● For each user, a pairwise matrix of items is generated


● The plus indicates that the user prefers item i over item j, and the
minus is that the user prefers item j over item i. The interrogation
indicates that we don’t know.
● Advantages:
○ Training set is defined as positive, negative and unknown

preferences.
○ The missing values between two non-observed items are exactly

the item pairs that have to be ranked by the model


○ The training data is created for the actual objective of ranking
○ Pairwise ranking is not restricted to implicit feedback and can be

used in explicit feedback by setting user preferences as pairs (i.e.


user u prefers item a more than item b)

Matrix Factorisation Limitations


● In matrix factorisation, we apply an inner product to the features of
users and items. It simply combines the latent features linearly.
● The inner product may not be sufficient to capture the complex
structure of user interaction data.

Neural Collaborative Filtering

● Generalised Matrix Factorisation


○ Neural CF layer is added to the classic MF model
○ Edge weights of the output layer are learned from training data
○ Generalise MF to a non-linear setting as a non-linear activation

function is used - more expressive.

● MLP model
○ For the case of concatenating user + item features
○ To give the model even higher flexibility and non-linearity

Loss function: Binary cross-entropy loss

Fusion of GMF and MLP

● MLP and GMF can be combined for even greater flexibility


● The two learn separate embeddings and are concatenated in the last
hidden layer

Generalised Matrix Factorisation Model

Multi-Layer perceptron Model


NCF Implementation

● Implemented in tensorflow
● Featured by different types of GMF and MLP layers

● Built-in data preparation module:


○ Negative sampling
○ Data loader in batches (with shuffling operation supported)
○ Leave-one-out evaluation protocol

● Parameters:
○ N factors - dimensionality of the latent space
○ Layer sizes - sizes of input layers
○ N epochs - number of epochs

Wide and Deep models

● Both wide model and deep model are jointly trained, not ensembled.
● Solves the problem for:
○ Wide part: ‘Memorisation’ - learning the frequent co-occurrence of

items or features.
○ Deep part: ‘Generalisation’ - based on transitivity of correlation

and tend to improve the diversity of the recommended items


Sequential VS Session-based recommendation systems

Sequential

● In many business scenarios, the recommendations are done in


sequences e.g. e-commerce.
● A sequential recommender combines the user’s historical preferences
with the user’s recent actions
● E.g. markov chains, RNNs, CNNs, Transformers.

Session-based

● The data is also typically sequential


● We don’t have user information - each session of the same user is
handled independently
● E.g. LSTMs, transformers.

Sequential data

● Common to set the data as session-parallel mini-batches


● Each sequence has a different length
● If any of the sessions end, the next available session is put in its place
● Sessions are assumed to be independent.
○ If different sessions are concatenated, some mechanism needs to

be in place to distinguish between them.


GRU-based

● It uses GRU networks


● Network’s input is the actual state of the session while the output is
the item of the next event in the session
● Items are one-hot encoded
● Sequence of items has a weight used as a discount if they have
occurred earlier.
● Output is the predicted preference of the items
LSTM-Based

● Good are modelling sequences - limited by sequence length

● Modifications
○ Due to the time irregularity between interactions, we can change

the gating logic to account for different time intervals


○ Not all items are related, we can use attention mechanisms to filter

out irrelevant items or distinguish different levels of influence.


○ To capture long-term behaviour, we can add matrix factorisation

techniques.
Sequential Convolutions

● The idea is to embed a sequence of recent items into an ‘image’ in the


time and latent spaces
● Horizontal and vertical convolutions are applied to the image
● Vertical filters encode how single items influence the next item
● Horizontal filters group the influence of items
● Using the previous 4 actions (L=4), we predict the items the user will
interact within the next 2 steps (T=2)
Dilated Convolutions

● Dilated convolutions can be used to model long-range dependencies


in comparison to CNNs.
● They allow the network to increase the receptive field of a layer
without adding extra parameters
● A longer receptive field captures features at multiple scales
● The idea of dilation is to apply the convolutional filter over a filed larger
than its original length by dilating it with zeros. It is also referred to as
a holed or sparse filter.

Transformed-based

● Strong for sequential recommendations


● Transformers improve over LSTMs and CNNs for sequences
● They use the ‘attention mechanism’ to focus on items relevant to the
next action
Graph Neural Networks (GNNs)

– They are neural networks designed for graph data

● Great for understanding networks or sequences


● Composed of nodes (entities) and edges (relationships)
● Leverage locality
● Specify different types of edges
● Computationally efficient
● Applicable to inductive problems to generalise to new data
Motivation of GNNs for Reco

● Great mechanism to learn relationships

– High-order connectivity
– Encode the interaction signal in the embeddings
– GNNs are powerful with graph-structured data

High Order Connectivity

– GNNs help achieve high-order connectivity

● Standard reco models capture similarity


○ User-user (user collaborative filtering)
○ Item-item (item collaborative filtering)
○ User-Item (model collaborative filtering)
Interaction in the embeddings

– GNNs can be used to encode the interaction signal in the embeddings


– Items are user features and users are item features

● Interacted items are user features: they provide direct evidence of


user preferences
● Users that consume an item are item’s features: they measure the
collaborative similarity of two items

Graph-structured data

● GGNs learn not only features but also the structure of the data
● Data sparsity is addressed via the graph connections

DKN - Knowledge-graph network

● Combination of a knowledge graph, attention and convolutions for


news recommendations.
● CTR prediction
● Joint Learning of semantic-level and knowledge-level representations
● Use of attention to encode the user’s historical behaviour
NLP approaches

Sequence to Sequence

● An encoder takes an input sequence and transforms it into a vector


representation
● A decoder takes the representation an outputs a different sequence
● E.g. Vanilla transformer

Autoencoders

● Train by adding noise to the text and reconstructing it. This model is
used in downstream tasks.
● E.g. BERT

Autoregressive

● Predict the future sequences based on previous ones.


● E.g. GPT

Transformers

● First architecture based solely on attention mechanisms


● Good for learning global word relationships and working in parallel
● Better behaviour than LSTMs and CNNs
● The encoder generates a hidden state that is fed to the decoder
● Key component: multi-head self-attention
● It’s the base of many successful NLP models

Attention mechanism

● Instead of producing a single hidden state for the input sequence, the
encoder creates a hidden state at each step that the decoder can
access
● Understand better the context, but the input to the decoder is huge
● Attention gives different weights to the encoder states. It measures
the importance

Self-attention intuition

● As it processes each word, self-attention allows it to look at other


positions in the input sequence for clues that can help lead to a better
encoding for this word.
● Self-attention allows the transformer model to capture dependencies
between all positions in the sequence, regardless of their relative
position. e.g. long term dependencies.

● It computes a weighted sum of the values to all positions, where the


weights are determined by the similarity between the query and key
vectors at each position.
● Query and vectors are derived from the input embeddings and are
used to compute an attention score for each position
● The attention score represents the importance of each position in the
sequence relative to the current position
○ Used to compute a wighted sum of the values at each position
◆ This is the output of the self-attention mechanism.
◆ This output is then fed through a feedforward neural network,

and the resulting output is used as the input to the next layer

– The main idea of self-attention is that instead of using fixed


embeddings for each token, we use the whole sequence to compute a
weighted average of each embedding

Self-Attention process

The most common attention implementation is the scaled dot-product:

. We project each token embedding into three matrices, q, k, and v


. Compute a similarity function between q and k, the dot product.
Similar queries and keys will have a high dot product.
. Compute the attention weights, by scaling the dot product and
applying a softmax
. Update the token embeddings by multiplying by the value vector

Softmax

Advantages

– It normalises a vector to a probability distribution

. It can convert any vector into a probability distribution, so it is easier


to compare
. It acts like a soft argmax function
. - Softmax will exaggerate that difference if one value is higher than
the others with the winning value close to one and all the others
close to zero
. - If there are several values close to the top, it will preserve them
all as highly probable, rather than artificially crushing close
second-place results.
. Softmax is differentiable, so it works well with backpropagation

Multi-head self-attention

● In multi-head attention, the attention mechanism is repeated multiple


times with linear projections of Q, K, and V.


● This allows the system to learn from different Q, K, and V
representations.
● In other words, this lets the transformer consider several previous
words simultaneously when predicting the next.
● The multi-head attention module that connects the encoder and
decoder will make sure that the encoder input sequence is taken into
account together with the decoder input sequence up to a given
position.
● This operation can be done in parallel.

Positional Encoding

● Unlike in RNNs that we know how sequence are fed into model
○ In transformers we need to encode the position of each element in

the input sequence


● The positional encoding is added to the input embeddings by adding a
fixed sinusoidal function of different frequencies and phases to each
embedding dimension.
● The frequency and phase of the sinusoidal function for each
dimension are determined by the position of the word in the sequence.

Residual Connections

Advantages

. Reduce the vanishing gradient since the gradient value is transferred


through the network
. It allows later layers to learn from features generated in the initial
layers, without the skip connection, that initial info would be lost.
. Skip connections help to maintain the gradient surface smooth and
without too many saddle points. This keeps gradient descent to get
stuck in local minima, is other words, the optimisation process is more
robust and then we can use deeper networks.

Layer Normalisation

● It normalises each input to have Zero mean and variance of one

Advantages

– Improved training stability


– Improved generalization
– Reduced sensitivity to initialisation
– Faster convergence

———

Transfer Learning

● First used in computer vision


● Common approach: train on one task, and then adapt it or fine tune it
on a new task.
● The original weights (body) of the task are used to initialise the
weights of the new task (head)

ImageNet Dataset

● Has become one of the most important benchmarks for computer


vision research.
● Several categories
○ Image classification
○ Object detection
● 1.2 million images
● 1000 classes of animals and objects

Freezing VS Finetuning

If the target and base domains are similar -> freeze and retrain the lat layer

If the target and base domains are different -> fine-tune all the network
Pretraining NLP

Supervised

● An LLM can be pretrained with translation pairs


● This is expensive and time consuming

Unsupervised

● Next token prediction: Autoregressive


○ Goal - Learn the set of parameters that accurately predict the next

token in a sequence given the previous tokens

● Masked token prediction: Denoising


○ Goal - Learn the set of parameters that accurately predict a mask

token given surrounding tokens.

Finetuning in NLP

● Finetuning adjusts the pretrained weights to a new task

● Supervised fine-tuning:
○ The traditional supervised learning model
○ The model learns to map the input data to the correct output labels

for the specific task

● Reward model finetuning:


○ A reward function is defined, and the model is fine-tuned using

reinforcement learning (RL) to maximise the reward

● Reinforcement Learning with Human Feedback (RLHF):


○ Used in ChatGPT
○ The model is fine-tuned using RL to maximise a reward function,

but the reward function is based on human feedback. The model


generates outputs, which are then evaluated by humans, who
provide feedback in the form of a reward signal.
○ The model is then updated to maximise the reward signal.

Bert Overview

● Bi-directional Encoder representations from transformers


● Encoder structure of the transformer
● Pretrained as a denoising autoencoder and autoregressive

● BERT large uses 340M parameters:


○ 24 encoder blocks
○ 1024 - dimensional embedding vectors
○ 16 attention blocks

● Deployed in Google search in 2019


● Many successful iterations: Roberta, Debra, DistilBert, Albert, etc.

Training BERT

● There are two pertaining tasks - masked language model and next
token prediction
● Pretrained on BooksCorpus (800M words) and English Wikipedia
(2500M words)
● The fine-tuning tasks are Q&A

Masked LM and Bidirectionality

– Masking tokens enables a true bidirectional model


Bert Tasks

● Sentence pair classification task


● Single sentence classification task
● Question answering task
● Single sentence tagging task
Zero-shot learning -> Without examples provided the model understands the
task based on the given instruction.
Methods used in Chatbots

Named-Entity Recognition (NER)

– New is the task of classifying words or key phrases into predefined


entities of interest
Steps of the NER process

. Text preprocessing: The text is preprocessed to remove any irrelevant


information, such as step words or punctuation
. Tokenisation: The text is broken down into individual tokens
. Part-of Speech tagging (POS): Each token is tagged with its part of
speech, such as noun, verb, or adjective.
. Named-Entity Recognition: The algorithm analyses the text and
identifies words or phrases that correspond to named entities, such as
people, places, or organisations.
. Post-Processing: The output of the NER algorithm may be post-
processed to improve the accuracy and consistency of the results.

– Some algorithms for NER are Conditional Random Fields, CNNs,


LSTMs, Decision Trees, and LLMs.

Q&A

– Q&A automatically generates answers to questions posed in natural


language.

Steps of the Q&A process:

. Text preprocessing: The text is preprocessed to remove any irrelevant


information, such as step words or punctuation
. Question Analysis: The question is analysed to determine its intent, the
type of answer expected, and any relevant context, such as time or
location
. Information Retrieval: The system searches for relevant information
that can be used to answer the question, using techniques such as
keyword matching, NER, and semantic search
. Answer Generation: The system generates one or more candidate
answers to the question, using a variety of techniques such as text
summarisation, knowledge representation, and inference.
. Answer ranking and selection: The candidate answers are ranked
based on their relevance, accuracy, and other criteria, and the best
answer is selected for presentation to the user.

Two types - Extractive and Abstractive Q&A

● Extractive Q&A involves selecting an answer from a given source of


information. The answer is extracted directly from the text, without any
modification or summarisation.
● Abstractive Q&A involves generating a new answer based on the
information contained in the source text. It requires understanding the
meaning and generating a summary or paraphrase of the relevant
information.

Summarisation

Steps of the Summarisation process:

. Text preprocessing: The text is preprocessed to remove any irrelevant


information, such as step words or punctuation
. Sentence extraction: The model identifies the most important
sentences from the original text. This can be done using various
methods, such as frequency analysis, graph-based algorithms, or
other machine learning models.
. Sentence Ranking: The extracted sentences are then ranked according
to their importance in the original text. This can be done using
different criteria, such as word frequency, sentence length, or
semantic similarity.
. Summary generation: The final step is to generate a summary using
the extracted and ranked sentences

Two types - Extractive and Abstractive Summarisation

● Extractive Summarisation involves selecting the most important


sentences or phrases from the original text and assembling them into
a summary
● In Abstractive Summarisation, the summary is generated by
understanding the meaning of the original text and generating a new
text that conveys the most important information

Text Classification

– Text classification involves assigning labels to text documents based


on their content

Steps of the Q&A process:

. Text preprocessing: The text is preprocessed to remove any irrelevant


information, such as step words or punctuation
. Feature engineering: The next step is to extract meaningful features
from the preprocessed text data. This can be done using various
techniques, such as Bag-of-words, TF-IDF, or word embeddings.
. Model training: The extracted features are then used to train a ML
.
model. Popular algorithms for text classification include Naive Bayes,
SVMs, and NNs.
. Model evaluation: The trained model is then evaluated using a
validation dataset to measure its performance. (Accuracy, precision,
recall, f1-score)
. Model deployment: Once the model has been evaluated and fine-
tuned, it can be deployed to classify new text data in real-world
applications

Types:

– Sentiment Analysis
– Sentence Classification
– Topic modelling
– Spam filtering
– Language classification

Language Modelling

– Language modelling is the task of predicting the likelihood of a


sequence of words

Steps of the Q&A process:

. Text preprocessing: The text is preprocessed to remove any irrelevant


information, such as step words or punctuation
. Feature engineering: The next step is to extract meaningful features
from the preprocessed text data. This can be done using various
techniques, such as Bag-of-words, TF-IDF, or word embeddings.
. Model training: Train the language model on the features. This involves
estimating the probability distribution over sequences of words using
maximum likelihood estimation.
. Model evaluation: Evaluate the performance of the language model on
a held-out test set of text data. This may involve computing perplexity,
a measure of how well the model predicts the test data
. Model deployment: Deploy the trained language model to perform the
desired task, such as text generation, machine translation, or speech
recognition. This may involve integrating the model into a larger
system.

Types:

– N-gram models
– Bayesian models
– Neural Network

RLHF Secret Sauce

● The standard loss of GPT-4 is cross-entropy loss. It learns to match


the probability distribution of the text
● RLHF instead of doing distribution matching, it does mode seeking
● RLHF biases the model toward the most rated preferences provided by
the human
● The effect is that you are losing a lot of the diversity of the base
model, in exchange for more reliable answers
● Regarding generating novelty and disruptive ideas, RLHF is not the
best solution because it constraints the model to output common
knowledge
● In same way, RLHF performs a function similar to Page Rank 20 years
ago at capturing the relevance of websites.

Memory in LLMs

LLMs understand the context you provide in the prompt. This acts like a default
short-term memory. However, it is useful to have a long-term memory that
stores and retrieves information from previous conversations. Memory uses the
stored information to generate more accurate responses in the future.

1 - Buffer Memory: Keep a buffer of all prior messages


2 - Conversation summary memory: Creates a summary of the conversation
over time
3 - Entity memory: Remembers facts about specific entities (names, locations,
product, etc.)
4 - Knowledge graph memory: Stores information about relationships between
entities. It represents knowledge in a structured way.
5 - Vector-stored memory: Stores conversations in a vector database, and
queries the top-K most ‘salient’ conversations every time it is called.

OpenAI API - ChatGPT for devs

Roles:

– System: it helps set the behaviour of the assistant


– User: This is the user’s prompt
– Assistant: This is ChatGPT’s response. It can be used to stored prior
responses

Threads are persistent and store the message history for each conversation.
They automatically truncate the history when it surpasses the context length.

Code interpreter - can run python code in a sandboxed execution environment

Knowledge retrieval - Augments the assistant with knowledge from outside its
model, such as information or documents provided by the user.

Functions - Allows the assistant to execute callbacks and call external APIs

ChatGPT plugins

● With plugins developers can connect applications to ChatGPT.


Examples include a web browser, a code interpreter, zapier, Expedia,
etc.

Microsoft Copilot

● Based on prometheus
○ Phase 1: When the user asks a question, the data is sent to the

Bing orchestrator. It generates multiple related search queries to


feed the system.
○ Phase 2: Query is combined with other pieces of information like

fresh data, news, answers, contextual signals, and location. This is


called grounding.
◆ During grounding, safeguards are applied to prevent offensive

and harmful content


○ Phase 3: The final answer is generated and enriched with relevant
citations

Google Gemini

● Multimodal learning - Gemini can process text, images, sound and


video
● Architecture - Based on the transformer decoder
● Gemini 1.0 Ultra has 56B parameters, Gemini 1.5 has 47B but reaches
parity with 1.0
● Gemini 1.0 Ultra has a context window of 128k tokens, while Gemini 1.5
has 1M tokens

You might also like