03 ML Essentials

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

CSE440: Natural Language

Processing II
Farig Sadeque
BRAC University
ML Essentials
Topics
- Classification
- Feature representation
- Probability
- Naive Bayes
- ML evaluation
Classification
- Let’s see an example from “Where’s My Mom? by Julia Donaldson and Axel
Scheffler ISBN-13: 978-0803732285”
- Context: a baby monkey is looking for his mom, and cannot find her anywhere
- A butterfly comes to help, asks the baby monkey how his mother looks like
- And the story starts
Who is the classifier here? What are the features? What is a feature?
Classifier
- Classifier = butterfly
- Features
- bigger than baby
- not a great gray hunk
- no tusks, trunk
- knees not baggy
- tail coils around trees
- doesn’t slither, hiss
- no nest of eggs
- legs > 0
Classifier
- Objects are described by properties or features
- Classifiers make predictions based on properties
- Good classifiers consider many properties jointly
- Classifiers may overfit, i.e., perform well on training data, but poorly on
unseen data
We will mostly use discriminative classifier in this course– where the classifier
function tries to draw (or predicts) a border to separate different types of data.
Classifier
- A classifier needs to be trained
- We train a classifier on a set of data we call training data
- We try to fit the classifier as best to our ability to the training data
- Should we do this? Will learn soon
- Then we try to predict the class of a new data that is not seen during training
- This is the part where prediction comes in
Formal definitions
Classification in matrix form
Features are easy for regular ML task
- Like you have seen in the butterfly classifier, or other examples
- What features do we use when we are trying to classify a natural language
data?
- Let’s see an example: BoW features
Bag-of-words features
- A bag-of-words feature representation means that for each word in the
vocabulary, there is a feature function, fi that produces the count of word i in
the text.
- Example: fgreat (great scenes great film) = 2
Bag-of-words features
How do we do it?
- Very easy
- Create a set of all possible unique words in the train data, w
- For each of the sentences create a vector v of size |w| with every element
initialized to 0
- If a word i appears in that sentence, replace the 0 in vi with the count of that
word in that sentence
- That’s it
What are the issues with BoW?
- One feature function per word → large, sparse matrices
- What’s wrong with sparsity?
- Completely ignores word order
- But often still useful
Classwork
Construct the feature matrix for these four text messages:
- Sorry I’ll call later
- U can call me now
- U have won call now!!
- Sorry! U can not unsubscribe
Classwork
What now?
- So, we have learned what features are
- Now, how are we going to use these features to classify a natural language
data?
- Let’s see one of the easiest types of classifier– it uses probability
Probability review
Prior Probability
- The unconditional or prior probability of a proposition a is the degree of belief
in that proposition given no other information

- P(DieRoll = 5) = 1/6
- P(CardDrawn = A♠) = 1/52
- P(SkyInTucson = sunny) = 286/365
Joint Probability
- The joint probability of propositions a1, . . . , an is the degree of belief in the
proposition a1 ∧ . . . ∧ an

- P(A,♠) = P(A ∧ ♠) = 1/52


Conditional Probability
- The posterior or conditional probability of a proposition a given a proposition b
is the degree of belief in a, given that we know only b

- P(Card = A♠|CardSuit = ♠) = 1/13


- P(DieRoll2 = 5|DieRoll1 = 5) =
Relation between Joint and Conditional
Product Rule
- P(a ∧ b) = P(a|b)P(b) or P(a|b) = P(a ∧ b)/P(b)

- P(A ∧ ♠) = 1/52
- P(A|♠) = 1/13
- P(♠) = 1/4
- P(A|♠)P(♠) = 1/13 · 1/4 = 1/52
Bayes’ Rule
P(b|a) = P(a|b)P(b)/P(a)
Let’s derive this equation.
Purpose of Bayes’ rule: Swap the conditioned and conditioning variables
Probabilistic classifiers
Naïve Bayes classifiers
Naïve Bayes classifiers
- Naïve Bayes classifier assumes conditional independence
- Two events A and B are conditionally independent given an event C if
P(A∩B|C)=P(A|C)P(B|C)
- A more general version:
P(A1 A2 A3 … An|C) = ∏P(Ai|C)
Let’s classify
- Or, in other words, train a classifier on train data and predict the class of a
new data
- We will need:
- Prior probability of a class y; P(y)
- And conditional probability of each feature f i P(fi|y)
- Where do we get these probabilities from?
- From the training data
- This is the training phase
- Recall from the equation in previous slide:
- We multiply all these probabilities to get the final probability of a test data
- This is the test phase
Let’s clear this up with an example
We have faced our first issue!
- Multiplying probability always has this type of issue: if a probability is 0, the
entire multiplication becomes 0
- What should we do?
- We use a technique called smoothing

m = number of unique words in the training data


Let’s try to do this example again, but now with smoothing
After the calculation
Refining features
- Binarization
- Binary multinomial naïve Bayes assumes that what matters is whether or not
a word occurs in a document, not its frequency in the document.
- Equivalent to removing duplicate words in each document.
Refining features
- N-gram features
- Instead of one word at a time, consider multiple words
- Bag-of-words assumption is sometimes too simplistic:
- good vs. not good
- like vs. didn’t like
- A bag-of-n-grams feature representation has a vocabulary that includes phrases (up to)
n tokens long.
- Example: fgreat film (great scenes great film) = 1
- This is bag-of-bigrams feature
- bag-of-words = bag-of-1-grams (a.k.a. unigrams)
- Consequences:
- Vocabulary size grows quickly as n increases
- Small counts; most n-grams never seen for large n
- For word n-grams, typically n ≤ 3
- For character n-grams, typically n ≤ 10
Refining features
- Lexicon features
- Sometimes there isn’t enough training data.
- Example: sentiment training data that doesn’t include wretched, dreadful, genial, etc.
- A lexicon is a list of words (not sentences or documents) that have been
annotated for a task.
- Example: the LIWC lexicon labels words for the categories Positive emotion, Family etc.
- Lexicons are typically used to produce count features.
- Example: fposemo (great scenes beautiful film) = 2 if
- great and beautiful are tagged as posemo in the lexicon
Rule-based features
- There are no limits on the algorithm that a feature function may apply.
- Examples:
- How many times does [$][\d,]{7} match?
- What proportion of first line’s characters are capitals?
- What is the ratio of text to image area (in HTML)?
Probabilistic learners are simple
- In everyday life, the learners we use make mistakes and then learn from
mistakes
- Let’s say we want to draw a line to separate spans vs. not spams (linear regression
classifier)
- The line splits the spams and not spams into two regions
- If a spam falls in the non-spam region, the classifier made a mistake
- This line is called a “decision boundary”
- How do we do that?
- We can penalize the model for making mistakes
- How can we penalize? Should we penalize the same amount for every mistake it makes?
- Enter the idea of Loss
Loss
Cross entropy loss
Cross entropy loss
- This cross-entropy loss, LCE(w, b; x, y), measures how bad w and b are on
the single example (x, y).
- An ML model runs through each of the training examples and tries to reduce
this cross entropy loss as it goes along– and thus learns the pattern in the
examples
- Sometimes, a model goes through all training examples multiple times to
reduce the loss further
- Each of these run is called an epoch
ML best practices
- Data splitting
- Underfitting and overfitting
- Performance metrics
- Statistical significance
Data splitting
● Train classifier on data Etrain
● Test classifier on Etest
Data splitting
● Never peak into the test data!
○ Not even accidentally

● What are the accuracies of these models?


● How do you improve them?
Data splitting: improving a model
● Bad idea
○ Train on the train data
○ Evaluate on test data
○ Go back to the training phase and tune parameters
● What will it do?
● Better idea
○ Split data into three parts
■ Train on the major chunk of the data (E train)
■ Tune on a very small piece of data (E val)
■ Test on another small chunk of data (E test)
Underfitting and Overfitting
Learning Curve
Underfitting
Insufficient training data Model is too simple
Addressing underfitting
Overfitting
Addressing overfitting
Performance metrics
● Accuracy is a bad performance metrics
○ the distribution of categories is unbalanced, or
○ you care about one category more than the others
○ Example: detecting spam in text messages
○ Example: diagnosing depression from tweets
● What should we do?
● Use other metrics: precision, recall, F-1 score
Confusion matrix
Precision, recall and F-1 score
Precision = ratio between true + (or -) and predicted + (or -)

Recall = ration between true (or -) and actual + (or -)

F-1 score: Harmonic mean of precision and recall

There are other task-specific metrics as well:


● BLEU: machine translation
● WER: ASR
● Cosine similarity: semantic similarity
● Euclidean distance: spell checking
● F-latency: early detection
More than one category of interest
● Sometimes we care about more than one category in classification
○ In sentiment analysis we care about both positive and negative sentiments
● Use
○ Macro-average: our regular average
○ Micro-average: combine-and-calculate
● Let’s calculate macro- and micro-average precision for this case:

We predicted 51 sentences as positive and and 42 as negative, where 37 of them


were actually positive and 18 were actually negative

● What is macro-precision?
● What is the micro-precision?
Statistical Significance
● On some test set T, classifier A gets .992 F1, B gets .991 F1.
○ Is A actually better than B?
● Competing hypotheses:
○ A is generally better than B
○ A was better than B by chance on T, but would not be better than B in general, i.e., the null
hypothesis (H0)
○ We would like to reject the null hypothesis if we want to establish that A is actually better
than B
● How to do this? One way can be:
○ We can do some sort of statistical significance test (Student’s t-test) using multiple samples
○ If the test value is less than than a preset critical value (p-value), we reject the null
hypothesis
○ How to get the samples? Many ways (e.g. cross-validation)

You might also like