03 ML Essentials
03 ML Essentials
03 ML Essentials
Processing II
Farig Sadeque
BRAC University
ML Essentials
Topics
- Classification
- Feature representation
- Probability
- Naive Bayes
- ML evaluation
Classification
- Let’s see an example from “Where’s My Mom? by Julia Donaldson and Axel
Scheffler ISBN-13: 978-0803732285”
- Context: a baby monkey is looking for his mom, and cannot find her anywhere
- A butterfly comes to help, asks the baby monkey how his mother looks like
- And the story starts
Who is the classifier here? What are the features? What is a feature?
Classifier
- Classifier = butterfly
- Features
- bigger than baby
- not a great gray hunk
- no tusks, trunk
- knees not baggy
- tail coils around trees
- doesn’t slither, hiss
- no nest of eggs
- legs > 0
Classifier
- Objects are described by properties or features
- Classifiers make predictions based on properties
- Good classifiers consider many properties jointly
- Classifiers may overfit, i.e., perform well on training data, but poorly on
unseen data
We will mostly use discriminative classifier in this course– where the classifier
function tries to draw (or predicts) a border to separate different types of data.
Classifier
- A classifier needs to be trained
- We train a classifier on a set of data we call training data
- We try to fit the classifier as best to our ability to the training data
- Should we do this? Will learn soon
- Then we try to predict the class of a new data that is not seen during training
- This is the part where prediction comes in
Formal definitions
Classification in matrix form
Features are easy for regular ML task
- Like you have seen in the butterfly classifier, or other examples
- What features do we use when we are trying to classify a natural language
data?
- Let’s see an example: BoW features
Bag-of-words features
- A bag-of-words feature representation means that for each word in the
vocabulary, there is a feature function, fi that produces the count of word i in
the text.
- Example: fgreat (great scenes great film) = 2
Bag-of-words features
How do we do it?
- Very easy
- Create a set of all possible unique words in the train data, w
- For each of the sentences create a vector v of size |w| with every element
initialized to 0
- If a word i appears in that sentence, replace the 0 in vi with the count of that
word in that sentence
- That’s it
What are the issues with BoW?
- One feature function per word → large, sparse matrices
- What’s wrong with sparsity?
- Completely ignores word order
- But often still useful
Classwork
Construct the feature matrix for these four text messages:
- Sorry I’ll call later
- U can call me now
- U have won call now!!
- Sorry! U can not unsubscribe
Classwork
What now?
- So, we have learned what features are
- Now, how are we going to use these features to classify a natural language
data?
- Let’s see one of the easiest types of classifier– it uses probability
Probability review
Prior Probability
- The unconditional or prior probability of a proposition a is the degree of belief
in that proposition given no other information
- P(DieRoll = 5) = 1/6
- P(CardDrawn = A♠) = 1/52
- P(SkyInTucson = sunny) = 286/365
Joint Probability
- The joint probability of propositions a1, . . . , an is the degree of belief in the
proposition a1 ∧ . . . ∧ an
- P(A ∧ ♠) = 1/52
- P(A|♠) = 1/13
- P(♠) = 1/4
- P(A|♠)P(♠) = 1/13 · 1/4 = 1/52
Bayes’ Rule
P(b|a) = P(a|b)P(b)/P(a)
Let’s derive this equation.
Purpose of Bayes’ rule: Swap the conditioned and conditioning variables
Probabilistic classifiers
Naïve Bayes classifiers
Naïve Bayes classifiers
- Naïve Bayes classifier assumes conditional independence
- Two events A and B are conditionally independent given an event C if
P(A∩B|C)=P(A|C)P(B|C)
- A more general version:
P(A1 A2 A3 … An|C) = ∏P(Ai|C)
Let’s classify
- Or, in other words, train a classifier on train data and predict the class of a
new data
- We will need:
- Prior probability of a class y; P(y)
- And conditional probability of each feature f i P(fi|y)
- Where do we get these probabilities from?
- From the training data
- This is the training phase
- Recall from the equation in previous slide:
- We multiply all these probabilities to get the final probability of a test data
- This is the test phase
Let’s clear this up with an example
We have faced our first issue!
- Multiplying probability always has this type of issue: if a probability is 0, the
entire multiplication becomes 0
- What should we do?
- We use a technique called smoothing
● What is macro-precision?
● What is the micro-precision?
Statistical Significance
● On some test set T, classifier A gets .992 F1, B gets .991 F1.
○ Is A actually better than B?
● Competing hypotheses:
○ A is generally better than B
○ A was better than B by chance on T, but would not be better than B in general, i.e., the null
hypothesis (H0)
○ We would like to reject the null hypothesis if we want to establish that A is actually better
than B
● How to do this? One way can be:
○ We can do some sort of statistical significance test (Student’s t-test) using multiple samples
○ If the test value is less than than a preset critical value (p-value), we reject the null
hypothesis
○ How to get the samples? Many ways (e.g. cross-validation)