007 Z-Score - Text-Classification - TF-IDF - Unlocked

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Topic #1 - Z-Score: A Handy Tool for Detecting Outliers in Data

The z-score measures the number of standard deviations that a data point is above or below
the mean of the distribution. It is calculated as:
𝑥−𝜇
𝑧=
𝜎
x is the value of the data point
µ is the mean of the distribution
σ is the standard deviation of the distribution

The z-score can be used to compare values from different normal distributions, as it expresses each
value in terms of its distance from the mean in units of standard deviation. It is also useful in
identifying outliers or extreme values in a data set.

The z-score, or standardized score, is a useful statistical tool in many ways. Some of the benefits of
using the z-score include:
Standardization: The z-score standardizes the data by converting it into a common scale. This allows
for easy comparison between different datasets that have different means and standard deviations.
Normal distribution: The z-score assumes that the data is normally distributed. This allows for the
use of statistical tests that rely on the normal distribution, such as hypothesis testing and estimations
of confidence intervals.
Outlier detection: The z-score is used to identify outliers in a dataset. Any data point with a z-score
greater than 3 or less than -3 is considered an outlier.
Probability calculations: The z-score can be used to calculate probabilities and percentiles for a
given dataset. This is particularly useful in hypothesis testing, where you can calculate the probability
of observing a given result by chance.
Data transformation: The z-score can be used to transform a dataset into a standard normal
distribution. This transformation can be useful in data analysis, as it simplifies the calculation of
certain statistics and allows for easier interpretation of results.

Example 1
Let’s assume you are analyzing the sales performance of a team of salespeople. The mean sales of the
group are $75,000, and the standard deviation is $10,000. You want to know how well a particular
salesperson is performing relative to the rest of the team if their sales are $85,000.

Solution 1
To find out, you can calculate the z-score as:
z = (x – µ) / σ where z = (85,000 – 75,000) / 10,000 or z = 1.
This means that the salesperson’s score is 1 standard deviation above the mean of the distribution. Since
the standard deviation is $10,000, this translates to sales of $85,000 being $10,000 dollars above the
mean sales of the group.

1
You can interpret the z-score as follows: the salesperson’s sales is better than the sales of 84.13% of
the sales team assuming a normal distribution. This can be found by looking up the z-score in a
standard normal distribution table or by using statistical software.
Example 2
Suppose we have a dataset of daily sales for a retail store over the past 30 days:
100 150 120 125 140 130 110 135 130 150 140 100 95 80 120
125 130 100 140 135 130 145 110 120 130 135 140 125 130 120

We want to identify any days where the sales revenue is significantly different from the other days,
which indicate anomaly or outlier.

Solution 2

Step 1: Caluclate mean and standard deviation


Mean = 125

Step 2: Calculate Z score


X = (Xi – µ)/ σ

Step 3: Set a threshold


We set a threshold for what we consider an anomaly
Usually, z-score = 3 is considered as a cut-off-value to set the limit.
Which captures 99.7% of the data in a normal distribution
Therefore, any z-score greater than +3 or less than -3 is considered as outlier which is pretty much
similar to standard deviation method.

Step 4: Identify anomaly values


No value is anomaly  No outlier value.

2
Topic #2 - Text Classification using Naïve Bayes Classifier

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems. Unlike discriminative classifiers, like logistic regression, it does
not learn which features are most important to differentiate between classes. It is mainly used in text
classification that includes a high-dimensional training dataset. Naïve Bayes Classifier is one of the
simple and most effective Classification algorithms which helps in building the fast machine learning
models that can make quick predictions. It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object. Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

Example
Assume that we are given the following dataset with training and tesint data. Our goal is to classify the
test data into the right class as h or ῀h (read as not h).

Document Class
Keywords in the document
ID h
1 Love Happy Joy Joy Happy Yes
2 Happy Love Kick Joy Happy Yes
3 Love Move Joy Good Yes
Training Set
4 Love Happy Joy Love Pain Yes
5 Joy Love Pain Kick Pain No
6 Pain Pain Love Kick No
Testing Set 7 Love Pain Joy Love Kick ?

Solution
The probability of the document ‘d’ being in class ‘c’ is computed as follows:
𝑝(𝑐|𝑑) ∝ 𝑝(𝑐) 𝑝(𝑡 |𝑐)

Where 𝑝(𝑡 |𝑐) is the conditional probability of term 𝑡 occurring in a document of class c.

The prior probability of a document being classified using the six documents are:
𝑃(ℎ) = = and 𝑃(˜ℎ) = =

That is: there is prior probability that a document will be classified as h and probability of ῀h.
The conditional probability for each term is the relative frequency of the term occuring in each class of
the document ‘h class’ and ‘῀h class’.

The testing example is: Love Pain Joy Love Kick = ?

3
Notice that we have 4 disinct terms: Love, Pain, Joy, and Kick.
Also, notice that we we have 19 different terms in ‘class h’, and 9 different terms in ‘class not h’.
So, we have the following values:
Class h Class -h
P(Love | h) = 5/19 0.263157895 P(Love | -h) = 2/9 0.222222222
P(Pain | h) = 1/19 0.052631579 P(Pain | -h) = 4/9 0.444444444
P(Joy | h) = 5/19 0.263157895 P(Joy | -h) = 1/9 0.111111111
P(Kick | h) = 1/19 0.052631579 P(Kick | -h) = 2/9 0.222222222

Now, we compute:
𝑃(ℎ|𝑑 ) = 𝑃(ℎ) ∗ 𝑃(𝐿𝑜𝑣𝑒|ℎ) ∗ 𝑃(𝐿𝑜𝑣𝑒|ℎ) ∗ 𝑃(𝑃𝑎𝑖𝑛|ℎ) ∗ 𝑃(𝐽𝑜𝑦|ℎ) ∗ 𝑃(𝐾𝑖𝑐𝑘|ℎ)
2 5 5 1 5 1
= ∗ ∗ ∗ ∗ ∗ = 𝟑. 𝟑𝟕 𝒙 𝟏𝟎 𝟓
3 19 19 19 19 19

𝑃(−ℎ|𝑑 ) = 𝑃(−ℎ) ∗ 𝑃(𝐿𝑜𝑣𝑒|ℎ −) ∗ 𝑃(𝐿𝑜𝑣𝑒|−ℎ) ∗ 𝑃(𝑃𝑎𝑖𝑛|−ℎ) ∗ 𝑃(𝐽𝑜𝑦|−ℎ) ∗ 𝑃(𝐾𝑖𝑐𝑘| − ℎ)


1 2 2 4 1 2
= ∗ ∗ ∗ ∗ ∗ = 𝟎. 𝟎𝟎𝟎𝟏𝟖
3 9 9 9 9 9

Since the 𝑃(−ℎ|𝑑 ) is higher, then the class label for the sentence: ‘Love Pain Joy Love Kick’ is No.

4
Topic #3 - TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF is a numerical statistic that is used to reflect the importance of a word in a document or a
collection of documents. It is commonly used in text classification and information retrieval tasks to
weigh the importance of individual words in a document based on their frequency and the frequency of
the document in which they appear.

In text classification, TF-IDF can be used to weight the importance of the words in a document in order
to improve the accuracy of the classification model or features extraction. For example, in an emotion
classification task, you could use the TF-IDF weights of the words in a document to help the model
better understand the emotional content of the document.

To use TF-IDF in classification, you would typically pre-process the text data by calculating the TF-
IDF weights of the words in the documents, and then use these weights as features in your classification
model. This can be done using a variety of techniques, such as machine learning algorithms like support
vector machines (SVMs) or decision trees, or more advanced techniques like deep learning.

It is worth noting that while TF-IDF can be a useful feature in text classification, it does not capture any
information about the meaning or context of the words beyond their frequency and distribution in the
documents. Other techniques, such as word2vec or GloVe, can be used to encode more information
about the meaning of the words, which can potentially improve the performance of the classification
model.

Actually, the TF-IDF algorithm is used to weigh a keyword in any content and assign importance to that
keyword based on the number of times it appears in the document. More importantly, it checks how
relevant the keyword is throughout the other documents or web, which is referred to as corpus.

For a term t in document d, the weight 𝑤 , of term t in:


𝑁
𝑤 , = 𝑡𝑓 , 𝑥 log
𝑑𝑓
𝑡𝑓 , = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑖 𝑖𝑛 𝑗
𝑑𝑓 , = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑖
𝑁 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠

Important: The TF-IDF is scored between 0 and 1. The higher the numerical weight value, the rarer
the term. The smaller the weight, the more common the term.

5
Example
Assume that we have 4 documents with the following terms, calculate the TF-IDF value for each term,
and specify the rarer term?
D1 D2
the quick brown over, the quick brown over, jumps over, quick jumps over, quick jumps
the quick over, the quick over, the quick over, quick jumps over, quick brown jumps
jumps, quick jumps over, quick jumps over, over, quick brown jumps over, quick brown fox
quick brown jumps, quick brown jumps jumps over, quick brown fox jumps over, jumps
quick brown fox jumps
D3 D4
quick brown fox jumps, quick brown fox the quick fox, the quick brown fox, brown fox
jumps, quick brown fox jumps, the quick over, brown fox over, brown fox, quick brown
brown fox jumps, the quick brown fox fox jumps over, quick brown jumps over, the
jumps, the brown fox, brown, brown, brown quick brown fox jumps, the quick fox jumps

Solution
The Term Frequency (TF) matrix given in the table shows the frequency of terms per document.
Document
the quick brown fox jumps over
Term
D1 5 9 4 0 5 6
D2 0 8 5 3 10 8
D3 3 5 6 6 5 0
D4 4 6 7 8 4 4

Solution
We need to calculate the wights 𝑤 , using the formula:
𝑁
𝑤 , = 𝑡𝑓 , 𝑥 log
𝑑𝑓
For example, TF-IDF(“the” in D1) = 5 * log(4/3) = 0.625

Applyimg the above formula on all terms in all documents will produce the following values:
Document/
the quick brown fox jumps over
Term
D1 0.625 0.000 0.000 0.000 0.000 0.750
D2 0.000 0.000 0.000 0.375 0.000 1.000
D3 0.375 0.000 0.000 0.750 0.000 0.000
D4 0.500 0.000 0.000 1.000 0.000 0.500
Total 1.499 0.000 0.000 2.124 0.000 2.249

This indicates that the rarer term is “over” among all terms and documents.

You might also like