Naive Bayesian Classifier: National Institute of Technology Sikkim

A Report on
Naive Bayesian Classifier

This report is submitted for the partial evaluation of the
subject “Machine Learning” of B. Tech Degree
In
Department of “Computer Science and Engineering”
At
NATIONAL INSTITUTE OF TECHNOLOGY SIKKIM
Submitted By
Ankit Saurav
Roll No. B170013CS
B.Tech, 3rd year, 6th Semester
Concerned Faculty
Dr. Pratyay Kuila
Assistant Professor
Department of Computer Science and Engineering
National Institute of Technology Sikkim
Introduction
Naive Bayes is a simple technique for constructing classifiers: models that
assign class labels to problem instances, represented as vectors of feature
values, where the class labels are drawn from some finite set. There is not a
single algorithm for training such classifiers, but a family of algorithms based
on a common principle: all naive Bayes classifiers assume that the value of a
particular feature is independent of the value of any other feature, given the
class variable. For example, a fruit may be considered to be an apple if it is
red, round, and about 10 cm in diameter. A naive Bayes classifier considers
each of these features to contribute independently to the probability that this
fruit is an apple, regardless of any possible correlations between the color,
roundness, and diameter features.
For some types of probability models, naive Bayes classifiers can be trained
very efficiently in a supervised learning setting. In many practical applications,
parameter estimation for naive Bayes models uses the method of maximum
likelihood; in other words, one can work with the naive Bayes model without
accepting Bayesian probability or using any Bayesian methods.
Despite their naive design and apparently oversimplified assumptions, naive
Bayes classifiers have worked quite well in many complex real-world
situations. In 2004, an analysis of the Bayesian classification problem showed
that there are sound theoretical reasons for the apparently implausible
efficacy of naive Bayes classifiers.[9] Still, a comprehensive comparison with
other classification algorithms in 2006 showed that Bayes classification is
outperformed by other approaches, such as boosted trees or random forests.
An advantage of naive Bayes is that it only requires a small number of

training data to estimate the parameters necessary for classification.
Probabilistic model
Abstractly, naïve Bayes is a conditional probability model: given a problem

instance to be classified, represented by a vector X = (x1,........... , xn)
representing some n features (independent variables), it assigns to this
instance probabilities p(Ck | x1, ........ , xn) for each of K possible outcomes
or classes Ck .
The problem with the above formulation is that if the number of features n is
large or if a feature can take on a large number of values, then basing such a
model on probability tables is infeasible. We therefore reformulate the model
to make it more tractable. Using Bayes' theorem, the conditional probability
can be decomposed as
p(Ck ) p(X | Ck)

p(Ck | x1 , ..... , xn ) =
p(X )
In practice, there is interest only in the numerator of that fraction, because the
denominator does not depend on C and the values of the features xi are
given, so that the denominator is effectively constant. The numerator is
equivalent to the joint probability model p(Ck, x1, ..... , xn).
Constructing a classifier from the probability mode
The discussion so far has derived the independent feature model, that is, the
naïve Bayes probability model. The naïve Bayes classifier combines this
model with a decision rule. One common rule is to pick the hypothesis that is
most probable; this is known as the maximum a posteriori or MAP decision
rule. The corresponding classifier, a Bayes classifier, is the function that
assigns a class label y^ = Ck for some k as follows:
y^ = argma xk∈{1,2,....,k} p(Ck) ∏ni=1 (p(xi Ck))
Parameter estimation and event models:
A class's prior may be calculated by assuming equiprobable classes (i.e.,

priors = 1/<number of classes>), or by calculating an estimate for the class
probability from the training set (i.e., <prior for a given class> = <number of
samples in the class>/<total number of samples>). To estimate the
parameters for a feature's distribution, one must assume a distribution or
generate nonparametric models for the features from the training set.
The assumptions on distributions of features are called the "event model" of

the naïve Bayes classifier. For discrete features like the ones encountered in
document classification (include spam filtering), multinomial and Bernoulli
distributions are popular. These assumptions lead to two distinct models,
which are often confused.
Gaussian naïve Bayes :
When dealing with continuous data, a typical assumption is that the

continuous values associated with each class are distributed according to a
normal (or Gaussian) distribution. For example, suppose the training data
contains a continuous attribute, x. We first segment the data by the class, and
then compute the mean and variance of x in each class. Let μk be the mean
of the values in x associated with class Ck and let σ2k be the Bessel
corrected variance of the values in x associated with class Ck . Suppose we
have collected some observation value . Then, the probability distribution of
given a class Ck , p(x = v | Ck) , can be computed by plugging into the
2
equation for a normal distribution parameterized by μk and σ k.
Multinomial naïve Bayes :
With a multinomial event model, samples (feature vectors) represent the

frequencies with which certain events have been generated by a multinomial
( p1, . . . . . , pn) where pi is the probability that event occurs (or K such
multinomials in the multiclass case). A feature vector X = (x1, , xn)
is then a histogram, with xi counting the number of times event was
observed in a particular instance. This is the event model typically used for
document classification, with events representing the occurrence of a word
in a single document. The likelihood of observing a histogram x is given by
( ∑i xi)!
p(x | Ck ) = pkixi
∏i x i ! ∏
i
Bernoulli naïve Bayes :
In the multivariate Bernoulli event model, features are independent Booleans

(binary variables) describing inputs. Like the multinomial model, this model is
popular for document classification tasks, where binary term occurrence
features are used rather than term frequencies. If xi is a boolean expressing
the occurrence or absence of the th term from the vocabulary, then the
likelihood of a document given a class Ck is given by -
where pki is the probability of class Ck generating the term xi. This event
model is especially popular for classifying short texts. It has the benefit of
explicitly modelling the absence of terms. Note that a naive Bayes classifier
with a Bernoulli event model is not the same as a multinomial NB classifier
with frequency counts truncated to one.
Semi-supervised parameter estimation:
Given a way to train a naïve Bayes classifier from labeled data, it's possible
to construct a semi-supervised training algorithm that can learn from a
combination of labeled and unlabeled data by running the supervised learning
algorithm in a loop:
• Given a collection D = L ⋃ U of labeled samples and unlabeled

samples , start by training a naïve Bayes classifier on .
• Until convergence, do:
1. Predict class probabilities p(c k) for all examples in D.

2. Re-train the model based on the probabilities (not the labels)
predicted in the previous step.
Convergence is determined based on improvement to the model likelihood

p(D | θ) , where θ denotes the parameters of the naïve Bayes model.
Conclusion:
Despite the fact that the far-reaching independence assumptions are often
inaccurate, the naive Bayes classifier has several properties that make it
surprisingly useful in practice. In particular, the decoupling of the class
conditional feature distributions means that each distribution can be
independently estimated as a one-dimensional distribution. This helps
alleviate problems stemming from the curse of dimensionality, such as the
need for data sets that scale exponentially with the number of features. While
naive Bayes often fails to produce a good estimate for the correct class
probabilities,[19] this may not be a requirement for many applications. For
example, the naive Bayes classifier will make the correct MAP decision rule
classification so long as the correct class is more probable than any other
class. This is true regardless of whether the probability estimate is slightly, or
even grossly inaccurate. In this manner, the overall classifier can be robust
enough to ignore serious deficiencies in its underlying naive probability
model.

Naive Bayesian Classifier: National Institute of Technology Sikkim

Uploaded by

Copyright:

Available Formats

Naive Bayesian Classifier: National Institute of Technology Sikkim

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naive Bayesian Classifier: National Institute of Technology Sikkim

Uploaded by

Copyright:

Available Formats

A Report on

Naive Bayesian Classifier

Department of “Computer Science and Engineering”

NATIONAL INSTITUTE OF TECHNOLOGY SIKKIM

Dr. Pratyay Kuila

An advantage of naive Bayes is that it only requires a small number of

Abstractly, naïve Bayes is a conditional probability model: given a problem

p(Ck ) p(X | Ck)

Constructing a classifier from the probability mode

y^ = argma xk∈{1,2,....,k} p(Ck) ∏ni=1 (p(xi Ck))

Parameter estimation and event models:

A class's prior may be calculated by assuming equiprobable classes (i.e.,

The assumptions on distributions of features are called the "event model" of

When dealing with continuous data, a typical assumption is that the

Multinomial naïve Bayes :

With a multinomial event model, samples (feature vectors) represent the

Bernoulli naïve Bayes :

In the multivariate Bernoulli event model, features are independent Booleans

Semi-supervised parameter estimation:

• Given a collection D = L ⋃ U of labeled samples and unlabeled

1. Predict class probabilities p(c k) for all examples in D.

Convergence is determined based on improvement to the model likelihood

You might also like