Machine Learning UNIT-2: Logistic Regression
Machine Learning UNIT-2: Logistic Regression
Machine Learning UNIT-2: Logistic Regression
UNIT-2
Logistic regression
This article discusses the basics of Logistic Regression and its implementation in
Python. Logistic regression is basically a supervised classification algorithm. In a
classification problem, the target variable(or output), y, can take only discrete values for
given set of features(or inputs), X.
Contrary to popular belief, logistic regression IS a regression model. The model builds a
regression model to predict the probability that a given data entry belongs to the
category numbered as “1”. Just like Linear regression assumes that the data follows a
linear function, Logistic regression models the data using the sigmoid function.
Perceptron
In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A
binary classifier is a function which can decide whether or not an input, represented by a vector of
numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm
that makes its predictions based on a linear predictor function combining a set of weights with
the feature vector.
Exponential family
In probability and statistics, an exponential family is a parametric set of probability distributions of a
certain form, specified below. This special form is chosen for mathematical convenience, based on
some useful algebraic properties, as well as for generality, as exponential families are in a sense
very natural sets of distributions to consider. The term exponential class is sometimes used in
place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution"
and "family" are often used loosely: properly, an exponential family is a set of distributions, where the
specific distribution varies with the parameter; however, a parametric family of distributions is often
referred to as "a distribution" (like "the normal distribution", meaning "the family of normal
distributions"), and the set of all exponential families is sometimes loosely referred to as "the"
exponential family.
The concept of exponential families is credited to E. J. G. Pitman, G. Darmois, and B. O.
Koopman in 1935–1936. Exponential families of distributions provides a general framework for
selecting a possible alternative parameterisation of a parametric family of distributions, in terms
of natural parameters, and for defining useful sample statistics, called the natural sufficient
statistics of the family.
Generative learning algorithms
More information about Discriminant Functions for the Normal Density can be found in
textbook as Pattern Classification DUDA, HART, SOTRK or Pattern Recognition and Machine
Learning BISHOP.
To separate the two classes of data points, there are many possible hyperplanes that
could be chosen. Our objective is to find a plane that has the maximum margin, i.e the
maximum distance between data points of both classes. Maximizing the margin distance
provides some reinforcement so that future data points can be classified with more
confidence.
Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also, the
dimension of the hyperplane depends upon the number of features. If the number of
input features is 2, then the hyperplane is just a line. If the number of input features is 3,
then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
Support Vectors
Support vectors are data points that are closer to the hyperplane and influence the
position and orientation of the hyperplane. Using these support vectors, we maximize the
margin of the classifier. Deleting the support vectors will change the position of the
hyperplane. These are the points that help us build our SVM.
Combining Classifiers
Vote provides a baseline method for combining classifiers. The default scheme is to
average their probability estimates or numeric predictions, for classification and
regression, respectively. Other combination schemes are available—for example, using
majority voting for classification. MultiScheme selects the best classifier from a set of
candidates using cross-validation of percentage accuracy or mean-squared error for
classification and regression, respectively. The number of folds is a parameter.
Performance on training data can be used instead.
Stacking combines classifiers using stacking (see Section 8.7, page 369) for both
classification and regression problems. You specify the base classifiers, the
metalearner, and the number of cross-validation folds. StackingC implements a more
efficient variant for which the metalearner must be a numeric prediction scheme
(Seewald, 2002). In Grading, the inputs to the metalearner are base-level predictions
that have been marked (i.e., “graded”) as correct or incorrect. For each base classifier,
a metalearner is learned that predicts when the base classifier will err. Just as stacking
may be viewed as a generalization of voting, grading generalizes selection by cross-
validation (Seewald and Fürnkranz, 2001).
Bagging
Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine
learning ensemble meta-algorithm designed to improve the stability and accuracy of
machine learning algorithms used in statistical classification and regression. It also
reduces variance and helps to avoid overfitting.
Boosting
The term 'Boosting' refers to a family of algorithms which converts weak learner to
strong learners. Boosting is an ensemble method for improving the model predictions
of any given learning algorithm. The idea of boosting is to train weak learners
sequentially, each trying to correct its predecessor.
Evaluating and debugging learning algorithms
•
•
•
Classification errors
There are multiple types of errors associated with machine learning and predictive
analytics. The primary types are in-sample and out-of-sample errors. In-sample errors
(aka re-substitution errors) are the error rate found from the training data, i.e., the
data used to build predictive models.
Out-of-sample errors (aka generalisation errors) are the error rates found on a new
data set, and are the most important since they represent the potential performance of
a given predictive model on new and unseen data.
In-sample error rates may be very low and seem to be indicative of a high-performing
model, but one must be careful, as this may be due to overfitting as mentioned, which
would result in a model that is unable to generalise well to new data.
Training and validation data is used to build, validate, and tune a model, but test data is
used to evaluate model performance and generalisation capability. One very important
point to note is that prediction performance and error analysis should only be done on
test data, when evaluating a model for use on non-training or new data (out-of-sample).
For binary classification problems, there are two primary types of errors. Type 1 errors
(false positives) and Type 2 errors (false negatives). It’s often possible through model
selection and tuning to increase one while decreasing the other, and often one must
choose which error type is more acceptable. This can be a major tradeoff consideration
depending on the situation.
A typical example of this tradeoff dilemma involves cancer diagnosis, where the
positive diagnosis of having cancer is based on some test. In this case, a false positive
means that someone is told that have have cancer when they do not. Conversely, the
false negative case is when someone is told that they do not have cancer when they
actually do. If no model is perfect, then in the example above, which is the more
acceptable error type? In other words, of which one can we accept to a greater degree?
Telling someone they have cancer when they don’t can result in tremendous emotional
distress, stress, additional tests and medical costs, and so on. On the other hand, failing
to detect cancer in someone that actually has it can mean the difference between life
and death.
In the spam or ham case, neither error type is nearly as serious as the cancer case, but
typically email vendors err slightly more on the side of letting some spam get into your
inbox as opposed to you missing a very important email because the spam classifier is
too aggressive.