Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
Experiment 1
Aim: Introduction to ML lab with tools (Hands on WEKA on data set (iris.arff)).
(a) Start Weka
Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar file. This
will start the Weka GUI Chooser. The Weka GUI Chooser lets you choose one of the Explorer,
Experimenter, Knowledge Explorer and the Simple CLI (command line interface).
Classifier
Any learning algorithm in WEKA is derived from the abstract weka.classifiers.Classifier class.
Surprisingly little is needed for a basic classifier: a routine Which gen- erates a classifier model
from a training dataset (= buildClassifier) and another routine which evaluates the generated model on
an unseen test dataset (= classifyInstance), or generates a probability distribution for all classes (=
distributionForInstance).A classifier model is an arbitrary complex mapping from all-but-one dataset
attributes to the class attribute.
● Support Vector Machine (SVM) - SVM discriminates a set of high-dimension features using a or
sets of hyper planes that gives the largest minimum distance to separates all data points among
classes.
● Multilayer Perceptron Neural Network (MLP) - MLP is a non-linear feed-forward network model
which maps a set of inputs x onto a set of outputs y using multi weights connections.
● Bayesian Network (BN) - A BN is a probabilistic graphical model for reasoning under uncertainty,
where the nodes represent discrete or continuous variables and the links represent the relationships
between them.
● C4.5 Decision Tree (DT) - DT decides the target class of a new sample based on selected features
from available data using the concept of information entropy. The nodes of the tree are the attributes,
each branch of the tree represents a possible decision and the end nodes or leaves are the classes.
● Random Forest (RF) - RF works by constructing multiple decision tree s on various sub-samples of
the datasets and output the class that appear most often or mean predictions of the decision trees.
● Naive Bayes (NB) - The NB classifier is a classification algorithm based on Bayes theorem with
strong inde- pendent assumptions between features.
● K-nearest Neighbour (KNN) - KNN is an instance-based learning algorithm that store all available
data points and classifies the new data points based on similarity measure such as distance. The
machine learning ensemble meta-algorithms on the other hand are:
● Boosting algorithms - Boosting works by combining a set of weak classifier to a single strong
classifier. The weak classifiers are weighted in some way from the training data points or hypotheses
into a final strong classifier, thus there are a varieties of boosting algorithms. Here, three boosting
algorithms are introduced in this paper:-
● Adaptive Boosting (AdaBoost) - The weights of incorrectly labelled data points are adjusted in
AdaBoost such that the following classifiers focus more on incorrectly labelled or diffi cult cases.
.
● LogitBoost - LogitBoost is actually an extension of AdaBoost where it applies the cost function
logistic regression to AdaBoost, thus it classifies by using a regression scheme as base learner.
● Real AdaBoost - Unlike most Boosting algorithms which returns binary valued classes (Discrete
AdaBoost), Real AdaBoost outputs a real valued probability of the class.
● Bagging – Bagging is a method by generating several training sets of the same size and use the same
machine learning algorithm to build model of them and combine the predictions by averaging. It is
often improve the accuracy and stability of the classifier.
● Dagging - Dagging generates a number of disjoint and stratified folds out of the data and feeds each
chunk of data to a copy of the machine learning classifier. Majority vote is done for predictions since
all the generated machine learning classifier are put into the voted Meta classifier. Dagging is useful
for base classifiers that are quadratic or worse in time behaviour on the number of instances in the
training data.
● Rotation Forest - The rotation forest is constructed using a number of the same machine learning
classifier typically decision tree independently and trained on a new set of trained features form by
sub-sampling of thedatasets with principal component analysis applied on each sub-sets.
Date: _________
Experiment 3
Aim: Understand clustering approaches and implement K means Algorithm using Weka Tool
=== Run information ===
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 62.1436882815797
Cluster 0: 6.1,2.9,4.7,1.4,Iris-versicolor
Cluster 1: 6.2,2.9,4.3,1.3,Iris-versicolor
The above information shows the result of k-means clustering Methods using WEKA tool. After that we
saved the result, the result will be saved in the ARFF file format. We also open this file in the ms excel.
Date: _________
Experiment 4
There are lot of data files that store attributes details of problem description and they store data in either
of formats
There is list of following arff files shown below. These files are evaluated and analysed to get results on
basis of data provided in files.
1) Airline
2) Breast-cancer
3) Contact-lenses
4) Cpu
5) Credit-g
1. Airline:
Monthly totals of international airline passengers (in thousands) for 1949-1960.
@relation airline_passengers
@attribute passenger_numbers numeric
@attribute Date date 'yyyy-MM-dd'
@data
112,1949-01-01
118,1949-02-01
132,1949-03-01
129,1949-04-01
121,1949-05-01
135,1949-06-01
148,1949-07-01
148,1949-08-01
432,1960-12-01
2. Breast-cancer
This data set includes 201 instances of one class and 85 instances of another class. The instances are
described by 9 attributes, some of which are linear and some are nominal.
3. Contact-lenses
1. Title: Database for fitting contact lenses
2. Sources:
(a) Cendrowska, J. "PRISM: An algorithm for inducing modular rules",
International Journal of Man-Machine Studies, 1987, 27, 349-370
(b) Donor: Benoit Julien ([email protected])
(c) Date: 1 August 1990
3. Past Usage:
1. See above.
2. Witten, I. H. & MacDonald, B. A. (1988). Using concept
learning for knowledge acquisition. International Journal of
Man-Machine Studies, 27, (pp. 349-370).
Notes: This database is complete (all possible combinations of attribute-value pairs are represented).
Each instance is complete and correct. 9 rules cover the training set.
5. Number of Instances: 24
7. Attribute Information:
-- 3 Classes
1 : the patient should be fitted with hard contact lenses,
2 : the patient should be fitted with soft contact lenses,
1 : the patient should not be fitted with contact lenses.
9. Class Distribution:
1. hard contact lenses: 4
2. soft contact lenses: 5
3. no contact lenses: 15
@relation contact-lenses
@data
24 instances
young,myope,no,reduced,none
young,myope,no,normal,soft
young,myope,yes,reduced,none
young,myope,yes,normal,hard
young,hypermetrope,no,reduced,none
young,hypermetrope,no,normal,soft
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,no,reduced,none
pre-presbyopic,myope,no,normal,soft
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,no,reduced,none
pre-presbyopic,hypermetrope,no,normal,soft
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,no,reduced,none
presbyopic,myope,no,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,no,reduced,none
presbyopic,hypermetrope,no,normal,soft
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none
4. CPU
Deleted "vendor" attribute to make data consistent with with what we
used in the data mining book.
@relation 'cpu'
@attribute MYCT numeric
@attribute MMIN numeric
@attribute MMAX numeric
@attribute CACH numeric
@attribute CHMIN numeric
@attribute CHMAX numeric
@attribute class numeric
@data
125,256,6000,256,16,128,198
29,8000,32000,32,8,32,269
29,8000,32000,32,8,32,220
29,8000,32000,32,8,32,172
29,8000,16000,32,8,16,132
26,8000,32000,64,8,32,318
23,16000,32000,64,16,32,367
23,16000,32000,64,16,32,489
23,16000,64000,64,16,32,636
23,32000,64000,128,32,64,1144
5. Credit-g
Description of the German credit dataset.
For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-
numeric". This file has been edited and several indicator variables added to make it suitable for
algorithms which cannot cope with categorical variables. Several
attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the
form used by StatLog.
6. Number of Attributes german: 20 (7 numerical, 13 categorical) Number of Attributes
german.numer: 24 (24 numerical)
7. Attribute description for german
Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative) Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
Attribute 13: (numerical)
Age in years
the rows represent the actual classification and the columns the predicted classification. It is worse to
class a customer as good when they are bad (5), than it is to class a customer as bad when they are good
(1).
@relation german_credit
@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}
@attribute duration numeric
@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously',
'critical/other existing credit'}
@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs,
education, vacation, retraining, business, other}
@attribute credit_amount numeric
@attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known savings'}
@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}
@attribute installment_commitment numeric
@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid', 'female
single'}
@attribute other_parties { none, 'co applicant', guarantor}
@attribute residence_since numeric
@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}
@attribute age numeric
@attribute other_payment_plans { bank, stores, none}
@attribute housing { rent, own, 'for free'}
@attribute existing_credits numeric
@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self emp/mgmt'}
@attribute num_dependents numeric
@attribute own_telephone { none, yes}
@attribute foreign_worker { yes, no}
@attribute class { good, bad}
@data
'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real
estate',67,none,own,2,skilled,1,yes,yes,good
'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real
estate',22,none,own,1,skilled,1,none,yes,bad
'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,3,'real
estate',49,none,own,1,'unskilled resident',2,none,yes,good
'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,4,'life
insurance',45,none,'for free',1,skilled,2,none,yes,good
'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no known
property',53,none,'for free',2,skilled,2,none,yes,bad
'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male single',none,4,'no
known property',35,none,'for free',1,'unskilled resident',2,yes,yes,good
'no checking',24,'existing paid',furniture/equipment,2835,'500<=X<1000','>=7',3,'male single',none,4,'life
insurance',53,none,own,1,skilled,1,none,yes,good
'0<=X<200',36,'existing paid','used car',6948,'<100','1<=X<4',2,'male
single',none,2,car,35,none,rent,1,'high qualif/self emp/mgmt',1,yes,yes,good.
Date: _________
Experiment 5
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
Test mode: 10-fold cross-validation
Class
Attribute bad good
(0.36) (0.64)
=================================================
duration
mean 2 2.25
std. dev. 0.7071 0.6821
weight sum 20 36
precision 1 1
wage-increase-first-year
mean 2.6563 4.3837
std. dev. 0.8643 1.1773
weight sum 20 36
precision 0.3125 0.3125
wage-increase-second-year
mean 2.9524 4.447
std. dev. 0.8193 0.9805
weight sum 15 31
precision 0.3571 0.3571
wage-increase-third-year
mean 2.0344 4.5795
std. dev. 0.1678 0.7893
weight sum 4 11
precision 0.3875 0.3875
cost-of-living-adjustment
none 10.0 14.0
tcf 2.0 8.0
tc 6.0 3.0
[total] 18.0 25.0
working-hours
mean 39.4887 37.5491
std. dev. 1.8903 2.9266
weight sum 19 32
precision 1.8571 1.8571
pension
none 12.0 1.0
ret_allw 3.0 3.0
empl_contr 6.0 8.0
[total] 21.0 12.0
standby-pay
mean 2.5 11.2
std. dev. 0.866 2.0396
weight sum 4 5
precision 2 2
shift-differential
mean 2.4691 5.6818
std. dev. 1.5738 5.0584
weight sum 9 22
precision 2.7778 2.7778
education-allowance
yes 4.0 8.0
no 10.0 4.0
[total] 14.0 12.0
statutory-holidays
mean 10.2 11.4182
std. dev. 0.805 1.2224
weight sum 20 33
precision 1.2 1.2
vacation
below_average 12.0 8.0
average 8.0 11.0
generous 3.0 15.0
[total] 23.0 34.0
longterm-disability-assistance
yes 6.0 16.0
no 9.0 1.0
[total] 15.0 17.0
contribution-to-dental-plan
none 8.0 3.0
half 8.0 9.0
full 1.0 14.0
[total] 17.0 26.0
bereavement-assistance
yes 10.0 19.0
no 4.0 1.0
[total] 14.0 20.0
contribution-to-health-plan
none 9.0 1.0
half 3.0 8.0
full 7.0 15.0
[total] 19.0 24.0
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.900 0.108 0.818 0.900 0.857 0.776 0.965 0.926 bad
0.892 0.100 0.943 0.892 0.917 0.776 0.965 0.983 good
Weighted Avg. 0.895 0.103 0.899 0.895 0.896 0.776 0.965 0.963
Scheme: weka.classifiers.trees.DecisionStump
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
Test mode: 10-fold cross-validation
Decision Stump
Classifications
Class distributions
pension = none
bad good
1.0 0.0
pension != none
bad good
0.4375 0.5625
pension is missing
bad good
0.06666666666666667 0.9333333333333333
Time taken to build model: 0 seconds
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.550 0.054 0.846 0.550 0.667 0.564 0.835 0.815 bad
0.946 0.450 0.795 0.946 0.864 0.564 0.835 0.851 good
Weighted Avg. 0.807 0.311 0.813 0.807 0.795 0.564 0.835 0.838
a b <-- classified as
11 9 | a = bad
2 35 | b = good.
LM num: 1
class =
-0.0515 * duration
- 0.1851 * wage-increase-first-year
+ 0.0443 * working-hours
+ 0.236 * pension=none
- 0.0225 * shift-differential
- 0.5762
LM num: 2
class =
-0.1125 * duration
- 0.2172 * wage-increase-first-year
+ 0.0364 * working-hours
+ 0.236 * pension=none
- 0.0261 * shift-differential
+ 0.1224
LM num: 3
class =
-0.1156 * duration
- 0.2331 * wage-increase-first-year
+ 0.0364 * working-hours
+ 0.236 * pension=none
- 0.023 * shift-differential
+ 0.1288
LM num: 4
class =
-0.1068 * duration
- 0.2195 * wage-increase-first-year
+ 0.0364 * working-hours
+ 0.236 * pension=none
- 0.023 * shift-differential
+ 0.0143
LM num: 5
class =
-0.0767 * duration
- 0.1349 * wage-increase-first-year
+ 0.0341 * working-hours
+ 0.3259 * pension=none
- 0.0183 * shift-differential
- 0.0512
LM num: 6
class =
-0.0461 * duration
- 0.0867 * wage-increase-first-year
+ 0.0238 * working-hours
+ 0.2735 * pension=none
- 0.0109 * shift-differential
- 0.2876
Number of Rules : 6
LM num: 1
class =
0.0767 * duration
+ 0.1349 * wage-increase-first-year
- 0.0341 * working-hours
+ 0.3259 * pension=ret_allw,empl_contr
+ 0.0183 * shift-differential
+ 0.7253
LM num: 2
class =
0.0515 * duration
+ 0.1851 * wage-increase-first-year
- 0.0443 * working-hours
+ 0.236 * pension=ret_allw,empl_contr
+ 0.0225 * shift-differential
+ 1.3402
LM num: 3
class =
0.1125 * duration
+ 0.2172 * wage-increase-first-year
- 0.0364 * working-hours
+ 0.236 * pension=ret_allw,empl_contr
+ 0.0261 * shift-differential
+ 0.6416
LM num: 4
class =
0.1156 * duration
+ 0.2331 * wage-increase-first-year
- 0.0364 * working-hours
+ 0.236 * pension=ret_allw,empl_contr
+ 0.023 * shift-differential
+ 0.6352
LM num: 5
class =
0.1068 * duration
+ 0.2195 * wage-increase-first-year
- 0.0364 * working-hours
+ 0.236 * pension=ret_allw,empl_contr
+ 0.023 * shift-differential
+ 0.7497
LM num: 6
class =
0.0461 * duration
+ 0.0867 * wage-increase-first-year
- 0.0238 * working-hours
+ 0.2735 * pension=ret_allw,empl_contr
+ 0.0109 * shift-differential
+ 1.0142
Number of Rules : 6
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.750 0.135 0.750 0.750 0.750 0.615 0.918 0.880 bad
0.865 0.250 0.865 0.865 0.865 0.615 0.918 0.951 good
Weighted Avg. 0.825 0.210 0.825 0.825 0.825 0.615 0.918 0.926
a b <-- classified as
15 5 | a = bad
5 32 | b = good
Odds Ratios...
Class
Variable bad
===========================================================
duration 0.0012
wage-increase-first-year 0
wage-increase-second-year 0
wage-increase-third-year 0
cost-of-living-adjustment=none 0.2182
cost-of-living-adjustment=tcf 0
cost-of-living-adjustment=tc 9714103733.4349
working-hours 48.3653
pension=none 3.3653068354006045E19
pension=ret_allw 5993.0626
pension=empl_contr 0
standby-pay 0.004
shift-differential 0.1307
education-allowance=no 5.1852
statutory-holidays 0.0001
vacation=below_average 4813.2712
vacation=average 9.8529
vacation=generous 0
longterm-disability-assistance=no 2.14532228968581478E18
contribution-to-dental-plan=none 6.563512730450786E14
contribution-to-dental-plan=half 1.2349
contribution-to-dental-plan=full 0
bereavement-assistance=no 4.3065760813857376E16
contribution-to-health-plan=none 2.14532228874995942E18
contribution-to-health-plan=half 0
contribution-to-health-plan=full 0
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.950 0.081 0.864 0.950 0.905 0.852 0.970 0.927 bad
0.919 0.050 0.971 0.919 0.944 0.852 0.981 0.989 good
Weighted Avg. 0.930 0.061 0.934 0.930 0.931 0.852 0.977 0.967
a b <-- classified as
19 1 | a = bad
3 34 | b = good
Odds Ratios...
Class
Variable bad
===========================================================
duration 0.0012
wage-increase-first-year 0
wage-increase-second-year 0
wage-increase-third-year 0
cost-of-living-adjustment=none 0.2182
cost-of-living-adjustment=tcf 0
cost-of-living-adjustment=tc 9714103733.4349
working-hours 48.3653
pension=none 3.3653068354006045E19
pension=ret_allw 5993.0626
pension=empl_contr 0
standby-pay 0.004
shift-differential 0.1307
education-allowance=no 5.1852
statutory-holidays 0.0001
vacation=below_average 4813.2712
vacation=average 9.8529
vacation=generous 0
longterm-disability-assistance=no 2.14532228968581478E18
contribution-to-dental-plan=none 6.563512730450786E14
contribution-to-dental-plan=half 1.2349
contribution-to-dental-plan=full 0
bereavement-assistance=no 4.3065760813857376E16
contribution-to-health-plan=none 2.14532228874995942E18
contribution-to-health-plan=half 0
contribution-to-health-plan=full 0
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.950 0.081 0.864 0.950 0.905 0.852 0.970 0.927 bad
0.919 0.050 0.971 0.919 0.944 0.852 0.981 0.989 good
Weighted Avg. 0.930 0.061 0.934 0.930 0.931 0.852 0.977 0.967
a b <-- classified as
19 1 | a = bad
3 34 | b = good
(F.) SVM (SUPPORT VECTOR MACHINES)
=== RUN INFORMATION ===
SMO
KERNEL USED:
LINEAR KERNEL: K(X,Y) = <X,Y>
BINARYSMO
TP RATE FP RATE PRECISION RECALL F-MEASURE MCC ROC AREA PRC AREA
CLASS
0.800 0.054 0.889 0.800 0.842 0.766 0.873 0.781 BAD
0.946 0.200 0.897 0.946 0.921 0.766 0.873 0.884 GOOD
WEIGHTED AVG. 0.895 0.149 0.894 0.895 0.893 0.766 0.873 0.848
A B <-- CLASSIFIED AS
16 4 | A = BAD
2 35 | B = GOOD
SMO
Kernel used:
Linear Kernel: K(x,y) = <x,y>
BinarySMO
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.800 0.054 0.889 0.800 0.842 0.766 0.873 0.781 bad
0.946 0.200 0.897 0.946 0.921 0.766 0.873 0.884 good
Weighted Avg. 0.895 0.149 0.894 0.895 0.893 0.766 0.873 0.848
a b <-- classified as
16 4 | a = bad
2 35 | b = good
Date: _________
Experiment 6
Algorithms:
1. Naïve Bayes
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Time taken to build model: 0.01 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 2948 63.713 %
Incorrectly Classified Instances 1679 36.287 %
Kappa statistic 0
Mean absolute error 0.4624
Root mean squared error 0.4808
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 4627
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 1.000 0.637 1.000 0.778 0.000 0.499 0.637 low
0.000 0.000 0.000 0.000 0.000 0.000 0.499 0.363 high
Weighted Avg. 0.637 0.637 0.406 0.637 0.496 0.000 0.499 0.537
a b <-- classified as
2948 0 | a = low
1679 0 | b = high
Mean absolute error is:
MAE = 1N∑i=1N|θ^i−θi|
2.Decision stump
Scheme: weka.classifiers.trees.DecisionStump
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: 10-fold cross-validation
Decision Stump
Classifications
Class distributions
tissues-paper prd = t
low high
0.48553627058299953 0.5144637294170005
tissues-paper prd != t
low high
0.7802521008403361 0.21974789915966386
tissues-paper prd is missing
low high
0.7802521008403361 0.21974789915966386
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.627 0.325 0.772 0.627 0.692 0.290 0.642 0.732 low
0.675 0.373 0.507 0.675 0.579 0.290 0.642 0.466 high
Weighted Avg. 0.644 0.343 0.676 0.644 0.651 0.290 0.642 0.635
a b <-- classified as
1847 1101 | a = low
546 1133 | b = high
3.Random forest
== Run information ===
RandomForest
a b <-- classified as
2948 0 | a = low
1679 0 | b = high
5. K MEAN CLUSTERING
Number of iterations: 2
Within cluster sum of squared errors: 0.0
Clustered Instances
0 1679 ( 36%)
1 2948 ( 64%)
Result:
By perform the classification by using the above five algorithms, we observe that ‘Naïve bayes’ take the
least time to build model and naïve bayes is also most accurate with least Mean Absolute Error.
Therefore, we conclude that out of the above algorithms, Naïve Bayes performs best.
Date: _________
Experiment 7
Sample Input:
Output:
=== Run information ===
node-caps = yes
| deg-malig = 1: recurrence-events (1.01/0.4)
| deg-malig = 2: no-recurrence-events (26.2/8.0)
| deg-malig = 3: recurrence-events (30.4/7.4)
node-caps = no: no-recurrence-events (228.39/53.4)
Number of Leaves : 4
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.960 0.776 0.745 0.960 0.839 0.287 0.582 0.728 no-recurrence-
events
0.224 0.040 0.704 0.224 0.339 0.287 0.582 0.444 recurrence-events
Weighted Avg. 0.741 0.558 0.733 0.741 0.691 0.287 0.582 0.643
a b <-- classified as
193 8 | a = no-recurrence-events
66 19 | b = recurrence-events
Date: _________
Experiment 8
R is a programming language and software environment for statistical analysis, graphics representation
and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand, and is currently developed by the R Development Core Team.
The core of R is an interpreted computer language which allows branching and looping as well as
modular programming using functions. R allows integration with the procedures written in the C, C++,
.Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary versions are
provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU project
called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the
University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.
● A large group of individuals has contributed to R by sending code and bug reports.
● Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source
code archive.
Features of R
As stated earlier, R is a programming language and software environment for statistical analysis,
graphics representation and reporting. The following are the important features of R −
● R is a well-developed, simple and effective programming language which includes conditionals,
loops, user defined recursive functions and input and output facilities.
● R has an effective data handling and storage facility,
● R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
● R provides a large, coherent and integrated collection of tools for data analysis.
● R provides graphical facilities for data analysis and display either directly at the computer or
printing at the papers.
As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of
data scientists and supported by a vibrant and talented community of contributors. R is taught in
universities and deployed in mission critical business applications.
Date: _________
BEYOND THE SYLLABUS
Experiment 1
Aim:Understanding of RMS Titanic Dataset to predict survival by training a model and predict the
required solution.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912,
during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224
passengers and crew. This sensational tragedy shocked the international community and led to better
safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there
were not enough lifeboats for the passengers and crew. Although there was some element of luck
involved in surviving the sinking, some groups of people were more likely to survive than others, such as
women, children, and the upper-class.
• Survived (Target Variable) - Binary categorical variable where 0 represents not survived and 1
represents survived.
• Pclass - Categorical variable. It is passenger class.
• Sex - Binary Variable representing the gender the of passenger
• Age - Feature engineered variable. It is divided into 4 classes.
• Fare - Feature engineered variable. It is divided into 4 classes.
• Embarked - Categorical Variable. It tells the Port of embarkation.
• Title - New feature created from names. The title of names is classified into 4 different classes.
• isAlone - Binary Variable. It tells whether the passenger is travelling alone or not.
• Age*Class - Feature engineered variable.
1. Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the
relationship between the categorical dependent variable (feature) and one or more independent variables
(features) by estimating probabilities using a logistic function, which is the cumulative logistic
distribution.
Note the confidence score generated by the model based on our training dataset.
2. In pattern recognition, the k-Nearest Neighbours algorithm (or k-NN for short) is a non-parametric
method used for classification and regression. A sample is classified by a majority vote of its neighbours,
with the sample being assigned to the class most common among its k nearest neighbours (k is a positive
integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest
neighbour.
KNN confidence score is better than Logistics Regression but worse than SVM.
3. Next we model using Support Vector Machines which are supervised learning models with associated
learning algorithms that analyze data used for classification and regression analysis. Given a set of
training samples, each marked as belonging to one or the other of two categories, an SVM training
algorithm builds a model that assigns new test samples to one category or the other, making it a non-
probabilistic binary linear classifier.
Note that the model generates a confidence score which is higher than Logistics Regression model.
4. In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive
Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables
(features) in a learning problem.
The model generated confidence score is the lowest among the models evaluated so far.
5. This model uses a decision tree as a predictive model which maps features (tree branches) to
conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set
of values are called classification trees; in these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels. Decision trees where the target variable
can take continuous values (typically real numbers) are called regression trees. The model confidence
score is the highest among models evaluated so far.
Date: _________
Experiment 2
Aim:Understanding of Indian education in Rural villages to predict whether girl child will be sent to
school or not?
The data is focused on rural India. It primarily looks into the fact whether the villagers are willing to send
the girl children to school or not and if they are not sending their daughters to school the reasons have
also been mentioned. The district is Gwalior. Various details of the villagers such as village, gender, age,
education, occupation, category, caste, religion, land etc have also been collected.
The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to make a
prediction for each instance of the dataset (with different training folds) and the presented result is a
summary of those predictions. Firstly, I noted the Classification Accuracy. The model achieved a result of
109/200 correct or 54.5%.
a b c d e f g h i j k l m <-- classified as
0 0 1 1 0 1 0 2 0 0 0 0 0 | a = Govt.
2 1 1 1 8 0 0 0 0 1 0 0 0 | b = Driver
2 0 17 2 9 0 0 2 0 0 0 0 0 | c = Farmer
0 0 4 3 2 0 1 0 0 1 0 0 0 | d = Shopkeeper
1 8 2 3 73 1 0 1 1 3 2 2 0 | e = labour
3 0 0 0 0 0 0 1 0 0 0 0 0 | f = Security Guard
0 1 0 1 0 0 0 2 0 0 0 0 0 | g = Raj Mistri
1 0 0 0 1 1 0 8 0 0 0 0 0 | h = Fishing
0 0 2 0 0 0 0 0 2 0 0 0 0 | i = Labour & Driver
0 0 2 0 1 0 0 0 0 2 0 0 0 | j = Homemaker
0 0 0 0 1 0 0 0 0 2 0 0 0 | k = Govt School Teacher
0 0 0 1 4 0 0 0 0 0 0 0 0 | l = Dhobi
1 0 0 0 3 0 0 0 0 0 0 0 1 | m = goats
The confusion matrix shows the precision of the algorithm showing that 1,1,1,2 Government officials
were misclassified as Farmer, Shopkeeper, Security Guard and Fishermen respectively, 2,1,1,8,1 Drivers
were misclassified as Government officials, Farmer, Shopkeeper, Labour, Homemaker, and so on. This
table can help to explain the accuracy achieved by the algorithm.
Now when we have model,
we need to load our test data we’ve created before. For this, select Supplied test set and click button Set.
Click More Options, where in new window, choose PlainText from Output predictions. Then click left
mouse button on recently created model on result list and select Re-evaluate model on current test set.
After re-evaluation
a b c d e f g h <-- classified as
147 0 1 0 0 0 0 0 | a = NA
4 12 0 0 0 0 0 0 | b = Poverty
5 0 3 0 0 0 0 0 | c = Marriage
0 0 1 3 0 0 0 0 | d = Distance
0 0 0 0 8 0 0 0 |e=X
4 0 0 0 0 0 0 0 | f = Unsafe Public Space
0 0 0 0 0 0 4 0 | g = Transport Facilities
1 0 0 0 0 0 0 4 | h = Household Responsibilities
The confusion matrix shows that majority of the reasons were not available and out of the reasons which
were available people did not send their daughters to school because of poverty and very few of them
considered Distance as a major factor for not sending their girl children to school.
3. Random Forest
The accuracy of this algorithm is 100% that is 200/200 have been correctly classified
a b c d e f g h i j k l m <-- classified as
5 0 0 0 0 0 0 0 0 0 0 0 0 | a = Govt.
0 14 0 0 0 0 0 0 0 0 0 0 0 | b = Driver
0 0 32 0 0 0 0 0 0 0 0 0 0 | c = Farmer
0 0 0 11 0 0 0 0 0 0 0 0 0 | d = Shopkeeper
0 0 0 0 97 0 0 0 0 0 0 0 0 | e = labour
0 0 0 0 0 4 0 0 0 0 0 0 0 | f = Security Guard
0 0 0 0 0 0 4 0 0 0 0 0 0 | g = Raj Mistri
0 0 0 0 0 0 0 11 0 0 0 0 0 | h = Fishing
0 0 0 0 0 0 0 0 4 0 0 0 0 | i = Labour & Driver
0 0 0 0 0 0 0 0 0 5 0 0 0 | j = Homemaker
0 0 0 0 0 0 0 0 0 0 4 0 0 | k = Govt School Teacher
0 0 0 0 0 0 0 0 0 0 0 5 0 | l = Dhobi
1 0 0 0 0 0 0 0 0 0 0 0 4 | m = goats
There is no observation which has been misclassified. Maximum number of villagers are laborers.
4. Random Tree
The classification accuracy is 76.0204% that is 149/200 have been classified correctly.
The false positive rate is 0.352 that is highest of all the four algorithms applied above. Here 35.2% of the
values which should have been classified negatively have been assigned a positive value.
=== Confusion Matrix ===
a b c d e f g h <-- classified as
126 7 3 1 0 8 3 0 | a = NA
7 8 1 0 0 0 0 0 | b = Poverty
4 1 3 0 0 0 0 0 | c = Marriage
1 0 0 3 0 0 0 0 | d = Distance
2 0 0 0 6 0 0 0 | e=X
4 0 0 0 0 0 0 0 | f = Unsafe Public Space
3 1 0 0 0 0 0 0 | g = Transport Facilities
1 0 0 0 0 0 0 3 | h = Household Responsibilities
Experiment 3
This is dataset collected from contact patterns among students collected during the spring semester 2006
in National University of Singapore
Scheme: weka.classifiers.functions.SimpleLinearRegression
Relation: MOCK_DATA (1)-weka.filters.unsupervised.instance.RemovePercentage-P50.0
Instances: 500
Attributes: 4
Start Time
Session Id
Student Id
Duration
Test mode: evaluate on training data
Start Time =
0.0274 * Session Id +
10.3846
Correlation coefficient 0
Mean absolute error 5.0003
Root mean squared error 5.8026
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 500
CONCLUSION:
Six algorithms have been used to measure the best classifier. Depending on various attributes,
performance of various algorithms can be measured via mean absolute error and correlation coefficient.
Depending on the results above, worst correlation has been found by DecisionTable and best correlation
has been found by Decision Stump
Decision Table:
Best first.