Lecture 7 - Weka

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 69

Introduction to Weka

Sampath Jayarathna
Cal Poly Pomona
Todays Workshop!

• What is WEKA?
• The Explorer:
• Preprocess data
• Classification
• Clustering (later)
• Attribute Selection
• Data Visualization
• KnowledgeFlow
• Generate multiple ROC curves
What is WEKA?
• Waikato Environment for Knowledge
Analysis
• It’s a data mining/machine learning tool developed by Department of
Computer Science, University of Waikato, New Zealand.
• Weka is also a bird found only on the islands of New Zealand.

• Website: http://www.cs.waikato.ac.nz/ml/weka/
• Support multiple platforms (written in java):
• Windows, Mac OS X and Linux
Main Features

• 49 data preprocessing tools


• 76 classification/regression algorithms
• 8 clustering algorithms
• 3 algorithms for finding association rules
• 15 attribute/subset evaluators + 10 search algorithms
for feature selection
WEKA GUI

• Main components
• “The Explorer” (exploratory data analysis)
• “The Experimenter” (experimental environment)
• “The KnowledgeFlow” (process model inspired interface)
• “Simple CLI” (Command Line interface)
Todays Workshop!

• What is WEKA?
• The Explorer:
• Preprocess data
• Classification
• Clustering (later)
• Attribute Selection
• Data Visualization
• KnowledgeFlow
• Generate multiple ROC curves
Explorer: pre-processing the data

• Data can be imported from a file in various formats: ARFF, CSV, C4.5,
binary
• Data can also be read from a URL or from an SQL database (using
JDBC)
• Pre-processing tools in WEKA are called “filters”
• WEKA contains filters for:
• Discretization, normalization, resampling, attribute selection, transforming
and combining attributes, …
WEKA only deals with “flat” files called “arff”
@relation heart-disease-simplified
Numeric attribute

@attribute age numeric Nominal attribute


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol real
@attribute exercise_induced_angina { yes, no}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
Missing value
67,male,asympt,286,yes,present
represented by ?
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
Exercise 1

• Lets create an arff file from our weather data


• Create CSV file from the weather excel file
• Now add the header for the file and rename it as wether.arff
• Next, load the weather.arff file to Weka Explorer
Subsampling

• Keep the invertSelection to false when creating your train sample


(Example 60% train and 40% test)
• Undo the subsampling and then click filter again
• Change the invertSelection to true when creating your test sample.
Exercise 2

• Load iris.arff dataset from Weka 3.8.1\data


• Create train.arff and test.arff with 60/40 split using
subsampling filters unsupervised.instance.Resample
Data Imbalance

• Consider random under-sampling or over-sampling to overcome


overfitting.
Exercise 3

• Load unbalanced.arff dataset from Weka 3.8.1\data


• Balance the dataset using ClassBalancer filter in the
Supervised/instance
Explorer: building “classifiers”

• Classifiers in WEKA are models for predicting nominal or numeric


quantities
• Implemented learning schemes include:
• Decision trees and lists, instance-based classifiers, support vector machines,
multi-layer perceptrons, logistic regression, Bayes’ nets, …
• “Meta”-classifiers include:
• Bagging, boosting, stacking, error-correcting output codes, locally weighted
learning, …

14
05/09/2024 University of Waikato
15
Exercise 4

• Load iris.arff dataset from Weka 3.8.1\data


• Use KNN classifier and modify the value of K=1, K=3, K=5
and report the accuracy
Feature Selection (filter method)
Exercise 5

• Using iris.arff dataset from Weka 3.8.1\data


• Remove last 5 features one by one from the ranker and report
the accuracy each time.
Todays Workshop!

• What is WEKA?
• The Explorer:
• Preprocess data
• Classification
• Clustering (later)
• Attribute Selection
• Data Visualization
• KnowledgeFlow
• Generate multiple ROC curves
Knowledge Flow Interface

• “Visual: drag-and-drop” user interface for WEKA - intuitive


• Java-Beans-based
• Can do everything that Explorer does (plus a bit more), but
not as comprehensively as Experimenter
• Data sources, classifiers, etc. are beans and can be connected
graphically
• Data “flows” through modules: e.g.,
“data source” ->“filter” ->“classifier”-> “evaluator”
• KF layouts can be saved and re-used later
Knowledge Flow: An Example

• What we want to do:


• Take a dataset
• Do some attribute selection
• Perform some classification on the reduced data using 10
fold CV
• Examine the subsets selected for each CV fold
• Visualize the results in text format and ROC
Getting Started
A few ‘hidden’ steps…
Add the Classifier learner
…and the performance
evaluator
TextViewers can be used for visualisation of results as well as
examining the processes – more later…
‘Right-clicking’ on each ‘block’ allows you to
configure it as well as ‘wire-up’ to others…
Connect: dataSet to
CrossValidationFoldMaker
Continue to ‘wire-up’ each ‘block’…
…and so on
To see the results output: ‘dump’ the text to
TextViewer…
When you have finished ‘wiring-up’, it’s time
to configure each of the
components/blocks…
Set the path/filename(s) of the datasets you
would like to load…
Once all is configured, you are ready to
start…
Once all experiments have finished, we can
visualize the results…
Output is similar to that of console window
of Explorer
But there are also ways save these results if
we want to keep them for later…
TextViewer components are also useful for
‘looking-inside’ processes…
For example: attribute selection….
It is also possible to visualize data in a
similar way to Explorer…e.g. ROC/threshold
curves
Area under the ROC Curve
 True positive rate = tp/(tp+fn) = recall = sensitivity
 False positive rate = fp/(tn+fp).
 An ROC curve demonstrates several things:
 The closer the curve follows the left-hand border and then the top border of the ROC space,
the more accurate the test.
 The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the
test.
Q: Lets generate a ROC Curve

• Use the weather data arff file and generate ROC


curve. Follow the following KnowledgeFlow
Exercise: How to generate multiple ROC
curves?
• Use the weather data arff file and generate multiple
ROC curves in a single chart (pick 3 classifiers)
• Remember to add following into your
KnowledgeFlow
• ClassValuePicker between ClassAssigner and
CrossValidationFoldMaker
• Add ModelPerformanceChart from Visualization
• To save your ROC curve
• Shift + Alt + Left Click
Write your own algorithms…

• WEKA is Open Source!


• Much of the work is already done for you
• Take advantage of the WEKA framework
• Writing code and contributing to the WEKA project now
easier than before see:
http://weka.wikispaces.com/How+can+I+contribute+to+WEKA%3F
Conclusion

• Explorer and Knowledge Flow:


• Offer useful and flexible ways to perform a range of batches of
experiments
• Beware of the way in which results are generated!
• KF is particularly useful for visualization
• Just a snapshot of capabilities of WEKA!

You might also like