WEKA Lab Manual
WEKA Lab Manual
WEKA Lab Manual
6 7 8 9
6 7 8 9
10
10
116-117
Launching WEKA
The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching Wekas main GUI applications and supporting tools. If one prefers a MDI (multiple document interface) appearance, then this is provided by an alternative launcher called Main (class weka.gui.Main). The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus. The buttons can be used to start the following applications: Explorer An environment for exploring data with WEKA (the rest of this documentation deals with this application in more detail). Experimenter An environment for performing experiments and conducting statistical tests between learning schemes. Knowledge Flow This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.
Working with Explorer Weka Data File Format (Input) The most popular data input format of Weka
is arff (with arff being the extension name of your input data file). Experiment:1 WEATHER RELATION: % ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes
PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user
CLASSIFICATION:
The user has the option of applying many different algorithms to the data set in order to produce a representation of information. The best approach is to independently apply a mixture of the available choices and see what yields something close to the desired results. The Classify tab is where the user selects the classifier choices. Figure 5 shows some of the categories.
Output:
correctly Classified Instances 5 35.7143 % Kappa statistic 0 Mean absolute error 0.4762 Root mean squared error 0.4934 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.643 1 0.783 0.178 yes 0 0 0 0 0 0.178 no Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.178 === Confusion Matrix === a b <-- classified as 9 0 | a = yes 5 0 | b = no
CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those S.K.T.R.M College of Engineering 7
ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation. OUTPUT: S.K.T.R.M College of Engineering 9
VISUALIZATION:
The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.
10
11
OUTPUT
PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the S.K.T.R.M College of Engineering 12
CLASSIFICATION:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of ins. === Run information === OUTPUT: Scheme: weka.classifiers.rules.ZeroR Relation: employee Instances: 3 Attributes: 4 ename eid esal edept S.K.T.R.M College of Engineering 13
CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
OUTPUT: Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: employee Instances: 3 Attributes: 4 ename eid esal edept Test mode: evaluate on training data === Model and evaluation on training set === EM == Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ====================== ename john 3 tony 2 ravi 1 [total] 6 eid mean 85 std. dev. 0 esal mean 8833.3333 std. dev. 471.4045 edept sales 3 admin 2 [total] 5 Clustered Instances 0 3 (100%)
S.K.T.R.M College of Engineering 14
ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: employee Instances: 3 Attributes: 4 ename eid esal edept
SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.
OUTPUT: === Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 11 Merit of best subset found: 0.196 Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,4 : 2 outlook windy
VISUALIZATION:
S.K.T.R.M College of Engineering 15
16
PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user
CLASSIFICATION:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. S.K.T.R.M College of Engineering 17
CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. heme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data=== Model and evaluation on training set === EM S.K.T.R.M College of Engineering 18
ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage
SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of S.K.T.R.M College of Engineering 19
Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 7 Merit of best subset found: 1 Attribute Subset Evaluator (supervised, Class (numeric): 4 sage): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,3 : 2 sname sbranch
VISUALIZATION:
The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.
20
21
PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user
CLASSIFICATION:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. S.K.T.R.M College of Engineering 22
Output:
Scheme: weka.classifiers.rules.ZeroR Relation: labor Instances: 3 Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: 15.0 Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient 0 Mean absolute error 0 Root mean squared error 0 Relative absolute error NaN % Root relative squared error NaN % Total Number of Instances 3
CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
S.K.T.R.M College of Engineering 24
SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.
=== Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 19 Merit of best subset found: 0 Attribute Subset Evaluator (supervised, Class (numeric): 6 vacation): CFS Subset EvaluatIncluding locally predictive attributes Selected attributes: 1 : 1 name
VISUALIZATION:
The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the
25
EXPERIMENTER:
The Weka Experiment Environment enables the user to create, run, modify, and analyse experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is (statistically) better than the other schemes.The Experiment Environment can be run from the command line using the Simple CLI
26
Defining an Experiment When the Experimenter is started, the Setup window (actually a pane) is displayed. Click New toinitialize an experiment. This causes default parameters to be defined for the experiment.
To define the dataset to be processed by a scheme, first select Use relative paths in the Datasetspanel of the Setup window and then click Add New to open a dialog box below
27
The dataset name is now displayed in the Datasets panel of the Setup window. Saving the Results of the Experiment To identify a dataset to which the results are to be sent, click on the CSVResultListener entry in the Destination panel. Note that this window (and other similar windows in Weka) is not initially expanded and some of the information in the window is not visible. Drag the bottom right-hand corner of the window to resize the window until the scroll bars disappear.
S.K.T.R.M College of Engineering 28
The output file parameter is near the bottom of the window, beside the text outputFile. Click on this parameter to display a file selection window.
29
The dataset name is displayed in the Destination panel of the Setup window.
Saving the Experiment Definition The experiment definition can be saved at any time. Select Save at the top of the Setup window. Type the dataset name with the extension exp (or select the dataset name if the experiment definition dataset already exists).
30
The experiment can be restored by selecting Open in the Setup window and then selecting Experiment1.exp in the dialog window. Running an Experiment To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 randomized train and test runs on the Iris dataset, using 66% of the patterns for training and 34% for testing, and using the ZeroR scheme.
31
If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. The results of the experiment are saved to the dataset Experiment1.txt.
Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0. 0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?
32
or remote experiments, which are distributed between several hosts for employee relation
Add new relation using add new button on the right panel And give database connection using jdbc and click ok
Choose ZERO R from the menu choose button by clicking add new button on the right panel and click ok
35
a single machine, or remote experiments, which are distributed between several hosts for labor relation
Add new relation using add new button on the right panel And give database connection using jdbc and click ok
36
37
38
39
or remote experiments, which are distributed between several hosts for student relation
40
41
Choose ZERO R from the menu choose button by clicking add new button on the right panel and click ok
42
KNOWLEDGE FLOW
The Knowledge Flow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The Knowledge Flow is a work in progress so some of the functionality from the Explorer is not yet available. On the other hand, there are things that can be done in the Knowledge Flow but not in the Explorer.
43
Components
Components available in the KnowledgeFlow:
DataSources
All of WEKAs loaders are available.
45
DataSinks
All of WEKAs savers are available.
Filters
All of WEKAs filters are available.
Classifiers
All of WEKAs classifiers are available.
Clusterers
All of WEKAs clusterers are available.
46
TrainingSetMaker - make a data set into a training set. TestSetMaker - make a data set into a test set. CrossValidationFoldMaker - split any data set, training set or test set into folds. TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set. ClassAssigner - assign a column to be the class for any data set, training set or test set. ClassValuePicker - choose a class value to be considered as the positive class. This is useful when generating data for ROC style curves (see ModelPerformanceChart below and example 6.4.2). ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers. IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers. ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers. PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions.
Visualization
DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot. ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot). AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data. ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves.
47
Experiment:9 Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's C4.5 implementation).
First start the KnowlegeFlow. Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse pointer will change to a "cross hairs").
48
Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy of the ArffLoader icon will appear on the layout area). Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to the location of your arff file.
Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner" (allows you to choose which column to be the class) component from the toolbar. Place this on the layout.
Move the mouse over the ClassAssigner component and left click - a red line labeled "dataSet" will connect the two components.
S.K.T.R.M College of Engineering 51
Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).
52
Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).
53
Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over "ClassAssigner" and selecting "dataSet" from under "Connections" in the menu.
54
Connect the CrossValidationFoldMaker to J48 TWICE by first choosing "trainingSet" and then "testSet" from the pop-up menu for the CrossValidationFoldMaker.
55
56
Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the layout.
57
Connect J48 to this component by selecting the "batchClassifier" entry from the pop-up menu for J48.
58
Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout.
Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the "text" entry from the pop-up menu for ClassifierPerformanceEvaluator.
59
Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.
60
When finished you can view the results by choosing show results from the pop-up menu for the TextViewer component.
61
62
Simple CSI
The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters, clusterers, etc., but without the hassle of the CLASSPATH (it facilitates the one, with which Weka was started). It offers a simple Weka shell with separated commandline and output.
Commands
The following commands are available in the Simple CLI: java <classname> [<args>] invokes a java class with the given arguments (if any) breakstops the current thread, e.g., a running classifier, in a friendly manner 31 32 CHAPTER 3. SIMPLE CLI kill stops the current thread in an unfriendly fashion cls clears the output area exit exits the Simple CLI help [<command>] provides an overview of the available commands if without a command name as argument, otherwise more help on the specified command
63
Commands
The following commands are available in the Simple CLI: java <classname> [<args>] invokes a java class with the given arguments (if any) break stops the current thread, e.g., a running classifier, in a friendly manner SIMPLE CLI kill stops the current thread in an unfriendly fashion cls clears the output area exit exits the Simple CLI help [<command>] provides an overview of the available commands if without a comman
Command redirection
Starting with this version of Weka one can perform a basic redirection: java weka.classifiers.trees.J48 -t test.arff > j48.txt Note: the > must be preceded and followed by a space, otherwise it is not recognized as redirection, but part of another parameter.
Command completion
Commands starting with java support completion for classnames and filenames
64
65
66
Description of the German credit dataset in ARFF (Attribute Relation File Format) Format:
Structure of ARFF Format:
%comment lines @relation relation name @attribute attribute name @Data Set of data items separated by commas. % 1. Title: German Credit data % % 2. Source Information % % Professor Dr. Hans Hofmann % Institut f"ur Statistik und "Okonometrie % Universit"at Hamburg % FB Wirtschaftswissenschaften % Von-Melle-Park 5 % 2000 Hamburg 13 % % 3. Number of Instances: 1000 % % Two datasets are provided. the original dataset, in the form provided % by Prof. Hofmann, contains categorical/symbolic attributes and % is in the file "german.data". % % For algorithms that need numerical attributes, Strathclyde University % produced the file "german.data-numeric". This file has been edited % and several indicator variables added to make it suitable for % algorithms which cannot cope with categorical variables. Several % attributes that are ordered categorical (such as attribute 17) have % been coded as integer. This was the form used by StatLog. % % % 6. Number of Attributes german: 20 (7 numerical, 13 categorical) % Number of Attributes german.numer: 24 (24 numerical) % % % 7. Attribute description for german % % Attribute 1: (qualitative) % Status of existing checking account % A11 : ... < 0 DM % A12 : 0 <= ... < 200 DM % A13 : ... >= 200 DM / % salary assignments for at least 1 year % A14 : no checking account
67
% Attribute 2: (numerical) % Duration in month % % Attribute 3: (qualitative) % Credit history % A30 : no credits taken/ % all credits paid back duly % A31 : all credits at this bank paid back duly % A32 : existing credits paid back duly till now % A33 : delay in paying off in the past % A34 : critical account/ % other credits existing (not at this bank) % % Attribute 4: (qualitative) % Purpose % A40 : car (new) % A41 : car (used) % A42 : furniture/equipment % A43 : radio/television % A44 : domestic appliances % A45 : repairs % A46 : education % A47 : (vacation - does not exist?) % A48 : retraining % A49 : business % A410 : others % % Attribute 5: (numerical) % Credit amount % % Attibute 6: (qualitative) % Savings account/bonds % A61 : ... < 100 DM % A62 : 100 <= ... < 500 DM % A63 : 500 <= ... < 1000 DM % A64 : .. >= 1000 DM % A65 : unknown/ no savings account % % Attribute 7: (qualitative) % Present employment since % A71 : unemployed % A72 : ... < 1 year % A73 : 1 <= ... < 4 years % A74 : 4 <= ... < 7 years % A75 : .. >= 7 years % % Attribute 8: (numerical)
% % %
A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/ 69
% From: A103 To: guarantor % % % Relabeled values in attribute property_magnitude % From: A121 To: 'real estate' % From: A122 To: 'life insurance' S.K.T.R.M College of Engineering 71
73
Lab Experiments
1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
From the German Credit Assessment Case Study given to us, the following attributes are found to be applicable for Credit-Risk Assessment:
Categorical or Nominal attributes(which takes True/false, etc values)
Total Valid Attributes 1. checking_status 2. duration 3. credit history 4. purpose 5. credit amount 6. savings_status 7. employment duration 8. installment rate 9. personal status 10. debitors 11. residence_since 12. property 14. installment plans 15. housing 16. existing credits 17. job 18. num_dependents 19. telephone 20. foreign worker
Real valued attributes 1. duration 2. credit amount 3. credit amount 4. residence 5. age 6. existing credits 7. num_dependents
1. checking_status 2. credit history 3. purpose 4. savings_status 5. employment 6. personal status 7. debtors 8. property 9. installment plans 10. housing 11. job 12. telephone 13. foreign worker
74
76
J48 pruned tree 1. Using WEKA Tool, we can generate a decision tree by selecting the classify tab. 2. In classify tab select choose option where a list of different decision trees are available. From that list select J48. 3. Now under test option ,select training data test option. 4. The resulting window in WEKA is as follows:
77
6. The obtained decision tree for credit risk assessment is very large to fit on the screen.
78
79
80
That is in the first iteration subsets D2, D3, . . . . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on.
1. Select classify tab and J48 decision tree and in the test option select cross validation radio button and the number of folds as 10. 2. Number of folds indicates number of partition with the set of attributes. 3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed out, but in reality there is no such training set that gives 100% accuracy. S.K.T.R.M College of Engineering 81
Here there are 1000 instances with 100 instances per partition.
Percentage split does not allow 100%, it allows only till 99.9% S.K.T.R.M College of Engineering 83
76.3523 %
106.4373 % 500
84
Percentage Split Result at 99.9%: Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0 1 0 0.6667 0.6667 221.7054 % 221.7054 % 1 0 100 % %
85
This increases in accuracy because the two attributes foreign workers and personal status are not much important in training and analyzing. By removing this, the time has been reduced to some extent and then it results in increase in the accuracy. The decision tree which is created is very large compared to the decision tree which we have trained now. This is the main difference between these two decision trees.
86
If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these two attributes are not significant to perform training.
87
88
89
Here accuracy is decreased. Select random attributes and then check the accuracy.
90
After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over attributes and visualize them. S.K.T.R.M College of Engineering 91
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further try random combination of attributes to increase the accuracy. Cross validation
92
93
Total Cost
3820 1705
Average cost 3.82 1.705 We dont find this cost factor in problem 6. As there we use equal cost. This is the major difference between the results of problem 6 and problem 9. The cost matrices we used here: Case 1: 5 1 1 5 Case 2: 2 1 12
94
95
4.Set classes as 2. 5.Click on Resize and then well get cost matrix. 6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 7.Then confusion matrix will be generated and you can find out the difference between good and bad attribute. 8.Check accuracy whether its changing or not.
96
97
4.
To generate the decision tree, right click on the result list and select visualize tree option, by which the decision tree will be generated.
98
Visualizetree
99
100
101
102
12. (Extra Credit): How can you convert a Decision Trees into "if-thenelse rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in WEKA is rules. PART, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oneR.
In WEKA, rules.PART is one of the classifier which converts the decision trees into IF-THEN-ELSE rules. Converting Decision trees into IF-THEN-ELSE rules using rules.PART classifier:PART decision list outlook = overcast: yes (4.0) windy = TRUE: no (4.0/1.0) outlook = sunny: no (3.0/1.0) : yes (3.0) Number of Rules : 4 Yes, sometimes just one attribute can be good enough in making the decision. In this dataset (Weather), Single attribute for making the decision is outlook outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) With respect to the time, the oneR classifier has higher ranking and J48 is in 2 nd place and PART gets 3rd place. S.K.T.R.M College of Engineering 103
104
105
1. Go to choose then click on Rule then select PART. 2. Click on Save and start. 3. Similarly for oneR algorithm.
106
If outlook = overcast then play=yes If outlook = sunny and humidity= high then play=no If outlook = sunny and humidity= low then play=yes
107