Trees
Trees
Trees
0 ™
User’s Guide
SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its proprietary computer
software. No material describing such software may be produced or distributed without the written permission of the owners
of the trademark and license rights in the software and the copyrights in the published materials.
The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the
Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in Technical Data and Computer
Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL
60606-6307.
General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks of
their respective companies.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.
1
Chapter
1
What Is AnswerTree?
2
3
Wh at Is A nsw erTree?
Figure 1-1
AnswerTree window
AnswerTree brings together four of the most popular and current analytic methods
(algorithms) used in science and business for performing classification or
segmentation. It can build a tree for you automatically or let you take control to refine
the tree according to your knowledge of the data. Because AnswerTree automates a
portion of your data analysis, you produce usable and comprehensible results more
quickly than when using traditional exploratory statistical methods. Once you have
generated a model, AnswerTree provides you with all of the validation tools needed for
performing exploratory and confirmatory segmentation and classification analyses.
For customers who are entirely new to AnswerTree, this chapter introduces the
concepts you need to know to get started quickly, including definitions of terms and a
brief description of the algorithms. (If you require a more thorough description of the
statistics and algorithms used, refer to Chapter 14.) For customers who have used
AnswerTree 1.0, the next section of this chapter describes the new features
incorporated into version 2.0.
4
Chap te r 1
The remainder of this manual includes two tutorials—Chapter 2 and Chapter 3—to
get you started using all of AnswerTree’s features. In addition, we have included five
extended examples (Chapter 8 through Chapter 12) to help you learn to apply the
features to real data. Other chapters describe the operation of the software in detail. The
final chapter (Chapter 14), new for version 2.0, explains the various algorithms in
mathematical terms, for those who are interested.
Figure 1-4
Decision tree
<535 >535
<555 93 249 >555 51 354
REJECT INTERVIEW
(45) (46)
The four algorithms included in AnswerTree all do basically the same thing—examine
all of the fields of your database to find the one that gives the best classification or
prediction by splitting the data into subgroups. The process is then applied recursively
to subgroups to define sub-subgroups, and so on, until the tree is finished (as defined
by certain stopping criteria). The four methods have different performance
characteristics and features.
For more information on these algorithms, including hints on selecting one for your
analysis, see Chapter 14.
Medical research. Create decision rules that suggest appropriate procedures based on
medical evidence.
Market analysis. Determine which variables, such as geography, price, and customer
characteristics, are associated with sales.
Quality control. Analyze data from product manufacturing and identify variables
determining product defects.
Policy studies. Use survey data to formulate policy using decision rules to select the
most important variables.
Health care. User surveys and clinical data can be combined to discover variables that
contribute to health.
A Word of Caution
AnswerTree is like any powerful tool, software or otherwise—it can be powerfully
misused. If your objectives for using AnswerTree are poorly formulated, you are likely
to have poor results. Exploration and discovery in your data should not be completely
question free; data analysis requires alert human participation. The answers that you
get with AnswerTree will depend on the appropriateness of tools that you use, the
condition of your data, and the relevance of the questions that you ask. Here are some
suggestions to help you create a decision tree that truly does what you need it to do:
Always look at the raw data.
Know the characteristics of the variables in your data before you undertake a large
project.
Clean your data or be aware of any irregularities in them.
Validate your AnswerTree results with new data or hold out a test set.
If possible, use traditional statistical models to extend and verify what you learn
with AnswerTree.
Chapter
2
Getting Started with AnswerTree
This brief introduction will help you get started using AnswerTree. You will learn to
load data, create a decision tree, and save the file. To illustrate the use of AnswerTree,
you will analyze the well-known Iris data of Fisher (1936). The object is to identify
three types of Iris flowers based on physical aspects.
AnswerTree organizes your work into project files. Each project file is based on a
single data file but can contain multiple tree analyses using different features of
AnswerTree.
9
10
Chap te r 2
Figure 2-1
Startup dialog box
Figure 2-2
New Project dialog box
Figure 2-3
Open dialog box
12
Chap te r 2
The New Tree Wizard allows you to select the options for your tree. The first step is to
select a growing method.
Growing Method
The growing methods are CHAID, Exhaustive CHAID, C&RT, and QUEST. For the
first tree created for a project, the default method is CHAID. Choose C&RT for this
example. Click Next.
Figure 2-4
Choosing a growing method
13
Getting Started with A nsw erTree
Assigning Variables
Now you need to specify which variables are predictors, which variable is the target,
and so on. Do this by dragging the variable from the source list into the box where it
belongs. For this example, you want to define species as the target variable, so drag
SPECIES to the target variable list. Similarly, assign petal length, petal width, sepal
length, and sepal width as predictor variables. Click Next.
Figure 2-5
Defining the target variable
14
Chap te r 2
The next screen allows you to specify a validation method for your tree. For this
example, we do not need to validate the tree. Make sure that Do not validate the tree
is selected and click Next.
(Validation is discussed in detail in Chapter 4.)
Figure 2-6
Validating the tree
15
Getting Started with A nsw erTree
Now you need to adjust some of the growing criteria. In the wizard, you can access the
growing criteria through the Advanced Options dialog box. To view it, click the
Advanced Options pushbutton.
In the Advanced Options dialog box, use the first tab, Stopping Rules. Because this
data set contains such a small sample, you need to specify smaller values for the
minimum number of cases—5 for the parent node and 2 for the child node.
Once you have changed the values, click OK to return to the wizard.
Figure 2-7
Advanced Options: Stopping Rules
16
Chap te r 2
Grow the root node by clicking Finish on the last screen of the New Tree Wizard. The
target variable is tabulated for the entire sample and is displayed in the Tree window.
You see the minimal tree, which consists of the root node of the decision tree.
Figure 2-8
Growing the root node
17
Getting Started with A nsw erTree
Figure 2-9
Minimal tree in the Tree window
Considering the plants with long petals ( > 2.450 ), you see that those with narrow
petals ( ≤ 1.750 ) are quite likely to be versicolor, while those with wide petals
( > 1.750 ) are probably virginica.
Figure 2-12
Second-level split
20
Chap te r 2
Examining further splits adds little to our understanding of the problem because the
subsequent splits deal with very small numbers of cases.
Figure 2-13
Third-level split
Figure 2-14
Risk summary
Conclusion
You’ve learned how to use AnswerTree to build a tree based on your data and how that
tree helps you make decisions about how to classify cases. You should now understand
the basics of AnswerTree well enough to begin working with your own data.
For more information on the various features of AnswerTree, proceed to the
extended tutorial.
Chapter
3
Extended Tutorial
In this tutorial, you will learn how to create your own trees and classification analyses
by using the various graphical and statistical features of AnswerTree. You will also
learn about the AnswerTree features that enable you to test the performance of your
model. Finally, you will learn how to guide the analysis as it proceeds so that your
model can reflect your preferences and domain knowledge in describing the data.
23
24
Chap te r 3
To keep the processing time low, you must limit the number of levels that AnswerTree
considers. In the Advanced Options dialog box, make sure that the Stopping Rules tab
is on top. In the Maximum Tree Depth group, enter 4 as the value for C&RT, as shown
in Figure 3-3. Because we have a relatively small sample here, change the values for
Parent node and Child node to 25 and 1, respectively. Click OK to accept the new
settings, and then choose Finish on the last screen of the New Tree Wizard.
Figure 3-3
Advanced Options: Stopping Rules
The root node is a tabulation of the target variable. It tells us that CRIT ACCT (category 5)
comprises 293 cases, or 29.3%, of the total of 1000. It should be noted that the critical
loans have been over-sampled so that a more accurate picture can be obtained.
26
Chap te r 3
From here, go ahead and grow the entire tree by choosing Grow Tree from the
Tree menu.
Figure 3-4
Root node of C&RT tree
We now have the C&RT tree that results from automatic growth limited to four levels.
To show the tree map, from the menus choose:
View
Tree Map
27
Exte nded T utorial
Figure 3-5
C&RT tree with tree map
To customize the view of the tree, the Zoom feature on the View menu has been used
to reduce the physical display of the tree. The Zoom feature is helpful when you need
to adjust the size of the image for printing, visualizing, or exporting.
28
Chap te r 3
Figure 3-6
Zoomed tree view
Tree Map
The tree map provides you with a bird’s-eye view of your decision tree and a way to
move quickly from one position in the Tree window to another.
The Tree Map Viewer is linked to the Tree window. If you select a node from the
tree map, the main window will update and move to the selected node. All open
viewers and windows update with new information about the node you have selected.
The tree map can be selected by using the Tree Map button on the toolbar or by
choosing Tree Map from the View menu.
29
Exte nded T utorial
Figure 3-7
Tree Map button
Depending on the size and complexity of the tree, each node in the tree map is labeled
with a number that is referenced in all other information displays, such as the gains and
risk charts. If you can’t see the node numbers, expanding the Tree Map window may
make them visible.
Figure 3-8
Tree Map Viewer
Node Graphs
Similarly, you can display only the graphs if you choose. In our example, if you select
node 22, it can be seen that critical accounts are predominant.
30
Chap te r 3
Figure 3-9
Tree view with node graphs and statistics
Graph Viewer
The Graph Viewer shows summary statistics for nodes in the tree as graphs. To view a
node graph, select a node, and then choose Graph on the View menu, or use the Graph
button on the toolbar.
Figure 3-10
Graph Viewer button
31
Exte nded T utorial
The Graph window updates to display the target variable distribution of any node that
you select. You can print and export the graph.
Figure 3-11
Graph Viewer
Gains Charts
In the case of a categorical target variable, gains charts provide you with node statistics
that describe the classification tree relative to the target category of the target variable.
If the target variable is continuous, gains charts provide you with node statistics relative
to the mean of the target variable. Alternatively, you can display node performance in
percentiles, with a user-definable increment. To specify options for the gains chart,
click the Gains tab in the Tree window; then from the menus choose:
Format
Gains…
32
Chap te r 3
Figure 3-12
Gain Summary dialog box
Here we have disabled cumulative statistics and have requested gains for the
percentage of cases in the target category, critical accounts (CRIT ACCT).
Figure 3-13
Gain summary
33
Exte nded T utorial
The terminal node gain summary for target variable ACCOUNT STATUS has the target
category set to critical accounts. We can read the statistics across the row labeled Node 22.
The columns labeled Node: n and Node: % tell us that this node captured 58 cases, or
5.80% of the total number of cases. The columns labeled Resp: n and Resp: % specify that
out of the 58 cases for the node, 54 are identified as critical accounts and that this
represents 18.43% of the total target class observations.
The Gain (%) column shows the percentage of the cases in the node that have the
target value for the target variable. For node 22, we would divide 54 by 58 to arrive at
93.1034%.
The Index (%) column indicates the makeup of the node (with respect to critical
accounts) compared to the makeup of the entire sample. You arrive at the Index (%)
value by taking the ratio of the Gain (%) value and the proportion of target category
responses in the entire sample. The Index (%) for node 22 is computed by taking
93.1034% (the gain percentage) and dividing it by 29.3% (the percentage of critical
accounts found in the root node), and computing a percentage. The result is
317.7592%, the index score for this node.
The percentile gain summary for the target variable, ACCOUNT STATUS, has the
target category set to critical accounts. Percentile gain summaries extend the meaning
of the node-oriented charts by arranging the percentiles according to performance of
the target category. The nodes that make up a particular percentile are listed in the first
column. Nodes that are listed in two or more consecutive rows of a percentile gains
chart span the increment.
34
Chap te r 3
Figure 3-14
Percentile gain summary
Risk Charts
Risk charts tabulate misclassification statistics. When risk is calculated ignoring
misclassification costs, it is equivalent to error. You can find the misclassification
matrix on the Risk tab of the Tree window.
35
Exte nded T utorial
Figure 3-15
Risk summary
The misclassification matrix counts up the predicted and actual category values and
displays them in a table. A correct classification is added to the counts in the diagonal
cells of the table. The diagonal elements of the table represent agreement between the
predicted and actual value—this is often called a “hit.” An incorrect classification—
called a “miss”—means that there is disagreement between predicted and actual value.
Misclassifications are counted in the off-diagonal elements of the matrix. In this
example, 11 applicants with no credit or no debt (NCR/NODEB) were misclassified as
having current, up-to-date credit accounts (PD BK). This table is helpful in
determining exactly where your model performs well or poorly.
The risk estimate and standard error of risk estimate are values that indicate how
well your classifier is performing. In this case, the risk estimate for the four-level
C&RT tree is 0.2880, and the standard error for the risk estimate is 0.0143. In other
words, we are missing 28.8% of the time. At this point, we might want to think about
ways to improve our model.
36
Chap te r 3
Rules
Rules tell you the characteristics of cases that make up particular nodes. For example,
node 15 is made up of applicants who have other bank or store debts, are foreign
workers, and who intend to use the loan for domestic appliances or job retraining. This
information can be useful for understanding key segments of your data and for
classifying new cases.
Rules can be generated in any of three formats: SPSS syntax, SQL query, or
decision rules. SPSS syntax rules can be used in SPSS or NewView to classify cases
based on values of predictor variables. Likewise, SQL rules can be used to extract and
label cases from your SQL database engine. Decision rules are plain-language
descriptions of the characteristics of nodes, suitable for including in reports or
presentations.
Figure 3-16
Rules summary
Analysis Summary
The Summary tab in the Tree window contains information presented as text. The
following information is included in the summary:
37
Exte nded T utorial
Project information. Project name, name of tree, data file, number of cases, and
weighting.
Partition information.
Cross-validation information.
Tree-growing criteria. Growing method, algorithm specifications, stopping rules,
pruning.
Model. Target variable, predictors, cost by target category, priors by target
category, profits by target category.
Figure 3-17
Analysis summary
38
Chap te r 3
The summary tab gives you a way to audit your analyses. The text output of the
summary tab reports dialog box settings for your model, growing criteria, and other
information about the current tree. You can use the analysis summary as part of your
report or use it to tune your analysis by changing the model or criteria. Notice that the
contents of the analysis summary will update to reflect any changes you make to the
current tree. This is the case if you modify a tree or grow a new one. If you have a
particular tree that you would like to save, do so, and grow a new tree or export the
summary output as a text file by choosing
File
Export...
with the Summary tab open.
Why would you want to change the configuration of your automatically grown
classifier? Because AnswerTree is an exploratory data analysis tool, you are not
required to adhere to assumptions about the data or model that you produce. The
measurement of a tree classifier’s success or failure is based on the performance
characteristics of the entire tree classifier rather than the choice of a “best,” or most
appropriate, variable at any given point in the model. By applying your special
knowledge of the problem at hand, you can compose a model that looks more like what
you expect to see from the data.
39
Exte nded T utorial
Pruning a Branch
The tree metaphor applies easily here. Pruning is simply the act of eliminating a portion
of a branch from your decision tree. To prune a branch, simply select the node beneath
which are the nodes that you want to eliminate, and from the menu choose:
Tree
Remove Branch
All of the nodes that existed as children to the selected node are eliminated from the
model. All of the appropriate views and summary statistics are automatically updated.
Figure 3-18
Pruning a branch from the tree
40
Chap te r 3
You have four levels in the current tree. If you want only three levels or want to add a
fifth level, you can accomplish this by choosing from the menus, respectively,
Tree
Remove One Level
or
Tree
Grow Tree One Level
The results of removing a level from our four-level tree are shown below. Again, all of
the appropriate views and summary statistics are automatically updated with new
information to reflect the status of the new tree.
Figure 3-19
Tree with one level removed
41
Exte nded T utorial
Conclusion
You’ve learned how to use AnswerTree to build a tree based on your data and how that
tree helps you make decisions about how to classify cases. You’ve also learned how to
access various features of AnswerTree that help you better understand your model and
to adjust your model according to the needs of your situation.
Chapter
4
Growing the Root Node
42
43
Grow ing the Roo t No de
Figure 4-1
Project, Tree Map, Viewer, and Tree windows
This chapter focuses on the Project window, which is where you begin the tree-
growing process. In this chapter, you will learn how to get started growing a tree and
how to specify all of the options that accompany the tree. Once you have grown the
root node, it appears in the Tree window. For more information about the Tree window
and its various options, see Chapter 5.
Project Window
The Project window is the application’s main window, showing all contents of the
project. It contains an outline control that organizes the project items hierarchically. If
there is no open project, the window is empty.
44
Chap te r 4
Figure 4-2
Project window
The highest level in the outline is the project level. New trees are appended to the
outline as children of the project. Double-clicking an entry in the Project window
activates the associated Tree window. Right-clicking an entry allows you to delete or
to close the corresponding tree.
Names of items in the outline can be edited by selecting the item and then clicking
the name. By default, trees are named for the dependent variable used.
Closing the Project window exits the application. Minimizing the Project window
minimizes all AnswerTree windows.
New Project
When you choose New Project from the File menu, the New Project dialog box
appears. You can choose the type of data file you want to use: an SPSS file, a SYSTAT
file, or a database file.
When you choose the Database Capture Wizard option, the Database Capture
Wizard opens. It assists you in creating, running, or editing a database query from any
database for which you have an ODBC driver. Additional options, including SPSS for
Express (for Oracle Express databases) and Business Query for SPSS (for
BusinessObjects databases), are available if you have these applications installed.
Figure 4-3
New Project dialog box
If your data are not in any format that AnswerTree can use, you may want to install the
SPSS ODBC driver, which is included on the AnswerTree CD-ROM. It allows you to
save data in SPSS format from any application capable of writing to ODBC data
sources. See the Readme file included with this driver for more information.
New Tree
Whenever you create a new tree, AnswerTree launches the New Tree Wizard. The
wizard consists of a series of dialog boxes that help you create your tree. At the very
46
Chap te r 4
least, you must fill in the first two dialog boxes, in which you select your growing
method and define your model. After defining your model, you can click Finish at any
time to grow your root node.
However, if you continue past the second dialog box (by clicking Next rather than
Finish), you can specify whether or not to validate the tree. Additional advanced
options allow you to specify stopping rules and other growing criteria for the method
you have chosen. Each dialog box in the New Tree Wizard is explained in the
following sections.
Growing Method
The tree is constructed using a particular statistical method to determine how to define
each split. Select one of the four methods available in AnswerTree:
CHAID. Chi-squared Automatic Interaction Detection. This method uses chi-
squared statistics to identify optimal splits. The target variable can be nominal,
ordinal, or continuous.
47
Grow ing the Roo t No de
Model Definition
The variable list on the left contains the variables available in the current project. As
variables are assigned to roles in the new tree, they are removed from this list. To
assign a variable to a particular role in the model, click the variable and drag it to the
desired location. Note that date variables cannot be used in AnswerTree models.
This dialog box has a convenient pop-up menu, which you can access by right-
clicking your mouse. You can change the measurement level of any variable shown,
and you can choose to sort the variables by measurement level or alphabetically. You
can also specify whether variable labels or variable names are used in the dialog box
and in the tree itself. To change the settings, right-click over any variable and select the
appropriate option from the menu.
Target. Indicates the variable to be predicted (also known as the dependent variable).
Predictors. Indicates the variables used to make predictions about target variable
values. After assigning a target variable, you can assign all remaining variables as
predictors by selecting All others. To assign only some of the available variables as
predictors, drag the appropriate variables to the Predictors list.
Frequency. Indicates the variable that specifies frequencies for cases, if any. Use this if
records in your data set represent more than one unit each—for example, if you are
analyzing aggregated data. Values for a frequency variable should be positive integers.
Cases with negative or zero frequency weights are excluded from the analysis. Non-
integer frequency values are rounded to the nearest integer.
Case Weight. Indicates the variable that specifies case weights, if any. Case weights are
used to account for differences in variance across levels of the target variable
(heteroscedasticity). These weights are used in model estimation but do not affect cell
frequencies. Case weight values should be positive, but they can be fractional. Cases
with negative or zero case weight are excluded from the analysis. With a categorical
dependent variable, cases that belong to the same dependent variable class and the
same predictor variable category are grouped together as a cell. The corresponding
case weights are aggregated to form a cell weight for that cell. A contingency table, in
which classes of the dependent variable are used as columns and categories of the
predictor variable being studied are used as rows, is formed and cell weights are used
in the analysis. Case weights are ignored when using the QUEST growing method.
49
Grow ing the Roo t No de
Figure 4-5
New Tree Wizard: Model definition
Validation Options
In many circumstances, you want to be able to assess how well your tree structure
generalizes from the data at hand to a larger sample. There are three validation options
available.
50
Chap te r 4
Figure 4-6
New Tree Wizard: Validation options
Do not validate the tree. This option requests no validation procedure. The tree is both
built and tested on the entire set of data.
Partition my data into subsamples. Partitioning divides your data into two sets—a
training sample, from which the model is generated, and a testing sample, on which the
generated model is tested. If the model generated on one part of the data fits the other
part well, this indicates that your tree structure should generalize adequately to larger
data sets that are similar to the current data. If you select partitioning, use the slider
control to determine what proportion of cases go into the training and testing samples.
Note that the slider proportion is approximate.
After setting up the partitions, make sure that your training sample is selected (in the
View menu) and grow the tree. When you are satisfied with the tree, select the testing
sample in the View menu. The results in the Tree window will change to reflect the
results of applying the tree to the testing sample. By examining the risk estimates, gain
summary, and analysis summary, you can determine the extent to which your tree
generalizes.
51
Grow ing the Roo t No de
Advanced Options
The final step in the New Tree Wizard lets you specify advanced options. Advanced
options include stopping rules, pruning, scores, costs, and priors. In addition, there are
options specific to each growing method. Therefore, the options available in the dialog
box will change, depending upon which growing method you have chosen.
If you elect not to specify any advanced options, the software reverts to default
settings, and your tree is generated normally. However, it is best to check the settings
before growing your tree. For example, some settings may need to be changed,
depending on the number of cases in your data file or the desired number of levels in
your tree. To achieve optimal results, you may have to experiment to discover the best
combination of settings.
52
Chap te r 4
To specify advanced options from the New Tree Wizard, click the button (on the
fourth screen) to open the Advanced Options dialog box. When you are finished, click
OK to return to the New Tree Wizard.
Note that after a tree is grown, you can go back and change the advanced options,
but this change will cause the tree to be regrown. To do so, from the Tree window
Analysis menu, choose Advanced Options.
Figure 4-7
New Tree Wizard: Advanced Options
In generating a tree structure, the program must be able to determine when to stop
splitting nodes. The criteria for determining this are called stopping rules.
53
Grow ing the Roo t No de
Figure 4-8
Advanced Options: Stopping Rules
Changing the stopping rules after a tree has been grown will invalidate the tree and
require the tree to be regrown.
This tab allows you to control growing criteria for CHAID and Exhaustive CHAID
models.
Figure 4-9
Advanced Options: CHAID
Convergence. This setting allows you to specify the convergence criteria for a CHAID
analysis.
Epsilon. Specify the smallest change required when calculating estimates of
expected cell frequencies.
Maximum iterations. Specify the maximum number of iterations the program
should perform before stopping.
Allow splitting of merged categories. You can use this control to allow splitting of
merged categories.
Use Bonferroni adjustment. This option allows you to correct alpha levels for multiple
comparisons. This option is activated by default.
This tab allows you to select the impurity measure for C&RT models.
Figure 4-10
Advanced Options: C&RT
56
Chap te r 4
Impurity Measure for Categorical Targets. Select the impurity measure that you want to
use in growing the tree. Select one of the following:
Gini. A measure based on squared probabilities of membership for each target
category in the node. It reaches its minimum (zero) when all cases in the node fall
into a single target category.
Twoing. A measure based on grouping target classes into the two best subclasses
and computing the variance of the binary variable indicating the subclass to which
each case belongs.
Ordered Twoing. Similar to the twoing criterion, with the additional constraint that
only contiguous target classes can be grouped together.
For continuous targets. The least squared deviation (LSD) measure of impurity is
automatically applied when the target variable is continuous. This index is computed
as the within-node variance, adjusted for frequency or case weights (if any).
Surrogates. Specify the maximum number of surrogates for which to keep statistics at
each node in the tree. For each node, the Select Surrogate dialog box will show only as
many surrogates as you specify here.
Growing criteria and user-defined costs. The Gini criterion is the only choice that
explicitly includes cost information in growing the tree. The twoing and ordered
twoing criteria do not consider costs in constructing the tree, although costs are still
used in computing risks and in node assignment. If you want to use costs with twoing
or ordered twoing, first disable custom costs, and then select Adjust priors using
misclassification costs in the Priors tab of this dialog box. This will incorporate cost
information into the priors and apply them to the model.
57
Grow ing the Roo t No de
This tab allows you to control the alpha setting for QUEST models.
Figure 4-11
Advanced Options: QUEST
Alpha for. This setting allows you to control the alpha levels for variable selection.
Specify an alpha level by adjusting the slider control or entering a value.
Surrogates. Specify the maximum number of surrogates for which to keep statistics at
each node in the tree. For each node, the Select Surrogate dialog box will show only as
many surrogates as you specify here.
This tab allows you to control the subtree selection criterion for pruning procedures.
Note that changing the settings in this dialog box does not activate pruning. To use
pruning, you must choose Grow Tree and Prune from the Tree menu after you have
grown the root node.
58
Chap te r 4
Figure 4-12
Advanced Options: Pruning
Select Subtree Based on. You can specify which criterion you want to use to select
subtrees for pruning.
Standard Error rule. If this option is selected, the application chooses the smallest
subtree whose risk is close to that of the subtree with the minimum risk. The
multiplier indicates the number of standard errors used for the standard error rule.
Available options are 0.5, 1.0, 1.5, 2.0, and 2.5. This criterion is the default, with
the 1.0 multiplier selected.
Minimum risk. If this option is selected, the application chooses the subtree that has
the minimum risk.
59
Grow ing the Roo t No de
Scores define the order of and distance between categories of an ordinal (or discretized
continuous) target variable. The ordering of categories affects all analyses. Scores are
available only for CHAID analyses.
Figure 4-13
Advanced Options: Scores
When you first create a tree, the Ordered integers option is selected by default. Default
scores follow the variable’s value order, so that the first category has a score of 1, the
second category has a score of 2, and so on. To enter your own custom scores, select
Custom. (At this point, AnswerTree may have to scan the data before populating the
grid.) For the target variable, each category is shown, along with its score. You can
change the score for any category in the grid by selecting the desired score and editing
the value.
If you change the scores after the tree is grown, your existing tree will be invalid.
You will need to regenerate the tree.
60
Chap te r 4
Costs allow you to include information about the relative penalty associated with
incorrect classification by the tree. For example, the cost of denying credit to a
creditworthy customer is likely to be different from the cost of extending credit to a
customer who then defaults on the loan.
Figure 4-14
Advanced Options: Costs
When you first create a tree, the costs are equal for all categories by default. To enter
your own custom costs, select Custom. (At this point, AnswerTree may have to scan
the data before populating the grid.)
Costs are shown in a k-by-k grid, where k is the number of categories of the target
variable. The columns represent the actual categories. The rows represent the predicted
categories, based on the tree. Each cell represents the cost of assigning a case to one
category (defined by the row) when it actually belongs to another category (defined by
the column). Default values are 1 for all off-diagonal cells.
61
Grow ing the Roo t No de
Prior probabilities allow you to specify the correct method for estimating the
proportion of cases associated with each target variable category. The proportions of
cases in the target variable categories determine how information from the cases is
used to grow the tree. In some cases, the proportions in the training sample are
representative of the proportions in the population under study (that is, the proportions
in real life). In other cases, the proportions in the training sample are not representative
of real life, and some other estimate must be used.
62
Chap te r 4
Figure 4-15
Advanced Options: Prior Probabilities
Save Project
When you choose Save Project or Save Project As from the File menu, AnswerTree
saves your current project and data. Save Project uses the current file names. Save
Project As allows you to specify a new name for the project.
Two files are required for an AnswerTree project: the project file and the data file.
The project file has the .atp extension and contains information about models used,
trees grown, parameters, etc. The data file has the .sav extension and includes the
values for all variables for all cases used in the analysis. The data file name is derived
from the name of the original data source, with an underscore prefixed to the name. So,
for example, if you create a new project based on the data file iris.sav, saving the
project as iris.atp creates two files: iris.atp and _iris.sav (where _iris is derived from
the original data filename).
When sending your AnswerTree results to another AnswerTree user, be sure to
include both files (project file and data file), or the other user will be unable to open
the project.
5
The Tree Window
After the root node is grown, you perform most operations in the Tree window. It is a
tabbed window that displays five views of the current analysis: a tree diagram, a tabular
gains summary, a risk summary, the rules defining nodes, and a text analysis summary.
You can duplicate a Tree window or create a new Tree window that uses the same
specifications as the active tree. The new or duplicate tree can be grown or edited
independently of the original tree. Properties can be set individually for the new
window.
65
66
Chap te r 5
The Tree view shows a graphical display of the structure of the tree in detail. In most
cases, because of the size of the overall tree, only a portion of the tree will be visible
in the Tree view. You can scroll the window to view other parts of the tree or use the
Tree Map window to select a region of the tree to view.
Figure 5-1
Tree view
Each node in the tree is shown as a table of values, a graph of values, or both. The node
display can be controlled via toolbar buttons or menu items. For a categorical (nominal
or ordinal) target variable, the values show the number of cases in each category. For a
continuous target variable, the values indicate the distribution of values for the node.
67
The Tree W indo w
You can select one or more nodes in the Tree view. When you change the selection,
the Graph, Table, and Data windows are automatically updated to reflect the currently
selected nodes.
Gains indicate which parts of the tree have the highest (and lowest) response or profit.
The Gains tab shows a summary of the gains for all of the terminal nodes in the tree,
sorted by gain value.
Figure 5-2
Gains view
Gain scores are computed differently for categorical target variables and continuous
target variables.
Categorical target variable. The gain score for a node is computed as either the
proportion of cases in the node belonging to the target class or the average profit for
the node. This is controlled by selecting the appropriate option in the Gain Summary
dialog box.
Continuous target variable. The gain score for a node is computed as the average value
for the node.
The table’s rows can represent statistics for individual nodes or for percentiles of cases.
Percentile tables are cumulative. The following information is displayed for each node
or percentile group:
Node (or Nodes). The nodes associated with the row.
68
Chap te r 5
Node: N (or Percentile: N). The number of cases in the target class.
Node: % (or Percentile: %). The percentage of the total sample cases falling into the
target class.
Gain. The gain value for the group.
Index (%). The ratio of this group’s gain score to the gain score for the entire sample.
Cumulative gains can also be displayed. These values are accumulated, so that each
row indicates values for cases in that group plus all previous groups. The display of
cumulative gains can be controlled by selecting the appropriate option from the Gain
Summary dialog box.
This view gives the estimated risk (and its standard error) of misclassification based on
the tree. Risk is calculated in different ways, depending on the nature of the target
variable.
Categorical (nominal or ordinal) target variable. Risk is calculated as the proportion of
cases in the sample incorrectly classified by the tree. A table is also displayed,
indicating the numbers of cases corresponding to specific prediction errors.
Continuous target variable. Risk is calculated as the within-node variance about the
mean of the node.
Figure 5-3
Risk view
69
The Tree W indo w
If misclassification costs have been specified, risk estimates are adjusted to account for
those costs. If priors have been specified, risk estimates are adjusted to account for
those priors.
For any combination of nodes, you can view the rules that describe the selected nodes.
The rule type and format are controlled by the rules format options. (For more
information on rules format options, see “Rules Format” on p. 88.)
Figure 5-4
Rules view
The analysis summary describes the data used in the tree, the tree-growing process, and
a description of the final tree. The analysis summary updates when the tree structure is
modified.
71
The Tree W indo w
Figure 5-5
Summary view
To export the contents of the Tree window, from the menus choose:
File
Export...
Items in the Tree window are exported in the following formats:
Tree View. Exports as a Windows bitmap (*.bmp) file or a Windows enhanced
metafile (*.emf).
Gains View. Exports as a tab-delimited text file. This file can easily be read into
spreadsheet or word-processing packages to create a table.
Risk View. Exports as a tab-delimited text file.
Rules View. Exports as a text file.
Summary View. Exports as a text file.
The Set Selection Rule dialog box lets you select the terminal nodes in your tree that
meet a certain criterion. With a large tree, the dialog box lets you locate specific nodes
quickly. You might want to explore the terminal nodes by trying several different
values for this dialog box. After you have made a useful selection, you might want to
export information about the selected nodes. For example, you could export the rules
for the selected nodes in SQL format, so that you could locate similar cases elsewhere
in your database.
Figure 5-6
Set Selection Rule dialog box
You can select nodes based on a statistical criterion, including gains, index, cumulative
gains, or cumulative index. To use this criterion, select a statistic from the drop-down
list and select a relation (>, >=, <, or <=). Finally, enter a value to complete the
inequality. For example, you might choose to select all terminal nodes whose gains are
greater than or equal to 1.5.
Alternatively, you can enter a value to select the top n nodes, the top nodes up to n
cases, or the top nodes up to n percent of your sample.
The results of this dialog box depend upon the settings in the Gain Summary dialog
box. For example, if the Gain Content is set to Average Profit and you are selecting
nodes based on gain, the selection cutoff point you enter must be an average profit
74
Chap te r 5
value, not a gain percentage. In addition, this dialog box is not available if your gain
summary is formatted to display percentiles in the rows (rather than nodes).
Finally, the meaning of top in this dialog box is relative. For most AnswerTree
analyses, the top nodes are those whose gains are greatest. However, you may want to
classify cases according to the smallest gains (for example, you might try to find low-
risk patients). This dialog box accomplishes that. Choose a less-than relation instead
of a greater-than relation.
Competitors
This dialog box allows you to control the number of predictors displayed for each node.
Selecting smaller values can speed up the processing of large problems.
Figure 5-7
Competitors dialog box
Competitors. Specify the maximum number of predictors for which to keep statistics at
each node in the tree (these retained predictors are known as competitors). For each
node, the Select Predictor dialog box will retain statistics for only as many predictors
as you specify here. You can still manually split a node based on a predictor that is not
a competitor. Changing the number of competitors for a grown tree will affect only new
tree growth.
Define Variable
This feature allows you to define various properties of the variables in your data file.
These properties affect how the variables are used in tree-growing and analysis.
76
Chap te r 5
Figure 5-8
Define Variable dialog box (nominal variable)
On the left is a variable list, showing all of the variables in your data set. The
measurement level for each variable is indicated by the icon shown next to the variable
name. Changes are made by selecting a variable in the variable list and then editing its
settings.
You can also specify whether variable labels or variable names are used in the
dialog box and in the tree itself. To change the setting, select a variable and right-click;
then select the appropriate option from the context menu. Changing this setting affects
only the selected item.
Measurement Level. You must indicate the appropriate measurement level in order to
have the variables treated properly in tree-growing and analysis. Available options are
nominal, ordinal, or continuous.
Values. For some variables, there may be values that you want to ignore or mark as
missing.
Valid. (Not available for continuous variables.) This list contains values for the
variable that are considered valid for the variable. For numerical ordinal variables,
values are processed in numerical order. Valid and missing values cannot be edited.
Missing. This list contains values for the variable that are defined as user-missing
for the variable. You can optionally specify that missing values be considered a
separate category in CHAID analyses. Valid and missing values cannot be edited.
Label for missing. You can specify a text label for missing values.
77
The Tree W indo w
Display value labels. (Not available for nominal or ordinal variables.) You can
specify that value labels, rather than values, are displayed in the tree. The default
is to show value labels (if defined). Value labels are always shown for nominal and
ordinal variables (if they are defined in the data file).
Allow CHAID to merge categories. Select this option to allow categories to be merged by
the CHAID algorithm.
Defining a Variable
Measurement Levels
Each variable can be characterized by the kind of values it can take and what those
values measure. This general characteristic is referred to as the measurement level of
the variable. You can specify a variable as having one of three measurement levels:
Nominal. This measurement level includes categorical variables with discrete values,
where there is no particular ordering of values. Examples include the gender of a
respondent, the brand of a product tested, and the type of a loan.
Ordinal. This measurement level includes variables with discrete values, where there is
a meaningful ordering of values. Ordinal variables generally don’t have equal intervals,
however, so the difference between the first category and the second may not be the
same as the difference between the fourth and fifth categories. Examples include years
of education and number of children.
Continuous. This measurement level includes variables that are not restricted to a list of
values but can essentially take any value (although the values may be bounded above,
below, or both). Examples include annual salary, the amount of a loan, and the weight
of a product.
For a more thorough discussion of variable types and measurement levels, see Chapter 14.
78
Chap te r 5
For continuous predictor variables in CHAID analyses, the quantitative scale is divided
into ranges of values, called intervals, which are treated as categories. Intervals can be
determined automatically by the program, or you can specify your own set of intervals.
Automatic intervals are chosen so that each interval contains approximately the same
number of cases.
Figure 5-9
Intervals dialog box
The name of the variable to be updated and its range are shown at the top of the dialog
box. The current intervals are shown in the list view, with four columns indicating the
start value, end value, number of cases, and percentage of cases for each interval. Note
that if the selected variable has fewer than 10 distinct values, you will not be able to
define intervals. Instead, the variable will be treated as an ordinal categorical.
Method. If you select Automatic, intervals will be calculated to distribute the cases
evenly among the number of intervals specified. To change the number of intervals, use
the spinner control or type the desired number of intervals into the field, and click the
Update List button.
79
The Tree W indo w
If you select Custom, you will need to enter start values for intervals. If you want to
define all of the intervals, begin by clicking Clear List. To add a new interval, enter a
start value for New Cut Point and click Add. The new interval contains all cases with
values between the start value and the end value (or the next higher start value).
Continue adding new intervals until all intervals are specified as desired.
E To define variable intervals for a CHAID analysis, from the Project window or Tree
window menus choose:
Analysis
Define Variable...
E Click Intervals.
E For custom intervals, click Clear List to delete the default intervals, and add new
intervals by specifying each interval’s starting start point and clicking Add.
Profits
Profits allow you to include information about the relative value of each category of
the target variable.
80
Chap te r 5
Figure 5-10
Define Profits dialog box
For the target variable in the current tree, each category is shown, along with its profit.
Default profits for ordinal variables (including discretized continuous variables) are
inherited from the variable’s scores. Default profits for nominal variables assign a
value of 1 to the highest category and a value of 0 to all other categories (where highest
category means the largest value for numerically coded variables and the last value
alphabetically for string-coded variables). You can change the profit for any category
in the grid by selecting the desired category’s profit and editing the value.
Setting Profits
E For each category of the target variable, change the value as necessary.
Grow Tree and Prune. Grows the entire tree, then automatically prunes it. Not
available if the growing method is CHAID or Exhaustive CHAID.
Grow Branch. Grows the tree below the current node to its terminal nodes.
Grow Branch One Level. Adds one level under the currently selected node.
Select Predictor. Allows you to specify which predictor to use in splitting the
current node and how values of the predictor are grouped to form the split.
Select Surrogate. Allows you to specify a surrogate variable to use in splitting the
current node.
Define Split. Redefines the split of the current node. This option can be used to
merge or separate nodes.
Remove Branch. Removes the branch below the current node.
Remove One Level. Removes one level from the whole tree.
Select Predictor
The Select Predictor dialog box displays a list of predictors available for splitting (or
resplitting) a selected node. When you select a variable in the list and click Grow, the
node is split using the selected predictor.
Figure 5-11
Select Predictor dialog box
82
Chap te r 5
The dialog box is not available if more than one node is selected in the tree. Predictors
with constant values for the selected node (where all cases in the node have the same
value for the predictor) are not shown.
The table displays information for each variable, depending on the growing method
used. (Not all items listed appear for all growing methods.)
Predictor. The name of the predictor variable.
Nodes. The number of nodes that will be created by splitting on the predictor.
Split Type. The type of split: default for a computer-generated split, custom for a
user-specified split, or arbitrary for noncompetitor predictors.
Chi-Square (categorical target) or F (continuous target). The value of the test statistic
used to evaluate the predictors.
D.F. The degrees of freedom associated with the test statistic. For chi-square
statistics, there is only one df value. For F statistics, the two df values are given as
the numerator df and the denominator df.
Adj. Prob. The probability (p value) associated with the test statistic, adjusted for
the multiple tests on the list competitors (using the Bonferroni method). The usual
rule of thumb is that probabilities of less than 0.05 indicate statistically significant
associations.
You can limit the list display to a subset of the predictors (called competitors) by
changing the settings in the Competitors dialog box. If you limit the number of
competitors, other predictors (noncompetitors) will still be shown in the list with no
statistics and with the split type shown as Arbitrary. You can specify that only
competitors be shown by right-clicking the predictor list and selecting
Show Competitors.
You can override a category grouping or redefine a cut point for the selected
predictor by clicking Define Split.
The following formatting options are also available from the context menu:
Display Variable Labels. Shows variable labels in the dialog box and the Tree view.
Display Variable Names. Shows variable names in the dialog box and the Tree view.
Selecting a Predictor
E To select a predictor for a custom split, from the Tree window menus choose:
Tree
Select Predictor...
83
The Tree W indo w
E Select a predictor from the list and click Grow to split the node using the selected
predictor.
Optionally, you can manually specify how the split is defined by clicking Define Split.
The Define Split feature allows you to control how the split is defined on the predictor
variable.
Figure 5-12
Define Split dialog box for CHAID
The dialog box shows the current grouping of categories. Changes are made by
selecting a node or nodes and right-clicking. From the context menu, the following
options are available:
Merge Nodes. The selected nodes are merged into a single node. The new node is
shown on a single line, with multiple categories separated by commas.
Separate Node. The selected node is separated into multiple nodes, one for each
category in the original node.
To rearrange the assignment of categories to nodes, you may first need to separate all
nodes so that each category defines a node, and then merge the individual categories to
create the desired split.
The following formatting options are also available from the context menu:
Display Value Labels. Shows category labels in the dialog box.
Display Values. Shows category values in the dialog box.
84
Chap te r 5
The Define Split feature allows you to control how the split is defined on the predictor
variable.
Figure 5-13
Define Split dialog box for C&RT or QUEST
If the selected node is already split, the dialog box shows the current grouping of
categories. Changes are made by dragging categories from one list to the other until the
ordinal categories are grouped as desired. For ordinal variables, dragging a category
from the middle of the list will also move all categories below it (dragging from left to
right) or above it (dragging from right to left).
The following formatting options are also available from the context menu:
Display Value Labels. Shows category labels in the dialog box.
Display Values. Shows category values in the dialog box.
The Define Split feature allows you to control how the split is defined on the predictor
variable.
85
The Tree W indo w
Figure 5-14
Define Split dialog box for a continuous predictor
If the selected node is already split, the dialog box shows the current cut point defining
the split. Changes are made by dragging the slider control or entering a cut point value
in the text box.
Select Surrogate
The Select Surrogate dialog box displays a list of surrogates available for splitting (or
resplitting) a selected node. When you select a variable in the list and click Grow, the
node is split using the selected surrogate. Surrogates are available only for models
grown using the C&RT or QUEST methods.
Figure 5-15
Select Surrogate dialog box
86
Chap te r 5
If the selected node is already split, the best surrogate for the split variable is
highlighted in the list.
Selecting a Surrogate
E To select a surrogate for a custom split, from the Tree window menus choose:
Tree
Select Surrogate...
E Select a surrogate from the list and click Grow to split the node.
You can control various aspects of how the gain summary is displayed.
87
The Tree W indo w
Figure 5-16
Gain Summary dialog box
E To set the gain summary format, from the Tree window menus choose:
Format
Gains...
Rules Format
SPSS. Rules are given as SPSS SELECT or COMPUTE statements. For derived
variables, names are assigned by the application. SPSS rules can be copied and
pasted into SPSS syntax files.
Decision. Rules are specified as a set of logical “if...then” statements, suitable for
inclusion in written reports
Generate Syntax for. You can select the content of the rules.
Selecting cases. Rules describe how to identify cases belonging to any of the
selected nodes. Note: The rules for selecting cases will be simplified wherever
possible. For example, if you select all of the terminal nodes in the tree and then
view the rules for selecting cases, you will see only one simple rule that selects all
cases. This is because selecting all of the terminal nodes corresponds to selecting
every possible combination of predictor values, and therefore all cases belong in
the “subset” defined by the selected nodes. To view a separate rule for each node
selected, choose Assigning values.
Assigning values. Rules give values for node assignment, the predicted value for the
target variable, and the probability value for the prediction, based on the
combination of predictor values that defines the node. One rule is generated for
each selected node.
Use Labels for. For decision rules, you can request that rules be generated using labels
(instead of values) for variables, values, or both.
E To set the Rules format, from the Tree window menus choose:
Format
Rules...
6
Viewer Windows
Viewer windows display information about individual nodes in your tree. They give
more details about the selected nodes. You cannot modify anything in a Viewer
window; you can only examine the current state of the nodes. The following Viewer
windows are available to provide insight into your tree model:
Tree Map window. Displays the entire active tree. The tree can be navigated by
selecting a node or region in the tree map.
Graph window. Displays a graphic representation of the selected nodes.
Table window. Displays a tabular representation of the selected nodes.
Data window. Displays case-level data for the selected nodes.
91
92
Chap te r 6
The tree map shows the full tree, including all nodes and connecting lines, in its current
orientation. The selected node or nodes are highlighted in the map. You can select a
node by clicking it in the Tree Map window. You can select additional nodes by Ctrl-
clicking.
The Tree Map window also has a selectable red rectangle that represents the
perimeter of the visible portion of the Tree window. The location or size of the rectangle
update when you scroll or resize the Tree window. If the Tree window is resized, its
contents adjust to fit in the available space, and the dimensions of the rectangle in the
tree map change accordingly. The tree can also be navigated by selecting a node in the
tree map. When you click on a node in the map, the node is selected in the tree map and
the Tree window, and the node becomes visible in the Tree window.
The rectangle can be moved by right-clicking and dragging it. If it is moved, a
different region becomes visible in the Tree window. Moving the rectangle does not
affect node selection in the Tree window.
Data Viewer
The Data Viewer shows case-by-case data for one or more selected nodes in the active
Tree window.
Figure 6-2
Data Viewer
Data shown in the viewer cannot be edited. In the data grid, rows represent cases, and
columns represent variables. All variables in the active data set are shown.
If data are partitioned, the data list includes only cases from the currently selected
partition. If there are frequency weights with partitioned data, a new variable is shown
to indicate the number of cases in each row assigned to the currently selected partition.
93
Vie wer W indow s
Exporting data. Data shown in the viewer can be exported to an external data file. Only
cases shown in the Data Viewer at the time of export are exported. To export all of the
data in the currently selected partition, select the root node before exporting. Data can
be exported in the following formats:
SPSS. Data exported in SPSS format retain all information regarding value and
variable labels, measurement levels, and missing values.
SYSTAT. Data exported in SYSTAT format retain information regarding system-
missing values, but information about value and variable labels and measurement
levels is lost. User-missing values are exported as the literal values, with no
indication that they represent another type of missing data.
Tab-delimited ASCII text. Data exported as tab-delimited text retain information
regarding system-missing values, but information about value and variable labels
and measurement levels is lost. System-missing values are exported as blanks.
User-missing values are exported as the literal values, with no indication that they
represent another type of missing data. The first line in the exported data file
contains the variable names.
Graph Viewer
The Graph Viewer shows summary statistics in graphical format for one or more
selected nodes in the active tree. The graph shows summary statistics based on values
of the target variable.
The view depends on the level of measurement of the target variable and the
selection in the Tree window.
Continuous target variable. The viewer shows a histogram of the target variable for
cases in the selected node.
95
Vie wer W indow s
Figure 6-3
Graph Viewer histogram
Categorical target variable. The default graph is a bar chart of percentages for a selected
node. Each category of the target variable is shown in the chart in the same order as
they appear in node graphs.
Figure 6-4
Graph Viewer bar chart
You can change the color of bars in the Graph Viewer by right-clicking a bar and
selecting a color from the Color dialog box.
96
Chap te r 6
Table Viewer
The Table Viewer shows statistics for one or more nodes in tabular form. A Table view
is analogous to a Graph view of a node.
The Table view depends on level of measurement of the dependent variable and the
selection in the Tree window:
97
Vie wer W indow s
Categorical target variable. The default table shows percentages and counts for each
category and the total n for the node(s). The percent given for the total is the percentage
of cases in the sample assigned to that node. If you have specified user-defined priors,
the total percent is adjusted to account for those priors. For categorical variables, you
can change the color of a row in the Table Viewer by right-clicking the row and
selecting a color from the Color dialog box.
Figure 6-5
Table Viewer for categorical target
Continuous target variable. The table shows the mean, standard deviation, number of
cases, and predicted value of the dependent variable for the selected node(s).
Figure 6-6
Table Viewer for continuous target
98
Chap te r 6
7
Capturing Data with ODBC
When you define a new project, you must specify a data source. One of your choices
for reading data is the Database Capture Wizard. The Database Capture Wizard
allows you to read data from any database format for which you have an ODBC driver.
You can also read Excel 5 files using the Excel ODBC driver.
E In the New Project dialog box, select Database Capture Wizard . Select your desired
option, to create a new query or to run or edit an existing one.
E Select the data source. This can be a database format, an Excel file, or a text file.
E Depending on the database file, you may need to enter a login name and password.
E Select the table(s) and fields you want to read into the software.
99
100
Chap te r 7
Use the first dialog box to select the type of data source to read into the software. After
you have chosen the file type, the Database Capture Wizard prompts you for the path
to your data file.
If you do not have any ODBC data sources configured or if you want to add a new
ODBC data source, click Add Data Source.
Figure 7-1
Database Capture Wizard dialog box
101
C apturing Data with O DB C
Example. Suppose that you have a Microsoft Access 7.0 database that contains data
about your employees and about the regions in which they work, and you want to import
that data. Select the MS Access 7.0 Database icon, and click Next to proceed. You will
see the Select Database dialog box. Specify the path to your database and click OK.
Database Login
If your database requires a password, the Database Capture Wizard prompts you for
one before it opens the data source.
Figure 7-2
Login dialog box
This dialog box controls which tables and fields are read into the software. Database
fields (columns) are read as variables.
If a table has any field(s) selected, all of its fields will be visible in the following
Database Capture Wizard windows, but only those fields selected in this dialog box
will be imported as variables. This enables you to create table joins and to specify
criteria using fields that you are not importing.
102
Chap te r 7
Figure 7-3
Select Data dialog box
Displaying field names. To list the fields in a table, click the plus sign (+) to the left of
a table name. To hide the fields, click the minus sign (–) to the left of a table name.
To add a field. Double-click any field in the Available Tables list, or drag it to the
Retrieve Fields in This Order list. Fields can be reordered by dragging and dropping
them within the selected fields list.
To remove a field. Double-click any field in the Retrieve Fields in This Order list, or
drag it to the Available Tables list.
Sort field names. If selected, the Database Capture Wizard will display your available
fields in alphabetical order.
Example. Assume that you want to import from a database with two tables, Employees
and Regions. The Employees table contains information about your company’s
employees, including the region they work in, their job category, and their annual sales.
103
C apturing Data with O DB C
Employees are each assigned a region code (REGION), while those who do not have a
home region get the special code of 0. The Regions table holds a large amount of data
about the areas in which your company operates and prospective markets. It uses a
region code (REGION) to identify the area and provides the average per capita income
for the area, among other things. To relate each employee’s sales to the average income
in the region, you would select the following fields from the Employees table: ID,
REGION, and SALES. Then, select the following fields from the Regions table:
REGION and AVGINC. Click Next to proceed.
This dialog box allows you to define the relationships between the tables. If fields from
more than one table are selected, you must define at least one join.
Figure 7-4
Specify Relationships dialog box
104
Chap te r 7
Establishing relationships. To create a relationship, drag a field from any table onto the
field to which you want to join it. The Database Capture Wizard draws a join line
between the two fields, indicating their relationship. These fields must be of the same
data type.
Auto Join Tables. If this is selected and if any two fields from the tables you have chosen
have the same name, have the same data type, and are part of their table’s primary key,
a join is automatically generated between these fields.
Specifying join types. If outer joins are supported by your driver, you can specify either
inner joins, left outer joins, or right outer joins. To select the type of join, click the join
line between the fields, and the software displays the Relationship Properties dialog
box.
You can also use the icons in the upper right corner of the dialog box to choose the type
of join.
Relationship Properties
This dialog box allows you to specify which type of relationship joins your tables.
Figure 7-5
Relationship Properties dialog box
Inner joins. An inner join includes only rows where the related fields are equal.
Example. Continuing with our data, suppose that you want to import data for only those
employees who work in a fixed region and for only those regions in which your
105
C apturing Data with O DB C
company operates. In this case, you would use an inner join, which would exclude
traveling employees and would filter out information about prospective regions in
which you do not currently have a presence.
Completing this would give you a data set that contains the variables ID, REGION,
SALES95, and AVGINC for each employee who worked in a fixed region.
Figure 7-6
Creating an inner join
Outer joins. A left outer join includes all records from the table on the left and only
those records from the table on the right where the related fields are equal. In a right
outer join, this relationship is switched, so that the software imports all records from
the table on the right and only those records from the table on the left where the related
fields are equal.
106
Chap te r 7
Example. If you wanted to import data only for those employees that worked in fixed
regions (a subset of the Employees table) but needed information about all of the
regions, a right outer join would be appropriate. This results in a data set that contains
the variables ID, REGION, SALES95, and AVGINC for each employee who worked in
a fixed region, plus data on the remaining regions in which your company does not
currently operate.
Figure 7-7
Creating a right outer join
The Limit Retrieved Cases dialog box allows you to specify the criteria for selecting
subsets of cases (rows). Limiting cases generally consists of filling the criteria grid
107
C apturing Data with O DB C
with one or more criteria. Criteria consist of two expressions and some relation
between them. They return a value of true, false, or missing for each case.
If the result is true, the case is selected.
If the result is false or missing, the case is not selected.
Most criteria use one or more of the six relational operators (<, >, <=, >=, =, and <>).
Expressions can include field names, constants, arithmetic operators, numeric and
other functions, and logical variables. You can use fields that you do not plan to
import as variables.
Figure 7-8
Limit Retrieved Cases dialog box
To build your criteria, you need at least two expressions and a relation to connect them.
108
Chap te r 7
E To build an expression, put your cursor in an Expression cell. You can type field
names, constants, arithmetic operators, numeric and other functions, and logical
variables. Other methods of putting a field into a criteria cell include double-clicking
on the field in the Fields list, dragging the field from the Fields list, or selecting a field
from the drop-down menu that is available in any active expression cell.
E The two expressions are usually connected by a relational operator, such as = or >. To
choose the relation, put your cursor in the Relation cell and either type in the operator
or select it from the drop-down menu.
To modify our earlier example to retrieve only data about employees who fit into job
categories 1 or 3, create two criteria in the criteria grid and prefix the second criteria
with the connector OR.
Criteria 1: 'EmployeeSales'.'JOBCAT' = 1
Criteria 2: 'EmployeeSales'.'JOBCAT' = 3
Functions. A selection of built-in arithmetic, logical, string, date, and time SQL
functions are provided. You can select a function from the list and drag it into the
expression, or you can enter any valid SQL function. See your database documentation
for valid SQL functions.
Prompt for Value. You can embed a prompt in your query to create a parameter query.
When users run the query, they will be asked to enter information specified here. You
might want to do this if you need to see different views of the same data. For example,
you may want to run the same query to see sales figures for different fiscal quarters.
Place your cursor in any Expression cell, and click this button to create a prompt.
Defining Variables
Variable names and labels. The complete database field (column) name is used as the
variable label. Unless you modify the variable name, the Database Capture Wizard
assigns variable names to each column from the database in one of two ways:
If the name of the database field (or the first eight characters) forms a valid, unique
variable name, it is used as the variable name.
If the name of the database field does not form a valid, unique variable name, the
software creates a unique name.
Click on any cell to edit the variable name.
109
C apturing Data with O DB C
Figure 7-9
Define Variables dialog box
Results
The Results dialog box displays the SQL syntax for your query. You can copy the SQL
syntax onto the clipboard or simply retrieve the data. You can also customize your
query by editing the SQL statement. In either case, you can save the query for use with
other applications by providing a name and path in the Save Query to File panel or by
clicking the Browse button, which lets you specify a name and location using a Save
As dialog box.
Details on the ODBC query syntax (the GET CAPTURE command) can be found
in the following section of this chapter.
110
Chap te r 7
Figure 7-10
Results dialog box
Syntax Diagram
GET CAPTURE {ODBC }*
[/CONNECT=’connection string’]
[/LOGIN=login] [/PASSWORD=password]
[/SERVER=host] [/DATABASE=database name]†
* You can import data from any database for which you have an ODBC driver installed.
† Optional subcommands are database specific. See “Syntax Rules” on p. 112 for the subcommand(s) required by
a database type.
Example
GET CAPTURE ODBC
/CONNECT=‘DSN=Sample DBASE files;CollatingSequence=ASCII;’
‘DBQ=C:\CRW; DefaultDir=C:\CRW; Deleted=1;’
‘Driverid=21;Fil=dBaseIII;PageTimeout=600;’
‘Statistics=0;UID=admin;’
/SELECT EMPLOYEE.LASTNAME,EMPLOYEE.FIRSTNAME,EMPLOYEE.ADDRESS,
EMPDATA.DATA FROM {oj employee LEFT OUTER JOIN EMPDATA ON
‘EMPLOYEE’,‘LASTNAME’=‘EMPDATA’,‘LASTNAME’}.
Overview
GET CAPTURE retrieves data from a database and converts them to a format that can
be used by program procedures. It builds a working data file for the current session.
Basic Specification
The basic specification is one of the subcommands specifying the database type
followed by the SELECT subcommand and any SQL select statement.
Subcommand Order
The subcommand specifying the type of database must be the first specification. The
SELECT subcommand must be the last.
112
Chap te r 7
Syntax Rules
Only one subcommand specifying the database type can be used.
The CONNECT subcommand must be specified if you use the Microsoft ODBC
(Open Database Connectivity) driver.
Operations
GET CAPTURE retrieves the data specified on SELECT.
The variables are in the same order in which they are specified on the SELECT
subcommand.
The data definition information captured from the database is stored in the working
data file dictionary.
Limitations
Maximum 3800 characters (approximately) can be specified on the SELECT
subcommand. This translates to 76 lines of 50 characters. Characters beyond the
limit are ignored.
CONNECT Subcommand
CONNECT is required to access any database that has an installed Microsoft ODBC
driver.
You cannot specify the connection string directly in the syntax window, but you
can paste it with the rest of the command from the Results dialog box, which is the
last of the series of dialog boxes opened with the Database Capture command on
the File menu.
SELECT Subcommand
SELECT specifies any SQL select statement accepted by the database you access. With
ODBC, you can now select columns from more than one related table in an ODBC data
source using either the inner join or the outer join.
113
C apturing Data with O DB C
Example
GET CAPTURE ODBC
/CONNECT=‘DSN=Sample DBASE files;CollatingSequence=ASCII;’
‘DBQ=C:\CRW; DefaultDir=C:\CRW; Deleted=1;’
‘Driverid=21;Fil=dBaseIII;PageTimeout=600;’
‘Statistics=0;UID=admin;’
/SELECT EMPLOYEE.LASTNAME,EMPLOYEE.FIRSTNAME,EMPLOYEE.ADDRESS,
EMPDATA.DATA FROM {oj EMPLOYEE LEFT OUTER JOIN EMPDATA ON
‘EMPLOYEE’.‘LASTNAME’=‘EMPDATA’.‘LASTNAME’}.
This example retrieves data from two related tables in a dBASE III database.
The SQL select statement retrieves the customer’s first names, last names, and their
addresses from the EMPLOYEE table and if the last name is also in the EMPDATA
table, the DATA column of the employee’s data table will be retrieved.
GET CAPTURE converts the data to a format used by program procedures and
builds a working data file.
Data Conversion
GET CAPTURE converts variable names, labels, missing values, and data types,
wherever necessary, to a format that conforms to SPSS-format conventions.
Missing Values
Null values in the database are transformed into the system-missing value in numeric
variables or into blanks in string variables.
114
Chap te r 7
Example
The indentation of the select statement illustrates how GET CAPTURE considers
everything after the word SELECT to be part of the database statement. These lines
are passed directly to the database, including all spaces and punctuation, except for
the command terminator (.).
Chapter
8
Iris Flower Classification Example
The Iris data set (Fisher, 1936) is perhaps the best-known data set found in the
classification literature. Although the classification task Fisher addresses is relatively
simple, which is visually apparent when you look at a scatterplot of the data, his paper
is a classic in the field and is referenced frequently. Using the Iris data set, this
example demonstrates the performance of AnswerTree and provides a way to easily
compare classification performance with other methods and competitive products.
Analysis Goal
We want to predict the species of iris based on four physical measurements. The
algorithms used are C&RT and QUEST.
The Data
The data file for this example is IRIS.SAV. The file contains four continuous
measurement variables on each observation and a classification variable, species.
115
116
Chap te r 8
using a small data set, we need to adjust the stopping rules. For each growing method,
in the wizard’s final panel, open the Advanced Options dialog box. For the minimum
number of cases, specify 25 for the parent node and 1 for the child node.
The root node for both C&RT and QUEST models looks the same. It represents an
enumeration of the target variable, species.
Figure 8-1
Root node for Iris problem
After creating the two separate root nodes, we will grow the two trees automatically,
using the Tree menu’s Grow Tree option. The tree maps show the overall structure of
the trees.
117
Iris F lo wer Classification Exam ple
Figure 8-2
C&RT tree
Figure 8-3
QUEST tree
As is frequently the case, the trees generated by different growing methods are similar
but not identical. The basic classification story is fairly simple—one species can be
differentiated on the basis of a single measurement (node 1 in both trees), and the other
two species use additional measurement information.
118
Chap te r 8
Grow the third tree in the project file using the same model criteria as the first two.
Because the following procedures are identically executed in C&RT and QUEST, we
will demonstrate using C&RT.
Instead of using Grow Tree from the Tree menu, choose Grow Tree One Level. The
results of growing the C&RT root node to a tree with one level are shown below.
Figure 8-4
C&RT tree grown one level
Petal length is chosen to split the root node, and the split is made at values of greater
than and less than the measurement value of 2.450. All cases with petal length values
of less than or equal to 2.450 are sent to node 1. All observations with petal lengths
greater than 2.450 are sent to node 2.
119
Iris F lo wer Classification Exam ple
The C&RT algorithm reports the relative importance of a node split by using the
decrease in impurity, or improvement, as an evaluation criterion. In this example, we
use the default Gini impurity measure. In the first split of the Iris tree, the improvement
is reported as 0.3333. This means that the impurity of the two child nodes that result
from the split was 0.3333 less than the impurity of the root node. Node 1 is composed
entirely of one species (setosa) and contains all of the cases of that species. Node 2
contains the remaining 100 observations, which include all of the versicolor and
virginica irises.
Figure 8-5
C&RT tree grown two levels
After growing the Iris decision tree to two levels, we can see that node 1 has been
defined as a terminal node. It is not possible to split this node and improve the
performance of the tree.
120
Chap te r 8
Node 2 is split using the petal width variable, and the improvement is reported as
0.2598. The two child nodes of node 2 roughly describe the two remaining species of
iris. Node 3 subsumes most of the versicolor irises, while node 4 contains most of the
virginica irises.
Priors
Suppose that our sample misrepresented the frequency of occurrences of iris species in
the population of interest. In cases like this, we can adjust the prior distribution of the
data by explicitly specifying the prior probabilities. This set of prior probabilities tells
AnswerTree to expect the classes at the assigned probability value. Explicit priors can
be set using the Advanced Options dialog box. From the menus choose:
Analysis
Advanced Options...
Set the priors at 0.2 for setosa, 0.3 for versicolor, and 0.5 for virginica. Note that the
tree will be regrown.
Figure 8-6
Setting explicit priors
121
Iris F lo wer Classification Exam ple
Compare the tree structure using explicit priors with the tree for which priors were set
according to the empirical distribution of the species in the data shown above.
Figure 8-7
C&RT tree with explicit priors
There are obvious substantive differences in the tree grown with adjusted priors. In this
example, we are trying very hard not to misclassify virginica, because according to the
explicit priors, we are observing less of this species than we should in our data. The
opposite can be said for setosa and versicolor.
The resubstitution risk estimate for the tree with custom priors is 0.036, and its
standard error is 0.013787. Relative to the tree with equal priors, we have reduced the
misclassification of virginica. The trade-off is that we have misclassified more
versicolor as virginica.
122
Chap te r 8
Figure 8-8
Risk summary using explicit priors
Discussion of Results
The Iris data set was used to show two methods for growing classification trees. C&RT
and QUEST produced similar but not identical decision rules and similar risk
performance in this example. In both cases, petal length perfectly identifies one species,
and petal width does a good job of distinguishing between the other two species. Using
explicit priors gives a somewhat different tree, which minimizes misclassification for
categories with large prior probabilities.
Chapter
9
Credit Scoring Example
Background
One of the most important applications of classification methods is in credit scoring,
or making decisions about who is likely to repay a loan and who is not. A tree-based
approach to the problem of credit scoring has some attractive features:
It allows you to identify homogeneous groups with high or low risk.
It makes it easy to construct rules for making predictions about individual cases.
Analysis Goal
We want to be able to categorize credit applicants according to whether or not they
represent a reasonable credit risk, based on the information available.
The Data
The data file for this example is CREDIT.SAV. The file contains a target variable,
Credit ranking (good/bad), and four predictor variables: Age Categorical (young,
middle, old), Has AMEX card (yes/no), Paid Weekly/Monthly (weekly pay/monthly
salary), and Social Class (management, professional, clerical, skilled, unskilled).
Data were collected for 323 cases. Because all variables are categorical, we will begin
with the CHAID method of growing the tree.
123
124
Chap te r 9
Click OK to return to the wizard, and then click Finish to grow the root node.
125
C redit Sco ring Exam ple
Figure 9-2
Root node for credit problem
The root node simply shows the breakdown of cases for the entire sample. In this data
set, the cases are nearly equally distributed between good credit risks (47.99%) and bad
credit risks (52.01%). The risk estimate also reflects this, showing that assigning all
cases to the majority class (bad credit risk) results in a 47.99% error rate.
Figure 9-3
Risk summary for root node
126
Chap te r 9
Notice that the majority of cases in the weekly pay group have low credit rankings,
while those in the monthly salary group are more likely to have high credit rankings.
Based on this variable alone, we can greatly improve our ability to distinguish good
credit risks from bad ones.
Examining the risk estimate reinforces this conclusion. The risk estimate with one
split is 0.1455, indicating that if we use the decision rule based on the current tree, we
will classify 100% – 14.55% = 85.45% of the cases correctly. So, a little bit of
information goes a long way in this case.
127
C redit Sco ring Exam ple
Figure 9-5
Risk estimate for first split
The results are encouraging. Let’s see if we can do even better with a little more
information. Right-click on the root node and select Grow Tree One Level again. Now
we see that each wage group is split by age.
Figure 9-6
Second split for credit problem
128
Chap te r 9
Examining the left branch first, we see that of those who are paid weekly, old
applicants (> 35 years) tend to have good credit—all of the cases in this example fit
this profile. On the other hand, young (< 25) and middle (25–35) applicants are much
more likely to have poor credit.
The right branch looks a bit different. It remains evident that older applicants are
more creditworthy than younger ones. However, for applicants who are paid monthly,
middle-aged people are grouped with the more creditworthy old group rather than with
the less creditworthy young group. Again, we see that old applicants within this group
almost all have high credit rankings. The younger group is more heterogeneous than
we saw in the other branch, however. Cases are evenly split in the young group—about
half of the applicants in this group have good credit and half have bad credit.
Perhaps by adding one more piece of information, we can identify the difference
between good and bad credit risks within this subgroup. Right-click that node and
choose Grow Branch One Level.
Figure 9-7
Split of node 5
129
C redit Sco ring Exam ple
The next piece of useful information is Social Class. The categories of this variable are
grouped into two subsets: managerial and clerical in one branch and professional in the
other. (There were no skilled or unskilled workers in this node.) The managerial and
clerical node cases all have good credit, whereas the professional node has both good
and bad credit risks (41.46% and 58.54%, respectively).
The risk summary shows that the current tree classifies almost 90% of the cases
accurately. Also, the misclassification matrix shows exactly what types of errors are
being made. The diagonal elements (upper left and lower right) of the table represent
the correct classifications. The off-diagonal elements (lower left and upper right)
represent the misclassifications. For this tree, notice that we almost never classify a
person with bad credit as having good credit. However, in 32 cases where people have
good credit, we classify them as having bad credit.
130
Chap te r 9
The first column gives the node number, which corresponds to the numbers found in
the tree map. For example, node 4 corresponds to applicants who are paid weekly and
are over 35 years old. The next two columns show the number of cases in the node and
the percentage of all cases that are in the node. The following columns present the
number of cases with the target response and the percentage of all of the target
responses that are in this node. For this example, that represents the number of people
in the node with good credit and the percentage of all of the people with good credit
who fall in this node. The Gain column indicates the proportion of cases in the node
that have the target response (good credit), and the Index column gives a measure of
how the number of target responses in this node compares to that for the entire sample.
For the credit problem, both node 4 (paid weekly/over 35) and node 7 (paid
monthly/under 25/managerial or clerical) have no cases with bad credit, so they both
show a gain value of 100%. Since this is just over twice the percentage of good credit
cases in the entire sample (47.99%), the gain index is 208.3871%. Clearly, these are the
cases we would want to seek out. Conversely, node 3 (paid weekly/under 35) has the
lowest proportion of good credit risks. As a lender, you would have sound reason to
avoid lending to applicants who fit this profile.
Chapter
10
Housing Value Example
Background
In this example, we will use the C&RT algorithm to build a regression tree for
predicting the median housing value for census tracts in and around Boston. This will
allow us to evaluate the effects of various characteristics on property values in the
region. The data were originally reported in Harrison and Rubinfeld (1978) in a study
of the effects of air pollution on property values.
Analysis Goal
We want to be able to evaluate the effects of various environmental, economic, and
social factors on housing values in this urban area.
The Data
The data file for this example is HOUSING.SAV. The file contains a target variable,
Median value of owner-occ homes (defined as continuous), and 13 predictor
variables:
Per capita crime rate (continuous)
Proportion of residential land zoned 25K+ (continuous)
Proportion of non-retail bus acres per town (continuous)
131
132
Chap te r 10
Figure 10-1
Stopping rules for C&RT tree
After specifying the advanced options, click Finish to see the root node in the Tree
window.
Figure 10-2
Root node for housing data
134
Chap te r 10
The root node simply shows the mean value of the target variable for the entire sample.
In this data set, the average for the target variable is 22.53, which means that the
average of the median housing values for the sample of census tracts is approximately
$22,530.
The risk estimate here is simply the within-node variance. Remember that the total
variance equals the within-node (error) variance plus the between-node (explained)
variance. The within-node variance here is 12.5322, while the total variance is 84.4196
(the risk estimate for the tree with only one node). The proportion of variance due to
error is 12.5322 ⁄ 84.4196 = 0.1485. Thus, the proportion of variance explained by
the model is 100% – 14.85% = 85.15% . There is still some residual variance, but the
amount we can account for using the model is enough to convince us that we have
captured the most important variables, and that we can probably trust our conclusions.
In this example, we are interested in identifying variables that play important roles in
explaining housing values. In particular, we are interested in discovering whether, or
for which subgroups, pollution levels affect property values. We can find this
information in the tree itself.
136
Chap te r 10
Figure 10-5
Tree indicating splits based on average number of rooms
The first split is made on the average number of rooms. This indicates that the average
number of rooms is the most important determining factor (of the factors we measured)
for the median housing value of a census tract. This split gives an improvement of
38.2205, reducing the within-node variance by almost half. Areas with an average
number of rooms greater than 6.941 are subsequently split by the number of rooms
again, with tracts having an average number of rooms between 6.941 and 7.437 in one
node and an average number of rooms greater than 7.437 in the other.
137
Hous ing Value Exam ple
Figure 10-6
Second-level split based on % lower status
For tracts where the average number of rooms is less than 6.941, the next split is
based on % lower status of the population. Areas with less than 14.4% lower-status
residents had higher values than other areas where the percentage of lower-status
residents exceeded 14.4%. These nodes are then split further, based on the distance
to Boston employment centers on the left branch and on the per capita crime rate on
the right branch.
Since we are particularly interested in the effect of pollution on housing values, let’s
examine the portion of the tree where pollution (nitric oxides concentration) becomes
a predictor.
138
Chap te r 10
Figure 10-7
Effect of pollution on housing values
It appears at first glance that tracts with an average number of rooms (between 6.9 and
7.4) show an effect of nitric oxide concentration on housing values. This result conflicts
with a previously published tree-based analysis of these data (Breiman et al., 1984),
which splits the node based on Per capita crime rate. To investigate this discrepancy, we
can examine the surrogates for the node to see how Per capita crime rate compares to
Nitric oxides concentration as a predictor at this particular node in the model. To show
statistics for surrogates, select the node and from the menus choose:
Tree
Select Surrogate...
139
Hous ing Value Exam ple
Figure 10-8
Surrogates for nitric oxides concentration
Notice that the first surrogate is Per capita crime rate and that its improvement statistic,
1.9900, is equal to the improvement for the split based on Nitric oxides concentration.
Furthermore, the association statistic for Per capita crime rate is 1.0, indicating that in
the context of this node, it is essentially equivalent to the selected predictor. The choice
between these two variables to split this node is arbitrary. The fact that the two analyses
selected different predictors was probably an accident of the order in which variables
were specified in the analysis.
While this redundancy essentially foils our attempt to determine how pollution
influences housing values, it does raise some interesting new questions. What is the
nature of the relationship between the crime rate and pollution? Which of the two
variables (if either) has a causal effect on housing values? How are these variables
related in other portions of the sample (that is, for other nodes)? Further investigation
using new data and other statistical techniques would be necessary to answer these
questions.
Chapter
11
Wage Prediction Example
Background
In this example, we will use the C&RT algorithm to build a regression tree for
predicting wages for a random selection of workers in the United States in two
separate years: 1978 and 1985. The two time periods will allow us to determine
whether the effects change over time. Various worker characteristics were recorded
along with wage levels. Data were extracted from the Current Population Survey
(CPS) of May 1978 and May 1985, published by the U.S. Department of Commerce.
These data were also analyzed by Berndt (1991).
Analysis Goal
We want to be able to evaluate the effects of various social, economic, and
demographic variables on wage levels at two separate points in time.
The Data
The data file for this example is WAGES.SAV. The file contains the target variable,
Log of avg hourly earnings (defined as continuous), and 20 predictor variables:
Education (years) (continuous)
Lives in south (nominal, yes or no)
Nonwhite (nominal, yes or no)
140
141
Wag e P re dic tion Exam ple
Figure 11-1
Stopping rules for the C&RT tree
After specifying the advanced options, click OK, and then in the wizard click Finish to
see the root node in the Tree window.
Figure 11-2
Root node for wage data
143
Wag e P re dic tion Exam ple
The root node simply shows the mean value of the target variable for the entire sample.
In this data set, the average for the target variable is 1.87.
Figure 11-4
Risk summary for tree
The risk estimate here is simply the within-node variance. Remember that the total
variance equals the within-node (error) variance plus the between-node (explained)
variance. The within-node variance here is 0.1860, while the total variance is 0.2944
(the risk estimate for the tree with only one node). The proportion of variance due to
error is 0.1860 ⁄ 0.2944 = 0.6318. Thus, the proportion of variance explained by the
model is 100% – 63.18% = 36.82% . Clearly, the ability to capture variation in wages
using the current tree model is less than optimal. However, the conclusions we draw
from the tree may be of some use in constructing a more detailed parametric model for
these data.
In this example, we are interested in identifying variables that play important roles in
explaining wage levels. We are also interested in whether or not different factors are
important at the two different time points. We can find this information in the tree itself.
In some cases, it is instructive to build the tree one split at a time. To revert to the
root node, select it in the tree, and from the menus choose:
Tree
Remove Branch
145
Wag e P re dic tion Exam ple
To make the first split, with the root node selected choose:
Tree
Grow Tree One Level
Figure 11-5
First split of tree
The first split is made on the time sampled, with workers polled in 1985 reporting
higher wages than those polled in 1978. This is not at all surprising—the average wage
values are given in unadjusted dollars, so the difference is probably due to inflation over
the seven-year period. The interesting question will be whether the subtrees look similar
for the two time points. Let’s examine the next level of the tree. From the menus choose:
Tree
Grow Tree One Level
146
Chap te r 11
Figure 11-6
Tree with second-level splits
Notice that the two nodes representing the two time points are split on different
predictors. For the 1978 cases, splitting on gender gives the best reduction of error,
whereas for the 1985 data, the best split is based on years of education. It appears that
there were indeed some changes in the determinants of wage levels (or at least their
relative importance) during that seven-year time interval.
147
Wag e P re dic tion Exam ple
Let’s go ahead and finish growing the tree and then consider the subtree for each
time point separately. From the menus choose:
Tree
Grow Tree
Figure 11-7
Subtree for workers polled in 1978
In the first round of workers polled, gender provided the best second-level split, as we
saw previously. We again see differences in the two subtrees: for males, age proves the
best predictor, while union representation is most prominent for women. For older
males, education level also adds useful information to the model.
148
Chap te r 11
Figure 11-8
Subtree for workers polled in 1985
In this branch, the first split for the 1985 data is based on years of education, and is in
the expected direction (more education predicts higher wages). For highly educated
workers (> 13.0 years), age becomes the next significant factor. For workers with less
education, the next significant factor is union representation, with union members
earning more than non-union workers. Finally, for those non-union workers, age is the
next predictor of wage levels, again in the expected direction (older workers earn more
than younger ones).
149
Wag e P re dic tion Exam ple
12
Market Segmentation Example
Background
In designing a marketing program, one of the key points is to identify likely buyers
and focus your sales efforts on them. Targeting the most profitable segments of your
market can help you get the best return on investment for your sales resources.
Identifying more or less profitable subgroups of your market is called market
segmentation. A tree-based approach to the problem of market segmentation gives
you clear decision rules for identifying prospects as more or less likely to buy, as well
as estimates of how much profit or loss you can expect by marketing to particular
subgroups.
Analysis Goal
We want to be able to categorize direct mail targets according to whether or not they
are likely to purchase products.
The Data
The data file for this example is SUBSCRIB.SAV. These data were used in the original
SPSS CHAID manual (Magidson, 1993). The file contains two target variables,
Dichotomous response (respondent or nonrespondent) and Response to sweepstakes
promotion (paid respondent, unpaid respondent, nonrespondent), and seven predictor
variables:
150
151
Marke t Seg men ta tion Exam ple
Age of household head (18–24, 25–34, 35–44, 45–54, 55–64, 65+, unknown)
Sex of household head (male or female)
Children in household? (yes or no)
Household income (<$8,000, $8,000–$9,999, $10,000–$14,999,
$15,000–$19,999, $20,000–$24,999, $25,000–$34,999, $35,000–$49,999,
$50,000+)
Bankcard in household? (yes or no)
Number of persons in household (1, 2, 3, 4, 5 or more, unknown)
Occupation of household head (white collar, blue collar, other, unknown)
In addition, one frequency variable (FREQ) is specified. For each combination of the
values of the other variables, the frequency variable indicates how many individual
cases in the sample fit that profile.
Data were collected for 81,040 cases. Because all variables are categorical, we will
use the CHAID method for growing our tree. Three of the variables, the two target
variables (Dichotomous response and Response to sweepstakes promotion) and
Occupation of household head, should be defined as nominal. All other variables
should be defined as ordinal. (To change the definition of a variable type, select the
variable in the New Tree dialog box, and then right-click and select the appropriate
type from the context menu.)
Figure 12-1
Advanced Options: CHAID
After specifying the advanced options, click Finish to see the root node in the Tree
window.
Figure 12-2
Root node for market segmentation problem
153
Marke t Seg men ta tion Exam ple
The root node simply shows the breakdown of cases for the entire sample. In this data
set, most cases (98.85%) are nonrespondents.
The first split is based on size of household. It seems that the more people there are in
a house, the more likely that house is to respond to the promotion. Cases with missing
values for household size were least likely to have responded to the mailing.
Notice that certain categories are grouped together. For example, households with
two persons are grouped with households having three persons. AnswerTree
automatically combines categories when there is no statistical distinction between
them. That is, the merged categories are basically equivalent from a statistical
perspective.
Now look at the node with number of persons in household equal to two or three
(node 2 in the tree map). This node is further broken down by age, where older persons
(65+) are less likely to respond than others. Likewise, households of unknown size
(node 4 in the tree map) are broken down by sex of household head, with women being
more likely to have responded than men.
At the second level, medium-sized households where the head of the household is
less than 65 years of age (node 5 in the tree map) can be divided by the presence of
bankcards in the household. Households with bankcards are more likely to have
responded than other households.
Figure 12-4
Risk summary for tree with three levels
Notice how the first row of the misclassification table shows all zeros. That is because
the tree never predicts any case to be a responder. Of course, the classifier is correct
98.85% of the time, but its predictions are not very useful for distinguishing good
prospects from bad. To identify segments of interest (that is, segments with a relatively
high probability of response), we need to examine the gains chart.
The gains chart shows the nodes sorted by the number of cases in the target category
for each node. The default is to make the target category the last category in the list. In
this case, however, we want the first category (responders) to be the target category. To
specify this change, select the Gains tab in the Tree window, and then from the menus
choose:
Format
Gains
Select the desired category—in this case, Respondent—under Percentage of cases in
target category.
156
Chap te r 12
Figure 12-5
Gain Summary dialog box
Now the gain summary will reflect the nodes that have the highest probability of
response to the promotion. There are two parts to the gains chart: node-by-node
statistics and cumulative statistics. Let’s look at the node-by-node statistics first.
Figure 12-6
Gains chart (node-by-node statistics)
Nodes are sorted by gain score from highest to lowest. The first node in the table,
node 9, contains 46 responders out of 1979 cases, or a 2.3244% response rate. For this
type of gains chart, with a categorical target variable, the gain score equals the
percentage of cases with the target category—in this case, respondents—for the node.
The index score shows how the proportion of respondents for this particular node
157
Marke t Seg men ta tion Exam ple
compares to the overall proportion of respondents. For node 9, the index score is about
202%, meaning that the proportion of respondents for this node is over twice the
response rate for the overall sample.
The second row shows statistics for node 3. This node has a response rate of
1.9200%, with an index score of about 167%. This group, or segment, also has an
appreciably higher proportion of respondents than the overall sample, although the
advantage is not as great as with the first node in the list. The pattern continues down
the table, with each subsequent node having a lower proportion of respondents. Look
at the fourth row, which describes node 1. The index score for this node is only about
95%. Whenever the index score is less than 100%, it means that the corresponding
node has a lower response rate than the overall sample. Between the third and fourth
rows (node 10 and node 1) is the crossover point, where we go from “winning nodes”
to “losing nodes.”
Figure 12-7
Gains chart (cumulative statistics)
Now let’s take a look at the cumulative statistics. The cumulative statistics can show us
how well we do at finding responders by taking the best segments of the market. If we
take only the best node (node 9), we reach 4.94% of responders by targeting only
2.44% of the market. If we include the next best node (node 3) as well, then we get
17.72% of the responders from only 10.09% of the market. Including the next node
(node 10) increases those values to 36.31% of responders from 23.87% of the sample.
At this stage, we are at the crossover point described above, where we start to see
diminishing returns. Notice what happens if we include the next node (node 1)—we
get 65.95% of responders, but we must contact 55.19% of the sample to get them.
The gains chart can give you valuable information about which segments to target
and which to avoid. Of course, you will need to make some decisions about how many
segments to target. You might base the decision on the number of prospects you want,
the desired response rate for the target market, or the desired proportion of all potential
158
Chap te r 12
Now that we have identified the nodes that define our target segments, what criteria do
we use to determine whether a new case fits into one of these segments? The answer
can be found in the Rules view. To see the characteristics that define a node, select the
node and then select the Rules tab in the Tree window. For example, to see what
defines the node with the highest response rate (node 9), select that node in either the
tree map, the Tree view of the Tree window, or the gains chart, and then select the
Rules tab.
Figure 12-8
Rules for node 9 (selecting cases)
In addition to showing rules for selecting cases that belong to certain segments, you
can display rules that assign values to those cases. To see the values assigned to node 9
consumers, select the Rules tab, and then from the menus choose:
Format
Rules
Under Generate Syntax For, select Assigning values. The new rule for node 9
indicates that the cases are assigned a node label (nod_001 = 9), a predicted outcome
(pre_001 = 2, corresponding to Nonrespondent), and the estimated probability that
the predicted outcome is correct (prb_001 = 0.977).
159
Marke t Seg men ta tion Exam ple
Figure 12-9
Rules for node 9 (assigning values)
The rules from the Rules window can be exported for use with other programs in
selecting targeted segments or adding tree-derived information to the records. Rules
can be generated in three formats: SQL (shown above), SPSS, or decision. SQL rules
can be used with any database engine that understands SQL commands. SPSS rules can
be used with SPSS or NewView to update your data files or select cases. Decision rules
are simply structured descriptions of the node(s), which are suitable for use in reports
or presentations. With decision-formatted rules, you have the option of using variable
labels and/or value labels in the rules, which can make the rules easier to understand,
especially for others not familiar with the structure of your data set.
For this example, we will build our model based on the other variable, Response to
sweepstakes promotion (paid respondent, unpaid respondent, nonrespondent), defined
as nominal, as our target variable. To build a new tree, from the menus choose:
File
New Tree
In the model dialog box, set Response to sweepstakes promotion as the target variable
and the other variables (except for Dichotomous response, of course) as predictors. A
new root node will be created, this time with three categories instead of two. We will
again use the likelihood-ratio chi-square statistic, so specify this on the CHAID tab of
the Advanced Options dialog box, if necessary.
Figure 12-10
Tree with Response to Sweepstakes Promotion as target
Figure 12-11
Define Profits dialog box
The real story, however, is told by the gains chart. The gains chart can be configured to
display average profit values for nodes. To set this option, select the Gains tab in the
Tree window, and then from the menus choose:
Format
Gains
and under Gain Column Contents, select Average profit.
Figure 12-13
Gains chart with profit values
The node with the greatest profit is node 17, with an average profit of $0.69. If we
consider the cases that make up this node—households with two people, head of
household 55–64 years of age, and having a high income—we see a pattern, which we
might call “empty nesters.”
Notice that several nodes have negative gain scores. A negative score indicates that,
on the average, you lose money by targeting that group. Of course, you will want to
divert resources away from those segments to the more profitable ones at the top of the
gains chart. If you simply want to ensure profitability, target the nodes with positive
gain scores. If you have a specific profit margin in mind that you want to exceed, the
cumulative statistics in the gains chart can tell you how many nodes to keep. For
example, if you want to ensure at least $0.20 average profit per household, you should
target nodes 17, 18, 14, 10, and 15, which together give an average profit of $0.21 per
household.
Chapter
13
Running Automated Jobs in
Production Mode
AnswerTree provides a scripting language that lets you run the application in
production mode, where the actions of the application are determined by a script
prepared in advance. The file containing the script should have the filename extension
.ats. Scripts are executed by starting AnswerTree with the script filename as a
parameter on the command line or by double-clicking the script file from a file viewer.
AnswerTree then executes the script, producing the output described in the script.
Running AnswerTree from a batch file. AnswerTree was not designed to run multiple
concurrent sessions. Therefore, if you want to run AnswerTree scripts from a batch
file, you should use the START /W DOS command to ensure that the job finishes
before the next process begins. This is especially important if you want to run more
than one AnswerTree script from the same batch file—every line in the batch file
should use the START /W command. For example, if you have a set of three analyses
you want to run every month, you might write a batch file similar to this:
This batch file runs the three analyses sequentially, so that they do not interfere with
each other.
163
164
Chap te r 13
General Rules
Answer Tree syntax resembles Microsoft’s Visual Basic. Specifically, an .ats file
consists of a sequence of statements. Statements can be continued across multiple lines
by ending continued statements with a blank space and an underscore ( _).
Comments must be introduced with an apostrophe. Any text following the
apostrophe up to the end of the same physical line is skipped. Lines containing
comments cannot be continued using the underscore. Comment lines cannot be
continued by ending the line with a blank and an underscore, and a line cannot be
continued if a comment occurs after the blank and underscore.
String constants can be of any length. The concatenation operator & (ampersand)
can be used to combine strings, so that very long strings can be broken into pieces
convenient for a text editor.
AnswerTree syntax uses keywords to permit unambiguous parsing. User input is
either numeric or enclosed in quotation marks. String constants are written in .ats files
with surrounding quotation marks. If a quotation mark is to be an actual character in
the string, it must be written as two adjacent quotation marks.
Production Runs
An .ats file contains one production run. A production run deals with a single
AnswerTree saved project or a single data source. If a data source is used, it must be
an SPSS-format data file. A saved project incorporates the data as part of the project.
Two statements determine a production run.
End Production_Run
The end of a production run is signaled by the End Production_Run statement. This
statement closes any open files. If there is no End Production_Run statement in the
script file, an error is reported in the log file.
165
Run nin g A utoma te d Jo bs in Production Mo de
The first statement after the Production_Run statement must be an Open statement.
There are two types of Open statements.
Starting a production run directly with a data source creates a new project with the
given project name and initializes AnswerTree to work with the specified data file.
Both the filename and project name are string constants.
An error may occur if the file is not a valid AnswerTree-format data file. If so, the
production run terminates.
Starting a production run with an existing project file continues work on a saved
project. The filename is a string constant. An error may occur at this point if the file is
not a valid saved project. If so, the production run terminates.
Saving a Project
At any time, the current state of a project may be saved to a project file with the Save
Project command.
This command saves the project with the specified filename. The filename is a string
constant. If an error occurs while attempting to save the project, the production run
terminates.
Building a Tree
Since modification of the training sample does not modify other aspects of the tree,
such as the target variable, frequency variable, and weight variable, both training and
testing data may be used in a tree block.
The basic structure of the tree block is as follows (items in brackets are optional):
Note that in version 2.0 of AnswerTree, analysis settings can be specified before the
Build Tree command. In fact, this is encouraged, as it will lead to faster execution of the
script. The reverse ordering and Rebuild Tree command required for version 1.0 are still
supported for backward compatibility, but building the tree first and then specifying the
analysis settings will cause the script to take longer to run.
Begin Tree starts a new tree with each occurrence of this statement. Statements
occurring after the Begin Tree statement specify values for parameters that affect tree
growing or display. Second or subsequent tree blocks do not retain settings from
preceding tree blocks. Each tree block must be terminated with either End Tree or
Delete Tree.
If <optional name> is specified, it will be used as the name of the new tree.
End Tree
An End Tree statement resets all of the parameters that affect tree growth or display to
their default values. Any subsequent Begin Tree statement will result in a new tree
being built.
Delete Tree
The Delete Tree statement is similar to the End Tree statement but has the additional
effect of removing the tree from the project. If a tree is not to be preserved in the
project, deleting the tree is recommended. Depending upon the tree, potentially large
amounts of system resources are freed by deleting a tree.
167
Run nin g A utoma te d Jo bs in Production Mo de
Model Settings
The following parameters are used to define the tree-growing process and must be
specified before the root node is defined.
Four different growing algorithms are supported: chaid, exhaustive_chaid, cart, and
quest.
Variable attributes
There are three basic variable types: nominal, ordinal, and continuous. Each of these
types may have missing values and a missing value label.
Defines the specified variable as ordinal. The variable specification is a string constant.
Specifies the label used to indicate missing values for the previously defined variable.
Allows you to specify whether the CHAID algorithm should merge categories for the
previously defined categorical variable.
168
Chap te r 13
The following commands affect how variables are used in the model.
Specifies the variable that gives frequency weights. If this command is omitted, no
variable is used for frequency weights.
Specifies the variable that gives the case weights. If this command is omitted, no
variable is used for case weights.
Specifies the variable to be used as the target variable. The variable specification must
be appropriate for the data source. When the data source is an SPSS .sav file, for
example, the variable specification is a variable name in the data set. The variable
specification is a string constant. This statement is required and must come before the
Method statement.
Specifies the variable(s) to be used as predictors in the model. To specify all variables
not assigned to other roles as predictors, use Predictors All.
Validation Settings
If you want to validate your tree to see how well it generalizes to new data, you can
partition your sample into training and test sets. If no validation command is specified,
the resubstitution estimate of risk based on the entire sample is used.
Partition Data
This command partitions the data into training and test sets and builds the top node of
the resulting tree. The two subcommands that govern the partition are described below.
The Random Seed and Training Percent subcommands may be written after Partition
Data on the same (extended) line or as separate commands preceding the Partition Data
statement.
169
Run nin g A utoma te d Jo bs in Production Mo de
The positive integer value is used to start the random number generator used in
selecting the training set. By using the same random seed value, it is possible to
replicate a particular set of random partitions.
The integer value represents a percentage of the data file cases to be used in defining a
tree (the training set). The remainder of the cases are considered cases to be tested (the
testing set).
Analysis Settings
Analysis settings control various aspects of how the tree is grown. In addition to
general settings, you can specify settings for stopping rules, CHAID, C&RT, and
QUEST, pruning, scores, costs, priors, and profits.
General Settings
Stopping Rules
Specifies the minimum number of cases in a parent node required to split that node.
170
Chap te r 13
Specifies the minimum number of cases in a child node required to create the node.
Specifies the minimum number of cases in a parent node required to split that node.
Applies only to C&RT models.
Chi_Square Pearson
Chi_Square Likelihood_Ratio
C&RT
Impurity_Measure Gini
Specifies the Gini impurity measure for C&RT models using categorical target
variables.
Impurity_Measure Twoing
Specifies the twoing impurity measure for C&RT models using categorical target
variables.
Impurity_Measure Ordered_Twoing
Specifies the ordered twoing impurity measure for C&RT models using ordinal
categorical target variables.
Note that C&RT models with continuous target variables always use the least
squared deviation (LSD) impurity measure.
QUEST
Pruning
Select_Subtree Minimum_Risk
With automatic pruning, selects the subtree with the minimum risk.
172
Chap te r 13
With automatic pruning, selects the smallest subtree with risk not more than
<positive number> standard deviations greater than the minimum risk.
Scores
The Scores command involves setting numeric values to be associated with target
variable categories. These categories may be string or numeric, depending upon the
variable chosen for the target. In the AnswerTree dialog boxes, the categories are
presented in a grid, and it is necessary to enter only an associated value. In AnswerTree
syntax, both the category and the value need to be listed in the syntax. For this purpose,
a pairing function, CVPair, is provided. AnswerTree syntax uses the notation
CVPair(category, value) to specify one category-value pair. A sequence of these notations
may be used to specify one or more category-value pairs in the following commands.
Only the categories that are to change from default values need to be specified.
Defines score values for categories of the target variable. For example:
Costs
The Costs command involves specifying an entire misclassification matrix. There is a
special AnswerTree syntax function for building rows of category-value pairs. The
function CVRow has two or more arguments. The first is a category value for the target
variable that determines which row of the cost matrix is to be changed. The remainder
of the parameters are category-value pairs used to describe which values of the matrix
are to be changed.
Actual Category
0 1 2
0 — 1 2
Predicted
Category 1 1 — 1
2 2 1 —
The rows must be on the same extended line, so note the use of the line continuation
marker ( _ ). Also, the diagonal elements of the cost matrix are assumed to be 0 and
may not occur in this statement.
Priors
Prior probabilities allow you to specify the correct method for estimating the
proportion of cases associated with each target variable category. These commands
affect only C&RT and QUEST trees.
Equal Priors
Specifies the use of equal prior probabilities for all categories of the target variable.
Normalize Priors
Causes the priors to be rescaled to sum to 1.0 (before computing adjusted priors, if
necessary).
Prior probabilities are associated with the categories of the target variable using the
pairing function, CVPair. AnswerTree syntax uses the notation CVPair(category, value)
to specify one category-value pair. Each of the categories should be assigned a positive
value. Normally, these are probabilities.
Profits
Profits indicate the relative values of different categories of the target variable.
Defines profit values for categories of the target variable. Profit values are associated
with the categories of the target variable using the pairing function, CVPair. AnswerTree
syntax uses the notation CVPair(category, value) to specify one category-value pair.
For example:
After specifying the model settings and any desired analysis settings, the tree can be
built.
Build Tree
This command results in a single node tree, the root of the tree whose parameters have
already been given. This statement causes AnswerTree to process the data in
preparation for growing the tree.
Rebuild Tree
The Rebuild Tree command is used to accommodate changes of scores, costs, profits,
and priors after the last Build Tree command. This command should not be required for
AnswerTree 2.0 scripts, since you can specify analysis settings before issuing the Build
Tree command. It is included for backward compatibility with AnswerTree 1.0 scripts.
Grow Tree
The Grow Tree command grows the tree base on the specified settings.
Grow_And_Prune Tree
175
Run nin g A utoma te d Jo bs in Production Mo de
The Grow_And_Prune Tree command grows the tree based on the specified settings and
then prunes it according to the subtree parameter specified in the Select_Subtree
command.
Formatting Output
You can control various aspects of AnswerTree output from a production mode script.
Controls the orientation of the tree. Valid orientation specifications are Top (top-down),
Left (left-to-right), and Right (right-to-left). This command must come after the
Begin Tree command but before the Build Tree command.
Specifies the contents of nodes in the tree window. Table requests tabular statistics in
nodes of the tree, Graph requests graphs in nodes of the tree, and Both requests both
tables and graphs in nodes of the tree.
Show Training_Set
This command is used with partitioned data to show results based on the training set of
data. If the training set is already displayed, this command does nothing.
Show Test_Set
This command is used with partitioned data to show results based on the held-out test
set of data. If the test set is already displayed, this command does nothing.
Format Gain
This command allows you to control aspects of the gain summary. All of the gain
summary formats appropriate for the chosen method can be obtained with the proper
choice of Format Gain parameters. Subcommands can be written after Format Gain on
the same (extended) line or as separate commands prior to Format Gain. The
subcommands include:
Sort Ascending
176
Chap te r 13
Sort Descending
Specifies the value of the target variable that is used to calculate gains and risks.
Average Profit
Requests that average profits be reported instead of the usual gain statistics.
Specifies whether the gain summary contains additional columns for cumulative
statistics. Not valid if Percentile Increment is used.
Format Rules
This command determines what is printed by the Print Rules command. There are three
different formats for the rules: SQL, SPSS, and decision rules. Each of these formats
may be used to present rules for either selecting cases or for assigning values to cases.
The decision rule format provides options for using variable labels and/or value labels.
An important point about selection rules: Printing automatically selects all terminal
nodes. As a result, the selection rule selects all cases and is uninformative.
SQL_Rules For_Selecting_Cases
Prints SQL rules for selecting cases. This is the default if no other Format Rules
options are specified.
SQL_Rules For_Assigning_Values
177
Run nin g A utoma te d Jo bs in Production Mo de
SPSS_Rules For_Selecting_Cases
SPSS_Rules For_Assigning_Values
Decision_Rules For_Selecting_Cases
Decision_Rules For_Assigning_Values
The five major views in AnswerTree all have a Print option (accessed from the File
menu). These options are available in AnswerTree syntax using five separate Print
commands. To perform the Print command, in effect AnswerTree syntax pushes the
corresponding Tab button and then executes the Print command. One side effect of this
is that the last type of Print command executed determines which view will be in effect
when the application is continued after a production run.
The gain summary will be printed using the settings defined. An optional path can be
included to print to a specific printer or to a file.
The risk summary will be printed. An optional path can be included to print to a
specific printer or to a file.
All terminal nodes are selected and the rules view is printed. An optional path can be
included to print to a specific printer or to a file.
The analysis summary is printed. An optional path can be included to print to a specific
printer or to a file.
This command will print the tree. An optional path can be included to print to a specific
printer or to a file.
The AnswerTree production log file is created during the production run. It records the
actions taken during the run and gives the times at which major events affecting the tree
take place. The log file is itself a valid .ats file that could also be run in production
mode. Indeed, if the original file inputs to the production run are the same, the original
.ats file and a log .ats file should produce the same production run and log except for
the time stamps.
Example Scripts
The Scripts directory (found on the AnswerTree CD-ROM) contains example scripts
that demonstrate various aspects of using AnswerTree production mode. See the
comments in the scripts for details on what each specific script does and how it works.
179
Run nin g A utoma te d Jo bs in Production Mo de
14
Statistics and Algorithms
This chapter discusses how AnswerTree generates trees and their associated statistics.
We have attempted to explain the concepts in general (nonmathematical) terms, but
the discussions of algorithms do assume some knowledge of statistics. If you would
like to learn more about how AnswerTree works, this chapter is for you. The following
topics are covered:
Variables. Measurement levels of variables, case weights, and frequency variables.
Growing methods. The advantages and disadvantages of each method, along with
algorithms for each.
Stopping rules. How AnswerTree stops growing a tree.
Tree parameters. Costs, prior probabilities, scores, and profits.
Gain summary. How the gain summary helps you interpret your results.
Accuracy of the tree. How to ensure that your tree is accurate.
Cost-complexity pruning. Explanations and algorithms for cost-complexity
pruning.
Variables
This section defines the term categorical variable and describes the different
measurement levels of variables that may be used in an AnswerTree analysis. In
addition to the analysis variables, AnswerTree also allows you to use case weights and
frequency variables. You can use these variables to reduce the size of your data file,
which may help speed up the analysis.
180
181
S tatistics an d A lg orithm s
Categorical Variables
Categorical variables differ from continuous variables in that they are not measured
in a continuous fashion but are classified into distinct groups. You can convert
continuous variables to categorical variables by grouping ranges of values. For
example, you could convert a continuous variable age into a categorical variable by
forming categories: 18 through 24, 25 through 34, 35 through 44, and so on.
Categorical variables may be nominal or ordinal. Categories of a nominal variable
differ in kind rather than in degree, so they have no natural ordering. For example,
occupational categories such as white collar, blue collar, other, and unknown do not
follow a particular, meaningful order. Ordinal variables have known or unknown
numeric scores associated with their categories. The categories of age described above
constitute an ordinal variable. It may sometimes make sense to compute the mean
(average score) for an ordinal variable but never for a nominal variable. For ordinal
variables, it never makes sense to merge noncontiguous values, whereas for nominal
variables, any pair of values could be merged.
In AnswerTree, all of the growing methods accept all types of variables, with one
exception: QUEST requires that the target variable be nominal.
AnswerTree frequently uses the terms target variable and predictor variable. The
target variable is the variable whose outcome you want to predict using other
variables. It is also known as the dependent variable. For example, in the analysis of
the Iris data set (Fisher, 1936), the target variable is the species of iris.
Predictor variables are those that predict the pattern of the target variable. They
are also known as independent variables. In the Iris data set, the predictor variables
are the petal length, petal width, sepal length, and sepal width, which meaningfully
predict the species of flower.
Case weight and frequency variables are useful for reducing the size of your data set.
Each has a distinct function, though. If a weight variable is mistakenly specified to be
a frequency variable, or vice versa, the resulting analysis will be incorrect.
182
Chap te r 14
Case Weights
The use of a case weight variable gives unequal treatment to the cases in a data set.
When a case weight variable is used, the contribution of a case in the analysis is
weighted in proportion to the population units that the case represents in the sample.
For example, suppose that in a direct marketing promotion, 10,000 households respond
and 1,000,000 households do not respond. To reduce the size of the data file, you might
include all of the responders but only a 1% sample (10,000) of the nonresponders. You
can do this if you define a case weight equal to 1 for responders and 100 for
nonresponders.
Note that in an AnswerTree analysis, the QUEST method does not accept case
weights.
Frequency Variables
Tree-Growing Methods
AnswerTree includes four growing methods: CHAID, Exhaustive CHAID, C&RT, and
QUEST. Each one works a bit differently, and each one has its own best use. This
section provides an overview of each algorithm, along with a discussion of the
advantages and disadvantages of each, including a mention of how each handles
missing values. At the end of the section, the mathematical algorithms are given for
each method.
Overview of CHAID
examines the series of merges for the predictor and finds the set of categories that gives
the strongest association with the target variable, and computes an adjusted p value for
that association. Thus, exhaustive CHAID can find the best split for each predictor, and
then choose which predictor to split on by comparing the adjusted p values.
Exhaustive CHAID is identical to CHAID in the statistical tests it uses and in the
way it treats missing values. Because its method of combining categories of variables
is more thorough than that of CHAID, it takes longer to compute. However, if you have
the time to spare, Exhaustive CHAID is generally safer to use than CHAID. It
sometimes finds more useful splits. Note, though, that depending on your data, you
may find no difference between Exhaustive CHAID and CHAID results.
Overview of C&RT
Overview of QUEST
QUEST stands for Quick, Unbiased, Efficient Statistical Tree. It is a relatively new
binary tree-growing algorithm developed by Loh and Shih (1997). It deals with
variable selection and split-point selection separately. The univariate split in QUEST
185
S tatistics an d A lg orithm s
performs approximately unbiased variable selection. That is, if all predictor variables
are equally informative with respect to the target variable, QUEST selects any of the
predictor variables with equal probability.
QUEST was created for computational efficiency. It affords many of the advantages
of C&RT, but, like C&RT, your trees can become unwieldy. You can apply automatic
cost-complexity pruning (see “Cost-Complexity Pruning” on p. 195) to a QUEST tree
to cut down its size. Note that QUEST uses surrogate splitting to handle missing values
(see “Surrogate Splitting” on p. 191).
Tree-Growing Algorithms
This section describes the computational processes behind each of AnswerTree’s four
growing methods.
CHAID Algorithm
If the p value is greater than α merge , merge this pair into a single compound
category. As a result, a new set of categories of X is formed, and you start the
process over at step 1.
If the p value is less than α merge , go on to step 3.
3. Compute the adjusted p value for the set of categories of X and the categories of Y
by using a proper Bonferroni adjustment.
4. Select the predictor variable X that has the smallest adjusted p value (the one that
is most significant). Compare its p value to a prespecified alpha level, α split .
If the p value is less than or equal to α split , split the node based on the set of
categories of X.
If the p value is greater than α split , do not split the node. The node is a terminal
node.
5. Continue the tree-growing process until the stopping rules are met.
Exhaustive CHAID works much the same as CHAID. You can set some of the options
mentioned below using the Advanced Options for CHAID. These include the choice
of the Pearson chi-squared or likelihood-ratio test and the level of α split .
1. For each predictor variable X, find the pair of categories of X that is least
significantly different (that is, has the largest p value) with respect to the target
variable Y. The method used to calculate the p value depends on the measurement
level of Y.
If Y is continuous, use an F test.
If Y is nominal, form a two-way crosstabulation with categories of X as rows
and categories of Y as columns. Use the Pearson chi-squared test or the
likelihood-ratio test.
If Y is ordinal, fit a Y association model (Clogg and Eliaisin, 1987; Goodman,
1979; and Magidson, 1992). Use the likelihood-ratio test.
2. Merge into a compound category the pair that gives the largest p value.
3. Calculate the p value based on the new set of categories of X. Remember the p
value and its corresponding set of categories of X.
187
S tatistics an d A lg orithm s
4. Repeat steps 1, 2, and 3 until only two categories remain. Then, among all sets of
categories of X, find the one for which the p value in step 3 is the smallest.
5. Compute the Bonferroni adjusted p value for the set of categories of X and the
categories of Y.
6. Select the predictor variable X that has the smallest adjusted p value (the one that
is most significant). Compare its p value to a prespecified alpha level, α split .
If the p value is less than or equal to α split , split the node based on the set of
categories of X.
If the p value is greater than α split , do not split the node. The node is a terminal
node.
7. Continue the tree-growing process until the stopping rules are met.
C&RT Algorithm
C&RT works by choosing a split at each node such that each child node is more pure
than its parent node. Here purity refers to the values of the target variable. In a
completely pure node, all of the cases have the same value for the target variable.
C&RT measures the impurity of a split at a node by defining an impurity measure.
Impurity Measures
There are four different impurity measures used to find splits for C&RT models,
depending on the type of the target variable. For categorical target variables, you can
choose Gini, twoing, or (for ordinal targets) ordered twoing. For continuous targets,
AnswerTree automatically uses the least-squared deviation (LSD) method of finding a
split. These measures are further explained in the following sections.
g (t ) = ∑ p (j t )p ( i t )
j≠i
where i and j are categories of the target variable. This can also be written as
188
Chap te r 14
g (t ) = 1 – ∑ p ( j t )
2
Thus, when the cases in a node are evenly distributed across the categories, the Gini
1
index takes its maximum value of 1 – --- , where k is the number of categories for the
k
target variable. When all cases in the node belong to the same category, the Gini index
equals 0.
If costs are specified, the Gini index is computed as
g (t ) = ∑ C(i j )p ( j t )p ( i t )
j≠i
Φ ( s, t ) = g ( t ) – p L g ( t L ) – p R g ( t R )
where p L is the proportion of cases in t sent to the left child node, and p R is the
proportion sent to the right child node. The split s is chosen to maximize the value of
Φ (s,t) . This value, weighted by the proportion of all cases in node t, is the value
reported as “improvement” in the tree.
Twoing. The twoing index is based on splitting the target categories into two
superclasses, and then finding the best split on the predictor variable based on those
two superclasses. The twoing criterion function for split s at node t is defined as
pL p R 2
Φ ( s, t ) = ----------
4
- ∑ p( j tL ) – p ( j t R )
j
where t L and t R are the nodes created by the split s. The split s is chosen as the split
that maximizes this criterion. This value, weighted by the proportion of all cases in
node t, is the value reported as “improvement” in the tree. The superclasses C1 and C2
are defined as
C 1 = { j:p ( j tL ) ≥ p ( j t R ) }
and
C2 = C – C1
189
S tatistics an d A lg orithm s
Ordered twoing. The ordered twoing index is a modification of the twoing index for
ordinal target variables. The difference is that with the ordered twoing criterion, only
contiguous categories can be combined to form superclasses. For example, consider a
target variable such as account status, with categories 1 = current, 2 = 30 days overdue,
3 = 60 days overdue, and 4 = 90 or more days overdue. The twoing criterion might in
some situations put categories 1 and 4 together to form a superclass, with categories 2
and 3 forming the other superclass. However, if we consider these categories to be
ordered, we don’t want categories 1 and 4 to be combined (without also including the
intervening categories) because they are not contiguous. The ordered twoing index
takes this ordering into account and will not combine noncontiguous categories such
as 1 and 4.
Least-squared deviation (LSD). For continuous target variables, the LSD impurity
measure is used. The LSD measure R(t) is simply the (weighted) within-node variance
for node t, and it is equal to the resubstitution estimate of risk for the node. It is defined as
1
R ( t ) = ------------- ∑ w n f n ( y i – y ( t ) )
2
Nw(t ) i ∈ t
Φ ( s, t ) = R ( t ) – p L R ( t L ) – p R R ( t R )
The split s is chosen to maximize the value of Φ (s,t) . This value, weighted by the
proportion of all cases in node t, is the value reported as “improvement” in the tree.
190
Chap te r 14
s ∈S
Then split node 1 ( t = 1 ) into two nodes, t = 2 and t = 3 , using split s*.
2. Repeat the split-searching process in each of t = 2 and t = 3 , and so on.
3. Continue the tree-growing process until at least one of the stopping rules is met.
QUEST Algorithm
QUEST deals with variable selection (steps 1 and 2) and split-point selection
separately (steps 3, 4, and 5). Note that you can specify the alpha level to be used in
the Advanced Options for QUEST—the default value is 0.05.
1. For each predictor variable X, if X is a nominal categorical variable, compute the
p value of a Pearson chi-squared test of independence between X and the
categorical dependent variable. If X is continuous or ordinal, use the F test to
compute the p value.
2. Compare the smallest p value to a prespecified, Bonferroni-adjusted alpha level.
If the p value is less than α , then select the corresponding predictor variable to
split the node. Go on to step 3.
If the p value is greater than α , for each X that is continuous or ordinal, use
Levene’s test for unequal variances to compute a p value. (In other words, try
to find out whether X has unequal variances at different levels of the target
variable.)
Compare the smallest p value from Levene’s test to a new Bonferroni-adjusted
alpha level.
If the p value is less than α , select the corresponding predictor variable with
the smallest p value from Levene’s test to split the node. Go on to step 3.
191
S tatistics an d A lg orithm s
If the p value is greater than α , select the predictor variable from step 1 that
has the smallest p value (from either a Pearson chi-squared test or an F test) to
split the node. Go on to step 3.
3. Suppose that X is the predictor variable from step 2. If X is continuous or ordinal,
go on to step 4. If X is nominal, transform X into a dummy variable Z and compute
the largest discriminant coordinate of Z. Roughly speaking, you transform X to
maximize the differences among the target variable categories. (For more
information, see Gnanadesikan, 1977.)
4. If Y has only two categories, go on to step 5. Otherwise, compute a mean of X for
each category of Y and apply a two-mean clustering algorithm to those means to
obtain two superclasses of Y.
5. Apply quadratic discriminant analysis (QDA) to determine the split point. Notice
that QDA usually produces two cut-off points—choose the one that is closer to the
sample mean of each class.
Surrogate Splitting
Surrogate splitting is used to handle missing values for predictor variables in C&RT
and QUEST. If the best predictor variable to be used for a split has a missing value at
a particular node, AnswerTree substitutes the best replacement, or surrogate, predictor
variable it can find.
For example, suppose that X* is the predictor variable that defines the best split s*
at node t. The surrogate-splitting process finds another split s, the surrogate, which uses
another predictor variable X such that this split is most similar to s* at node t. If a new
case is to be predicted and it has a missing value on X* at node t, AnswerTree makes
the prediction on the surrogate split s instead. (Unless, of course, this case also has
missing value on X. In such a situation, the next best surrogate is used, and so on.)
Stopping Rules
AnswerTree stops the tree-growing process when one of a number of stopping rules
has been met. A node will not be split if any of the following conditions is met:
All cases in a node have identical values for all predictors.
192
Chap te r 14
The node becomes pure; that is, all cases in the node have the same value of the
target variable.
The depth of the tree has reached its prespecified maximum value.
The number cases constituting the node is less than a prespecified minimum parent
node size.
The split at the node results in producing a child node whose number of cases is
less than a prespecified minimum child node size.
For C&RT only, the maximum decrease in impurity is less than a prespecified
value β .
Note that you can set all but the first two of these rules in the Advanced Options for
stopping rules.
Tree Parameters
Four types of parameters can be applied to trees: scores, profits, costs, and priors. Each
one is defined in the following sections.
Scores
Scores are available in CHAID and Exhaustive CHAID. They define the order and
distance between categories of an ordinal categorical target variable. In other words,
the scores define the variable’s scale. Values of scores are involved in tree growing.
Profits
Profits are numeric values associated with categories of a target variable (ordinal or
nominal) that can be used to estimate the gain or loss associated with a segment. They
define the relative value of each value of the target variable. For example, in a
magazine marketing campaign you might have three target categories: paid
responders, unpaid responders, and non-responders, where paid responders generate
$35 of profit (because they subscribe to the magazine), unpaid responders generate –$7
of profit (because they accept the free trial issue but don't generate any revenue), and
non-responders generate –$0.15 (because of postage to send the mailing to them).
Although profits typically represent monetary worth, they can also be used for other
193
S tatistics an d A lg orithm s
kinds of cost-benefit analysis. Values are used in computing the gains chart, but not in
tree growing. You can specify profits for any of AnswerTree’s four growing methods.
Costs
Misclassification costs are numeric penalties for classifying an item into one category
when it really belongs in another—for example, when a statistical procedure
misclassifies cancerous cells as benign. Only the C&RT and QUEST methods take
costs into account when growing the tree, but all four of the methods can use the costs
in calculating risk estimates.
The cost of misclassifying a category j case as a category i class is
ìc i≠ j
C ( i j ) = í ij
î0 i= j
In a case where C ( i j ) = 1 for all i, j such that i ≠ j , we simply say that no costs are
involved. A cost matrix can be symmetric or asymmetric. If it is asymmetric, you can
make it symmetric by averaging the values of two opposite entries—that is,
C' ( i j ) = { C ( i j ) + C ( j i ) } ⁄ 2
Priors
Prior probabilities are numeric values that influence the misclassification rates for
categories of the target variable. They specify the proportion of cases already
associated with each category of the target variable prior to the analysis. In
AnswerTree, you can set priors for QUEST or C&RT as long as you have a categorical
target variable. The values are involved both in tree growing and risk estimation.
Misclassification costs can be incorporated into priors to form a set of adjusted
priors so that you do not have to specify costs separately. The adjusted priors are
defined as
C ( j )π ( j )
π' ( j ) = ------------------------------
∑ C ( j' )π ( j' )
j'
194
Chap te r 14
The following table summarizes the variables and parameters available for use with
each of the four growing methods.
Table 14-2
Summary of the methods
Gain Summary
The gain summary provides descriptive statistics for the terminal nodes of a tree. It
allows you to identify desired terminal nodes based on the gain values. For example,
you can identify your best sales prospects, your most creditworthy loan applicants, or
your patients most likely to develop a certain disease.
If your target variable is continuous, the gain summary shows the average of the
target value for each terminal node. You can choose to sort the nodes in descending or
ascending order of gain, depending on whether the highest or lowest gain value is of
interest in your study.
If your target variable is categorical (nominal or ordinal), the gain summary shows
the percentage of cases in a selected target category or, if profits are defined for the
tree, the average profit value for each terminal node.
195
S tatistics an d A lg orithm s
The index value shows the ratio of the gain value for each terminal node to the gain
value for the entire sample. It tells you how a node compares to the average.
Cost-Complexity Pruning
Cost-complexity pruning is a way to generate a tree of an appropriate size. If pruning
is not used, the tree may end up too large to be useful. In any case, the terminal nodes
may not provide useful information (the final splits may be superfluous). Perhaps the
most important reason to use pruning is to avoid overfitting, where your tree fits not
only the real patterns present in your data but also some of the unique “noise,” or error,
in your sample. An overfitted tree often does not generalize well to other data sets.
AnswerTree uses pruning to deal with this problem. In pruning the tree, the software
tries to create the smallest tree whose misclassification risk is not too much greater than
that of the largest tree possible. It removes a tree branch if the cost associated with
having a more complex tree exceeds the gain associated with having another level of
nodes (branch).
196
Chap te r 14
It uses an index that measures both the misclassification risk and the complexity of
the tree, since we want to minimize both of these things. This cost-complexity measure
is defined as follows:
˜
R α( T ) = R ( T ) + α T
˜
R ( T ) is the misclassification risk of tree T, and T is the number of terminal nodes for
tree T. This measure is a linear combination of the risk of tree T and its complexity. If
α is the complexity cost per terminal node, then R α ( T ) is the sum of the risk of tree T
and its cost penalty for complexity. (Note that the value of α is calculated by the
algorithm during pruning.)
Any tree you might generate has a maximum size ( T max ), in which each terminal
node contains only one case. With no complexity cost ( α = 0 ) , the maximum tree has
the lowest risk, since every case is perfectly predicted. Thus, the larger the value of α ,
the fewer the number of terminal nodes in T ( α ) , where T ( α ) is the tree with the
lowest complexity cost for the given α . As α increases from 0, it produces a finite
sequence of subtrees ( T 1 ,T 2, T 3 ,... ) , each with progressively fewer terminal nodes.
Cost-complexity pruning works by removing the weakest split.
The following equations represent the cost complexity for { t } , which is any single
node, and for Tt, the subbranch of { t } .
R α( { t } ) = R ( t ) + α
R α ( T t ) = R ( T t ) + α T˜ t
If R α ( T t ) is less than R α ( { t } ) , then the branch Tt has a smaller cost complexity than
the single node { t } .
The tree-growing process ensures that R α ( { t } ) ≥ R α ( T t ) for ( α = 0 ) . As α
increases from 0, both R α ( { t } ) and R α ( T t ) grow linearly, with the latter growing at a
faster rate. Eventually, you will reach a threshold α' , such that R α ( { t } ) < R α ( T t ) for
all α > α' . This means that when α grows larger than α' , the cost complexity of the
tree can be reduced if we cut the subbranch T t under { t } . Determining the threshold
is a simple computation. You can solve this first inequality, R α ( { t } ) ≥ R α ( T t ) , to find
the largest value of α for which the inequality holds, which is also represented by g ( t ) .
You end up with
R( t ) – R ( Tt)
α ≤ g ( t ) = -----------------------------
-
T˜ t – 1
197
S tatistics an d A lg orithm s
You can define the weakest link (t) in tree T as the node that has the smallest value of
g (t ) :
g ( t ) = min
t∈T
g( t)
Note that the one-standard-error rule is somewhat subjective. For that reason, you can
set the standard error multiplier in the Advanced Options for pruning. You can also
simply choose the tree with the smallest risk.
Bibliography
Berndt, E. 1991. The practice of economics: Classic and contemporary. Reading, Mass.:
Addison-Wesley.
Biggs, D., B. de Ville, and E. Suen. 1991. A method of choosing multiway partitions for
classification and decision trees. Journal of Applied Statistics, 18: 49–62.
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and regression
trees. Belmont, Calif.: Wadsworth.
Clogg, C. C., and S. R. Eliasin. 1987. Some problems in log-linear analysis. Sociological
Methods and Research, 16:1, 8–44.
Fisher, R. A. 1936. The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7: 179–188.
Gnanadesikan, R. 1977. Methods for statistical data analysis of multivariate observations.
New York: John Wiley & Sons, Inc.
Goodman, L. A. 1979. Simple models for the analysis of association in cross-classifications
having ordered categories. Journal of the American Statistical Association, 74: 537–552.
Harrison, D., and D. L. Rubinfeld. 1978. Hedonic prices and the demand for clean air. Journal of
Environmental Economics & Management, 5: 81–102.
Kass, G. 1980. An exploratory technique for investigating large quantities of categorical data.
Applied Statistics, 29:2, 119–127.
Loh, W. Y., and Y. S. Shih. 1997. Split selection methods for classification trees. Statistica
Sinica, 7: 815–840.
Magidson, J. 1992. Chi-squared analysis of a scalable dependent variable. In Proceedings of the
1992 Annual Meeting of the American Statistical Association, Educational Statistics Section.
Magidson, J., and SPSS Inc. 1993. SPSS for Windows CHAID Release 6.0. Chicago: SPSS Inc.
Merz, C. J., and P. M. Murphy. 1996. UCI repository of machine learning databases
(http://www.ics.uci.edu/~mlearn/mlrepository.html). Department of Information and
Computer Science, University of California, Irvine.
198
Index
199
200
Index
examples
credit scoring, 123 heteroscedasticity, 48
iris classification, 115
market segmentation, 150
testing effects, 131 impurity measures, 187–189
wage prediction, 140 for C&RT, 55
Exhaustive CHAID, 46 Gini, 187
advanced options, 54 least squared deviation, 189
algorithms, 186 least squared deviation (LSD), 55
compared to other methods, 183 ordered twoing, 189
exporting twoing, 188
analysis summary, 72 independent variables, 181
data, 92, 93 intervals, 78
gains summary, 72
graphs (Graph Viewer), 96
risk summary, 72
rules, 72 learning system, 5
tables (Table Viewer), 98 least squared deviation (measure of impurity), 55
Tree view, 72 least squared deviation impurity measure, 189
LSD, 55
format
gain summary, 86, 88
maximum tree depth, 52
rules, 88
measurement levels, 77
frequency
of cases, 48 methods
comparisons of, 183–185
frequency variables, 182
minimum change in impurity, 52
201
Index