Trees

Download as pdf or txt
Download as pdf or txt
You are on page 1of 203

AnswerTree 2.

0 ™

User’s Guide

SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its proprietary computer
software. No material describing such software may be produced or distributed without the written permission of the owners
of the trademark and license rights in the software and the copyrights in the published materials.

The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the
Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in Technical Data and Computer
Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL
60606-6307.

General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks of
their respective companies.

AnswerTree is a trademark of SPSS Inc.

Windows is a registered trademark of Microsoft Corporation.

AnswerTree™ 2.0 User’s Guide


Copyright © 1998 by SPSS Inc.
All rights reserved.
Printed in the United States of America.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.

1
Chapter

1
What Is AnswerTree?

AnswerTree is a computer learning system that creates classification systems


displayed in decision trees. If you have data divided into classes that interest you (high-
versus low-risk loans, subscribers or nonsubscribers, voters versus nonvoters, or types
of bacteria), AnswerTree can use your data to build rules that you can use to classify
old or new cases with maximum accuracy. For example, Figure 1-1 shows a tree that
classifies credit risks based on the client’s payday frequency and age category.

2
3
Wh at Is A nsw erTree?

Figure 1-1
AnswerTree window

AnswerTree brings together four of the most popular and current analytic methods
(algorithms) used in science and business for performing classification or
segmentation. It can build a tree for you automatically or let you take control to refine
the tree according to your knowledge of the data. Because AnswerTree automates a
portion of your data analysis, you produce usable and comprehensible results more
quickly than when using traditional exploratory statistical methods. Once you have
generated a model, AnswerTree provides you with all of the validation tools needed for
performing exploratory and confirmatory segmentation and classification analyses.
For customers who are entirely new to AnswerTree, this chapter introduces the
concepts you need to know to get started quickly, including definitions of terms and a
brief description of the algorithms. (If you require a more thorough description of the
statistics and algorithms used, refer to Chapter 14.) For customers who have used
AnswerTree 1.0, the next section of this chapter describes the new features
incorporated into version 2.0.
4
Chap te r 1

The remainder of this manual includes two tutorials—Chapter 2 and Chapter 3—to
get you started using all of AnswerTree’s features. In addition, we have included five
extended examples (Chapter 8 through Chapter 12) to help you learn to apply the
features to real data. Other chapters describe the operation of the software in detail. The
final chapter (Chapter 14), new for version 2.0, explains the various algorithms in
mathematical terms, for those who are interested.

What’s New in AnswerTree 2.0?


For customers who have previously used AnswerTree 1.0, you will find several
changes in the new version—some visible and some behind the scenes. The most
visible change is in the way you define a new tree. The New Tree Wizard takes you
through the tree-definition process step by step, and it allows you to specify complete
growing criteria for your method before you grow the root node. The wizard includes
a single, compact dialog box called Advanced Options, in which you can now specify
method-specific growing criteria, stopping rules, pruning criteria, scores, costs, and
profits. You can start the New Tree Wizard by choosing New Tree from the File menu.
A new dialog box lets you specify criteria to locate the best terminal nodes in your
tree automatically. The criteria include a specified gains or index value, a specified
number of cases, or a specified percentage of your sample. You can access this new
dialog box in the Tree window. From the menus, choose Edit, then Select Terminal
Nodes, and then Rule-Based.
AnswerTree now accepts a variety of data formats. Besides reading SPSS data files,
the software also imports SYSTAT files and database files. The new Database Capture
Wizard makes importing database files easy. In addition, you can now import data
using SPSS for Express (for Oracle Express databases) and BusinessQuery for SPSS
(for BusinessObjects databases).
Production mode users will find more complete documentation and a more
streamlined process for writing scripts.
Finally, you should notice a definite improvement in processing time. We
implemented a variety of behind-the-scenes performance enhancements, so
AnswerTree 2.0 generates root nodes and grows trees much faster than the previous
version.
5
Wh at Is A nsw erTree?

What Is a Learning System?


A learning system is a computer program that derives decision rules from existing data.
Figure 1-2
Learning system

What Is a Classification System?


A classification system is a collection of decision rules that predict or classify future
observations. A computer learning system, such as AnswerTree, is used to generate the
classification rules from existing data.
Figure 1-3
Classification system

What Is a Decision Tree?


Decision trees are charts that illustrate decision rules. They begin with one root node
that contains all of the observations in the sample. As you drop down the tree, the data
branch into mutually exclusive subsets of the data. Figure 1-4 shows a decision tree that
describes a college admissions policy based on grade point averages and standardized
test (MCAT) scores.
6
Chap te r 1

Figure 1-4
Decision tree

GRADE POINT AVERAGE


n=727
<3.47 >3.47
342 385
MCAT VERBAL MCAT VERBAL

<535 >535
<555 93 249 >555 51 354

MCAT QUANTITATIVE REJECT INTERVIEW


REJECT
(9) (19) (49)
<655 >655
122 127

REJECT INTERVIEW
(45) (46)

Choice of Tree-Growing Algorithms


AnswerTree offers four algorithms for performing classification and segmentation
analysis:
CHAID. Chi-squared Automatic Interaction Detector, a method that uses chi-squared
statistics to identify optimal splits (Kass, 1980).
Exhaustive CHAID. A modification of CHAID that does a more thorough job of
examining all possible splits for each predictor but takes longer to compute
(Biggs et al., 1991).
C&RT. Classification and Regression Trees, methods that are based on minimization of
impurity measures (Breiman et al., 1984).
QUEST. Quick, Unbiased, Efficient Statistical Tree, a method that is quick to compute
and avoids other methods’ biases in favor of predictors with many categories (Loh and
Shih, 1997).
7
Wh at Is A nsw erTree?

The four algorithms included in AnswerTree all do basically the same thing—examine
all of the fields of your database to find the one that gives the best classification or
prediction by splitting the data into subgroups. The process is then applied recursively
to subgroups to define sub-subgroups, and so on, until the tree is finished (as defined
by certain stopping criteria). The four methods have different performance
characteristics and features.
For more information on these algorithms, including hints on selecting one for your
analysis, see Chapter 14.

General Uses of Tree-Based Analysis


Segmentation. Identify persons who are likely to be members of a particular class.
Stratification. Assign cases into one of several categories, such as high-, medium-, and
low-risk groups.
Prediction. Create rules and use them to predict future events. Prediction can also mean
attempts to relate predictive attributes to values of a continuous variable.
Data reduction and variable screening. Select a useful subset of predictors from a large
set of variables for use in building a formal parametric model.
Interaction identification. Identify relationships that pertain only to specific subgroups
and specify these in a formal parametric model.
Category merging and discretizing continuous variables. Recode group predictor
categories and continuous variables with minimal loss of information.

Typical Applications of Tree-Based Analyses


Direct mail. Determine which demographic groups have the highest response rate. Use
this information to maximize the response to future mailings.
Credit scoring. Use an individual’s credit history to make credit decisions.
Human resources. Understand past hiring practices and create decision rules to
streamline the hiring process.
8
Chap te r 1

Medical research. Create decision rules that suggest appropriate procedures based on
medical evidence.
Market analysis. Determine which variables, such as geography, price, and customer
characteristics, are associated with sales.
Quality control. Analyze data from product manufacturing and identify variables
determining product defects.
Policy studies. Use survey data to formulate policy using decision rules to select the
most important variables.
Health care. User surveys and clinical data can be combined to discover variables that
contribute to health.

A Word of Caution
AnswerTree is like any powerful tool, software or otherwise—it can be powerfully
misused. If your objectives for using AnswerTree are poorly formulated, you are likely
to have poor results. Exploration and discovery in your data should not be completely
question free; data analysis requires alert human participation. The answers that you
get with AnswerTree will depend on the appropriateness of tools that you use, the
condition of your data, and the relevance of the questions that you ask. Here are some
suggestions to help you create a decision tree that truly does what you need it to do:
 Always look at the raw data.
 Know the characteristics of the variables in your data before you undertake a large
project.
 Clean your data or be aware of any irregularities in them.
 Validate your AnswerTree results with new data or hold out a test set.
 If possible, use traditional statistical models to extend and verify what you learn
with AnswerTree.
Chapter

2
Getting Started with AnswerTree

This brief introduction will help you get started using AnswerTree. You will learn to
load data, create a decision tree, and save the file. To illustrate the use of AnswerTree,
you will analyze the well-known Iris data of Fisher (1936). The object is to identify
three types of Iris flowers based on physical aspects.
AnswerTree organizes your work into project files. Each project file is based on a
single data file but can contain multiple tree analyses using different features of
AnswerTree.

Startup Dialog Box


The first object displayed when you launch AnswerTree is the startup dialog box. This
offers you the choice of running the introductory overview, creating a new project
from a data file, or opening an existing AnswerTree project. You want to create a new
project, so select Start a new project and click OK. The New Project dialog box
appears, from which you can select the type of data file you will use.

9
10
Chap te r 2

Figure 2-1
Startup dialog box

New Project Dialog Box


The New Project dialog box asks you to identify the source of your data. You can
choose from among an SPSS data file, a SYSTAT data file, or data obtained from a
database. (If you want to use data from a database, select the Database Capture
Wizard, which will enable you to choose one of the database options.)
For this tutorial, you will use the data file iris.sav, which is an SPSS data file. When
you select the SPSS data file (.sav) option, the Open dialog box will appear. Select
iris.sav and click Open.
11
Getting Started with A nsw erTree

Figure 2-2
New Project dialog box

Figure 2-3
Open dialog box
12
Chap te r 2

Growing a New Tree


After specifying a data source, you must define your first tree. From the menus choose:
File
New Tree...

The New Tree Wizard allows you to select the options for your tree. The first step is to
select a growing method.

Growing Method

The growing methods are CHAID, Exhaustive CHAID, C&RT, and QUEST. For the
first tree created for a project, the default method is CHAID. Choose C&RT for this
example. Click Next.
Figure 2-4
Choosing a growing method
13
Getting Started with A nsw erTree

Assigning Variables
Now you need to specify which variables are predictors, which variable is the target,
and so on. Do this by dragging the variable from the source list into the box where it
belongs. For this example, you want to define species as the target variable, so drag
SPECIES to the target variable list. Similarly, assign petal length, petal width, sepal
length, and sepal width as predictor variables. Click Next.
Figure 2-5
Defining the target variable
14
Chap te r 2

Validating the Tree

The next screen allows you to specify a validation method for your tree. For this
example, we do not need to validate the tree. Make sure that Do not validate the tree
is selected and click Next.
(Validation is discussed in detail in Chapter 4.)
Figure 2-6
Validating the tree
15
Getting Started with A nsw erTree

Advanced Options: Setting Growing Criteria

Now you need to adjust some of the growing criteria. In the wizard, you can access the
growing criteria through the Advanced Options dialog box. To view it, click the
Advanced Options pushbutton.
In the Advanced Options dialog box, use the first tab, Stopping Rules. Because this
data set contains such a small sample, you need to specify smaller values for the
minimum number of cases—5 for the parent node and 2 for the child node.
Once you have changed the values, click OK to return to the wizard.
Figure 2-7
Advanced Options: Stopping Rules
16
Chap te r 2

Growing the Root Node

Grow the root node by clicking Finish on the last screen of the New Tree Wizard. The
target variable is tabulated for the entire sample and is displayed in the Tree window.
You see the minimal tree, which consists of the root node of the decision tree.
Figure 2-8
Growing the root node
17
Getting Started with A nsw erTree

Figure 2-9
Minimal tree in the Tree window

Growing the Tree Automatically


Now you are ready to grow the rest of the tree. To grow an entire tree automatically,
from the Tree window menus choose:
Tree
Grow Tree
The data are analyzed (this may take a moment), and the decision tree is shown in the
Tree window. To see an overview of the tree, activate the tree map from the View menu.
Figure 2-10
Tree map overview of the Iris tree
18
Chap te r 2

Interpreting the Resulting Tree


Decision trees are appealing because they convey a concrete story about your analysis.
Part of your developing and understanding of tree-based analysis will come from
creating explanations of the results. In this particular example, the tree indicates the
following:
 Looking at the top-level split, you can tell that if an Iris has short petals ( ≤ 2.450 ),
it is probably setosa.
Figure 2-11
Top-level split
19
Getting Started with A nsw erTree

 Considering the plants with long petals ( > 2.450 ), you see that those with narrow
petals ( ≤ 1.750 ) are quite likely to be versicolor, while those with wide petals
( > 1.750 ) are probably virginica.
Figure 2-12
Second-level split
20
Chap te r 2

 Examining further splits adds little to our understanding of the problem because the
subsequent splits deal with very small numbers of cases.
Figure 2-13
Third-level split

Evaluating the Model: Misclassification Risk


To see how well the model does at predicting the type of Iris, you can examine the risk
summary. The risk summary compares the tree’s assignment of class with the class
actually recorded. The risk estimate gives the proportion of cases classified incorrectly.
You can see that the misclassification rate of the model is quite low—all but three cases
are classified correctly. This gives a risk estimate of 0.02, since 2% of the cases are
misclassified.
To see the risk summary, click the Risk tab at the bottom of the Tree window.
21
Getting Started with A nsw erTree

Figure 2-14
Risk summary

Saving the Project


Now that you have generated a project, you should save it for future reference and
modification. To save the project, from the Project window menus choose:
File
Save Project
The Save As dialog box appears and prompts you for a filename.
Figure 2-15
Saving the project
22
Chap te r 2

Conclusion
You’ve learned how to use AnswerTree to build a tree based on your data and how that
tree helps you make decisions about how to classify cases. You should now understand
the basics of AnswerTree well enough to begin working with your own data.
For more information on the various features of AnswerTree, proceed to the
extended tutorial.
Chapter

3
Extended Tutorial

In this tutorial, you will learn how to create your own trees and classification analyses
by using the various graphical and statistical features of AnswerTree. You will also
learn about the AnswerTree features that enable you to test the performance of your
model. Finally, you will learn how to guide the analysis as it proceeds so that your
model can reflect your preferences and domain knowledge in describing the data.

Goal of the Credit Scoring Example


The second tutorial uses a credit scoring data file (Merz and Murphy, 1996). The goal of
credit scoring is to determine whether the historical data that have been collected during
the course of servicing a loan can provide information about who is likely to default on
loans in the future. The results of such an analysis are used to efficiently screen future
loan applicants. The target variable in this example is ACCOUNT STATUS, with five
categories. The target category is critical accounts (CRIT ACCT). The predictor
variables include personal information collected during the application and account
performance data.
Figure 3-1
Data Viewer containing credit scoring data

23
24
Chap te r 3

Defining a New Tree


The credit scoring data file has been saved as a project file. Open the file german.atp,
located in the AnswerTree directory, and define a new tree with the following options:
 Use the C&RT method.
 Select ACCOUNT STATUS as the target variable and all other variables as
predictors.
 Do not validate the tree.
 Open the Advanced Options dialog box when prompted to do so.
Figure 3-2
Defining the tree
25
Exte nded T utorial

Limiting the Size of the Tree

To keep the processing time low, you must limit the number of levels that AnswerTree
considers. In the Advanced Options dialog box, make sure that the Stopping Rules tab
is on top. In the Maximum Tree Depth group, enter 4 as the value for C&RT, as shown
in Figure 3-3. Because we have a relatively small sample here, change the values for
Parent node and Child node to 25 and 1, respectively. Click OK to accept the new
settings, and then choose Finish on the last screen of the New Tree Wizard.
Figure 3-3
Advanced Options: Stopping Rules

Examining the Root Node

The root node is a tabulation of the target variable. It tells us that CRIT ACCT (category 5)
comprises 293 cases, or 29.3%, of the total of 1000. It should be noted that the critical
loans have been over-sampled so that a more accurate picture can be obtained.
26
Chap te r 3

From here, go ahead and grow the entire tree by choosing Grow Tree from the
Tree menu.
Figure 3-4
Root node of C&RT tree

Viewing and Sizing the Automatically Grown Tree

We now have the C&RT tree that results from automatic growth limited to four levels.
To show the tree map, from the menus choose:
View
Tree Map
27
Exte nded T utorial

Figure 3-5
C&RT tree with tree map

To customize the view of the tree, the Zoom feature on the View menu has been used
to reduce the physical display of the tree. The Zoom feature is helpful when you need
to adjust the size of the image for printing, visualizing, or exporting.
28
Chap te r 3

Figure 3-6
Zoomed tree view

Tree Map
The tree map provides you with a bird’s-eye view of your decision tree and a way to
move quickly from one position in the Tree window to another.
The Tree Map Viewer is linked to the Tree window. If you select a node from the
tree map, the main window will update and move to the selected node. All open
viewers and windows update with new information about the node you have selected.
The tree map can be selected by using the Tree Map button on the toolbar or by
choosing Tree Map from the View menu.
29
Exte nded T utorial

Figure 3-7
Tree Map button

Depending on the size and complexity of the tree, each node in the tree map is labeled
with a number that is referenced in all other information displays, such as the gains and
risk charts. If you can’t see the node numbers, expanding the Tree Map window may
make them visible.
Figure 3-8
Tree Map Viewer

Node Graphs

A very helpful feature of AnswerTree allows you to combine graphs of distributions of


the target variable simultaneously with the tabulated data. To use this option, from the
menus choose:
View
Node
Both

Similarly, you can display only the graphs if you choose. In our example, if you select
node 22, it can be seen that critical accounts are predominant.
30
Chap te r 3

Figure 3-9
Tree view with node graphs and statistics

Graph Viewer
The Graph Viewer shows summary statistics for nodes in the tree as graphs. To view a
node graph, select a node, and then choose Graph on the View menu, or use the Graph
button on the toolbar.
Figure 3-10
Graph Viewer button
31
Exte nded T utorial

The Graph window updates to display the target variable distribution of any node that
you select. You can print and export the graph.
Figure 3-11
Graph Viewer

Gains Charts
In the case of a categorical target variable, gains charts provide you with node statistics
that describe the classification tree relative to the target category of the target variable.
If the target variable is continuous, gains charts provide you with node statistics relative
to the mean of the target variable. Alternatively, you can display node performance in
percentiles, with a user-definable increment. To specify options for the gains chart,
click the Gains tab in the Tree window; then from the menus choose:
Format
Gains…
32
Chap te r 3

Figure 3-12
Gain Summary dialog box

Here we have disabled cumulative statistics and have requested gains for the
percentage of cases in the target category, critical accounts (CRIT ACCT).
Figure 3-13
Gain summary
33
Exte nded T utorial

Interpreting the Terminal Node Gains Chart

The terminal node gain summary for target variable ACCOUNT STATUS has the target
category set to critical accounts. We can read the statistics across the row labeled Node 22.
The columns labeled Node: n and Node: % tell us that this node captured 58 cases, or
5.80% of the total number of cases. The columns labeled Resp: n and Resp: % specify that
out of the 58 cases for the node, 54 are identified as critical accounts and that this
represents 18.43% of the total target class observations.
The Gain (%) column shows the percentage of the cases in the node that have the
target value for the target variable. For node 22, we would divide 54 by 58 to arrive at
93.1034%.
The Index (%) column indicates the makeup of the node (with respect to critical
accounts) compared to the makeup of the entire sample. You arrive at the Index (%)
value by taking the ratio of the Gain (%) value and the proportion of target category
responses in the entire sample. The Index (%) for node 22 is computed by taking
93.1034% (the gain percentage) and dividing it by 29.3% (the percentage of critical
accounts found in the root node), and computing a percentage. The result is
317.7592%, the index score for this node.

Interpreting the Percentile Gains Chart

The percentile gain summary for the target variable, ACCOUNT STATUS, has the
target category set to critical accounts. Percentile gain summaries extend the meaning
of the node-oriented charts by arranging the percentiles according to performance of
the target category. The nodes that make up a particular percentile are listed in the first
column. Nodes that are listed in two or more consecutive rows of a percentile gains
chart span the increment.
34
Chap te r 3

Figure 3-14
Percentile gain summary

Risk Charts
Risk charts tabulate misclassification statistics. When risk is calculated ignoring
misclassification costs, it is equivalent to error. You can find the misclassification
matrix on the Risk tab of the Tree window.
35
Exte nded T utorial

Figure 3-15
Risk summary

Interpreting the Risk Chart

The misclassification matrix counts up the predicted and actual category values and
displays them in a table. A correct classification is added to the counts in the diagonal
cells of the table. The diagonal elements of the table represent agreement between the
predicted and actual value—this is often called a “hit.” An incorrect classification—
called a “miss”—means that there is disagreement between predicted and actual value.
Misclassifications are counted in the off-diagonal elements of the matrix. In this
example, 11 applicants with no credit or no debt (NCR/NODEB) were misclassified as
having current, up-to-date credit accounts (PD BK). This table is helpful in
determining exactly where your model performs well or poorly.
The risk estimate and standard error of risk estimate are values that indicate how
well your classifier is performing. In this case, the risk estimate for the four-level
C&RT tree is 0.2880, and the standard error for the risk estimate is 0.0143. In other
words, we are missing 28.8% of the time. At this point, we might want to think about
ways to improve our model.
36
Chap te r 3

Rules
Rules tell you the characteristics of cases that make up particular nodes. For example,
node 15 is made up of applicants who have other bank or store debts, are foreign
workers, and who intend to use the loan for domestic appliances or job retraining. This
information can be useful for understanding key segments of your data and for
classifying new cases.
Rules can be generated in any of three formats: SPSS syntax, SQL query, or
decision rules. SPSS syntax rules can be used in SPSS or NewView to classify cases
based on values of predictor variables. Likewise, SQL rules can be used to extract and
label cases from your SQL database engine. Decision rules are plain-language
descriptions of the characteristics of nodes, suitable for including in reports or
presentations.
Figure 3-16
Rules summary

Analysis Summary
The Summary tab in the Tree window contains information presented as text. The
following information is included in the summary:
37
Exte nded T utorial

 Project information. Project name, name of tree, data file, number of cases, and
weighting.
 Partition information.
 Cross-validation information.
 Tree-growing criteria. Growing method, algorithm specifications, stopping rules,
pruning.
 Model. Target variable, predictors, cost by target category, priors by target
category, profits by target category.
Figure 3-17
Analysis summary
38
Chap te r 3

Interpreting the Analysis Summary

The summary tab gives you a way to audit your analyses. The text output of the
summary tab reports dialog box settings for your model, growing criteria, and other
information about the current tree. You can use the analysis summary as part of your
report or use it to tune your analysis by changing the model or criteria. Notice that the
contents of the analysis summary will update to reflect any changes you make to the
current tree. This is the case if you modify a tree or grow a new one. If you have a
particular tree that you would like to save, do so, and grow a new tree or export the
summary output as a text file by choosing
File
Export...
with the Summary tab open.

Modifying an Existing Decision Tree


Once you have generated a decision tree, you can modify the structure to more closely
fit your expectations. You can:
 Prune a branch.
 Reduce the number of levels in the tree.
 Increase the number of levels in the tree.

Why would you want to change the configuration of your automatically grown
classifier? Because AnswerTree is an exploratory data analysis tool, you are not
required to adhere to assumptions about the data or model that you produce. The
measurement of a tree classifier’s success or failure is based on the performance
characteristics of the entire tree classifier rather than the choice of a “best,” or most
appropriate, variable at any given point in the model. By applying your special
knowledge of the problem at hand, you can compose a model that looks more like what
you expect to see from the data.
39
Exte nded T utorial

Pruning a Branch

The tree metaphor applies easily here. Pruning is simply the act of eliminating a portion
of a branch from your decision tree. To prune a branch, simply select the node beneath
which are the nodes that you want to eliminate, and from the menu choose:
Tree
Remove Branch

All of the nodes that existed as children to the selected node are eliminated from the
model. All of the appropriate views and summary statistics are automatically updated.
Figure 3-18
Pruning a branch from the tree
40
Chap te r 3

Reducing and Increasing the Number of Levels in the Tree

You have four levels in the current tree. If you want only three levels or want to add a
fifth level, you can accomplish this by choosing from the menus, respectively,
Tree
Remove One Level
or
Tree
Grow Tree One Level
The results of removing a level from our four-level tree are shown below. Again, all of
the appropriate views and summary statistics are automatically updated with new
information to reflect the status of the new tree.
Figure 3-19
Tree with one level removed
41
Exte nded T utorial

Conclusion
You’ve learned how to use AnswerTree to build a tree based on your data and how that
tree helps you make decisions about how to classify cases. You’ve also learned how to
access various features of AnswerTree that help you better understand your model and
to adjust your model according to the needs of your situation.
Chapter

4
Growing the Root Node

AnswerTree is designed to make tree-growing easy. The multiple windows of the


software allow you to choose how you want to view your work. Details about a tree
are shown in the Tree window and in the various Viewer windows. The Tree window
shows the whole tree, while the Viewer windows provide details about specific nodes
within the tree. The windows are linked, so that changing the selection in the Tree
window causes the Viewer windows to be updated automatically.
Your work in AnswerTree is organized into projects, each of which is based on a
particular set of data. Each project can have multiple trees associated with it. The
Project window summarizes the contents of the current project. Trees associated with
the current project are shown as subordinate entries in the hierarchical view of the
Project window.

42
43
Grow ing the Roo t No de

Figure 4-1
Project, Tree Map, Viewer, and Tree windows

This chapter focuses on the Project window, which is where you begin the tree-
growing process. In this chapter, you will learn how to get started growing a tree and
how to specify all of the options that accompany the tree. Once you have grown the
root node, it appears in the Tree window. For more information about the Tree window
and its various options, see Chapter 5.

Project Window
The Project window is the application’s main window, showing all contents of the
project. It contains an outline control that organizes the project items hierarchically. If
there is no open project, the window is empty.
44
Chap te r 4

Figure 4-2
Project window

The highest level in the outline is the project level. New trees are appended to the
outline as children of the project. Double-clicking an entry in the Project window
activates the associated Tree window. Right-clicking an entry allows you to delete or
to close the corresponding tree.
Names of items in the outline can be edited by selecting the item and then clicking
the name. By default, trees are named for the dependent variable used.
Closing the Project window exits the application. Minimizing the Project window
minimizes all AnswerTree windows.

Project Window File Menu


The Project window File menu contains the following options:
 New Project. Clears the current project and creates a new empty project.
 Open Project. Opens an existing project file.
 New Tree. Opens the New Tree Wizard, which initially contains the system default
settings for tree creation.
 Save Project. Saves the current project to a file.
 Save Project As. Saves the current project with a new name.
 Exit AnswerTree. Exits the program.
45
Grow ing the Roo t No de

New Project
When you choose New Project from the File menu, the New Project dialog box
appears. You can choose the type of data file you want to use: an SPSS file, a SYSTAT
file, or a database file.
When you choose the Database Capture Wizard option, the Database Capture
Wizard opens. It assists you in creating, running, or editing a database query from any
database for which you have an ODBC driver. Additional options, including SPSS for
Express (for Oracle Express databases) and Business Query for SPSS (for
BusinessObjects databases), are available if you have these applications installed.
Figure 4-3
New Project dialog box

If your data are not in any format that AnswerTree can use, you may want to install the
SPSS ODBC driver, which is included on the AnswerTree CD-ROM. It allows you to
save data in SPSS format from any application capable of writing to ODBC data
sources. See the Readme file included with this driver for more information.

New Tree
Whenever you create a new tree, AnswerTree launches the New Tree Wizard. The
wizard consists of a series of dialog boxes that help you create your tree. At the very
46
Chap te r 4

least, you must fill in the first two dialog boxes, in which you select your growing
method and define your model. After defining your model, you can click Finish at any
time to grow your root node.
However, if you continue past the second dialog box (by clicking Next rather than
Finish), you can specify whether or not to validate the tree. Additional advanced
options allow you to specify stopping rules and other growing criteria for the method
you have chosen. Each dialog box in the New Tree Wizard is explained in the
following sections.

To Grow a New Tree


E To grow a new tree, from the Project window or Tree window menus choose:
File
New Tree...

E Select a growing method.

E Specify a target variable and predictor variables.


Optionally, you can:
 Specify variables with information on frequencies or case weights.
 Choose to validate your tree.
 Specify stopping rules and growing criteria.
 Specify options for pruning, scores, priors, and costs.

E Click Finish when you are ready to grow the tree.

Growing Method

The tree is constructed using a particular statistical method to determine how to define
each split. Select one of the four methods available in AnswerTree:
 CHAID. Chi-squared Automatic Interaction Detection. This method uses chi-
squared statistics to identify optimal splits. The target variable can be nominal,
ordinal, or continuous.
47
Grow ing the Roo t No de

 Exhaustive CHAID. This method is a modification of CHAID that does a more


thorough job of examining all possible splits for each predictor but takes longer to
compute. The target variable can be nominal, ordinal, or continuous.
 C&RT. Classification and Regression Trees. These methods are based on
minimization of impurity measures. The target variable can be nominal, ordinal, or
continuous.
 QUEST. Quick, Unbiased, Efficient Statistical Tree. This method is quick to
compute and avoids the other methods’ biases in favor of predictors with many
categories. The target variable must be nominal.
C&RT and QUEST generate binary trees (every split results in exactly two child
nodes), whereas CHAID and Exhaustive CHAID can generate nonbinary trees (some
splits result in more than two child nodes).
Figure 4-4
New Tree Wizard: Growing method
48
Chap te r 4

Model Definition

The variable list on the left contains the variables available in the current project. As
variables are assigned to roles in the new tree, they are removed from this list. To
assign a variable to a particular role in the model, click the variable and drag it to the
desired location. Note that date variables cannot be used in AnswerTree models.
This dialog box has a convenient pop-up menu, which you can access by right-
clicking your mouse. You can change the measurement level of any variable shown,
and you can choose to sort the variables by measurement level or alphabetically. You
can also specify whether variable labels or variable names are used in the dialog box
and in the tree itself. To change the settings, right-click over any variable and select the
appropriate option from the menu.
Target. Indicates the variable to be predicted (also known as the dependent variable).
Predictors. Indicates the variables used to make predictions about target variable
values. After assigning a target variable, you can assign all remaining variables as
predictors by selecting All others. To assign only some of the available variables as
predictors, drag the appropriate variables to the Predictors list.
Frequency. Indicates the variable that specifies frequencies for cases, if any. Use this if
records in your data set represent more than one unit each—for example, if you are
analyzing aggregated data. Values for a frequency variable should be positive integers.
Cases with negative or zero frequency weights are excluded from the analysis. Non-
integer frequency values are rounded to the nearest integer.
Case Weight. Indicates the variable that specifies case weights, if any. Case weights are
used to account for differences in variance across levels of the target variable
(heteroscedasticity). These weights are used in model estimation but do not affect cell
frequencies. Case weight values should be positive, but they can be fractional. Cases
with negative or zero case weight are excluded from the analysis. With a categorical
dependent variable, cases that belong to the same dependent variable class and the
same predictor variable category are grouped together as a cell. The corresponding
case weights are aggregated to form a cell weight for that cell. A contingency table, in
which classes of the dependent variable are used as columns and categories of the
predictor variable being studied are used as rows, is formed and cell weights are used
in the analysis. Case weights are ignored when using the QUEST growing method.
49
Grow ing the Roo t No de

Figure 4-5
New Tree Wizard: Model definition

Validation Options

In many circumstances, you want to be able to assess how well your tree structure
generalizes from the data at hand to a larger sample. There are three validation options
available.
50
Chap te r 4

Figure 4-6
New Tree Wizard: Validation options

Do not validate the tree. This option requests no validation procedure. The tree is both
built and tested on the entire set of data.
Partition my data into subsamples. Partitioning divides your data into two sets—a
training sample, from which the model is generated, and a testing sample, on which the
generated model is tested. If the model generated on one part of the data fits the other
part well, this indicates that your tree structure should generalize adequately to larger
data sets that are similar to the current data. If you select partitioning, use the slider
control to determine what proportion of cases go into the training and testing samples.
Note that the slider proportion is approximate.
After setting up the partitions, make sure that your training sample is selected (in the
View menu) and grow the tree. When you are satisfied with the tree, select the testing
sample in the View menu. The results in the Tree window will change to reflect the
results of applying the tree to the testing sample. By examining the risk estimates, gain
summary, and analysis summary, you can determine the extent to which your tree
generalizes.
51
Grow ing the Roo t No de

Cross-validation. Cross-validation involves splitting the sample into a number of


smaller samples. Trees are then generated, excluding the data from each subsample in
turn. For example, with tenfold cross-validation, the data are split into 10 subsamples
(called sample folds) and then 10 trees are generated. The first tree is based on all of
the cases except those in the first sample fold, the second tree is based on all of the
cases except those in the second sample fold, and so on. For each tree, misclassification
risk is estimated by applying the tree to the subsample excluded in generating it. The
cross-validated risk estimate for the overall tree is calculated as the average of the risks
for all of these trees. If you select the cross-validated risk estimate, specify the number
of sample folds in the text box. Note that the cross-validated risk estimate is available
only when the tree is grown automatically.
Random Seed. When using validation, cases are randomly assigned to partitions or
sample folds. The seed setting allows you to specify the starting value used by the
random number generator to assign cases. This feature is useful if you want to be able
to duplicate exactly the partitioning in another session; sets defined with the same
random number seed will always assign the same cases to the same partitions.
Therefore, if you want to duplicate the case partitioning later, set the seed to a specific
value. (The default setting is 2,000,000.)

Advanced Options
The final step in the New Tree Wizard lets you specify advanced options. Advanced
options include stopping rules, pruning, scores, costs, and priors. In addition, there are
options specific to each growing method. Therefore, the options available in the dialog
box will change, depending upon which growing method you have chosen.
If you elect not to specify any advanced options, the software reverts to default
settings, and your tree is generated normally. However, it is best to check the settings
before growing your tree. For example, some settings may need to be changed,
depending on the number of cases in your data file or the desired number of levels in
your tree. To achieve optimal results, you may have to experiment to discover the best
combination of settings.
52
Chap te r 4

To specify advanced options from the New Tree Wizard, click the button (on the
fourth screen) to open the Advanced Options dialog box. When you are finished, click
OK to return to the New Tree Wizard.
Note that after a tree is grown, you can go back and change the advanced options,
but this change will cause the tree to be regrown. To do so, from the Tree window
Analysis menu, choose Advanced Options.
Figure 4-7
New Tree Wizard: Advanced Options

Advanced Options: Stopping Rules

In generating a tree structure, the program must be able to determine when to stop
splitting nodes. The criteria for determining this are called stopping rules.
53
Grow ing the Roo t No de

Figure 4-8
Advanced Options: Stopping Rules

You can control the following stopping rules settings:


Maximum Tree Depth. This setting allows you to control the depth (number of levels
below the root node) of the generated tree.
Minimum Number of Cases. This setting allows you to specify minimum numbers of
cases for nodes. Nodes that do not satisfy these criteria will not be split.
 Parent node. The minimum number of cases in a parent node. Nodes with fewer
cases will not be split.
 Child node. The minimum number of cases in child nodes. If splitting a node would
result in a child node with number of cases less than this value, the node will not
be split.
C&RT. The stopping rule for C&RT depends on the minimum change in impurity. If
splitting a node results in a change in impurity less than the minimum, the node is not
split.
54
Chap te r 4

Changing the stopping rules after a tree has been grown will invalidate the tree and
require the tree to be regrown.

Advanced Options: CHAID

This tab allows you to control growing criteria for CHAID and Exhaustive CHAID
models.
Figure 4-9
Advanced Options: CHAID

You can control the following CHAID settings:


Alpha for. This setting allows you to control the alpha levels for splitting nodes and
merging categories. Specify an alpha level for each operation by adjusting the slider
control or entering a value.
Chi-Square for Nominal Target. This setting allows you to specify the chi-square type to
use for evaluating splits. Select either Pearson or Likelihood Ratio. For ordinal and
discretized continuous variables, the likelihood-ratio statistic is used.
55
Grow ing the Roo t No de

Convergence. This setting allows you to specify the convergence criteria for a CHAID
analysis.
 Epsilon. Specify the smallest change required when calculating estimates of
expected cell frequencies.
 Maximum iterations. Specify the maximum number of iterations the program
should perform before stopping.
Allow splitting of merged categories. You can use this control to allow splitting of
merged categories.
Use Bonferroni adjustment. This option allows you to correct alpha levels for multiple
comparisons. This option is activated by default.

Advanced Options: C&RT

This tab allows you to select the impurity measure for C&RT models.
Figure 4-10
Advanced Options: C&RT
56
Chap te r 4

Impurity Measure for Categorical Targets. Select the impurity measure that you want to
use in growing the tree. Select one of the following:
 Gini. A measure based on squared probabilities of membership for each target
category in the node. It reaches its minimum (zero) when all cases in the node fall
into a single target category.
 Twoing. A measure based on grouping target classes into the two best subclasses
and computing the variance of the binary variable indicating the subclass to which
each case belongs.
 Ordered Twoing. Similar to the twoing criterion, with the additional constraint that
only contiguous target classes can be grouped together.
For continuous targets. The least squared deviation (LSD) measure of impurity is
automatically applied when the target variable is continuous. This index is computed
as the within-node variance, adjusted for frequency or case weights (if any).
Surrogates. Specify the maximum number of surrogates for which to keep statistics at
each node in the tree. For each node, the Select Surrogate dialog box will show only as
many surrogates as you specify here.
Growing criteria and user-defined costs. The Gini criterion is the only choice that
explicitly includes cost information in growing the tree. The twoing and ordered
twoing criteria do not consider costs in constructing the tree, although costs are still
used in computing risks and in node assignment. If you want to use costs with twoing
or ordered twoing, first disable custom costs, and then select Adjust priors using
misclassification costs in the Priors tab of this dialog box. This will incorporate cost
information into the priors and apply them to the model.
57
Grow ing the Roo t No de

Advanced Options: QUEST

This tab allows you to control the alpha setting for QUEST models.
Figure 4-11
Advanced Options: QUEST

Alpha for. This setting allows you to control the alpha levels for variable selection.
Specify an alpha level by adjusting the slider control or entering a value.
Surrogates. Specify the maximum number of surrogates for which to keep statistics at
each node in the tree. For each node, the Select Surrogate dialog box will show only as
many surrogates as you specify here.

Advanced Options: Pruning

This tab allows you to control the subtree selection criterion for pruning procedures.
Note that changing the settings in this dialog box does not activate pruning. To use
pruning, you must choose Grow Tree and Prune from the Tree menu after you have
grown the root node.
58
Chap te r 4

Figure 4-12
Advanced Options: Pruning

Select Subtree Based on. You can specify which criterion you want to use to select
subtrees for pruning.
 Standard Error rule. If this option is selected, the application chooses the smallest
subtree whose risk is close to that of the subtree with the minimum risk. The
multiplier indicates the number of standard errors used for the standard error rule.
Available options are 0.5, 1.0, 1.5, 2.0, and 2.5. This criterion is the default, with
the 1.0 multiplier selected.
 Minimum risk. If this option is selected, the application chooses the subtree that has
the minimum risk.
59
Grow ing the Roo t No de

Advanced Options: Scores

Scores define the order of and distance between categories of an ordinal (or discretized
continuous) target variable. The ordering of categories affects all analyses. Scores are
available only for CHAID analyses.
Figure 4-13
Advanced Options: Scores

When you first create a tree, the Ordered integers option is selected by default. Default
scores follow the variable’s value order, so that the first category has a score of 1, the
second category has a score of 2, and so on. To enter your own custom scores, select
Custom. (At this point, AnswerTree may have to scan the data before populating the
grid.) For the target variable, each category is shown, along with its score. You can
change the score for any category in the grid by selecting the desired score and editing
the value.
If you change the scores after the tree is grown, your existing tree will be invalid.
You will need to regenerate the tree.
60
Chap te r 4

Advanced Options: Costs

Costs allow you to include information about the relative penalty associated with
incorrect classification by the tree. For example, the cost of denying credit to a
creditworthy customer is likely to be different from the cost of extending credit to a
customer who then defaults on the loan.
Figure 4-14
Advanced Options: Costs

When you first create a tree, the costs are equal for all categories by default. To enter
your own custom costs, select Custom. (At this point, AnswerTree may have to scan
the data before populating the grid.)
Costs are shown in a k-by-k grid, where k is the number of categories of the target
variable. The columns represent the actual categories. The rows represent the predicted
categories, based on the tree. Each cell represents the cost of assigning a case to one
category (defined by the row) when it actually belongs to another category (defined by
the column). Default values are 1 for all off-diagonal cells.
61
Grow ing the Roo t No de

The diagonal elements represent correct classifications, where the predicted


category and the actual category match. No cost is associated with correct predictions.
All other costs can be edited by selecting the appropriate cell of the matrix and
editing the value. Changing the cost matrix invalidates trees grown with the C&RT
method. Such trees will need to be regrown to reflect the new costs. Trees grown using
other methods are not directly affected by changes to the cost matrix, but risk statistics
are updated to reflect the new costs.
Costs do not affect the tree-growing process for QUEST trees or C&RT trees using
the twoing or ordered twoing impurity measures. However, costs are used in node
assignment and risk estimation for these trees. To include cost information in the tree-
growing process for such trees, use adjusted priors.
Make Matrix Symmetric. In many instances, you will want costs to be symmetric—that
is, the cost of misclassifying A as B is the same as the cost of misclassifying B as A.
The following controls can make it easier to specify a symmetric cost matrix:
 Copy Lower Half. Copies values in the lower triangle of the matrix (below the
diagonal) into the corresponding upper-triangular cells.
 Copy Upper Half. Copies values in the upper triangle of the matrix (above the
diagonal) into the corresponding lower-triangular cells.
 Use Cell Averages. For each cell in each half of the matrix, the two values (upper-
and lower-triangular) are averaged and the average replaces both values. For
example, if the cost of misclassifying A as B is 1, and the cost of misclassifying B
as A is 3, then this control will replace both of those values with the average
(1 + 3) ⁄ 2 = 2 .

Advanced Options: Prior Probabilities

Prior probabilities allow you to specify the correct method for estimating the
proportion of cases associated with each target variable category. The proportions of
cases in the target variable categories determine how information from the cases is
used to grow the tree. In some cases, the proportions in the training sample are
representative of the proportions in the population under study (that is, the proportions
in real life). In other cases, the proportions in the training sample are not representative
of real life, and some other estimate must be used.
62
Chap te r 4

Figure 4-15
Advanced Options: Prior Probabilities

Method. There are three methods for specifying priors:


 Based on training data. The training data are assumed to be representative of the
probabilities in the population.
 Equal for all classes. Cases in the population are assumed to be distributed evenly
among all of the target variable categories.
 Custom. When neither of the other two methods is suitable, you can specify custom
prior probabilities. Values for priors are entered in the grid control. To edit a prior
probability value, select the desired probability and enter the new value.
Adjust priors using misclassification costs. Selecting this option allows you to
incorporate information about costs into the prior probabilities. This provides a means
for including cost information in node splits for QUEST trees (which normally ignore
the cost matrix).
If you change the priors after the tree is grown, your existing tree will be invalid. You
will need to regenerate the tree.
63
Grow ing the Roo t No de

Save Project
When you choose Save Project or Save Project As from the File menu, AnswerTree
saves your current project and data. Save Project uses the current file names. Save
Project As allows you to specify a new name for the project.
Two files are required for an AnswerTree project: the project file and the data file.
The project file has the .atp extension and contains information about models used,
trees grown, parameters, etc. The data file has the .sav extension and includes the
values for all variables for all cases used in the analysis. The data file name is derived
from the name of the original data source, with an underscore prefixed to the name. So,
for example, if you create a new project based on the data file iris.sav, saving the
project as iris.atp creates two files: iris.atp and _iris.sav (where _iris is derived from
the original data filename).
When sending your AnswerTree results to another AnswerTree user, be sure to
include both files (project file and data file), or the other user will be unable to open
the project.

Project Window View Menu


The Project window View menu contains the following options:
 Tree Map. Toggles display of the Tree Map window.
 Graph. Toggles display of the Graph window.
 Table. Toggles display of the Table window.
 Data. Toggles display of the Data window.
 Startup Dialog. Displays or hides the startup dialog box the next time the application
is invoked.
 Toolbar. Toggles display of the toolbar at the top of the Project window.
 Status Bar. Toggles display of the status bar at the bottom of the Project window.
64
Chap te r 4

Project Window Help Menu


The Project window Help menu contains the following options:
 AnswerTree Help Topics. Displays the contents of the online Help system.
 AnswerTree Tutorial. Launches the online AnswerTree tutorial.
 Help for This Window. Displays Help for the Project window.
 About AnswerTree. Displays information about AnswerTree, including the version
number.
Chapter

5
The Tree Window

After the root node is grown, you perform most operations in the Tree window. It is a
tabbed window that displays five views of the current analysis: a tree diagram, a tabular
gains summary, a risk summary, the rules defining nodes, and a text analysis summary.

Tree Window Views


Tree. Displays the details for a portion of the tree. The region of the tree shown here
can be controlled by scrolling or by using the Tree Map window to select a region.
The nodes selected here determine the information displayed in the Graph, Table, and
Data windows. Nodes can be displayed as tables of values, graphs of values, or both
tables and graphs.
Gains. Displays the statistics associated with nodes and the information gained from
each split. Gains can be displayed by nodes or by percentiles, based on cases or
average profits. The nodes selected here determine the information displayed in the
Graph, Table, and Data windows.
Risk. Displays the estimated risk and misclassification table.
Rules. Displays the rules used to define the selected nodes in the tree.
Summary. Displays a summary report for the tree.

You can duplicate a Tree window or create a new Tree window that uses the same
specifications as the active tree. The new or duplicate tree can be grown or edited
independently of the original tree. Properties can be set individually for the new
window.

65
66
Chap te r 5

Tree Window: Tree View

The Tree view shows a graphical display of the structure of the tree in detail. In most
cases, because of the size of the overall tree, only a portion of the tree will be visible
in the Tree view. You can scroll the window to view other parts of the tree or use the
Tree Map window to select a region of the tree to view.
Figure 5-1
Tree view

Each node in the tree is shown as a table of values, a graph of values, or both. The node
display can be controlled via toolbar buttons or menu items. For a categorical (nominal
or ordinal) target variable, the values show the number of cases in each category. For a
continuous target variable, the values indicate the distribution of values for the node.
67
The Tree W indo w

You can select one or more nodes in the Tree view. When you change the selection,
the Graph, Table, and Data windows are automatically updated to reflect the currently
selected nodes.

Tree Window: Gain Summary

Gains indicate which parts of the tree have the highest (and lowest) response or profit.
The Gains tab shows a summary of the gains for all of the terminal nodes in the tree,
sorted by gain value.
Figure 5-2
Gains view

Gain scores are computed differently for categorical target variables and continuous
target variables.
Categorical target variable. The gain score for a node is computed as either the
proportion of cases in the node belonging to the target class or the average profit for
the node. This is controlled by selecting the appropriate option in the Gain Summary
dialog box.
Continuous target variable. The gain score for a node is computed as the average value
for the node.

The table’s rows can represent statistics for individual nodes or for percentiles of cases.
Percentile tables are cumulative. The following information is displayed for each node
or percentile group:
 Node (or Nodes). The nodes associated with the row.
68
Chap te r 5

 Node: N (or Percentile: N). The number of cases in the target class.
 Node: % (or Percentile: %). The percentage of the total sample cases falling into the
target class.
 Gain. The gain value for the group.
 Index (%). The ratio of this group’s gain score to the gain score for the entire sample.
Cumulative gains can also be displayed. These values are accumulated, so that each
row indicates values for cases in that group plus all previous groups. The display of
cumulative gains can be controlled by selecting the appropriate option from the Gain
Summary dialog box.

Tree Window: Risk Summary

This view gives the estimated risk (and its standard error) of misclassification based on
the tree. Risk is calculated in different ways, depending on the nature of the target
variable.
Categorical (nominal or ordinal) target variable. Risk is calculated as the proportion of
cases in the sample incorrectly classified by the tree. A table is also displayed,
indicating the numbers of cases corresponding to specific prediction errors.
Continuous target variable. Risk is calculated as the within-node variance about the
mean of the node.
Figure 5-3
Risk view
69
The Tree W indo w

If misclassification costs have been specified, risk estimates are adjusted to account for
those costs. If priors have been specified, risk estimates are adjusted to account for
those priors.

Tree Window: Rules View

For any combination of nodes, you can view the rules that describe the selected nodes.
The rule type and format are controlled by the rules format options. (For more
information on rules format options, see “Rules Format” on p. 88.)

Two types of rules are available:


Selecting cases. Rules describe how to identify cases belonging to any of the selected
nodes. Note: The rules for selecting cases will be simplified wherever possible. For
example, if you select all of the terminal nodes in the tree and then view the rules for
selecting cases, you will see only one simple rule that selects all of the cases. This is
because selecting all of the terminal nodes corresponds to selecting every possible
combination of predictor values, and therefore all cases belong in the “subset” defined
by the selected nodes. To view a separate rule for each node selected, choose Assigning
values.
Assigning values. Rules give values for the node assignment, the predicted value for the
target variable, and the probability value for the prediction, based on the combination
of predictor values that defines the node. One rule is generated for each selected node.

Rules can also be displayed in one of three formats:


SQL. SQL rules are used to extract records (from a database) that meet criteria for
membership in a node or to assign values for classification. SQL rules can be copied
and pasted into your SQL application or exported to a file and imported into the SQL
application. Note that after copying rules into your SQL application, you must provide
the table names.
SPSS. Rules are given as SPSS SELECT or COMPUTE statements. For derived
variables, names are assigned by the application. SPSS rules can be copied and pasted
into SPSS syntax files.
Decision. Specified as a set of logical “if...then” statements, suitable for inclusion in
written reports.
70
Chap te r 5

Figure 5-4
Rules view

Tree Window: Analysis Summary

The analysis summary describes the data used in the tree, the tree-growing process, and
a description of the final tree. The analysis summary updates when the tree structure is
modified.
71
The Tree W indo w

Figure 5-5
Summary view

Tree Window Menus

Tree Window File Menu

The Tree window’s File menu contains the following options:


 New Tree. Opens the New Tree Wizard. The initial settings for the wizard, when it
is opened from the Tree window, match the settings of the current tree. (This is
different from choosing New Tree from the Project window’s File menu. In the
Project window, the initial wizard settings are the system defaults.)
72
Chap te r 5

 Close. Saves the current project to a file.


 Export. Exports the current view.
 Print Setup. Allows you to change print settings, such as the selected printer, paper
size, and orientation.
 Print Preview. Displays a preview of the printed output.
 Print. Prints the view or the current selection.
 Exit AnswerTree. Exits the program.

Exporting from the Tree Window

To export the contents of the Tree window, from the menus choose:
File
Export...
Items in the Tree window are exported in the following formats:
 Tree View. Exports as a Windows bitmap (*.bmp) file or a Windows enhanced
metafile (*.emf).
 Gains View. Exports as a tab-delimited text file. This file can easily be read into
spreadsheet or word-processing packages to create a table.
 Risk View. Exports as a tab-delimited text file.
 Rules View. Exports as a text file.
 Summary View. Exports as a text file.

Tree Window Edit Menu

The Tree window’s Edit menu contains the following options:


 Copy. Writes the selection or the visible portion of the active view to the clipboard.
 Select Terminal Nodes. Selects terminal nodes of the tree, depending on which
suboption you choose. The suboption All selects all terminal nodes. The suboption
Rule-Based opens the Set Selection Rule dialog box, in which you can specify
rules for selecting certain terminal nodes of interest.
73
The Tree W indo w

Set Selection Rule Dialog Box

The Set Selection Rule dialog box lets you select the terminal nodes in your tree that
meet a certain criterion. With a large tree, the dialog box lets you locate specific nodes
quickly. You might want to explore the terminal nodes by trying several different
values for this dialog box. After you have made a useful selection, you might want to
export information about the selected nodes. For example, you could export the rules
for the selected nodes in SQL format, so that you could locate similar cases elsewhere
in your database.
Figure 5-6
Set Selection Rule dialog box

You can select nodes based on a statistical criterion, including gains, index, cumulative
gains, or cumulative index. To use this criterion, select a statistic from the drop-down
list and select a relation (>, >=, <, or <=). Finally, enter a value to complete the
inequality. For example, you might choose to select all terminal nodes whose gains are
greater than or equal to 1.5.
Alternatively, you can enter a value to select the top n nodes, the top nodes up to n
cases, or the top nodes up to n percent of your sample.
The results of this dialog box depend upon the settings in the Gain Summary dialog
box. For example, if the Gain Content is set to Average Profit and you are selecting
nodes based on gain, the selection cutoff point you enter must be an average profit
74
Chap te r 5

value, not a gain percentage. In addition, this dialog box is not available if your gain
summary is formatted to display percentiles in the rows (rather than nodes).
Finally, the meaning of top in this dialog box is relative. For most AnswerTree
analyses, the top nodes are those whose gains are greatest. However, you may want to
classify cases according to the smallest gains (for example, you might try to find low-
risk patients). This dialog box accomplishes that. Choose a less-than relation instead
of a greater-than relation.

Tree Window View Menu

The Tree window’s View menu contains the following options:


 Node Statistics. Displays statistics in each node in the Tree tab.
 Node Graph. Displays a graph in each node in the Tree tab.
 Tree Map. Displays the Tree Map window.
 Graph. Displays the Graph window.
 Table. Displays the Table window
 Data. Displays the Data window.
 Sample. Allows you to select the sample used for the tree. The available options are
Training or Testing.
 Orientation. Selects the orientation of the tree in the Tree view. The available
options are Top Down, Left-to-Right, or Right-to-Left.
 Tool Bar. Toggles the toolbar at the top of the Tree window.
 Status Bar. Toggles the status bar at the bottom of the Tree window.
 Zoom. Zooms the Tree view in or out. This can be used to fit more of the tree in the
window (at the cost of reducing the size of each node) or to increase the size of the
type and graphs in the nodes (at the cost of reducing the area of the tree shown).

Tree Window Analysis Menu

The Tree window’s Analysis menu contains the following options:


 Define Variable. Defines how a variable is used in the analysis.
 Advanced Options. Allows you to modify settings for stopping rules, CHAID,
C&RT, and QUEST models, pruning, scores, costs, and priors.
75
The Tree W indo w

 Competitors. Sets limits on the number of predictors (competitors) considered for


splitting at each node.
 Profits. Sets up profits.

Competitors

This dialog box allows you to control the number of predictors displayed for each node.
Selecting smaller values can speed up the processing of large problems.
Figure 5-7
Competitors dialog box

Competitors. Specify the maximum number of predictors for which to keep statistics at
each node in the tree (these retained predictors are known as competitors). For each
node, the Select Predictor dialog box will retain statistics for only as many predictors
as you specify here. You can still manually split a node based on a predictor that is not
a competitor. Changing the number of competitors for a grown tree will affect only new
tree growth.

Setting Maximum Numbers for Competitors

E To set maximum numbers, from the Tree window menus choose:


Analysis
Competitors...

E Change the settings as needed.

Define Variable

This feature allows you to define various properties of the variables in your data file.
These properties affect how the variables are used in tree-growing and analysis.
76
Chap te r 5

Figure 5-8
Define Variable dialog box (nominal variable)

On the left is a variable list, showing all of the variables in your data set. The
measurement level for each variable is indicated by the icon shown next to the variable
name. Changes are made by selecting a variable in the variable list and then editing its
settings.
You can also specify whether variable labels or variable names are used in the
dialog box and in the tree itself. To change the setting, select a variable and right-click;
then select the appropriate option from the context menu. Changing this setting affects
only the selected item.
Measurement Level. You must indicate the appropriate measurement level in order to
have the variables treated properly in tree-growing and analysis. Available options are
nominal, ordinal, or continuous.
Values. For some variables, there may be values that you want to ignore or mark as
missing.
 Valid. (Not available for continuous variables.) This list contains values for the
variable that are considered valid for the variable. For numerical ordinal variables,
values are processed in numerical order. Valid and missing values cannot be edited.
 Missing. This list contains values for the variable that are defined as user-missing
for the variable. You can optionally specify that missing values be considered a
separate category in CHAID analyses. Valid and missing values cannot be edited.
 Label for missing. You can specify a text label for missing values.
77
The Tree W indo w

 Display value labels. (Not available for nominal or ordinal variables.) You can
specify that value labels, rather than values, are displayed in the tree. The default
is to show value labels (if defined). Value labels are always shown for nominal and
ordinal variables (if they are defined in the data file).
Allow CHAID to merge categories. Select this option to allow categories to be merged by
the CHAID algorithm.

Defining a Variable

E To define variables, from the Tree window menus choose:


Analysis
Define Variable...

E Select a variable to modify.

E Change the variable’s settings as needed.

Measurement Levels

Each variable can be characterized by the kind of values it can take and what those
values measure. This general characteristic is referred to as the measurement level of
the variable. You can specify a variable as having one of three measurement levels:
Nominal. This measurement level includes categorical variables with discrete values,
where there is no particular ordering of values. Examples include the gender of a
respondent, the brand of a product tested, and the type of a loan.
Ordinal. This measurement level includes variables with discrete values, where there is
a meaningful ordering of values. Ordinal variables generally don’t have equal intervals,
however, so the difference between the first category and the second may not be the
same as the difference between the fourth and fifth categories. Examples include years
of education and number of children.
Continuous. This measurement level includes variables that are not restricted to a list of
values but can essentially take any value (although the values may be bounded above,
below, or both). Examples include annual salary, the amount of a loan, and the weight
of a product.
For a more thorough discussion of variable types and measurement levels, see Chapter 14.
78
Chap te r 5

Define Variable: Intervals

For continuous predictor variables in CHAID analyses, the quantitative scale is divided
into ranges of values, called intervals, which are treated as categories. Intervals can be
determined automatically by the program, or you can specify your own set of intervals.
Automatic intervals are chosen so that each interval contains approximately the same
number of cases.
Figure 5-9
Intervals dialog box

The name of the variable to be updated and its range are shown at the top of the dialog
box. The current intervals are shown in the list view, with four columns indicating the
start value, end value, number of cases, and percentage of cases for each interval. Note
that if the selected variable has fewer than 10 distinct values, you will not be able to
define intervals. Instead, the variable will be treated as an ordinal categorical.
Method. If you select Automatic, intervals will be calculated to distribute the cases
evenly among the number of intervals specified. To change the number of intervals, use
the spinner control or type the desired number of intervals into the field, and click the
Update List button.
79
The Tree W indo w

If you select Custom, you will need to enter start values for intervals. If you want to
define all of the intervals, begin by clicking Clear List. To add a new interval, enter a
start value for New Cut Point and click Add. The new interval contains all cases with
values between the start value and the end value (or the next higher start value).
Continue adding new intervals until all intervals are specified as desired.

The following buttons can be used to modify your interval list:


 Update List. While editing custom intervals, the number and percentage of cases are
not updated automatically. The Update List button updates these values for the
current set of intervals.
 Clear List. This button clears all custom interval definitions and resets the interval
list to a single interval containing all of the cases.
 Delete Item. This button removes the currently selected interval from the list.

Defining Variable Intervals

E To define variable intervals for a CHAID analysis, from the Project window or Tree
window menus choose:
Analysis
Define Variable...

E Set the measurement level to Continuous.

E Click Intervals.

E Select Automatic or Custom intervals.

E For custom intervals, click Clear List to delete the default intervals, and add new
intervals by specifying each interval’s starting start point and clicking Add.

Profits

Profits allow you to include information about the relative value of each category of
the target variable.
80
Chap te r 5

Figure 5-10
Define Profits dialog box

For the target variable in the current tree, each category is shown, along with its profit.
Default profits for ordinal variables (including discretized continuous variables) are
inherited from the variable’s scores. Default profits for nominal variables assign a
value of 1 to the highest category and a value of 0 to all other categories (where highest
category means the largest value for numerically coded variables and the last value
alphabetically for string-coded variables). You can change the profit for any category
in the grid by selecting the desired category’s profit and editing the value.

Setting Profits

E To set profits, from the Tree window menus choose:


Analysis
Profits...

E For each category of the target variable, change the value as necessary.

Tree Window Tree Menu

The Tree window’s Tree menu contains the following options:


 Grow Tree. Grows the entire tree. If the tree has already been partially grown, the
tree will be grown starting from the existing tree.
 Grow Tree One Level. Adds one level to the tree structure.
81
The Tree W indo w

 Grow Tree and Prune. Grows the entire tree, then automatically prunes it. Not
available if the growing method is CHAID or Exhaustive CHAID.
 Grow Branch. Grows the tree below the current node to its terminal nodes.
 Grow Branch One Level. Adds one level under the currently selected node.
 Select Predictor. Allows you to specify which predictor to use in splitting the
current node and how values of the predictor are grouped to form the split.
 Select Surrogate. Allows you to specify a surrogate variable to use in splitting the
current node.
 Define Split. Redefines the split of the current node. This option can be used to
merge or separate nodes.
 Remove Branch. Removes the branch below the current node.
 Remove One Level. Removes one level from the whole tree.

Select Predictor

The Select Predictor dialog box displays a list of predictors available for splitting (or
resplitting) a selected node. When you select a variable in the list and click Grow, the
node is split using the selected predictor.
Figure 5-11
Select Predictor dialog box
82
Chap te r 5

The dialog box is not available if more than one node is selected in the tree. Predictors
with constant values for the selected node (where all cases in the node have the same
value for the predictor) are not shown.
The table displays information for each variable, depending on the growing method
used. (Not all items listed appear for all growing methods.)
 Predictor. The name of the predictor variable.
 Nodes. The number of nodes that will be created by splitting on the predictor.
 Split Type. The type of split: default for a computer-generated split, custom for a
user-specified split, or arbitrary for noncompetitor predictors.
 Chi-Square (categorical target) or F (continuous target). The value of the test statistic
used to evaluate the predictors.
 D.F. The degrees of freedom associated with the test statistic. For chi-square
statistics, there is only one df value. For F statistics, the two df values are given as
the numerator df and the denominator df.
 Adj. Prob. The probability (p value) associated with the test statistic, adjusted for
the multiple tests on the list competitors (using the Bonferroni method). The usual
rule of thumb is that probabilities of less than 0.05 indicate statistically significant
associations.
You can limit the list display to a subset of the predictors (called competitors) by
changing the settings in the Competitors dialog box. If you limit the number of
competitors, other predictors (noncompetitors) will still be shown in the list with no
statistics and with the split type shown as Arbitrary. You can specify that only
competitors be shown by right-clicking the predictor list and selecting
Show Competitors.
You can override a category grouping or redefine a cut point for the selected
predictor by clicking Define Split.

The following formatting options are also available from the context menu:
Display Variable Labels. Shows variable labels in the dialog box and the Tree view.
Display Variable Names. Shows variable names in the dialog box and the Tree view.

Selecting a Predictor

E To select a predictor for a custom split, from the Tree window menus choose:
Tree
Select Predictor...
83
The Tree W indo w

E Select a predictor from the list and click Grow to split the node using the selected
predictor.
Optionally, you can manually specify how the split is defined by clicking Define Split.

Define Split (Nominal Predictor/CHAID)

The Define Split feature allows you to control how the split is defined on the predictor
variable.
Figure 5-12
Define Split dialog box for CHAID

The dialog box shows the current grouping of categories. Changes are made by
selecting a node or nodes and right-clicking. From the context menu, the following
options are available:
 Merge Nodes. The selected nodes are merged into a single node. The new node is
shown on a single line, with multiple categories separated by commas.
 Separate Node. The selected node is separated into multiple nodes, one for each
category in the original node.
To rearrange the assignment of categories to nodes, you may first need to separate all
nodes so that each category defines a node, and then merge the individual categories to
create the desired split.

The following formatting options are also available from the context menu:
Display Value Labels. Shows category labels in the dialog box.
Display Values. Shows category values in the dialog box.
84
Chap te r 5

Define Split (C&RT or QUEST)

The Define Split feature allows you to control how the split is defined on the predictor
variable.
Figure 5-13
Define Split dialog box for C&RT or QUEST

If the selected node is already split, the dialog box shows the current grouping of
categories. Changes are made by dragging categories from one list to the other until the
ordinal categories are grouped as desired. For ordinal variables, dragging a category
from the middle of the list will also move all categories below it (dragging from left to
right) or above it (dragging from right to left).

The following formatting options are also available from the context menu:
Display Value Labels. Shows category labels in the dialog box.
Display Values. Shows category values in the dialog box.

Define Split (Continuous Predictor)

The Define Split feature allows you to control how the split is defined on the predictor
variable.
85
The Tree W indo w

Figure 5-14
Define Split dialog box for a continuous predictor

If the selected node is already split, the dialog box shows the current cut point defining
the split. Changes are made by dragging the slider control or entering a cut point value
in the text box.

Select Surrogate
The Select Surrogate dialog box displays a list of surrogates available for splitting (or
resplitting) a selected node. When you select a variable in the list and click Grow, the
node is split using the selected surrogate. Surrogates are available only for models
grown using the C&RT or QUEST methods.
Figure 5-15
Select Surrogate dialog box
86
Chap te r 5

If the selected node is already split, the best surrogate for the split variable is
highlighted in the list.

The table displays the following information for each variable:


 Surrogate. The name of the predictor variable.
 Improvement. The change in impurity if split using the surrogate. Shown for C&RT
models only.
 Association. The degree to which predictions based on the surrogate match those
based on the predictor. High values of association indicate surrogates that are good
substitutes for the original predictor.

Selecting a Surrogate

E To select a surrogate for a custom split, from the Tree window menus choose:
Tree
Select Surrogate...

E Select a surrogate from the list and click Grow to split the node.

Tree Window Format Menu

The Tree window’s Format menu contains the following options:


 Gains. Allows you to define the format of the gain summary.
 Rules. Allows you to specify the type of rules reported in the Rules tab.

Gain Summary Format

You can control various aspects of how the gain summary is displayed.
87
The Tree W indo w

Figure 5-16
Gain Summary dialog box

The following options are available:


Rows Represent. You can select options for the units represented by rows in the table.
 Nodes. The table contains one row for each node in the tree.
 Percentiles. Each row of the table represents a certain proportion of the cases. You
can specify the size of the proportion in each row by selecting the appropriate value
from the Increment drop-down list.
 Display cumulative statistics. Selecting this option displays cumulative statistics for
the nodes in the table.
Gain Column. You can select the values displayed in the Gain column of the table and
control the sort order of rows.
 Contents. If you select Percentage of cases in target category, the Gain column
displays the percentage of cases that fall into the target category you have specified.
If you select Average Profit, the Gain column displays the average profit (or loss)
for each node, as you have specified in the Profits dialog box. Choose this option
if you have specified profits associated with each category.
 Sort order. You can select either Ascending or Descending order of rows, based on
the Gain value.
88
Chap te r 5

Setting the Gain Summary Format

E To set the gain summary format, from the Tree window menus choose:
Format
Gains...

E Select the desired options.

Rules Format

You can control various aspects of how rules are displayed.


Figure 5-17
Rules Format dialog box

The following options are available:


Type. You can select the type of rules generated.
 SQL. SQL rules are used to extract records (from a database) that meet criteria for
membership in a node or to assign values for classification. SQL rules can be
copied and pasted into your SQL application or exported to a file and imported into
the SQL application. Note that after copying rule(s) into your SQL application, you
must provide the table name(s).
89
The Tree W indo w

 SPSS. Rules are given as SPSS SELECT or COMPUTE statements. For derived
variables, names are assigned by the application. SPSS rules can be copied and
pasted into SPSS syntax files.
 Decision. Rules are specified as a set of logical “if...then” statements, suitable for
inclusion in written reports
Generate Syntax for. You can select the content of the rules.
 Selecting cases. Rules describe how to identify cases belonging to any of the
selected nodes. Note: The rules for selecting cases will be simplified wherever
possible. For example, if you select all of the terminal nodes in the tree and then
view the rules for selecting cases, you will see only one simple rule that selects all
cases. This is because selecting all of the terminal nodes corresponds to selecting
every possible combination of predictor values, and therefore all cases belong in
the “subset” defined by the selected nodes. To view a separate rule for each node
selected, choose Assigning values.
 Assigning values. Rules give values for node assignment, the predicted value for the
target variable, and the probability value for the prediction, based on the
combination of predictor values that defines the node. One rule is generated for
each selected node.
Use Labels for. For decision rules, you can request that rules be generated using labels
(instead of values) for variables, values, or both.

Setting the Rules Format

E To set the Rules format, from the Tree window menus choose:
Format
Rules...

E Select the desired options.

Tree Window Window Menu

The Tree window’s Window menu contains the following option:


 Minimize AnswerTree. Minimizes all AnswerTree windows.
90
Chap te r 5

Tree Window Help Menu

The Tree window’s Help menu contains the following options:


 AnswerTree Help Topics. Displays the contents of the online Help system.
 Help for This Window. Displays help for the tab currently selected in the Tree
window.
 About AnswerTree. Displays information about AnswerTree, including the version
number.
Chapter

6
Viewer Windows

Viewer windows display information about individual nodes in your tree. They give
more details about the selected nodes. You cannot modify anything in a Viewer
window; you can only examine the current state of the nodes. The following Viewer
windows are available to provide insight into your tree model:
 Tree Map window. Displays the entire active tree. The tree can be navigated by
selecting a node or region in the tree map.
 Graph window. Displays a graphic representation of the selected nodes.
 Table window. Displays a tabular representation of the selected nodes.
 Data window. Displays case-level data for the selected nodes.

Tree Map Window


The Tree Map window displays a macroscopic view of the tree. Changes to the
appearance and structure of the tree are reflected in the tree map.
Figure 6-1
Tree Map window

91
92
Chap te r 6

The tree map shows the full tree, including all nodes and connecting lines, in its current
orientation. The selected node or nodes are highlighted in the map. You can select a
node by clicking it in the Tree Map window. You can select additional nodes by Ctrl-
clicking.
The Tree Map window also has a selectable red rectangle that represents the
perimeter of the visible portion of the Tree window. The location or size of the rectangle
update when you scroll or resize the Tree window. If the Tree window is resized, its
contents adjust to fit in the available space, and the dimensions of the rectangle in the
tree map change accordingly. The tree can also be navigated by selecting a node in the
tree map. When you click on a node in the map, the node is selected in the tree map and
the Tree window, and the node becomes visible in the Tree window.
The rectangle can be moved by right-clicking and dragging it. If it is moved, a
different region becomes visible in the Tree window. Moving the rectangle does not
affect node selection in the Tree window.

Data Viewer
The Data Viewer shows case-by-case data for one or more selected nodes in the active
Tree window.
Figure 6-2
Data Viewer

Data shown in the viewer cannot be edited. In the data grid, rows represent cases, and
columns represent variables. All variables in the active data set are shown.
If data are partitioned, the data list includes only cases from the currently selected
partition. If there are frequency weights with partitioned data, a new variable is shown
to indicate the number of cases in each row assigned to the currently selected partition.
93
Vie wer W indow s

Exporting data. Data shown in the viewer can be exported to an external data file. Only
cases shown in the Data Viewer at the time of export are exported. To export all of the
data in the currently selected partition, select the root node before exporting. Data can
be exported in the following formats:
 SPSS. Data exported in SPSS format retain all information regarding value and
variable labels, measurement levels, and missing values.
 SYSTAT. Data exported in SYSTAT format retain information regarding system-
missing values, but information about value and variable labels and measurement
levels is lost. User-missing values are exported as the literal values, with no
indication that they represent another type of missing data.
 Tab-delimited ASCII text. Data exported as tab-delimited text retain information
regarding system-missing values, but information about value and variable labels
and measurement levels is lost. System-missing values are exported as blanks.
User-missing values are exported as the literal values, with no indication that they
represent another type of missing data. The first line in the exported data file
contains the variable names.

Data Viewer File Menu

The Data Viewer File menu contains the following options:


 Export. Exports the data shown in the Data Viewer. Data can be exported in SPSS,
SYSTAT, or ASCII format.
 Print Setup. Allows you to change print settings, such as the selected printer, paper
size, and orientation.
 Print Preview. Displays a preview of the printed output.
 Print. Prints the data shown in the Data Viewer.

Data Viewer Edit Menu

The Data Viewer Edit menu contains the following option:


 Copy. Copies the selected object(s) to the clipboard for pasting.
94
Chap te r 6

Data Viewer View Menu

The Data Viewer View menu contains the following options:


 Tool Bar. Toggles the toolbar at the top of the window.
 Status Bar. Toggles the status bar at the bottom of the window.

Data Viewer Help Menu

The Data Viewer Help menu contains the following options:


 AnswerTree Help Topics. Displays the contents of the online Help system.
 Help for this Window. Displays Help for the Data Viewer.
 About AnswerTree. Displays information about AnswerTree, including the version
number.

Graph Viewer
The Graph Viewer shows summary statistics in graphical format for one or more
selected nodes in the active tree. The graph shows summary statistics based on values
of the target variable.
The view depends on the level of measurement of the target variable and the
selection in the Tree window.
Continuous target variable. The viewer shows a histogram of the target variable for
cases in the selected node.
95
Vie wer W indow s

Figure 6-3
Graph Viewer histogram

Categorical target variable. The default graph is a bar chart of percentages for a selected
node. Each category of the target variable is shown in the chart in the same order as
they appear in node graphs.
Figure 6-4
Graph Viewer bar chart

You can change the color of bars in the Graph Viewer by right-clicking a bar and
selecting a color from the Color dialog box.
96
Chap te r 6

Graph Viewer File Menu

The Graph Viewer File menu contains the following option:


 Export. Exports the graph shown in the Graph Viewer as a Windows bitmap
(*.bmp) or enhanced metafile (*.emf).

Graph Viewer Edit Menu

The Graph Viewer Edit menu contains the following option:


 Copy Graph. Copies the current graph to the clipboard for pasting.

Graph Viewer View Menu

The Graph Viewer View menu contains the following options:


 Tool Bar. Toggles the toolbar at the top of the window.
 Status Bar. Toggles the status bar at the bottom of the window.

Graph Viewer Help Menu

The Graph Viewer Help menu contains the following options:


 AnswerTree Help Topics. Displays the contents of the online Help system.
 Help for this Window. Displays Help for the Graph Viewer.
 About AnswerTree. Displays information about AnswerTree, including the version
number.

Table Viewer
The Table Viewer shows statistics for one or more nodes in tabular form. A Table view
is analogous to a Graph view of a node.
The Table view depends on level of measurement of the dependent variable and the
selection in the Tree window:
97
Vie wer W indow s

Categorical target variable. The default table shows percentages and counts for each
category and the total n for the node(s). The percent given for the total is the percentage
of cases in the sample assigned to that node. If you have specified user-defined priors,
the total percent is adjusted to account for those priors. For categorical variables, you
can change the color of a row in the Table Viewer by right-clicking the row and
selecting a color from the Color dialog box.
Figure 6-5
Table Viewer for categorical target

Continuous target variable. The table shows the mean, standard deviation, number of
cases, and predicted value of the dependent variable for the selected node(s).
Figure 6-6
Table Viewer for continuous target
98
Chap te r 6

Table Viewer File Menu

The Table Viewer File menu contains the following option:


 Export. Exports the table shown in the Table Viewer as a Windows bitmap (*.bmp)
or enhanced metafile (*.emf).

Table Viewer Edit Menu

The Table Viewer Edit menu contains the following option:


 Copy Table. Copies the current table to the clipboard for pasting.

Table Viewer View Menu

The Table Viewer View menu contains the following options:


 Tool Bar. Toggles the toolbar at the top of the window.
 Status Bar. Toggles the status bar at the bottom of the window.

Table Viewer Help Menu

The Table Viewer Help menu contains the following options:


 AnswerTree Help Topics. Displays the contents of the online Help system.
 Help for this Window. Displays Help for the Table Viewer.
 About AnswerTree. Displays information about AnswerTree, including the version
number.
Chapter

7
Capturing Data with ODBC

When you define a new project, you must specify a data source. One of your choices
for reading data is the Database Capture Wizard. The Database Capture Wizard
allows you to read data from any database format for which you have an ODBC driver.
You can also read Excel 5 files using the Excel ODBC driver.

To Read Database Files with ODBC


E From the AnswerTree menus choose:
File
New Project...

E In the New Project dialog box, select Database Capture Wizard . Select your desired
option, to create a new query or to run or edit an existing one.

E Select the data source. This can be a database format, an Excel file, or a text file.

E Select the database file.

E Depending on the database file, you may need to enter a login name and password.

E Select the table(s) and fields you want to read into the software.

E Specify any relationships between your tables.


Optionally, you can:
 Specify any selection criteria for your data.

99
100
Chap te r 7

 Add a prompt for user input to create a parameter query.


 Define any variable attributes.
 Save the query you have constructed before running it.

Selecting a Data Source

Use the first dialog box to select the type of data source to read into the software. After
you have chosen the file type, the Database Capture Wizard prompts you for the path
to your data file.
If you do not have any ODBC data sources configured or if you want to add a new
ODBC data source, click Add Data Source.
Figure 7-1
Database Capture Wizard dialog box
101
C apturing Data with O DB C

Example. Suppose that you have a Microsoft Access 7.0 database that contains data
about your employees and about the regions in which they work, and you want to import
that data. Select the MS Access 7.0 Database icon, and click Next to proceed. You will
see the Select Database dialog box. Specify the path to your database and click OK.

Database Login

If your database requires a password, the Database Capture Wizard prompts you for
one before it opens the data source.
Figure 7-2
Login dialog box

Selecting Data Fields

This dialog box controls which tables and fields are read into the software. Database
fields (columns) are read as variables.
If a table has any field(s) selected, all of its fields will be visible in the following
Database Capture Wizard windows, but only those fields selected in this dialog box
will be imported as variables. This enables you to create table joins and to specify
criteria using fields that you are not importing.
102
Chap te r 7

Figure 7-3
Select Data dialog box

Displaying field names. To list the fields in a table, click the plus sign (+) to the left of
a table name. To hide the fields, click the minus sign (–) to the left of a table name.
To add a field. Double-click any field in the Available Tables list, or drag it to the
Retrieve Fields in This Order list. Fields can be reordered by dragging and dropping
them within the selected fields list.
To remove a field. Double-click any field in the Retrieve Fields in This Order list, or
drag it to the Available Tables list.
Sort field names. If selected, the Database Capture Wizard will display your available
fields in alphabetical order.
Example. Assume that you want to import from a database with two tables, Employees
and Regions. The Employees table contains information about your company’s
employees, including the region they work in, their job category, and their annual sales.
103
C apturing Data with O DB C

Employees are each assigned a region code (REGION), while those who do not have a
home region get the special code of 0. The Regions table holds a large amount of data
about the areas in which your company operates and prospective markets. It uses a
region code (REGION) to identify the area and provides the average per capita income
for the area, among other things. To relate each employee’s sales to the average income
in the region, you would select the following fields from the Employees table: ID,
REGION, and SALES. Then, select the following fields from the Regions table:
REGION and AVGINC. Click Next to proceed.

Creating a Relationship between Tables

This dialog box allows you to define the relationships between the tables. If fields from
more than one table are selected, you must define at least one join.
Figure 7-4
Specify Relationships dialog box
104
Chap te r 7

Establishing relationships. To create a relationship, drag a field from any table onto the
field to which you want to join it. The Database Capture Wizard draws a join line
between the two fields, indicating their relationship. These fields must be of the same
data type.
Auto Join Tables. If this is selected and if any two fields from the tables you have chosen
have the same name, have the same data type, and are part of their table’s primary key,
a join is automatically generated between these fields.
Specifying join types. If outer joins are supported by your driver, you can specify either
inner joins, left outer joins, or right outer joins. To select the type of join, click the join
line between the fields, and the software displays the Relationship Properties dialog
box.
You can also use the icons in the upper right corner of the dialog box to choose the type
of join.

Relationship Properties

This dialog box allows you to specify which type of relationship joins your tables.
Figure 7-5
Relationship Properties dialog box

Inner joins. An inner join includes only rows where the related fields are equal.
Example. Continuing with our data, suppose that you want to import data for only those
employees who work in a fixed region and for only those regions in which your
105
C apturing Data with O DB C

company operates. In this case, you would use an inner join, which would exclude
traveling employees and would filter out information about prospective regions in
which you do not currently have a presence.
Completing this would give you a data set that contains the variables ID, REGION,
SALES95, and AVGINC for each employee who worked in a fixed region.
Figure 7-6
Creating an inner join

Outer joins. A left outer join includes all records from the table on the left and only
those records from the table on the right where the related fields are equal. In a right
outer join, this relationship is switched, so that the software imports all records from
the table on the right and only those records from the table on the left where the related
fields are equal.
106
Chap te r 7

Example. If you wanted to import data only for those employees that worked in fixed
regions (a subset of the Employees table) but needed information about all of the
regions, a right outer join would be appropriate. This results in a data set that contains
the variables ID, REGION, SALES95, and AVGINC for each employee who worked in
a fixed region, plus data on the remaining regions in which your company does not
currently operate.
Figure 7-7
Creating a right outer join

Limit Retrieved Cases

The Limit Retrieved Cases dialog box allows you to specify the criteria for selecting
subsets of cases (rows). Limiting cases generally consists of filling the criteria grid
107
C apturing Data with O DB C

with one or more criteria. Criteria consist of two expressions and some relation
between them. They return a value of true, false, or missing for each case.
 If the result is true, the case is selected.
 If the result is false or missing, the case is not selected.
 Most criteria use one or more of the six relational operators (<, >, <=, >=, =, and <>).
 Expressions can include field names, constants, arithmetic operators, numeric and
other functions, and logical variables. You can use fields that you do not plan to
import as variables.
Figure 7-8
Limit Retrieved Cases dialog box

To build your criteria, you need at least two expressions and a relation to connect them.
108
Chap te r 7

E To build an expression, put your cursor in an Expression cell. You can type field
names, constants, arithmetic operators, numeric and other functions, and logical
variables. Other methods of putting a field into a criteria cell include double-clicking
on the field in the Fields list, dragging the field from the Fields list, or selecting a field
from the drop-down menu that is available in any active expression cell.

E The two expressions are usually connected by a relational operator, such as = or >. To
choose the relation, put your cursor in the Relation cell and either type in the operator
or select it from the drop-down menu.
To modify our earlier example to retrieve only data about employees who fit into job
categories 1 or 3, create two criteria in the criteria grid and prefix the second criteria
with the connector OR.
Criteria 1: 'EmployeeSales'.'JOBCAT' = 1
Criteria 2: 'EmployeeSales'.'JOBCAT' = 3
Functions. A selection of built-in arithmetic, logical, string, date, and time SQL
functions are provided. You can select a function from the list and drag it into the
expression, or you can enter any valid SQL function. See your database documentation
for valid SQL functions.
Prompt for Value. You can embed a prompt in your query to create a parameter query.
When users run the query, they will be asked to enter information specified here. You
might want to do this if you need to see different views of the same data. For example,
you may want to run the same query to see sales figures for different fiscal quarters.
Place your cursor in any Expression cell, and click this button to create a prompt.

Defining Variables

Variable names and labels. The complete database field (column) name is used as the
variable label. Unless you modify the variable name, the Database Capture Wizard
assigns variable names to each column from the database in one of two ways:
 If the name of the database field (or the first eight characters) forms a valid, unique
variable name, it is used as the variable name.
 If the name of the database field does not form a valid, unique variable name, the
software creates a unique name.
Click on any cell to edit the variable name.
109
C apturing Data with O DB C

Figure 7-9
Define Variables dialog box

Results

The Results dialog box displays the SQL syntax for your query. You can copy the SQL
syntax onto the clipboard or simply retrieve the data. You can also customize your
query by editing the SQL statement. In either case, you can save the query for use with
other applications by providing a name and path in the Save Query to File panel or by
clicking the Browse button, which lets you specify a name and location using a Save
As dialog box.
Details on the ODBC query syntax (the GET CAPTURE command) can be found
in the following section of this chapter.
110
Chap te r 7

Figure 7-10
Results dialog box

ODBC Query Syntax


ODBC queries are executed using AnswerTree’s GET CAPTURE query engine, and
they use the command syntax for that engine. This section contains a summary of the
command syntax for those who want to edit their query commands. Throughout the
section, we refer to SPSS-format conventions. Note that SPSS-format refers to the file
format used by many SPSS products, including AnswerTree—that is, the *.sav file
format.
111
C apturing Data with O DB C

Syntax Diagram
GET CAPTURE {ODBC }*

[/CONNECT=’connection string’]
[/LOGIN=login] [/PASSWORD=password]
[/SERVER=host] [/DATABASE=database name]†

/SELECT any select statement

* You can import data from any database for which you have an ODBC driver installed.
† Optional subcommands are database specific. See “Syntax Rules” on p. 112 for the subcommand(s) required by
a database type.

Example
GET CAPTURE ODBC
/CONNECT=‘DSN=Sample DBASE files;CollatingSequence=ASCII;’
‘DBQ=C:\CRW; DefaultDir=C:\CRW; Deleted=1;’
‘Driverid=21;Fil=dBaseIII;PageTimeout=600;’
‘Statistics=0;UID=admin;’
/SELECT EMPLOYEE.LASTNAME,EMPLOYEE.FIRSTNAME,EMPLOYEE.ADDRESS,
EMPDATA.DATA FROM {oj employee LEFT OUTER JOIN EMPDATA ON
‘EMPLOYEE’,‘LASTNAME’=‘EMPDATA’,‘LASTNAME’}.

Overview

GET CAPTURE retrieves data from a database and converts them to a format that can
be used by program procedures. It builds a working data file for the current session.

Basic Specification

The basic specification is one of the subcommands specifying the database type
followed by the SELECT subcommand and any SQL select statement.

Subcommand Order

The subcommand specifying the type of database must be the first specification. The
SELECT subcommand must be the last.
112
Chap te r 7

Syntax Rules
 Only one subcommand specifying the database type can be used.
 The CONNECT subcommand must be specified if you use the Microsoft ODBC
(Open Database Connectivity) driver.

Operations
 GET CAPTURE retrieves the data specified on SELECT.
 The variables are in the same order in which they are specified on the SELECT
subcommand.
 The data definition information captured from the database is stored in the working
data file dictionary.

Limitations
 Maximum 3800 characters (approximately) can be specified on the SELECT
subcommand. This translates to 76 lines of 50 characters. Characters beyond the
limit are ignored.

CONNECT Subcommand

CONNECT is required to access any database that has an installed Microsoft ODBC
driver.
 You cannot specify the connection string directly in the syntax window, but you
can paste it with the rest of the command from the Results dialog box, which is the
last of the series of dialog boxes opened with the Database Capture command on
the File menu.

SELECT Subcommand

SELECT specifies any SQL select statement accepted by the database you access. With
ODBC, you can now select columns from more than one related table in an ODBC data
source using either the inner join or the outer join.
113
C apturing Data with O DB C

Example
GET CAPTURE ODBC
/CONNECT=‘DSN=Sample DBASE files;CollatingSequence=ASCII;’
‘DBQ=C:\CRW; DefaultDir=C:\CRW; Deleted=1;’
‘Driverid=21;Fil=dBaseIII;PageTimeout=600;’
‘Statistics=0;UID=admin;’
/SELECT EMPLOYEE.LASTNAME,EMPLOYEE.FIRSTNAME,EMPLOYEE.ADDRESS,
EMPDATA.DATA FROM {oj EMPLOYEE LEFT OUTER JOIN EMPDATA ON
‘EMPLOYEE’.‘LASTNAME’=‘EMPDATA’.‘LASTNAME’}.
 This example retrieves data from two related tables in a dBASE III database.
 The SQL select statement retrieves the customer’s first names, last names, and their
addresses from the EMPLOYEE table and if the last name is also in the EMPDATA
table, the DATA column of the employee’s data table will be retrieved.
 GET CAPTURE converts the data to a format used by program procedures and
builds a working data file.

Data Conversion

GET CAPTURE converts variable names, labels, missing values, and data types,
wherever necessary, to a format that conforms to SPSS-format conventions.

Variable Names and Labels

Database columns are read as variables.


 A column name is converted to a variable name if it conforms to SPSS-format
naming conventions and is different from all other names created for the working
data file. If not, GET CAPTURE gives the column a name formed from the first few
letters of the column and its column number. If this is not possible, the letters COL
followed by the column number are used. For example, the seventh column
specified in the select statement could be COL7.
 GET CAPTURE labels each variable with its full column name specified in the
original database.

Missing Values

Null values in the database are transformed into the system-missing value in numeric
variables or into blanks in string variables.
114
Chap te r 7

Example

GET CAPTURE ORACLE


/LOGIN=SCOTT /PASSWORD=TIGER
/SELECT EMP.ENAME, EMP.JOB,
EMP.HIREDATE, EMP.SAL,
DEPT.DNAME
FROM EMP, DEPT, DEPT_SAL
WHERE EMP.DEPTNO = DEPT.DEPTNO
AND EMP.DEPTNO = DEPT_SAL.DEPTNO
AND EMP.SAL = DEPT_SAL.HISAL.

 The indentation of the select statement illustrates how GET CAPTURE considers
everything after the word SELECT to be part of the database statement. These lines
are passed directly to the database, including all spaces and punctuation, except for
the command terminator (.).
Chapter

8
Iris Flower Classification Example

The Iris data set (Fisher, 1936) is perhaps the best-known data set found in the
classification literature. Although the classification task Fisher addresses is relatively
simple, which is visually apparent when you look at a scatterplot of the data, his paper
is a classic in the field and is referenced frequently. Using the Iris data set, this
example demonstrates the performance of AnswerTree and provides a way to easily
compare classification performance with other methods and competitive products.

Analysis Goal
We want to predict the species of iris based on four physical measurements. The
algorithms used are C&RT and QUEST.

The Data
The data file for this example is IRIS.SAV. The file contains four continuous
measurement variables on each observation and a classification variable, species.

Creating the Tree


After opening a new project file, we need to create two separate trees—one using
C&RT and the other using QUEST. For each, select species as the target variable and
petal length, petal width, sepal length, and sepal width as predictors. Since we are

115
116
Chap te r 8

using a small data set, we need to adjust the stopping rules. For each growing method,
in the wizard’s final panel, open the Advanced Options dialog box. For the minimum
number of cases, specify 25 for the parent node and 1 for the child node.
The root node for both C&RT and QUEST models looks the same. It represents an
enumeration of the target variable, species.
Figure 8-1
Root node for Iris problem

After creating the two separate root nodes, we will grow the two trees automatically,
using the Tree menu’s Grow Tree option. The tree maps show the overall structure of
the trees.
117
Iris F lo wer Classification Exam ple

Figure 8-2
C&RT tree

Figure 8-3
QUEST tree

As is frequently the case, the trees generated by different growing methods are similar
but not identical. The basic classification story is fairly simple—one species can be
differentiated on the basis of a single measurement (node 1 in both trees), and the other
two species use additional measurement information.
118
Chap te r 8

C&RT User-Generated Tree

Grow the third tree in the project file using the same model criteria as the first two.
Because the following procedures are identically executed in C&RT and QUEST, we
will demonstrate using C&RT.
Instead of using Grow Tree from the Tree menu, choose Grow Tree One Level. The
results of growing the C&RT root node to a tree with one level are shown below.
Figure 8-4
C&RT tree grown one level

Petal length is chosen to split the root node, and the split is made at values of greater
than and less than the measurement value of 2.450. All cases with petal length values
of less than or equal to 2.450 are sent to node 1. All observations with petal lengths
greater than 2.450 are sent to node 2.
119
Iris F lo wer Classification Exam ple

The C&RT algorithm reports the relative importance of a node split by using the
decrease in impurity, or improvement, as an evaluation criterion. In this example, we
use the default Gini impurity measure. In the first split of the Iris tree, the improvement
is reported as 0.3333. This means that the impurity of the two child nodes that result
from the split was 0.3333 less than the impurity of the root node. Node 1 is composed
entirely of one species (setosa) and contains all of the cases of that species. Node 2
contains the remaining 100 observations, which include all of the versicolor and
virginica irises.
Figure 8-5
C&RT tree grown two levels

After growing the Iris decision tree to two levels, we can see that node 1 has been
defined as a terminal node. It is not possible to split this node and improve the
performance of the tree.
120
Chap te r 8

Node 2 is split using the petal width variable, and the improvement is reported as
0.2598. The two child nodes of node 2 roughly describe the two remaining species of
iris. Node 3 subsumes most of the versicolor irises, while node 4 contains most of the
virginica irises.

Priors
Suppose that our sample misrepresented the frequency of occurrences of iris species in
the population of interest. In cases like this, we can adjust the prior distribution of the
data by explicitly specifying the prior probabilities. This set of prior probabilities tells
AnswerTree to expect the classes at the assigned probability value. Explicit priors can
be set using the Advanced Options dialog box. From the menus choose:
Analysis
Advanced Options...
Set the priors at 0.2 for setosa, 0.3 for versicolor, and 0.5 for virginica. Note that the
tree will be regrown.
Figure 8-6
Setting explicit priors
121
Iris F lo wer Classification Exam ple

Compare the tree structure using explicit priors with the tree for which priors were set
according to the empirical distribution of the species in the data shown above.
Figure 8-7
C&RT tree with explicit priors

There are obvious substantive differences in the tree grown with adjusted priors. In this
example, we are trying very hard not to misclassify virginica, because according to the
explicit priors, we are observing less of this species than we should in our data. The
opposite can be said for setosa and versicolor.
The resubstitution risk estimate for the tree with custom priors is 0.036, and its
standard error is 0.013787. Relative to the tree with equal priors, we have reduced the
misclassification of virginica. The trade-off is that we have misclassified more
versicolor as virginica.
122
Chap te r 8

Figure 8-8
Risk summary using explicit priors

Discussion of Results
The Iris data set was used to show two methods for growing classification trees. C&RT
and QUEST produced similar but not identical decision rules and similar risk
performance in this example. In both cases, petal length perfectly identifies one species,
and petal width does a good job of distinguishing between the other two species. Using
explicit priors gives a somewhat different tree, which minimizes misclassification for
categories with large prior probabilities.
Chapter

9
Credit Scoring Example

Background
One of the most important applications of classification methods is in credit scoring,
or making decisions about who is likely to repay a loan and who is not. A tree-based
approach to the problem of credit scoring has some attractive features:
 It allows you to identify homogeneous groups with high or low risk.
 It makes it easy to construct rules for making predictions about individual cases.

Analysis Goal
We want to be able to categorize credit applicants according to whether or not they
represent a reasonable credit risk, based on the information available.

The Data
The data file for this example is CREDIT.SAV. The file contains a target variable,
Credit ranking (good/bad), and four predictor variables: Age Categorical (young,
middle, old), Has AMEX card (yes/no), Paid Weekly/Monthly (weekly pay/monthly
salary), and Social Class (management, professional, clerical, skilled, unskilled).
Data were collected for 323 cases. Because all variables are categorical, we will begin
with the CHAID method of growing the tree.

123
124
Chap te r 9

Creating the CHAID Tree


After defining the target and predictor variables, we must adjust the stopping rules. In
Advanced Options, set the minimum number of cases to 25 for the parent node and to
1 for the child node.
Figure 9-1
Stopping rules for credit problem

Click OK to return to the wizard, and then click Finish to grow the root node.
125
C redit Sco ring Exam ple

Figure 9-2
Root node for credit problem

The root node simply shows the breakdown of cases for the entire sample. In this data
set, the cases are nearly equally distributed between good credit risks (47.99%) and bad
credit risks (52.01%). The risk estimate also reflects this, showing that assigning all
cases to the majority class (bad credit risk) results in a 47.99% error rate.
Figure 9-3
Risk summary for root node
126
Chap te r 9

Growing the CHAID Tree


Now that we have defined a root node, we can begin splitting the data to create
subgroups with desirable properties. Right-click the root node and choose Grow Tree
One Level. This will split the root node into two child nodes based on the variable
Paid Weekly/Monthly.
Figure 9-4
First split for credit problem

Notice that the majority of cases in the weekly pay group have low credit rankings,
while those in the monthly salary group are more likely to have high credit rankings.
Based on this variable alone, we can greatly improve our ability to distinguish good
credit risks from bad ones.
Examining the risk estimate reinforces this conclusion. The risk estimate with one
split is 0.1455, indicating that if we use the decision rule based on the current tree, we
will classify 100% – 14.55% = 85.45% of the cases correctly. So, a little bit of
information goes a long way in this case.
127
C redit Sco ring Exam ple

Figure 9-5
Risk estimate for first split

The results are encouraging. Let’s see if we can do even better with a little more
information. Right-click on the root node and select Grow Tree One Level again. Now
we see that each wage group is split by age.
Figure 9-6
Second split for credit problem
128
Chap te r 9

Examining the left branch first, we see that of those who are paid weekly, old
applicants (> 35 years) tend to have good credit—all of the cases in this example fit
this profile. On the other hand, young (< 25) and middle (25–35) applicants are much
more likely to have poor credit.
The right branch looks a bit different. It remains evident that older applicants are
more creditworthy than younger ones. However, for applicants who are paid monthly,
middle-aged people are grouped with the more creditworthy old group rather than with
the less creditworthy young group. Again, we see that old applicants within this group
almost all have high credit rankings. The younger group is more heterogeneous than
we saw in the other branch, however. Cases are evenly split in the young group—about
half of the applicants in this group have good credit and half have bad credit.
Perhaps by adding one more piece of information, we can identify the difference
between good and bad credit risks within this subgroup. Right-click that node and
choose Grow Branch One Level.
Figure 9-7
Split of node 5
129
C redit Sco ring Exam ple

The next piece of useful information is Social Class. The categories of this variable are
grouped into two subsets: managerial and clerical in one branch and professional in the
other. (There were no skilled or unskilled workers in this node.) The managerial and
clerical node cases all have good credit, whereas the professional node has both good
and bad credit risks (41.46% and 58.54%, respectively).

Evaluating the Tree


Let’s see how our tree does in classifying cases. Again, we look at the risk summary to
find out what proportion of cases are incorrectly classified.
Figure 9-8
Risk summary for tree with three levels

The risk summary shows that the current tree classifies almost 90% of the cases
accurately. Also, the misclassification matrix shows exactly what types of errors are
being made. The diagonal elements (upper left and lower right) of the table represent
the correct classifications. The off-diagonal elements (lower left and upper right)
represent the misclassifications. For this tree, notice that we almost never classify a
person with bad credit as having good credit. However, in 32 cases where people have
good credit, we classify them as having bad credit.
130
Chap te r 9

Interpreting the Results


The gain summary can also provide useful insight into the tree. The gain summary
shows which nodes have the highest and lowest proportions of a target category within
the node. In this case, we want to know which subsets of applicants (nodes) are most
likely to be good credit risks. We will look at the gain summary for nodes, with
percentages for the good credit target category.
Figure 9-9
Gain summary for credit problem

The first column gives the node number, which corresponds to the numbers found in
the tree map. For example, node 4 corresponds to applicants who are paid weekly and
are over 35 years old. The next two columns show the number of cases in the node and
the percentage of all cases that are in the node. The following columns present the
number of cases with the target response and the percentage of all of the target
responses that are in this node. For this example, that represents the number of people
in the node with good credit and the percentage of all of the people with good credit
who fall in this node. The Gain column indicates the proportion of cases in the node
that have the target response (good credit), and the Index column gives a measure of
how the number of target responses in this node compares to that for the entire sample.
For the credit problem, both node 4 (paid weekly/over 35) and node 7 (paid
monthly/under 25/managerial or clerical) have no cases with bad credit, so they both
show a gain value of 100%. Since this is just over twice the percentage of good credit
cases in the entire sample (47.99%), the gain index is 208.3871%. Clearly, these are the
cases we would want to seek out. Conversely, node 3 (paid weekly/under 35) has the
lowest proportion of good credit risks. As a lender, you would have sound reason to
avoid lending to applicants who fit this profile.
Chapter

10
Housing Value Example

Background
In this example, we will use the C&RT algorithm to build a regression tree for
predicting the median housing value for census tracts in and around Boston. This will
allow us to evaluate the effects of various characteristics on property values in the
region. The data were originally reported in Harrison and Rubinfeld (1978) in a study
of the effects of air pollution on property values.

Analysis Goal
We want to be able to evaluate the effects of various environmental, economic, and
social factors on housing values in this urban area.

The Data
The data file for this example is HOUSING.SAV. The file contains a target variable,
Median value of owner-occ homes (defined as continuous), and 13 predictor
variables:
 Per capita crime rate (continuous)
 Proportion of residential land zoned 25K+ (continuous)
 Proportion of non-retail bus acres per town (continuous)

131
132
Chap te r 10

 Charles River connection dummy (nominal, 0 = adjacent to river, 1 = not adjacent


to river)
 Nitric oxides concentration pp 10M (continuous)
 Average # of rooms per dwelling (continuous)
 Proportion of owner-occ dwellings before 1940 (continuous)
 Wtd dist to five Boston employment ctrs (continuous)
 Access index to radial hwys (continuous)
 Full-value prop-tax rate per $10K (continuous)
 Pupil-teacher ratio by town (continuous)
 Proportion of blacks per town—transformed (continuous)
 % lower status of the population (continuous)
Data were collected for 506 census tracts. We will build a C&RT model for the
continuous target variable.

Creating the C&RT Tree


For the first tree, use the C&RT growing method. Select Median value of owner-occ
homes as the target variable and all others as predictor variables. Before we grow the
tree, we must specify some stopping rules. When prompted, open the Advanced
Options dialog box. Under Minimum Number of Cases on the Stopping Rules tab,
specify 25 for the parent node and 1 for the child node. Specify 1.0 for the minimum
change in impurity.
133
Hous ing Value Exam ple

Figure 10-1
Stopping rules for C&RT tree

After specifying the advanced options, click Finish to see the root node in the Tree
window.
Figure 10-2
Root node for housing data
134
Chap te r 10

The root node simply shows the mean value of the target variable for the entire sample.
In this data set, the average for the target variable is 22.53, which means that the
average of the median housing values for the sample of census tracts is approximately
$22,530.

Growing the C&RT Tree


Now that we have defined a root node, we can begin splitting the data to identify the
most salient variables. From the menus choose:
Tree
Grow Tree
This produces a tree with nine terminal nodes, as we can see in the tree map.
Figure 10-3
Tree map of C&RT tree for housing problem
135
Hous ing Value Exam ple

Evaluating the Tree


We can see how well the automatically grown tree does at predicting the median
property value by examining the risk summary for the tree.
Figure 10-4
Risk summary for tree

The risk estimate here is simply the within-node variance. Remember that the total
variance equals the within-node (error) variance plus the between-node (explained)
variance. The within-node variance here is 12.5322, while the total variance is 84.4196
(the risk estimate for the tree with only one node). The proportion of variance due to
error is 12.5322 ⁄ 84.4196 = 0.1485. Thus, the proportion of variance explained by
the model is 100% – 14.85% = 85.15% . There is still some residual variance, but the
amount we can account for using the model is enough to convince us that we have
captured the most important variables, and that we can probably trust our conclusions.

Examining Variable Splits

In this example, we are interested in identifying variables that play important roles in
explaining housing values. In particular, we are interested in discovering whether, or
for which subgroups, pollution levels affect property values. We can find this
information in the tree itself.
136
Chap te r 10

Figure 10-5
Tree indicating splits based on average number of rooms

The first split is made on the average number of rooms. This indicates that the average
number of rooms is the most important determining factor (of the factors we measured)
for the median housing value of a census tract. This split gives an improvement of
38.2205, reducing the within-node variance by almost half. Areas with an average
number of rooms greater than 6.941 are subsequently split by the number of rooms
again, with tracts having an average number of rooms between 6.941 and 7.437 in one
node and an average number of rooms greater than 7.437 in the other.
137
Hous ing Value Exam ple

Figure 10-6
Second-level split based on % lower status

For tracts where the average number of rooms is less than 6.941, the next split is
based on % lower status of the population. Areas with less than 14.4% lower-status
residents had higher values than other areas where the percentage of lower-status
residents exceeded 14.4%. These nodes are then split further, based on the distance
to Boston employment centers on the left branch and on the per capita crime rate on
the right branch.
Since we are particularly interested in the effect of pollution on housing values, let’s
examine the portion of the tree where pollution (nitric oxides concentration) becomes
a predictor.
138
Chap te r 10

Figure 10-7
Effect of pollution on housing values

It appears at first glance that tracts with an average number of rooms (between 6.9 and
7.4) show an effect of nitric oxide concentration on housing values. This result conflicts
with a previously published tree-based analysis of these data (Breiman et al., 1984),
which splits the node based on Per capita crime rate. To investigate this discrepancy, we
can examine the surrogates for the node to see how Per capita crime rate compares to
Nitric oxides concentration as a predictor at this particular node in the model. To show
statistics for surrogates, select the node and from the menus choose:
Tree
Select Surrogate...
139
Hous ing Value Exam ple

Figure 10-8
Surrogates for nitric oxides concentration

Notice that the first surrogate is Per capita crime rate and that its improvement statistic,
1.9900, is equal to the improvement for the split based on Nitric oxides concentration.
Furthermore, the association statistic for Per capita crime rate is 1.0, indicating that in
the context of this node, it is essentially equivalent to the selected predictor. The choice
between these two variables to split this node is arbitrary. The fact that the two analyses
selected different predictors was probably an accident of the order in which variables
were specified in the analysis.
While this redundancy essentially foils our attempt to determine how pollution
influences housing values, it does raise some interesting new questions. What is the
nature of the relationship between the crime rate and pollution? Which of the two
variables (if either) has a causal effect on housing values? How are these variables
related in other portions of the sample (that is, for other nodes)? Further investigation
using new data and other statistical techniques would be necessary to answer these
questions.
Chapter

11
Wage Prediction Example

Background
In this example, we will use the C&RT algorithm to build a regression tree for
predicting wages for a random selection of workers in the United States in two
separate years: 1978 and 1985. The two time periods will allow us to determine
whether the effects change over time. Various worker characteristics were recorded
along with wage levels. Data were extracted from the Current Population Survey
(CPS) of May 1978 and May 1985, published by the U.S. Department of Commerce.
These data were also analyzed by Berndt (1991).

Analysis Goal
We want to be able to evaluate the effects of various social, economic, and
demographic variables on wage levels at two separate points in time.

The Data
The data file for this example is WAGES.SAV. The file contains the target variable,
Log of avg hourly earnings (defined as continuous), and 20 predictor variables:
 Education (years) (continuous)
 Lives in south (nominal, yes or no)
 Nonwhite (nominal, yes or no)

140
141
Wag e P re dic tion Exam ple

 Hispanic (nominal, yes or no)


 Gender (nominal, male or female)
 Married with spouse present (nominal, yes or no)
 Married female with spouse present (nominal, yes or no)
 Years of labor market experience (AGE–ED–6) (continuous)
 Squared years of labor market exper (continuous)
 Union job (nominal, yes or no)
 Age in years (continuous)
 Manufacturing worker (nominal, yes or no)
 Construction worker (nominal, yes or no)
 Managerial or administrative worker (nominal, yes or no)
 Sales worker (nominal, yes or no)
 Clerical worker (nominal, yes or no)
 Service worker (nominal, yes or no)
 Professional worker (nominal, yes or no)
 Time sampled (nominal, 1978 or 1985)
Data were collected for 550 workers in 1978 and 535 workers in 1985. We will build
a C&RT model for the Log of avg hourly earnings variable. (The logarithm of average
hourly wages is used because the distribution of wages tends to be positively skewed.
The natural logarithm transformation compensates for this, yielding an approximately
normally distributed variable.)

Creating the C&RT Tree


For the first tree, select C&RT for the growing method. Select Log of avg hourly
earnings as the target variable and all others as predictor variables. Next, we must
specify some stopping rules. When prompted, open the Advanced Options dialog box
and make sure the Stopping Rules tab is on top. Since we have a moderately sized
sample, under Minimum Number of Cases, specify 25 for the parent node and 1 for the
child node. Each split should account for at least 1% of the variance in the target
2
variable ( 0.54 = 0.2916 ), so specify 0.003 for the minimum change in impurity.
142
Chap te r 11

Figure 11-1
Stopping rules for the C&RT tree

After specifying the advanced options, click OK, and then in the wizard click Finish to
see the root node in the Tree window.
Figure 11-2
Root node for wage data
143
Wag e P re dic tion Exam ple

The root node simply shows the mean value of the target variable for the entire sample.
In this data set, the average for the target variable is 1.87.

Growing the C&RT Tree


Now that we’ve defined a root node, we can begin splitting the data to identify the most
salient variables. From the menus choose:
Tree
Grow Tree
This produces a tree with 10 terminal nodes, as we can see in the tree map.
Figure 11-3
Tree map of C&RT tree for wage problem

Evaluating the Tree


We can see how well the automatically grown tree does at predicting wages by
examining the risk summary for the tree.
144
Chap te r 11

Figure 11-4
Risk summary for tree

The risk estimate here is simply the within-node variance. Remember that the total
variance equals the within-node (error) variance plus the between-node (explained)
variance. The within-node variance here is 0.1860, while the total variance is 0.2944
(the risk estimate for the tree with only one node). The proportion of variance due to
error is 0.1860 ⁄ 0.2944 = 0.6318. Thus, the proportion of variance explained by the
model is 100% – 63.18% = 36.82% . Clearly, the ability to capture variation in wages
using the current tree model is less than optimal. However, the conclusions we draw
from the tree may be of some use in constructing a more detailed parametric model for
these data.

Examining Variable Splits

In this example, we are interested in identifying variables that play important roles in
explaining wage levels. We are also interested in whether or not different factors are
important at the two different time points. We can find this information in the tree itself.
In some cases, it is instructive to build the tree one split at a time. To revert to the
root node, select it in the tree, and from the menus choose:
Tree
Remove Branch
145
Wag e P re dic tion Exam ple

To make the first split, with the root node selected choose:
Tree
Grow Tree One Level
Figure 11-5
First split of tree

The first split is made on the time sampled, with workers polled in 1985 reporting
higher wages than those polled in 1978. This is not at all surprising—the average wage
values are given in unadjusted dollars, so the difference is probably due to inflation over
the seven-year period. The interesting question will be whether the subtrees look similar
for the two time points. Let’s examine the next level of the tree. From the menus choose:
Tree
Grow Tree One Level
146
Chap te r 11

Figure 11-6
Tree with second-level splits

Notice that the two nodes representing the two time points are split on different
predictors. For the 1978 cases, splitting on gender gives the best reduction of error,
whereas for the 1985 data, the best split is based on years of education. It appears that
there were indeed some changes in the determinants of wage levels (or at least their
relative importance) during that seven-year time interval.
147
Wag e P re dic tion Exam ple

Let’s go ahead and finish growing the tree and then consider the subtree for each
time point separately. From the menus choose:
Tree
Grow Tree
Figure 11-7
Subtree for workers polled in 1978

In the first round of workers polled, gender provided the best second-level split, as we
saw previously. We again see differences in the two subtrees: for males, age proves the
best predictor, while union representation is most prominent for women. For older
males, education level also adds useful information to the model.
148
Chap te r 11

Figure 11-8
Subtree for workers polled in 1985

In this branch, the first split for the 1985 data is based on years of education, and is in
the expected direction (more education predicts higher wages). For highly educated
workers (> 13.0 years), age becomes the next significant factor. For workers with less
education, the next significant factor is union representation, with union members
earning more than non-union workers. Finally, for those non-union workers, age is the
next predictor of wage levels, again in the expected direction (older workers earn more
than younger ones).
149
Wag e P re dic tion Exam ple

Interpreting the Results


In this example, we show that two groups of data collected at two time points yield
different models for predicting the same target variable from the same predictors. The
natural conclusion is that there are differences in the determinants of average wages
from one time point to the other. We must be careful not to overinterpret these results,
however, since the model presented here accounts for a relatively small portion of the
variance observed in average wages. It may be the case that some crucial factor has
been overlooked, or that the loss of information due to binary splitting of continuous
predictor variables reduces the accuracy of the model. In any case, further investigation
is in order.
One possible route would be to use the results of this analysis to construct a
parametric (linear) regression model. The tree structure can be used to determine the
factors and interactions that should be considered for inclusion in such a model. For
example, because the subtrees for 1978 and 1985 are so different, it would be
reasonable to begin by building separate parametric models for the two data sets. For
the 1978 data, we would want to include gender (as a dummy-coded variable), age,
union job (dummy-coded), education, and also a term for the interaction between age
and education, since the effect of education seems to depend on age. Similarly, for the
1985 data, we would want to include education, union job (dummy-coded), age, and a
term for the age by union interaction. Such models might give better accuracy of
prediction because they are able to use all of the information contained in the
continuous variables for age and education.
Chapter

12
Market Segmentation Example

Background
In designing a marketing program, one of the key points is to identify likely buyers
and focus your sales efforts on them. Targeting the most profitable segments of your
market can help you get the best return on investment for your sales resources.
Identifying more or less profitable subgroups of your market is called market
segmentation. A tree-based approach to the problem of market segmentation gives
you clear decision rules for identifying prospects as more or less likely to buy, as well
as estimates of how much profit or loss you can expect by marketing to particular
subgroups.

Analysis Goal
We want to be able to categorize direct mail targets according to whether or not they
are likely to purchase products.

The Data
The data file for this example is SUBSCRIB.SAV. These data were used in the original
SPSS CHAID manual (Magidson, 1993). The file contains two target variables,
Dichotomous response (respondent or nonrespondent) and Response to sweepstakes
promotion (paid respondent, unpaid respondent, nonrespondent), and seven predictor
variables:

150
151
Marke t Seg men ta tion Exam ple

 Age of household head (18–24, 25–34, 35–44, 45–54, 55–64, 65+, unknown)
 Sex of household head (male or female)
 Children in household? (yes or no)
 Household income (<$8,000, $8,000–$9,999, $10,000–$14,999,
$15,000–$19,999, $20,000–$24,999, $25,000–$34,999, $35,000–$49,999,
$50,000+)
 Bankcard in household? (yes or no)
 Number of persons in household (1, 2, 3, 4, 5 or more, unknown)
 Occupation of household head (white collar, blue collar, other, unknown)
In addition, one frequency variable (FREQ) is specified. For each combination of the
values of the other variables, the frequency variable indicates how many individual
cases in the sample fit that profile.
Data were collected for 81,040 cases. Because all variables are categorical, we will
use the CHAID method for growing our tree. Three of the variables, the two target
variables (Dichotomous response and Response to sweepstakes promotion) and
Occupation of household head, should be defined as nominal. All other variables
should be defined as ordinal. (To change the definition of a variable type, select the
variable in the New Tree dialog box, and then right-click and select the appropriate
type from the context menu.)

Creating the CHAID Tree


For the first tree, choose CHAID as the growing method. Select Dichotomous response
as the target variable, FREQ as the frequency variable, and all others except Response
to sweepstakes promotion as predictor variables. Now we must redefine some of the
growing criteria. We want to use the likelihood-ratio chi-square statistic for building
the tree. Change this setting on the CHAID tab of the Advanced Options dialog box.
152
Chap te r 12

Figure 12-1
Advanced Options: CHAID

After specifying the advanced options, click Finish to see the root node in the Tree
window.
Figure 12-2
Root node for market segmentation problem
153
Marke t Seg men ta tion Exam ple

The root node simply shows the breakdown of cases for the entire sample. In this data
set, most cases (98.85%) are nonrespondents.

Growing the CHAID Tree


Now that we have defined a root node, we can begin splitting the data to create
subgroups with desirable properties. From the menus choose:
Tree
Grow Tree
The resulting tree is shown below.
Figure 12-3
Resulting tree with tree map
154
Chap te r 12

The first split is based on size of household. It seems that the more people there are in
a house, the more likely that house is to respond to the promotion. Cases with missing
values for household size were least likely to have responded to the mailing.
Notice that certain categories are grouped together. For example, households with
two persons are grouped with households having three persons. AnswerTree
automatically combines categories when there is no statistical distinction between
them. That is, the merged categories are basically equivalent from a statistical
perspective.
Now look at the node with number of persons in household equal to two or three
(node 2 in the tree map). This node is further broken down by age, where older persons
(65+) are less likely to respond than others. Likewise, households of unknown size
(node 4 in the tree map) are broken down by sex of household head, with women being
more likely to have responded than men.
At the second level, medium-sized households where the head of the household is
less than 65 years of age (node 5 in the tree map) can be divided by the presence of
bankcards in the household. Households with bankcards are more likely to have
responded than other households.

Evaluating the Tree


There are no nodes where responders outnumber nonresponders. Because our sample
is so dominated by nonresponders, the computer predicts nonresponse for every node.
This limits the usefulness of the risk summary for evaluating the tree.
155
Marke t Seg men ta tion Exam ple

Figure 12-4
Risk summary for tree with three levels

Notice how the first row of the misclassification table shows all zeros. That is because
the tree never predicts any case to be a responder. Of course, the classifier is correct
98.85% of the time, but its predictions are not very useful for distinguishing good
prospects from bad. To identify segments of interest (that is, segments with a relatively
high probability of response), we need to examine the gains chart.

Using the Gains Chart

The gains chart shows the nodes sorted by the number of cases in the target category
for each node. The default is to make the target category the last category in the list. In
this case, however, we want the first category (responders) to be the target category. To
specify this change, select the Gains tab in the Tree window, and then from the menus
choose:
Format
Gains
Select the desired category—in this case, Respondent—under Percentage of cases in
target category.
156
Chap te r 12

Figure 12-5
Gain Summary dialog box

Now the gain summary will reflect the nodes that have the highest probability of
response to the promotion. There are two parts to the gains chart: node-by-node
statistics and cumulative statistics. Let’s look at the node-by-node statistics first.
Figure 12-6
Gains chart (node-by-node statistics)

Nodes are sorted by gain score from highest to lowest. The first node in the table,
node 9, contains 46 responders out of 1979 cases, or a 2.3244% response rate. For this
type of gains chart, with a categorical target variable, the gain score equals the
percentage of cases with the target category—in this case, respondents—for the node.
The index score shows how the proportion of respondents for this particular node
157
Marke t Seg men ta tion Exam ple

compares to the overall proportion of respondents. For node 9, the index score is about
202%, meaning that the proportion of respondents for this node is over twice the
response rate for the overall sample.
The second row shows statistics for node 3. This node has a response rate of
1.9200%, with an index score of about 167%. This group, or segment, also has an
appreciably higher proportion of respondents than the overall sample, although the
advantage is not as great as with the first node in the list. The pattern continues down
the table, with each subsequent node having a lower proportion of respondents. Look
at the fourth row, which describes node 1. The index score for this node is only about
95%. Whenever the index score is less than 100%, it means that the corresponding
node has a lower response rate than the overall sample. Between the third and fourth
rows (node 10 and node 1) is the crossover point, where we go from “winning nodes”
to “losing nodes.”
Figure 12-7
Gains chart (cumulative statistics)

Now let’s take a look at the cumulative statistics. The cumulative statistics can show us
how well we do at finding responders by taking the best segments of the market. If we
take only the best node (node 9), we reach 4.94% of responders by targeting only
2.44% of the market. If we include the next best node (node 3) as well, then we get
17.72% of the responders from only 10.09% of the market. Including the next node
(node 10) increases those values to 36.31% of responders from 23.87% of the sample.
At this stage, we are at the crossover point described above, where we start to see
diminishing returns. Notice what happens if we include the next node (node 1)—we
get 65.95% of responders, but we must contact 55.19% of the sample to get them.
The gains chart can give you valuable information about which segments to target
and which to avoid. Of course, you will need to make some decisions about how many
segments to target. You might base the decision on the number of prospects you want,
the desired response rate for the target market, or the desired proportion of all potential
158
Chap te r 12

responders you want to contact. In this example, suppose we want an estimated


response rate of at least 2%. To achieve this, we would target the first two nodes, nodes
9 and 3.

Identifying Segment Characteristics

Now that we have identified the nodes that define our target segments, what criteria do
we use to determine whether a new case fits into one of these segments? The answer
can be found in the Rules view. To see the characteristics that define a node, select the
node and then select the Rules tab in the Tree window. For example, to see what
defines the node with the highest response rate (node 9), select that node in either the
tree map, the Tree view of the Tree window, or the gains chart, and then select the
Rules tab.
Figure 12-8
Rules for node 9 (selecting cases)

In addition to showing rules for selecting cases that belong to certain segments, you
can display rules that assign values to those cases. To see the values assigned to node 9
consumers, select the Rules tab, and then from the menus choose:
Format
Rules
Under Generate Syntax For, select Assigning values. The new rule for node 9
indicates that the cases are assigned a node label (nod_001 = 9), a predicted outcome
(pre_001 = 2, corresponding to Nonrespondent), and the estimated probability that
the predicted outcome is correct (prb_001 = 0.977).
159
Marke t Seg men ta tion Exam ple

Figure 12-9
Rules for node 9 (assigning values)

The rules from the Rules window can be exported for use with other programs in
selecting targeted segments or adding tree-derived information to the records. Rules
can be generated in three formats: SQL (shown above), SPSS, or decision. SQL rules
can be used with any database engine that understands SQL commands. SPSS rules can
be used with SPSS or NewView to update your data files or select cases. Decision rules
are simply structured descriptions of the node(s), which are suitable for use in reports
or presentations. With decision-formatted rules, you have the option of using variable
labels and/or value labels in the rules, which can make the rules easier to understand,
especially for others not familiar with the structure of your data set.

Applying Profits to the Model


So far we have been building our model to identify consumers most likely to respond
to our promotional mailing. Of course, the goal of this is to maximize our revenue by
getting a high return on investment. We can make this ultimate goal explicit by
specifying profits for each of the outcomes. Gain scores can then be expressed as
profits, rather than probabilities of response.
160
Chap te r 12

For this example, we will build our model based on the other variable, Response to
sweepstakes promotion (paid respondent, unpaid respondent, nonrespondent), defined
as nominal, as our target variable. To build a new tree, from the menus choose:
File
New Tree
In the model dialog box, set Response to sweepstakes promotion as the target variable
and the other variables (except for Dichotomous response, of course) as predictors. A
new root node will be created, this time with three categories instead of two. We will
again use the likelihood-ratio chi-square statistic, so specify this on the CHAID tab of
the Advanced Options dialog box, if necessary.
Figure 12-10
Tree with Response to Sweepstakes Promotion as target

To specify profit values, from the menus choose:


Analysis
Profits
In this example, the profits are specified as follows:
 Paid respondent: $35 profit from subscription
 Unpaid respondent: –$7 profit (or $7 loss) due to cost of introductory issue and
follow-up
 Nonrespondent: –$0.15 profit (or $0.15 loss) for cost of mailing
161
Marke t Seg men ta tion Exam ple

Figure 12-11
Define Profits dialog box

Now, let’s grow the tree. From the menus choose:


Tree
Grow Tree
The resulting tree contains three levels with 14 terminal nodes. The entire tree is too
large to view on the screen, but the basic structure of the tree can be seen in the tree
map. The first split is based on the number of persons in the household, as with the
previous model. In this model; however, the values for this variable are grouped
differently from the previous model. This reflects the different set of categories for the
target variable. The other splits can be examined in the Tree view as well.
Figure 12-12
Tree map of grown tree using profit values
162
Chap te r 12

The real story, however, is told by the gains chart. The gains chart can be configured to
display average profit values for nodes. To set this option, select the Gains tab in the
Tree window, and then from the menus choose:
Format
Gains
and under Gain Column Contents, select Average profit.
Figure 12-13
Gains chart with profit values

The node with the greatest profit is node 17, with an average profit of $0.69. If we
consider the cases that make up this node—households with two people, head of
household 55–64 years of age, and having a high income—we see a pattern, which we
might call “empty nesters.”
Notice that several nodes have negative gain scores. A negative score indicates that,
on the average, you lose money by targeting that group. Of course, you will want to
divert resources away from those segments to the more profitable ones at the top of the
gains chart. If you simply want to ensure profitability, target the nodes with positive
gain scores. If you have a specific profit margin in mind that you want to exceed, the
cumulative statistics in the gains chart can tell you how many nodes to keep. For
example, if you want to ensure at least $0.20 average profit per household, you should
target nodes 17, 18, 14, 10, and 15, which together give an average profit of $0.21 per
household.
Chapter

13
Running Automated Jobs in
Production Mode

AnswerTree provides a scripting language that lets you run the application in
production mode, where the actions of the application are determined by a script
prepared in advance. The file containing the script should have the filename extension
.ats. Scripts are executed by starting AnswerTree with the script filename as a
parameter on the command line or by double-clicking the script file from a file viewer.
AnswerTree then executes the script, producing the output described in the script.
Running AnswerTree from a batch file. AnswerTree was not designed to run multiple
concurrent sessions. Therefore, if you want to run AnswerTree scripts from a batch
file, you should use the START /W DOS command to ensure that the job finishes
before the next process begins. This is especially important if you want to run more
than one AnswerTree script from the same batch file—every line in the batch file
should use the START /W command. For example, if you have a set of three analyses
you want to run every month, you might write a batch file similar to this:

START /W “C:\Program Files\AnswerTree\AnswerTree.exe” “C:\Reports\tree1.ats”


START /W “C:\Program Files\AnswerTree\AnswerTree.exe” “C:\Reports\tree2.ats”
START /W “C:\Program Files\AnswerTree\AnswerTree.exe” “C:\Reports\tree3.ats”

This batch file runs the three analyses sequentially, so that they do not interfere with
each other.

163
164
Chap te r 13

General Rules
Answer Tree syntax resembles Microsoft’s Visual Basic. Specifically, an .ats file
consists of a sequence of statements. Statements can be continued across multiple lines
by ending continued statements with a blank space and an underscore ( _).
Comments must be introduced with an apostrophe. Any text following the
apostrophe up to the end of the same physical line is skipped. Lines containing
comments cannot be continued using the underscore. Comment lines cannot be
continued by ending the line with a blank and an underscore, and a line cannot be
continued if a comment occurs after the blank and underscore.
String constants can be of any length. The concatenation operator & (ampersand)
can be used to combine strings, so that very long strings can be broken into pieces
convenient for a text editor.
AnswerTree syntax uses keywords to permit unambiguous parsing. User input is
either numeric or enclosed in quotation marks. String constants are written in .ats files
with surrounding quotation marks. If a quotation mark is to be an actual character in
the string, it must be written as two adjacent quotation marks.

Production Runs
An .ats file contains one production run. A production run deals with a single
AnswerTree saved project or a single data source. If a data source is used, it must be
an SPSS-format data file. A saved project incorporates the data as part of the project.
Two statements determine a production run.

Begin Production_Run <production log filename> <text>

A production run is begun with a Production_Run statement. The statement specifies a


log filename and a user-defined text annotation (enclosed in quotation marks), which
is included in the log. The production log filename should be a single token, enclosed
in quotation marks.

End Production_Run

The end of a production run is signaled by the End Production_Run statement. This
statement closes any open files. If there is no End Production_Run statement in the
script file, an error is reported in the log file.
165
Run nin g A utoma te d Jo bs in Production Mo de

Opening a Data Source or Saved Project

The first statement after the Production_Run statement must be an Open statement.
There are two types of Open statements.

Open Data_Source <filename> <project name>

Starting a production run directly with a data source creates a new project with the
given project name and initializes AnswerTree to work with the specified data file.
Both the filename and project name are string constants.
An error may occur if the file is not a valid AnswerTree-format data file. If so, the
production run terminates.

Open Project <filename>

Starting a production run with an existing project file continues work on a saved
project. The filename is a string constant. An error may occur at this point if the file is
not a valid saved project. If so, the production run terminates.

Saving a Project

At any time, the current state of a project may be saved to a project file with the Save
Project command.

Save Project <filename>

This command saves the project with the specified filename. The filename is a string
constant. If an error occurs while attempting to save the project, the production run
terminates.

Building a Tree

The main purpose of AnswerTree syntax is to permit the automated execution of


AnswerTree. Building and producing output for a single tree is accomplished with a
tree block. All of the specifications for building and producing output for a single tree
occur between a Begin Tree statement and an End Tree statement.
166
Chap te r 13

Since modification of the training sample does not modify other aspects of the tree,
such as the target variable, frequency variable, and weight variable, both training and
testing data may be used in a tree block.
The basic structure of the tree block is as follows (items in brackets are optional):

Begin Tree <optional name>


Model settings
<optional analysis settings>
Build Tree
<optional analysis settings + Rebuild Tree>
Grow Tree (or Grow_and_Prune Tree)
End Tree (or Delete Tree)

Note that in version 2.0 of AnswerTree, analysis settings can be specified before the
Build Tree command. In fact, this is encouraged, as it will lead to faster execution of the
script. The reverse ordering and Rebuild Tree command required for version 1.0 are still
supported for backward compatibility, but building the tree first and then specifying the
analysis settings will cause the script to take longer to run.

Begin Tree <optional name>

Begin Tree starts a new tree with each occurrence of this statement. Statements
occurring after the Begin Tree statement specify values for parameters that affect tree
growing or display. Second or subsequent tree blocks do not retain settings from
preceding tree blocks. Each tree block must be terminated with either End Tree or
Delete Tree.
If <optional name> is specified, it will be used as the name of the new tree.

End Tree

An End Tree statement resets all of the parameters that affect tree growth or display to
their default values. Any subsequent Begin Tree statement will result in a new tree
being built.

Delete Tree

The Delete Tree statement is similar to the End Tree statement but has the additional
effect of removing the tree from the project. If a tree is not to be preserved in the
project, deleting the tree is recommended. Depending upon the tree, potentially large
amounts of system resources are freed by deleting a tree.
167
Run nin g A utoma te d Jo bs in Production Mo de

Model Settings

The following parameters are used to define the tree-growing process and must be
specified before the root node is defined.

Method <tree-growing method>

Four different growing algorithms are supported: chaid, exhaustive_chaid, cart, and
quest.

Variable attributes

There are three basic variable types: nominal, ordinal, and continuous. Each of these
types may have missing values and a missing value label.

Nominal Variable <variable specification>

Defines the specified variable as nominal. The variable specification is a string


constant.

Ordinal Variable <variable specification>

Defines the specified variable as ordinal. The variable specification is a string constant.

Continuous Variable <variable specification>

Defines the specified variable as continuous. The variable specification is a string


constant.

Missing Label <quoted string to be used to display missing values>

Specifies the label used to indicate missing values for the previously defined variable.

CHAID Merge <Yes, No>

Allows you to specify whether the CHAID algorithm should merge categories for the
previously defined categorical variable.
168
Chap te r 13

Roles for Variables

The following commands affect how variables are used in the model.

Frequency <variable specification>

Specifies the variable that gives frequency weights. If this command is omitted, no
variable is used for frequency weights.

Weight <variable specification>

Specifies the variable that gives the case weights. If this command is omitted, no
variable is used for case weights.

Target <variable specification>

Specifies the variable to be used as the target variable. The variable specification must
be appropriate for the data source. When the data source is an SPSS .sav file, for
example, the variable specification is a variable name in the data set. The variable
specification is a string constant. This statement is required and must come before the
Method statement.

Predictors <list of variable specifications>

Specifies the variable(s) to be used as predictors in the model. To specify all variables
not assigned to other roles as predictors, use Predictors All.

Validation Settings

If you want to validate your tree to see how well it generalizes to new data, you can
partition your sample into training and test sets. If no validation command is specified,
the resubstitution estimate of risk based on the entire sample is used.

Partition Data

This command partitions the data into training and test sets and builds the top node of
the resulting tree. The two subcommands that govern the partition are described below.
The Random Seed and Training Percent subcommands may be written after Partition
Data on the same (extended) line or as separate commands preceding the Partition Data
statement.
169
Run nin g A utoma te d Jo bs in Production Mo de

Random Seed <positive integer value>

The positive integer value is used to start the random number generator used in
selecting the training set. By using the same random seed value, it is possible to
replicate a particular set of random partitions.

Training Percent <integer value>

The integer value represents a percentage of the data file cases to be used in defining a
tree (the training set). The remainder of the cases are considered cases to be tested (the
testing set).

Analysis Settings

Analysis settings control various aspects of how the tree is grown. In addition to
general settings, you can specify settings for stopping rules, CHAID, C&RT, and
QUEST, pruning, scores, costs, priors, and profits.

General Settings

Maximum Competitors <positive integer>

Specifies the maximum number of competitors to be considered at each node split.

Maximum Surrogates <positive integer>

Specifies the maximum number of surrogates to be considered at each node split.

Stopping Rules

Maximum Depth <positive integer>

Specifies a limit on the number of levels in the tree.

Minimum_Cases Parent <positive integer>

Specifies the minimum number of cases in a parent node required to split that node.
170
Chap te r 13

Minimum_Cases Child <positive integer>

Specifies the minimum number of cases in a child node required to create the node.

Minimum Impurity_Change <positive number>

Specifies the minimum number of cases in a parent node required to split that node.
Applies only to C&RT models.

CHAID/Exhaustive CHAID Settings

Alpha Merge <positive number>

Specifies the alpha level for merging categories.

Alpha Split <positive number>

Specifies the alpha level for splitting nodes.

Chi_Square Pearson

Specifies the use of Pearson chi-square statistics.

Chi_Square Likelihood_Ratio

Specifies the use of likelihood-ratio chi-square statistics.

Convergence Epsilon <positive number>

Specifies the minimum difference between iteratively calculated expected frequencies


and their theoretical maximum likelihood values in CHAID or Exhaustive CHAID
models with ordinal target variables. If the difference between calculated and
theoretical values is less than epsilon, convergence is achieved and no more iterations
are executed.

Convergence Maximum_Iterations <positive integer>

Specifies the maximum number of iterations for computing expected frequencies in


CHAID or Exhaustive CHAID models with ordinal or continuous target variables.
171
Run nin g A utoma te d Jo bs in Production Mo de

Allow Splitting_Of_Merged_Categories <Yes, No>

Specifies whether to allow merged categories to be split.

Use Bonferroni_Adjustment <Yes, No>

Specifies whether to adjust significance levels for multiple comparisons.

C&RT

Impurity_Measure Gini

Specifies the Gini impurity measure for C&RT models using categorical target
variables.

Impurity_Measure Twoing

Specifies the twoing impurity measure for C&RT models using categorical target
variables.

Impurity_Measure Ordered_Twoing

Specifies the ordered twoing impurity measure for C&RT models using ordinal
categorical target variables.
Note that C&RT models with continuous target variables always use the least
squared deviation (LSD) impurity measure.

QUEST

Alpha Variable_Selection <positive number>

Specifies the alpha level for variable selection in QUEST models.

Pruning

Select_Subtree Minimum_Risk

With automatic pruning, selects the subtree with the minimum risk.
172
Chap te r 13

Select_Subtree Standard_Error_multiplier <positive number>

With automatic pruning, selects the smallest subtree with risk not more than
<positive number> standard deviations greater than the minimum risk.

Scores

The Scores command involves setting numeric values to be associated with target
variable categories. These categories may be string or numeric, depending upon the
variable chosen for the target. In the AnswerTree dialog boxes, the categories are
presented in a grid, and it is necessary to enter only an associated value. In AnswerTree
syntax, both the category and the value need to be listed in the syntax. For this purpose,
a pairing function, CVPair, is provided. AnswerTree syntax uses the notation
CVPair(category, value) to specify one category-value pair. A sequence of these notations
may be used to specify one or more category-value pairs in the following commands.
Only the categories that are to change from default values need to be specified.

Define Scores <list of category-value pairs>

Defines score values for categories of the target variable. For example:

Define Scores CVPair(0,10),CVPair(1,20)


Define Scores CVPair("male",1),CVPair("female",2)

Costs
The Costs command involves specifying an entire misclassification matrix. There is a
special AnswerTree syntax function for building rows of category-value pairs. The
function CVRow has two or more arguments. The first is a category value for the target
variable that determines which row of the cost matrix is to be changed. The remainder
of the parameters are category-value pairs used to describe which values of the matrix
are to be changed.

Define Costs <list of rows in the misclassification matrix>

Defines costs for various kinds of misclassification. For example:

Define Costs CVRow(0,CVPair(1,1),CVPair(2,2)), _


CVRow(1,CVPair(0,1),CVPair(2,1)), _
CVRow(2,CVPair(0,2),CVPair(1,1))
173
Run nin g A utoma te d Jo bs in Production Mo de

defines this cost matrix:

Actual Category
0 1 2
0 — 1 2
Predicted
Category 1 1 — 1

2 2 1 —

The rows must be on the same extended line, so note the use of the line continuation
marker ( _ ). Also, the diagonal elements of the cost matrix are assumed to be 0 and
may not occur in this statement.

Priors

Prior probabilities allow you to specify the correct method for estimating the
proportion of cases associated with each target variable category. These commands
affect only C&RT and QUEST trees.

Equal Priors

Specifies the use of equal prior probabilities for all categories of the target variable.

Normalize Priors

Causes the priors to be rescaled to sum to 1.0 (before computing adjusted priors, if
necessary).

Define Priors <list of category-value pairs>

Prior probabilities are associated with the categories of the target variable using the
pairing function, CVPair. AnswerTree syntax uses the notation CVPair(category, value)
to specify one category-value pair. Each of the categories should be assigned a positive
value. Normally, these are probabilities.

Adjust Priors <Yes, No>

Determines whether priors are adjusted to include misclassification cost information.


The default value is No.
174
Chap te r 13

Profits

Profits indicate the relative values of different categories of the target variable.

Define Profits <list of category-value pairs>

Defines profit values for categories of the target variable. Profit values are associated
with the categories of the target variable using the pairing function, CVPair. AnswerTree
syntax uses the notation CVPair(category, value) to specify one category-value pair.
For example:

Define Profits CVPair(0,10),CVPair(1,20)


Define Profits CVPair("male",1.5),CVPair("female",3.75)

Creating the Tree

After specifying the model settings and any desired analysis settings, the tree can be
built.

Build Tree

This command results in a single node tree, the root of the tree whose parameters have
already been given. This statement causes AnswerTree to process the data in
preparation for growing the tree.

Rebuild Tree

The Rebuild Tree command is used to accommodate changes of scores, costs, profits,
and priors after the last Build Tree command. This command should not be required for
AnswerTree 2.0 scripts, since you can specify analysis settings before issuing the Build
Tree command. It is included for backward compatibility with AnswerTree 1.0 scripts.

Grow Tree

The Grow Tree command grows the tree base on the specified settings.

Grow_And_Prune Tree
175
Run nin g A utoma te d Jo bs in Production Mo de

The Grow_And_Prune Tree command grows the tree based on the specified settings and
then prunes it according to the subtree parameter specified in the Select_Subtree
command.

Formatting Output

You can control various aspects of AnswerTree output from a production mode script.

Orientation <orientation specification>

Controls the orientation of the tree. Valid orientation specifications are Top (top-down),
Left (left-to-right), and Right (right-to-left). This command must come after the
Begin Tree command but before the Build Tree command.

View <Table, Graph, Both>

Specifies the contents of nodes in the tree window. Table requests tabular statistics in
nodes of the tree, Graph requests graphs in nodes of the tree, and Both requests both
tables and graphs in nodes of the tree.

Show Training_Set

This command is used with partitioned data to show results based on the training set of
data. If the training set is already displayed, this command does nothing.

Show Test_Set

This command is used with partitioned data to show results based on the held-out test
set of data. If the test set is already displayed, this command does nothing.

Format Gain

This command allows you to control aspects of the gain summary. All of the gain
summary formats appropriate for the chosen method can be obtained with the proper
choice of Format Gain parameters. Subcommands can be written after Format Gain on
the same (extended) line or as separate commands prior to Format Gain. The
subcommands include:

Sort Ascending
176
Chap te r 13

Sorts the values in ascending order of gain scores.

Sort Descending

Sorts the values in descending order of gain scores.

Percentile Increment <integer value>

The gain summary is organized by percentiles, with each percentile group


containing <integer value> percent of the cases.

Target Category <category name>

Specifies the value of the target variable that is used to calculate gains and risks.

Average Profit

Requests that average profits be reported instead of the usual gain statistics.

Cumulative Statistics <Yes, No>

Specifies whether the gain summary contains additional columns for cumulative
statistics. Not valid if Percentile Increment is used.

Format Rules

This command determines what is printed by the Print Rules command. There are three
different formats for the rules: SQL, SPSS, and decision rules. Each of these formats
may be used to present rules for either selecting cases or for assigning values to cases.
The decision rule format provides options for using variable labels and/or value labels.
An important point about selection rules: Printing automatically selects all terminal
nodes. As a result, the selection rule selects all cases and is uninformative.

SQL_Rules For_Selecting_Cases

Prints SQL rules for selecting cases. This is the default if no other Format Rules
options are specified.

SQL_Rules For_Assigning_Values
177
Run nin g A utoma te d Jo bs in Production Mo de

Prints SQL rules for assigning values to cases.

SPSS_Rules For_Selecting_Cases

Prints SPSS rules for selecting cases.

SPSS_Rules For_Assigning_Values

Prints SPSS rules for assigning values to cases.

Decision_Rules For_Selecting_Cases

Prints decision rules for selecting cases.

Decision_Rules For_Assigning_Values

Prints decision rules for assigning values to cases.

Decision_Rules Use_Value_Labels <Yes, No>

Specifies whether to use value labels in decision rules.

Decision_Rules Use_Variable_Labels <Yes, No>

Specifies whether to use variable labels in decision rules.

Printing from Production Mode

The five major views in AnswerTree all have a Print option (accessed from the File
menu). These options are available in AnswerTree syntax using five separate Print
commands. To perform the Print command, in effect AnswerTree syntax pushes the
corresponding Tab button and then executes the Print command. One side effect of this
is that the last type of Print command executed determines which view will be in effect
when the application is continued after a production run.

Print Gain <path>

The gain summary will be printed using the settings defined. An optional path can be
included to print to a specific printer or to a file.

Print Risk <path>


178
Chap te r 13

The risk summary will be printed. An optional path can be included to print to a
specific printer or to a file.

Print Rules <path>

All terminal nodes are selected and the rules view is printed. An optional path can be
included to print to a specific printer or to a file.

Print Summary <path>

The analysis summary is printed. An optional path can be included to print to a specific
printer or to a file.

Print Tree <path>

This command will print the tree. An optional path can be included to print to a specific
printer or to a file.

The Log File

The AnswerTree production log file is created during the production run. It records the
actions taken during the run and gives the times at which major events affecting the tree
take place. The log file is itself a valid .ats file that could also be run in production
mode. Indeed, if the original file inputs to the production run are the same, the original
.ats file and a log .ats file should produce the same production run and log except for
the time stamps.

Example Scripts

The Scripts directory (found on the AnswerTree CD-ROM) contains example scripts
that demonstrate various aspects of using AnswerTree production mode. See the
comments in the scripts for details on what each specific script does and how it works.
179
Run nin g A utoma te d Jo bs in Production Mo de

Alphabetical List of Keywords


Adjust Define Minimum_Risk Rules
All Delete Missing Save
All_Categories Depth No Scores
Allow Descending Nodes Seed
Alpha End Nominal Select_Subtree
Ascending Equal Off Show
Average Epsilon On Sort
Begin Exhaustive_Chaid Open Split
Bonferroni_Adjustment For_Assigning_Values Ordered_Twoing Splitting_Of_Merged_Categories
Both For_Selecting_Cases Ordinal SPSS_Rules
Build For_Values Orientation SQL_Rules
Cart Format Parent Standard_Error_Multiplier
Cases Frequency Parent_Node Statistics
Category Gain Partition Summary
Chaid Gini Pearson Surrogates
Chi_Square Graph Percent Table
Child Grow Percentile Target
Close Grow_And_Prune Predictors Test_Set
Column Impurity_Change Print Top
Comment Impurity_Measure Priors Training
Competitors Include Production_Run Training_Set
Content Increment Profit Tree
Continuous Label Profits Twoing
Convergence Left Project Use
Costs Likelihood_Ratio Pruning Use_Value_Labels
Cumulative Maximum QUEST Use_Variable_Labels
CVPair Maximum_Iterations Random Variable
CVRow Merge Rebuild Variable_Selection
Data Method Right View
Data_Source Minimum Risk Weight
Decision_Rules Minimum_Cases Row Yes
Chapter

14
Statistics and Algorithms

This chapter discusses how AnswerTree generates trees and their associated statistics.
We have attempted to explain the concepts in general (nonmathematical) terms, but
the discussions of algorithms do assume some knowledge of statistics. If you would
like to learn more about how AnswerTree works, this chapter is for you. The following
topics are covered:
 Variables. Measurement levels of variables, case weights, and frequency variables.
 Growing methods. The advantages and disadvantages of each method, along with
algorithms for each.
 Stopping rules. How AnswerTree stops growing a tree.
 Tree parameters. Costs, prior probabilities, scores, and profits.
 Gain summary. How the gain summary helps you interpret your results.
 Accuracy of the tree. How to ensure that your tree is accurate.
 Cost-complexity pruning. Explanations and algorithms for cost-complexity
pruning.

Variables
This section defines the term categorical variable and describes the different
measurement levels of variables that may be used in an AnswerTree analysis. In
addition to the analysis variables, AnswerTree also allows you to use case weights and
frequency variables. You can use these variables to reduce the size of your data file,
which may help speed up the analysis.

180
181
S tatistics an d A lg orithm s

Categorical Variables

Categorical variables differ from continuous variables in that they are not measured
in a continuous fashion but are classified into distinct groups. You can convert
continuous variables to categorical variables by grouping ranges of values. For
example, you could convert a continuous variable age into a categorical variable by
forming categories: 18 through 24, 25 through 34, 35 through 44, and so on.
Categorical variables may be nominal or ordinal. Categories of a nominal variable
differ in kind rather than in degree, so they have no natural ordering. For example,
occupational categories such as white collar, blue collar, other, and unknown do not
follow a particular, meaningful order. Ordinal variables have known or unknown
numeric scores associated with their categories. The categories of age described above
constitute an ordinal variable. It may sometimes make sense to compute the mean
(average score) for an ordinal variable but never for a nominal variable. For ordinal
variables, it never makes sense to merge noncontiguous values, whereas for nominal
variables, any pair of values could be merged.
In AnswerTree, all of the growing methods accept all types of variables, with one
exception: QUEST requires that the target variable be nominal.

Target and Predictor Variables

AnswerTree frequently uses the terms target variable and predictor variable. The
target variable is the variable whose outcome you want to predict using other
variables. It is also known as the dependent variable. For example, in the analysis of
the Iris data set (Fisher, 1936), the target variable is the species of iris.
Predictor variables are those that predict the pattern of the target variable. They
are also known as independent variables. In the Iris data set, the predictor variables
are the petal length, petal width, sepal length, and sepal width, which meaningfully
predict the species of flower.

Case Weight and Frequency Variables

Case weight and frequency variables are useful for reducing the size of your data set.
Each has a distinct function, though. If a weight variable is mistakenly specified to be
a frequency variable, or vice versa, the resulting analysis will be incorrect.
182
Chap te r 14

Case Weights

The use of a case weight variable gives unequal treatment to the cases in a data set.
When a case weight variable is used, the contribution of a case in the analysis is
weighted in proportion to the population units that the case represents in the sample.
For example, suppose that in a direct marketing promotion, 10,000 households respond
and 1,000,000 households do not respond. To reduce the size of the data file, you might
include all of the responders but only a 1% sample (10,000) of the nonresponders. You
can do this if you define a case weight equal to 1 for responders and 100 for
nonresponders.
Note that in an AnswerTree analysis, the QUEST method does not accept case
weights.

Frequency Variables

A frequency variable represents the total number of observations that fall in a


particular cell. It is useful for creating aggregate data, in which a record represents
more than one individual. The sum of the values for a frequency variable should
always be equal to the total number of observations in the sample. Note that output and
statistics are the same whether you use a frequency variable or case-by-case data.
Table 14-1 shows a hypothetical example, with the predictor variables sex and
employment and the target variable response. The frequency variable tells us, for
example, that 10 employed men responded yes to the target question, and 19
unemployed women responded no.
Table 14-1
Data set with frequency variable

Sex Employment Response Frequency


M Y Y 10
M Y N 17
M N Y 12
M N N 21
F Y Y 11
F Y N 15
F N Y 15
F N N 19
183
S tatistics an d A lg orithm s

Tree-Growing Methods
AnswerTree includes four growing methods: CHAID, Exhaustive CHAID, C&RT, and
QUEST. Each one works a bit differently, and each one has its own best use. This
section provides an overview of each algorithm, along with a discussion of the
advantages and disadvantages of each, including a mention of how each handles
missing values. At the end of the section, the mathematical algorithms are given for
each method.

Overview of CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a highly efficient


statistical technique for segmentation, or tree growing, developed by Kass (1980).
Using as a criterion the significance of a statistical test, CHAID evaluates all of the
values of a potential predictor variable. It merges values that are judged to be
statistically homogeneous (similar) with respect to the target variable and maintains all
other values that are heterogeneous (dissimilar).
It then selects the best predictor variable to form the first branch in the decision tree,
such that each node is made of a group of homogeneous values of the selected variable.
This process continues recursively until the tree is fully grown. The statistical test used
depends upon the measurement level of the target variable. If the target variable is
continuous, an F test is used. If the target variable is categorical, a chi-squared test is used.
CHAID is probably the most popular of AnswerTree’s methods. It is not binary; that
is, it can produce more than two categories at any particular level in the tree. Therefore,
it tends to create a wider tree than do the binary growing methods. It works for all types
of variables, and it accepts both case weights and frequency variables. It handles
missing values by treating them all as a single valid category.

Overview of Exhaustive CHAID

Exhaustive CHAID is a modification of CHAID developed by Biggs, de Ville, and


Suen (1991). It was developed to address some of the weaknesses of the CHAID
method. In particular, sometimes CHAID may not find the optimal split for a variable,
since it stops merging categories as soon as it finds that all remaining categories are
statistically different. Exhaustive CHAID remedies this by continuing to merge
categories of the predictor variable until only two supercategories are left. It then
184
Chap te r 14

examines the series of merges for the predictor and finds the set of categories that gives
the strongest association with the target variable, and computes an adjusted p value for
that association. Thus, exhaustive CHAID can find the best split for each predictor, and
then choose which predictor to split on by comparing the adjusted p values.
Exhaustive CHAID is identical to CHAID in the statistical tests it uses and in the
way it treats missing values. Because its method of combining categories of variables
is more thorough than that of CHAID, it takes longer to compute. However, if you have
the time to spare, Exhaustive CHAID is generally safer to use than CHAID. It
sometimes finds more useful splits. Note, though, that depending on your data, you
may find no difference between Exhaustive CHAID and CHAID results.

Overview of C&RT

C&RT stands for Classification and Regression Trees. It is a binary tree-growing


algorithm developed by Breiman, Friedman, Olshen, and Stone (1984). C&RT
partitions the data into two subsets so that the cases within each subset are more
homogeneous than in the previous subset. It is a recursive process—it repeats itself
until the homogeneity criterion is reached or until some other stopping criterion is
satisfied (as do all of the tree-growing methods). Note that the same predictor variable
may be used several times at different levels in the tree.
C&RT is quite flexible. It allows misclassification costs to be considered in the tree
growing process. It also allows for prior probability distribution in a classification
problem. You can apply automatic cost-complexity pruning to a C&RT tree to obtain
a more generalizable tree (see “Cost-Complexity Pruning” on p. 195).
C&RT has some caveats, though. As a binary algorithm, it tends to grow trees with
many levels. Therefore, the resulting tree may not present results efficiently, especially
if the same variable was used to split a number of successive levels. It also has a tendency
to select variables that can afford more splits in the tree-growing process. Therefore,
conclusions drawn from the tree structures may prove to be unreliable. Finally, C&RT is
complex. Its computation can take a long time with large data sets. Note that it uses
surrogate splitting to handle missing values (see “Surrogate Splitting” on p. 191).

Overview of QUEST

QUEST stands for Quick, Unbiased, Efficient Statistical Tree. It is a relatively new
binary tree-growing algorithm developed by Loh and Shih (1997). It deals with
variable selection and split-point selection separately. The univariate split in QUEST
185
S tatistics an d A lg orithm s

performs approximately unbiased variable selection. That is, if all predictor variables
are equally informative with respect to the target variable, QUEST selects any of the
predictor variables with equal probability.
QUEST was created for computational efficiency. It affords many of the advantages
of C&RT, but, like C&RT, your trees can become unwieldy. You can apply automatic
cost-complexity pruning (see “Cost-Complexity Pruning” on p. 195) to a QUEST tree
to cut down its size. Note that QUEST uses surrogate splitting to handle missing values
(see “Surrogate Splitting” on p. 191).

Tree-Growing Algorithms

This section describes the computational processes behind each of AnswerTree’s four
growing methods.

CHAID Algorithm

CHAID works with all types of continuous or categorical variables. However,


continuous predictor variables are automatically categorized for the purpose of the
analysis. Note that you can set some of the options mentioned below using the
Advanced Options for CHAID. These include the choice of the Pearson chi-squared or
likelihood-ratio test, the level of α merge , and the level of α split .
1. For each predictor variable X, find the pair of categories of X that is least
significantly different (that is, has the largest p value) with respect to the target
variable Y. The method used to calculate the p value depends on the measurement
level of Y.
 If Y is continuous, use an F test.
 If Y is nominal, form a two-way crosstabulation with categories of X as rows
and categories of Y as columns. Use the Pearson chi-squared test or the
likelihood-ratio test.
 If Y is ordinal, fit a Y association model (Clogg and Eliaisin, 1987; Goodman,
1979; and Magidson, 1992). Use the likelihood-ratio test.
2. For the pair of categories of X with the largest p value, compare the p value to a
prespecified alpha level, α merge .
186
Chap te r 14

 If the p value is greater than α merge , merge this pair into a single compound
category. As a result, a new set of categories of X is formed, and you start the
process over at step 1.
 If the p value is less than α merge , go on to step 3.
3. Compute the adjusted p value for the set of categories of X and the categories of Y
by using a proper Bonferroni adjustment.
4. Select the predictor variable X that has the smallest adjusted p value (the one that
is most significant). Compare its p value to a prespecified alpha level, α split .
 If the p value is less than or equal to α split , split the node based on the set of
categories of X.
 If the p value is greater than α split , do not split the node. The node is a terminal
node.
5. Continue the tree-growing process until the stopping rules are met.

Exhaustive CHAID Algorithm

Exhaustive CHAID works much the same as CHAID. You can set some of the options
mentioned below using the Advanced Options for CHAID. These include the choice
of the Pearson chi-squared or likelihood-ratio test and the level of α split .
1. For each predictor variable X, find the pair of categories of X that is least
significantly different (that is, has the largest p value) with respect to the target
variable Y. The method used to calculate the p value depends on the measurement
level of Y.
 If Y is continuous, use an F test.
 If Y is nominal, form a two-way crosstabulation with categories of X as rows
and categories of Y as columns. Use the Pearson chi-squared test or the
likelihood-ratio test.
 If Y is ordinal, fit a Y association model (Clogg and Eliaisin, 1987; Goodman,
1979; and Magidson, 1992). Use the likelihood-ratio test.
2. Merge into a compound category the pair that gives the largest p value.
3. Calculate the p value based on the new set of categories of X. Remember the p
value and its corresponding set of categories of X.
187
S tatistics an d A lg orithm s

4. Repeat steps 1, 2, and 3 until only two categories remain. Then, among all sets of
categories of X, find the one for which the p value in step 3 is the smallest.
5. Compute the Bonferroni adjusted p value for the set of categories of X and the
categories of Y.
6. Select the predictor variable X that has the smallest adjusted p value (the one that
is most significant). Compare its p value to a prespecified alpha level, α split .
 If the p value is less than or equal to α split , split the node based on the set of
categories of X.
 If the p value is greater than α split , do not split the node. The node is a terminal
node.
7. Continue the tree-growing process until the stopping rules are met.

C&RT Algorithm

C&RT works by choosing a split at each node such that each child node is more pure
than its parent node. Here purity refers to the values of the target variable. In a
completely pure node, all of the cases have the same value for the target variable.
C&RT measures the impurity of a split at a node by defining an impurity measure.

Impurity Measures

There are four different impurity measures used to find splits for C&RT models,
depending on the type of the target variable. For categorical target variables, you can
choose Gini, twoing, or (for ordinal targets) ordered twoing. For continuous targets,
AnswerTree automatically uses the least-squared deviation (LSD) method of finding a
split. These measures are further explained in the following sections.

Gini. The Gini index at node t, g(t), is defined as

g (t ) = ∑ p (j t )p ( i t )
j≠i

where i and j are categories of the target variable. This can also be written as
188
Chap te r 14

g (t ) = 1 – ∑ p ( j t )
2

Thus, when the cases in a node are evenly distributed across the categories, the Gini
1
index takes its maximum value of 1 – --- , where k is the number of categories for the
k
target variable. When all cases in the node belong to the same category, the Gini index
equals 0.
If costs are specified, the Gini index is computed as

g (t ) = ∑ C(i j )p ( j t )p ( i t )
j≠i

where C(i|j) specifies the cost of misclassifying a category j case as category i.


The Gini criterion function Φ (s,t) for split s at node t is defined as

Φ ( s, t ) = g ( t ) – p L g ( t L ) – p R g ( t R )

where p L is the proportion of cases in t sent to the left child node, and p R is the
proportion sent to the right child node. The split s is chosen to maximize the value of
Φ (s,t) . This value, weighted by the proportion of all cases in node t, is the value
reported as “improvement” in the tree.
Twoing. The twoing index is based on splitting the target categories into two
superclasses, and then finding the best split on the predictor variable based on those
two superclasses. The twoing criterion function for split s at node t is defined as

pL p R 2
Φ ( s, t ) = ----------
4
- ∑ p( j tL ) – p ( j t R )
j

where t L and t R are the nodes created by the split s. The split s is chosen as the split
that maximizes this criterion. This value, weighted by the proportion of all cases in
node t, is the value reported as “improvement” in the tree. The superclasses C1 and C2
are defined as

C 1 = { j:p ( j tL ) ≥ p ( j t R ) }

and

C2 = C – C1
189
S tatistics an d A lg orithm s

where C is the set of categories of the target variable.


Costs, if specified, are not taken into account in splitting nodes using the twoing
criterion. However, costs will be incorporated into node assignment and risk
estimation.

Ordered twoing. The ordered twoing index is a modification of the twoing index for
ordinal target variables. The difference is that with the ordered twoing criterion, only
contiguous categories can be combined to form superclasses. For example, consider a
target variable such as account status, with categories 1 = current, 2 = 30 days overdue,
3 = 60 days overdue, and 4 = 90 or more days overdue. The twoing criterion might in
some situations put categories 1 and 4 together to form a superclass, with categories 2
and 3 forming the other superclass. However, if we consider these categories to be
ordered, we don’t want categories 1 and 4 to be combined (without also including the
intervening categories) because they are not contiguous. The ordered twoing index
takes this ordering into account and will not combine noncontiguous categories such
as 1 and 4.

Least-squared deviation (LSD). For continuous target variables, the LSD impurity
measure is used. The LSD measure R(t) is simply the (weighted) within-node variance
for node t, and it is equal to the resubstitution estimate of risk for the node. It is defined as

1
R ( t ) = ------------- ∑ w n f n ( y i – y ( t ) )
2

Nw(t ) i ∈ t

where N W ( t ) is the weighted number of cases in node t, w n is the value of the


weighting variable for case i (if any), f n is the value of the frequency variable (if any),
y i is the value of the target variable, and y(t) is the (weighted) mean for node t.
The LSD criterion function for split s at node t is defined as

Φ ( s, t ) = R ( t ) – p L R ( t L ) – p R R ( t R )

The split s is chosen to maximize the value of Φ (s,t) . This value, weighted by the
proportion of all cases in node t, is the value reported as “improvement” in the tree.
190
Chap te r 14

Steps in the C&RT Analysis


1. To conduct a C&RT analysis, starting from the root node t = 1 , search for a split
s* among the set of all possible candidates S that gives the largest decrease in
impurity:

Φ (s ,1) = max Φ (s,1 )


*

s ∈S

Then split node 1 ( t = 1 ) into two nodes, t = 2 and t = 3 , using split s*.
2. Repeat the split-searching process in each of t = 2 and t = 3 , and so on.
3. Continue the tree-growing process until at least one of the stopping rules is met.

QUEST Algorithm

QUEST deals with variable selection (steps 1 and 2) and split-point selection
separately (steps 3, 4, and 5). Note that you can specify the alpha level to be used in
the Advanced Options for QUEST—the default value is 0.05.
1. For each predictor variable X, if X is a nominal categorical variable, compute the
p value of a Pearson chi-squared test of independence between X and the
categorical dependent variable. If X is continuous or ordinal, use the F test to
compute the p value.
2. Compare the smallest p value to a prespecified, Bonferroni-adjusted alpha level.
 If the p value is less than α , then select the corresponding predictor variable to
split the node. Go on to step 3.
 If the p value is greater than α , for each X that is continuous or ordinal, use
Levene’s test for unequal variances to compute a p value. (In other words, try
to find out whether X has unequal variances at different levels of the target
variable.)
 Compare the smallest p value from Levene’s test to a new Bonferroni-adjusted
alpha level.
 If the p value is less than α , select the corresponding predictor variable with
the smallest p value from Levene’s test to split the node. Go on to step 3.
191
S tatistics an d A lg orithm s

 If the p value is greater than α , select the predictor variable from step 1 that
has the smallest p value (from either a Pearson chi-squared test or an F test) to
split the node. Go on to step 3.
3. Suppose that X is the predictor variable from step 2. If X is continuous or ordinal,
go on to step 4. If X is nominal, transform X into a dummy variable Z and compute
the largest discriminant coordinate of Z. Roughly speaking, you transform X to
maximize the differences among the target variable categories. (For more
information, see Gnanadesikan, 1977.)
4. If Y has only two categories, go on to step 5. Otherwise, compute a mean of X for
each category of Y and apply a two-mean clustering algorithm to those means to
obtain two superclasses of Y.
5. Apply quadratic discriminant analysis (QDA) to determine the split point. Notice
that QDA usually produces two cut-off points—choose the one that is closer to the
sample mean of each class.

Surrogate Splitting
Surrogate splitting is used to handle missing values for predictor variables in C&RT
and QUEST. If the best predictor variable to be used for a split has a missing value at
a particular node, AnswerTree substitutes the best replacement, or surrogate, predictor
variable it can find.
For example, suppose that X* is the predictor variable that defines the best split s*
at node t. The surrogate-splitting process finds another split s, the surrogate, which uses
another predictor variable X such that this split is most similar to s* at node t. If a new
case is to be predicted and it has a missing value on X* at node t, AnswerTree makes
the prediction on the surrogate split s instead. (Unless, of course, this case also has
missing value on X. In such a situation, the next best surrogate is used, and so on.)

Stopping Rules
AnswerTree stops the tree-growing process when one of a number of stopping rules
has been met. A node will not be split if any of the following conditions is met:
 All cases in a node have identical values for all predictors.
192
Chap te r 14

 The node becomes pure; that is, all cases in the node have the same value of the
target variable.
 The depth of the tree has reached its prespecified maximum value.
 The number cases constituting the node is less than a prespecified minimum parent
node size.
 The split at the node results in producing a child node whose number of cases is
less than a prespecified minimum child node size.
 For C&RT only, the maximum decrease in impurity is less than a prespecified
value β .
Note that you can set all but the first two of these rules in the Advanced Options for
stopping rules.

Tree Parameters
Four types of parameters can be applied to trees: scores, profits, costs, and priors. Each
one is defined in the following sections.

Scores

Scores are available in CHAID and Exhaustive CHAID. They define the order and
distance between categories of an ordinal categorical target variable. In other words,
the scores define the variable’s scale. Values of scores are involved in tree growing.

Profits

Profits are numeric values associated with categories of a target variable (ordinal or
nominal) that can be used to estimate the gain or loss associated with a segment. They
define the relative value of each value of the target variable. For example, in a
magazine marketing campaign you might have three target categories: paid
responders, unpaid responders, and non-responders, where paid responders generate
$35 of profit (because they subscribe to the magazine), unpaid responders generate –$7
of profit (because they accept the free trial issue but don't generate any revenue), and
non-responders generate –$0.15 (because of postage to send the mailing to them).
Although profits typically represent monetary worth, they can also be used for other
193
S tatistics an d A lg orithm s

kinds of cost-benefit analysis. Values are used in computing the gains chart, but not in
tree growing. You can specify profits for any of AnswerTree’s four growing methods.

Costs

Misclassification costs are numeric penalties for classifying an item into one category
when it really belongs in another—for example, when a statistical procedure
misclassifies cancerous cells as benign. Only the C&RT and QUEST methods take
costs into account when growing the tree, but all four of the methods can use the costs
in calculating risk estimates.
The cost of misclassifying a category j case as a category i class is

ìc i≠ j
C ( i j ) = í ij
î0 i= j

In a case where C ( i j ) = 1 for all i, j such that i ≠ j , we simply say that no costs are
involved. A cost matrix can be symmetric or asymmetric. If it is asymmetric, you can
make it symmetric by averaging the values of two opposite entries—that is,

C' ( i j ) = { C ( i j ) + C ( j i ) } ⁄ 2

Priors

Prior probabilities are numeric values that influence the misclassification rates for
categories of the target variable. They specify the proportion of cases already
associated with each category of the target variable prior to the analysis. In
AnswerTree, you can set priors for QUEST or C&RT as long as you have a categorical
target variable. The values are involved both in tree growing and risk estimation.
Misclassification costs can be incorporated into priors to form a set of adjusted
priors so that you do not have to specify costs separately. The adjusted priors are
defined as

C ( j )π ( j )
π' ( j ) = ------------------------------
∑ C ( j' )π ( j' )
j'
194
Chap te r 14

where the categorical target variable Y = j , j = 1 ,...,J , C ( j ) = ∑ C ( i j ) , and π ( j )


i
are the original (unadjusted) priors.

Summary of the Methods

The following table summarizes the variables and parameters available for use with
each of the four growing methods.
Table 14-2
Summary of the methods

Target Profits Costs Priors


Method Variable Case Frequency Scores (Ordinal or (Ordinal or (Ordinal or
Weights Variables (Ordinal)
Type Nominal) Nominal) Nominal)
Continuous,
CHAID Nominal, ✔ ✔ ✔ ✔ ✔
Ordinal
Exhaustive Continuous,
CHAID Nominal, ✔ ✔ ✔ ✔ ✔
Ordinal
Continuous,
C&RT Nominal, ✔ ✔ ✔ ✔ ✔
Ordinal
QUEST Nominal ✔ ✔ ✔ ✔

Gain Summary
The gain summary provides descriptive statistics for the terminal nodes of a tree. It
allows you to identify desired terminal nodes based on the gain values. For example,
you can identify your best sales prospects, your most creditworthy loan applicants, or
your patients most likely to develop a certain disease.
If your target variable is continuous, the gain summary shows the average of the
target value for each terminal node. You can choose to sort the nodes in descending or
ascending order of gain, depending on whether the highest or lowest gain value is of
interest in your study.
If your target variable is categorical (nominal or ordinal), the gain summary shows
the percentage of cases in a selected target category or, if profits are defined for the
tree, the average profit value for each terminal node.
195
S tatistics an d A lg orithm s

The index value shows the ratio of the gain value for each terminal node to the gain
value for the entire sample. It tells you how a node compares to the average.

Accuracy of the Tree


Once you have generated a tree, it is always important to consider the accuracy of your
tree. Accuracy refers to how well the tree predicts outcomes or classifies individuals.
(Accuracy is sometimes referred to as predictive validity.) Conversely, the inaccuracy
of the tree is called the risk. You can estimate the risk of your tree using one of three
methods: resubstitution of the full sample, partitioning to create a testing sample, or
cross-validation. In AnswerTree, risk estimates and misclassification tables report the
accuracy of the final tree.
Risk estimation using resubstitution is the easiest method, but it usually
underestimates the true risk. Partitioning the data into two subsets, one for training
and one for testing, is a good method when the data set is large enough. You use the
training data to grow the tree, and then you “drop” the testing data into the tree to see
how well they fit. The risk is computed based on the testing sample. Cross-validation
is useful when the data set is too small for partitioning. In AnswerTree, cross-
validation is available only when the data are not partitioned and the tree is grown
automatically. See Chapter 5 for more information about cross-validation.

Cost-Complexity Pruning
Cost-complexity pruning is a way to generate a tree of an appropriate size. If pruning
is not used, the tree may end up too large to be useful. In any case, the terminal nodes
may not provide useful information (the final splits may be superfluous). Perhaps the
most important reason to use pruning is to avoid overfitting, where your tree fits not
only the real patterns present in your data but also some of the unique “noise,” or error,
in your sample. An overfitted tree often does not generalize well to other data sets.
AnswerTree uses pruning to deal with this problem. In pruning the tree, the software
tries to create the smallest tree whose misclassification risk is not too much greater than
that of the largest tree possible. It removes a tree branch if the cost associated with
having a more complex tree exceeds the gain associated with having another level of
nodes (branch).
196
Chap te r 14

It uses an index that measures both the misclassification risk and the complexity of
the tree, since we want to minimize both of these things. This cost-complexity measure
is defined as follows:

˜
R α( T ) = R ( T ) + α T
˜
R ( T ) is the misclassification risk of tree T, and T is the number of terminal nodes for
tree T. This measure is a linear combination of the risk of tree T and its complexity. If
α is the complexity cost per terminal node, then R α ( T ) is the sum of the risk of tree T
and its cost penalty for complexity. (Note that the value of α is calculated by the
algorithm during pruning.)
Any tree you might generate has a maximum size ( T max ), in which each terminal
node contains only one case. With no complexity cost ( α = 0 ) , the maximum tree has
the lowest risk, since every case is perfectly predicted. Thus, the larger the value of α ,
the fewer the number of terminal nodes in T ( α ) , where T ( α ) is the tree with the
lowest complexity cost for the given α . As α increases from 0, it produces a finite
sequence of subtrees ( T 1 ,T 2, T 3 ,... ) , each with progressively fewer terminal nodes.
Cost-complexity pruning works by removing the weakest split.
The following equations represent the cost complexity for { t } , which is any single
node, and for Tt, the subbranch of { t } .

R α( { t } ) = R ( t ) + α

R α ( T t ) = R ( T t ) + α T˜ t

If R α ( T t ) is less than R α ( { t } ) , then the branch Tt has a smaller cost complexity than
the single node { t } .
The tree-growing process ensures that R α ( { t } ) ≥ R α ( T t ) for ( α = 0 ) . As α
increases from 0, both R α ( { t } ) and R α ( T t ) grow linearly, with the latter growing at a
faster rate. Eventually, you will reach a threshold α' , such that R α ( { t } ) < R α ( T t ) for
all α > α' . This means that when α grows larger than α' , the cost complexity of the
tree can be reduced if we cut the subbranch T t under { t } . Determining the threshold
is a simple computation. You can solve this first inequality, R α ( { t } ) ≥ R α ( T t ) , to find
the largest value of α for which the inequality holds, which is also represented by g ( t ) .
You end up with

R( t ) – R ( Tt)
α ≤ g ( t ) = -----------------------------
-
T˜ t – 1
197
S tatistics an d A lg orithm s

You can define the weakest link (t) in tree T as the node that has the smallest value of
g (t ) :

g ( t ) = min
t∈T
g( t)

Therefore, as α increases, t is the first node for which R α ( { t } ) = R α ( T t ) . At that


point, { t } becomes preferable to T t (and α' = g ( t ) is the value of α at which equality
occurs). In other words, the node becomes preferable to its subbranch, and the
subbranch is pruned.
Now we are ready to perform the pruning algorithm. Set α 1 equal to 0 and start with
the tree T 1 = T ( 0 ) . Assume that the sequence of subtrees ( T 1 ,T 2 ,...T k ) has been built
and that the corresponding values of α have been found for each subtree. By applying
the above algorithm for Tk, you find α ( k + 1 ) and the weakest link, tk. Cut the branch
under this node to get T ( k + 1 ) , the next-higher subtree in the sequence. Repeat the
process until you reach the root node of T max .
The pruning process results in a decreasing sequence of subtrees
( T 1 > T 2 > ... > T root ), where T k = T ( α k ) and α 1 = 0 . To obtain the optimally sized
tree, AnswerTree selects a subtree from the sequence based on its corresponding risk
estimate.
You can compute the risk estimate in one of two ways. You can use a separate
sample of data, called the test sample, to calculate the risk of a tree that is grown by
another sample, called the learning sample. Or you can use resubstitution, in which the
learning sample also serves as the test sample. Results based on a separate test sample
are usually more reliable. No matter which method of calculation you choose, though,
the one-standard-error rule applies: choose the simplest tree whose risk is less than
one standard error greater than the tree with the smallest risk. That is, if T k0 is such that
ˆ ˆ
R ( T k0 ) = min R ( T k ) ,
k

choose T k 1 , where k1 is the largest k satisfying


ˆ ˆ ˆ
R ( T k1 ) ≤ R ( T k0 ) + SE ( R ( T k0 ) )

Note that the one-standard-error rule is somewhat subjective. For that reason, you can
set the standard error multiplier in the Advanced Options for pruning. You can also
simply choose the tree with the smallest risk.
Bibliography

Berndt, E. 1991. The practice of economics: Classic and contemporary. Reading, Mass.:
Addison-Wesley.
Biggs, D., B. de Ville, and E. Suen. 1991. A method of choosing multiway partitions for
classification and decision trees. Journal of Applied Statistics, 18: 49–62.
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and regression
trees. Belmont, Calif.: Wadsworth.
Clogg, C. C., and S. R. Eliasin. 1987. Some problems in log-linear analysis. Sociological
Methods and Research, 16:1, 8–44.
Fisher, R. A. 1936. The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7: 179–188.
Gnanadesikan, R. 1977. Methods for statistical data analysis of multivariate observations.
New York: John Wiley & Sons, Inc.
Goodman, L. A. 1979. Simple models for the analysis of association in cross-classifications
having ordered categories. Journal of the American Statistical Association, 74: 537–552.
Harrison, D., and D. L. Rubinfeld. 1978. Hedonic prices and the demand for clean air. Journal of
Environmental Economics & Management, 5: 81–102.
Kass, G. 1980. An exploratory technique for investigating large quantities of categorical data.
Applied Statistics, 29:2, 119–127.
Loh, W. Y., and Y. S. Shih. 1997. Split selection methods for classification trees. Statistica
Sinica, 7: 815–840.
Magidson, J. 1992. Chi-squared analysis of a scalable dependent variable. In Proceedings of the
1992 Annual Meeting of the American Statistical Association, Educational Statistics Section.
Magidson, J., and SPSS Inc. 1993. SPSS for Windows CHAID Release 6.0. Chicago: SPSS Inc.
Merz, C. J., and P. M. Murphy. 1996. UCI repository of machine learning databases
(http://www.ics.uci.edu/~mlearn/mlrepository.html). Department of Information and
Computer Science, University of California, Irvine.
198
Index

accuracy of tree, 195 case weights, 48, 182


adjusted priors, 193 categorical variables, 181
advanced options nominal, 181
C&RT, 55 ordinal, 181
CHAID, 54 categories
costs, 60 ordering of and distance between, 59
Exhaustive CHAID, 54 CHAID, 46
prior probabilities, 61 advanced options, 54
pruning, 57 algorithms, 185
QUEST, 57 compared to other methods, 183
scores, 59 example, 123, 150
stopping rules, 52 Exhaustive, 46
algorithms chi-square, 54
C&RT, 187 classification system, 5
CHAID, 185
comparisons of methods, 183–185 competitors, 75
Exhaustive CHAID, 186 complexity of tree, 195
pruning, 195 continuous variables, 77
QUEST, 190 discretizing, 78
algorithms, defined, 6 convergence criteria, 54
alpha, 54, 57 cost-complexity pruning
AnswerTree, 2 algorithms for, 195
AnswerTree Script (.ats) files, 163, 164 costs, 60
ATS files, 163, 164 incorporating into priors, 193
automated processing, 163, 164 criteria
C&RT, 55
automatic intervals, 78 CHAID, 54
pruning, 57
QUEST, 57
Bonferroni adjustment, 54 stopping rules, 52
cross-validation, 49
custom intervals, 78
C&RT, 46 cut point, 84
advanced options, 55
algorithms, 187
compared to other methods, 184
example, 115, 131, 140
impurity measure, 55

199
200
Index

data gain summary, 194


exporting, 92 format, 86, 88
data files, 63 generalizability, 49
Data Viewer, 92 Gini impurity measure, 55, 187
menus, 93, 94 Graph Viewer, 94
database capture, 99 menus, 96
database files grouping categories, 83, 84
reading with ODBC, 100, 101 growing algorithms, 6
decision tree growing criteria
applications, 7 C&RT, 55
defined, 5 CHAID, 54
define split, 83, 84 pruning, 57
define variables, 75, 77 QUEST, 57
dependent variables, 48, 181 stopping rules, 52
discretizing continuous variables, 78 growing methods, 46
table summarizing, 194

examples
credit scoring, 123 heteroscedasticity, 48
iris classification, 115
market segmentation, 150
testing effects, 131 impurity measures, 187–189
wage prediction, 140 for C&RT, 55
Exhaustive CHAID, 46 Gini, 187
advanced options, 54 least squared deviation, 189
algorithms, 186 least squared deviation (LSD), 55
compared to other methods, 183 ordered twoing, 189
exporting twoing, 188
analysis summary, 72 independent variables, 181
data, 92, 93 intervals, 78
gains summary, 72
graphs (Graph Viewer), 96
risk summary, 72
rules, 72 learning system, 5
tables (Table Viewer), 98 least squared deviation (measure of impurity), 55
Tree view, 72 least squared deviation impurity measure, 189
LSD, 55

format
gain summary, 86, 88
maximum tree depth, 52
rules, 88
measurement levels, 77
frequency
of cases, 48 methods
comparisons of, 183–185
frequency variables, 182
minimum change in impurity, 52
201
Index

minimum number of cases, 52 parameters


misclassification risk, 68, 194 table summarizing, 194
missing values partitioning, 49, 195
surrogate splitting, 191 predictor variables, 181
model definition, 48 predictors
number shown in Select Predictor, 75
selecting for split, 81, 82
selecting surrogates, 85, 86
new project, 45
printing
New Tree Wizard, 45 production mode, 177
growing methods, 46
prior probabilities, 61, 193
model definition, 48
adjusted (by costs), 193
validation options, 49
production mode
new trees
Adjust Priors, 169
growing, 46
Allow Splitting_Of_Merged_Categories, 169
nodes Alpha Merge, 169
selecting, 73 Alpha Split, 169
nominal variables, 77, 181 Alpha Variable_Selection, 169
analysis settings, 169
Average Profit, 175
Begin Production_Run, 164
ODBC, 99
Begin Tree, 165
conditional expressions, 106
Build Tree, 174
creating relationships, 103
building a tree, 165
database security, 101
C&RT settings, 169
defining variables, 108
CHAID Merge, 167
logging into a database, 101
CHAID settings, 169
parameter queries, 106
Chi_Square Likelihood_Ratio, 169
query syntax, 110–114
Chi_Square Pearson, 169
relationship properties, 104
Continuous Variable, 167
saving queries, 109
Convergence Epsilon, 169
selecting a data source, 100
Convergence Maximum_Iterations, 169
selecting data fields, 101
costs, 169
specifying criteria, 106
creating the tree, 174
SQL syntax, 109–114
Cumulative Statistics, 175
table joins, 103, 104
CVPair, 169
verifying results, 109
CVRow, 169
Where clause, 106
Decision_Rules, 175
opening files, 99 Define Costs, 169
ordered twoing impurity measure, 55, 189 Define Priors, 169
ordinal variables, 77, 181 Define Profits, 169
Define Scores, 169
Delete Tree, 165
End Production_Run, 164
End Tree, 165
Equal Priors, 169
202
Index

example scripts, 178 Save Project, 165


examples, 178 scores, 169
Format Gain_Summary, 175 Select_Subtree Minimum_Risk, 169
Format Rules, 175 Select_Subtree Standard_Error_Multiplier, 169
formatting output, 175 Show Test_Set, 175
Frequency, 167 Show Training_Set, 175
general rules, 164 Sort Ascending, 175
Grow Tree, 174 Sort Descending, 175
Grow_And_Prune Tree, 174 SPSS_Rules, 175
Impurity_Measure Gini, 169 SQL_Rules, 175
Impurity_Measure Ordered_Twoing, 169 stopping rules, 169
Impurity_Measure Twoing, 169 Target, 167
introduction, 163 Target Category, 175
list of keywords, 179 Training Percent, 167
log file, 178 Use Bonferroni_Adjustment, 169
Maximum Competitors, 169 validation settings, 167
Maximum Depth, 169 variable settings, 167
Maximum Surrogates, 169 View, 175
Method, 167 Weight, 167
Minimum Impurity_Change, 169 profits, 79, 80, 192
Minimum_Cases Child, 169 project files, 63
Minimum_Cases Parent, 169
Missing Label, 167 Project window, 43
model settings, 167 menus, 44, 63, 64
Nominal Variable, 167 projects
Normalize Priors, 169 saving, 63
Open Data_Source, 165 pruning
Open Project, 165 algorithms for, 195
opening files, 165 selection criteria, 57
Ordinal Variable, 167
Orientation, 175
Partition Data, 167
Percentile Increment, 175 queries
Predictors, 167 editing syntax, 110–114
Print Gain, 177 QUEST, 46
Print Risk, 177 algorithms, 190
Print Rules, 177 alpha for variable selection, 57
Print Summary, 177 compared to other methods, 184
Print Tree, 177
printing, 177
priors, 169
random number seed, 49
production runs, 164
profits, 169 risk estimate
pruning settings, 169 cross-validated, 49
QUEST settings, 169 risk summary, 68, 194
Random Seed, 167
Rebuild Tree, 174
203
Index

rules tree-growing algorithms, 6


for assigning cases to nodes, 69 tutorial, 9
for multiple nodes, 69 twoing impurity measure, 55, 188
for selecting terminal nodes, 73
format, 88, 89
simplification of selection rules, 69
validation, 49
value labels, 75, 83, 84
save project, 63 variable labels, 48, 75, 81
saving files variables
ODBC queries, 109 case weight, 182
scores, 59, 192 categorical, 181
defining, 75
seed, 49 dependent, 181
select predictor, 81, 82 editing properties, 75
select surrogate, 85, 86 frequency, 182
selecting terminal nodes, 73 independent, 181
splitting nodes, 81, 82, 83, 84, 85, 86 intervals, 78, 79
measurement levels, 77
SPSS nominal, 181
opening SPSS data files, 45 ordinal, 181
SQL predictor, 181
syntax for queries, 110–114 predictors, 181
stopping rules, 52, 191 target, 181
surrogate splitting, 191 Viewer windows, 91
surrogates, 85, 86
setting for C&RT, 55
setting for QUEST, 57
window types, 91
windows
Data Viewer, 92
Table Viewer, 96 Graph Viewer, 94
menus, 98 Project window, 43
target variables, 181 Table Viewer, 96
testing data, 195 Tree Map, 91
Tree window, 65, 66, 67, 69, 70
training data, 195
Ttree window, 68
Tree Map, 91 Viewer windows, 91
Tree window, 65
analysis summary, 70
exporting, 72
gain summary, 67
menus, 71, 72, 74, 80, 86, 89, 90
risk summary, 68
rules, 69
Tree view, 66

You might also like