Rapid Miner - Data Preparation
Rapid Miner - Data Preparation
Rapid Miner - Data Preparation
NORMALIZATION
Normalization is used to scale values so they fit in a specific range. Adjusting the value range is
very important when dealing with Attributes of different units and scales. For example, when
using the Euclidean distance all Attributes should have the same scale for a fair comparison.
Normalization is useful to compare Attributes that vary in size. This Operator performs
normalization of the selected Attributes. Four normalization methods are provided. These
methods are explained in the parameters.
Differentiation
Scale by Weights
This Operator can be used to scale Attributes by pre-calculated weights. Instead of adjusting the
value range to a common scale, this Operator can be used to give important Attributes even more
weight.
De-Normalize
This Operator can be used to revert a previously applied normalization. It requires the pre-
processing model returned by a Normalization Operator.
Parameters
create view
Create a View instead of changing the underlying data. If this option is checked, the
normalization is delayed until the transformations are needed. This parameter can be
considered a legacy option.
Range:
attribute_filter_type
This parameter allows you to select the Attribute selection filter; the method you want to
use for selecting Attributes. It has the following options:
o all: This option selects all the Attributes of the Example Set, so that no Attributes
are removed. This is the default option.
o single: This option allows the selection of a single Attribute. The required
Attribute is selected by the attribute parameter.
o subset: This option allows the selection of multiple Attributes through a list (see
parameter attributes). If the meta data of the Example Set is known, all Attributes
are present in the list and the required ones can easily be selected.
o regular_expression: This option allows you to specify a regular expression for
the Attribute selection. The regular expression filter is configured by the
parameters regular expression, use except expression and except expression.
o value_type: This option allows selection of all the Attributes of a particular type.
It should be noted that types are hierarchical. For example, both real and integer
types belong to the numeric type. The value type filter is configured by the
parameters value type, use value type exception, except value type.
o block_type: This option allows the selection of all the Attributes of a particular
block type. It should be noted that block types may be hierarchical. For example,
value_series_start and value_series_end block types both belong to the
value_series block type. The block type filter is configured by the parameters
block type, use block type exception, except block type.
o no_missing_values: This option selects all Attributes of the ExampleSet, which
do not contain a missing value in any Example. Attributes that have even a single
missing value are removed.
o numeric_value_filter: All numeric Attributes whose Examples all match a given
numeric condition are selected. The condition is specified by the numeric
condition parameter. Please note that all nominal Attributes are also selected
irrespective of the given numerical condition.
Method
Four methods are provided here for normalizing data.
z_transformation: This is also called statistical normalization. This normalization subtracts the mean of
the data from all values and then divides them by the standard deviation. Afterwards, the distribution of
the data has a mean of zero and a variance of one. This is a common and very useful normalization
technique. It preserves the original distribution of the data and is less influenced by outliers.
range_transformation: Range transformation normalizes all Attribute values to a specified value range.
When this method is selected, two other parameters (min, max) appear in the Parameters panel. So the
largest value is set to 'max' and the smallest value is set to 'min'. All other values are scaled, so they fit
into the given range. This method can be influenced by outliers, because the bounds move towards them.
On the other hand, this method keeps the original distribution of the data points, so it can also be used for
data anonymization, for example to obfuscate the true range of observations.
proportion_transformation: This normalization is based on the proportion each Attribute value has on
the complete Attribute. This means each value is divided by the total sum of that Attribute values. The
sum is only formed from finite values, ignoring NaN/missing values as well as positive and negative
infinity. When this method is selected, another parameter (allow negative values) appears in the
Parameters panel. If checked, negative values will be treated as absolute values, otherwise they will
produce an error when executed.
interquartile_range: Normalization is performed using the interquartile range. The interquartile range is
the distance between the 25th and 75th percentile, which are also called lower and upper quartile, or Q1
and Q3. They are calculated by first sorting the data and then taking the data value that separates the first
(or the last) 25% of the Examples from the rest. The median is the 50th percentile, so it is the value that
separates the sorted values in half. The interquartile range (IQR) is the difference between Q3 and Q1.
The final formula for the interquartile range normalization is then: (value median) / IQR The IQR is the
range between the middle 50% of the data, so this normalization method is less influenced by outliers.
NaN/missing values, as well as infinite values will be ignored for this method. Also, if no finite values
could be found, the corresponding Attribute will be ignored
min
This parameter is available only when the method parameter is set to 'range
transformation'. It is used to specify the minimum point of the range.
max
This parameter is available only when the method parameter is set to 'range
transformation'. It is used to specify the maximum point of the range.
allow_negative_values
This parameter is available only when the method parameter is set to 'proportion
transformation'. It is used to allow or disallow negative values in the processed
Attributes. Negative values then will be counted as their absolute values.
EXAMPLE SET
NORMALIZED
TASK:
Normalizing Age and Passenger Fare for the Titanic data
Takes the Age and the Passenger Fare Attributes from the Titanic data and performs a normalization on
them. The Attributes have a very different range of values (the highest Age is 80 and the highest fare is
around 500). Also, the Passenger Fare has one value that is much higher than all the other fares. So, it can
be considered as an outlier. When applying the Z-Transformation, both Attributes are centred around 0.
When changing the method to Interquartile Range, the values of the Passenger Fare are spread out a bit
more evenly, as the one outlier does not have so much influence. For visualization, it is best to use the
Histogram charts view.
REPLACE MISSING VALUES
This Operator replaces missing values in Examples of selected Attributes by a specified replacement.
Missing values can be replaced by the minimum, maximum or average value of that Attribute.
Zero can also be used to replace missing values. Any replenishment value can also be specified
as a replacement of missing values.
Differentiation
This Operator estimates values for the missing values by applying a model learned for missing
values.
In contrast to the Replace Missing Values Operators, this Operator set specific values of selected
Attributes to missing values.
RESULT WITH MISSING VALUES
REPLACING MISSING VALUES
This operator removes duplicate examples from an Example Set by comparing all examples with
each other on the basis of the specified attributes. Two examples are considered duplicate if the
selected attributes have the same values in them.
The Remove Duplicates operator removes duplicate examples from an Example Set by
comparing all examples with each other on the basis of the specified attributes. This operator
removes duplicate examples such that only one of all the duplicate examples is kept. Two
examples are considered duplicate if the selected attributes have the same values in them.
Attributes can be selected from the attribute filter type parameter and other associated
parameters. Suppose two attributes 'att1' and 'att2' are selected and 'att1' and 'att2' have three and
two possible values respectively. Thus there are total 6 (i.e. 3 x 2) unique combinations of these
two attribute. Thus, the resultant Example Set can have 6 examples at most. This operator works
on all attribute types.
DETECT OUTLIERS
OUTLIER DETECTION
TASK
Removing duplicate values from the Golf data set on the basis of the Outlook and Wind
attributes
The 'Golf' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you
can have a look at the Example Set. You can see that the Outlook attribute has three possible
values i.e., 'sunny', 'rain' and 'overcast'. The Wind attribute has two possible values i.e. 'true' and
'false'. The Remove Duplicates operator is applied on this Example Set to remove duplicate
examples on the basis of the Outlook and Wind attributes. The attribute filter type parameter is
set to 'value type' and the value type parameter is set to 'nominal', thus two examples that have
same values in their Outlook and Wind attributes are considered as duplicate. Note that the Play
attribute is not selected although its value type is nominal because it is a special attribute
(because it has label role). To select attributes with special roles the include special attributes
parameter should be set to true. The Outlook and Wind attributes have 3 and 2 possible values
respectively. Thus, the resultant Example Set will have 6 examples at most i.e. one example for
each possible combination of attribute values. You can see the resultant Example Set in the
Results Workspace. You can see that it has 6 examples and all examples have a different
combination of the Outlook and Wind attribute values.
This operator identifies n outliers in the given Example Set based on the distance to
their k nearest neighbors. The variables n and k can be specified through parameters.
Parameters
number_of_neighborsThis parameter specifies the k value for the k-th nearest neighbors
to be the analyzed. The minimum and maximum values for this parameter are 1 and 1
million respectively. Range: integer
number_of_outliersThis parameter specifies the number of top-n outliers to be looked
for. The resultant Example Set will have n number of examples that are considered
outliers. The minimum and maximum values for this parameter are 2 and 1 million
respectively. Range: integer
distance_function This parameter specifies the distance function that will be used for
calculating the distance between two examples. Range: selection
TASK
The Generate Data operator is used for generating an Example Set. The target function parameter
is set to 'gaussian mixture clusters'. The number examples and number of attributes parameters
are set to 200 and 2 respectively. A breakpoint is inserted here so that you can view the Example
Set in the Results Workspace. A good plot of the Example Set can be seen by switching to the
'Plot View' tab. Set Plotter to 'Scatter', x-Axis to 'att1' and y-Axis to 'att2' to view the scatter plot
of the Example Set.
The Detect Outlier (Distances) operator is applied on this Example Set. The number of neighbors
and number of outliers parameters are set to 4 and 12 respectively. Thus 12 examples of the
resultant Example Set will have true value in the 'outlier' attribute. This can be verified by
viewing the Example Set in the Results Workspace. For better understanding switch to the 'Plot
View' tab. Set Plotter to 'Scatter', x-Axis to 'att1', y-Axis to 'att2' and Color Column to 'outlier' to
view the scatter plot of the Example Set (the outliers are marked red).