Algoritmo
Algoritmo
Algoritmo
This reference content provides the technical background on each of the machine learning algorithms and
modules available in Azure Machine Learning designer.
Each module represents a set of code that can run independently and perform a machine learning task, given
the required inputs. A module might contain a particular algorithm, or perform a task that is important in
machine learning, such as missing value replacement, or statistical analysis.
For help with choosing algorithms, see
How to select algorithms
Azure Machine Learning Algorithm Cheat Sheet
TIP
In any pipeline in the designer, you can get information about a specific module. Select the Learn more link in the
module card when hovering on the module in the module list, or in the right pane of the module.
Data Input and Output Move data from cloud sources into Enter Data Manually
your pipeline. Write your results or Export Data
intermediate data to Azure Storage, Import Data
SQL Database, or Hive, while running a
pipeline, or use cloud storage to
exchange data between pipelines.
Feature Selection Select a subset of relevant, useful Filter Based Feature Selection
features to use in building an analytical Permutation Feature Importance
model.
Classification Predict a class. Choose from binary Multiclass Boosted Decision Tree
(two-class) or multiclass algorithms. Multiclass Decision Forest
Multiclass Logistic Regression
Multiclass Neural Network
One vs. All Multiclass
One vs. One Multiclass
Two-Class Averaged Perceptron
Two-Class Boosted Decision Tree
Two-Class Decision Forest
Two-Class Logistic Regression
Two-Class Neural Network
Two Class Support Vector Machine
Model Training Run data through the algorithm. Train Clustering Model
Train Model
Train Pytorch Model
Tune Model Hyperparameters
Model Scoring and Evaluation Measure the accuracy of the trained Apply Transformation
model. Assign Data to Clusters
Cross Validate Model
Evaluate Model
Score Image Model
Score Model
Python Language Write code and embed it in a module Create Python Model
to integrate Python with your pipeline. Execute Python Script
Computer Vision Image data preprocessing and Image Apply Image Transformation
recognition related modules. Convert to Image Directory
Init Image Transformation
Split Image Directory
DenseNet
ResNet
Web service
Learn about the web service modules which are necessary for real-time inference in Azure Machine Learning
designer.
Error messages
Learn about the error messages and exception codes you might encounter using modules in Azure Machine
Learning designer.
Next steps
Tutorial: Build a model in designer to predict auto prices
Import Data module
4/30/2021 • 4 minutes to read • Edit Online
NOTE
All functionality provided by this module can be done by datastore and datasets in the worksapce landing page. We
recommend you use datastore and dataset which includes additional features like data monitoring. To learn more, see
How to Access Data and How to Register Datasets article. After you register a dataset, you can find it in the Datasets ->
My Datasets category in designer interface. This module is reserved for Studio(classic) users to for a familiar experience.
The Impor t Data module support read data from following sources:
URL via HTTP
Azure cloud storages through Datastores )
Azure Blob Container
Azure File Share
Azure Data Lake
Azure Data Lake Gen2
Azure SQL Database
Azure PostgreSQL
Before using cloud storage, you need to register a datastore in your Azure Machine Learning workspace first.
For more information, see How to Access Data.
After you define the data you want and connect to the source, Impor t Data infers the data type of each column
based on the values it contains, and loads the data into your designer pipeline. The output of Impor t Data is a
dataset that can be used with any designer pipeline.
If your source data changes, you can refresh the dataset and add new data by rerunning Import Data.
WARNING
If your workspace is in a virtual network, you must configure your datastores to use the designer's data visualization
features. For more information on how to use datastores and datasets in a virtual network, see Use Azure Machine
Learning studio in an Azure virtual network.
NOTE
Impor t Data module is for Tabular data only. If you want to import multiple tabular data files one time, it
requires following conditions, otherwise errors will occur:
1. To include all data files in the folder, you need to input folder_name/** for Path .
2. All data files must be encoded in unicode-8.
3. All data files must have the same column numbers and column names.
4. The result of importing multiple data files is concatenating all rows from multiple files in order.
4. Select the preview schema to filter the columns you want to include. You can also define advanced
settings like Delimiter in Parsing options.
5. The checkbox, Regenerate output , decides whether to execute the module to regenerate output at
running time.
It's by default unselected, which means if the module has been executed with the same parameters
previously, the system will reuse the output from last run to reduce run time.
If it is selected, the system will execute the module again to regenerate output. So select this option when
underlying data in storage is updated, it can help to get the latest data.
6. Submit the pipeline.
When Import Data loads the data into the designer, it infers the data type of each column based on the
values it contains, either numerical or categorical.
If a header is present, the header is used to name the columns of the output dataset.
If there are no existing column headers in the data, new column names are generated using the format
col1, col2,… , coln*.
Results
When import completes, right-click the output dataset and select Visualize to see if the data was imported
successfully.
If you want to save the data for reuse, rather than importing a new set of data each time the pipeline is run,
select the Register dataset icon under the Outputs+logs tab in the right panel of the module. Choose a name
for the dataset. The saved dataset preserves the data at the time of saving, the dataset is not updated when the
pipeline is rerun, even if the dataset in the pipeline changes. This can be useful for taking snapshots of data.
After importing the data, it might need some additional preparations for modeling and analysis:
Use Edit Metadata to change column names, to handle a column as a different data type, or to indicate
that some columns are labels or features.
Use Select Columns in Dataset to select a subset of columns to transform or use in modeling. The
transformed or removed columns can easily be rejoined to the original dataset by using the Add
Columns module.
Use Partition and Sample to divide the dataset, perform sampling, or get the top n rows.
Limitations
Due to datstore access limitation, if your inference pipeline contains Impor t Data module, it will be auto-
removed when deploy to real-time endpoint.
Next steps
See the set of modules available to Azure Machine Learning.
Enter Data Manually module
11/2/2020 • 2 minutes to read • Edit Online
Create a dataset
1. Add the Enter Data Manually module to your pipeline. You can find this module in the Data Input and
Output category in Azure Machine Learning.
2. For DataFormat , select one of the following options. These options determine how the data that you
provide should be parsed. The requirements for each format differ greatly, so be sure to read the related
topics.
ARFF : Attribute-relation file format used by Weka.
CSV : Comma-separated values format. For more information, see Convert to CSV.
SVMLight : Format used by Vowpal Wabbit and other machine learning frameworks.
TSV : Tab-separated values format.
If you choose a format and do not provide data that meets the format specifications, a runtime error
occurs.
3. Click inside the Data text box to start entering data. The following formats require special attention:
CSV : To create multiple columns, paste in comma-separated text, or type multiple columns by
using commas between fields.
If you select the HasHeader option, you can use the first row of values as the column heading.
If you deselect this option, the column names (Col1, Col2, and so forth) are used. You can add or
change columns names later by using Edit Metadata.
TSV : To create multiple columns, paste in tab-separated text, or type multiple columns by using
tabs between fields.
If you select the HasHeader option, you can use the first row of values as the column heading.
If you deselect this option, the column names (Col1, Col2, and so forth) are used. You can add or
change columns names later by using Edit Metadata.
ARFF : Paste in an existing ARFF format file. If you're typing values directly, be sure to add the
optional header and required attribute fields at the beginning of the data.
For example, the following header and attribute rows can be added to a simple list. The column
heading would be SampleText . Note that the String type is not supported.
% Title: SampleText.ARFF
% Source: Enter Data module
@ATTRIBUTE SampleText NUMERIC
@DATA
\<type first data row here>
When you run the Enter Data Manually module, these lines are converted to a dataset of columns
and index values as follows:
4. Select the Enter key after each row, to start a new line.
If you select Enter multiple times to add multiple empty trailing rows, the empty rows will be removed or
trimmed.
If you create rows with missing values, you can always filter them out later.
5. Connect the output port to other modules, and run the pipeline.
To view the dataset, right-click the module and select Visualize .
Next steps
See the set of modules available to Azure Machine Learning.
Export Data module
4/30/2021 • 2 minutes to read • Edit Online
NOTE
Exporting data of a certain data type to a SQL database column specified as another data type is not supported.
The target table does not need to exist first.
5. The checkbox, Regenerate output , decides whether to execute the module to regenerate output at
running time.
It's by default unselected, which means if the module has been executed with the same parameters
previously, the system will reuse the output from last run to reduce run time.
If it is selected, the system will execute the module again to regenerate output.
6. Define the path in the datastore where the data is. The path is a relative path.Take data/testoutput as an
example, which means the input data of Expor t Data will be exported to data/testoutput in the
datastore you set in the Output settings of the module.
NOTE
The empty paths or URL paths are not allowed.
7. For File format , select the format in which data should be stored.
8. Submit the pipeline.
Limitations
Due to datstore access limitation, if your inference pipeline contains Expor t Data module, it will be auto-
removed when deploy to real-time endpoint.
Next steps
See the set of modules available to Azure Machine Learning.
Add Columns module
3/5/2021 • 2 minutes to read • Edit Online
Next steps
See the set of modules available to Azure Machine Learning.
Add Rows module
3/5/2021 • 2 minutes to read • Edit Online
Next steps
See the set of modules available to Azure Machine Learning.
Apply Math Operation
3/26/2021 • 15 minutes to read • Edit Online
Results
If you generate the results using the Append or ResultOnly options, the column headings of the returned
dataset indicate the operation and the columns that were used. For example, if you compare two columns using
the Equals operator, the results would look like this:
Equals(Col2_Col1) , indicating that you tested Col2 against Col1.
Equals(Col2_$10) , indicating that you compared column 2 to the constant 10.
Even if you use the In place option, the source data is not deleted or changed; the column in the original dataset
is still available in the designer. To view the original data, you can connect the Add Columns module and join it to
the output of Apply Math Operation .
Comparison operations
Use the comparison functions in Azure Machine Learning designer anytime that you need to test two sets of
values against each other. For example, in a pipeline you might need to do these comparison operations:
Evaluate a column of probability scores model against a threshold value.
Determine if two sets of results are the same. For each row that is different, add a FALSE flag that can be used
for further processing or filtering.
EqualTo
Returns True if the values are the same.
GreaterThan
Returns True if the values in Column set are greater than the specified constant, or greater than the
corresponding values in the comparison column.
GreaterThanOrEqualTo
Returns True if the values in Column set are greater than or equal to the specified constant, or greater than or
equal to the corresponding values in the comparison column.
LessThan
Returns True if the values in Column set are less than the specified constant, or less than the corresponding
values in the comparison column.
LessThanOrEqualTo
Returns True if the values in Column set are less than or equal to the specified constant, or less than or equal to
the corresponding values in the comparison column.
NotEqualTo
Returns True if the values in Column set are not equal to the constant or comparison column, and returns False
if they are equal.
PairMax
Returns the value that is greater—the value in Column set or the value in the constant or comparison column.
PairMin
Returns the value that is lesser—the value in Column set or the value in the constant or comparison column
Arithmetic operations
Includes the basic arithmetic operations: addition and subtraction, division, and multiplication. Because most
operations are binary, requiring two numbers, you first choose the operation, and then choose the column or
numbers to use in the first and second arguments.
The order for division and subtraction are as follows:
Subtract(Arg1_Arg2) = Arg1 - Arg 2
Divide(Arg1_Arg2) = Arg1 / Arg 2
The following table shows some examples
Addition 1 5 Add(Num2_Num1) 6
Multiplication 1 5 Multiple(Num2_Num 5
1)
Subtraction 5 1 Subtract(Num2_Num 4
1)
Subtraction 0 1 Subtract(Num2_Num -1
1)
Division 5 1 Divide(Num2_Num1) 5
Add
Specify the source columns by using Column set , and then add to those values a number specified in Second
argument .
To add the values in two columns, choose a column or columns using Column set , and then choose a second
column using Second argument .
Divide
Divides the values in Column set by a constant or by the column values defined in Second argument . In
other words, you pick the divisor first, and then the dividend. The output value is the quotient.
Multiply
Multiplies the values in Column set by the specified constant or column values.
Subtract
Specify the column of values to operate on (the minuend), by choosing a different column, using the Column
set option. Then, specify the number to subtract (the subtrahend) by using the Second argument dropdown
list. You can choose either a constant or column of values.
Rounding operations
Azure Machine Learning designer supports a variety of rounding operations. For many operations, you must
specify the amount of precision to use when rounding. You can use either a static precision level, specified as a
constant, or you can apply a dynamic precision value obtained from a column of values.
If you use a constant, set Precision Type to Constant and then type the number of digits as an integer
in the Constant Precision text box. If you type a non-integer, the module does not raise an error, but
results can be unexpected.
To use a different precision value for each row in your dataset, set Precision Type to ColumnSet , and
then choose the column that contains appropriate precision values.
Ceiling
Returns the ceiling for the values in Column set .
CeilingPower2
Returns the squared ceiling for the values in Column set .
Floor
Returns the floor for the values in Column set , to the specified precision.
Mod
Returns the fractional part of the values in Column set , to the specified precision.
Quotient
Returns the fractional part of the values in Column set , to the specified precision.
Remainder
Returns the remainder for the values in Column set .
RoundDigits
Returns the values in Column set , rounded by the 4/5 rule to the specified number of digits.
RoundDown
Returns the values in Column set , rounded down to the specified number of digits.
RoundUp
Returns the values in Column set , rounded up to the specified number of digits.
ToEven
Returns the values in Column set , rounded to the nearest whole, even number.
ToOdd
Returns the values in Column set , rounded to the nearest whole, odd number.
Truncate
Truncates the values in Column set by removing all digits not allowed by the specified precision.
Trigonometric functions
This category iIncludes most of the important trigonometric and inverse trigonometric functions. All
trigonometric functions are unary and require no additional arguments.
Acos
Calculates the arccosine for the column values.
AcosDegree
Calculates the arccosine of the column values, in degrees.
Acosh
Calculates the hyperbolic arccosine of the column values.
Acot
Calculates the arccotangent of the column values.
AcotDegrees
Calculates the arccotangent of the column values, in degrees.
Acoth
Calculates the hyperbolic arccotangent of the column values.
Acsc
Calculates the arccosecant of the column values.
AcscDegrees
Calculates the arccosecant of the column values, in degrees.
Asec
Calculates the arcsecant of the column values.
AsecDegrees
Calculates the arcsecant of the column values, in degrees.
Asech
Calculates the hyperbolic arcsecant of the column values.
Asin
Calculates the arcsine of the column values.
AsinDegrees
Calculates the arcsine of the column values, in degrees.
Asinh
Calculates the hyperbolic arcsine for the column values.
Atan
Calculates the arctangent of the column values.
AtanDegrees
Calculates the arctangent of the column values, in degrees.
Atanh
Calculates the hyperbolic arctangent of the column values.
Cos
Calculates the cosine of the column values.
CosDegrees
Calculates the cosine for the column values, in degrees.
Cosh
Calculates the hyperbolic cosine for the column values.
Cot
Calculates the cotangent for the column values.
CotDegrees
Calculates the cotangent for the column values, in degrees.
Coth
Calculates the hyperbolic cotangent for the column values.
Csc
Calculates the cosecant for the column values.
CscDegrees
Calculates the cosecant for the column values, in degrees.
Csch
Calculates the hyperbolic cosecant for the column values.
DegreesToRadians
Converts degrees to radians.
Sec
Calculates the secant of the column values.
aSecDegrees
Calculates the secant for the column values, in degrees.
aSech
Calculates the hyperbolic secant of the column values.
Sign
Returns the sign of the column values.
Sin
Calculates the sine of the column values.
Sinc
Calculates the sine-cosine value of the column values.
SinDegrees
Calculates the sine for the column values, in degrees.
Sinh
Calculates the hyperbolic sine of the column values.
Tan
Calculates the tangent of the column values.
TanDegrees
Calculates the tangent for the argument, in degrees.
Tanh
Calculates the hyperbolic tangent of the column values.
Technical notes
Be careful when you select more than one column as the second operator. The results are easy to understand if
the operation is simple, such as adding a constant to all columns.
Assume your dataset has multiple columns, and you add the dataset to itself. In the results, each column is
added to itself, as follows:
1 5 2 2 10 4
2 3 -1 4 6 -2
0 1 -1 0 2 -2
If you need to perform more complex calculations, you can chain multiple instances of Apply Math Operation .
For example, you might add two columns by using one instance of Apply Math Operation , and then use
another instance of Apply Math Operation to divide the sum by a constant to obtain the mean.
Alternatively, use one of the following modules to do all the calculations at once, using SQL, R, or Python script:
Execute R Script
Execute Python Script
Apply SQL Transformation
Next steps
See the set of modules available to Azure Machine Learning.
Apply SQL Transformation
8/25/2021 • 2 minutes to read • Edit Online
IMPORTANT
The SQL engine used in this module is SQLite . For more information about SQLite syntax, see SQL as Understood by
SQLite. This module will bump data to SQLite, which is in the memory DB, hence the module execution requires much
more memory and may hit an Out of memory error. Make sure your computer has enough RAM.
SELECT t1.*
, t3.Average_Rating
FROM t1 join
(SELECT placeID
, AVG(rating) AS Average_Rating
FROM t2
GROUP BY placeID
) as t3
on t1.placeID = t3.placeID
The remaining parameter is a SQL query, which uses the SQLite syntax. When typing multiple lines in the SQL
Script text box, use a semi-colon to terminate each statement. Otherwise, line breaks are converted to spaces.
This module supports all standard statements of the SQLite syntax. For a list of unsupported statements, see the
Technical Notes section.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
An input is always required on port 1.
For column identifiers that contain a space or other special characters, always enclose the column
identifier in square brackets or double quotation marks when referring to the column in the SELECT or
WHERE clauses.
If you have used Edit Metadata to specify the column metadata (categorical or fields) before Apply
SQL Transformation , the outputs of Apply SQL Transformation will not contain these attributes. You
need to use Edit Metadata to edit the column after Apply SQL Transformation .
Unsupported statements
Although SQLite supports much of the ANSI SQL standard, it does not include many features supported by
commercial relational database systems. For more information, see SQL as Understood by SQLite. Also, be
aware of the following restrictions when creating SQL statements:
SQLite uses dynamic typing for values, rather than assigning a type to a column as in most relational
database systems. It is weakly typed, and allows implicit type conversion.
LEFT OUTER JOIN is implemented, but not RIGHT OUTER JOIN or FULL OUTER JOIN .
You can use RENAME TABLE and ADD COLUMN statements with the ALTER TABLE command, but other
clauses are not supported, including DROP COLUMN , ALTER COLUMN , and ADD CONSTRAINT .
You can create a VIEW within SQLite, but thereafter views are read-only. You cannot execute a DELETE ,
INSERT , or UPDATE statement on a view. However, you can create a trigger that fires on an attempt to
DELETE , INSERT , or UPDATE on a view and perform other operations in the body of the trigger.
In addition to the list of non-supported functions provided on the official SQLite site, the following wiki provides
a list of other unsupported features: SQLite - Unsupported SQL
Next steps
See the set of modules available to Azure Machine Learning.
Clean Missing Data module
11/2/2020 • 6 minutes to read • Edit Online
IMPORTANT
The cleaning method that you use for handling missing values can dramatically affect your results. We recommend that
you experiment with different methods. Consider both the justification for use of a particular method, and the quality of
the results.
WARNING
This condition must be met by each and every column in order for the specified operation to apply. For example,
assume you selected three columns and then set the minimum ratio of missing values to .2 (20%), but only one
column actually has 20% missing values. In this case, the cleanup operation would apply only to the column with
over 20% missing values. Therefore, the other columns would be unchanged.
If you have any doubt about whether missing values were changed, select the option, Generate missing value
indicator column . A column is appended to the dataset to indicate whether or not each column met the
specified criteria for the minimum and maximum ranges.
4. For Maximum missing value ratio , specify the maximum number of missing values that can be
present for the operation to be performed.
For example, you might want to perform missing value substitution only if 30% or fewer of the rows
contain missing values, but leave the values as-is if more than 30% of rows have missing values.
You define the number as the ratio of missing values to all values in the column. By default, the
Maximum missing value ratio is set to 1. This means that missing values are cleaned even if 100% of
the values in the column are missing.
5. For Cleaning Mode , select one of the following options for replacing or removing missing values:
Custom substitution value : Use this option to specify a placeholder value (such as a 0 or NA)
that applies to all missing values. The value that you specify as a replacement must be compatible
with the data type of the column.
Replace with mean : Calculates the column mean and uses the mean as the replacement value
for each missing value in the column.
Applies only to columns that have Integer, Double, or Boolean data types.
Replace with median : Calculates the column median value, and uses the median value as the
replacement for any missing value in the column.
Applies only to columns that have Integer or Double data types.
Replace with mode : Calculates the mode for the column, and uses the mode as the replacement
value for every missing value in the column.
Applies to columns that have Integer, Double, Boolean, or Categorical data types.
Remove entire row : Completely removes any row in the dataset that has one or more missing
values. This is useful if the missing value can be considered randomly missing.
Remove entire column : Completely removes any column in the dataset that has one or more
missing values.
6. The option Replacement value is available if you have selected the option, Custom substitution
value . Type a new value to use as the replacement value for all missing values in the column.
Note that you can use this option only in columns that have the Integer, Double, Boolean, or String.
7. Generate missing value indicator column : Select this option if you want to output some indication
of whether the values in the column met the criteria for missing value cleaning. This option is particularly
useful when you are setting up a new cleaning operation and want to make sure it works as designed.
8. Submit the pipeline.
Results
The module returns two outputs:
Cleaned dataset : A dataset comprised of the selected columns, with missing values handled as
specified, along with an indicator column, if you selected that option.
Columns not selected for cleaning are also "passed through".
Cleaning transformation : A data transformation used for cleaning, that can be saved in your
workspace and applied to new data later.
Apply a saved cleaning operation to new data
If you need to repeat cleaning operations often, we recommend that you save your recipe for data cleansing as a
transform, to reuse with the same dataset. Saving a cleaning transformation is particularly useful if you must
frequently re-import and then clean data that has the same schema.
1. Add the Apply Transformation module to your pipeline.
2. Add the dataset you want to clean, and connect the dataset to the right-hand input port.
3. Expand the Transforms group in the left-hand pane of the designer. Locate the saved transformation and
drag it into the pipeline.
4. Connect the saved transformation to the left input port of Apply Transformation.
When you apply a saved transformation, you cannot select the columns to which the transformation are
applied. That is because the transformation has been already defined and applies automatically to the
columns specified in the original operation.
However, suppose you created a transformation on a subset of numeric columns. You can apply this
transformation to a dataset of mixed column types without raising an error, because the missing values
are changed only in the matching numeric columns.
5. Submit the pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Clip Values
11/2/2020 • 5 minutes to read • Edit Online
1 TRUE 4, TRUE
2 TRUE 4, TRUE
3 3, FALSE 4, TRUE
4 4, FALSE 4, TRUE
5 5, FALSE 5, FALSE
6 6, FALSE 6, FALSE
7 7, FALSE 7, TRUE
8 8, FALSE 7, TRUE
9 9, FALSE 7, TRUE
10 TRUE 7, TRUE
Next steps
See the set of modules available to Azure Machine Learning.
Convert to CSV module
3/5/2021 • 2 minutes to read • Edit Online
Next steps
See the set of modules available to Azure Machine Learning.
Convert to Dataset
3/9/2021 • 3 minutes to read • Edit Online
This article describes how to use the Convert to Dataset module in Azure Machine Learning designer to convert
any data for a pipeline to the designer's internal format.
Conversion is not required in most cases. Azure Machine Learning implicitly converts data to its native dataset
format when any operation is performed on the data.
We recommend saving data to the dataset format if you've performed some kind of normalization or cleaning
on a set of data, and you want to ensure that the changes are used in other pipelines.
NOTE
Convert to Dataset changes only the format of the data. It does not save a new copy of the data in the workspace. To
save the dataset, double-click the output port, select Save as dataset , and enter a new name.
Technical notes
Any module that takes a dataset as input can also take data in the CSV file or the TSV file. Before any
module code is run, the inputs are preprocessed. Preprocessing is equivalent to running the Convert to
Dataset module on the input.
You can't convert from the SVMLight format to a dataset.
When you're specifying a custom replace operation, the search-and-replace operation applies to
complete values. Partial matches are not allowed. For example, you can replace a 3 with a -1 or with 33,
but you can't replace a 3 in a two-digit number such as 35.
For custom replace operations, the replacement will silently fail if you use as a replacement any character
that does not conform to the current data type of the column.
Next steps
See the set of modules available to Azure Machine Learning.
Convert to Indicator Values
3/5/2021 • 4 minutes to read • Edit Online
NOTE
You can use the Edit Metadata module before the Conver t to Indiciator Values module to mark the target
column(s) as categorical.
2. Connect the Conver t to Indicator Values module to the dataset containing the columns you want to
convert.
3. Select Edit column to choose one or more categorical columns.
4. Select the Over write categorical columns option if you want to output only the new Boolean
columns. By default, this option is off.
TIP
If you choose the option to overwrite, the source column is not actually deleted or modified. Instead, the new
columns are generated and presented in the output dataset, and the source column remains available in the
workspace. If you need to see the original data, you can use the Add Columns module at any time to add the
source column back in.
Results
Suppose you have a column with scores that indicate whether a server has a high, medium, or low probability of
failure.
SERVER ID FA IL URE SC O RE
10301 Low
10302 Medium
10303 High
When you apply Conver t to Indicator Values , the designer converts a single column of labels into multiple
columns containing Boolean values:
10301 1 0 0
10302 0 1 0
10303 0 0 1
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Usage tips
Only columns that are marked as categorical can be converted to indicator columns. If you see the
following error, it is likely that one of the columns you selected is not categorical:
Error 0056: Column with name <column name> is not in an allowed category.
By default, most string columns are handled as string features, so you must explicitly mark them as
categorical using Edit Metadata.
There is no limit on the number of columns that you can convert to indicator columns. However, because
each column of values can yield multiple indicator columns, you may want to convert and review just a
few columns at a time.
If the column contains missing values, a separate indicator column is created for the missing category,
with this name: <source column>- Missing
If the column that you convert to indicator values contains numbers, they must be marked as categorical
like any other feature column. After you have done so, the numbers are treated as discrete values. For
example, if you have a numeric column with MPG values ranging from 25 to 30, a new indicator column
would be created for each discrete value:
H IGH WAY H IGH WAY H IGH WAY H IGH WAY H IGH WAY H IGH WAY
MAKE M P G - 25 M P G - 26 M P G - 27 M P G - 28 M P G - 29 M P G - 30
Contoso 0 0 0 0 0 1
Cars
To avoid adding too many dimensions to your dataset. We recommend that you first check the number of
values in the column, and bin or quantize the data appropriately.
Next steps
See the set of modules available to Azure Machine Learning.
Edit Metadata module
11/2/2020 • 4 minutes to read • Edit Online
4. Select the Categorical option to specify that the values in the selected columns should be treated as
categories.
For example, you might have a column that contains the numbers 0, 1, and 2, but know that the numbers
actually mean "Smoker," "Non-smoker," and "Unknown." In that case, by flagging the column as
categorical you ensure that the values are used only to group data and not in numeric calculations.
5. Use the Fields option if you want to change the way that Azure Machine Learning uses the data in a
model.
Feature : Use this option to flag a column as a feature in modules that operate only on feature
columns. By default, all columns are initially treated as features.
Label : Use this option to mark the label, which is also known as the predictable attribute or target
variable. Many modules require that exactly one label column is present in the dataset.
In many cases, Azure Machine Learning can infer that a column contains a class label. By setting
this metadata, you can ensure that the column is identified correctly. Setting this option does not
change data values. It changes only the way that some machine-learning algorithms handle the
data.
TIP
Do you have data that doesn't fit into these categories? For example, your dataset might contain values such as
unique identifiers that aren't useful as variables. Sometimes such IDs can cause problems when used in a model.
Fortunately, Azure Machine Learning keeps all of your data, so that you don't have to delete such columns from
the dataset. When you need to perform operations on some special set of columns, just remove all other columns
temporarily by using the Select Columns in Dataset module. Later you can merge the columns back into the
dataset by using the Add Columns module.
6. Use the following options to clear previous selections and restore metadata to the default values.
Clear feature : Use this option to remove the feature flag.
All columns are initially treated as features. For modules that perform mathematical operations,
you might need to use this option in order to prevent numeric columns from being treated as
variables.
Clear label : Use this option to remove the label metadata from the specified column.
Clear score : Use this option to remove the score metadata from the specified column.
You currently can't explicitly mark a column as a score in Azure Machine Learning. However, some
operations result in a column being flagged as a score internally. Also, a custom R module might
output score values.
7. For New column names , enter the new name of the selected column or columns.
Column names can use only characters that are supported by UTF-8 encoding. Empty strings,
nulls, or names that consist entirely of spaces aren't allowed.
To rename multiple columns, enter the names as a comma-separated list in order of the column
indexes.
All selected columns must be renamed. You can't omit or skip columns.
8. Submit the pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Group Data into Bins module
11/2/2020 • 5 minutes to read • Edit Online
This article describes how to use the Group Data into Bins module in Azure Machine Learning designer, to group
numbers or change the distribution of continuous data.
The Group Data into Bins module supports multiple options for binning data. You can customize how the bin
edges are set and how values are apportioned into the bins. For example, you can:
Manually type a series of values to serve as the bin boundaries.
Assign values to bins by using quantiles, or percentile ranks.
Force an even distribution of values into the bins.
4. If you're using the Quantiles and Equal Width binning modes, use the Number of bins option to
specify how many bins, or quantiles, you want to create.
5. For Columns to bin , use the column selector to choose the columns that have the values you want to
bin. Columns must be a numeric data type.
The same binning rule is applied to all applicable columns that you choose. If you need to bin some
columns by using a different method, use a separate instance of the Group Data into Bins module for
each set of columns.
WARNING
If you choose a column that's not an allowed type, a runtime error is generated. The module returns an error as
soon as it finds any column of a disallowed type. If you get an error, review all selected columns. The error does
not list all invalid columns.
6. For Output mode , indicate how you want to output the quantized values:
Append : Creates a new column with the binned values, and appends that to the input table.
Inplace : Replaces the original values with the new values in the dataset.
ResultOnly : Returns just the result columns.
7. If you select the Quantiles binning mode, use the Quantile normalization option to determine how
values are normalized before sorting into quantiles. Note that normalizing values transforms the values
but doesn't affect the final number of bins.
The following normalization types are supported:
Percent : Values are normalized within the range [0,100].
PQuantile : Values are normalized within the range [0,1].
QuantileIndex : Values are normalized within the range [1,number of bins].
8. If you choose the Custom Edges option, enter a comma-separated list of numbers to use as bin edges in
the Comma-separated list of bin edges text box.
The values mark the point that divides bins. For example, if you enter one bin edge value, two bins will be
generated. If you enter two bin edge values, three bins will be generated.
The values must be sorted in the order that the bins are created, from lowest to highest.
9. Select the Tag columns as categorical option to indicate that the quantized columns should be
handled as categorical variables.
10. Submit the pipeline.
Results
The Group Data into Bins module returns a dataset in which each element has been binned according to the
specified mode.
It also returns a binning transformation. That function can be passed to the Apply Transformation module to bin
new samples of data by using the same binning mode and parameters.
TIP
If you use binning on your training data, you must use the same binning method on data that you use for testing and
prediction. You must also use the same bin locations and bin widths.
To ensure that data is always transformed by using the same binning method, we recommend that you save useful data
transformations. Then apply them to other datasets by using the Apply Transformation module.
Next steps
See the set of modules available to Azure Machine Learning.
Join Data
3/5/2021 • 2 minutes to read • Edit Online
This article describes how to use the Join Data module in Azure Machine Learning designer to merge two
datasets using a database-style join operation.
4. Select the Match case option if you want to preserve case sensitivity on a text column join.
5. Use the Join type dropdown list to specify how the datasets should be combined.
Inner Join : An inner join is the most common join operation. It returns the combined rows only
when the values of the key columns match.
Left Outer Join : A left outer join returns joined rows for all rows from the left table. When a row
in the left table has no matching rows in the right table, the returned row contains missing values
for all columns that come from the right table. You can also specify a replacement value for
missing values.
Full Outer Join : A full outer join returns all rows from the left table (table1 ) and from the right
table (table2 ).
For each of the rows in either table that have no matching rows in the other, the result includes a
row containing missing values.
Left Semi-Join : A left semi-join returns only the values from the left table when the values of the
key columns match.
6. For the option Keep right key columns in joined table :
Select this option to view the keys from both input tables.
Deselect to only return the key columns from the left input.
7. Submit the pipeline.
8. To view the results, right-click the Join Data and select Visualize .
Next steps
See the set of modules available to Azure Machine Learning.
Normalize Data module
11/2/2020 • 4 minutes to read • Edit Online
WARNING
Some algorithms require that data be normalized before training a model. Other algorithms perform their own data
scaling or normalization. Therefore, when you choose a machine learning algorithm to use in building a predictive model,
be sure to review the data requirements of the algorithm before applying normalization to the training data.
4. Use 0 for constant columns when checked : Select this option when any numeric column contains a
single unchanging value. This ensures that such columns are not used in normalization operations.
5. From the Transformation method dropdown list, choose a single mathematical function to apply to all
selected columns.
Zscore : Converts all values to a z-score.
The values in the column are transformed using the following formula:
Mean and standard deviation are computed for each column separately. Population standard
deviation is used.
MinMax : The min-max normalizer linearly rescales every feature to the [0,1] interval.
Rescaling to the [0,1] interval is done by shifting the values of each feature so that the minimal
value is 0, and then dividing by the new maximal value (which is the difference between the
original maximal and minimal values).
The values in the column are transformed using the following formula:
Logistic : The values in the column are transformed using the following formula:
Here μ and σ are the parameters of the distribution, computed empirically from the data as
maximum likelihood estimates, for each column separately.
TanH : All values are converted to a hyperbolic tangent.
The values in the column are transformed using the following formula:
6. Submit the pipeline, or double-click the Normalize Data module and select Run Selected .
Results
The Normalize Data module generates two outputs:
To view the transformed values, right-click the module, and select Visualize .
By default, values are transformed in place. If you want to compare the transformed values to the original
values, use the Add Columns module to recombine the datasets and view the columns side by side.
To save the transformation so that you can apply the same normalization method to another dataset,
select the module, and select Register dataset under the Outputs tab in the right panel.
You can then load the saved transformations from the Transforms group of the left navigation pane and
apply it to a dataset with the same schema by using Apply Transformation.
Next steps
See the set of modules available to Azure Machine Learning.
Partition and Sample module
11/2/2020 • 7 minutes to read • Edit Online
NOTE
You can't view the fold designations directly. They're present only in the metadata.
Next steps
See the set of modules available to Azure Machine Learning.
Remove Duplicate Rows module
11/2/2020 • 2 minutes to read • Edit Online
1 F.M. M 53 Jan
2 F.A.M. M 53 Jan
3 F.A.M. M 24 Jan
3 F.M. M 24 Feb
4 F.M. M 23 Feb
F.M. M 23
5 F.A.M. M 53
6 F.A.M. M NaN
7 F.A.M. M NaN
Clearly, this example has multiple columns with potentially duplicate data. Whether they are actually duplicates
depends on your knowledge of the data.
For example, you might know that many patients have the same name. You wouldn't eliminate duplicates
using any name columns, only the ID column. That way, only the rows with duplicate ID values are
filtered out, regardless of whether the patients have the same name or not.
Alternatively, you might decide to allow duplicates in the ID field, and use some other combination of files
to find unique records, such as first name, last name, age, and gender.
To set the criteria for whether a row is duplicate or not, you specify a single column or a set of columns to use as
keys . Two rows are considered duplicates only when the values in all key columns are equal. If any row has
missing value for keys , they will not be considered duplicate rows. For example, if Gender and Age are set as
Keys in above table, row 6 and 7 are not duplicate rows given they have missing value in Age.
When you run the module, it creates a candidate dataset, and returns a set of rows that have no duplicates
across the set of columns you specified.
IMPORTANT
The source dataset is not altered; this module creates a new dataset that is filtered to exclude duplicates, based on the
criteria you specify.
How to use Remove Duplicate Rows
1. Add the module to your pipeline. You can find the Remove Duplicate Rows module under Data
Transformation , Manipulation .
2. Connect the dataset that you want to check for duplicate rows.
3. In the Proper ties pane, under Key column selection filter expression , click Launch column
selector , to choose columns to use in identifying duplicates.
In this context, Key does not mean a unique identifier. All columns that you select using the Column
Selector are designated as key columns . All unselected columns are considered non-key columns. The
combination of columns that you select as keys determines the uniqueness of the records. (Think of it as
a SQL statement that uses multiple equalities joins.)
Examples:
"I want to ensure that IDs are unique": Choose only the ID column.
"I want to ensure that the combination of first name, last name, and ID is unique": Select all three
columns.
4. Use the Retain first duplicate row checkbox to indicate which row to return when duplicates are
found:
If selected, the first row is returned and others discarded.
If you uncheck this option, the last duplicate row is kept in the results, and others are discarded.
5. Submit the pipeline.
6. To review the results, right-click the module, and select Visualize .
TIP
If the results are difficult to understand, or if you want to exclude some columns from consideration, you can remove
columns by using the Select Columns in Dataset module.
Next steps
See the set of modules available to Azure Machine Learning.
SMOTE
11/2/2020 • 5 minutes to read • Edit Online
This article describes how to use the SMOTE module in Azure Machine Learning designer to increase the
number of underrepresented cases in a dataset that's used for machine learning. SMOTE is a better way of
increasing the number of rare cases than simply duplicating existing cases.
You connect the SMOTE module to a dataset that's imbalanced. There are many reasons why a dataset might be
imbalanced. For example, the category you're targeting might be rare in the population, or the data might be
difficult to collect. Typically, you use SMOTE when the class that you want to analyze is underrepresented.
The module returns a dataset that contains the original samples. It also returns a number of synthetic minority
samples, depending on the percentage that you specify.
Examples
We recommend that you try using SMOTE with a small dataset to see how it works. The following example uses
the Blood Donation dataset available in Azure Machine Learning designer.
If you add the dataset to a pipeline and select Visualize on the dataset's output, you can see that of the 748
rows or cases in the dataset, 570 cases (76 percent) are of Class 0, and 178 cases (24 percent) are of Class 1.
Although this result isn't terribly imbalanced, Class 1 represents the people who donated blood, so these rows
contain the feature space that you want to model.
To increase the number of cases, you can set the value of SMOTE percentage , by using multiples of 100, as
follows:
C L A SS 0 C L A SS 1 TOTA L
WARNING
Increasing the number of cases by using SMOTE is not guaranteed to produce more accurate models. Try pipelining with
different percentages, different feature sets, and different numbers of nearest neighbors to see how adding cases
influences your model.
6. Use the Number of nearest neighbors option to determine the size of the feature space that the
SMOTE algorithm uses in building new cases. A nearest neighbor is a row of data (a case) that's similar to
a target case. The distance between any two cases is measured by combining the weighted vectors of all
features.
By increasing the number of nearest neighbors, you get features from more cases.
By keeping the number of nearest neighbors low, you use features that are more like those in the
original sample.
7. Enter a value in the Random seed box if you want to ensure the same results over runs of the same
pipeline, with the same data. Otherwise, the module generates a random seed based on processor clock
values when the pipeline is deployed. The generation of a random seed can cause slightly different results
over runs.
8. Submit the pipeline.
The output of the module is a dataset that contains the original rows plus a number of added rows with
minority cases.
Technical notes
When you're publishing a model that uses the SMOTE module, remove SMOTE from the predictive
pipeline before it's published as a web service. The reason is that SMOTE is intended for improving a
model during training, not for scoring. You might get an error if a published predictive pipeline contains
the SMOTE module.
You can often get better results if you clean missing values or apply other transformations to fix data
before you apply SMOTE.
Some researchers have investigated whether SMOTE is effective on high-dimensional or sparse data,
such as data used in text classification or genomics datasets. This paper has a good summary of the
effects and of the theoretical validity of applying SMOTE in such cases: Blagus and Lusa: SMOTE for high-
dimensional class-imbalanced data.
If SMOTE is not effective in your dataset, other approaches that you might consider include:
Methods for oversampling the minority cases or undersampling the majority cases.
Ensemble techniques that help the learner directly by using clustering, bagging, or adaptive boosting.
Next steps
See the set of modules available to Azure Machine Learning.
Select Columns Transform
3/5/2021 • 2 minutes to read • Edit Online
This article describes how to use the Select Columns Transform module in Azure Machine Learning designer. The
purpose of the Select Columns Transform module is to ensure that a predictable, consistent set of columns is
used in downstream machine learning operations.
This module is helpful for tasks such as scoring, which require specific columns. Changes in the available
columns might break the pipeline or change the results.
You use Select Columns Transform to create and save a set of columns. Then, use the Apply Transformation
module to apply those selections to new data.
IMPORTANT
Because feature importance is based on the values in the column, you can't know in advance which columns
might be available for input to Train Model.
Next steps
See the set of modules available to Azure Machine Learning.
Select Columns in Dataset module
11/2/2020 • 4 minutes to read • Edit Online
How to use
This module has no parameters. You use the column selector to choose the columns to include or exclude.
Choose columns by name
There are multiple options in the module for choosing columns by name:
Filter and search
Click the BY NAME option.
If you have connected a dataset that is already populated, a list of available columns should appear. If no
columns appear, you might need to run upstream modules to view the column list.
To filter the list, type in the search box. For example, if you type the letter w in the search box, the list is
filtered to show the column names that contain the letter w .
Select columns and click the right arrow button to move the selected columns to the list in the right-hand
pane.
To select a continuous range of column names, press Shift + Click .
To add individual columns to the selection, press Ctrl + Click .
Click the checkmark button to save and close.
Use names in combination with other rules
Click the WITH RULES option.
Choose a rule, such as showing columns of a specific data type.
Then, click individual columns of that type by name, to add them to the selection list.
Type or paste a comma-separated list of column names
If your dataset is wide, it might be easier to use indexes or generated lists of names, rather than selecting
columns individually. Assuming you have prepared the list in advance:
1. Click the WITH RULES option.
2. Select No columns , select Include , and then click inside the text box with the red exclamation mark.
3. Paste in or type a comma-separated list of previously validated column names. You cannot save the
module if any column has an invalid name, so be sure to check the names beforehand.
You can also use this method to specify a list of columns using their index values.
Choose by type
If you use the WITH RULES option, you can apply multiple conditions on the column selections. For example,
you might need to get only feature columns of a numeric data type.
The BEGIN WITH option determines your starting point and is important for understanding the results.
If you select the ALL COLUMNS option, all columns are added to the list. Then, you must use the
Exclude option to remove columns that meet certain conditions.
For example, you might start with all columns and then remove columns by name, or by type.
If you select the NO COLUMNS option, the list of columns starts out empty. You then specify conditions
to add columns to the list.
If you apply multiple rules, each condition is additive . For example, say you start with no columns, and
then add a rule to get all numeric columns. In the Automobile price dataset, that results in 16 columns.
Then, you click the + sign to add a new condition, and select Include all features . The resulting dataset
includes all the numeric columns, plus all the feature columns, including some string feature columns.
Choose by column index
The column index refers to the order of the column within the original dataset.
Columns are numbered sequentially starting at 1.
To get a range of columns, use a hyphen.
Open-ended specifications such as 1- or -3 are not allowed.
Duplicate index values (or column names) are not allowed, and might result in an error.
For example, assuming your dataset has at least eight columns, you could paste in any of the following
examples to return multiple non-contiguous columns:
8,1-4,6
1,3-8
1,3-6,4
the final example does not result in an error; however, it returns a single instance of column 4 .
Change order of columns
The option Allow duplicates and preser ve column order in selection starts with an empty list, and adds
columns that you specify by name or by index. Unlike other options, which always return columns in their
"natural order", this option outputs the columns in the order that you name or list them.
For example, in a dataset with the columns Col1, Col2, Col3, and Col4, you could reverse the order of the
columns and leave out column 2, by specifying either of the following lists:
Col4, Col3, Col1
4,3,1
Next steps
See the set of modules available to Azure Machine Learning.
Split Data module
11/2/2020 • 5 minutes to read • Edit Online
1. Add the Split Data module to your pipeline in the designer. You can find this module under Data
Transformation , in the Sample and Split category.
2. Splitting mode : Choose one of the following modes, depending on the type of data you have and how
you want to divide it. Each splitting mode has different options.
Split Rows : Use this option if you just want to divide the data into two parts. You can specify the
percentage of data to put in each split. By default, the data is divided 50/50.
You can also randomize the selection of rows in each group, and use stratified sampling. In
stratified sampling, you must select a single column of data for which you want values to be
apportioned equally among the two result datasets.
Regular Expression Split : Choose this option when you want to divide your dataset by testing a
single column for a value.
For example, if you're analyzing sentiment, you can check for the presence of a particular product
name in a text field. You can then divide the dataset into rows with the target product name and
rows without the target product name.
Relative Expression Split : Use this option whenever you want to apply a condition to a number
column. The number can be a date/time field, a column that contains age or dollar amounts, or
even a percentage. For example, you might want to divide your dataset based on the cost of the
items, group people by age ranges, or separate data by a calendar date.
Split rows
1. Add the Split Data module to your pipeline in the designer, and connect the dataset that you want to split.
2. For Splitting mode , select Split Rows .
3. Fraction of rows in the first output dataset : Use this option to determine how many rows will go
into the first (left side) output. All other rows will go into the second (right side) output.
The ratio represents the percentage of rows sent to the first output dataset, so you must enter a decimal
number between 0 and 1.
For example, if you enter 0.75 as the value, the dataset will be split 75/25. In this split, 75 percent of the
rows will be sent to the first output dataset. The remaining 25 percent will be sent to the second output
dataset.
4. Select the Randomized split option if you want to randomize selection of data into the two groups. This
is the preferred option when you're creating training and test datasets.
5. Random Seed : Enter a non-negative integer value to start the pseudorandom sequence of instances to
be used. This default seed is used in all modules that generate random numbers.
Specifying a seed makes the results reproducible. If you need to repeat the results of a split operation,
you should specify a seed for the random number generator. Otherwise the random seed is set by default
to 0 , which means the initial seed value is obtained from the system clock. As a result, the distribution of
data might be slightly different each time you perform a split.
6. Stratified split : Set this option to True to ensure that the two output datasets contain a representative
sample of the values in the strata column or stratification key column.
With stratified sampling, the data is divided such that each output dataset gets roughly the same
percentage of each target value. For example, you might want to ensure that your training and testing
sets are roughly balanced with regard to the outcome or to some other column (such as gender).
7. Submit the pipeline.
\"Text" Gryphon
Substring
This example looks for the specified string in any position within the second column of the dataset. The position
is denoted here by the index value of 1. The match is case-sensitive.
(\1) ^[a-f]
The first result dataset contains all rows where the index column begins with one of these characters: a , b , c ,
d , e , f . All other rows are directed to the second output.
Select a relative expression
1. Add the Split Data module to your pipeline, and connect it as input to the dataset that you want to split.
2. For Splitting mode , select Relative Expression .
3. In the Relational expression box, enter an expression that performs a comparison operation on a single
column.
For Numeric column :
The column contains numbers of any numeric data type, including date and time data types.
The expression can reference a maximum of one column name.
Use the ampersand character, & , for the AND operation. Use the pipe character, | , for the OR
operation.
The following operators are supported: < , > , <= , >= , == , != .
You can't group operations by using ( and ) .
For String column :
The following operators are supported: == , != .
4. Submit the pipeline.
The expression divides the dataset into two sets of rows: rows with values that meet the condition, and all
remaining rows.
The following examples demonstrate how to divide a dataset by using the Relative Expression option in the
Split Data module.
Calendar year
A common scenario is to divide a dataset by years. The following expression selects all rows where the values in
the column Year are greater than 2010 .
The date expression must account for all date parts that are included in the data column. The format of dates in
the data column must be consistent.
For example, in a date column that uses the format mmddyyyy , the expression should be something like this:
Column index
The following expression demonstrates how you can use the column index to select all rows in the first column
of the dataset that contain values less than or equal to 30, but not equal to 20.
Next steps
See the set of modules available to Azure Machine Learning.
Filter Based Feature Selection
11/2/2020 • 6 minutes to read • Edit Online
This article describes how to use the Filter Based Feature Selection module in Azure Machine Learning designer.
This module helps you identify the columns in your input dataset that have the greatest predictive power.
In general, feature selection refers to the process of applying statistical tests to inputs, given a specified output.
The goal is to determine which columns are more predictive of the output. The Filter Based Feature Selection
module provides multiple feature selection algorithms to choose from. The module includes correlation
methods such as Pearson correlation and chi-squared values.
When you use the Filter Based Feature Selection module, you provide a dataset and identify the column that
contains the label or dependent variable. You then specify a single method to use in measuring feature
importance.
The module outputs a dataset that contains the best feature columns, as ranked by predictive power. It also
outputs the names of the features and their scores from the selected metric.
IMPORTANT
Ensure that the columns that you're providing as input are potential features. For example, a column that contains
a single value has no information value.
If you know that some columns would make bad features, you can remove them from the column selection. You
can also use the Edit Metadata module to flag them as Categorical.
3. For Feature scoring method , choose one of the following established statistical methods to use in
calculating scores.
M ET H O D REQ UIREM EN T S
Chi squared Labels and features can be text or numeric. Use this
method for computing feature importance for two
categorical columns.
TIP
If you change the selected metric, all other selections will be reset. So be sure to set this option first.
4. Select the Operate on feature columns only option to generate a score only for columns that were
previously marked as features.
If you clear this option, the module will create a score for any column that otherwise meets the criteria,
up to the number of columns specified in Number of desired features .
5. For Target column , select Launch column selector to choose the label column either by name or by
its index. (Indexes are one-based.)
A label column is required for all methods that involve statistical correlation. The module returns a
design-time error if you choose no label column or multiple label columns.
6. For Number of desired features , enter the number of feature columns that you want returned as a
result:
The minimum number of features that you can specify is one, but we recommend that you
increase this value.
If the specified number of desired features is greater than the number of columns in the dataset,
then all features are returned. Even features with zero scores are returned.
If you specify fewer result columns than there are feature columns, the features are ranked by
descending score. Only the top features are returned.
7. Submit the pipeline.
IMPORTANT
If you are going to use Filter Based Feature Selection in inference, you need to use Select Columns Transform to store
the feature selected result and Apply Transformation to apply the feature selected transformation to the scoring dataset.
Refer to the following screenshot to build your pipeline, to ensure that column selections are the same for the scoring
process.
Results
After processing is complete:
To see a complete list of the analyzed feature columns and their scores, right-click the module and select
Visualize .
To view the dataset based on your feature selection criteria, right-click the module and select Visualize .
If the dataset contains fewer columns than you expected, check the module settings. Also check the data types of
the columns provided as input. For example, if you set Number of desired features to 1, the output dataset
contains just two columns: the label column, and the most highly ranked feature column.
Technical notes
Implementation details
If you use Pearson correlation on a numeric feature and a categorical label, the feature score is calculated as
follows:
1. For each level in the categorical column, compute the conditional mean of numeric column.
2. Correlate the column of conditional means with the numeric column.
Requirements
A feature selection score can't be generated for any column that's designated as a Label or Score
column.
If you try to use a scoring method with a column of a data type that the method doesn't support, the
module will raise an error. Or, a zero score will be assigned to the column.
If a column contains logical (true/false) values, they're processed as True = 1 and False = 0 .
A column can't be a feature if it has been designated as a Label or a Score .
How missing values are handled
You can't specify as a target (label) column any column that has all missing values.
If a column contains missing values, the module ignores them when it's computing the score for the
column.
If a column designated as a feature column has all missing values, the module assigns a zero score.
Next steps
See the set of modules available to Azure Machine Learning.
Permutation Feature Importance
3/5/2021 • 2 minutes to read • Edit Online
This article describes how to use the Permutation Feature Importance module in Azure Machine Learning
designer, to compute a set of feature importance scores for your dataset. You use these scores to help you
determine the best features to use in a model.
In this module, feature values are randomly shuffled, one column at a time. The performance of the model is
measured before and after. You can choose one of the standard metrics to measure performance.
The scores that the module returns represent the change in the performance of a trained model, after
permutation. Important features are usually more sensitive to the shuffling process, so they'll result in higher
importance scores.
This article provides an overview of the permutation feature, its theoretical basis, and its applications in machine
learning: Permutation Feature Importance.
Technical notes
Permutation Feature Importance works by randomly changing the values of each feature column, one column at
a time. It then evaluates the model.
The rankings that the module provides are often different from the ones you get from Filter Based Feature
Selection. Filter Based Feature Selection calculates scores before a model is created.
The reason for the difference is that Permutation Feature Importance doesn't measure the association between a
feature and a target value. Instead, it captures how much influence each feature has on predictions from the
model.
Next steps
See the set of modules available to Azure Machine Learning.
Summarize Data
11/2/2020 • 2 minutes to read • Edit Online
Results
The report from the module can include the following statistics.
C O L UM N N A M E DESC RIP T IO N
P1 1% percentile
P5 5% percentile
Technical notes
For non-numeric columns, only the values for Count, Unique value count, and Missing value count are
computed. For other statistics, a null value is returned.
Columns that contain Boolean values are processed using these rules:
When calculating Min, a logical AND is applied.
When calculating Max, a logical OR is applied
When computing Range, the module first checks whether the number of unique values in the
column equals 2.
When computing any statistic that requires floating-point calculations, values of True are treated as
1.0, and values of False are treated as 0.0.
Next steps
See the set of modules available to Azure Machine Learning.
Boosted Decision Tree Regression module
3/5/2021 • 4 minutes to read • Edit Online
NOTE
Use this module only with datasets that use numerical variables.
After you have defined the model, train it by using the Train Model.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To use the model for scoring, connect Train Model to Score Model, to predict values for new input
examples.
To save a snapshot of the trained model, select Outputs tab in the right panel of Trained model and
click Register dataset icon. The copy of the trained model will be saved as a module in the module tree
and will not be updated on successive runs of the pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Decision Forest Regression module
3/5/2021 • 4 minutes to read • Edit Online
How it works
Decision trees are non-parametric models that perform a sequence of simple tests for each instance, traversing
a binary tree data structure until a leaf node (decision) is reached.
Decision trees have these advantages:
They are efficient in both computation and memory usage during training and prediction.
They can represent non-linear decision boundaries.
They perform integrated feature selection and classification and are resilient in the presence of noisy
features.
This regression model consists of an ensemble of decision trees. Each tree in a regression decision forest
outputs a Gaussian distribution as a prediction. An aggregation is performed over the ensemble of trees to find
a Gaussian distribution closest to the combined distribution for all trees in the model.
For more information about the theoretical framework for this algorithm and its implementation, see this article:
Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and
Semi-Supervised Learning
TIP
If you set the value to 1; however, this means that only one tree will be produced (the tree with the initial set of
parameters) and no further iterations will be performed.
5. For Maximum depth of the decision trees , type a number to limit the maximum depth of any
decision tree. Increasing the depth of the tree might increase precision, at the risk of some overfitting and
increased training time.
6. For Number of random splits per node , type the number of splits to use when building each node of
the tree. A split means that features in each level of the tree (node) are randomly divided.
7. For Minimum number of samples per leaf node , indicate the minimum number of cases that are
required to create any terminal node (leaf) in a tree.
By increasing this value, you increase the threshold for creating new rules. For example, with the default
value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the
training data would have to contain at least five cases that meet the same conditions.
8. Train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
WARNING
If you pass a parameter range to Train Model, it uses only the first value in the parameter range list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a
range of settings for each parameter, it ignores the values and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value
you specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the training module, then switch to Outputs+logs tab in the
right panel. Click on the icon Register dataset . You can find the saved model as a module in the module
tree.
Next steps
See the set of modules available to Azure Machine Learning.
Linear Regression module
3/5/2021 • 6 minutes to read • Edit Online
NOTE
Remember to apply the same normalization method to new data used for scoring.
7. In L2 regularization weight , type the value to use as the weight for L2 regularization. We recommend
that you use a non-zero value to avoid overfitting.
To learn more about how regularization affects model fitting, see this article: L1 and L2 Regularization for
Machine Learning
8. Select the option, Decrease learning rate , if you want the learning rate to decrease as iterations
progress.
9. For Random number seed , you can optionally type a value to seed the random number generator used
by the model. Using a seed value is useful if you want to maintain the same results across different runs
of the same pipeline.
10. Train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Module overview
This article describes a module in Azure Machine Learning designer.
Use this module to create a regression model using a customizable neural network algorithm.
Although neural networks are widely known for use in deep learning and modeling complex problems such as
image recognition, they are easily adapted to regression problems. Any class of statistical models can be termed
a neural network if they use adaptive weights and can approximate non-linear functions of their inputs. Thus
neural network regression is suited to problems where a more traditional regression model cannot fit a solution.
Neural network regression is a supervised learning method, and therefore requires a tagged dataset, which
includes a label column. Because a regression model predicts a numerical value, the label column must be a
numerical data type.
You can train the model by providing the model and the tagged dataset as an input to Train Model. The trained
model can then be used to predict values for the new input examples.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the Outputs tab in the right panel of the Train model
module. Select the Register dataset icon to save the model as a reusable module.
Next steps
See the set of modules available to Azure Machine Learning.
Poisson Regression
3/5/2021 • 4 minutes to read • Edit Online
TIP
If your target isn’t a count, Poisson regression is probably not an appropriate method. Try other regression modules in
the designer.
After you have set up the regression method, you must train the model using a dataset containing examples of
the value you want to predict. The trained model can then be used to make predictions.
WARNING
If you pass a parameter range to Train Model, it uses only the first value in the parameter range list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a
range of settings for each parameter, it ignores the values and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value
you specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the training module, then switch to Outputs+logs tab in the
right panel. Click on the icon Register dataset . You can find the saved model as a module in the module
tree.
Next steps
See the set of modules available to Azure Machine Learning.
Module: K-Means Clustering
3/5/2021 • 6 minutes to read • Edit Online
This article describes how to use the K-Means Clustering module in Azure Machine Learning designer to create
an untrained K-means clustering model.
K-means is one of the simplest and the best known unsupervised learning algorithms. You can use the
algorithm for a variety of machine learning tasks, such as:
Detecting abnormal data.
Clustering text documents.
Analyzing datasets before you use other classification or regression methods.
To create a clustering model, you:
Add this module to your pipeline.
Connect a dataset.
Set parameters, such as the number of clusters you expect, the distance metric to use in creating the clusters,
and so forth.
After you've configured the module hyperparameters, you connect the untrained model to the Train Clustering
Model. Because the K-means algorithm is an unsupervised learning method, a label column is optional.
If your data includes a label, you can use the label values to guide selection of the clusters and optimize
the model.
If your data has no label, the algorithm creates clusters representing possible categories, based solely on
the data.
Results
After you've finished configuring and training the model, you have a model that you can use to generate scores.
However, there are multiple ways to train the model, and multiple ways to view and use the results:
Capture a snapshot of the model in your workspace
If you used the Train Clustering Model module:
1. Select the Train Clustering Model module and open the right panel.
2. Select Outputs tab. Select the Register dataset icon to save a copy of the trained model.
The saved model represents the training data at the time you saved the model. If you later update the training
data used in the pipeline, it doesn't update the saved model.
See the clustering result dataset
If you used the Train Clustering Model module:
1. Right-click the Train Clustering Model module.
2. Select Visualize .
Tips for generating the best clustering model
It is known that the seeding process that's used during clustering can significantly affect the model. Seeding
means the initial placement of points into potential centroids.
For example, if the dataset contains many outliers, and an outlier is chosen to seed the clusters, no other data
points would fit well with that cluster, and the cluster could be a singleton. That is, it might have only one point.
You can avoid this problem in a couple of ways:
Change the number of centroids and try multiple seed values.
Create multiple models, varying the metric or iterating more.
In general, with clustering models, it's possible that any given configuration will result in a locally optimized set
of clusters. In other words, the set of clusters that's returned by the model suits only the current data points and
isn't generalizable to other data. If you use a different initial configuration, the K-means method might find a
different, superior, configuration.
Next steps
See the set of modules available to Azure Machine Learning.
Multiclass Boosted Decision Tree
11/2/2020 • 3 minutes to read • Edit Online
How to configure
This module creates an untrained classification model. Because classification is a supervised learning method,
you need a labeled dataset that includes a label column with a value for all rows.
You can train this type of model by using the Train Model.
1. Add the Multiclass Boosted Decision Tree module to your pipeline.
2. Specify how you want the model to be trained by setting the Create trainer mode option.
Single Parameter : If you know how you want to configure the model, you can provide a specific
set of values as arguments.
Parameter Range : Select this option if you are not sure of the best parameters, and want to run a
parameter sweep. Select a range of values to iterate over, and the Tune Model Hyperparameters
iterates over all possible combinations of the settings you provided to determine the
hyperparameters that produce the optimal results.
3. Maximum number of leaves per tree limits the maximum number of terminal nodes (leaves) that can
be created in any tree.
By increasing this value, you potentially increase the size of the tree and achieve higher precision, at the
risk of overfitting and longer training time.
4. Minimum number of samples per leaf node indicates the number of cases required to create any
terminal node (leaf) in a tree.
By increasing this value, you increase the threshold for creating new rules. For example, with the default
value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the
training data would have to contain at least five cases that meet the same conditions.
5. Learning rate defines the step size while learning. Enter a number between 0 and 1.
The learning rate determines how fast or slow the learner converges on an optimal solution. If the step
size is too large, you might overshoot the optimal solution. If the step size is too small, training takes
longer to converge on the best solution.
6. Number of trees constructed indicates the total number of decision trees to create in the ensemble.
By creating more decision trees, you can potentially get better coverage, but training time will increase.
7. Random number seed optionally sets a non-negative integer to use as the random seed value.
Specifying a seed ensures reproducibility across runs that have the same data and parameters.
The random seed is set by default to 42. Successive runs using different random seeds can have different
results.
8. Train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Next steps
See the set of modules available to Azure Machine Learning.
Multiclass Decision Forest module
3/5/2021 • 4 minutes to read • Edit Online
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Next steps
See the set of modules available to Azure Machine Learning.
Multiclass Logistic Regression module
3/5/2021 • 3 minutes to read • Edit Online
Different linear combinations of L1 and L2 terms have been devised for logistic regression models, such
as elastic net regularization.
5. Random number seed : Type an integer value to use as the seed for the algorithm if you want the
results to be repeatable over runs. Otherwise, a system clock value is used as the seed, which can
produce slightly different results in runs of the same pipeline.
6. Connect a labeled dataset, and train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Next steps
See the set of modules available to Azure Machine Learning.
Multiclass Neural Network module
3/5/2021 • 4 minutes to read • Edit Online
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the Outputs tab in the right panel of the Train model
module. Select the Register dataset icon to save the model as a reusable module.
Next steps
See the set of modules available to Azure Machine Learning.
One-vs-All Multiclass
3/5/2021 • 3 minutes to read • Edit Online
This article describes how to use the One-vs-All Multiclass module in Azure Machine Learning designer. The goal
is to create a classification model that can predict multiple classes, by using the one-versus-all approach.
This module is useful for creating models that predict three or more possible outcomes, when the outcome
depends on continuous or categorical predictor variables. This method also lets you use binary classification
methods for issues that require multiple output classes.
More about one -versus-all models
Some classification algorithms permit the use of more than two classes by design. Others restrict the possible
outcomes to one of two values (a binary, or two-class model). But even binary classification algorithms can be
adapted for multi-class classification tasks through a variety of strategies.
This module implements the one-versus-all method, in which a binary model is created for each of the multiple
output classes. The module assesses each of these binary models for the individual classes against its
complement (all other classes in the model) as though it's a binary classification issue. In addition to its
computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its
interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge
about the class by inspecting its corresponding classifier. This is the most commonly used strategy for multiclass
classification and is a fair default choice. The module then performs prediction by running these binary
classifiers and choosing the prediction with the highest confidence score.
In essence, the module creates an ensemble of individual models and then merges the results, to create a single
model that predicts all classes. Any binary classifier can be used as the basis for a one-versus-all model.
For example, let’s say you configure a Two-Class Support Vector Machine model and provide that as input to the
One-vs-All Multiclass module. The module would create two-class support vector machine models for all
members of the output class. It would then apply the one-versus-all method to combine the results for all
classes.
The module uses OneVsRestClassifier of sklearn, and you can learn more details here.
Results
After training is complete, you can use the model to make multiclass predictions.
Alternatively, you can pass the untrained classifier to Cross-Validate Model for cross-validation against a labeled
validation dataset.
Next steps
See the set of modules available to Azure Machine Learning.
One-vs-One Multiclass
3/5/2021 • 2 minutes to read • Edit Online
This article describes how to use the One-vs-One Multiclass module in Azure Machine Learning designer. The
goal is to create a classification model that can predict multiple classes, by using the one-versus-one approach.
This module is useful for creating models that predict three or more possible outcomes, when the outcome
depends on continuous or categorical predictor variables. This method also lets you use binary classification
methods for issues that require multiple output classes.
More about one -versus-one models
Some classification algorithms permit the use of more than two classes by design. Others restrict the possible
outcomes to one of two values (a binary, or two-class model). But even binary classification algorithms can be
adapted for multi-class classification tasks through a variety of strategies.
This module implements the one-versus-one method, in which a binary model is created per class pair. At
prediction time, the class which received the most votes is selected. Since it requires to fit
n_classes * (n_classes - 1) / 2 classifiers, this method is usually slower than one-versus-all, due to its
O(n_classes^2) complexity. However, this method may be advantageous for algorithms such as kernel
algorithms which don’t scale well with n_samples . This is because each individual learning problem only
involves a small subset of the data whereas, with one-versus-all, the complete dataset is used n_classes times.
In essence, the module creates an ensemble of individual models and then merges the results, to create a single
model that predicts all classes. Any binary classifier can be used as the basis for a one-versus-one model.
For example, let’s say you configure a Two-Class Support Vector Machine model and provide that as input to the
One-vs-One Multiclass module. The module would create two-class support vector machine models for all
members of the output class. It would then apply the one-versus-one method to combine the results for all
classes.
The module uses OneVsOneClassifier of sklearn, and you can learn more details here.
Results
After training is complete, you can use the model to make multiclass predictions.
Alternatively, you can pass the untrained classifier to Cross-Validate Model for cross-validation against a labeled
validation dataset.
Next steps
See the set of modules available to Azure Machine Learning.
Two-Class Averaged Perceptron module
3/5/2021 • 2 minutes to read • Edit Online
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Next steps
See the set of modules available to Azure Machine Learning.
Two-Class Boosted Decision Tree module
3/5/2021 • 4 minutes to read • Edit Online
How to configure
This module creates an untrained classification model. Because classification is a supervised learning method, to
train the model, you need a tagged dataset that includes a label column with a value for all rows.
You can train this type of model using Train Model.
1. In Azure Machine Learning, add the Boosted Decision Tree module to your pipeline.
2. Specify how you want the model to be trained, by setting the Create trainer mode option.
Single Parameter : If you know how you want to configure the model, you can provide a specific
set of values as arguments.
Parameter Range : If you are not sure of the best parameters, you can find the optimal
parameters by using the Tune Model Hyperparameters module. You provide some range of values,
and the trainer iterates over multiple combinations of the settings to determine the combination of
values that produces the best result.
3. For Maximum number of leaves per tree , indicate the maximum number of terminal nodes (leaves)
that can be created in any tree.
By increasing this value, you potentially increase the size of the tree and get better precision, at the risk of
overfitting and longer training time.
4. For Minimum number of samples per leaf node , indicate the number of cases required to create any
terminal node (leaf) in a tree.
By increasing this value, you increase the threshold for creating new rules. For example, with the default
value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the
training data would have to contain at least five cases that meet the same conditions.
5. For Learning rate , type a number between 0 and 1 that defines the step size while learning.
The learning rate determines how fast or slow the learner converges on the optimal solution. If the step
size is too large, you might overshoot the optimal solution. If the step size is too small, training takes
longer to converge on the best solution.
6. For Number of trees constructed , indicate the total number of decision trees to create in the
ensemble. By creating more decision trees, you can potentially get better coverage, but training time will
increase.
If you set the value to 1, only one tree is produced (the tree with the initial set of parameters) and no
further iterations are performed.
7. For Random number seed , optionally type a non-negative integer to use as the random seed value.
Specifying a seed ensures reproducibility across runs that have the same data and parameters.
The random seed is set by default to 0, which means the initial seed value is obtained from the system
clock. Successive runs using a random seed can have different results.
8. Train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the Outputs tab in the right panel of the Train model
module. Select the Register dataset icon to save the model as a reusable module.
To use the model for scoring, add the Score Model module to a pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Two-Class Decision Forest module
3/5/2021 • 5 minutes to read • Edit Online
How to configure
1. Add the Two-Class Decision Forest module to your pipeline in Azure Machine Learning, and open the
Proper ties pane of the module.
You can find the module under Machine Learning . Expand Initialize , and then Classification .
2. For Resampling method , choose the method used to create the individual trees. You can choose from
Bagging or Replicate .
Bagging : Bagging is also called bootstrap aggregating. In this method, each tree is grown on a
new sample, created by randomly sampling the original dataset with replacement until you have a
dataset the size of the original.
The outputs of the models are combined by voting, which is a form of aggregation. Each tree in a
classification decision forest outputs an unnormalized frequency histogram of labels. The
aggregation is to sum these histograms and normalize to get the "probabilities" for each label. In
this manner, the trees that have high prediction confidence will have a greater weight in the final
decision of the ensemble.
For more information, see the Wikipedia entry for Bootstrap aggregating.
Replicate : In replication, each tree is trained on exactly the same input data. The determination of
which split predicate is used for each tree node remains random and the trees will be diverse.
3. Specify how you want the model to be trained, by setting the Create trainer mode option.
Single Parameter : If you know how you want to configure the model, you can provide a specific
set of values as arguments.
Parameter Range : If you are not sure of the best parameters, you can find the optimal
parameters by using the Tune Model Hyperparameters module. You provide some range of values,
and the trainer iterates over multiple combinations of the settings to determine the combination of
values that produces the best result.
4. For Number of decision trees , type the maximum number of decision trees that can be created in the
ensemble. By creating more decision trees, you can potentially get better coverage, but training time
increases.
NOTE
If you set the value to 1. However, only one tree can be produced (the tree with the initial set of parameters) and
no further iterations are performed.
5. For Maximum depth of the decision trees , type a number to limit the maximum depth of any
decision tree. Increasing the depth of the tree might increase precision, at the risk of some overfitting and
increased training time.
6. For Minimum number of samples per leaf node , indicate the minimum number of cases that are
required to create any terminal node (leaf) in a tree.
By increasing this value, you increase the threshold for creating new rules. For example, with the default
value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the
training data would have to contain at least five cases that meet the same conditions.
7. Select the Allow unknown values for categorical features option to create a group for unknown
values in the training or validation sets. The model might be less precise for known values, but it can
provide better predictions for new (unknown) values.
If you deselect this option, the model can accept only the values that are contained in the training data.
8. Attach a labeled dataset, and train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the Outputs tab in the right panel of the Train model
module. Select the Register dataset icon to save the model as a reusable module.
To use the model for scoring, add the Score Model module to a pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Two-Class Logistic Regression module
3/5/2021 • 4 minutes to read • Edit Online
How to configure
To train this model, you must provide a dataset that contains a label or class column. Because this module is
intended for two-class problems, the label or class column must contain exactly two values.
For example, the label column might be [Voted] with possible values of "Yes" or "No". Or, it might be [Credit
Risk], with possible values of "High" or "Low".
1. Add the Two-Class Logistic Regression module to your pipeline.
2. Specify how you want the model to be trained, by setting the Create trainer mode option.
Single Parameter : If you know how you want to configure the model, you can provide a specific
set of values as arguments.
Parameter Range : If you are not sure of the best parameters, you can find the optimal
parameters by using the Tune Model Hyperparameters module. You provide some range of values,
and the trainer iterates over multiple combinations of the settings to determine the combination of
values that produces the best result.
3. For Optimization tolerance , specify a threshold value to use when optimizing the model. If the
improvement between iterations falls below the specified threshold, the algorithm is considered to have
converged on a solution, and training stops.
4. For L1 regularization weight and L2 regularization weight , type a value to use for the
regularization parameters L1 and L2. A non-zero value is recommended for both.
Regularization is a method for preventing overfitting by penalizing models with extreme coefficient
values. Regularization works by adding the penalty that is associated with coefficient values to the error
of the hypothesis. Thus, an accurate model with extreme coefficient values would be penalized more, but
a less accurate model with more conservative values would be penalized less.
L1 and L2 regularization have different effects and uses.
L1 can be applied to sparse models, which is useful when working with high-dimensional data.
In contrast, L2 regularization is preferable for data that is not sparse.
This algorithm supports a linear combination of L1 and L2 regularization values: that is, if x = L1 and
y = L2 , then ax + by = c defines the linear span of the regularization terms.
NOTE
Want to learn more about L1 and L2 regularization? The following article provides a discussion of how L1 and L2
regularization are different and how they affect model fitting, with code samples for logistic regression and neural
network models: L1 and L2 Regularization for Machine Learning
Different linear combinations of L1 and L2 terms have been devised for logistic regression models: for example,
elastic net regularization. We suggest that you reference these combinations to define a linear combination that is
effective in your model.
5. For Memor y size for L-BFGS , specify the amount of memory to use for L-BFGS optimization.
L-BFGS stands for "limited memory Broyden-Fletcher-Goldfarb-Shanno". It is an optimization algorithm
that is popular for parameter estimation. This parameter indicates the number of past positions and
gradients to store for the computation of the next step.
This optimization parameter limits the amount of memory that is used to compute the next step and
direction. When you specify less memory, training is faster but less accurate.
6. For Random number seed , type an integer value. Defining a seed value is important if you want the
results to be reproducible over multiple runs of the same pipeline.
7. Add a labeled dataset to the pipeline, and train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To make predictions on new data, use the trained model and new data as input to the Score Model module.
Next steps
See the set of modules available to Azure Machine Learning.
Two-Class Neural Network module
3/5/2021 • 4 minutes to read • Edit Online
How to configure
1. Add the Two-Class Neural Network module to your pipeline. You can find this module under
Machine Learning , Initialize , in the Classification category.
2. Specify how you want the model to be trained, by setting the Create trainer mode option.
Single Parameter : Choose this option if you already know how you want to configure the model.
Parameter Range : If you are not sure of the best parameters, you can find the optimal
parameters by using the Tune Model Hyperparameters module. You provide some range of values,
and the trainer iterates over multiple combinations of the settings to determine the combination of
values that produces the best result.
3. For Hidden layer specification , select the type of network architecture to create.
Fully connected case : Uses the default neural network architecture, defined for two-class neural
networks as follows:
Has one hidden layer.
The output layer is fully connected to the hidden layer, and the hidden layer is fully
connected to the input layer.
The number of nodes in the input layer equals the number of features in the training data.
The number of nodes in the hidden layer is set by the user. The default value is 100.
The number of nodes equals the number of classes. For a two-class neural network, this
means that all inputs must map to one of two nodes in the output layer.
4. For Learning rate , define the size of the step taken at each iteration, before correction. A larger value for
learning rate can cause the model to converge faster, but it can overshoot local minima.
5. For Number of learning iterations , specify the maximum number of times the algorithm should
process the training cases.
6. For The initial learning weights diameter , specify the node weights at the start of the learning
process.
7. For The momentum , specify a weight to apply during learning to nodes from previous iterations
8. Select the Shuffle examples option to shuffle cases between iterations. If you deselect this option, cases
are processed in exactly the same order each time you run the pipeline.
9. For Random number seed , type a value to use as the seed.
Specifying a seed value is useful when you want to ensure repeatability across runs of the same pipeline.
Otherwise, a system clock value is used as the seed, which can cause slightly different results each time
you run the pipeline.
10. Add a labeled dataset to the pipeline, and train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the Outputs tab in the right panel of the Train model
module. Select the Register dataset icon to save the model as a reusable module.
To use the model for scoring, add the Score Model module to a pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Two-Class Support Vector Machine module
3/5/2021 • 3 minutes to read • Edit Online
How to configure
For this model type, it is recommended that you normalize the dataset before using it to train the classifier.
1. Add the Two-Class Suppor t Vector Machine module to your pipeline.
2. Specify how you want the model to be trained, by setting the Create trainer mode option.
Single Parameter : If you know how you want to configure the model, you can provide a specific
set of values as arguments.
Parameter Range : If you are not sure of the best parameters, you can find the optimal
parameters by using the Tune Model Hyperparameters module. You provide some range of values,
and the trainer iterates over multiple combinations of the settings to determine the combination of
values that produces the best result.
3. For Number of iterations , type a number that denotes the number of iterations used when building the
model.
This parameter can be used to control trade-off between training speed and accuracy.
4. For Lambda , type a value to use as the weight for L1 regularization.
This regularization coefficient can be used to tune the model. Larger values penalize more complex
models.
5. Select the option, Normalize features , if you want to normalize features before training.
If you apply normalization, before training, data points are centered at the mean and scaled to have one
unit of standard deviation.
6. Select the option, Project to the unit sphere , to normalize coefficients.
Projecting values to unit space means that before training, data points are centered at 0 and scaled to
have one unit of standard deviation.
7. In Random number seed , type an integer value to use as a seed if you want to ensure reproducibility
across runs. Otherwise, a system clock value is used as a seed, which can result in slightly different results
across runs.
8. Connect a labeled dataset, and train the model:
If you set Create trainer mode to Single Parameter , connect a tagged dataset and the Train
Model module.
If you set Create trainer mode to Parameter Range , connect a tagged dataset and train the
model by using Tune Model Hyperparameters.
NOTE
If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.
If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range
of settings for each parameter, it ignores the values, and uses the default values for the learner.
If you select the Parameter Range option and enter a single value for any parameter, that single value you
specified is used throughout the sweep, even if other parameters change across a range of values.
Results
After training is complete:
To save a snapshot of the trained model, select the Outputs tab in the right panel of the Train model
module. Select the Register dataset icon to save the model as a reusable module.
To use the model for scoring, add the Score Model module to a pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Train Clustering Model
3/19/2021 • 2 minutes to read • Edit Online
NOTE
A clustering model cannot be trained using the Train Model module, which is the generic module for training machine
learning models. That is because Train Model works only with supervised learning algorithms. K-means and other
clustering algorithms allow unsupervised learning, meaning that the algorithm can learn from unlabeled data.
NOTE
If you need to deploy the trained model in the designer, make sure that Assign Data to Clusters instead of Score Model
is connected to the input of Web Service Output module in the inference pipeline.
Next steps
See the set of modules available to Azure Machine Learning.
Train Model module
5/25/2021 • 4 minutes to read • Edit Online
TIP
If you have trouble using the Column Selector, see the article Select Columns in Dataset for tips. It describes some
common scenarios and tips for using the WITH RULES and BY NAME options.
4. Submit the pipeline. If you have a lot of data, it can take a while.
IMPORTANT
If you have an ID column which is the ID of each row, or a text column, which contains too many unique values,
Train Model may hit an error like "Number of unique values in column: "{column_name}" is greater than allowed.
This is because the column hit the threshold of unique values, and may cause out of memory. You can use Edit
Metadata to mark that column as Clear feature and it will not be used in training, or Extract N-Gram Features
from Text module to preprocess text column. See Designer error code for more error details.
Model Interpretability
Model interpretability provides possibility to comprehend the ML model and to present the underlying basis for
decision-making in a way that is understandable to humans.
Currently Train Model module supports using interpretability package to explain ML models. Following built-in
algorithms are supported:
Linear Regression
Neural Network Regression
Boosted Decistion Tree Regression
Decision Forest Regression
Poisson Regression
Two-Class Logistic Regression
Two-Class Support Vector Machine
Two-Class Boosted Decistion Tree
Two-Class Decision Forest
Multi-class Decision Forest
Multi-class Logistic Regression
Multi-class Neural Network
To generate model explanations, you can select True in the drop-down list of Model Explanation in Train
Model module. By default it is set to False in the Train Model module. Please note that generating explanation
requires extra compute cost.
After the pipeline run completed, you can visit Explanations tab in the right pane of Train Model module, and
explore the model performance, dataset and feature importance.
To learn more about using model explanations in Azure Machine Learning, refer to the how-to article about
Interpret ML models.
Results
After the model is trained:
To use the model in other pipelines, select the module and select the Register dataset icon under the
Outputs tab in right panel. You can access saved models in the module palette under Datasets .
To use the model in predicting new values, connect it to the Score Model module, together with new input
data.
Next steps
See the set of modules available to Azure Machine Learning.
Train PyTorch Model
6/10/2021 • 5 minutes to read • Edit Online
This article describes how to use the Train PyTorch Model module in Azure Machine Learning designer to
train PyTorch models like DenseNet. Training takes place after you define a model and set its parameters, and
requires labeled data.
Currently, Train PyTorch Model module supports both single node and distributed training.
NOTE
Train PyTorch Model module is better run on GPU type compute for large dataset, otherwise your pipeline will
fail. You can select compute for specific module in the right pane of the module by setting Use other compute
target .
3. On the left input, attach an untrained model. Attach the training dataset and validation dataset to the
middle and right-hand input of Train PyTorch Model .
For untrained model, it must be a PyTorch model like DenseNet; otherwise, a
'InvalidModelDirectoryError' will be thrown.
For dataset, the training dataset must be a labeled image directory. Refer to Conver t to Image
Director y for how to get a labeled image directory. If not labeled, a 'NotLabeledDatasetError' will be
thrown.
The training dataset and validation dataset have the same label categories, otherwise a
InvalidDatasetError will be thrown.
4. For Epochs , specify how many epochs you'd like to train. The whole dataset will be iterated in every
epoch, by default 5.
5. For Batch size , specify how many instances to train in a batch, by default 16.
6. For Warmup step number , specify how many epochs you'd like to warm up the training, in case initial
learning rate is slightly too large to start converging, by default 0.
7. For Learning rate , specify a value for the learning rate, and the default value is 0.001. Learning rate
controls the size of the step that is used in optimizer like sgd each time the model is tested and corrected.
By setting the rate smaller, you test the model more often, with the risk that you might get stuck in a local
plateau. By setting the rate larger, you can converge faster, with the risk of overshooting the true minima.
NOTE
If train loss becomes nan during training, which may be caused by too large learning rate, decreasing learning rate
may help. In distributed training, to keep gradient descent stable, the actual learning rate is calculated by
lr * torch.distributed.get_world_size() because batch size of the process group is world size times that of
single process. Polynomial learning rate decay is applied and can help result in a better performing model.
8. For Random seed , optionally type an integer value to use as the seed. Using a seed is recommended if
you want to ensure reproducibility of the experiment across runs.
9. For Patience , specify how many epochs to early stop training if validation loss does not decrease
consecutively. by default 3.
10. For Print frequency , specify training log print frequency over iterations in each epoch, by default 10.
11. Submit the pipeline. If your dataset has larger size, it will take a while and GPU compute are
recommended.
Distributed training
In distributed training the workload to train a model is split up and shared among multiple mini processors,
called worker nodes. These worker nodes work in parallel to speed up model training. Currently the designer
support distributed training for Train PyTorch Model module.
Training time
Distributed training makes it possible to train on a large dataset like ImageNet (1000 classes, 1.2 million images)
in just several hours by Train PyTorch Model . The following table shows training time and performance during
training 50 epochs of Resnet50 on ImageNet from scratch based on different devices.
T RA IN IN G TO P - 1 VA L IDAT IO N TO P - 5 VA L IDAT IO N
DEVIC ES T RA IN IN G T IM E T H RO UGH P UT A C C URA C Y A C C URA C Y
Click on this module 'Metrics' tab and see training metric graphs, such as 'Train images per second' and 'Top 1
accuracy'.
How to enable distributed training
To enable distributed training for Train PyTorch Model module, you can set in Run settings in the right pane
of the module. Only AML Compute cluster is supported for distributed training.
NOTE
Multiple GPUs are required to activate distributed training because NCCL backend Train PyTorch Model module uses
needs cuda.
1. Select the module and open the right panel. Expand the Run settings section.
2. Make sure you have select AML compute for the compute target.
3. In Resource layout section, you need to set the following values:
Node count : Number of nodes in the compute target used for training. It should be less than or
equal to the Maximum number of nodes your compute cluster. By default it is 1, which means
single node job.
Process count per node : Number of processes triggered per node. It should be less than or
equal to the Processing Unit of your compute. By default it is 1, which means single process
job.
You can check the Maximum number of nodes and Processing Unit of your compute by clicking the
compute name into the compute detail page.
You can learn more about distributed training in Azure Machine Learning here.
Troubleshooting for distributed training
If you enable distributed training for this module, there will be driver logs for each process. 70_driver_log_0 is
for master process. You can check driver logs for error details of each process under Outputs+logs tab in the
right pane.
If the module enabled distributed training fails without any 70_driver logs, you can check 70_mpi_log for error
details.
The following example shows a common error, which is Process count per node is larger than Processing
Unit of the compute.
You can refer to this article for more details about module troubleshooting.
Results
After pipeline run is completed, to use the model for scoring, connect the Train PyTorch Model to Score Image
Model, to predict values for new input examples.
Technical notes
Expected inputs
NAME TYPE DESC RIP T IO N
Module parameters
NAME RA N GE TYPE DEFA ULT DESC RIP T IO N
Outputs
NAME TYPE DESC RIP T IO N
Next steps
See the set of modules available to Azure Machine Learning.
Tune Model Hyperparameters
3/5/2021 • 7 minutes to read • Edit Online
This article describes how to use the Tune Model Hyperparameters module in Azure Machine Learning designer.
The goal is to determine the optimum hyperparameters for a machine learning model. The module builds and
tests multiple models by using different combinations of settings. It compares metrics over all models to get the
combinations of settings.
The terms parameter and hyperparameter can be confusing. The model's parameters are what you set in the
right pane of the module. Basically, this module performs a parameter sweep over the specified parameter
settings. It learns an optimal set of hyperparameters, which might be different for each specific decision tree,
dataset, or regression method. The process of finding the optimal configuration is sometimes called tuning.
The module supports the following method for finding the optimum settings for a model: integrated train and
tune. In this method, you configure a set of parameters to use. You then let the module iterate over multiple
combinations. The module measures accuracy until it finds a "best" model. With most learner modules, you can
choose which parameters should be changed during the training process, and which should remain fixed.
Depending on how long you want the tuning process to run, you might decide to exhaustively test all
combinations. Or you might shorten the process by establishing a grid of parameter combinations and testing a
randomized subset of the parameter grid.
This method generates a trained model that you can save for reuse.
TIP
You can do a related task. Before you start tuning, apply feature selection to determine the columns or variables that
have the highest information value.
NOTE
Tune Model Hyperparameters can only be connect to built-in machine learning algorithm modules, and
cannot support customized model built in Create Python Model.
3. Add the dataset that you want to use for training, and connect it to the middle input of Tune Model
Hyperparameters.
Optionally, if you have a tagged dataset, you can connect it to the rightmost input port (Optional
validation dataset ). This lets you measure accuracy while training and tuning.
4. In the right panel of Tune Model Hyperparameters, choose a value for Parameter sweeping mode . This
option controls how the parameters are selected.
Entire grid : When you select this option, the module loops over a grid predefined by the system,
to try different combinations and identify the best learner. This option is useful when you don't
know what the best parameter settings might be and want to try all possible combinations of
values.
Random sweep : When you select this option, the module will randomly select parameter values
over a system-defined range. You must specify the maximum number of runs that you want the
module to execute. This option is useful when you want to increase model performance by using
the metrics of your choice but still conserve computing resources.
5. For Label column , open the column selector to choose a single label column.
6. Choose the number of runs:
Maximum number of runs on random sweep : If you choose a random sweep, you can specify
how many times the model should be trained, by using a random combination of parameter values.
7. For Ranking , choose a single metric to use for ranking the models.
When you run a parameter sweep, the module calculates all applicable metrics for the model type and
returns them in the Sweep results report. The module uses separate metrics for regression and
classification models.
However, the metric that you choose determines how the models are ranked. Only the top model, as
ranked by the chosen metric, is output as a trained model to use for scoring.
8. For Random seed , enter a number to use for starting the parameter sweep.
9. Submit the pipeline.
Technical notes
This section contains implementation details and tips.
How a parameter sweep works
When you set up a parameter sweep, you define the scope of your search. The search might use a finite number
of parameters selected randomly. Or it might be an exhaustive search over a parameter space that you define.
Random sweep : This option trains a model by using a set number of iterations.
You specify a range of values to iterate over, and the module uses a randomly chosen subset of those
values. Values are chosen with replacement, meaning that numbers previously chosen at random are not
removed from the pool of available numbers. So the chance of any value being selected stays the same
across all passes.
Entire grid : The option to use the entire grid means that every combination is tested. This option is the
most thorough, but it requires the most time.
Controlling the length and complexity of training
Iterating over many combinations of settings can be time-consuming, so the module provides several ways to
constrain the process:
Limit the number of iterations used to test a model.
Limit the parameter space.
Limit both the number of iterations and the parameter space.
We recommend that you pipeline with the settings to determine the most efficient method of training on a
particular dataset and model.
Choosing an evaluation metric
At the end of testing, the model presents a report that contains the accuracy for each model so that you can
review the metric results:
A uniform set of metrics is used for all binary classification models.
Accuracy is used for all multi-class classification models.
A different set of metrics is used for regression models.
However, during training, you must choose a single metric to use in ranking the models that are generated
during the tuning process. You might find that the best metric varies, depending on your business problem and
the cost of false positives and false negatives.
Metrics used for binary classification
Accuracy is the proportion of true results to total cases.
Precision is the proportion of true results to positive results.
Recall is the fraction of all correct results over all results.
F-score is a measure that balances precision and recall.
AUC is a value that represents the area under the curve when false positives are plotted on the x-axis and
true positives are plotted on the y-axis.
Average Log Loss is the difference between two probability distributions: the true one, and the one in
the model.
Metrics used for regression
Mean absolute error averages all the errors in the model, where error means the distance of the
predicted value from the true value. It's often abbreviated as MAE.
Root of mean squared error measures the average of the squares of the errors, and then takes the
root of that value. It's often abbreviated as RMSE.
Relative absolute error represents the error as a percentage of the true value.
Relative squared error normalizes the total squared error by dividing by the total squared error of the
predicted values.
Coefficient of determination is a single number that indicates how well data fits a model. A value of
one means that the model exactly matches the data. A value of zero means that the data is random or
otherwise can't be fit to the model. It's often called r2, R 2, or r-squared.
Modules that don't support a parameter sweep
Almost all learners in Azure Machine Learning support cross-validation with an integrated parameter sweep,
which lets you choose the parameters to pipeline with. If the learner doesn't support setting a range of values,
you can still use it in cross-validation. In this case, a range of allowed values is selected for the sweep.
Next steps
See the set of modules available to Azure Machine Learning.
Apply Transformation module
11/2/2020 • 2 minutes to read • Edit Online
2. In the inference pipeline, remove the TD- module, and replace it with the registered dataset in the previous step.
Next steps
See the set of modules available to Azure Machine Learning.
Module: Assign Data to Clusters
11/2/2020 • 2 minutes to read • Edit Online
This article describes how to use the Assign Data to Clusters module in Azure Machine Learning designer. The
module generates predictions through a clustering model that was trained with the K-means clustering
algorithm.
The Assign Data to Clusters module returns a dataset that contains the probable assignments for each new data
point.
TIP
To reduce the number of columns that are written to the designer from the cluster predictions, use Select columns
in the dataset, and select a subset of the columns.
4. Leave the Check for append or uncheck for result only check box selected if you want the results to
contain the full input dataset, including a column that displays the results (cluster assignments).
If you clear this check box, only the results are returned. This option might be useful when you create
predictions as part of a web service.
5. Submit the pipeline.
Results
To view the values in the dataset, right-click the module, and then select Visualize . Or Select the module and
switch to the Outputs tab in the right panel, click on the histogram icon in the Por t outputs to visualize the
result.
Cross Validate Model
3/5/2021 • 6 minutes to read • Edit Online
This article describes how to use the Cross Validate Model module in Azure Machine Learning designer. Cross-
validation is a technique often used in machine learning to assess both the variability of a dataset and the
reliability of any model trained through that data.
The Cross Validate Model module takes as input a labeled dataset, together with an untrained classification or
regression model. It divides the dataset into some number of subsets (folds), builds a model on each fold, and
then returns a set of accuracy statistics for each fold. By comparing the accuracy statistics for all the folds, you
can interpret the quality of the data set. You can then understand whether the model is susceptible to variations
in the data.
Cross Validate Model also returns predicted results and probabilities for the dataset, so that you can assess the
reliability of the predictions.
How cross-validation works
1. Cross-validation randomly divides training data into folds.
The algorithm defaults to 10 folds if you have not previously partitioned the dataset. To divide the dataset
into a different number of folds, you can use the Partition and Sample module and indicate how many
folds to use.
2. The module sets aside the data in fold 1 to use for validation. (This is sometimes called the holdout fold.)
The module uses the remaining folds to train a model.
For example, if you create five folds, the module generates five models during cross-validation. The
module trains each model by using four-fifths of the data. It tests each model on the remaining one-fifth.
3. During testing of the model for each fold, the module evaluates multiple accuracy statistics. Which
statistics the module uses depends on the type of model that you're evaluating. Different statistics are
used to evaluate classification models versus regression models.
4. When the building and evaluation process is complete for all folds, Cross Validate Model generates a set
of performance metrics and scored results for all the data. Review these metrics to see whether any single
fold has high or low accuracy.
Advantages of cross-validation
A different and common way of evaluating a model is to divide the data into a training and test set by using
Split Data, and then validate the model on the training data. But cross-validation offers some advantages:
Cross-validation uses more test data.
Cross-validation measures the performance of the model with the specified parameters in a bigger data
space. That is, cross-validation uses the entire training dataset for both training and evaluation, instead of
a portion. In contrast, if you validate a model by using data generated from a random split, typically you
evaluate the model on only 30 percent or less of the available data.
However, because cross-validation trains and validates the model multiple times over a larger dataset, it's
much more computationally intensive. It takes much longer than validating on a random split.
Cross-validation evaluates both the dataset and the model.
Cross-validation doesn't simply measure the accuracy of a model. It also gives you some idea of how
representative the dataset is and how sensitive the model might be to variations in the data.
TIP
You don't have to train the model, because Cross-Validate Model automatically trains the model as part of
evaluation.
3. On the Dataset port of Cross Validate Model, connect any labeled training dataset.
4. In the right panel of Cross Validate Model, click Edit column . Select the single column that contains the
class label, or the predictable value.
5. Set a value for the Random seed parameter if you want to repeat the results of cross-validation across
successive runs on the same data.
6. Submit the pipeline.
7. See the Results section for a description of the reports.
Results
After all iterations are complete, Cross Validate Model creates scores for the entire dataset. It also creates
performance metrics that you can use to assess the quality of the model.
Scored results
The first output of the module provides the source data for each row, together with some predicted values and
related probabilities.
To view the results, in the pipeline, right-click the Cross Validate Model module. Select Visualize Scored
results .
N EW C O L UM N N A M E DESC RIP T IO N
Scored Labels This column is added at the end of the dataset. It contains
the predicted value for each row.
Scored Probabilities This column is added at the end of the dataset. It indicates
the estimated probability of the value in Scored Labels .
N EW C O L UM N N A M E DESC RIP T IO N
Fold Number Indicates the zero-based index of the fold that each row of
data was assigned to during cross-validation.
Evaluation results
The second report is grouped by folds. Remember that during execution, Cross Validate Model randomly splits
the training data into n folds (by default, 10). In each iteration over the dataset, Cross Validate Model uses one
fold as a validation dataset. It uses the remaining n-1 folds to train a model. Each of the n models is tested
against the data in all the other folds.
In this report, the folds are listed by index value, in ascending order. To order on any other column, you can save
the results as a dataset.
To view the results, in the pipeline, right-click the Cross Validate Model module. Select Visualize Evaluation
results by fold .
C O L UM N N A M E DESC RIP T IO N
Fold number An identifier for each fold. If you created five folds, there
would be five subsets of data, numbered 0 to 4.
Number of examples in fold The number of rows assigned to each fold. They should be
roughly equal.
The module also includes the following metrics for each fold, depending on the type of model that you're
evaluating:
Classification models : Precision, recall, F-score, AUC, accuracy
Regression models : Mean absolute error, root mean squared error, relative absolute error, relative
squared error, and coefficient of determination
Technical notes
It's a best practice to normalize datasets before you use them for cross-validation.
Cross Validate Model is much more computationally intensive and takes longer to complete than if you
validated the model by using a randomly divided dataset. The reason is that Cross Validate Model trains
and validates the model multiple times.
There's no need to split the dataset into training and testing sets when you use cross-validation to
measure the accuracy of the model.
Next steps
See the set of modules available to Azure Machine Learning.
Evaluate Model module
3/5/2021 • 6 minutes to read • Edit Online
TIP
If you are new to model evaluation, we recommend the video series by Dr. Stephen Elston, as part of the machine learning
course from EdX.
NOTE
If use modules like "Select Columns in Dataset" to select part of input dataset, please ensure Actual label column
(used in training), 'Scored Probabilities' column and 'Scored Labels' column exist to calculate metrics like AUC,
Accuracy for binary classification/anomaly detection. Actual label column, 'Scored Labels' column exist to calculate
metrics for multi-class classification/regression. 'Assignments' column, columns 'DistancesToClusterCenter no.X' (X
is centroid index, ranging from 0, ..., Number of centroids-1) exist to calculate metrics for clustering.
IMPORTANT
To evaluate the results, the output dataset should contain specific score column names, which meet Evaluate
Model module requirements.
The Labels column will be considered as actual labels.
For regression task, the dataset to evaluate must has one column, named Regression Scored Labels , which
represents scored labels.
For binary classification task, the dataset to evaluate must has two columns, named
Binary Class Scored Labels , Binary Class Scored Probabilities , which represent scored labels, and
probabilities respectively.
For multi classification task, the dataset to evaluate must has one column, named
Multi Class Scored Labels , which represents scored labels. If the outputs of the upstream module does
not have these columns, you need to modify according to the requirements above.
2. [Optional] Connect the Scored dataset output of the Score Model or Result dataset output of the Assign
Data to Clusters for the second model to the right input port of Evaluate Model . You can easily
compare results from two different models on the same data. The two input algorithms should be the
same algorithm type. Or, you might compare scores from two different runs over the same data with
different parameters.
NOTE
Algorithm type refers to 'Two-class Classification', 'Multi-class Classification', 'Regression', 'Clustering' under
'Machine Learning Algorithms'.
Results
After you run Evaluate Model , select the module to open up the Evaluate Model navigation panel on the
right. Then, choose the Outputs + Logs tab, and on that tab the Data Outputs section has several icons. The
Visualize icon has a bar graph icon, and is a first way to see the results.
For binary-classification, after you click Visualize icon, you can visualize the binary confusion matrix. For multi-
classification, you can find the confusion matrix plot file under the Outputs + Logs tab like following:
If you connect datasets to both inputs of Evaluate Model , the results will contain metrics for both set of data,
or both models. The model or data attached to the left port is presented first in the report, followed by the
metrics for the dataset, or model attached on the right port.
For example, the following image represents a comparison of results from two clustering models that were built
on the same data, but with different parameters.
Because this is a clustering model, the evaluation results are different than if you compared scores from two
regression models, or compared two classification models. However, the overall presentation is the same.
Metrics
This section describes the metrics returned for the specific types of models supported for use with Evaluate
Model :
classification models
regression models
clustering models
Metrics for classification models
The following metrics are reported when evaluating binary classification models.
Accuracy measures the goodness of a classification model as the proportion of true results to total
cases.
Precision is the proportion of true results over all positive results. Precision = TP/(TP+FP)
Recall is the fraction of the total amount of relevant instances that were actually retrieved. Recall =
TP/(TP+FN)
F1 score is computed as the weighted average of precision and recall between 0 and 1, where the ideal
F1 score value is 1.
AUC measures the area under the curve plotted with true positives on the y axis and false positives on
the x axis. This metric is useful because it provides a single number that lets you compare models of
different types. AUC is classification-threshold-invariant. It measures the quality of the model's
predictions irrespective of what classification threshold is chosen.
Metrics for regression models
The metrics returned for regression models are designed to estimate the amount of error. A model is considered
to fit the data well if the difference between observed and predicted values is small. However, looking at the
pattern of the residuals (the difference between any one predicted point and its corresponding actual value) can
tell you a lot about potential bias in the model.
The following metrics are reported for evaluating regression models.
Mean absolute error (MAE) measures how close the predictions are to the actual outcomes; thus, a
lower score is better.
Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By
squaring the difference, the metric disregards the difference between over-prediction and under-
prediction.
Relative absolute error (RAE) is the relative absolute difference between expected and actual values;
relative because the mean difference is divided by the arithmetic mean.
Relative squared error (RSE) similarly normalizes the total squared error of the predicted values by
dividing by the total squared error of the actual values.
Coefficient of determination , often referred to as R 2, represents the predictive power of the model as
a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect
fit. However, caution should be used in interpreting R 2 values, as low values can be entirely normal and
high values can be suspect.
Metrics for clustering models
Because clustering models differ significantly from classification and regression models in many respects,
Evaluate Model also returns a different set of statistics for clustering models.
The statistics returned for a clustering model describe how many data points were assigned to each cluster, the
amount of separation between clusters, and how tightly the data points are bunched within each cluster.
The statistics for the clustering model are averaged over the entire dataset, with additional rows containing the
statistics per cluster.
The following metrics are reported for evaluating clustering models.
The scores in the column, Average Distance to Other Center , represent how close, on average, each
point in the cluster is to the centroids of all other clusters.
The scores in the column, Average Distance to Cluster Center , represent the closeness of all points in
a cluster to the centroid of that cluster.
The Number of Points column shows how many data points were assigned to each cluster, along with
the total overall number of data points in any cluster.
If the number of data points assigned to clusters is less than the total number of data points available, it
means that the data points could not be assigned to a cluster.
The scores in the column, Maximal Distance to Cluster Center , represent the max of the distances
between each point and the centroid of that point's cluster.
If this number is high, it can mean that the cluster is widely dispersed. You should review this statistic
together with the Average Distance to Cluster Center to determine the cluster's spread.
The Combined Evaluation score at the bottom of the each section of results lists the averaged scores
for the clusters created in that particular model.
Next steps
See the set of modules available to Azure Machine Learning.
Score Image Model
3/5/2021 • 2 minutes to read • Edit Online
Results
After you have generated a set of scores using Score Image Model, to generate a set of metrics used for
evaluating the model's accuracy (performance), you can connect this module and the scored dataset to Evaluate
Model,
Publish scores as a web service
A common use of scoring is to return the output as part of a predictive web service. For more information, see
this tutorial on how to deploy a real-time endpoint based on a pipeline in Azure Machine Learning designer.
Next steps
See the set of modules available to Azure Machine Learning.
Score Model
3/5/2021 • 2 minutes to read • Edit Online
How to use
1. Add the Score Model module to your pipeline.
2. Attach a trained model and a dataset containing new input data.
The data should be in a format compatible with the type of trained model you are using. The schema of
the input dataset should also generally match the schema of the data used to train the model.
3. Submit the pipeline.
Results
After you have generated a set of scores using Score Model:
To generate a set of metrics used for evaluating the model's accuracy (performance), you can connect the
scored dataset to Evaluate Model,
Right-click the module and select Visualize to see a sample of the results.
The score, or predicted value, can be in many different formats, depending on the model and your input data:
For classification models, Score Model outputs a predicted value for the class, as well as the probability of the
predicted value.
For regression models, Score Model generates just the predicted numeric value.
Next steps
See the set of modules available to Azure Machine Learning.
Create Python Model module
8/25/2021 • 3 minutes to read • Edit Online
WARNING
Currently, it's not possible to connect this module to Tune Model Hyperparameters module or pass the scored results
of a Python model to Evaluate Model. If you need to tune the hyperparameters or evaluate a model, you can write a
custom Python script by using Execute Python Script module.
NOTE
Please be very careful when writing your script and makes sure there is no syntax error, such as using a un-declared
object or a un-imported module.
NOTE
Also pay extra attentions to the pre-installed modules list in Execute Python Script. Only import pre-installed modules.
Please do not install extra packages such as "pip install xgboost" in this script, otherwise errors will be raised when reading
models in down-stream modules.
This article shows how to use Create Python Model with a simple pipeline. Here's a diagram of the pipeline:
1. Select Create Python Model , and edit the script to implement your modeling or data management process.
You can base the model on any learner that's included in a Python package in the Azure Machine Learning
environment.
NOTE
Please pay extra attention to the comments in sample code of the script and make sure your script strictly follows the
requirement, including the class name, methods as well as method signature. Violation will lead to exceptions. Create
Python Model only supports creating sklearn based model to be trained using Train Model.
The following sample code of the two-class Naive Bayes classifier uses the popular sklearn package:
# The script MUST define a class named AzureMLModel.
# This class MUST at least define the following three methods:
# __init__: in which self.model must be assigned,
# train: which trains self.model, the two input arguments must be pandas DataFrame,
# predict: which generates prediction result, the input argument and the prediction result MUST be
pandas DataFrame.
# The signatures (method names and argument names) of all these methods MUST be exactly the same as the
following example.
# Please do not install extra packages such as "pip install xgboost" in this script,
# otherwise errors will be raised when reading models in down-stream modules.
import pandas as pd
from sklearn.naive_bayes import GaussianNB
class AzureMLModel:
def __init__(self):
self.model = GaussianNB()
self.feature_column_names = list()
2. Connect the Create Python Model module that you just created to Train Model and Score Model .
3. If you need to evaluate the model, add an Execute Python Script module and edit the Python script.
The following script is sample evaluation code:
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
metrics = pd.DataFrame();
metrics["Metric"] = ["Accuracy", "Precision", "Recall", "AUC"];
metrics["Value"] = [accuracy, precision, recall, auc]
return metrics,
Next steps
See the set of modules available to Azure Machine Learning.
Execute Python Script module
8/25/2021 • 8 minutes to read • Edit Online
This article describes the Execute Python Script module in Azure Machine Learning designer.
Use this module to run Python code. For more information about the architecture and design principles of
Python, see how run Python code in Azure Machine Learning designer.
With Python, you can perform tasks that existing modules don't support, such as:
Visualizing data by using matplotlib .
Using Python libraries to enumerate datasets and models in your workspace.
Reading, loading, and manipulating data from sources that the Import Data module doesn't support.
Run your own deep learning code.
import os
os.system(f"pip install scikit-misc")
Use the following code to install packages for better performance, especially for inference:
import importlib.util
package_name = 'scikit-misc'
spec = importlib.util.find_spec(package_name)
if spec is None:
import os
os.system(f"pip install scikit-misc")
NOTE
If your pipeline contains multiple Execute Python Script modules that need packages that aren't in the preinstalled list,
install the packages in each module.
WARNING
Excute Python Script module does not support installing packages that depend on extra native libraries with command
like "apt-get", such as Java, PyODBC and etc. This is because this module is executed in a simple environment with Python
pre-installed only and with non-admin permission.
Access to current workspace and registered datasets
You can refer to the following sample code to access to the registered datasets in your workspace:
Upload files
The Execute Python Script module supports uploading files by using the Azure Machine Learning Python SDK.
The following example shows how to upload an image file to the run record in the Execute Python Script
module:
# The script MUST contain a function named azureml_main,
# which is the entry point for this module.
After the pipeline run is finished, you can preview the image in the right panel of the module.
You can also upload file to any datastore using following code. You can only preview the file in your storage
account.
import pandas as pd
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
img_file = "line.png"
# Set path
path = "./img_folder"
os.mkdir(path)
plt.savefig(os.path.join(path,img_file))
# Get a named datastore from the current workspace and upload to specified path
from azureml.core import Datastore
datastore = Datastore.get(ws, datastore_name='workspacefilestore')
datastore.upload(path)
return dataframe1,
IMPORTANT
Please use unique and meaningful name for files in the script bundle since some common words (like test , app
and etc) are reserved for built-in services.
Following is a script bundle example, which contains a python script file and a txt file:
def my_func(dataframe1):
return dataframe1
Following is sample code showing how to consume the files in the script bundle:
import pandas as pd
from my_script import my_func
4. In the Python script text box, type or paste valid Python script.
NOTE
Be careful when writing your script. Make sure there are no syntax errors, such as using undeclared variables or
unimported modules or functions. Pay extra attention to the preinstalled module list. To import modules that
aren't listed, install the corresponding packages in your script, such as:
import os
os.system(f"pip install scikit-misc")
The Python script text box is prepopulated with some instructions in comments, and sample code for
data access and output. You must edit or replace this code. Follow Python conventions for indentation
and casing:
The script must contain a function named azureml_main as the entry point for this module.
The entry point function must have two input arguments, Param<dataframe1> and Param<dataframe2> ,
even when these arguments aren't used in your script.
Zipped files connected to the third input port are unzipped and stored in the directory
.\Script Bundle , which is also added to the Python sys.path .
If your .zip file contains mymodule.py , import it by using import mymodule .
Two datasets can be returned to the designer, which must be a sequence of type pandas.DataFrame . You
can create other outputs in your Python code and write them directly to Azure storage.
WARNING
It's Not recommended to connect to a Database or other external storages in Execute Python Script Module .
You can use Import Data module and Export Data module
Next steps
See the set of modules available to Azure Machine Learning.
Execute R Script module
8/6/2021 • 11 minutes to read • Edit Online
This article describes how to use the Execute R Script module to run R code in your Azure Machine Learning
designer pipeline.
With R, you can do tasks that aren't supported by existing modules, such as:
Create custom data transformations
Use your own metrics to evaluate predictions
Build models using algorithms that aren't implemented as standalone modules in the designer
R version support
Azure Machine Learning designer uses the CRAN (Comprehensive R Archive Network) distribution of R. The
currently used version is CRAN 3.5.1.
Supported R packages
The R environment is preinstalled with more than 100 packages. For a complete list, see the section Preinstalled
R packages.
You can also add the following code to any Execute R Script module, to see the installed packages.
NOTE
If your pipeline contains multiple Execute R Script modules that need packages that aren't in the preinstalled list, install the
packages in each module.
Installing R packages
To install additional R packages, use the install.packages() method. Packages are installed for each Execute R
Script module. They aren't shared across other Execute R Script modules.
NOTE
It's NOT recommended to install R package from the script bundle. It's recommended to install packages directly in the
script editor. Specify the CRAN repository when you're installing packages, such as
install.packages("zoo",repos = "https://cloud.r-project.org") .
WARNING
Excute R Script module does not support installing packages that require native compilation, like qdap package which
requires JAVA and drc package which requires C++. This is because this module is executed in a pre-installed
environment with non-admin permission. Do not install packages which are pre-built on/for Windows, since the designer
modules are running on Ubuntu. To check whether a package is pre-built on windows, you could go to CRAN and search
your package, download one binary file according to your OS, and check Built: part in the DESCRIPTION file. Following
is an example:
This sample shows how to install Zoo:
# R version: 3.5.1
# The script MUST contain a function named azureml_main,
# which is the entry point for this module.
NOTE
Before you install a package, check if it already exists so you don't repeat an installation. Repeat installations might cause
web service requests to time out.
NOTE
Be careful when writing your script. Make sure there are no syntax errors, such as using undeclared variables or
unimported modules or functions. Pay extra attention to the preinstalled package list at the end of this article. To
use packages that aren't listed, install them in your script. An example is
install.packages("zoo",repos = "https://cloud.r-project.org") .
To help you get started, the R Script text box is prepopulated with sample code, which you can edit or
replace.
# R version: 3.5.1
# The script MUST contain a function named azureml_main,
# which is the entry point for this module.
The entry point function must have the input arguments Param<dataframe1> and Param<dataframe2> , even
when these arguments aren't used in the function.
NOTE
The data passed to the Execute R Script module is referenced as dataframe1 and dataframe2 , which is different
from Azure Machine Learning designer (the designer reference as dataset1 , dataset2 ). Make sure that input
data is referenced correctly in your script.
NOTE
Existing R code might need minor changes to run in a designer pipeline. For example, input data that you provide
in CSV format should be explicitly converted to a dataset before you can use it in your code. Data and column
types used in the R language also differ in some ways from the data and column types used in the designer.
4. If your script is larger than 16 KB, use the Script Bundle port to avoid errors like CommandLine exceeds
the limit of 16597 characters.
a. Bundle the script and other custom resources to a zip file.
b. Upload the zip file as a File Dataset to the studio.
c. Drag the dataset module from the Datasets list in the left module pane in the designer authoring page.
d. Connect the dataset module to the Script Bundle port of Execute R Script module.
Following is the sample code to consume the script in the script bundle:
azureml_main <- function(dataframe1, dataframe2){
# Source the custom R script: my_script.R
source("./Script Bundle/my_script.R")
5. For Random Seed , enter a value to use inside the R environment as the random seed value. This
parameter is equivalent to calling set.seed(value) in R code.
6. Submit the pipeline.
Results
Execute R Script modules can return multiple outputs, but they must be provided as R data frames. The designer
automatically converts data frames to datasets for compatibility with other modules.
Standard messages and errors from R are returned to the module's log.
If you need to print results in the R script, you can find the printed results in 70_driver_log under the
Outputs+logs tab in the right panel of the module.
Sample scripts
There are many ways to extend your pipeline by using custom R scripts. This section provides sample code for
common tasks.
Add an R script as an input
The Execute R Script module supports arbitrary R script files as inputs. To use them, you must upload them to
your workspace as part of the .zip file.
1. To upload a .zip file that contains R code to your workspace, go to the Datasets asset page. Select Create
dataset , and then select From local file and the File dataset type option.
2. Verify that the zipped file appears in My Datasets under the Datasets category in the left module tree.
3. Connect the dataset to the Script Bundle input port.
4. All files in the .zip file are available during pipeline run time.
If the script bundle file contained a directory structure, the structure is preserved. But you must alter your
code to prepend the directory ./Script Bundle to the path.
Process data
The following sample shows how to scale and normalize input data:
# R version: 3.5.1
# The script MUST contain a function named azureml_main,
# which is the entry point for this module.
Replicate rows
This sample shows how to replicate positive records in a dataset to balance the sample:
azureml_main <- function(dataframe1, dataframe2){
data.set <- dataframe1[dataframe1[,1]==-1,]
# positions of the positive samples
pos <- dataframe1[dataframe1[,1]==1,]
# replicate the positive samples to balance the sample
for (i in 1:20) data.set <- rbind(data.set,pos)
row.names(data.set) <- NULL
# Return datasets as a Named List
return(list(dataset1=data.set, dataset2=dataframe2))
}
return(list(dataset1=data.set, dataset2=dataframe2))
}
The explicit conversion to integer type is done because the serialization function outputs data in the R
Raw format, which the designer doesn't support.
2. Add a second instance of the Execute R Script module, and connect it to the output port of the previous
module.
3. Type the following code in the R Script text box to extract object A from the input data table.
Preinstalled R packages
The following preinstalled R packages are currently available:
PA C K A GE VERSIO N
askpass 1.1
assertthat 0.2.1
backports 1.1.4
PA C K A GE VERSIO N
base 3.5.1
base64enc 0.1-3
BH 1.69.0-1
bindr 0.1.1
bindrcpp 0.2.2
bitops 1.0-6
boot 1.3-22
broom 0.5.2
callr 3.2.0
caret 6.0-84
caTools 1.17.1.2
cellranger 1.1.0
class 7.3-15
cli 1.1.0
clipr 0.6.0
cluster 2.0.7-1
codetools 0.2-16
colorspace 1.4-1
compiler 3.5.1
crayon 1.3.4
curl 3.3
data.table 1.12.2
datasets 3.5.1
DBI 1.0.0
dbplyr 1.4.1
PA C K A GE VERSIO N
digest 0.6.19
dplyr 0.7.6
e1071 1.7-2
evaluate 0.14
fansi 0.4.0
forcats 0.3.0
foreach 1.4.4
foreign 0.8-71
fs 1.3.1
gdata 2.18.0
generics 0.0.2
ggplot2 3.2.0
glmnet 2.0-18
glue 1.3.1
gower 0.2.1
gplots 3.0.1.1
graphics 3.5.1
grDevices 3.5.1
grid 3.5.1
gtable 0.3.0
gtools 3.8.1
haven 2.1.0
highr 0.8
hms 0.4.2
htmltools 0.3.6
PA C K A GE VERSIO N
httr 1.4.0
ipred 0.9-9
iterators 1.0.10
jsonlite 1.6
KernSmooth 2.23-15
knitr 1.23
labeling 0.3
lattice 0.20-38
lava 1.6.5
lazyeval 0.2.2
lubridate 1.7.4
magrittr 1.5
markdown 1
MASS 7.3-51.4
Matrix 1.2-17
methods 3.5.1
mgcv 1.8-28
mime 0.7
ModelMetrics 1.2.2
modelr 0.1.4
munsell 0.5.0
nlme 3.1-140
nnet 7.3-12
numDeriv 2016.8-1.1
openssl 1.4
PA C K A GE VERSIO N
parallel 3.5.1
pillar 1.4.1
pkgconfig 2.0.2
plogr 0.2.0
plyr 1.8.4
prettyunits 1.0.2
processx 3.3.1
prodlim 2018.04.18
progress 1.2.2
ps 1.3.0
purrr 0.3.2
quadprog 1.5-7
quantmod 0.4-15
R6 2.4.0
randomForest 4.6-14
RColorBrewer 1.1-2
Rcpp 1.0.1
RcppRoll 0.3.0
readr 1.3.1
readxl 1.3.1
recipes 0.1.5
rematch 1.0.1
reprex 0.3.0
reshape2 1.4.3
reticulate 1.12
PA C K A GE VERSIO N
rlang 0.4.0
rmarkdown 1.13
ROCR 1.0-7
rpart 4.1-15
rstudioapi 0.1
rvest 0.3.4
scales 1.0.0
selectr 0.4-1
spatial 7.3-11
splines 3.5.1
SQUAREM 2017.10-1
stats 3.5.1
stats4 3.5.1
stringi 1.4.3
stringr 1.3.1
survival 2.44-1.1
sys 3.2
tcltk 3.5.1
tibble 2.1.3
tidyr 0.8.3
tidyselect 0.2.5
tidyverse 1.2.1
timeDate 3043.102
tinytex 0.13
tools 3.5.1
PA C K A GE VERSIO N
tseries 0.10-47
TTR 0.23-4
utf8 1.1.4
utils 3.5.1
vctrs 0.1.0
viridisLite 0.3.0
whisker 0.3-2
withr 2.1.2
xfun 0.8
xml2 1.2.0
xts 0.11-2
yaml 2.2.0
zeallot 0.1.0
zoo 1.8-6
Next steps
See the set of modules available to Azure Machine Learning.
Convert Word to Vector module
11/2/2020 • 6 minutes to read • Edit Online
This article describes how to use the Convert Word to Vector module in Azure Machine Learning designer to do
these tasks:
Apply various Word2Vec models (Word2Vec, FastText, GloVe pretrained model) on the corpus of text that you
specified as input.
Generate a vocabulary with word embeddings.
This module uses the Gensim library. For more information about Gensim, see its official website, which
includes tutorials and an explanation of algorithms.
More about converting words to vectors
Converting words to vectors, or word vectorization, is a natural language processing (NLP) process. The process
uses language models to map words into vector space. A vector space represents each word by a vector of real
numbers. It also allows words with similar meanings have similar representations.
Use word embeddings as initial input for NLP downstream tasks such as text classification and sentiment
analysis.
Among various word embedding technologies, in this module, we implemented three widely used methods.
Two, Word2Vec and FastText, are online-training models. The other is a pretrained model, glove-wiki-gigaword-
100.
Online-training models are trained on your input data. Pretrained models are trained offline on a larger text
corpus (for example, Wikipedia, Google News) that usually contains about 100 billion words. Word embedding
then stays constant during word vectorization. Pretrained word models provide benefits such as reduced
training time, better word vectors encoded, and improved overall performance.
Here's some information about the methods:
Word2Vec is one of the most popular techniques to learn word embeddings by using a shallow neural
network. The theory is discussed in this paper, available as a PDF download: Efficient Estimation of Word
Representations in Vector Space. The implementation in this module is based on the Gensim library for
Word2Vec.
The FastText theory is explained in this paper, available as a PDF download: Enriching Word Vectors with
Subword Information. The implementation in this module is based on the Gensim library for FastText.
The GloVe pretrained model is glove-wiki-gigaword-100. It's a collection of pretrained vectors based on a
Wikipedia text corpus, which contains 5.6 billion tokens and 400,000 uncased vocabulary words. A PDF
download is available: GloVe: Global Vectors for Word Representation.
Examples
The module has one output:
Vocabular y with embeddings : Contains the generated vocabulary, together with each word's embedding.
One dimension occupies one column.
The following example shows how the Convert Word to Vector module works. It uses Convert Word to Vector
with default settings to the preprocessed Wikipedia SP 500 Dataset.
Source dataset
The dataset contains a category column, along with the full text fetched from Wikipedia. The following table
shows a few representative examples.
T EXT
nasdaq 100 component s p 500 component foundation founder location city apple campus 1 infinite loop street infinite loop
cupertino california cupertino california location country united states...
T EXT
br nasdaq 100 nasdaq 100 component br s p 500 s p 500 component industry computer software foundation br founder
charles geschke br john warnock location adobe systems...
s p 500 s p 500 component industry automotive industry automotive predecessor general motors corporation 1908 2009
successor...
s p 500 s p 500 component industry conglomerate company conglomerate foundation founder location city fairfield
connecticut fairfield connecticut location country usa area...
br s p 500 s p 500 component foundation 1903 founder william s harley br arthur davidson harley davidson founder arthur
davidson br walter davidson br william a davidson location...
EM B EDDI
VO C A B UL EM B EDDI EM B EDDI EM B EDDI EM B EDDI EM B EDDI EM B EDDI N G DIM
A RY N G DIM 0 N G DIM 1 N G DIM 2 N G DIM 3 N G DIM 4 N G DIM 5 ... 99
In this example, we used the default Gensim Word2Vec for Word2Vec strategy , and Training Algorithm is
Skip-gram . Length of word Embedding is 100, so we have 100 embedding columns.
Technical notes
This section contains tips and answers to frequently asked questions.
Difference between online-training and pretrained model:
In this Convert Word to Vector module, we provided three different strategies: two online-training
models and one pretrained model. The online-training models use your input dataset as training data,
and generate vocabulary and word vectors during training. The pretrained model is already trained by a
much larger text corpus, such as Wikipedia or Twitter text. The pretrained model is actually a collection of
word/embedding pairs.
The GloVe pre-trained model summarizes a vocabulary from the input dataset and generates an
embedding vector for each word from the pretrained model. Without online training, the use of a
pretrained model can save training time. It has better performance, especially when the input dataset size
is relatively small.
Embedding size:
In general, the length of word embedding is set to a few hundred. For example, 100, 200, 300. A small
embedding size means a small vector space, which could cause word embedding collisions.
The length of word embeddings is fixed for pretrained models. In this example, the embedding size of
glove-wiki-gigaword-100 is 100.
Next steps
See the set of modules available to Azure Machine Learning.
For a list of errors specific to the designer modules, see Machine Learning error codes.
Extract N-Gram Features from Text module
reference
3/5/2021 • 6 minutes to read • Edit Online
This article describes a module in Azure Machine Learning designer. Use the Extract N-Gram Features from Text
module to featurize unstructured text data.
TF-IDF Weight : Assigns a term frequency/inverse document frequency (TF/IDF) score to the
extracted n-grams. The value for each n-gram is its TF score multiplied by its IDF score.
6. Set Minimum word length to the minimum number of letters that can be used in any single word in an
n-gram.
7. Use Maximum word length to set the maximum number of letters that can be used in any single word
in an n-gram.
By default, up to 25 characters per word or token are allowed.
8. Use Minimum n-gram document absolute frequency to set the minimum occurrences required for
any n-gram to be included in the n-gram dictionary.
For example, if you use the default value of 5, any n-gram must appear at least five times in the corpus to
be included in the n-gram dictionary.
9. Set Maximum n-gram document ratio to the maximum ratio of the number of rows that contain a
particular n-gram, over the number of rows in the overall corpus.
For example, a ratio of 1 would indicate that, even if a specific n-gram is present in every row, the n-gram
can be added to the n-gram dictionary. More typically, a word that occurs in every row would be
considered a noise word and would be removed. To filter out domain-dependent noise words, try
reducing this ratio.
IMPORTANT
The rate of occurrence of particular words is not uniform. It varies from document to document. For example, if
you're analyzing customer comments about a specific product, the product name might be very high frequency
and close to a noise word, but be a significant term in other contexts.
10. Select the option Normalize n-gram feature vectors to normalize the feature vectors. If this option is
enabled, each n-gram feature vector is divided by its L2 norm.
11. Submit the pipeline.
Use an existing n-gram dictionary
1. Add the Extract N-Gram Features from Text module to your pipeline, and connect the dataset that has the
text you want to process to the Dataset port.
2. Use Text column to select the text column that contains the text you want to featurize. By default, the
module selects all columns of type string . For best results, process a single column at a time.
3. Add the saved dataset that contains a previously generated n-gram dictionary, and connect it to the
Input vocabular y port. You can also connect the Result vocabular y output of an upstream instance of
the Extract N-Gram Features from Text module.
4. For Vocabular y mode , select the ReadOnly update option from the drop-down list.
The ReadOnly option represents the input corpus for the input vocabulary. Rather than computing term
frequencies from the new text dataset (on the left input), the n-gram weights from the input vocabulary
are applied as is.
TIP
Use this option when you're scoring a text classifier.
5. For all other options, see the property descriptions in the previous section.
6. Submit the pipeline.
Build inference pipeline that uses n-grams to deploy a real-time endpoint
A training pipeline which contains Extract N-Grams Feature From Text and Score Model to make
prediction on test dataset, is built in following structure:
Vocabular y mode of the circled Extract N-Grams Feature From Text module is Create , and Vocabular y
mode of the module which connects to Score Model module is ReadOnly .
After submitting the training pipeline above successfully, you can register the output of the circled module as
dataset.
Then you can create real-time inference pipeline. After creating inference pipeline, you need to adjust your
inference pipeline manually like following:
NOTE
Don't connect the data output to the Train Model module directly. You should remove free text columns before they're fed
into the Train Model. Otherwise, the free text columns will be treated as categorical features.
Next steps
See the set of modules available to Azure Machine Learning.
Feature Hashing module reference
3/5/2021 • 4 minutes to read • Edit Online
I love books 2
Internally, the Feature Hashing module creates a dictionary of n-grams. For example, the list of bigrams for this
dataset would be something like this:
This book 3
I loved 1
I hated 1
I love 1
You can control the size of the n-grams by using the N-grams property. If you choose bigrams, unigrams are
also computed. The dictionary would also include single terms like these:
book 3
I 3
T ERM ( UN IGRA M S) F REQ UEN C Y
books 1
was 1
After the dictionary is built, the Feature Hashing module converts the dictionary terms into hash values. It then
computes whether a feature was used in each case. For each row of text data, the module outputs a set of
columns, one column for each hashed feature.
For example, after hashing, the feature columns might look something like this:
4 1 1 0
5 0 0 0
If the value in the column is 0, the row didn't contain the hashed feature.
If the value is 1, the row did contain the feature.
Feature hashing lets you represent text documents of variable length as numeric feature vectors of equal length
to reduce dimensionality. If you tried to use the text column for training as is, it would be treated as a categorical
feature column with many distinct values.
Numeric outputs also make it possible to use common machine learning methods, including classification,
clustering, and information retrieval. Because lookup operations can use integer hashes rather than string
comparisons, getting the feature weights is also much faster.
TIP
Because feature hashing does not perform lexical operations such as stemming or truncation, you can sometimes
get better results by preprocessing text before you apply feature hashing.
3. Set Target columns to the text columns that you want to convert to hashed features. Keep in mind that:
The columns must be the string data type.
Choosing multiple text columns can have a significant impact on feature dimensionality. For
example, the number of columns for a 10-bit hash goes from 1,024 for a single column to 2,048
for two columns.
4. Use Hashing bitsize to specify the number of bits to use when you're creating the hash table.
The default bit size is 10. For many problems, this value is adequate. You might need more space to avoid
collisions, depending on the size of the n-grams vocabulary in the training text.
5. For N-grams , enter a number that defines the maximum length of the n-grams to add to the training
dictionary. An n-gram is a sequence of n words, treated as a unique unit.
For example, if you enter 3, unigrams, bigrams, and trigrams will be created.
6. Submit the pipeline.
Results
After processing is complete, the module outputs a transformed dataset in which the original text column has
been converted to multiple columns. Each column represents a feature in the text. Depending on how significant
the dictionary is, the resulting dataset can be large:
C O L UM N N A M E 1 C O L UM N T Y P E 2
After you create the transformed dataset, you can use it as the input to the Train Model module.
Best practices
The following best practices can help you get the most out of the Feature Hashing module:
Add a Preprocess Text module before using Feature Hashing to preprocess the input text.
Add a Select Columns module after the Feature Hashing module to remove the text columns from the
output dataset. You don't need the text columns after the hashing features have been generated.
Consider using these text preprocessing options, to simplify results and improve accuracy:
Word breaking
Stopping word removal
Case normalization
Removal of punctuation and special characters
Stemming
The optimal set of preprocessing methods to apply in any solution depends on domain, vocabulary, and
business need. pipeline with your data to see which text processing methods are most effective.
Next steps
See the set of modules available to Azure Machine Learning
Preprocess Text
3/5/2021 • 3 minutes to read • Edit Online
Remove URLs : Select this option to remove any sequence that includes the following URL
prefixes: http , https , ftp , www
11. Expand verb contractions : This option applies only to languages that use verb contractions; currently,
English only.
For example, by selecting this option, you could replace the phrase "wouldn't stay there" with "would not
stay there".
12. Normalize backslashes to slashes : Select this option to map all instances of \\ to / .
13. Split tokens on special characters : Select this option if you want to break words on characters such
as & , - , and so forth. This option can also reduce the special characters when it repeats more than
twice.
For example, the string MS---WORD would be separated into three tokens, MS , - , and WORD .
14. Submit the pipeline.
Technical notes
The preprocess-text module in Studio(classic) and designer use different language models. The designer uses
a multi-task CNN trained model from spaCy. Different models give different tokenizer and part-of-speech
tagger, which leads to different results.
Following are some examples:
C O N F IGURAT IO N O UT P UT RESULT
C O N F IGURAT IO N O UT P UT RESULT
C O N F IGURAT IO N O UT P UT RESULT
C O N F IGURAT IO N O UT P UT RESULT
Next steps
See the set of modules available to Azure Machine Learning.
Latent Dirichlet Allocation module
11/2/2020 • 11 minutes to read • Edit Online
This article describes how to use the Latent Dirichlet Allocation module in Azure Machine Learning designer, to
group otherwise unclassified text into categories.
Latent Dirichlet Allocation (LDA) is often used in natural language processing to find texts that are similar.
Another common term is topic modeling.
This module takes a column of text and generates these outputs:
The source text, together with a score for each category
A feature matrix that contains extracted terms and coefficients for each category
A transformation, which you can save and reapply to new text used as input
This module uses the scikit-learn library. For more information about scikit-learn, see the GitHub repository,
which includes tutorials and an explanation of the algorithm.
NOTE
In Azure Machine Learning designer, the scikit-learn library no longer supports unnormalized doc_topic_distr
output from version 0.19. In this module, the Normalize parameter can only be applied to feature Topic matrix
output. Transformed dataset output is always normalized.
7. Select the option Show all options , and then set it to TRUE if you want to set the following advanced
parameters.
These parameters are specific to the scikit-learn implementation of LDA. There are some good tutorials
about LDA in scikit-learn, as well as the official scikit-learn document.
Rho parameter . Provide a prior probability for the sparsity of topic distributions. This parameter
corresponds to sklearn's topic_word_prior parameter. Use the value 1 if you expect that the
distribution of words is flat; that is, all words are assumed equiprobable. If you think most words
appear sparsely, you might set it to a lower value.
Alpha parameter . Specify a prior probability for the sparsity of per-document topic weights. This
parameter corresponds to sklearn's doc_topic_prior parameter.
Estimated number of documents . Enter a number that represents your best estimate of the
number of documents (rows) that will be processed. This parameter lets the module allocate a
hash table of sufficient size. It corresponds to the total_samples parameter in scikit-learn.
Size of the batch . Enter a number that indicates how many rows to include in each batch of text
sent to the LDA model. This parameter corresponds to the batch_size parameter in scikit-learn.
Initial value of iteration used in learning update schedule . Specify the starting value that
downweights the learning rate for early iterations in online learning. This parameter corresponds
to the learning_offset parameter in scikit-learn.
Power applied to the iteration during updates . Indicate the level of power applied to the
iteration count in order to control learning rate during online updates. This parameter corresponds
to the learning_decay parameter in scikit-learn.
Number of passes over the data . Specify the maximum number of times the algorithm will
cycle over the data. This parameter corresponds to the max_iter parameter in scikit-learn.
8. Select the option Build dictionar y of ngrams or Build dictionar y of ngrams prior to LDA , if you
want to create the n-gram list in an initial pass before classifying text.
If you create the initial dictionary beforehand, you can later use the dictionary when reviewing the model.
Being able to map results to text rather than numerical indices is generally easier for interpretation.
However, saving the dictionary will take longer and use additional storage.
9. For Maximum size of ngram dictionar y , enter the total number of rows that can be created in the n-
gram dictionary.
This option is useful for controlling the size of the dictionary. But if the number of ngrams in the input
exceeds this size, collisions may occur.
10. Submit the pipeline. The LDA module uses Bayes theorem to determine what topics might be associated
with individual words. Words are not exclusively associated with any topics or groups. Instead, each n-
gram has a learned probability of being associated with any of the discovered classes.
Results
The module has two outputs:
Transformed dataset : This output contains the input text, a specified number of discovered categories,
and the scores for each text example for each category.
Feature topic matrix : The leftmost column contains the extracted text feature. A column for each
category contains the score for that feature in that category.
LDA transformation
This module also outputs the LDA transformation that applies LDA to the dataset.
You can save this transformation and reuse it for other datasets. This technique might be useful if you've trained
on a large corpus and want to reuse the coefficients or categories.
To reuse this transformation, select the Register dataset icon in the right panel of the Latent Dirichlet
Allocation module to keep the module under the Datasets category in the module list. Then you can connect
this module to the Apply Transformation module to reuse this transformation.
Refining an LDA model or results
Typically, you can't create a single LDA model that will meet all needs. Even a model designed for one task might
require many iterations to improve accuracy. We recommend that you try all these methods to improve your
model:
Changing the model parameters
Using visualization to understand the results
Getting the feedback of subject matter experts to determine whether the generated topics are useful
Qualitative measures can also be useful for assessing the results. To evaluate topic modeling results, consider:
Accuracy. Are similar items really similar?
Diversity. Can the model discriminate between similar items when required for the business problem?
Scalability. Does it work on a wide range of text categories or only on a narrow target domain?
You can often improve the accuracy of models based on LDA by using natural language processing to clean,
summarize and simplify, or categorize text. For example, the following techniques, all supported in Azure
Machine Learning, can improve classification accuracy:
Stop word removal
Case normalization
Lemmatization or stemming
Named entity recognition
For more information, see Preprocess Text.
In the designer, you can also use R or Python libraries for text processing: Execute R Script, Execute Python
Script.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Implementation details
By default, the distributions of outputs for a transformed dataset and feature-topic matrix are normalized as
probabilities:
The transformed dataset is normalized as the conditional probability of topics given a document. In this
case, the sum of each row equals 1.
The feature-topic matrix is normalized as the conditional probability of words given a topic. In this case,
the sum of each column equals 1.
TIP
Occasionally the module might return an empty topic. Most often, the cause is pseudo-random initialization of the
algorithm. If this happens, you can try changing related parameters. For example, change the maximum size of the N-
gram dictionary or the number of bits to use for feature hashing.
Rho parameter Float [0.00001;1.0] Applies when the 0.01 Topic word prior
Show all distribution.
options check
box is selected
Alpha parameter Float [0.00001;1.0] Applies when the 0.01 Document topic
Show all prior distribution.
options check
box is selected
Size of the batch Integer [1;1024] Applies when the 32 Size of the batch.
Show all
options check
box is selected
Initial value of Integer [0;int.MaxValue] Applies when the 0 Initial value that
iteration used in Show all downweights
learning rate options check learning rate for
update schedule box is selected early iterations.
Corresponds to
the
learning_offset
parameter.
NAME TYPE RA N GE O P T IO N A L DEFA ULT DESC RIP T IO N
Power applied to Float [0.0;1.0] Applies when the 0.5 Power applied to
the iteration Show all the iteration
during updates options check count in order to
box is selected control learning
rate.
Corresponds to
the
learning_decay
parameter.
Build dictionary Boolean True or False Applies when the True Builds a
of ngrams Show all dictionary of
options check ngrams prior to
box is not computing LDA.
selected Useful for model
inspection and
interpretation.
Maximum size of Integer [1;int.MaxValue] Applies when the 20000 Maximum size of
ngram dictionary option Build the ngrams
dictionar y of dictionary. If the
ngrams is True number of
tokens in the
input exceeds
this size,
collisions might
occur.
Build dictionary Boolean True or False Applies when the True Builds a
of ngrams prior Show all dictionary of
to LDA options check ngrams prior to
box is selected LDA. Useful for
model inspection
and
interpretation.
Next steps
See the set of modules available to Azure Machine Learning.
For a list of errors specific to the modules, see Exceptions and error codes for the designer.
Score Vowpal Wabbit Model
11/2/2020 • 3 minutes to read • Edit Online
This article describes how to use the Score Vowpal Wabbit Model module in Azure Machine Learning
designer, to generate scores for a set of input data, using an existing trained Vowpal Wabbit model.
This module provides the latest version of the Vowpal Wabbit framework, version 8.8.1. Use this module to
score data using a trained model saved in the VW version 8 format.
NOTE
Only Vowpal Wabbit 8.8.1 models are supported; you cannot connect saved models that were trained by using
other algorithms.
3. Add the test dataset and connect it to right-hand input port. If test dataset is a directory, which contains
the test data file, specify the test data file name with Name of the test data file . If test dataset is a
single file, leave Name of the test data file to be empty.
4. In the VW arguments text box, type a set of valid command-line arguments to the Vowpal Wabbit
executable.
For information about which Vowpal Wabbit arguments are supported and unsupported in Azure
Machine Learning, see the Technical Notes section.
5. Name of the test data file : Type the name of the file that contains the input data. This argument is only
used when the test dataset is a directory.
6. Specify file type : Indicate which format your training data uses. Vowpal Wabbit supports these two
input file formats:
VW represents the internal format used by Vowpal Wabbit . See the Vowpal Wabbit wiki page for
details.
SVMLight is a format used by some other machine learning tools.
7. Select the option, Include an extra column containing labels , if you want to output labels together
with the scores.
Typically, when handling text data, Vowpal Wabbit does not require labels, and will return only the scores
for each row of data.
8. Select the option, Include an extra column containing raw scores , if you want to output raw scores
together with the results.
9. Submit the pipeline.
Results
After training is complete:
To visualize the results, right-click the output of the Score Vowpal Wabbit Model module. The output
indicates a prediction score normalized from 0 to 1.
To evaluate the results, the output dataset should contain specific score column names, which meet
Evaluate Model module requirements.
For regression task, the dataset to evaluate must has one column, named Regression Scored Labels ,
which represents scored labels.
For binary classification task, the dataset to evaluate must has two columns, named
Binary Class Scored Labels , Binary Class Scored Probabilities , which represent scored labels, and
probabilities respectively.
For multi classification task, the dataset to evaluate must has one column, named
Multi Class Scored Labels , which represents scored labels.
Note that the results of the Score Vowpal Wabbit Model module cannot be evaluated directly. Before
evaluating, the dataset should be modified according to the requirements above.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Parameters
Vowpal Wabbit has many command-line options for choosing and tuning algorithms. A full discussion of these
options is not possible here; we recommend that you view the Vowpal Wabbit wiki page.
The following parameters are not supported in Azure Machine Learning Studio (classic).
The input/output options specified in https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-
line-arguments
These properties are already configured automatically by the module.
Additionally, any option that generates multiple outputs or takes multiple inputs is disallowed. These
include --cbt , --lda , and --wap .
Only supervised learning algorithms are supported. This disallows these options: –active , --rank ,
--search etc.
Next steps
See the set of modules available to Azure Machine Learning.
Train Vowpal Wabbit Model
3/19/2021 • 6 minutes to read • Edit Online
This article describes how to use the Train Vowpal Wabbit Model module in Azure Machine Learning
designer, to create a machine learning model by using Vowpal Wabbit.
To use Vowpal Wabbit for machine learning, format your input according to Vowpal Wabbit requirements, and
prepare the data in the required format. Use this module to specify Vowpal Wabbit command-line arguments.
When the pipeline is run, an instance of Vowpal Wabbit is loaded into the experiment run-time, together with
the specified data. When training is complete, the model is serialized back to the workspace. You can use the
model immediately to score data.
To incrementally train an existing model on new data, connect a saved model to the Pre-trained Vowpal
Wabbit model input port of Train Vowpal Wabbit Model , and add the new data to the other input port.
Results
To generate scores from the model, use Score Vowpal Wabbit Model.
NOTE
If you need to deploy the trained model in the designer, make sure that Score Vowpal Wabbit Model instead of Score
Model is connected to the input of Web Service Output module in the inference pipeline.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Advantages of Vowpal Wabbit
Vowpal Wabbit provides extremely fast learning over non-linear features like n-grams.
Vowpal Wabbit uses online learning techniques such as stochastic gradient descent (SGD) to fit a model one
record at a time. Thus it iterates very quickly over raw data and can develop a good predictor faster than most
other models. This approach also avoids having to read all training data into memory.
Vowpal Wabbit converts all data to hashes, not just text data but other categorical variables. Using hashes makes
lookup of regression weights more efficient, which is critical for effective stochastic gradient descent.
Supported and unsupported parameters
This section describes support for Vowpal Wabbit command line parameters in Azure Machine Learning
designer.
Generally, all but a limited set of arguments are supported. For a complete list of arguments, use the Vowpal
Wabbit wiki page.
The following parameters are not supported:
The input/output options specified in https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-
line-arguments
These properties are already configured automatically by the module.
Additionally, any option that generates multiple outputs or takes multiple inputs is disallowed. These
include --cbt , --lda , and --wap .
Only supervised learning algorithms are supported. Therefore, these options are not supported: –active
, --rank , --search etc.
Restrictions
Because the goal of the service is to support experienced users of Vowpal Wabbit, input data must be prepared
ahead of time using the Vowpal Wabbit native text format, rather than the dataset format used by other
modules.
Next steps
See the set of modules available to Azure Machine Learning.
Apply Image Transformation
11/2/2020 • 2 minutes to read • Edit Online
This article describes how to use the Apply Image Transformation module in Azure Machine Learning designer,
to modify an input image directory based on a previously specified image transformation.
You need to connect an Init Image Transformation module to specify the transformation, and then you can apply
such transformation to the input image directory of the Apply Image Transformation module.
NOTE
Only image transformation generated by Init Image Transformation module is accepted to this module. For other
kind of transformation, please connect it to Apply Transformation, otherwise 'InvalidTransformationDirectoryError'
will be thrown.
NOTE
Transformations which will be excluded in mode For inference are: Random resized crop, Random crop, Random
horizontal flip, Random vertical flip, Random rotation, Random affine, Random grayscale, Random perspective,
Random erasing.
Expected inputs
NAME TYPE DESC RIP T IO N
Outputs
NAME TYPE DESC RIP T IO N
Next steps
See the set of modules available to Azure Machine Learning.
Convert to Image Directory
4/22/2021 • 2 minutes to read • Edit Online
This article describes how to use the Convert to Image Directory module to help convert image dataset to
Image Directory data type, which is standardized data format in image-related tasks like image classification in
Azure Machine Learning designer.
Your_image_folder_name/Category_1/xxx.png
Your_image_folder_name/Category_1/xxy.jpg
Your_image_folder_name/Category_1/xxz.jpeg
Your_image_folder_name/Category_2/123.png
Your_image_folder_name/Category_2/nsdf3.png
Your_image_folder_name/Category_2/asd932_.png
In the image dataset folder, there are multiple subfolders. Each subfolder contains images of one category
respectively. The names of subfolders are considered as the labels for tasks like image classification. Refer
to torchvision datasets for more information.
WARNING
Currently labeled datasets exported from Data Labeling are not supported in the designer.
Images with these extensions (in lowercase) are supported: '.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif',
'.tiff', '.webp'. You can also have multiple types of images in one folder. It is not necessary to contain the
same count of images in each category folder.
You can either use the folder or compressed file with extension '.zip', '.tar', '.gz', and '.bz2'. Compressed
files are recommended for better performance.
NOTE
For inference, the image dataset folder only needs to contain unclassified images.
2. Register the image dataset as a file dataset in your workspace, since the input of Convert to Image
Directory module must be a File dataset .
3. Add the registered image dataset to the canvas. You can find your registered dataset in the Datasets
category in the module list in the left of canvas. Currently Designer does not support visualize image
dataset.
WARNING
You cannot use Impor t Data module to import image dataset, because the output type of Impor t Data
module is DataFrame Directory, which only contains file path string.
4. Add the Conver t to Image Director y module to the canvas. You can find this module in the 'Computer
Vision/Image Data Transformation' category in the module list. Connect it to the image dataset.
5. Submit the pipeline. This module could be run on either GPU or CPU.
Results
The output of Conver t to Image Director y module is in Image Director y format, and can be connected to
other image-related modules of which the input port format is also Image Directory.
Technical notes
Expected inputs
NAME TYPE DESC RIP T IO N
Output
NAME TYPE DESC RIP T IO N
Next steps
See the set of modules available to Azure Machine Learning.
Init Image Transformation
4/22/2021 • 4 minutes to read • Edit Online
This article describes how to use the Init Image Transformation module in Azure Machine Learning designer,
to initialize image transformation to specify how you want image to be transformed.
Results
After transformation is completed, you can find transformed images in the output of Apply Image
Transformation module.
Technical notes
Refer to https://pytorch.org/vision/stable/transforms.html for more info about image transformation.
Module parameters
NAME RA N GE TYPE DEFA ULT DESC RIP T IO N
Random resized crop Any Boolean False Crop the given PIL
Image to random
size and aspect ratio
Output
NAME TYPE DESC RIP T IO N
Next steps
See the set of modules available to Azure Machine Learning.
Split Image Directory
3/5/2021 • 2 minutes to read • Edit Online
This topic describes how to use the Split Image Directory module in Azure Machine Learning designer, to divide
the images of an image directory into two distinct sets.
This module is particularly useful when you need to separate image data into training and testing sets.
Technical notes
Expected inputs
NAME TYPE DESC RIP T IO N
Module parameters
NAME TYPE RA N GE O P T IO N A L DESC RIP T IO N DEFA ULT
Outputs
NAME TYPE DESC RIP T IO N
Output image directory2 ImageDirectory Image directory that contains all other
images
Next steps
See the set of modules available to Azure Machine Learning.
DenseNet
3/5/2021 • 2 minutes to read • Edit Online
This article describes how to use the DenseNet module in Azure Machine Learning designer, to create an image
classification model using the Densenet algorithm.
This classification algorithm is a supervised learning method, and requires a labeled image directory.
NOTE
This module does not support labeled dataset generated from Data Labeling in the studio, but only support labeled
image directory generated from Convert to Image Directory module.
You can train the model by providing the model and the labeled image directory as inputs to Train Pytorch
Model. The trained model can then be used to predict values for the new input examples using Score Image
Model.
More about DenseNet
For more information on DenseNet, see the research paper, Densely Connected Convolutional Networks.
Results
After pipeline run is completed, to use the model for scoring, connect the Train Pytorch Model to Score Image
Model, to predict values for new input examples.
Technical notes
Module parameters
NAME RA N GE TYPE DEFA ULT DESC RIP T IO N
Output
NAME TYPE DESC RIP T IO N
Next steps
See the set of modules available to Azure Machine Learning.
ResNet
4/26/2021 • 2 minutes to read • Edit Online
This article describes how to use the ResNet module in Azure Machine Learning designer, to create an image
classification model using the ResNet algorithm..
This classification algorithm is a supervised learning method, and requires a labeled dataset.
NOTE
This module does not support labeled dataset generated from Data Labeling in the studio, but only support labeled
image directory generated from Convert to Image Directory module.
You can train the model by providing a model and a labeled image directory as inputs to Train Pytorch Model.
The trained model can then be used to predict values for the new input examples using Score Image Model.
More about ResNet
Refer to this paper for more details about ResNet.
Results
After pipeline run is completed, to use the model for scoring, connect the Train PyTorch Model to Score Image
Model, to predict values for new input examples.
Technical notes
Module parameters
NAME RA N GE TYPE DEFA ULT DESC RIP T IO N
Output
NAME TYPE DESC RIP T IO N
Next steps
See the set of modules available to Azure Machine Learning.
Evaluate Recommender
11/2/2020 • 2 minutes to read • Edit Online
This article describes how to use the Evaluate Recommender module in Azure Machine Learning designer. The
goal is to measure the accuracy of predictions that a recommendation model has made. By using this module,
you can evaluate different kinds of recommendations:
Ratings predicted for a user and an item
Items recommended for a user
When you create predictions by using a recommendation model, slightly different results are returned for each
of these supported prediction types. The Evaluate Recommender module deduces the kind of prediction from
the column format of the scored dataset. For example, the scored dataset might contain:
User-item-rating triples
Users and their recommended items
The module also applies the appropriate performance metrics, based on the type of prediction being made.
Evaluate Recommender compares the ratings in the "ground truth" dataset to the predicted ratings of the scored
dataset. It then computes the mean absolute error (MAE) and the root mean squared error (RMSE).
IMPORTANT
For Evaluate Recommender to work, the column names must be User , Item 1 , Item 2 , Item 3 and so forth.
Evaluate Recommender computes the average normalized discounted cumulative gain (NDCG) and returns it in
the output dataset.
Because it's impossible to know the actual "ground truth" for the recommended items, Evaluate Recommender
uses the user-item ratings in the test dataset as gains in the computation of the NDCG. To evaluate, the
recommender scoring module must only produce recommendations for items with "ground truth" ratings (in
the test dataset).
Next steps
See the set of modules available to Azure Machine Learning.
Score SVD Recommender
11/2/2020 • 5 minutes to read • Edit Online
This article describes how to use the Score SVD Recommender module in Azure Machine Learning designer. Use
this module to create predictions by using a trained recommendation model based on the Single Value
Decomposition (SVD) algorithm.
The SVD recommender can generate two different kinds of predictions:
Predict ratings for a given user and item
Recommend items to a user
When you're creating the second type of predictions, you can operate in one of these modes:
Production mode considers all users or items. It's typically used in a web service.
You can create scores for new users, not just users seen during training. For more information, see the
technical notes.
Evaluation mode operates on a reduced set of users or items that can be evaluated. It's typically used
during pipeline operations.
For more information on the SVD recommender algorithm, see the research paper Matrix factorization
techniques for recommender systems.
Technical notes
If you have a pipeline with the SVD recommender, and you move the model to production, be aware that there
are key differences between using the recommender in evaluation mode and using it in production mode.
Evaluation, by definition, requires predictions that can be verified against the ground truth in a test set. When
you evaluate the recommender, it must predict only items that have been rated in the test set. This restricts the
possible values that are predicted.
When you operationalize the model, you typically change the prediction mode to make recommendations based
on all possible items, in order to get the best predictions. For many of these predictions, there's no
corresponding ground truth. So the accuracy of the recommendation can't be verified in the same way as during
pipeline operations.
Next steps
See the set of modules available to Azure Machine Learning.
Score Wide and Deep Recommender
11/2/2020 • 9 minutes to read • Edit Online
This article describes how to use the Score Wide and Deep Recommender module in Azure Machine
Learning designer, to create predictions based on a trained recommendation model, based on the Wide & Deep
learning from Google.
The Wide and Deep recommender can generate two different kinds of predictions:
Predict ratings for a given user and item
Recommend items to a given user
When creating the latter kind of predictions, you can operate in either production mode or evaluation mode.
Production mode considers all users or items, and is typically used in a web service. You can create
scores for new users, not just users seen during training.
Evaluation mode operates on a reduced set of users or items that can be evaluated, and is typically
used during experimentation.
More details on the Wide and Deep recommender and its underlying theory can be found in the relevant
research paper: Wide & Deep Learning for Recommender Systems.
WARNING
If the model was trained without using user features, you cannot introduce user features during scoring.
5. If you have a dataset of item features, you can connect it to Item features .
The item features dataset must contain an item identifier in the first column. The remaining columns
should contain values that characterize the items.
Features of rated items in the training dataset are ignored by Score Wide and Deep Recommender as
they have already been learned during training. Therefore, restrict your scoring dataset to cold-start
items, or items that have not been rated by any users.
WARNING
If the model was trained without using item features, you cannot introduce item features during scoring.
WARNING
If the model was trained without using user features, you cannot use apply features during scoring.
6. (Optional) If you have a dataset of item features , you can connect it to Item features .
The first column in the item features dataset must contain the item identifier. The remaining columns
should contain values that characterize the items.
Features of rated items are ignored by Score Wide and Deep Recommender , because these features
have already been learned during training. Therefore, you can restrict your scoring dataset to cold-start
items, or items that have not been rated by any users.
WARNING
If the model was trained without using item features, do not use item features when scoring.
7. Maximum number of items to recommend to a user : Type the number of items to return for each
user. By default, 5 items are recommended.
8. Minimum size of the recommendation pool per user : Type a value that indicates how many prior
recommendations are required. By default, this parameter is set to 2, meaning the item must have been
recommended by at least two other users.
This option should be used only if you are scoring in evaluation mode. The option is not available if you
select From All Items or From Unrated Items (to suggest new items to users) .
9. For From Unrated Items (to suggest new items to users) , use the third input port, named Training
Data , to remove items that have already been rated from the prediction results.
To apply this filter, connect the original training dataset to the input port.
10. Run the experiment.
Results of item recommendation
The scored dataset returned by Score Wide and Deep Recommender lists the recommended items for each
user.
The first column contains the user identifiers.
A number of additional columns are generated, depending on the value you set for Maximum number of
items to recommend to a user . Each column contains a recommended item (by identifier). The
recommendations are ordered by user-item affinity, with the item with highest affinity put in column, Item 1 .
Technical notes
This section contains answers to some common questions about using the Wide & Deep recommender to
create predictions.
Cold-start users and recommendations
Typically, to create recommendations, the Score Wide and Deep Recommender module requires the same
inputs that you used when training the model, including a user ID. That is because the algorithm needs to know
if it has learned something about this user during training.
However, for new users, you might not have a user ID, only some user features such as age, gender, and so forth.
You can still create recommendations for users who are new to your system, by handling them as cold-start
users. For such users, the recommendation algorithm does not use past history or previous ratings, only user
features.
For purposes of prediction, a cold-start user is defined as a user with an ID that has not been used for training.
To ensure that IDs do not match IDs used in training, you can create new identifiers. For example, you might
generate random IDs within a specified range, or allocate a series of IDs in advance for cold-start users.
However, if you do not have any collaborative filtering data, such as a vector of user features, you are better of
using a classification or regression learner.
Production use of the Wide and Deep recommender
If you have experimented with the Wide and Deep recommender and then move the model to production, be
aware of these key differences when using the recommender in evaluation mode and in production mode:
Evaluation, by definition, requires predictions that can be verified against the ground truth in a test set.
Therefore, when you evaluate the recommender, it must predict only items that have been rated in the
test set. This necessarily restricts the possible values that are predicted.
However, when you operationalize the model, you typically change the prediction mode to make
recommendations based on all possible items, in order to get the best predictions. For many of these
predictions, there is no corresponding ground truth, so the accuracy of the recommendation cannot be
verified in the same way as during experimentation.
If you do not provide a user ID in production, and provide only a feature vector, you might get as
response a list of all recommendations for all possible users. Be sure to provide a user ID.
To limit the number of recommendations that are returned, you can also specify the maximum number of
items returned per user.
Next steps
See the set of modules available of Azure Machine Learning.
Train SVD Recommender
3/19/2021 • 2 minutes to read • Edit Online
This article describes how to use the Train SVD Recommender module in Azure Machine Learning designer. Use
this module to train a recommendation model based on the Single Value Decomposition (SVD) algorithm.
The Train SVD Recommender module reads a dataset of user-item-rating triples. It returns a trained SVD
recommender. You can then use the trained model to predict ratings or generate recommendations, by
connecting the Score SVD Recommender module.
Results
After pipeline run is completed, to use the model for scoring, connect the Train SVD Recommender to Score SVD
Recommender, to predict values for new input examples.
Next steps
See the set of modules available to Azure Machine Learning.
Train Wide & Deep Recommender
6/10/2021 • 7 minutes to read • Edit Online
This article describes how to use the Train Wide & Deep Recommender module in Azure Machine Learning
designer, to train a recommendation model. This module is based on Wide & Deep learning, which is proposed
by Google.
The Train Wide & Deep Recommender module reads a dataset of user-item-rating triples and, optionally,
some user and item features. It returns a trained Wide & Deep recommender. You can then use the trained
model to generate rating predictions or recommendations by using the Score Wide and Deep Recommender
module.
1 68646 10
223 31381 10
O RIGIN A L
M O VIEID T IT L E L A N GUA GE GEN RES Y EA R
Next steps
See the set of modules available of Azure Machine Learning.
PCA-Based Anomaly Detection module
11/2/2020 • 5 minutes to read • Edit Online
This article describes how to use the PCA-Based Anomaly Detection module in Azure Machine Learning
designer, to create an anomaly detection model based on principal component analysis (PCA).
This module helps you build a model in scenarios where it's easy to get training data from one class, such as
valid transactions, but difficult to get sufficient samples of the targeted anomalies.
For example, to detect fraudulent transactions, you often don't have enough examples of fraud to train on. But
you might have many examples of good transactions. The PCA-Based Anomaly Detection module solves the
problem by analyzing available features to determine what constitutes a "normal" class. The module then
applies distance metrics to identify cases that represent anomalies. This approach lets you train a model by
using existing imbalanced data.
NOTE
You can't view the oversampled data set. For more information on how oversampling is used with PCA, see
Technical notes.
5. Select the Enable input feature mean normalization option to normalize all input features to a mean
of zero. Normalization or scaling to zero is generally recommended for PCA, because the goal of PCA is
to maximize variance among variables.
This option is selected by default. Deselect it if values have already been normalized through a different
method or scale.
6. Connect a tagged training dataset and one of the training modules.
If you set the Create trainer mode option to Single Parameter , use the Train Anomaly Detection
Model module.
7. Submit the pipeline.
Results
When training is complete, you can save the trained model. Or you can connect it to the Score Model module to
predict anomaly scores.
To evaluate the results of an anomaly detection model:
1. Ensure that a score column is available in both datasets.
If you try to evaluate an anomaly detection model and get the error "There is no score column in scored
dataset to compare," you're using a typical evaluation dataset that contains a label column but no
probability scores. Choose a dataset that matches the schema output for anomaly detection models,
which includes Scored Labels and Scored Probabilities columns.
2. Ensure that label columns are marked.
Sometimes the metadata associated with the label column is removed in the pipeline graph. If this
happens, when you use the Evaluate Model module to compare the results of two anomaly detection
models, you might get the error "There is no label column in scored dataset." Or you might get the error
"There is no label column in scored dataset to compare."
You can avoid these errors by adding the Edit Metadata module before the Evaluate Model module. Use
the column selector to choose the class column, and in the Fields list, select Label .
3. Use the Execute Python Script module to adjust label column categories as 1(positive, normal) and
0(negative, abnormal) .
label_column_name = 'XXX'
anomaly_label_category = YY
dataframe1[label_column_name] = dataframe1[label_column_name].apply(lambda x: 0 if x ==
anomaly_label_category else 1)
Technical notes
This algorithm uses PCA to approximate the subspace that contains the normal class. The subspace is spanned
by eigenvectors associated with the top eigenvalues of the data covariance matrix.
For each new input, the anomaly detector first computes its projection on the eigenvectors, and then computes
the normalized reconstruction error. This error is the anomaly score. The higher the error, the more anomalous
the instance. For details on how the normal space is computed, see Wikipedia: Principal component analysis.
Next steps
See the set of modules available to Azure Machine Learning.
See Exceptions and error codes for the designer for a list of errors specific to the designer modules.
Train Anomaly Detection Model module
11/2/2020 • 2 minutes to read • Edit Online
This article describes how to use the Train Anomaly Detection Model module in Azure Machine Learning
designer to create a trained anomaly detection model.
The module takes as input a set of parameters for an anomaly detection model and an unlabeled dataset. It
returns a trained anomaly detection model, together with a set of labels for the training data.
For more information about the anomaly detection algorithms provided in the designer, see PCA-Based
Anomaly Detection.
Results
After training is complete:
To view the model's parameters, right-click the module and select Visualize .
To create predictions, use the Score Model module with new input data.
To save a snapshot of the trained model, select the module. Then select the Register dataset icon under
the Outputs+logs tab in the right panel.
Next steps
See the set of modules available to Azure Machine Learning.
See Exceptions and error codes for the designer for a list of errors specific to the designer modules. '
Web Service Input and Web Service Output
modules
3/5/2021 • 2 minutes to read • Edit Online
This article describes the Web Service Input and Web Service Output modules in Azure Machine Learning
designer.
The Web Service Input module can only connect to an input port with the type DataFrameDirector y . The Web
Service Output module can only be connected from an output port with the type DataFrameDirector y . You
can find the two modules in the module tree, under the Web Ser vice category.
The Web Service Input module indicates where user data enters the pipeline. The Web Service Output module
indicates where user data is returned in a real-time inference pipeline.
NOTE
Automatically generating a real-time inference pipeline is a rule-based, best-effort process. There's no guarantee of
correctness.
You can manually add or remove the Web Service Input and Web Service Output modules to satisfy your
requirements. Make sure that your real-time inference pipeline has at least one Web Service Input module and
one Web Service Output module. If you have multiple Web Service Input or Web Service Output modules, make
sure they have unique names. You can enter the name in the right panel of the module.
You can also manually create a real-time inference pipeline by adding Web Service Input and Web Service
Output modules to your unsubmitted pipeline.
NOTE
The pipeline type will be determined the first time you submit it. Be sure to add Web Service Input and Web Service
Output modules before you submit for the first time.
The following example shows how to manually create real-time inference pipeline from the Execute Python
Script module.
After you submit the pipeline and the run finishes successfully, you can deploy the real-time endpoint.
NOTE
In the preceding example, Enter Data Manually provides the data schema for web service input and is necessary for
deploying the real-time endpoint. Generally, you should always connect a module or dataset to the port where Web
Ser vice Input is connected to provide the data schema.
Next steps
Learn more about deploying the real-time endpoint.
See the set of modules available to Azure Machine Learning.
Exceptions and error codes for the designer
6/24/2021 • 59 minutes to read • Edit Online
This article describes the error messages and exception codes in Azure Machine Learning designer to help you
troubleshoot your machine learning pipelines.
You can find the error message in the designer following these steps:
Select the failed module, go to the Outputs+logs tab, you can find the detailed log in the
70_driver_log.txt file under the azureml-logs category.
For detailed module error, you can check it in the error_info.json under module_statistics category.
Following are error codes of modules in the designer.
Error 0001
Exception occurs if one or more specified columns of data set couldn't be found.
You will receive this error if a column selection is made for a module, but the selected column(s) do not exist in
the input data set. This error may occur if you have manually typed in a column name or if the column selector
has provided a suggested column that did not exist in your dataset when you ran the pipeline.
Resolution: Revisit the module throwing this exception and validate that the column name or names are
correct and that all referenced columns do exist.
Column with name or index "{column_id}" does not exist in "{arg_name_missing_column}", but exists in "
{arg_name_has_column}".
Columns with name or index "{column_names}" does not exist in "{arg_name_missing_column}", but exists in "
{arg_name_has_column}".
Error 0002
Exception occurs if one or more parameters could not be parsed or converted from specified type into required
by target method type.
This error occurs in Azure Machine Learning when you specify a parameter as input and the value type is
different from the type that is expected, and implicit conversion cannot be performed.
Resolution: Check the module requirements and determine which value type is required (string, integer,
double, etc.)
Failed to convert value "{arg_value}" in column "{arg_name_or_column}" from "{from_type}" to "{to_type}" with usage of the
format "{fmt}" provided.
Error 0003
Exception occurs if one or more of inputs are null or empty.
You will receive this error in Azure Machine Learning if any inputs or parameters to a module are null or empty.
This error might occur, for example, when you did not type in any value for a parameter. It can also happen if you
chose a dataset that has missing values, or an empty dataset.
Resolution:
Open the module that produced the exception and verify that all inputs have been specified. Ensure that all
required inputs are specified.
Make sure that data that is loaded from Azure storage is accessible, and that the account name or key has not
changed.
Check the input data for missing values, or nulls.
If using a query on a data source, verify that data is being returned in the format you expect.
Check for typos or other changes in the specification of data.
Error 0004
Exception occurs if parameter is less than or equal to specific value.
You will receive this error in Azure Machine Learning if the parameter in the message is below a boundary value
required for the module to process the data.
Resolution: Revisit the module throwing the exception and modify the parameter to be greater than the
specified value.
Parameter "{arg_name}" has value "{actual_value}" which should be greater than {lower_boundary}.
Error 0005
Exception occurs if parameter is less than a specific value.
You will receive this error in Azure Machine Learning if the parameter in the message is below or equal to a
boundary value required for the module to process the data.
Resolution: Revisit the module throwing the exception and modify the parameter to be greater than or equal to
the specified value.
Parameter "{arg_name}" has value "{value}" which should be greater than or equal to {lower_boundary}.
Error 0006
Exception occurs if parameter is greater than or equal to the specified value.
You will receive this error in Azure Machine Learning if the parameter in the message is greater than or equal to
a boundary value required for the module to process the data.
Resolution: Revisit the module throwing the exception and modify the parameter to be less than the specified
value.
Parameter "{arg_name}" has value "{value}" which should be less than {upper_boundary_parameter_name}.
Error 0007
Exception occurs if parameter is greater than a specific value.
You will receive this error in Azure Machine Learning if, in the properties for the module, you specified a value
that is greater than is allowed. For example, you might specify a data that is outside the range of supported
dates, or you might indicate that five columns be used when only three columns are available.
You might also see this error if you are specifying two sets of data that need to match in some way. For example,
if you are renaming columns, and specify the columns by index, the number of names you supply must match
the number of column indices. Another example might be a math operation that uses two columns, where the
columns must have the same number of rows.
Resolution:
Open the module in question and review any numeric property settings.
Ensure that any parameter values fall within the supported range of values for that property.
If the module takes multiple inputs, ensure that inputs are of the same size.
Check whether the dataset or data source has changed. Sometimes a value that worked with a previous
version of the data will fail after the number of columns, the column data types, or the size of the data has
changed.
Parameters mismatch. One of the parameters should be less than or equal to another.
Parameter "{arg_name}" value should be less than or equal to parameter "{upper_boundary_parameter_name}" value.
Parameter "{arg_name}" has value "{actual_value}" which should be less than or equal to {upper_boundary}.
Parameter "{arg_name}" value {actual_value} should be less than or equal to parameter "{upper_boundary_parameter_name}"
value {upper_boundary}.
Parameter "{arg_name}" value {actual_value} should be less than or equal to {upper_boundary_meaning} value
{upper_boundary}.
Error 0008
Exception occurs if parameter is not in range.
You will receive this error in Azure Machine Learning if the parameter in the message is outside the bounds
required for the module to process the data.
For example, this error is displayed if you try to use Add Rows to combine two datasets that have a different
number of columns.
Resolution: Revisit the module throwing the exception and modify the parameter to be within the specified
range.
Error 0009
Exception occurs when the Azure storage account name or container name is specified incorrectly.
This error occurs in Azure Machine Learning designer when you specify parameters for an Azure storage
account, but the name or password cannot be resolved. Errors on passwords or account names can happen for
many reasons:
The account is the wrong type. Some new account types are not supported for use with Machine Learning
designer. See Import Data for details.
You entered the incorrect account name
The account no longer exists
The password for the storage account is wrong or has changed
You didn't specify the container name, or the container does not exist
You didn't fully specify the file path (path to the blob)
Resolution:
Such problems often occur when you try to manually enter the account name, password, or container path. We
recommend that you use the new wizard for the Import Data module, which helps you look up and check
names.
Also check whether the account, container, or blob has been deleted. Use another Azure storage utility to verify
that the account name and password have been entered correctly, and that the container exists.
Some newer account types are not supported by Azure Machine Learning. For example, the new "hot" or "cold"
storage types cannot be used for machine learning. Both classic storage accounts and storage accounts created
as "General purpose" work fine.
If the complete path to a blob was specified, verify that the path is specified as container/blobname , and that
both the container and the blob exist in the account.
The path should not contain a leading slash. For example /container/blob is incorrect and should be entered
as container/blob .
The Azure storage account name "{account_name}" or container name "{container_name}" is incorrect; a container name of the
format container/blob was expected.
Error 0010
Exception occurs if input datasets have column names that should match but do not.
You will receive this error in Azure Machine Learning if the column index in the message has different column
names in the two input datasets.
Resolution: Use Edit Metadata or modify the original dataset to have the same column name for the specified
column index.
Column names are not the same for column {col_index} (zero-based) of input datasets ({dataset1} and {dataset2} respectively).
Error 0011
Exception occurs if passed column set argument does not apply to any of dataset columns.
You will receive this error in Azure Machine Learning if the specified column selection does not match any of the
columns in the given dataset.
You can also get this error if you haven't selected a column and at least one column is required for the module
to work.
Resolution: Modify the column selection in the module so that it will apply to the columns in the dataset.
If the module requires that you select a specific column, such as a label column, verify that the right column is
selected.
If inappropriate columns are selected, remove them and rerun the pipeline.
Specified column set "{column_set}" does not apply to any of dataset columns.
Error 0012
Exception occurs if instance of class could not be created with passed set of arguments.
Resolution: This error is not actionable by the user and will be deprecated in a future release.
Error 0013
Exception occurs if the learner passed to the module is an invalid type.
This error occurs whenever a trained model is incompatible with the connected scoring module.
Resolution:
Determine the type of learner that is produced by the training module, and determine the scoring module that is
appropriate for the learner.
If the model was trained using any of the specialized training modules, connect the trained model only to the
corresponding specialized scoring module.
Error 0014
Exception occurs if the count of column unique values is greater than allowed.
This error occurs when a column contains too many unique values, like an ID column or text column. You might
see this error if you specify that a column be handled as categorical data, but there are too many unique values
in the column to allow processing to complete. You might also see this error if there is a mismatch between the
number of unique values in two inputs.
The error of unique values is greater than allowed will occur if meeting both following conditions:
More than 97% instances of one column are unique values, which means nearly all categories are different
from each other.
One column has more than than 1000 unique values.
Resolution:
Open the module that generated the error, and identify the columns used as inputs. For some modules, you can
right-click the dataset input and select Visualize to get statistics on individual columns, including the number of
unique values and their distribution.
For columns that you intend to use for grouping or categorization, take steps to reduce the number of unique
values in columns. You can reduce in different ways, depending on the data type of the column.
For ID columns which is not meaningful features during training a model, you can use Edit Metadata to mark
that column as Clear feature and it will not be used during training a model.
For text columns, you can use Feature Hashing or Extract N-Gram Features from Text module to preprocess text
columns.
TIP
Unable to find a resolution that matches your scenario? You can provide feedback on this topic that includes the name of
the module that generated the error, and the data type and cardinality of the column. We will use the information to
provide more targeted troubleshooting steps for common scenarios.
Error 0015
Exception occurs if database connection has failed.
You will receive this error if you enter an incorrect SQL account name, password, database server, or database
name, or if a connection with the database cannot be established due to problems with the database or server.
Resolution: Verify that the account name, password, database server, and database have been entered correctly,
and that the specified account has the correct level of permissions. Verify that the database is currently
accessible.
Error 0016
Exception occurs if input datasets passed to the module should have compatible column types but do not.
You will receive this error in Azure Machine Learning if the types of the columns passed in two or more datasets
are not compatible with each other.
Resolution: Use Edit Metadata or modify the original input dataset to ensure that the types of the columns are
compatible.
Column element types are not compatible for column '{first_col_names}' (zero-based) of input datasets ({first_dataset_names}
and {second_dataset_names} respectively).
Error 0017
Exception occurs if a selected column uses a data type that is not supported by the current module.
For example, you might receive this error in Azure Machine Learning if your column selection includes a column
with a data type that cannot be processed by the module, such as a string column for a math operation, or a
score column where a categorical feature column is required.
Resolution:
1. Identify the column that is the problem.
2. Review the requirements of the module.
3. Modify the column to make it conform to requirements. You might need to use several of the following
modules to make changes, depending on the column and the conversion you are attempting:
Use Edit Metadata to change the data type of columns, or to change the column usage from feature to
numeric, categorical to non-categorical, and so forth.
4. As a last resort, you might need to modify the original input dataset.
TIP
Unable to find a resolution that matches your scenario? You can provide feedback on this topic that includes the name of
the module that generated the error, and the data type and cardinality of the column. We will use the information to
provide more targeted troubleshooting steps for common scenarios.
Cannot process column of current type. The type is not supported by the module.
Cannot process column of type {col_type}. The type is not supported by the module.
Cannot process column "{col_name}" of type {col_type}. The type is not supported by the module.
Cannot process column "{col_name}" of type {col_type}. The type is not supported by the module. Parameter name:
{arg_name}.
Error 0018
Exception occurs if input dataset is not valid.
Resolution: This error in Azure Machine Learning can appear in many contexts, so there is not a single
resolution. In general, the error indicates that the data provided as input to a module has the wrong number of
columns, or that the data type does not match requirements of the module. For example:
The module requires a label column, but no column is marked as a label, or you have not selected a label
column yet.
The module requires that data be categorical but your data is numeric.
The data is in the wrong format.
Imported data contains invalid characters, bad values, or out of range values.
The column is empty or contains too many missing values.
To determine the requirements and how your data might, review the help topic for the module that will be
consuming the dataset as input.
.
Error 0020
Exception occurs if number of columns in some of the datasets passed to the module is too small.
You will receive this error in Azure Machine Learning if not enough columns have been selected for a module.
Resolution: Revisit the module and ensure that column selector has correct number of columns selected.
Number of columns in input dataset is less than allowed minimum of {required_columns_count} column(s).
Number of columns in input dataset "{arg_name}" is less than allowed minimum of {required_columns_count} column(s).
Error 0021
Exception occurs if number of rows in some of the datasets passed to the module is too small.
This error in seen in Azure Machine Learning when there are not enough rows in the dataset to perform the
specified operation. For example, you might see this error if the input dataset is empty, or if you are trying to
perform an operation that requires some minimum number of rows to be valid. Such operations can include
(but are not limited to) grouping or classification based on statistical methods, certain types of binning, and
learning with counts.
Resolution:
Open the module that returned the error, and check the input dataset and module properties.
Verify that the input dataset is not empty and that there are enough rows of data to meet the requirements
described in module help.
If your data is loaded from an external source, make sure that the data source is available and that there is no
error or change in the data definition that would cause the import process to get fewer rows.
If you are performing an operation on the data upstream of the module that might affect the type of data or
the number of values, such as cleaning, splitting, or join operations, check the outputs of those operations to
determine the number of rows returned.
Number of rows in input dataset is less than allowed minimum of {required_rows_count} row(s).
Number of rows in input dataset is less than allowed minimum of {required_rows_count} row(s). {reason}
Number of rows in input dataset "{arg_name}" is less than allowed minimum of {required_rows_count} row(s).
Number of rows in input dataset "{arg_name}" is {actual_rows_count}, less than allowed minimum of {required_rows_count}
row(s).
Number of "{row_type}" rows in input dataset "{arg_name}" is {actual_rows_count}, less than allowed minimum of
{required_rows_count} row(s).
Error 0022
Exception occurs if number of selected columns in input dataset does not equal to the expected number.
This error in Azure Machine Learning can occur when the downstream module or operation requires a specific
number of columns or inputs, and you have provided too few or too many columns or inputs. For example:
You specify a single label column or key column and accidentally selected multiple columns.
You are renaming columns, but provided more or fewer names than there are columns.
The number of columns in the source or destination has changed or doesn't match the number of
columns used by the module.
You have provided a comma-separated list of values for inputs, but the number of values does not match,
or multiple inputs are not supported.
Resolution: Revisit the module and check the column selection to ensure that the correct number of columns is
selected. Verify the outputs of upstream modules, and the requirements of downstream operations.
If you used one of the column selection options that can select multiple columns (column indices, all features, all
numeric, etc.), validate the exact number of columns returned by the selection.
Verify that the number or type of upstream columns has not changed.
If you are using a recommendation dataset to train a model, remember that the recommender expects a limited
number of columns, corresponding to user-item pairs or user-item-rankings. Remove additional columns before
training the model or splitting recommendation datasets. For more information, see Split Data.
Number of selected columns in input dataset does not equal to the expected number.
Column selection pattern "{selection_pattern_friendly_name}" provides number of selected columns in input dataset not equal
to {expected_col_count}.
EXC EP T IO N M ESSA GES
Error 0023
Exception occurs if target column of input dataset is not valid for the current trainer module.
This error in Azure Machine Learning occurs if the target column (as selected in the module parameters) is not
of the valid data-type, contained all missing values, or was not categorical as expected.
Resolution: Revisit the module input to inspect the content of the label/target column. Make sure it does not
have all missing values. If the module is expecting target column to be categorical, make sure that there are
more than one distinct values in the target column.
Input dataset has unsupported target column "{column_index}" for learner of type {learner_type}.
Error 0024
Exception occurs if dataset does not contain a label column.
This error in Azure Machine Learning occurs when the module requires a label column and the dataset does not
have a label column. For example, evaluation of a scored dataset usually requires that a label column is present
to compute accuracy metrics.
It can also happen that a label column is present in the dataset, but not detected correctly by Azure Machine
Learning.
Resolution:
Open the module that generated the error, and determine if a label column is present. The name or data type
of the column doesn't matter, as long as the column contains a single outcome (or dependent variable) that
you are trying to predict. If you are not sure which column has the label, look for a generic name such as
Class or Target.
If the dataset does not include a label column, it is possible that the label column was explicitly or accidentally
removed upstream. It could also be that the dataset is not the output of an upstream scoring module.
To explicitly mark the column as the label column, add the Edit Metadata module and connect the dataset.
Select only the label column, and select Label from the Fields dropdown list.
If the wrong column is chosen as the label, you can select Clear label from the Fields to fix the metadata on
the column.
There is no score column in "{dataset_name}" that is produced by a "{learner_type}". Score the dataset using the correct type of
learner.
Error 0026
Exception occurs if columns with the same name are not allowed.
This error in Azure Machine Learning occurs if multiple columns have the same name. One way you may receive
this error is if the dataset does not have a header row and column names are automatically assigned: Col0, Col1,
etc.
Resolution: If columns have same name, insert a Edit Metadata module between the input dataset and the
module. Use the column selector in Edit Metadata to select columns to rename, typing the new names into the
New column names textbox.
Equal column names are specified in arguments. Equal column names are not allowed by module.
Equal column names in arguments "{arg_name_1}" and "{arg_name_2}" are not allowed. Please specify different names.
Error 0027
Exception occurs in case when two objects have to be of the same size but are not.
This is a common error in Azure Machine Learning and can be caused by many conditions.
Resolution: There is no specific resolution. However, you can check for conditions such as the following:
If you are renaming columns, make sure that each list (the input columns and the list of new names) has
the same number of items.
If you are joining or concatenating two datasets, make sure they have the same schema.
If you are joining two datasets that have multiple columns, make sure that the key columns have the
same data type, and select the option Allow duplicates and preser ve column order in selection .
EXC EP T IO N M ESSA GES
Error 0028
Exception occurs in the case when column set contains duplicated column names and it is not allowed.
This error in Azure Machine Learning occurs when column names are duplicated; that is, not unique.
Resolution: If any columns have same name, add an instance of Edit Metadata between the input dataset and
the module raising the error. Use the Column Selector in Edit Metadata to select columns to rename, and type
the new columns names into the New column names textbox. If you are renaming multiple columns, ensure
that the values you type in the New column names are unique.
Error 0029
Exception occurs in case when invalid URI is passed.
This error in Azure Machine Learning occurs in case when invalid URI is passed. You will receive this error if any
of the following conditions are true:
The Public or SAS URI provided for Azure Blob Storage for read or write contains an error.
The time window for the SAS has expired.
The Web URL via HTTP source represents a file or a loopback URI.
The Web URL via HTTP contains an incorrectly formatted URL.
The URL cannot be resolved by the remote source.
Resolution: Revisit the module and verify the format of the URI. If the data source is a Web URL via HTTP, verify
that the intended source is not a file or a loopback URI (localhost).
Error 0030
Exception occurs in the case when it is not possible to download a file.
This exception in Azure Machine Learning occurs when it is not possible to download a file. You will receive this
exception when an attempted read from an HTTP source has failed after three (3) retry attempts.
Resolution: Verify that the URI to the HTTP source is correct and that the site is currently accessible via the
Internet.
Error 0031
Exception occurs if number of columns in column set is less than needed.
This error in Azure Machine Learning occurs if the number of columns selected is less than needed. You will
receive this error if the minimum required number of columns are not selected.
Resolution: Add additional columns to the column selection by using the Column Selector .
At least {required_columns_count} column(s) should be specified for input argument "{arg_name}". The actual number of
specified columns is {input_columns_count}.
Error 0032
Exception occurs if argument is not a number.
You will receive this error in Azure Machine Learning if the argument is a double or NaN.
Resolution: Modify the specified argument to use a valid value.
Error 0033
Exception occurs if argument is Infinity.
This error in Azure Machine Learning occurs if the argument is infinite. You will receive this error if the argument
is either double.NegativeInfinity or double.PositiveInfinity .
Resolution: Modify the specified argument to be a valid value.
EXC EP T IO N M ESSA GES
Error 0034
Exception occurs if more than one rating exists for a given user-item pair.
This error in Azure Machine Learning occurs in recommendation if a user-item pair has more than one rating
value.
Resolution: Ensure that the user-item pair possesses one rating value only.
More than one rating for user {user} and item {item} in rating prediction data table.
More than one rating for user {user} and item {item} in {dataset}.
Error 0035
Exception occurs if no features were provided for a given user or item.
This error in Azure Machine Learning occurs you are trying to use a recommendation model for scoring but a
feature vector cannot be found.
Resolution:
The Matchbox recommender has certain requirements that must be met when using either item features or user
features. This error indicates that a feature vector is missing for a user or item that you provided as input. Ensure
that a vector of features is available in the data for each user or item.
For example, if you trained a recommendation model using features such as the user's age, location, or income,
but now want to create scores for new users who were not seen during training, you must provide some
equivalent set of features (namely, age, location, and income values) for the new users in order to make
appropriate predictions for them.
If you do not have any features for these users, consider feature engineering to generate appropriate features.
For example, if you do not have individual user age or income values, you might generate approximate values to
use for a group of users.
TIP
Resolution not applicable to your case? You are welcome to send feedback on this article and provide information about
the scenario, including the module and the number of rows in the column. We will use this information to provide more
detailed troubleshooting steps in the future.
EXC EP T IO N M ESSA GES
Error 0036
Exception occurs if multiple feature vectors were provided for a given user or item.
This error in Azure Machine Learning occurs if a feature vector is defined more than once.
Resolution: Ensure that the feature vector is not defined more than once.
Error 0037
Exception occurs if multiple label columns are specified and just one is allowed.
This error in Azure Machine Learning occurs if more than one column is selected to be the new label column.
Most supervised learning algorithms require a single column to be marked as the target or label.
Resolution: Make sure to select a single column as the new label column.
Error 0039
Exception occurs if an operation has failed.
This error in Azure Machine Learning occurs when an internal operation cannot be completed.
Resolution: This error is caused by many conditions and there is no specific remedy.
The following table contains generic messages for this error, which are followed by a specific description of the
condition.
If no details are available, Microsoft Q&A question page for send feedback and provide information about the
modules that generated the error and related conditions.
Operation failed.
TIP
Resolution unclear, or not applicable to your case? You are welcome to send feedback on this article and provide
information about the scenario, including the module and the data type of the column. We will use this information to
provide more detailed troubleshooting steps in the future.
Could not convert column "{col_name1}" of type {type1} to column of type {type2}.
Could not convert column "{col_name1}" of type {type1} to column "{col_name2}" of type {type2}.
Error 0044
Exception occurs when it is not possible to derive element type of column from the existing values.
This error in Azure Machine Learning occurs when it is not possible to infer the type of a column or columns in a
dataset. This typically happens when concatenating two or more datasets with different element types. If Azure
Machine Learning is unable to determine a common type that is able to represent all the values in a column or
columns without loss of information, it will generate this error.
Resolution: Ensure that all values in a given column in both datasets being combined are either of the same
type (numeric, Boolean, categorical, string, date, etc.) or can be coerced to the same type.
Cannot derive element type for column "{column_name}" -- all the elements are null references.
Cannot derive element type for column "{column_name}" of dataset "{dataset_name}" -- all the elements are null references.
Error 0045
Exception occurs when it is not possible to create a column because of mixed element types in the source.
This error in Azure Machine Learning is produced when the element types of two datasets being combined are
different.
Resolution: Ensure that all values in a given column in both datasets being combined are of the same type
(numeric, Boolean, categorical, string, date, etc.).
Error 0046
Exception occurs when it is not possible to create directory on specified path.
This error in Azure Machine Learning occurs when it is not possible to create a directory on the specified path.
You will receive this error if any part of the path to the output directory for a Hive Query is incorrect or
inaccessible.
Resolution: Revisit the module and verify that the directory path is correctly formatted and that it is accessible
with the current credentials.
EXC EP T IO N M ESSA GES
Error 0047
Exception occurs if number of feature columns in some of the datasets passed to the module is too small.
This error in Azure Machine Learning occurs if the input dataset to training does not contain the minimum
number of columns required by the algorithm. Typically either the dataset is empty or only contains training
columns.
Resolution: Revisit the input dataset to make sure there are one or more additional columns apart from the
label column.
Number of feature columns in input dataset is less than allowed minimum of {required_columns_count} column(s).
Number of feature columns in input dataset "{arg_name}" is less than allowed minimum of {required_columns_count}
column(s).
Error 0048
Exception occurs in the case when it is not possible to open a file.
This error in Azure Machine Learning occurs when it is not possible to open a file for read or write. You might
receive this error for these reasons:
The container or the file (blob) does not exist
The access level of the file or container does not allow you to access the file
The file is too large to read or the wrong format
Resolution: Revisit the module and the file you are trying to read.
Verify that the names of the container and file are correct.
Use the Azure classic portal or an Azure storage tool to verify that you have permission to access the file.
Error while opening the file: {file_name}. Storage exception message: {exception}.
Error 0049
Exception occurs in the case when it is not possible to parse a file.
This error in Azure Machine Learning occurs when it is not possible to parse a file. You will receive this error if
the file format selected in the Import Data module does not match the actual format of the file, or if the file
contains an unrecognizable character.
Resolution: Revisit the module and correct the file format selection if it does not match the format of the file. If
possible, inspect the file to confirm that it does not contain any illegal characters.
Error 0052
Exception occurs if Azure storage account key is specified incorrectly.
This error in Azure Machine Learning occurs if the key used to access the Azure storage account is incorrect. For
example, you might see this error if the Azure storage key was truncated when copied and pasted, or if the
wrong key was used.
For more information about how to get the key for an Azure storage account, see View, copy, and regenerate
storage access keys.
Resolution: Revisit the module and verify that the Azure storage key is correct for the account; copy the key
again from the Azure classic portal if necessary.
Error 0053
Exception occurs in the case when there are no user features or items for matchbox recommendations.
This error in Azure Machine Learning is produced when a feature vector cannot be found.
Resolution: Ensure that a feature vector is present in the input dataset.
Error 0056
Exception occurs if the columns you selected for an operation violates requirements.
This error in Azure Machine Learning occurs when you are choosing columns for an operation that requires the
column be of a particular data type.
This error can also happen if the column is the correct data type, but the module you are using requires that the
column also be marked as a feature, label, or categorical column.
Resolution:
1. Review the data type of the columns that are currently selected.
2. Ascertain whether the selected columns are categorical, label, or feature columns.
3. Review the help topic for the module in which you made the column selection, to determine if there are
specific requirements for data type or column usage.
4. Use Edit Metadata to change the column type for the duration of this operation. Be sure to change the
column type back to its original value, using another instance of Edit Metadata, if you need it for
downstream operations.
Error 0057
Exception occurs when attempting to create a file or blob that already exists.
This exception occurs when you are using the Export Data module or other module to save results of a pipeline
in Azure Machine Learning to Azure blob storage, but you attempt to create a file or blob that already exists.
Resolution:
You will receive this error only if you previously set the property Azure blob storage write mode to Error . By
design, this module raises an error if you attempt to write a dataset to a blob that already exists.
Open the module properties and change the property Azure blob storage write mode to Over write .
Alternatively, you can type the name of a different destination blob or file and be sure to specify a blob that
does not already exist.
Error 0058
This error in Azure Machine Learning occurs if the dataset does not contain the expected label column.
This exception can also occur when the label column provided does not match the data or datatype expected by
the learner, or has the wrong values. For example, this exception is produced when using a real-valued label
column when training a binary classifier.
Resolution: The resolution depends on the learner or trainer that you are using, and the data types of the
columns in your dataset. First, verify the requirements of the machine learning algorithm or training module.
Revisit the input dataset. Verify that the column you expect to be treated as the label has the right data type for
the model you are creating.
Check inputs for missing values and eliminate or replace them if necessary.
If necessary, add the Edit Metadata module and ensure that the label column is marked as a label.
The label column values and scored label column values are not comparable.
Error 0059
Exception occurs if a column index specified in a column picker cannot be parsed.
This error in Azure Machine Learning occurs if a column index specified when using the Column Selector cannot
be parsed. You will receive this error when the column index is in an invalid format that cannot be parsed.
Resolution: Modify the column index to use a valid index value.
One or more specified column indexes or index ranges could not be parsed.
Error 0060
Exception occurs when an out of range column range is specified in a column picker.
This error in Azure Machine Learning occurs when an out-of-range column range is specified in the Column
Selector. You will receive this error if the column range in the column picker does not correspond to the columns
in the dataset.
Resolution: Modify the column range in the column picker to correspond to the columns in the dataset.
Error 0061
Exception occurs when attempting to add a row to a DataTable that has a different number of columns than the
table.
This error in Azure Machine Learning occurs when you attempt to add a row to a dataset that has a different
number of columns than the dataset. You will receive this error if the row that is being added to the dataset has
a different number of columns from the input dataset. The row cannot be appended to the dataset if the number
of columns is different.
Resolution: Modify the input dataset to have the same number of columns as the row added, or modify the
row added to have the same number of columns as the dataset.
Columns in chunk "{chunk_id_1}" is different with chunk "{chunk_id_2}" with chunk size: {chunk_size}.
Column count in file "{filename_1}" (count={column_count_1}) is different with file "{filename_2}" (count={column_count_2}).
Error 0062
Exception occurs when attempting to compare two models with different learner types.
This error in Azure Machine Learning is produced when evaluation metrics for two different scored datasets
cannot be compared. In this case, it is not possible to compare the effectiveness of the models used to produce
the two scored datasets.
Resolution: Verify that the scored results are produced by the same kind of machine learning model (binary
classification, regression, multi-class classification, recommendation, clustering, anomaly detection, etc.) All
models that you compare must have the same learner type.
Got incompatible learner type: "{actual_learner_type}". Expected learner types are: "{expected_learner_type_list}".
Error 0064
Exception occurs if Azure storage account name or storage key is specified incorrectly.
This error in Azure Machine Learning occurs if the Azure storage account name or storage key is specified
incorrectly. You will receive this error if you enter an incorrect account name or password for the storage
account. This may occur if you manually enter the account name or password. It may also occur if the account
has been deleted.
Resolution: Verify that the account name and password have been entered correctly, and that the account
exists.
The Azure storage account name "{account_name}" or storage key for the account name is incorrect.
Error 0065
Exception occurs if Azure blob name is specified incorrectly.
This error in Azure Machine Learning occurs if the Azure blob name is specified incorrectly. You will receive the
error if:
The blob cannot be found in the specified container.
Only the container was specified as the source in a Import Data request when the format was Excel or
CSV with encoding; concatenation of the contents of all blobs within a container is not allowed with these
formats.
A SAS URI does not contain the name of a valid blob.
Resolution: Revisit the module throwing the exception. Verify that the specified blob does exist in the container
in the storage account and that permissions allow you to see the blob. Verify that the input is of the form
containername/filename if you have Excel or CSV with encoding formats. Verify that a SAS URI contains the
name of a valid blob.
The Azure storage blob name with prefix "{blob_name_prefix}" does not exist.
Failed to find any Azure storage blobs with wildcard path "{blob_wildcard_path}".
Error 0066
Exception occurs if a resource could not be uploaded to an Azure Blob.
This error in Azure Machine Learning occurs if a resource could not be uploaded to an Azure Blob. Both are
saved to the same Azure storage account as the account containing the input file.
Resolution: Revisit the module. Verify that the Azure account name, storage key, and container are correct and
that the account has permission to write to the container.
Error 0067
Exception occurs if a dataset has a different number of columns than expected.
This error in Azure Machine Learning occurs if a dataset has a different number of columns than expected. You
will receive this error when the number of columns in the dataset are different from the number of columns that
the module expects during execution.
Resolution: Modify the input dataset or the parameters.
EXC EP T IO N M ESSA GES
Error 0068
Exception occurs if the specified Hive script is not correct.
This error in Azure Machine Learning occurs if there are syntax errors in a Hive QL script, or if the Hive
interpreter encounters an error while executing the query or script.
Resolution:
The error message from Hive is normally reported back in the Error Log so that you can take action based on
the specific error.
Open the module and inspect the query for mistakes.
Verify that the query works correctly outside of Azure Machine Learning by logging in to the Hive console of
your Hadoop cluster and running the query.
Try placing comments in your Hive script in a separate line as opposed to mixing executable statements and
comments in a single line.
Resources
See the following articles for help with Hive queries for machine learning:
Create Hive tables and load data from Azure Blob Storage
Explore data in tables with Hive queries
Create features for data in an Hadoop cluster using Hive queries
Hive for SQL Users Cheat Sheet (PDF)
Error 0069
Exception occurs if the specified SQL script is not correct.
This error in Azure Machine Learning occurs if the specified SQL script has syntax problems, or if the columns or
table specified in the script is not valid.
You will receive this error if the SQL engine encounters any error while executing the query or script. The SQL
error message is normally reported back in the Error Log so that you can take action based on the specific error.
Resolution: Revisit the module and inspect the SQL query for mistakes.
Verify that the query works correctly outside of Azure ML by logging in to the database server directly and
running the query.
If there is a SQL generated message reported by the module exception, take action based on the reported error.
For example, the error messages sometimes include specific guidance on the likely error:
No such column or missing database, indicating that you might have typed a column name wrong. If you are
sure the column name is correct, try using brackets or quotation marks to enclose the column identifier.
SQL logic error near <SQL keyword>, indicating that you might have a syntax error before the specified
keyword
Error 0070
Exception occurs when attempting to access non-existent Azure table.
This error in Azure Machine Learning occurs when you attempt to access a non-existent Azure table. You will
receive this error if you specify a table in Azure storage, which does not exist when reading from or writing to
Azure Table Storage. This can happen if you mistype the name of the desired table, or you have a mismatch
between the target name and the storage type. For example, you intended to read from a table but entered the
name of a blob instead.
Resolution: Revisit the module to verify that the name of the table is correct.
Error 0072
Exception occurs in the case of connection timeout.
This error in Azure Machine Learning occurs when a connection times out. You will receive this error if there are
currently connectivity issues with the data source or destination, such as slow internet connectivity, or if the
dataset is large and/or the SQL query to read in the data performs complicated processing.
Resolution: Determine whether there are currently issues with slow connections to Azure storage or the
internet.
Error 0073
Exception occurs if an error occurs while converting a column to another type.
This error in Azure Machine Learning occurs when it is not possible to convert column to another type. You will
receive this error if a module requires a particular type and it is not possible to convert the column to the new
type.
Resolution: Modify the input dataset so that the column can be converted based on the inner exception.
Error 0075
Exception occurs when an invalid binning function is used when quantizing a dataset.
This error in Azure Machine Learning occurs when you are trying to bin data using an unsupported method, or
when the parameter combinations are invalid.
Resolution:
Error handling for this event was introduced in an earlier version of Azure Machine Learning that allowed more
customization of binning methods. Currently all binning methods are based on a selection from a dropdown list,
so technically it should no longer be possible to get this error.
Error 0077
Exception occurs when unknown blob file writes mode passed.
This error in Azure Machine Learning occurs if an invalid argument is passed in the specifications for a blob file
destination or source.
Resolution: In almost all modules that import or export data to and from Azure blob storage, parameter values
controlling the write mode are assigned by using a dropdown list; therefore, it is not possible to pass an invalid
value, and this error should not appear. This error will be deprecated in a later release.
Error 0078
Exception occurs when the HTTP option for Import Data receives a 3xx status code indicating redirection.
This error in Azure Machine Learning occurs when the HTTP option for Import Data receives a 3xx (301, 302,
304, etc.) status code indicating redirection. You will receive this error if you attempt to connect to an HTTP
source that redirects the browser to another page. For security reasons, redirecting websites are not allowed as
data sources for Azure Machine Learning.
Resolution: If the website is a trusted website, enter the redirected URL directly.
EXC EP T IO N M ESSA GES
Error 0079
Exception occurs if Azure storage container name is specified incorrectly.
This error in Azure Machine Learning occurs if the Azure storage container name is specified incorrectly. You will
receive this error if you have not specified both the container and the blob (file) name using the Path to blob
beginning with container option when writing to Azure Blob Storage.
Resolution: Revisit the Export Data module and verify that the specified path to the blob contains both the
container and the file name, in the format container/filename .
The Azure storage container name "{container_name}" is incorrect; a container name of the format container/blob was
expected.
Error 0080
Exception occurs when column with all values missing is not allowed by module.
This error in Azure Machine Learning is produced when one or more of the columns consumed by the module
contains all missing values. For example, if a module is computing aggregate statistics for each column, it cannot
operate on a column containing no data. In such cases, module execution is halted with this exception.
Resolution: Revisit the input dataset and remove any columns that contain all missing values.
Error 0081
Exception occurs in PCA module if number of dimensions to reduce to is equal to number of feature columns in
input dataset, containing at least one sparse feature column.
This error in Azure Machine Learning is produced if the following conditions are met: (a) the input dataset has at
least one sparse column and (b) the final number of dimensions requested is the same as the number of input
dimensions.
Resolution: Consider reducing the number of dimensions in the output to be fewer than the number of
dimensions in the input. It is typical in applications of PCA.
For dataset containing sparse feature columns number of dimensions to reduce to should be less than number of feature
columns.
Error 0082
Exception occurs when a model cannot be successfully deserialized.
This error in Azure Machine Learning occurs when a saved machine learning model or transform cannot be
loaded by a newer version of the Azure Machine Learning runtime as a result of a breaking change.
Resolution: The training pipeline that produced the model or transform must be rerun and the model or
transform must be resaved.
Model could not be deserialized because it is likely serialized with an older serialization format. Retrain and resave the model.
Error 0083
Exception occurs if dataset used for training cannot be used for concrete type of learner.
This error in Azure Machine Learning is produced when the dataset is incompatible with the learner being
trained. For example, the dataset might contain at least one missing value in each row, and as a result, the entire
dataset would be skipped during training. In other cases, some machine learning algorithms such as anomaly
detection do not expect labels to be present and can throw this exception if labels are present in the dataset.
Resolution: Consult the documentation of the learner being used to check requirements for the input dataset.
Examine the columns to see all required columns are present.
{data_name} contains invalid data for training. Learner type: {learner_type}. Reason: {reason}.
Error 0084
Exception occurs when scores produced from an R Script are evaluated. This is currently unsupported.
This error in Azure Machine Learning occurs if you try to use one of the modules for evaluating a model with
output from an R script that contains scores.
Resolution:
Error 0085
Exception occurs when script evaluation fails with an error.
This error in Azure Machine Learning occurs when you are running custom script that contains syntax errors.
Resolution: Review your code in an external editor and check for errors.
The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from {script_language} interpreter ----------
{message}
---------- End of error message from {script_language} interpreter ----------
Error 0090
Exception occurs when Hive table creation fails.
This error in Azure Machine Learning occurs when you are using Export Data or another option to save data to
an HDInsight cluster and the specified Hive table cannot be created.
Resolution: Check the Azure storage account name associated with the cluster and verify that you are using the
same account in the module properties.
The Hive table could not be created. For a HDInsight cluster, please ensure the Azure storage account name associated with
cluster is the same as what is passed in through the module parameter.
The Hive table "{table_name}" could not be created. For a HDInsight cluster, please ensure the Azure storage account name
associated with cluster is the same as what is passed in through the module parameter.
The Hive table "{table_name}" could not be created. For a HDInsight cluster, ensure the Azure storage account name
associated with cluster is "{cluster_name}".
Error 0102
Thrown when a ZIP file cannot be extracted.
This error in Azure Machine Learning occurs when you are importing a zipped package with the .zip extension,
but the package is either not a zip file, or the file does not use a supported zip format.
Resolution: Make sure the selected file is a valid .zip file, and that it was compressed by using one of the
supported compression algorithms.
If you get this error when importing datasets in compressed format, verify that all contained files use one of the
supported file formats, and are in Unicode format.
Try reading the desired files to a new compressed zipped folder and try to add the custom module again.
Error 0105
This error is displayed when a module definition file contains an unsupported parameter type
This error in Azure Machine Learning is produced when the you create a custom module xml definition and the
type of a parameter or argument in the definition does not match a supported type.
Resolution: Make sure that the type property of any Arg element in the custom module xml definition file is a
supported type.
Error 0107
Thrown when a module definition file defines an unsupported output type
This error in Azure Machine Learning is produced when the type of an output port in a custom module xml
definition does not match a supported type.
Resolution: Make sure that the type property of an Output element in the custom module xml definition file is
a supported type.
Error 0125
Thrown when schema for multiple datasets does not match.
Resolution:
Error 0127
Image pixel size exceeds allowed limit
This error occurs if you are reading images from an image dataset for classification and the images are larger
than the model can handle.
Image pixel size in the file '{file_path}' exceeds allowed limit: '{size_limit}'.
Error 0128
Number of conditional probabilities for categorical columns exceeds limit.
Resolution:
Number of conditional probabilities for categorical columns exceeds limit. Columns '{column_name_or_index_1}' and
'{column_name_or_index_2}' are the problematic pair.
Error 0129
Number of columns in the dataset exceeds allowed limit.
Resolution:
Number of columns in the dataset in '{dataset_name}' exceeds allowed '{limit_columns_count}' limit of '{component_name}'.
Error 0134
Exception occurs when label column is missing or has insufficient number of labeled rows.
This error occurs when the module requires a label column, but you did not include one in the column selection,
or the label column is missing too many values.
This error can also occur when a previous operation changes the dataset such that insufficient rows are available
to a downstream operation. For example, suppose you use an expression in the Par tition and Sample module
to divide a dataset by values. If no matches are found for your expression, one of the datasets resulting from the
partition would be empty.
Resolution:
If you include a label column in the column selection but it isn't recognized, use the Edit Metadata module to
mark it as a label column.
Then, you can use the Clean Missing Data module to remove rows with missing values in the label column.
Check your input datasets to make sure that they contain valid data, and enough rows to satisfy the
requirements of the operation. Many algorithms will generate an error message if they require some minimum
number rows of data, but the data contains only a few rows, or only a header.
Exception occurs when label column is missing or has insufficient number of labeled rows.
Exception occurs when label column is missing or has less than {required_rows_count} labeled rows.
Exception occurs when label column in dataset {dataset_name} is missing or has less than {required_rows_count} labeled rows.
Error 0138
Memory has been exhausted, unable to complete running of module. Downsampling the dataset may help to
alleviate the problem.
This error occurs when the module that is running requires more memory than is available in the Azure
container. This can happen if you are working with a large dataset and the current operation cannot fit into
memory.
Resolution: If you are trying to read a large dataset and the operation cannot be completed, downsampling the
dataset might help.
Memory has been exhausted, unable to complete running of module. Details: {details}
Error 0141
Exception occurs if the number of the selected numerical columns and unique values in the categorical and
string columns is too small.
This error in Azure Machine Learning occurs when there are not enough unique values in the selected column to
perform the operation.
Resolution: Some operations perform statistical operations on feature and categorical columns, and if there are
not enough values, the operation might fail or return an invalid result. Check your dataset to see how many
values there are in the feature and label columns, and determine whether the operation you are trying to
perform is statistically valid.
If the source dataset is valid, you might also check whether some upstream data manipulation or metadata
operation has changed the data and removed some values.
If upstream operations include splitting, sampling, or resampling, verify that the outputs contain the expected
number of rows and values.
The number of the selected numerical columns and unique values in the categorical and string columns is too small.
The total number of the selected numerical columns and unique values in the categorical and string columns (currently
{actual_num}) should be at least {lower_boundary}.
Error 0154
Exception occurs when user tries to join data on key columns with incompatible column type.
Key column element types are not compatible.(left: {keys_left}; right: {keys_right})
Error 0155
Exception occurs when column names of dataset are not string.
The dataframe column name must be string type. Column names are not string.
The dataframe column name must be string type. Column names {column_names} are not string.
Error 0156
Exception occurs when failed to read data from Azure SQL Database.
Failed to read data from Azure SQL Database: {detailed_message} DB: {database_server_name}:{database_name} Query:
{sql_statement}
Error 0157
Datastore not found.
Datastore information is invalid. Failed to get AzureML datastore '{datastore_name}' in workspace '{workspace_name}'.
Error 0158
Thrown when a transformation directory is invalid.
TransformationDirectory "{arg_name}" is invalid. Reason: {reason}. Rerun training experiment, which generates the Transform
file. If training experiment was deleted, please recreate and save the Transform file.
Error 0159
Exception occurs if module model directory is invalid.
Error 1000
Internal library exception.
This error is provided to capture otherwise unhandled internal engine errors. Therefore, the cause for this error
might be different depending on the module that generated the error.
To get more help, we recommend that you post the detailed message that accompanies the error to the Azure
Machine Learning forum, together with a description of the scenario, including the data used as inputs. This
feedback will help us to prioritize errors and identify the most important issues for further work.
Library exception.
Distributed training
Currently designer supports distributed training for and Train PyTorch Model module.
If the module enabled distributed training fails without any 70_driver logs, you can check 70_mpi_log for error
details.
The following example shows that the Node count of run settings is larger than available node count of
compute cluster.
The following example shows that Process count per node is larger than Processing Unit of the compute.
Otherwise, you can check 70_driver_log for each process. 70_driver_log_0 is for master process.
Graph search query syntax
4/9/2021 • 2 minutes to read • Edit Online
In this article, you learn about the graph search functionality in Azure Machine Learning.
Graph search lets you fast navigate a node when you are debugging or building a pipeline. You can either type
the key word or query in the input box in the toolbar, or under search tab in the left panel to trigger search. All
matched result will be highlighted in yellow in canvas, and if you select a result in the left panel, the node in
canvas will be highlighted in red.
Graph search supports full-text keyword search on node name and comments. You can also filter on node
property like runStatus, duration, computeTarget. The keyword search is based on Lucene query. A complete
search query looks like this:
[[lucene quer y] | [filter quer y]]
You can use either Lucene query or filter query. To use both, use the | separator. The syntax of the filter query is
more strict than Lucene query. So if customer input can be parsed as both, the filter query will be applied.
For example, data OR model | compute in {cpucluster} , this is to search nodes of which name or comment
contains data or model , and compute is cpucluster.
Lucene query
Graph search uses Lucene simple query as full-text search syntax on node "name" and "comment". The
following Lucene operators are supported:
AND/OR
Wildcard matching with ? and * operators.
Examples
Simple search: JSON Data
NOTE
You cannot start a Lucene query with a "*" character.
Filter query
Filter queries use the following pattern:
**[key1] [operator1] [value1]; [key2] [operator1] [value2];**
Technical notes
The relationship between multiple filters is "AND"
If >=, >, <, or <= is chosen, values will automatically be converted to number type. Otherwise, string types
are used for comparison.
For all string type values, case is insensitive in comparison.
Operator "In" expects a collection as value, collection syntax is {name1, name2, name3}
Space will be ignored between keywords