Oracle Machine Learning Python Users Guide

Oracle® Machine Learning for Python
User's Guide
Release 1.0
E97014-35
July 2024
Oracle Machine Learning for Python User's Guide, Release 1.0
E97014-35
Copyright © 2019, 2024, Oracle and/or its affiliates.
Primary Author: Dhanish Kumar
Contributors: Andi Wang , Boriana Milenova , David McDermid , Feng Li , Mandeep Kaur , Mark Hornick, Qin Wang ,
Sherry Lamonica , Venkatanathan Varadarajan , Yu Xiang
This software and related documentation are provided under a license agreement containing restrictions on use and
disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or
allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit,
perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation
of this software, unless required by law for interoperability, is prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If you find
any errors, please report them to us in writing.
If this is software, software documentation, data (as defined in the Federal Acquisition Regulation), or related
documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then
the following notice is applicable:
U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any
programs embedded, installed, or activated on delivered hardware, and modifications of such programs) and Oracle
computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial
computer software," "commercial computer software documentation," or "limited rights data" pursuant to the applicable
Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, reproduction,
duplication, release, display, disclosure, modification, preparation of derivative works, and/or adaptation of i) Oracle
programs (including any operating system, integrated software, any programs embedded, installed, or activated on
delivered hardware, and modifications of such programs), ii) Oracle computer documentation and/or iii) other Oracle
data, is subject to the rights and limitations specified in the license contained in the applicable contract. The terms
governing the U.S. Government's use of Oracle cloud services are defined by the applicable contract for such services.
No other rights are granted to the U.S. Government.
This software or hardware is developed for general use in a variety of information management applications. It is not
developed or intended for use in any inherently dangerous applications, including applications that may create a risk of
personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all
appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its
affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.
Oracle®, Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates. Other names may be
trademarks of their respective owners.
Intel and Intel Inside are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used
under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Epyc, and the AMD logo
are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open
Group.
This software or hardware and documentation may provide access to or information about content, products, and
services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all
warranties of any kind with respect to third-party content, products, and services unless otherwise set forth in an
applicable agreement between you and Oracle. Oracle Corporation and its affiliates will not be responsible for any loss,
costs, or damages incurred due to your access to or use of third-party content, products, or services, except as set forth
in an applicable agreement between you and Oracle.
Contents
Preface
Audience viii
Documentation Accessibility viii
Related Resources viii
Conventions ix
1 About Oracle Machine Learning for Python

1.1 What Is Oracle Machine Learning for Python? 1-1
1.2 Advantages of Oracle Machine Learning for Python 1-2
1.3 Transparently Convert Python to SQL 1-4
1.4 About the Python Components and Libraries in OML4Py 1-5
2 Install OML4Py Client for Linux for Use With Autonomous Database on
Serverless Exadata Infrastructure
3 Install OML4Py for On-Premises Databases

3.1 OML4Py On Premises System Requirements 3-1
3.2 Build and Install Python for Linux for On-Premises Databases 3-1
3.3 Install the Required Supporting Packages for Linux for On-Premises Databases 3-4
3.4 Install OML4Py Server for On-Premises Oracle Database 3-6
3.4.1 Install OML4Py Server for Linux for On-Premises Oracle Database 19c 3-6
3.4.2 Install OML4Py Server for Linux for On-Premises Oracle Database 21c 3-11
3.4.3 Verify OML4Py Server Installation for On-Premises Database 3-13
3.4.4 Grant Users the Required Privileges for On-Premises Database 3-13
3.4.5 Create New Users for On-Premises Oracle Database 3-14
3.4.6 Uninstall the OML4Py Server from an On-Premises Database 19c 3-16
3.5 Install OML4Py Client for On-Premises Databases 3-16
3.5.1 Install Oracle Instant Client and the OML4Py Client for Linux 3-17
3.5.1.1 Install Oracle Instant Client for Linux for On-Premises Databases 3-17
3.5.1.2 Install OML4Py Client for Linux for On-Premises Databases 3-18
3.5.2 Verify OML4Py Client Installation for On-Premises Databases 3-22
iii
3.5.3 Uninstall the OML4Py Client for On-Premises Databases 3-22
4 Install OML4Py on Exadata

4.1 About Oracle Machine Learning for Python on Exadata 4-1
4.2 Configure DCLI to install Python across Exadata compute nodes. 4-2
4.2.1 Install Python across Exadata compute nodes using DCLI 4-4
4.2.2 Install OML4Py across Exadata compute nodes using DCLI 4-5
5 Install Third-Party Packages

5.1 Conda Commands 5-1
5.2 Administrative Tasks for Creating and Saving a Conda Environment 5-9
5.3 OML User Tasks for Downloading an Available Conda Environment 5-13
5.4 Using Conda Environments with Embedded Python Execution 5-19
6 Get Started with Oracle Machine Learning for Python

6.1 Use OML4Py with Oracle Autonomous Database 6-1
6.2 Use OML4Py with an On-Premises Oracle Database 6-1
6.2.1 About Connecting to an On-Premises Oracle Database 6-2
6.2.2 About Oracle Wallets 6-3
6.2.3 Connect to an Oracle Database 6-4
6.3 Move Data Between the Database and a Python Session 6-8
6.3.1 About Moving Data Between the Database and a Python Session 6-9
6.3.2 Push Local Python Data to the Database 6-9
6.3.3 Pull Data from the Database to a Local Python Session 6-11
6.3.4 Create a Python Proxy Object for a Database Object 6-13
6.3.5 Create a Persistent Database Table from a Python Data Set 6-16
6.4 Save Python Objects in the Database 6-20
6.4.1 About OML4Py Datastores 6-20
6.4.2 Save Objects to a Datastore 6-21
6.4.3 Load Saved Objects From a Datastore 6-24
6.4.4 Get Information About Datastores 6-25
6.4.5 Get Information About Datastore Objects 6-27
6.4.6 Delete Datastore Objects 6-28
6.4.7 Manage Access to Stored Objects 6-30
7 Prepare and Explore Data

7.1 Prepare Data 7-1
7.1.1 About Preparing Data in the Database 7-1
iv
7.1.2 Select Data 7-3
7.1.3 Combine Data 7-8
7.1.4 Clean Data 7-13
7.1.5 Split Data 7-15
7.2 Explore Data 7-17
7.2.1 About the Exploratory Data Analysis Methods 7-17
7.2.2 Correlate Data 7-19
7.2.3 Cross-Tabulate Data 7-20
7.2.4 Mutate Data 7-23
7.2.5 Sort Data 7-26
7.2.6 Summarize Data 7-28
7.3 Render Graphics 7-31
8 OML4Py Classes That Provide Access to In-Database Machine Learning

Algorithms
8.1 About Machine Learning Classes and Algorithms 8-2
8.2 About Model Settings 8-4
8.3 Shared Settings 8-4
8.4 Export Oracle Machine Learning for Python Models 8-7
8.5 Automatic Data Preparation 8-11
8.6 Model Explainability 8-12
8.7 Attribute Importance 8-18
8.8 Association Rules 8-21
8.9 Decision Tree 8-27
8.10 Expectation Maximization 8-34
8.11 Explicit Semantic Analysis 8-48
8.12 Generalized Linear Model 8-53
8.13 k-Means 8-63
8.14 Naive Bayes 8-69
8.15 Neural Network 8-77
8.16 Random Forest 8-86
8.17 Singular Value Decomposition 8-94
8.18 Support Vector Machine 8-100
9 Automated Machine Learning

9.1 About Automated Machine Learning 9-1
9.2 Algorithm Selection 9-6
9.3 Feature Selection 9-8
9.4 Model Tuning 9-11
v
9.5 Model Selection 9-15
10 Embedded Python Execution

10.1 About Embedded Python Execution 10-1
10.1.1 Comparison of the Embedded Python Execution APIs 10-2
10.2 Parallelism with OML4Py Embedded Python Execution 10-5
10.3 Embedded Python Execution Views 10-6
10.3.1 ALL_PYQ_DATASTORE_CONTENTS View 10-7
10.3.2 ALL_PYQ_DATASTORES View 10-8
10.3.3 ALL_PYQ_SCRIPTS View 10-9
10.3.4 USER_PYQ_DATASTORES View 10-10
10.3.5 USER_PYQ_SCRIPTS View 10-11
10.4 Python API for Embedded Python Execution 10-12
10.4.1 About Embedded Python Execution 10-12
10.4.2 Run a User-Defined Python Function 10-14
10.4.3 Run a User-Defined Python Function on the Specified Data 10-15
10.4.4 Run a Python Function on Data Grouped By Column Values 10-18
10.4.5 Run a User-Defined Python Function on Sets of Rows 10-22
10.4.6 Run a User-Defined Python Function Multiple Times 10-26
10.4.7 Save and Manage User-Defined Python Functions in the Script Repository 10-27
10.4.7.1 About the Script Repository 10-28
10.4.7.2 Create and Store a User-Defined Python Function 10-28
10.4.7.3 List Available User-Defined Python Functions 10-32
10.4.7.4 Load a User-Defined Python Function 10-33
10.4.7.5 Drop a User-Defined Python Function from the Repository 10-34
10.5 SQL API for Embedded Python Execution with On-premises Database 10-36
10.5.1 About the SQL API for Embedded Python Execution with On-Premises
Database 10-37
10.5.2 pyqEval Function (On-Premises Database) 10-37
10.5.3 pyqTableEval Function (On-Premises Database) 10-40
10.5.4 pyqRowEval Function (On-Premises Database) 10-43
10.5.5 pyqGroupEval Function (On-Premises Database) 10-47
10.5.6 pyqGrant Function (On-Premises Database) 10-50
10.5.7 pyqRevoke Function (On-Premises Database) 10-51
10.5.8 pyqScriptCreate Procedure (On-Premises Database) 10-52
10.5.9 pyqScriptDrop Procedure (On-Premises Database) 10-54
10.6 SQL API for Embedded Python Execution with Autonomous Database 10-55
10.6.1 Access and Authorization Procedures and Functions 10-55
10.6.1.1 pyqAppendHostACE Procedure 10-58
10.6.1.2 pyqGetHostACE Function 10-58
10.6.1.3 pyqRemoveHostACE Procedure 10-59
vi
10.6.1.4 pyqSetAuthToken Procedure 10-59
10.6.1.5 pyqIsTokenSet Function 10-59
10.6.2 Embedded Python Execution Functions (Autonomous Database) 10-60
10.6.2.1 pyqListEnvs Function (Autonomous Database) 10-61
10.6.2.2 pyqEval Function (Autonomous Database) 10-61
10.6.2.3 pyqTableEval Function (Autonomous Database) 10-64
10.6.2.4 pyqRowEval Function (Autonomous Database) 10-67
10.6.2.5 pyqGroupEval Function (Autonomous Database) 10-71
10.6.2.6 pyqIndexEval Function (Autonomous Database) 10-75
10.6.2.7 pyqGrant Function (Autonomous Database) 10-95
10.6.2.8 pyqRevoke Function (Autonomous Database) 10-96
10.6.2.9 pyqScriptCreate Procedure (Autonomous Database) 10-97
10.6.2.10 pyqScriptDrop Procedure (Autonomous Database) 10-100
10.6.3 Asynchronous Jobs (Autonomous Database) 10-100
10.6.3.1 oml_async_flag Argument 10-101
10.6.3.2 pyqJobStatus Function 10-102
10.6.3.3 pyqJobResult Function 10-103
10.6.3.4 Asynchronous Job Example 10-104
10.6.4 Special Control Arguments (Autonomous Database) 10-109
10.6.5 Output Formats (Autonomous Database) 10-110
11 Administrative Tasks for Oracle Machine Learning for Python
Index
vii
Preface
Preface
This publication describes Oracle Machine Learning for Python (OML4Py) and how to use it.
.
• Audience
• Documentation Accessibility
• Related Resources
• Conventions
Audience
This document is intended for those who want to run Python commands for statistical, machine
learning, and graphical analysis on data stored in or accessible through Oracle Autonomous
Database or Oracle Database on premises using a Python API. Use of Oracle Machine
Learning for Python requires knowledge of Python and of Oracle Autonomous Database or
Oracle Database on premises.
Documentation Accessibility
For information about Oracle's commitment to accessibility, visit the Oracle Accessibility
Program website at http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc.
Access to Oracle Support

Oracle customers that have purchased support have access to electronic support through My
Oracle Support. For information, visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=info
or visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs if you are hearing impaired.
Related Resources
Related documentation is in the following publications:
• Oracle Machine Learning for Python API Reference
• Oracle Machine Learning for Python Known Issues
• Oracle Machine Learning for Python Licensing Information User Manual
• REST API for Embedded Python Execution
• Get Started with Notebooks for Data Analysis and Data Visualization in Using Oracle
Machine Learning Notebooks
• Oracle Machine Learning AutoML User Interface
• REST API for Oracle Machine Learning Services
viii
Preface
For more information, see these Oracle resources:

• Oracle Machine Learning Technologies
• Oracle Autonomous Database
Conventions
The following text conventions are used in this document:
Convention Meaning
boldface Boldface type indicates graphical user interface elements associated with an
action, or terms defined in text or the glossary.
italic Italic type indicates book titles, emphasis, or placeholder variables for which
you supply particular values.
monospace Monospace type indicates commands within a paragraph, URLs, code in
examples, text that appears on the screen, or text that you enter.
ix
1
About Oracle Machine Learning for Python
The following topics describe Oracle Machine Learning for Python (OML4Py) and its
advantages for the Python user.
Topics:
• What Is Oracle Machine Learning for Python?
Oracle Machine Learning for Python (OML4Py) enables you to run Python commands for
data transformations and for statistical, machine learning, and graphical analysis on data
stored in or accessible through an Oracle database using a Python API. The OML4Py
supports running user-defined Python functions through the database spawned and
controlled Python engines, with optional built-in data-parallelism and task-parallelism. This
embedded execution functionality enables invoking user-defined functions from SQL, and
on ADB, REST. The OML4Py supports Automated Machine Learning (AutoML) for
algorithm and feature selection, and model tuning and selection. You can augment the
Python included functionality with third-party packages from the Python ecosystem.
• Advantages of Oracle Machine Learning for Python
Using OML4Py to prepare and analyze data in or accessible to an Oracle database has
many advantages for a Python user.
• Transparently Convert Python to SQL
With the transparency layer classes, you can convert select Python objects to Oracle
database objects and also invoke a range of familiar Python functions that are overloaded
to invoke the corresponding SQL on tables in the database.
• About the Python Components and Libraries in OML4Py
OML4Py requires an installation of Python, a number of Python libraries, as well as the
OML4Py components.
1.1 What Is Oracle Machine Learning for Python?

Oracle Machine Learning for Python (OML4Py) enables you to run Python commands for data
transformations and for statistical, machine learning, and graphical analysis on data stored in
or accessible through an Oracle database using a Python API. The OML4Py supports running
user-defined Python functions through the database spawned and controlled Python engines,
with optional built-in data-parallelism and task-parallelism. This embedded execution
functionality enables invoking user-defined functions from SQL, and on ADB, REST. The
OML4Py supports Automated Machine Learning (AutoML) for algorithm and feature selection,
and model tuning and selection. You can augment the Python included functionality with third-
party packages from the Python ecosystem.
OML4Py is a Python module that enables Python users to manipulate data in database tables
and views using Python syntax. OML4Py functions and methods transparently translate a
select set of Python functions into SQL for in-database execution.
OML4Py is available in the following Oracle database environments:
• OML4Py is available in the Python interpreter in Oracle Machine Learning Notebooks in
your Oracle Autonomous Database. For more information, see Get Started with Notebooks
for Data Analysis and Data Visualization in Using Oracle Machine Learning Notebooks.
1-1
Chapter 1
Advantages of Oracle Machine Learning for Python
• An OML4Py client connection to OML4Py in an on-premises Oracle Database instance.

For this environment, you must install Python, the required Python libraries, and the
OML4Py server components in the database, and you must install the OML4Py client. See
Install OML4Py for On-Premises Databases.
Designed for problems involving both large and small volumes of data, OML4Py integrates
Python with the database. With OML4Py, you can do the following:
• Run overloaded Python functions and use native Python syntax to manipulate in-database
data, without having to learn SQL.
• Use Automated Machine Learning (AutoML) to enhance user productivity and machine
learning results through automated algorithm and feature selection, as well as model
tuning and selection.
• Use Embedded Python Execution to run user-defined Python functions in Python engines
spawned and managed by the database environment. The user-defined functions and data
are automatically loaded to the engines as required, and when data-parallel and task-
parallel execution is enabled. Develop, refine, and deploy user-defined Python functions
and machine learning models that leverage the parallelism and scalability of the database
to automate data preparation and machine learning.
• Use a natural Python interface to build in-database machine learning models.
1.2 Advantages of Oracle Machine Learning for Python

Using OML4Py to prepare and analyze data in or accessible to an Oracle database has many
advantages for a Python user.
With OML4Py, you can do the following:
• Operate on database data without using SQL
OML4Py transparently translates many standard Python functions into SQL. With
OML4Py, you can create Python proxy objects that access, analyze, and manipulate data
that resides in the database. OML4Py can automatically optimize the SQL by taking
advantage of column indexes, query optimization, table partitioning, and database
parallelism.
OML4Py overloaded functions are available for many commonly used Python functions,
including those on Pandas data frames for in-database execution.
See Also: Transparently Convert Python to SQL
• Automate common machine learning tasks
By using Oracle’s advanced Automated Machine Learning (AutoML) technology, both data
scientists and beginner machine learning users can automate common machine learning
modeling tasks such as algorithm selection and feature selection, and model tuning and
selection, all of which leverage the parallel processing and scalability of the database.
See Also: About Automated Machine Learning
• Minimize data movement
By keeping data in the database whenever possible, you eliminate the time involved in
transferring the data to your client Python engine and the need to store the data locally.
You also eliminate the need to manage the locally stored data, which includes tasks such
as distributing the data files to the appropriate locations, synchronizing the data with
changes that are made in the production database, and so on.
See Also: About Moving Data Between the Database and a Python Session
1-2
Chapter 1
Advantages of Oracle Machine Learning for Python
• Keep data secure

By keeping the data in the database, you have the security, scalability, reliability, and
backup features of the database for managing the data.
• Use the power of the database
By operating directly on data in the database, you can use the memory and processing
power of the database and avoid the memory constraints of your client Python engine.
• Use current data
As data is refreshed in the database, you have immediate access to current data.
• Save Python objects to a datastore in the database
You can save Python objects to an OML4Py datastore for future use and for use by others.
See Also: About OML4Py Datastores
• Build and store native Python models in the database
Using Embedded Python Execution, you can build native Python models and store and
manage them in an OML4Py datastore.
You can also build in-database models, with, for example, an oml class such as the
Decision Tree class oml.dt. These in-database models have proxy objects that reference
the actual models. Keeping with normal Python behavior, when the Python engine
terminates, all in-memory objects, including models, are lost. To prevent an in-database
model created using OML4Py from being deleted when the database connection is
terminated, you must store its proxy object in a datastore.
See Also: About Machine Learning Classes and Algorithms
• Score data
For most of the OML4Py machine learning classes, you can use the predict and
predict_proba methods of the model object to score new data.
For these OML4Py in-database models, you can also use the SQL PREDICTION function on
the model proxy objects, which scores directly in the database. You can use in-database
models directly from SQL if you prepare the data properly. For open source models, you
can use Embedded Python Execution and enable data-parallel execution for performance
and scalability.
• Run user-defined Python functions in embedded Python engines
Using OML4Py Embedded Python Execution, you can store user-defined Python functions
in the OML4Py script repository, and run those functions in Python engines spawned by
the database environment. When a user-defined Python function runs, the database starts,
controls, and manages one or more Python engines that can run in parallel. With the
Embedded Python Execution functionality, you can do the following:
– Use a select set of Python packages in user-defined functions that run in embedded
Python engines
– Use other Python packages and third-party package in user-defined Python functions
that run in embedded Python engines
– Operationalize user-defined Python functions for use in production applications and
eliminate porting Python code and models into SQL, and on ADB, REST; avoid
reinventing code to integrate Python results into existing applications
– Seamlessly leverage your Oracle database as a high-performance computing
environment for user-defined Python functions, providing data parallelism and
resource management
1-3
Chapter 1
Transparently Convert Python to SQL
– Perform parallel simulations, for example, Monte Carlo analysis, using the
oml.index_apply function
– Generate JSON images, PNG images and XML representations of both structured and
image data, which can be used by Python clients and SQL-based applications. PNG
images and structured data can be used for Python clients and applications that use
REST APIs.
See Also: About Embedded Python Execution
1.3 Transparently Convert Python to SQL

With the transparency layer classes, you can convert select Python objects to Oracle database
objects and also invoke a range of familiar Python functions that are overloaded to invoke the
corresponding SQL on tables in the database.
The OML4Py transparency layer does the following:
• Contains functions that convert Python pandas.DataFrame objects to database tables
• Overloads Python functions, translating their functionality into SQL
• Leverages proxy objects for database data
• Uses familiar Python syntax to manipulate database data
The following table lists the transparency layer functions.
Table 1-1 Transparency Layer Functions
Function Description
oml.create Creates a table in a the database schema from a Python data set.
oml_object.pull Creates a local Python object that contains a copy of data referenced by the
oml object.
oml.push Pushes data from a Python session into an object in a database schema.
oml.sync Creates a DataFrame proxy object in Python that represents a database
table or view.
oml.dir Return the names of oml objects in the Python session workspace.
oml.drop Drops a persistent database table or view.
Transparency layer proxy classes map SQL data types or objects to corresponding Python
types. The classes provide Python functions and operators that are the same as those on the
mapped Python types. The following table lists the transparency layer data type classes.
Table 1-2 Transparency Layer Data Type Classes
Class Description
oml.Boolean A boolean series data class that represents a single column of 0, 1, and NULL
values in database data.
oml.Bytes A binary series data class that represents a single column of RAW or BLOB
database data types.
oml.Float A numeric series data class that represents a single column of NUMBER,
BINARY_DOUBLE, or BINARY_FLOAT database data types.
1-4
Chapter 1
About the Python Components and Libraries in OML4Py
Table 1-2 (Cont.) Transparency Layer Data Type Classes
Class Description
oml.String A character series data class that represents a single column of VARCHAR2, CHAR,
or CLOB database data types.
oml.DataFrame A tabular DataFrame class that represents multiple columns of oml.Boolean,
oml.Bytes, oml.Float, and oml.String data.
The following table lists the mappings of OML4Py data types for both the reading and writing of
data between Python and the database.
Table 1-3 Python and SQL Data Type Equivalencies
Database Read Python Data Types Database Write

N/A Boolean If oranumber == True, then NUMBER (the
default), else BINARY_DOUBLE.
BLOB bytes BLOB
RAW RAW
BINARY_DOUBLE float If oranumber == True, then NUMBER (the
BINARY_FLOAT default), else BINARY_DOUBLE.
NUMBER
CHAR str CHAR
CLOB CLOB
VARCHAR2 VARCHAR2
1.4 About the Python Components and Libraries in OML4Py

OML4Py requires an installation of Python, a number of Python libraries, as well as the
OML4Py components.
• In Oracle Autonomous Database, OML4Py is already installed. The OML4Py installation
includes Python, additional required Python libraries, and the OML4Py server components.
A Python interpreter is included with Oracle Machine Learning Notebooks in Autonomous
Database.
• You can install OML4Py in an on-premises Oracle Database. In this case, you must install
Python, the additional required Python libraries, the OML4Py server components, and an
OML4Py client. See Install OML4Py for On-Premises Databases.
Python Version in Current Release of OML4Py

The current release of OML4Py is based on Python 3.9.5.
This version is in the current release of Oracle Autonomous Database. You must install it
manually when installing OML4Py on an on-premises Oracle Database.
Required Python Libraries

The following Python libraries must be included.
• cx_Oracle 8.1.0
1-5
Chapter 1
About the Python Components and Libraries in OML4Py
• cycler 0.10.0
• joblib 1.1.0
• kiwisolver 1.1.0
• matplotlib 3.3.3
• numpy 1.21.5
• pandas 1.3.4
• Pillow-8.2.0
• pyparsing 2.4.0
• python-dateutil 2.8.1
• pytz 2019.3
• scikit-learn 1.0.1
• scipy 1.7.3
• six 1.13.0
• threadpoolctl 2.1.0
All the above libraries are included with Python in the current release of Oracle Autonomous
Database.
For an installation of OML4Py in an on-premises Oracle Database, you must install Python and
additionally the libraries listed here. See Install OML4Py for On-Premises Databases.
1-6
2
Install OML4Py Client for Linux for Use With
Autonomous Database on Serverless Exadata
Infrastructure
You can install and use the OML4Py client for Linux to work with OML4Py in an Oracle
Autonomous Database on Serverless Exadata infrastructure.
OML4Py on premises runs on 64-bit platforms only. For supported platforms see OML4Py On
Premises System Requirements.
The following instructions tell you how to download install Python, configure your environment,
install manage your client credentials, install Oracle Instant Client, and install the OML4Py
client:
1. Download the Python 3.9.5 source and untar it:
wget https://www.python.org/ftp/python/3.9.5/Python-3.9.5.tar.xz
tar xvf Python-3.9.5.tar.xz
2. OML4Py requires the presence of the perl-Env, libffi-devel, openssl, openssl-devel,

tk-devel, xz-devel, zlib-devel, bzip2-devel, readline-devel, libuuid-devel and
ncurses-devel rpm libraries. Install these packages as sudo or root user:
Note:
RPMs must be installed under sudo, or root.
sudo yum install perl-Env libffi-devel openssl openssl-devel tk-devel xz-

devel zlib-devel bzip2-devel readline-devel libuuid-devel ncurses-devel
3. To build Python, enter the following commands, where PREFIX is the directory in which you
installed Python-3.9.5. Use make altinstall to avoid overriding the system default's
Python installation.
export PREFIX=`pwd`/Python-3.9.5
cd $PREFIX
./configure --prefix=$PREFIX --enable-shared
make clean; make

make altinstall
2-1
Chapter 2
4. Set environment variable PYTHONHOME and add it to your PATH, and set environment variable
LD_LIBRARY_PATH:
export PYTHONHOME=$PREFIX
export PATH=$PYTHONHOME/bin:$PATH
export LD_LIBRARY_PATH=$PYTHONHOME/lib:$LD_LIBRARY_PATH
Create a symbolic link in your $PYTHONHOME/bin directory. You need to link it to your
python3.9 executable, which you can do with the following commands:
cd $PYTHONHOME/bin
ln -s python3.9 python3
You can now start Python with the python3 script:
python3
pip will return warnings during package installation if the latest version is not installed. You
can upgrade the version of pip to avoid these warnings:
python3 -m pip install --upgrade pip
5. Install the Oracle Instant Client for Autonomous Database, as follows:

Download the Oracle Instant Client for your system. Go to the Oracle Instant Client
Downloads page and select Instant Client for Linux x86-64. For more instruction see
Install Oracle Instant Client for Linux for On-Premises Databases.
For instruction on installing the Oracle instant client for on-premises see Install OML4Py
Client for On-Premises Databases.
If you have root access to install an RPM on the client system. Alternatively, you can also
download the zip file installer, unzip the file, and add the location of the unzipped file to
LD_LIBRARY_PATH as done in next section.
wget https://download.oracle.com/otn_software/linux/instantclient/1914000/
oracle-instantclient19.14-basic-19.14.0.0.0-1.x86_64.rpm
rpm -ivh oracle-instantclient19.14-basic-19.14.0.0.0-1.x86_64.rpm
export LD_LIBRARY_PATH=/usr/lib/oracle/19.14/client64/lib:$LD_LIBRARY_PATH
If you do not have root access to install an RPM on the client system.
wget https://download.oracle.com/otn_software/linux/instantclient/1914000/
instantclient-basic-linux.x64-19.14.0.0.0dbru.zip
unzip instantclient-basic-linux.x64-19.14.0.0.0dbru.zip
export LD_LIBRARY_PATH=/path/to/instantclient_19_4:$LD_LIBRARY_PATH
6. Download the client credentials (wallet) from your Autonomous database. Create a
directory for the Wallet contents. Unzip the wallet zip file to the newly created directory:
2-2
Chapter 2
Note:
An mTLS connection using the client Wallet is required. TLS connections are not
currently supported.
mkdir -p mywalletdir
unzip Wallet.name.zip -d mywalletdir
cd mywalletdir/
ls
README ewallet.p12 ojdbc.properties tnsnames.ora

cwallet.sso keystore.jks sqlnet.ora truststore.jks
7. Update sqlnet.ora with the wallet location. If you're working behind a proxy firewall, set
the SQLNET.USE_HTTPS_PROXY environment variable to on:
WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA =

(DIRECTORY="mywalletdir")))
SSL_SERVER_DN_MATCH=yes
SQLNET.USE_HTTPS_PROXY=on
8. Add proxy address information to all service levels in tnsnames.ora, and add the
connection pools for all service levels. If you are behind a firewall, enter the proxy address
and port number to all service levels in tnsnames.ora. You will also need to add three new
entries for the AutoML connection pools as shown below.
Note:
If the proxy server contains a firewall to terminate connections within a set time
period, the database connection will also be terminated.
For example, myadb_medium_pool is another alias for the connection string with
SERVER=POOLED added to the corresponding one for myadb_medium.
myadb_low = (description= (retry_count=20)(retry_delay=3)
(address=(https_proxy=your proxy address here)(https_proxy_port=80)
(protocol=tcps)(port=1522)(host=adb.us-sanjose-1.oraclecloud.com))
(connect_data=(service_name=qtraya2braestch_myadb_medium.adb.oraclecloud.com))
(security=(ssl_server_cert_dn="CN=adb.us-sanjose-1.oraclecloud.com,OU=Oracle
ADB SANJOSE,O=Oracle Corporation,L=Redwood City,ST=California,C=US")))
myadb_medium = (description= (retry_count=20)(retry_delay=3)
myadb_high = (description= (retry_count=20)(retry_delay=3)
2-3
Chapter 2
myadb_low_pool = (description= (retry_count=20)(retry_delay=3)
(connect_data=(service_name=qtraya2braestch_myadb_medium.adb.oraclecloud.com)
(SERVER=POOLED))(security=(ssl_server_cert_dn="CN=adb.us-
sanjose-1.oraclecloud.com,OU=Oracle ADB SANJOSE,O=Oracle Corporation,L=Redwood
City,ST=California,C=US")))
myadb_medium_pool = (description= (retry_count=20)(retry_delay=3)
myadb_high_pool = (description= (retry_count=20)(retry_delay=3)
9. Set TNS_ADMIN environment variable to the wallet directory:
export TNS_ADMIN=mywalletdir
10. Install OML4Py library dependencies. The versions listed here are the versions Oracle has
tested and supports:
2-4
Chapter 2
• pip3.9 install pandas==1.3.4
pip3.9 install scipy==1.7.3
pip3.9 install matplotlib==3.3.3
pip3.9 install cx_Oracle==8.1.0
pip3.9 install threadpoolctl==2.1.0
pip3.9 install joblib==0.14.0
pip3.9 install scikit-learn==1.0.1 --no-deps
pip3.9 uninstall numpy
pip3.9 install numpy==1.21.5
• Install OML4Py client:

Download OML4Py client installation zip file, go to the Oracle Machine Learning for
Python Downloads page on the Oracle Technology Network. For more instruction see
Install OML4Py Client for Linux for On-Premises Databases
unzip oml4py-client-linux-x86_64-1.0.zip
perl -Iclient client/client.pl
Oracle Machine Learning for Python 1.0 Client.
Copyright (c) 2018, 2022 Oracle and/or its affiliates. All rights
reserved.
Checking platform .................. Pass
Checking Python .................... Pass
Checking dependencies .............. Pass
Checking OML4P version ............. Pass
Current configuration
Python Version ................... 3.9.5
PYTHONHOME ....................... /opt/Python-3.9.5
Existing OML4P module version .... None
Operation ........................ Install/Upgrade
Proceed? [yes]
Processing ./client/oml-1.0-cp39-cp39-linux_x86_64.whl
Installing collected packages: oml
Successfully installed oml-1.0
2-5
Chapter 2
Done
• Start Python and load the oml library:
python3
import oml
• Create a database connection. The OML client connects using the wallet. Set the dsn
and automl arguments to the tnsnames alias in the wallet:
oml.connect(user="oml_user", password="oml_user_password",
dsn="myadb_medium", automl="myadb_medium_pool")
To provide empty strings for the user and password parameters to connect without
exposing your Oracle Machine Learning user credentials in clear text:
oml.connect(user="", password="", dsn="myadb_medium",
automl="myadb_medium_pool")
2-6
3
Install OML4Py for On-Premises Databases
The following topics tell how to install and uninstall the server and client components required
for using OML4Py with an on-premises Oracle Database.
Topics:
• OML4Py On Premises System Requirements
OML4Py on premises runs on 64-bit platforms only.
• Build and Install Python for Linux for On-Premises Databases
Instructions for installing Python for Linux for an on-premises Oracle database.
• Install the Required Supporting Packages for Linux for On-Premises Databases
Both the OML4Py server and client installations for an on-premises Oracle database
require that you also install a set of supporting Python packages, as described below.
• Install OML4Py Server for On-Premises Oracle Database
The following instructions tell how to install and uninstall the OML4Py server components
for an on-premises Oracle Database.
• Install OML4Py Client for On-Premises Databases
Instructions for installing and uninstalling the on-premises OML4Py client.
3.1 OML4Py On Premises System Requirements

OML4Py on premises runs on 64-bit platforms only.
Both client and server on-premises components are supported on the Linux platforms listed in
the table below.
Table 3-1 On-Premises OML4Py Platform Requirements
Operating System Hardware Platform Description

Oracle Linux x86-64 7.x Intel 64-bit Oracle Linux Release 7
Oracle Linux x86-64 8.x 64-bit Oracle Linux Release 8
Table 3-2 On-Premises OML4Py Configuration Requirements and Server Support

Matrix
Oracle Machine Learning Python Version On-Premises Oracle Database

for Python Version Release
1.0 3.9.5 19c, 21c
3.2 Build and Install Python for Linux for On-Premises Databases
Instructions for installing Python for Linux for an on-premises Oracle database.
The Python installation on the database server must be executed by the Oracle user and not
sudo, root, or any other user. This is not a requirement on the OML4Py client.
3-1
Chapter 3
Build and Install Python for Linux for On-Premises Databases
Python 3.9.5 is required to install and use OML4Py.

These steps describe building and installing Python 3.9.5 for Linux.
1. Go to the Python website and download the Gzipped source tarball. The downloaded file
name is Python-3.9.5.tgz
wget https://www.python.org/ftp/python/3.9.5/Python-3.9.5.tgz
2. Create a directory $ORACLE_HOME/python and extract the contents to this directory:
mkdir -p $ORACLE_HOME/python
tar -xvzf Python-3.9.5.tgz --strip-components=1 -C $ORACLE_HOME/python
The contents of the Gzipped source tarball will be copied directly to $ORACLE_HOME/python
3. Go to the new directory:
cd $ORACLE_HOME/python
4. OML4Py requires the presence of the perl-Env, libffi-devel, openssl, openssl-devel,

tk-devel, xz-devel, zlib-devel, bzip2-devel, readline-devel, libuuid-devel and
ncurses-devel rpm libraries. Install these packages as sudo or root user:
Note:
RPMs must be installed under sudo, or root.
sudo yum install perl-Env libffi-devel openssl openssl-devel tk-devel xz-

devel zlib-devel bzip2-devel readline-devel libuuid-devel ncurses-devel
5. To build Python 3.9.5, enter the following commands, where PREFIX is the directory in
which you installed Python-3.9.5. The command on the Oracle Machine Learning for
Python server will be:
cd $ORACLE_HOME/python
./configure --enable-shared --prefix=$ORACLE_HOME/python
make clean; make

make altinstall
Note:
Be sure to use the --enable-shared flag if you are going to use Embedded
Python Execution; otherwise, using an Embedded Python Execution function
results in an extproc error.
Be sure to invoke make altinstall instead of make install to avoid overwriting
the system Python.
3-2
Chapter 3
Build and Install Python for Linux for On-Premises Databases
LD_LIBRARY_PATH:
export PYTHONHOME=$ORACLE_HOME/python
export PATH=$PYTHONHOME/bin:$PATH
export LD_LIBRARY_PATH=$PYTHONHOME/lib:$LD_LIBRARY_PATH
Note:
In order to use Python for OML4Py, the variables must be set, and these
variables must appear before system Python in PATH and LD_LIBRARY_PATH.
pip will return warnings during package installation if the latest version is not installed. You
can upgrade the version of pip to avoid these warnings:
python3 -m pip install --upgrade pip
7. Create a symbolic link in your $ORACLE_HOME/python/bin directory to link to your python3.9

executable, which you can do with the following commands:
cd $ORACLE_HOME/python/bin
ln -s python3.9 python3
You can now start Python by running the command python3. To verify the directory where
Python is installed, use the sys.executable command from the sys package. For example:
python3
Python 3.9.5 (default, Feb 22 2022, 15:13:36)

[GCC 4.8.5 20150623 (Red Hat 4.8.5-44.0.3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
import sys
print(sys.executable)
/u01/app/oracle/product/19.3/dbhome_1/python/bin/python3
This returns the absolute path of the Python executable binary.

If you run the command python3 and you get the error command not found, then that means
the system cannot find an executable named python3 in $PYTHONHOME/bin. A symlink is
required for the OML4Py server installation components. So, in that case, you need to create a
symbolic link in your PREFIX/bin directory to link to your python3.9 executable as described in
Step 6.
3-3
Chapter 3
Install the Required Supporting Packages for Linux for On-Premises Databases
3.3 Install the Required Supporting Packages for Linux for On-
Premises Databases
Both the OML4Py server and client installations for an on-premises Oracle database require
that you also install a set of supporting Python packages, as described below.
Installing required packages on OML4Py client machine

The on-premises OML4Py client requires the following Python packages:
• numpy 1.21.5
• pandas 1.3.4
• scipy 1.7.3
• cx_Oracle 8.1.0
• scikit-learn 1.0.1
• matplotlib 3.3.3
Use pip3.9 to install the supporting packages. For OML4Py client installation of all the
packages, run the following command, specifying the package:
pip3.9 install packagename
These command installs the required packages:

pip3.9 install pandas==1.3.4
pip3.9 install scipy==1.7.3
pip3.9 install matplotlib==3.3.3
pip3.9 install cx_Oracle==8.1.0
pip3.9 install threadpoolctl==2.1.0
pip3.9 install joblib==0.14.0
pip3.9 install scikit-learn==1.0.1 --no-deps
pip3.9 install numpy==1.21.5
This command installs the cx_Oracle package using an example proxy server:
pip3.9 install cx_Oracle==8.1.0 --proxy="http://www-proxy.example.com:80"
Note:
The proxy server is only necessary if the user is behind a firewall.
Installing required packages on OML4Py server machine

On the OML4Py server machine, all these packages must be installed into $ORACLE_HOME/
oml4py/modules so they can be detected by the Embedded Python Execution process. Run
3-4
Chapter 3
Install the Required Supporting Packages for Linux for On-Premises Databases
the following command, specifying the package and target directory, $ORACLE_HOME/oml4py/
modules:
pip3.9 install packagename --target=$ORACLE_HOME/oml4py/modules
These command installs the required packages:
pip3.9 install pandas==1.3.4 --target=$ORACLE_HOME/oml4py/modules

pip3.9 install scipy==1.7.3 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install matplotlib==3.3.3 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install cx_Oracle==8.1.0 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install threadpoolctl==2.1.0 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install joblib==0.14.0 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install scikit-learn==1.0.1 --no-deps --target=$ORACLE_HOME/oml4py/
modules
pip3.9 install numpy==1.21.5 --target=$ORACLE_HOME/oml4py/modules
This command installs the cx_Oracle package using an example proxy server:
pip3.9 install cx_Oracle==8.1.0 --proxy="http://www-proxy.example.com:80" --

target=$ORACLE_HOME/oml4py/modules
Verify the Package Installation

Load the packages below to ensure they have been installed successfully. Start Python and
run the following commands:
$ python3

import numpy
import pandas
import scipy
import matplotlib
import cx_Oracle
import sklearn
If all the packages are installed successfully, then no errors are returned.
3-5
Chapter 3
Install OML4Py Server for On-Premises Oracle Database
3.4 Install OML4Py Server for On-Premises Oracle Database

The following instructions tell how to install and uninstall the OML4Py server components for
an on-premises Oracle Database.
Topics:
• Install OML4Py Server for Linux for On-Premises Oracle Database 19c
Instructions for installing the OML4Py server for Linux for an on-premises Oracle Database
19c.
• Install OML4Py Server for Linux for On-Premises Oracle Database 21c
21c.
• Verify OML4Py Server Installation for On-Premises Database
Verify the installation of the OML4Py server and client components for an on-premises
database.
• Grant Users the Required Privileges for On-Premises Database
Instructions for granting the privileges required for using OML4Py with an on-premises
database.
• Create New Users for On-Premises Oracle Database
The pyquser.sql script is a convenient way to create a new OML4Py user for on on-
premises database.
• Uninstall the OML4Py Server from an On-Premises Database 19c
Instructions for uninstalling the on-premises OML4Py server components from an on-
premises Oracle Database 19c.
3.4.1 Install OML4Py Server for Linux for On-Premises Oracle Database
19c
19c.
To install the OML4Py server for Linux for an on-premises Oracle database, run the server
installation Perl script.
Prerequisites
To install the on-premises OML4Py server, the following are required:
• A connection to the internet.
• Python 3.9.5. For instructions on installing Python 3.9.5 see Build and Install Python for
Linux for On-Premises Databases.
• OML4Py supporting packages. For instructions on installing the required supporting
packages see Install the Required Supporting Packages for Linux for On-Premises
Databases.
• Perl 5.8 or higher installed on your system.
3-6
Chapter 3
Note:
Perl requires the presence of the perl-Env package.
• To verify if the perl-Env package exists on the system, type the command :
rpm -qa perl-Env
If it is installed, the return value will contain the version of the perl-Env RPM installed on
your system:
rpm -qa perl-Env

perl-Env-1.04-2.el7.noarch
If perl-Env is not installed on the system, there will be no return value, and you can install
the package as root or sudo using the command:
yum install perl-Env
• Write permission on the directories to which you download and install the server
components.
Download and Extract the Server Installation File

Download the on-premises OML4Py server installation file and extract its contents.
1. If the directory oml4py does not exist in the $ORACLE_HOME directory, then create it.
mkdir $ORACLE_HOME/oml4py
2. Download the installation file for your system.

a. Go to the Oracle Machine Learning for Python Downloads page on the Oracle
Technology Network.
b. Accept the license agreement and select Oracle Machine Learning for Python
Downloads (v1.0).
c. Select Oracle Machine Learning for Python Server Install for Oracle Database on
Linux 64 bit.
d. Save the file to the $ORACLE_HOME/oml4py directory.
3. To extract the installation file to $ORACLE_HOME/oml4py directory, use the command:
unzip oml4py-server-linux-x86_64-1.0.zip -d $ORACLE_HOME/oml4py
The files are extracted to the $ORACLE_HOME/oml4py/server subdirectory.
View the Optional Arguments to the Server Installation Perl Script

To view the optional arguments to the server installation script, change directories to
the $ORACLE_HOME/oml4py directory.
3-7
Chapter 3
Display the available installation options with the following command:
perl -Iserver server/server.pl --help
The command displays the following:
Oracle Machine Learning for Python 1.0 Server.
Copyright (c) 2018, 2022 Oracle and/or its affiliates. All rights reserved.
Usage: server.pl [OPTION]...

Install, upgrade, or uninstall OML4P Server.
-i, --install install or upgrade (default)

-u, --uninstall uninstall
-y never prompt
--ask interactive mode (default)
--pdb NAME PDB name
--perm PERM permanent tablespace to use for PYQSYS
--temp TEMP temporary tablespace to use for PYQSYS
--no-db do not configure the database; only install the
oml module and libraries associated with
Embedded Python Execution
--no-embed do not install the Embedded Python Execution
component
--no-automl do not install the AutoML metamodels
By default, the installation script installs both the Embedded Python Execution and AutoML
components. If you do not want to install these components, then you can use the --no-embed
and/or the --no-automl flag.
If you do not specify a permanent tablespace or a temporary tablespace in the Perl command,
then the installation script prompts you for them.
If you only want to install the oml modules and Embedded Python Execution libraries with no
database configuration, use the --no-db flag. The --no-db flag is used when OML4Py is
installed in a database with multiple nodes, such as Oracle RAC. The OML4Py server requires
a complete database configuration on the first node, but the oml module and Embedded
Python Execution libraries must be installed on each node.
Run the Server Installation Perl Script

The installation Perl script creates the PYQSYS schema and user. It uses the permanent and
temporary tablespaces that you specify to store OML4Py database objects and tables and
other server elements. The PYQSYS user is locked to protect the system objects stored in the
PYQSYS schema.
By default, the installation Perl script runs in interactive mode and installs the Embedded
Python Execution components.
1. You need to set the PYTHONPATH environment variable prior to running the server
installation script so that Python can find the installed oml modules:
export PYTHONPATH=$ORACLE_HOME/oml4py/modules
3-8
Chapter 3
2. From the $ORACLE_HOME/oml4py directory, run the server installation script. The following
command runs the script in interactive mode:
perl -Iserver server/server.pl
Enter temporary and permanent tablespaces for the PYQSYS user when the script
prompts you for them.
3. When the installation script displays Proceed? , enter y or yes. The output of a successful
installation is as follows:
Oracle Machine Learning for Python 1.0 Server.

Checking ORACLE_HOME ............... Pass
Checking ORACLE_SID ................ Pass
Checking sqlplus ................... Pass
Checking ORACLE instance ........... Pass
Checking CDB/PDB ................... Fail
ERROR: cannot install OML4P in a root container
PDB to use for OML4P installation [list]:
ORCLPDB
PDB to use for OML4P installation [list]: ORCLPDB
Checking CDB/PDB ................... Pass
Checking OML4P Server .............. Pass
Checking module dependencies ....... Pass
Checking Python libraries .......... Pass
Choosing PYQSYS tablespaces

PERMANENT tablespace to use for PYQSYS [list]:
SYSTEM
USERS
PERMANENT tablespace to use for PYQSYS [list]: SYSTEM
TEMPORARY tablespace to use for PYQSYS [list]: TEMP
ORACLE_HOME ...................... /u01/app/oracle/product/19.3/dbhome_1
ORACLE_SID ....................... orcl
PDB .............................. ORCLPDB
Python Version ................... 3.9.5
PYTHONHOME ....................... /u01/app/oracle/product/19.3/dbhome_1/python
Existing OML4P data and code ..... None

Existing OML4P AutoML component .. None
Existing OML4P embed component ... None
PYQSYS PERMANENT tablespace ...... SYSTEM

PYQSYS TEMPORARY tablespace ...... TEMP
Proceed? [yes]yes
3-9
Chapter 3
Copying embedded python libraries ... Pass

Processing ./server/oml-1.0-cp39-cp39-linux_x86_64.whl
Configuring the database ............ Pass
Done
An OML4Py user is a database user account that has privileges for performing machine
learning activities. To learn more about how to create a user for Oracle Machine learning
for python click Create New Users for On-Premises Oracle Database
Verify the Server Installation

You can verify the database configuration of OML4Py as oracle user by doing the following:
1. On the OML4Py server database instance, start SQL*Plus as the OML user logging into
the PDB, in this example, PDB1.
$ sqlplus oml_user/oml_user_password$PDB1
2. Run the following command:
SELECT * FROM sys.pyq_config;
The expected output is as follows:

sqlplus / as sysdba;
SQL*Plus: Release 19.0.0.0.0 - Production on Mon Jan 31 12:49:34 2022

Version 19.3.0.0.0
Copyright (c) 1982, 2019, Oracle. All rights reserved.
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
SQL> alter session set container=PDB1;
Session altered.
select * from sys.pyq_config;
NAME
--------------------------------------------------------------------------------
VALUE
--------------------------------------------------------------------------------
PYTHONHOME
/u01/app/oracle/product/19.3/dbhome_1/python
PYTHONPATH
/u01/app/oracle/product/19.3/dbhome_1/oml4py/modules
VERSION
1.0
NAME
3-10
Chapter 3
--------------------------------------------------------------------------------
VALUE
--------------------------------------------------------------------------------
PLATFORM
ODB
DSWLIST
oml.*;pandas.*;numpy.*;matplotlib.*;sklearn.*
3. To verify the installation of the OML4Py server for an on-premises database see Verify
OML4Py Server Installation for On-Premises Database.
3.4.2 Install OML4Py Server for Linux for On-Premises Oracle Database
21c
21c.
You can install OML4Py by using a Python script included in your 21c database or by using the
Database Configuration Assistant (DBCA).
Install OML4Py By Using a Python Script

To install the on-premises OML4Py server, the following are required:
• Python 3.9.5. For instructions on installing Python 3.9.5 see Build and Install Python for
Linux for On-Premises Databases.
• OML4Py supporting packages. For instructions on installing the required supporting
packages see Install the Required Supporting Packages for Linux for On-Premises
Databases.
Note:
Perl requires the perl-Env package. You can install the package as root with the
command yum install perl-Env .
To check for the existence of perl-Env, run the following command. The version will vary
depending on your Operating System and version:
rpm -qa perl-Env

perl-Env-1.04-395.el8.noarch
• Write permission on the directories to which you download and install the server
components.
Note:
The following environment variables must be set up.
3-11
Chapter 3
• Set environment variables: Set PYTHONHOME and add it to your PATH

• Set ORACLE_HOME and add it to your PATH
• Set LD_LIBRARY_PATH
export PYTHONHOME=PREFIX
export PATH=$PYTHONHOME/bin:$ORACLE_HOME/bin:$PATH
export ORACLE_HOME=ORACLE_HOME_HERE
export LD_LIBRARY_PATH=$PYTHONHOME/lib:$ORACLE_HOME/lib:$LD_LIBRARY_PATH
To install the OML4Py server for Linux for an on-premises Oracle Database 21c, run the server
installation Python script pyqcfg.sql.
1. At your operating system prompt, start SQL*Plus and log in to your Oracle pluggable
database (PDB) directly.
2. Run the pyqcfg.sql script. The script is under $ORACLE_HOME/oml4py/server.
To capture the log, spool the installation steps to an external file. The following example
uses the PDB PDB1 and gives example values for the script arguments.
sqlplus / as sysdba
spool install.txt
alter session set container=PDB1;
ALTER PROFILE DEFAULT LIMIT PASSWORD_VERIFY_FUNCTION NULL;
@$ORACLE_HOME/oml4py/server/pyqcfg.sql
define permtbl_value = SYSAUX --> Specify a permanent tablespace for the

PYQSYS schema
define temptbl_value = TEMP --> Specify a temporary tablespace
define orahome_value = /u01/app/oracle/product/21.3.0.0/dbhome_1 -->
Specify the ORACLE_HOME directory
define pythonhome = /opt/Python-3.9.5 --> Specify the PYTHON_HOME directory
3. Open the install.txt file to see if any errors occurred.
Install OML4Py With the Database Configuration Assistant (DBCA)

You can install OML4Py by using DBCA. For complete instruction on using DBCA, see
Database Configuration Assistant Command Reference for Silent Mode.
The basic syntax to install OML4Py is:
dbca -configureOML4PY
You can include the following parameters:

• -oml4pyConfigTablespace to configure the tablespace of the PYQSYS schema for OML4Py.
The default tablespace is SYSAUX.
• -enableOml4pyEmbeddedExecution to enable the embedded Python component of Oracle
Machine Learning for Python. The default value is TRUE.
3-12
Chapter 3
3.4.3 Verify OML4Py Server Installation for On-Premises Database

Verify the installation of the OML4Py server and client components for an on-premises
database.
1. In your local Python session, connect to the OML4Py server and invoke the same function
by name. In the following example, replace the values for the parameters with those for
your database.
import oml
oml.connect(user='oml_user', password='oml_user_password', host='myhost',
port=1521, sid='mysid')
2. Create a user-defined Python function and store it in the OML4Py script repository.
oml.script.create("TEST", func='def func():return 1 + 1', overwrite=True)
3. Call the user-defined function, using the oml.do_eval function.
res = oml.do_eval(func='TEST')
res
4. When you are finished testing, you can drop the test.
oml.script.drop("TEST")
3.4.4 Grant Users the Required Privileges for On-Premises Database

Instructions for granting the privileges required for using OML4Py with an on-premises
database.
To use OML4Py (OML4Py), a user must have certain database privileges. To store and
manage user-defined Python functions in the OML4Py script repository, a user must also have
the PYQADMIN database role.
User Privileges
After installing the OML4Py server on an on-premises Oracle database server, grant the
following privileges to any OML4Py user.
• CREATE SESSION
• CREATE TABLE
• CREATE VIEW
• CREATE PROCEDURE
• CREATE MINING MODEL
• EXECUTE ON CTXSYS.CTX_DDL ( required for using Oracle Text Processing capability
in the algorithm classes in the oml.algo package )
3-13
Chapter 3
To grant all of these privileges, on the on-premises Oracle database server start SQL as a
database administrator and run the following SQL statement, where oml_user is the OML4Py
user:
GRANT CREATE SESSION, CREATE TABLE, CREATE VIEW, CREATE PROCEDURE,

CREATE MINING MODEL, EXECUTE ON CTXSYS.CTX_DDL to oml_user;
Script Repository and Datastore Management

The OML4Py script repository stores user-defined Python functions that a user can invoke in
an Embedded Python Execution function. An OML4Py datastore stores Python objects that
can be used in subsequent Python sessions. A user-defined Python function in the script
repository or a datastore can be available to any user or can be restricted for use by the owner
only or by those granted access to it.
The OML4Py server installation script creates the PYQADMIN role in the database. A user
must have that role to do the following:
• Store user-defined Python functions in the script repository.
• Drop user-defined Python function from the repository
• Grant or revoke permission to use a user-defined Python function in the script repository.
• Grant or revoke permission to use the objects in a datastore.
To grant this role to a user, on the on-premises Oracle database server start SQL as a
database administrator and run the following SQL statement, where oml_user is your OML4Py
user:
GRANT PYQADMIN to oml_user;
3.4.5 Create New Users for On-Premises Oracle Database

The pyquser.sql script is a convenient way to create a new OML4Py user for on on-premises
database.
About the pyquser.sql Script

The pyquser.sql script is a component of the on-premises OML4Py server installation. The
script is in the server directory of the installation. The sysdba privilege is required to run the
script.
The pyquser.sql script grants the new user the required on-premises Oracle database
privileges and, optionally, grants the PYQADMIN database role. The PYQADMIN role is
required for creating and managing scripts in the OML4Py script repository for use in
Embedded Python Execution.
The pyquser.sql script takes the following five positional arguments:
• Username
• User's permanent tablespace
• User's temporary tablespace
• Permanent tablespace quota
• PYQADMIN role
When you run the script, it prompts you for a password for the user.
3-14
Chapter 3
Create a New User

To use the pyquser.sql script, go the server subdirectory of the directory that contains the
extracted OML4Py server installation files. Run the script as a database administrator.
The following examples use SQL*Plus and the sysdba user to run the pyquser.sql script.
Example 3-1 Creating New Users

This example creates the user oml_user with the permanent tablespace USERS with an
unlimited quota, the temporary tablespace TEMP, and grants the PYQADMIN role to the
oml_user.
sqlplus / as sysdba
@pyquser.sql oml_user USERS TEMP unlimited pyqadmin
Enter value for password: <type your password>
For a pluggable database:
sqlplus / as sysdba
alter session set container=<PDBNAME>
@pyquser.sql oml_user USERS TEMP unlimited pyqadmin
The output is similar to the following:
SQL> @pyquser.sql oml_user USERS TEMP unlimited pyqadmin

Enter value for password: welcome1
old 1: create user &&1 identified by &password
new 1: create user oml_user identified by welcome1
old 2: default tablespace &&2
new 2: default tablespace USERS
old 3: temporary tablespace &&3
new 3: temporary tablespace TEMP
old 4: quota &&4 on &&2
new 4: quota unlimited on USERS
User created.
old 4: 'create procedure, create mining model to &&1';

new 4: 'create procedure, create mining model to pyquser';
old 6: IF lower('&&5') = 'pyqadmin' THEN
new 6: IF lower('pyqadmin') = 'pyqadmin' THEN
old 7: execute immediate 'grant PYQADMIN to &&1';
new 7: execute immediate 'grant PYQADMIN to pyquser';
PL/SQL procedure successfully completed.
3-15
Chapter 3
Install OML4Py Client for On-Premises Databases
This example creates the user oml_user2 with 20 megabyte quota on the USERS tablespace,
the temporary tablespace TEMP, and without the PYQADMIN role.
sqlplus / as sysdba
@pyquser.sql oml_user2 USERS TEMP 20M FALSE
Enter value for password: <type your password>
3.4.6 Uninstall the OML4Py Server from an On-Premises Database 19c

Instructions for uninstalling the on-premises OML4Py server components from an on-premises
Oracle Database 19c.
Uninstall the On-Premises OML4Py Server for Linux

To uninstall the on-premises OML4Py server for Linux, do the following:
1. Verify that the PYTHONHOME environment variable is set to the Python3.9 directory.
echo $PYTHONHOME
2. Verify that PYTHONPATH environment variable is set to the directory in which the oml
modules are installed.
echo $PYTHONPATH
If it is not set to the proper directory, set it.
3. Change directories to the directory containing the server installation zip file.
cd $ORACLE_HOME/oml4py
4. Run the server installation Perl script with the -u argument.
perl -Iserver server/server.pl -u
When the script displays Proceed?, enter y or yes.
3.5 Install OML4Py Client for On-Premises Databases

Instructions for installing and uninstalling the on-premises OML4Py client.
For instructions on installing the OML4Py client on Autonomous Database, see Install OML4Py
Client for Linux for Use With Autonomous Database on Serverless Exadata Infrastructure
Topics:
• Install Oracle Instant Client and the OML4Py Client for Linux
Instructions for installing Oracle Instant Client and the OML4Py client for Linux for an on-
premises Oracle database.
3-16
Chapter 3
• Verify OML4Py Client Installation for On-Premises Databases

Verify the installation of the OML4Py client components for an on-premises Oracle
database.
• Uninstall the OML4Py Client for On-Premises Databases
Instructions for uninstalling the OML4Py client.
3.5.1 Install Oracle Instant Client and the OML4Py Client for Linux
Instructions for installing Oracle Instant Client and the OML4Py client for Linux for an on-
premises Oracle database.
To connect the OML4Py client for Linux to an on-premises Oracle database, you must have
Oracle Instant Client installed on your local system.
• Install Oracle Instant Client for Linux for On-Premises Databases
Instructions for installing Oracle Instant Client for Linux for use with an on-premises Oracle
database.
• Install OML4Py Client for Linux for On-Premises Databases
Instructions for installing the OML4Py client for Linux for use with an on-premises Oracle
database.
3.5.1.1 Install Oracle Instant Client for Linux for On-Premises Databases
Instructions for installing Oracle Instant Client for Linux for use with an on-premises Oracle
database.
The OML4Py client requires Oracle Instant Client to connect to an Oracle database. See the
Oracle Support Note "Client / Server Interoperability Support Matrix for Different Oracle
Versions (Doc ID 207303.1)".
To install Oracle Instant Client, the following are required:
• Write permission on the directory in which you are installing the client.
To install Oracle Instant Client, do the following:
1. Download the Oracle Instant Client for your system. Go to the Oracle Instant Client
Downloads page and select Instant Client for Linux x86-64.
2. Locate the section for your version of Oracle Database. These instructions use the
19.14.0.0.0 version.
3. In the Base section, in the Download column, click the zip file for the Basic Package or
Basic Light Package and save the file in an accessible directory on your system. These
instructions use the directory /opt/oracle.
4. Go to the folder that you selected and unzip the package. For example:
cd /opt/oracle
unzip instantclient-basic-linux.x64-19.14.0.0.0dbru.zip
Extracting the package creates the subdirectory instantclient_19_14, which contains the
Oracle Instant Client files.
3-17
Chapter 3
5. The libaio package is also required. To see if libaoi resides on the system run the
following command.
$ rpm -qa libaio

libaio-0.3.112-1.el8.i686
libaio-0.3.112-1.el8.x86_64
The version will vary based on the Linux version. If nothing is returned from this command,
then the libaio RPM is not installed on the target system.
To install the libaio package with sudo or as the root user, run the following command:
sudo yum install libaio
Note:
In some Linux distributions, this package is called libaio1.
6. Add the directory that contains the Oracle Instant Client files to the beginning of your
LD_LIBRARY_PATH environment variable:
export LD_LIBRARY_PATH=/opt/oracle/instantclient_19_14:$LD_LIBRARY_PATH
3.5.1.2 Install OML4Py Client for Linux for On-Premises Databases

Instructions for installing the OML4Py client for Linux for use with an on-premises Oracle
database.
Prerequisites
To download and install the on-premises OML4Py client, the following are required:
• Write permission on the directory in which you are installing the client.
• Python 3.9.5. To know more about downloading and installing Python 3.9.5, see Build and
Install Python for Linux for On-Premises Databases
To use the OML4Py client to connect to an on-premises Oracle database, the following are
required:
• Oracle Instant Client must be installed on the client machine.
• The OML4Py server must be installed on the on-premises database server.
Download and Extract the OML4Py Client Installation File
To download and extract the OML4Py client installation file, do the following:
1. Download the client installation zip file.
Technology Network.
3-18
Chapter 3
Downloads (v1.0).
c. Select Oracle Machine Learning for Python Client Install for Oracle Database on
Linux 64 bit.
d. Save the zip file to an accessible directory. These instructions use a directory named
oml4py, but you can download the zip file to any location accessible to the user
installing the oml4py client.
2. Go to the directory to which you downloaded the zip file and unzip the file.
cd oml4py
unzip oml4py-client-linux-x86_64-1.0.zip
The contents are extracted to a subdirectory named client, which contains these four
files:
• OML4PInstallShared.pm
• oml-1.0-cp39-cp39-linux_x86_64.whl
• client.pl
• oml4py.ver
View the Optional Arguments to the Client Installation Perl Script

In the directory that contains the downloaded the installation zip file (oml4py in these
instructions), run the client installation Perl script with the --help option to display the
arguments to the client installation Perl script.
The following command displays the available installation options:
perl -Iclient client/client.pl --help
Usage: client.pl [OPTION]...
Install, upgrade, or uninstall OML4P Client.
-i, --install install or upgrade (default)

-u, --uninstall uninstall
-y never prompt
--ask interactive mode (default)
--no-embed do not install embedded python functionality
--no-automl do not install automl module
--no-deps turn off dependencies checking
--target <dir> install client into <dir>
By default, the installation script installs the Embedded Python Execution and AutoML
modules. If you don't want to install these modules, then you can use the --no-embed and --
no-automl flags, respectively.
Also by default, the installation script checks for the existence and version of each of the
supporting packages that the OML4Py client requires. If a required package is missing or does
not meet the version requirement, the installation script displays an error message and exits.
You can skip the dependency checking in the client installation by using the --no-deps flag.
3-19
Chapter 3
However, to use the oml module, you need to have installed acceptable versions of all of the
supporting packages.
For a list of the required dependencies, see Install the Required Supporting Packages for Linux
for On-Premises Databases.
Run the OML4Py Client Installation Script

To install the OML4Py client, do the following:
1. In the directory that contains the extracted client installation Perl script, run the script. The
following command runs the Perl script in the current directory:
Alternatively, the following command runs the Perl script with the target directory specified:
perl -Iclient client/client.pl --target path_to_target_dir
The --target flag is optional, if you don't want to install it to the current directory.
If you use the --target <dir> argument to install the oml module to the specified
directory, then add that location to environment variable PYTHONPATH so that Python can
find the module:
export PYTHONPATH=path_to_target_dir
The command displays the following:

Checking dependencies .............. Pass
Python Version ................... 3.9.5
PYTHONHOME ....................... /opt/Python-3.9.5
Proceed? [yes]
Processing ./client/oml-1.0-cp39-cp39-linux_x86_64.whl
2. To verify that oml modules are successfully installed and are ready to use, start Python and
import oml. At the Linux prompt, enter python3.
python3
3-20
Chapter 3
At the Python prompt, enter import oml
import oml
python3

import oml
3. Display the location of the installation directory.

If you didn't use the --target <dir> argument, then the installed oml modules are stored
under $PYTHONHOME/lib/python3.9/site-packages/. Again, you must have write
permission for the target directory.
In Python, after importing the oml module, you can display the directory in which the client
is installed. At the Python prompt, enter:
oml.__path__
Connect to the OML4Py Server

Start Python, import oml, and create a connection to your OML4Py server using an appropriate
password, hostname, and system identifier. The following example uses oml_user as the user
and has example argument values. Replace the username and other argument values with the
values for your user and database.
import oml
oml.connect(user='oml_user', password='oml_user_password', host=myhost,
port=1521, service_name='myservice')
After connecting, you can run any of the examples in this publication. For example, you could
run Example 6-8.
Note:
To use the Embedded Python Execution examples, you must have installed the
OML4Py client with the Embedded Python Execution option enabled.
To use the Automatic Machine Learning (AutoML) examples, you must specify a
running connection pool on the server in the automl argument in an oml.connect
invocation.
3-21
Chapter 3
3.5.2 Verify OML4Py Client Installation for On-Premises Databases

Verify the installation of the OML4Py client components for an on-premises Oracle database.
1. In your local Python session, connect to the OML4Py server and invoke the same function
by name. In the following example, replace the values for the parameters with those for
your database.
import oml
2. Create a user-defined Python function and store it in the OML4Py script repository.
oml.script.create("TEST", func='def func():return 1 + 1', overwrite=True)
3. Call the user-defined function, using the oml.do_eval function.
res = oml.do_eval(func='TEST')
res
4. When you are finished testing, you can drop the test.
oml.script.drop("TEST")
3.5.3 Uninstall the OML4Py Client for On-Premises Databases

Instructions for uninstalling the OML4Py client.
Uninstall the On-Premises OML4Py Client for Linux

To uninstall the on-premises OML4Py client for Linux, from the directory containing the client
installation zip file, run the client installation Perl script with the -u argument:
perl -Iclient client/client.pl -u
If the client is successfully uninstalled, you'll see the following message:
Uninstalling oml-1.0:
Successfully uninstalled oml-1.0
3-22
4
Install OML4Py on Exadata
The following topics tell about OML4Py on Exadata and how to configure DCLI and install
python, OML4Py across Exadata.
Topics:
• About Oracle Machine Learning for Python on Exadata
Exadata is an ideal platform for OML4Py. The parallel resources of Python computations in
OML4Py take advantage of the massively parallel grid infrastructure of Exadata.
• Configure DCLI to install Python across Exadata compute nodes.
Using Distributed Command Line Interface (DCLI) can simplify the installation of OML4Py
on Exadata.
4.1 About Oracle Machine Learning for Python on Exadata

Exadata is an ideal platform for OML4Py. The parallel resources of Python computations in
OML4Py take advantage of the massively parallel grid infrastructure of Exadata.
Note:
The version of OML4Py must be the same on the server and on each client
computer. Also, the version of Python must be the same on the server and on each
client computer. See table number 3-2 OML4Py On Premises System Requirements
for supported configurations.
To install OML4Py on Exadata:

1. On all compute nodes:
• Install Python
• Verify and configure the environment
• Install the OML4Py supporting pacakges
• Install the OML4Py server components
2. On the first node only:
• Install the OML4Py Server components including the database configuration.
• Create an OML4Py user, if desired. Alternatively, configure an existing database user
to use OML4Py. See Create New Users for On-Premises Oracle Database.
You can simplify the Python installation on Exadata by using the Distributed Command Line
Interface (DCLI).
4-1
Chapter 4
Configure DCLI to install Python across Exadata compute nodes.
4.2 Configure DCLI to install Python across Exadata compute

nodes.
Using Distributed Command Line Interface (DCLI) can simplify the installation of OML4Py on
Exadata.
With DCLI, you can use a single command to install Python across multiple Exadata compute
nodes. The following example shows the output of the DCLI help option, which explains the
basic syntax of the utility.
Example 4-1 DCLI Help Option Output
dcli -h
Distributed Shell for Oracle Storage
This script executes commands on multiple cells in parallel threads.

The cells are referenced by their domain name or ip address.
Local files can be copied to cells and executed on cells.
This tool does not support interactive sessions with host applications.
Use of this tool assumes ssh is running on local host and cells.
The -k option should be used initially to perform key exchange with
cells. User may be prompted to acknowledge cell authenticity, and
may be prompted for the remote user password. This -k step is serialized
to prevent overlayed prompts. After -k option is used once, then
subsequent commands to the same cells do not require -k and will not require
passwords for that user from the host.
Command output (stdout and stderr) is collected and displayed after the
copy and command execution has finished on all cells.
Options allow this command output to be abbreviated.
Return values:
0 -- file or command was copied and executed successfully on all cells
1 -- one or more cells could not be reached or remote execution
returned non-zero status.
2 -- An error prevented any command execution
Examples:
dcli -g mycells -k
dcli -c stsd2s2,stsd2s3 vmstat
dcli -g mycells cellcli -e alter iormplan active
dcli -g mycells -x reConfig.scl
Usage: dcli [options] [command]
Options:
--version show program's version number and exit
--batchsize=MAXTHDS limit the number of target cells on which to run the
command or file copy in parallel
-c CELLS comma-separated list of cells
--ctimeout=CTIMEOUT Maximum time in seconds for initial cell connect
-d DESTFILE destination directory or file
-f FILE files to be copied
4-2
Chapter 4
-g GROUPFILE file containing list of cells

-h, --help show help message and exit
--hidestderr hide stderr for remotely executed commands in ssh
-k push ssh key to cell's authorized_keys file
--key-with-one-password
apply one credential for pushing ssh key to
authorized_keys files
-l USERID user to login as on remote cells (default: celladmin)
--root-exadatatmp root user login using directory /var/log/exadatatmp/
--maxlines=MAXLINES limit output lines from a cell when in parallel
execution over multiple cells (default: 100000)
-n abbreviate non-error output
-r REGEXP abbreviate output lines matching a regular expression
-s SSHOPTIONS string of options passed through to ssh
--scp=SCPOPTIONS string of options passed through to scp if different
from sshoptions
--serial serialize execution over the cells
--showbanner show banner of the remote node in ssh
-t list target cells
--unkey drop keys from target cells' authorized_keys file
-v print extra messages to stdout
--vmstat=VMSTATOPS vmstat command options
-x EXECFILE file to be copied and executed
Configure the Exadata environment to enable automatic authentication for DCLI on each
compute node.
1. Generate an SSH public-private key for the root user. Execute the following command as
root on any node:
ssh-keygen -N '' -f /.ssh/id_dsa -t dsa
This command generates public and private key files in the .ssh subdirectory of the home
directory of the root user.
2. In a text editor, create a file that contains the names of all the compute nodes in the rack.
Specify each node name on a separate line. For example, the nodes file for a 2-node
cluster could contain entries like the following:
cat nodes
exadb01
exadb02
3. Run the DCLI command with the -k option to establish SSH trust across all the nodes. The
-k option causes DCLI to contact each node sequentially (not in parallel) and prompts you
to enter the password for each node.
dcli -t -g nodes -l root -k -s "\-o StrictHostkeyChecking=no"
DCLI with -k establishes SSH Trust and User Equivalence. Subsequent DCLI commands
will not prompt for passwords.
4-3
Chapter 4
Instructions for installing Python and OML4Py across Exadata compute nodes using DCLI are
described in the following topics.
Topics:
• Install Python across Exadata compute nodes using DCLI
Instructions for installing Python across Exadata compute nodes using DCLI.
• Install OML4Py across Exadata compute nodes using DCLI
Instructions for installing OML4Py across Exadata compute nodes using DCLI.
4.2.1 Install Python across Exadata compute nodes using DCLI

Instructions for installing Python across Exadata compute nodes using DCLI.
These steps describe building and installing Python for Exdata.
1. Go to the Python website and download the Python 3.9.5 XZ compressed source tarball
and untar it. The downloaded file name is Python-3.9.5.tar.xz
wget https://www.python.org/ftp/python/3.9.5/Python-3.9.5.tar.xz
tar xvf Python-3.9.5.tar.xz
2. OML4Py requires the presence of the perl-Env libffi-devel, openssl, openssl-devel,

tk-devel, xz-devel, zlib-devel, bzip2-devel, readline-devel and libuuid-devel
libraries. Install these libraries using the command:
dcli -t -g nodes -l root "yum -y install perl-Env libffi-devel openssl

openssl-devel tk-devel xz-devel zlib-devel bzip2-devel readline-devel
libuuid-devel"
3. Set the PYTHONHOME environment on each node:
dcli -t -g nodes -l oracle "export PYTHONHOME=$ORACLE_HOME/python; export

PATH=$ORACLE_HOME/python/bin$PATH; export LD_LIBRARY_PATH=$ORACLE_HOME/
python/lib:$LD_LIBRARY_PATH; export PIP_REQUIRE_VIRTUALENV=false"
dcli -t -g nodes -l oracle "tar xvfz $ORACLE_HOME/Python-3.9.5.tar.xz -
C $ORACLE_HOME/python"
dcli -t -g nodes -l oracle "cd $ORACLE_HOME/python; ./configure --enable-
shared --prefix=$ORACLE_HOME/python"
dcli -t -g nodes -l oracle "cd $ORACLE_HOME/python; make clean; make"
dcli -t -g nodes -l oracle "cd $ORACLE_HOME/python; make altinstall"
4. Create a symbolic link in your $PYTHONHOME/bin directory. You need to link it to your
python3.9 executable, which you can do with the following commands:
dcli -t -g nodes -l oracle "cd $PYTHONHOME/bin"

dcli -t -g nodes -l oracle "ln -s python3.9 python3"
LD_LIBRARY_PATH:
dcli -t -g nodes -l oracle "export PYTHONHOME=$ORACLE_HOME/python"

dcli -t -g nodes -l oracle "export PATH=$PYTHONHOME/bin:$PATH"
dcli -t -g nodes -l oracle "export LD_LIBRARY_PATH=$PYTHONHOME/
4-4
Chapter 4
lib:$LD_LIBRARY_PATH"
dcli -t -g nodes -l oracle "export PIP_REQUIRE_VIRTUALENV=false"
6. You can now start Python by running the command python3. For example:
dcli -t -g nodes -l oracle "python3"
exadb01: Python 3.9.5 (default, Feb 10 2022, 14:38:12)

exadb02: Python 3.9.5 (default, Feb 10 2022, 14:38:12)

4.2.2 Install OML4Py across Exadata compute nodes using DCLI

Instructions for installing OML4Py across Exadata compute nodes using DCLI.
To install OML4Py on Exadata using DCLI, follow the steps:

1. First install the OML4Py supporting packages to $ORACLE_HOME/oml4py/modules on each
node. The OML4Py supporting packages must be installed individually on each compute
node. DCLI cannot be used because it uses the system default Python and cause conflicts
with the Python installed for use with OML4Py.
pip3.9 install pandas==1.3.4 --target=$ORACLE_HOME/oml4py/modules

pip3.9 install scipy==1.7.3 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install matplotlib==3.3.3 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install cx_Oracle==8.1.0 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install threadpoolctl==2.1.0 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install scikit-learn==1.0.1 --no-deps --target=$ORACLE_HOME/oml4py/
modules
pip3.9 install joblib==0.14.0 --target=$ORACLE_HOME/oml4py/modules
pip3.9 install numpy==1.21.5 --target=$ORACLE_HOME/oml4py/modules
2. Set the PYTHONPATH environment variable to the location of the OML4Py modules:
3. Download the installation file for your system.

Technology Network.
Downloads (v1.0).
c. Select Oracle Machine Learning for Python Server Install for Oracle Database on
Linux 64 bit.
d. Save the file to the $ORACLE_HOME/oml4py directory.
4-5
Chapter 4
To extract the installation file to $ORACLE_HOME/oml4py directory, use the command:
unzip oml4py-server-linux-x86_64-1.0.zip -d $ORACLE_HOME/oml4py
The files are extracted to the $ORACLE_HOME/oml4py/server subdirectory.

4. On the first node, from the $ORACLE_HOME/oml4py directory, run the server installation
script. The following command runs the script in interactive mode:
To run the server script in non-interactive mode, pass the parameters for the pluggable
database, and permanent and temporary tablespaces to the script
perl -Iserver server/server.pl -y --pdb PDB11 --perm SYSTEM --temp TEMP
Run the server script with the --no-db flag on all remaining compute nodes. This sets up
the OML4Py server configuration and skips the database configuration steps already
performed on the first node:
perl -Iserver server/server.pl --no-db
4-6
5
Install Third-Party Packages
Oracle Machine Learning Notebooks in the Autonomous Database provides a conda
interpreter to install third-party Python libraries in a conda environment for use within OML
Notebooks sessions and OML4Py embedded execution invocations. Conda is an open-source
package and environment management system that enables the use of environments
containing third-party Python libraries.
Administrators create conda environments and install packages that can then be accessed by
non-administrator users and loaded into their OML Notebooks session. The conda
environments can be used by OML4Py Python, SQL, and REST APIs.
Note:
• None of the OML features that come with ADB require the customer to install any
additional third-party software via the conda feature.
• When installing third-party software using the conda feature, vulnerability
management and license compliance of that software is the sole responsibility of
the customer who installed it, not Oracle.
Topics:
• Conda Commands
This topic contains common commands used by ADMIN while creating and testing conda
environments in Autonomous Databases. Conda is an open-source package and
environment management system that enables the use of environments containing third-
party Python libraries.
• Administrative Tasks for Creating and Saving a Conda Environment
In OML Notebooks, user ADMIN can manage the lifecycle of the OML user’s conda
environments, including creating and deleting environments and installing and deleting
packages.
• OML User Tasks for Downloading an Available Conda Environment
Once user ADMIN installs the environment in Object Storage in the Autonomous
Database, as an OML user, you can download, activate, and use it in Python paragraphs in
notebooks and with embedded execution.
• Using Conda Environments with Embedded Python Execution
This topic explains the usage of conda environments by running user-defined functions
(UDFs) in SQL and REST APIs for embedded Python execution.
5.1 Conda Commands

This topic contains common commands used by ADMIN while creating and testing conda
environments in Autonomous Databases. Conda is an open-source package and environment
5-1
Chapter 5
Conda Commands
management system that enables the use of environments containing third-party Python
libraries.
Refer to Conda Interpreter Commands for a table of supported conda commands.
Conda Help
To get help for conda commands, run the command name followed by the --help flag.
Note:
The conda command is not run explicitly because the %conda interpreter provides the
conda context.
• Get help for all conda commands
%conda
--help
• Get help for a specific conda command. Run the following command to get help with the
install command:
%conda
install --help
Conda Info
The info command displays information about the conda installation, including the conda
version and available channels.
%conda
info
Conda Search
The search command allows the user to search for packages and display associated
information, including the package version and the channel where it resides.
• Search for a specific package. Run the following command to search for the package
scikit-learn.
%conda
search scikit-learn
• Search for packages containing 'scikit' in the package name.
%conda
search '*scikit*'
5-2
Chapter 5
Conda Commands
• Search for a specific version of a package.
%conda
search 'numpy==1.12'
%conda
search 'numpy>=1.12'
• Search for a specific version on a specific channel.
%conda
search conda-forge::numpy
Enhanced Conda Commands

A set of enhanced conda commands in the conda environment lifecycle management package
env-lcm supports the management of environments saved to Object Storage, including
uploading, downloading, listing, and deleting available environments.
Help for conda lifecycle environment commands.
%conda
env-lcm --help
Usage: conda-env-lcm [OPTIONS] COMMAND [ARGS]...
ADB-S Command Line Interface (CLI) to manage persistence of conda

environments
Options:
-v, --version Show the version and exit.
--help Show this message and exit.
Commands:
delete Delete a saved conda environment
download Download a saved conda environment
import Create or update a conda environment from saved metadata
list-local-envs List locally available environments for use
list-saved-envs List saved conda environments
upload Save conda environment for later use
Creating Conda Environments

This section demonstrates creating and installing packages to a conda environment, then
removing the environment. Here commonly used options available for environment creation
and testing are illustrated. The environment exists for the duration of the notebook session and
does not persist between sessions unless it is saved to Object Storage. For instructions that
include both creating and persisting an environment for OML users, refer to Administrative task
to create and save the conda environments. As an ADMIN user:
5-3
Chapter 5
Conda Commands
1. Use the create command to create an environment myenv and install the Python keras
package.
2. Verify that the new environment is created, and activate the environment.
3. Install, then uninstall an additional Python package, pytorch, in the environment.
4. Deactivate and remove the environment.
Note:
The ADMIN user can access the conda environment from Python and R, but does
not have the capability to run embedded Python and R execution commands.
For help with the conda create command, enter create --help in a %conda paragraph.
List Environments
Start by listing the environments available by default. Conda contains default environments
with some core system libraries and conda dependencies. The active environment is marked
with an asterisk (*).
%conda
env list
# conda environments:
#
base * /usr
conda-pack-env /usr/envs/conda-pack-env
Create Conda Environment

Create conda environment called myenv with Python 3.10 for OML4Py compatibility and install
the keras package.
%conda
create -n myenv python=3.10 keras
Verify Environment Creation

Verify the myenv environment is in the list of environments. The asterisk (*) indicates active
environments. The new environment is created but not activated.
%conda
env list
#
myenv /u01/.conda/envs/myenv
5-4
Chapter 5
Conda Commands
base * /usr
Activate the Environment

Activate the myenv environment and list the environments to verify the activation. The asterisk
(*) next to the environment name confirms the activation.
%conda
activate myenv
Conda environment 'myenv' activated
List the environments available by default.
%conda
env list
#
myenv * /u01/.conda/envs/myenv
base /usr
Installing and Uninstalling Libraries

The ADMIN user can install and uninstall libraries into an environment using the install and
uninstall commands. For help with the conda install and uninstall commands, type
install --help and uninstall --help in a %conda paragraph.
Note:
When conda installs a package into an environment, it also installs any required
dependencies. As shown here, it’s possible to install packages to an existing
environment. As a best practice, to avoid dependency conflicts, simultaneously install
all the packages you need in a specific environment.
Install Additional Packages

Install the pytorch package into the activated myenv environment.
%conda
install pytorch
List Packages in the Current Environment
5-5
Chapter 5
Conda Commands
List the packages installed in the current environment, and confirm that keras and pytorch are
installed.
%conda
list
# packages in environment at /u01/.conda/envs/myenv:

#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
blas 1.0 mkl
.
.
.
fftw 3.3.9 h27cfd23_1
future 0.18.2 py310h06a4308_1
intel-openmp 2021.4.0 h06a4308_3561
keras 2.10.0 py310h06a4308_0
keras-preprocessing 1.1.2 pyhd3eb1b0_0
ld_impl_linux-64 2.38 h1181459_1
libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_1
.
.
.
numpy-base 1.23.3 py310h8e6c178_1
openssl 1.1.1s h7f8727e_0
pip 22.2.2 py310h06a4308_0
pycparser 2.21 pypi_0 pypi
python 3.10.6 haa1d7c7_1
pytorch 1.10.2 cpu_py310h6894f24_0
readline 8.2 h5eee18b_0
.
.
.
xz 5.2.6 h5eee18b_0
zlib 1.2.13 h5eee18b_0
The output above has been truncated and does not show the complete list of packages.
Uninstall Package
Libraries can be uninstalled from an environment using the uninstall command. Let’s uninstall
the pytorch package from the current environment.
%conda
uninstall pytorch
Verify Package was Uninstalled
5-6
Chapter 5
Conda Commands
List packages in current environment and verify that the pytorch package was uninstalled.
%conda
list
The output shown below does not contain the pytorch package.
# packages in environment at /u01/.conda/envs/myenv:

#
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
blas 1.0 mkl
bzip2 1.0.8 h7b6447c_0
ca-certificates 2022.10.11 h06a4308_0
certifi 2022.9.24 py310h06a4308_0
cffi 1.15.1 py310h74dc2b5_0
fftw 3.3.9 h27cfd23_1
future 0.18.2 py310h06a4308_1
intel-openmp 2021.4.0 h06a4308_3561
keras 2.10.0 py310h06a4308_0
keras-preprocessing 1.1.2 pyhd3eb1b0_0
ld_impl_linux-64 2.38 h1181459_1
libgcc-ng 11.2.0 h1234567_1
libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.0.3 h7f8727e_2
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py310h7f8727e_0
mkl_fft 1.3.1 py310hd6ae3a3_0
mkl_random 1.2.2 py310h00e6091_0
ncurses 6.3 h5eee18b_3
ninja 1.10.2 h06a4308_5
ninja-base 1.10.2 hd09550d_5
numpy 1.23.3 py310hd5efca6_1
numpy-base 1.23.3 py310h8e6c178_1
openssl 1.1.1s h7f8727e_0
pip 22.2.2 py310h06a4308_0
pycparser 2.21 pypi_0 pypi
python 3.10.6 haa1d7c7_1
readline 8.2 h5eee18b_0
scipy 1.9.3 py310hd5efca6_0
setuptools 65.5.0 py310h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.39.3 h5082296_0
tk 8.6.12 h1ccaba5_0
typing-extensions 4.3.0 py310h06a4308_0
typing_extensions 4.3.0 py310h06a4308_0
tzdata 2022f h04d1e81_0
wheel 0.37.1 pyhd3eb1b0_0
5-7
Chapter 5
Conda Commands
xz 5.2.6 h5eee18b_0
Removing Environments
If you don’t intend to upload the environment to Object Storage for the OML users in the
database, you can simply exit the notebook session and it will go out of scope. Alternatively, it
can be explicitly removed using the env remove command. Remove the myenv environment
and verify it was removed. A best practice is to deactivate the environment prior to removal.
For help on the env remove command, type env remove --help in the %conda interpreter.
• Deactivate the environment.
%conda
deactivate
Conda environment deactivated
• Remove the environment.
%conda
env remove -n myenv
List the environment to see if the environment is removed or not.
env list
#
myrenv /u01/.conda/envs/myrenv
base * /usr
Remove all packages in environment /u01/.conda/envs/myenv.
Specify Packages for Installation

Install Packages from the conda-forge Channel
Conda channels are the locations where packages are stored. They serve as the base for
hosting and managing packages. Conda packages are downloaded from remote channels,
which are URLs to directories containing conda packages. The conda command searches a
set of channels. By default, packages are automatically downloaded and updated from the
default channel. The conda-forge channel is free for all to use. You can modify what remote
channels are automatically searched. You might want to do this to maintain a private or internal
channel. We use the conda-forge channel, a community channel made up of thousands of
contributors, in the following examples.
• Install a specific version of a Package.
To install a specific version of a package, use <package_name>=<version>.
5-8
Chapter 5
Administrative Tasks for Creating and Saving a Conda Environment
• Create an environment using conda-forge.
%conda
create -n mychannelenv -c conda-forge python=3.10
activate mychannelenv
• Install a package from conda-forge by specifying the channel.
%conda
install scipy --channel conda-forge
• Install a specific version of a package.
%conda
install scipy=0.15.0
5.2 Administrative Tasks for Creating and Saving a Conda

Environment
In OML Notebooks, user ADMIN can manage the lifecycle of the OML user’s conda
environments, including creating and deleting environments and installing and deleting
packages.
The conda environments created by user ADMIN are stored in an Object Storage bucket folder
associated with the Autonomous Database instance. OML users can download these conda
environments using enhanced conda commands. Conda environments are available after they
are downloaded and activated using the download and activate functions in a %conda
paragraph. An activated environment is available until it is deactivated.
Create a Conda enviroment

As an ADMIN user in an OML notebook, specify a conda interpreter in a paragraph using
%conda, then use the create command to create a conda environment named sbenv to install
the seaborn package. Specify the Python version using the python parameter. Here, Python
3.12 is used for compatibility with OML4Py.
Note:
When conda installs a package into an environment, it also installs any required
dependencies. As a best practice, to avoid dependency conflicts, simultaneously
install all the packages you need in a specific environment.
5-9
Chapter 5
Note:
Specify python=3.12 when creating a conda environment for a 3rd-party package to
avoid inconsistencies.
%conda
create -n sbenv -c conda-forge --strict-channel-priority python=3.12.1
seaborn
Upload the environment to Object Storage

Upload the environment to the Object Storage associated with the Autonomous Database
instance. Here you provide an environment description and a tag corresponding to an
application name, OML4Py.
%conda
upload sbenv --description 'Conda environment with seaborn' -t application

"OML4PY"
Uploading conda environment sbenv

Upload successful for conda environment sbenv
The environment is now available for an OML user to download. The uploaded environment
will persist in Object Storage until it is deleted. The application tag is required for use with
embedded execution. For example, OML4Py embedded Python execution works with conda
environments containing the OML4Py tag, and OML4R embedded R execution works with
conda environments containing the OML4R tag.
There is one Object Storage bucket for each data center region. The conda environments are
saved to a folder in Object Storage corresponding to the tenancy and database. The folder is
managed by Autonomous Database and only available to users through OML notebooks.
There is an 8G maximum size for a single conda environment, and no size limit on Object
Storage.
Logged in as a non-administrator user, specify the conda interpreter in a notebook paragraph
using %conda. Get the list of environments saved in Object Storage using the list-saved-envs
command.
%conda
list-saved-envs
Provide the environment name as an argument to the -e parameter and request a list of
packages installed in the environment.
%conda
list-saved-envs -e sbenv --installed-packages
5-10
Chapter 5
{
"name": "sbenv",
"size": "1.7 GiB",
"description": "Conda environment with seaborn",
"tags": {
"application": "OML4PY"
},
"number_of_installed_packages": 78,
"installed_packages": [
"blas-1.0-mkl",
"bottleneck-1.3.5-py39h7deecbd_0",
"brotli-1.0.9-h5eee18b_7",
"brotli-bin-1.0.9-h5eee18b_7",
"ca-certificates-2022.07.19-h06a4308_0",
"certifi-2022.9.14-py39h06a4308_0",
"cycler-0.11.0-pyhd3eb1b0_0",
"dbus-1.13.18-hb2f20db_0",
"expat-2.4.4-h295c915_0",
"fftw-3.3.9-h27cfd23_1",
"fontconfig-2.13.1-h6c09931_0",
"fonttools-4.25.0-pyhd3eb1b0_0",
"freetype-2.11.0-h70c0345_0",
"giflib-5.2.1-h7b6447c_0",
"glib-2.69.1-h4ff587b_1",
"gst-plugins-base-1.14.0-h8213a91_2",
"gstreamer-1.14.0-h28cd5cc_2",
"icu-58.2-he6710b0_3",
"intel-openmp-2021.4.0-h06a4308_3561",
"jpeg-9e-h7f8727e_0",
"kiwisolver-1.4.2-py39h295c915_0",
"lcms2-2.12-h3be6417_0",
"ld_impl_linux-64-2.38-h1181459_1",
"lerc-3.0-h295c915_0",
"libbrotlicommon-1.0.9-h5eee18b_7",
"libbrotlidec-1.0.9-h5eee18b_7",
"libbrotlienc-1.0.9-h5eee18b_7",
"libdeflate-1.8-h7f8727e_5",
"libffi-3.3-he6710b0_2",
"libgcc-ng-11.2.0-h1234567_1",
"libgfortran-ng-11.2.0-h00389a5_1",
"libgfortran5-11.2.0-h1234567_1",
"libpng-1.6.37-hbc83047_0",
"libstdcxx-ng-11.2.0-h1234567_1",
"libtiff-4.4.0-hecacb30_0",
"libuuid-1.0.3-h7f8727e_2",
"libwebp-1.2.2-h55f646e_0",
"libwebp-base-1.2.2-h7f8727e_0",
"libxcb-1.15-h7f8727e_0",
"libxml2-2.9.14-h74e7548_0",
"lz4-c-1.9.3-h295c915_1",
"matplotlib-3.5.2-py39h06a4308_0",
"matplotlib-base-3.5.2-py39hf590b9c_0",
"mkl-2021.4.0-h06a4308_640",
"mkl-service-2.4.0-py39h7f8727e_0",
5-11
Chapter 5
"mkl_fft-1.3.1-py39hd3c417c_0",
"mkl_random-1.2.2-py39h51133e4_0",
"munkres-1.1.4-py_0",
"ncurses-6.3-h5eee18b_3",
"numexpr-2.8.3-py39h807cd23_0",
"numpy-1.22.3-py39he7a7128_0",
"numpy-base-1.22.3-py39hf524024_0",
"openssl-1.1.1q-h7f8727e_0",
"packaging-21.3-pyhd3eb1b0_0",
"pandas-1.4.4-py39h6a678d5_0",
"pcre-8.45-h295c915_0",
"pillow-9.2.0-py39hace64e9_1",
"pip-22.1.2-py39h06a4308_0",
"pyparsing-3.0.9-py39h06a4308_0",
"pyqt-5.9.2-py39h2531618_6",
"python-3.9.0-hdb3f193_2",
"python-dateutil-2.8.2-pyhd3eb1b0_0",
"pytz-2022.1-py39h06a4308_0",
"qt-5.9.7-h5867ecd_1",
"readline-8.1.2-h7f8727e_1",
"scipy-1.7.3-py39h6c91a56_2",
"seaborn-0.11.2-pyhd3eb1b0_0",
"setuptools-63.4.1-py39h06a4308_0",
"sip-4.19.13-py39h295c915_0",
"six-1.16.0-pyhd3eb1b0_1",
"sqlite-3.39.2-h5082296_0",
"tk-8.6.12-h1ccaba5_0",
"tornado-6.2-py39h5eee18b_0",
"tzdata-2022c-h04d1e81_0",
"wheel-0.37.1-pyhd3eb1b0_0",
"xz-5.2.5-h7f8727e_1",
"zlib-1.2.12-h5eee18b_3",
"zstd-1.5.2-ha4553b6_0"
]
}
Delete an environment saved in an Object Storage

Use the delete command to delete an environment saved in an Object Storage.
Note:
Only user ADMIN can delete an environment saved in an Object Storage.
%conda
delete sbenv
Deleting conda environment sbenv

Deletion successful for conda environment sbenv
5-12
Chapter 5
OML User Tasks for Downloading an Available Conda Environment
5.3 OML User Tasks for Downloading an Available Conda

Environment
Once user ADMIN installs the environment in Object Storage in the Autonomous Database, as
an OML user, you can download, activate, and use it in Python paragraphs in notebooks and
with embedded execution.
List all environments persisted in Object Storage

Get the list of environments saved in Object Storage using the list-saved-envs command.
%conda
list-saved-envs
Get information on a named environment persisted in Object Storage

Provide the environment name as an argument to the -e parameter and request information on
the environment.
%conda
list-saved-envs -e sbenv
{
"name": "sbenv",
"size": "1.2 GiB",
"description": "Conda environment with seaborn",
"tags": {
"application": "OML4PY"
},
"number_of_installed_packages": 60
}
Download and activate the environment

Use the download command to download an environment from Object Storage. To activate the
downloaded environment, use the activate command.
Note:
The paragraph that contains the download command must be the first paragraph in
the notebook.
%conda
5-13
Chapter 5
download sbenv
activate sbenv
Downloading conda environment sbenv

Download successful for conda environment sbenv
List the packages available in the environment

Get the list of all the packages in an active environment using the list command.
%conda
list
# packages in environment at /u01/.conda/envs/sbenv:

#
blas 1.0 mkl
bottleneck 1.3.5 py39h7deecbd_0
brotli 1.0.9 h5eee18b_7
brotli-bin 1.0.9 h5eee18b_7
ca-certificates 2022.07.19 h06a4308_0
certifi 2022.9.14 py39h06a4308_0
cycler 0.11.0 pyhd3eb1b0_0
dbus 1.13.18 hb2f20db_0
expat 2.4.4 h295c915_0
fftw 3.3.9 h27cfd23_1
fontconfig 2.13.1 h6c09931_0
fonttools 4.25.0 pyhd3eb1b0_0
freetype 2.11.0 h70c0345_0
giflib 5.2.1 h7b6447c_0
glib 2.69.1 h4ff587b_1
gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
icu 58.2 he6710b0_3
intel-openmp 2021.4.0 h06a4308_3561
jpeg 9e h7f8727e_0
kiwisolver 1.4.2 py39h295c915_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libbrotlicommon 1.0.9 h5eee18b_7
libbrotlidec 1.0.9 h5eee18b_7
libbrotlienc 1.0.9 h5eee18b_7
libdeflate 1.8 h7f8727e_5
libgcc-ng 11.2.0 h1234567_1
libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libpng 1.6.37 hbc83047_0
libstdcxx-ng 11.2.0 h1234567_1
libtiff 4.4.0 hecacb30_0
5-14
Chapter 5
libuuid 1.0.3 h7f8727e_2

libwebp 1.2.2 h55f646e_0
libwebp-base 1.2.2 h7f8727e_0
libxcb 1.15 h7f8727e_0
libxml2 2.9.14 h74e7548_0
lz4-c 1.9.3 h295c915_1
matplotlib 3.5.2 py39h06a4308_0
matplotlib-base 3.5.2 py39hf590b9c_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py39h7f8727e_0
mkl_fft 1.3.1 py39hd3c417c_0
mkl_random 1.2.2 py39h51133e4_0
munkres 1.1.4 py_0
ncurses 6.3 h5eee18b_3
numexpr 2.8.3 py39h807cd23_0
numpy 1.22.3 py39he7a7128_0
numpy-base 1.22.3 py39hf524024_0
openssl 1.1.1q h7f8727e_0
packaging 21.3 pyhd3eb1b0_0
pandas 1.4.4 py39h6a678d5_0
pcre 8.45 h295c915_0
pillow 9.2.0 py39hace64e9_1
pip 22.1.2 py39h06a4308_0
pyparsing 3.0.9 py39h06a4308_0
pyqt 5.9.2 py39h2531618_6
python 3.9.0 hdb3f193_2
python-dateutil 2.8.2 pyhd3eb1b0_0
pytz 2022.1 py39h06a4308_0
qt 5.9.7 h5867ecd_1
readline 8.1.2 h7f8727e_1
scipy 1.7.3 py39h6c91a56_2
seaborn 0.11.2 pyhd3eb1b0_0
setuptools 63.4.1 py39h06a4308_0
sip 4.19.13 py39h295c915_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.39.2 h5082296_0
tk 8.6.12 h1ccaba5_0
tornado 6.2 py39h5eee18b_0
tzdata 2022c h04d1e81_0
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.5 h7f8727e_1
zstd 1.5.2 ha4553b6_0
Example 5-1 Create a visualization using seaborn

The following example shows the use of the available packages in the installed and activated
environment. It imports pandas, seaborn, and matplotlib packages and loads the iris
dataset from the seaborn library as a pandas dataframe. The pairplot seaborn function plots
the pair-wise relationship between all the variables of the dataset.
%python
import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt
5-15
Chapter 5
df = sb.load_dataset('iris')
sb.set_style("ticks")
sb.pairplot(df,hue = 'species',diag_kind = "kde",kind = "scatter",palette =
"husl")
plt.show()
The output of the example is the following.
Figure 5-1 Iris pair plot
Example 5-2 Create a string representation of the function and save it to the OML4Py
script repository
With OML4Py, functions are saved to the script repository using their string definition
representation so they can be run in embedded Python execution. Create a function sb_plot,
after verifying the function behaves as expected, provide it as a string (within triple quotes for
formatting), and save it to the OML4Py script repository. Use the oml.script.create function
to store a single user-defined Python function in the OML4Py script repository. The parameter
5-16
Chapter 5
"sb_plot" is a string that specifies the name of the user-defined function. The parameter
func=sb_plot is the Python function to run.
%python
sb_plot = """def sb_plot():

import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt
df = sb.load_dataset('iris')
sb.set_style("ticks")
sb.pairplot(df,hue = 'species',diag_kind = "kde",kind = "scatter",palette
= "husl")
plt.show()"""
oml.script.create("sb_plot", func=sb_plot)
Use the Python API for embedded Python execution to run the user-defined Python function
you saved in the script repository.
%python
oml.do_eval(func="sb_plot", graphics=True)
The output of the example is the following:
5-17
Chapter 5
Figure 5-2 Iris pair plot
Deactivate the current environment

Use the deactivate command to deactivate an enviroment.
Note:
At a given time, only one active environment is supported. So, a newly activated
environment would replace an old environment. As a best practice, deactivate an
environment before logging off.
%conda
deactivate
Conda environment deactivated
5-18
Chapter 5
Using Conda Environments with Embedded Python Execution
5.4 Using Conda Environments with Embedded Python

Execution
This topic explains the usage of conda environments by running user-defined functions (UDFs)
in SQL and REST APIs for embedded Python execution.
Running UDFs in the SQL and REST APIs for embedded Python execution
The conda environments can be used by OML4Py Python, SQL, and REST APIs. To use the
SQL and REST API for embedded Python execution, the following information is needed.
1. The token URL from the OML service console in Autonomous Database. For more
information on how to obtain the token URL and set the access token see Access and
Authorization Procedures and Functions (Autonomous Database).
2. A script containing a user-defined Python function in the Oracle Machine Learning for
Python (OML4Py) script repository. For information on creating a script and saving it to the
script repository, see About Embedded Python Execution and the Script Repository.
Note:
To use a conda environment when calling OML4Py script execution endpoints,
specify the conda environment in the env_name field when using SQL, and the
envName field when using REST.
Run the Python UDF using the SQL API for embedded Python execution -
Asynchronous mode
Run a SELECT statement that calls the pyqEval function. The PAR_LST argument specifies the
special control argument oml_graphics_flag to true so that the web server can capture
images rendered in the invoked script, the oml_async_flag is set to true to submit the job
asynchronously. In the OUT_FMT argument, the string 'PNG', specifies that the table returns the
response in a table with fixed columns (including an image bytes column). The SCR_NAME
parameter specifies the function sb_plot stored in the script repository. The ENV_NAME specifies
the environment name mysbenv in which the script is called.
%script
set long 2000
SELECT * FROM table(pyqEval(

par_lst => '{"oml_graphics_flag":true, "oml_async_flag":true}',
out_fmt => 'PNG',
scr_name => 'sb_plot',
scr_owner=> NULL,
env_name => 'mysbenv'));
NAME VALUE
---------------------------
5-19
Chapter 5
https://gcc59e2cf7a6f5f-oml4.adb-compdev1.us-
phoenix-1.oraclecloudapps.com/oml/api/py-scripts/v1/jobs/b82947a7-ec3a-4ca6-
bf86-54b3f2b3a4b0
Get the job status

Poll the job status using the pyqJobStatus function. If the job is still running, the return value
will note that the job is still running. When the job completes, a job ID and result location are
returned.
%script
set long 1000

SELECT VALUE from pyqJobStatus(job_id => 'b82947a7-ec3a-4ca6-
bf86-54b3f2b3a4b0');
The output returns a job ID:
NAME VALUE
---------------------------
https://gcc59e2cf7a6f5f-oml4.adb-compdev1.us-
phoenix-1.oraclecloudapps.com/oml/api/py-scripts/v1/jobs/b82947a7-ec3a-4ca6-
bf86-54b3f2b3a4b0/result
Retrieve the result
%script
set long 500

SELECT NAME, ID, VALUE, dbms_lob.substr(image,100,1) image FROM
pyqJobResult(job_id => 'b82947a7-ec3a-4ca6-bf86-54b3f2b3a4b0',
out_fmt=>'PNG');
NAME ID VALUE IMAGE

---------------------------
1
[{"0":0.0,"1":0.0,"2":0.2333333333,"accuracy":0.2333333333,"macro
avg":0.0777777778,"weighted avg":0.0544444444},
{"0":0.0,"1":0.0,"2":1.0,"accuracy":0.2333333333,"macro
avg":0.3333333333,"weighted avg":0.2333333333},
{"0":0.0,"1":0.0,"2":0.3783783784,"accuracy":0.2333333333,"macro
avg":0.1261261261,"weighted avg":0.0882882883},
{"0":11.0,"1":12.0,"2":7.0,"accuracy":0.2333333333,"macro avg":30.0,"weighted
avg":30.0}]
89504E470D0A1A0A0000000D494844520000046A000003E808060000008668185B000000397445
5874536F667477617265004D6174706C6F746C69622076657273696F6E332E362E322C20687474
70733A2F2F6D6174706C6F746C69622E6F72672F28E8
5-20
Chapter 5
Run the Python UDF using the REST API for embedded Python execution
The following example runs the script named sb_plot in the OML4Py REST API for embedded
Python execution. The environment name parameter envName is set to mysbenv. The
graphicsFlag parameter is set to true to return the PNG image and the data from the function
in JSON format.
$ curl -i -X POST --header "Authorization: Bearer ${token}" \

--header 'Content-Type: application/json' --header 'Accept: application/json'
\
-d '{"envName":"mysbenv", "graphicsFlag":true, "service":"LOW"}' \
"${omlserver}/oml/api/py-scripts/v1/do-eval/sb_plot"
NAME ID VALUE IMAGE

---------------------------
1 [{"0":0.0,"1":0.0,"2":0.2333333333,"accuracy":0.2333333333,"macro
avg":0.0777777778,"weighted avg":0.0544444444},
{"0":0.0,"1":0.0,"2":1.0,"accuracy":0.2333333333,"macro
avg":0.3333333333,"weighted avg":0.2333333333},
{"0":0.0,"1":0.0,"2":0.3783783784,"accuracy":0.2333333333,"macro
avg":0.1261261261,"weighted avg":0.0882882883},
{"0":11.0,"1":12.0,"2":7.0,"accuracy":0.2333333333,"macro avg":30.0,"weighted
avg":30.0}]
89504E470D0A1A0A0000000D494844520000046A000003E808060000008668185B000000397445
5874536F667477617265004D6174706C6F746C69622076657273696F6E332E362E322C20687474
70733A2F2F6D6174706C6F746C69622E6F72672F28E8
5-21
6
Get Started with Oracle Machine Learning for
Python
Learn how to use OML4Py in Oracle Machine Learning Notebooks and how to move data
between the local Python session and the database.
These actions are described in the following topics.
Topics:
• Use OML4Py with Oracle Autonomous Database
OML4Py is available through the Python interpreter in Oracle Machine Learning Notebooks
in Oracle Autonomous Database.
• Use OML4Py with an On-Premises Oracle Database
After the OML4Py server and client components have been installed on your on-premises
Oracle database server and you have installed the OML4Py client on your local system,
you can connect your client Python session to the OML4Py server.
• Move Data Between the Database and a Python Session
With OML4Py functions, you can interact with data structures in a database schema.
• Save Python Objects in the Database
You can save Python objects in OML4Py datastores, which persist in the database.
6.1 Use OML4Py with Oracle Autonomous Database

OML4Py is available through the Python interpreter in Oracle Machine Learning Notebooks in
Oracle Autonomous Database.
For more information, see Get Started with Notebooks for Data Analysis and Data Visualization
in Using Oracle Machine Learning Notebooks.
6.2 Use OML4Py with an On-Premises Oracle Database

After the OML4Py server and client components have been installed on your on-premises
Oracle database server and you have installed the OML4Py client on your local system, you
can connect your client Python session to the OML4Py server.
To connect an OML4Py client to an on-premises Oracle database, you first import the oml
module and then connect as described in the following topics.
• About Connecting to an On-Premises Oracle Database
OML4Py client components connect a Python session to the OML4Py server components
on an on-premises Oracle database server.
• About Oracle Wallets
An Oracle wallet is a secure software container that stores authentication and signing
credentials for an Oracle Database.
• Connect to an Oracle Database
Establish an OML4Py connection to an on-premises Oracle database with oml.connect.
6-1
Chapter 6
Use OML4Py with an On-Premises Oracle Database
6.2.1 About Connecting to an On-Premises Oracle Database

OML4Py client components connect a Python session to the OML4Py server components on
an on-premises Oracle database server.
The connection makes the data in an on-premises Oracle database schema available to the
Python user. It also makes the processing power, memory, and storage capacities of the
database server available to the Python session through the OML4Py client interface. To use
that data and those capabilities, you must create a connection to the Oracle database server.
To use the Automatic Machine Learning (AutoML) capabilities of OML4Py, the following must
be true:
• A connection pool must be running on the server.
• You must explicitly use the automl argument in an oml.connect invocation to specify the
running connection pool on the server.
Note:
Before you can create an AutoML connection, a database administrator must first
activate the database-resident connection pool in your on-premises Oracle database
by issuing the following SQL statement:
EXECUTE DBMS_CONNECTION_POOL.START_POOL();
Once started, the connection pool remains in this state until a database administrator
explicitly stops it by issuing the following command:
EXECUTE DBMS_CONNECTION_POOL.STOP_POOL();
Note:
Because an AutoML connection requires more database resources than an
oml.connect connection without AutoML does, you should create an AutoML
connection only if you are going to use the AutoML classes.
6-2
Chapter 6
Note:
• Only one type of connection can be active during a Python session: either a
connection with AutoML enabled or one without it enabled. You can, however,
terminate one type of connection and initiate the other type during the same
Python session. Terminating either type of connection results in the automatic
clean up of any temporary objects created in the session during that connection.
If you want to save any objects that you created in one type of connection before
changing to the other type, then save the objects in an OML4Py datastore before
invoking oml.connect again. You can then reload the objects after reconnecting.
• The oml.connect function uses the cx_Oracle Python package for database
connectivity. In some cases, you might want to use the cx_Oracle.connect
function of that package to connect to a database. That function has advantages
such as the following:
– Allows multiple connections to a multiple databases, which might be useful in
an running Embedded Python Execution functions
– Permits some SQL data manipulation language (DML) operations that are
not available in an oml.connect connection
For information on the cx_Oracle.connect function, see Connecting to Oracle
Database in the cx_Oracle documentation.
OML4Py Connection Functions

The OML4Py functions related to database connections are the following.
Table 6-1 Connection Functions for OML4Py
oml.connect Establishes an OML4Py connection to an Oracle database.
oml.disconnect Terminates the Oracle database connection.
oml.isconnected Indicates whether an active Oracle database connection exists.
oml.check_embed Indicates whether Embedded Python Execution is enabled in the
connected Oracle database.
6.2.2 About Oracle Wallets

An Oracle wallet is a secure software container that stores authentication and signing
credentials for an Oracle Database.
You can create an OML4Py connection to an Oracle Database instance by specifying an
Oracle wallet. For instructions on creating an Oracle wallet, see Managing the Secure External
Password Store for Password Credentials in Oracle Database Security Guide.
The Oracle wallet must contain a credential that specifies a tnsnames.ora entry such as the
following:
waltcon = (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=myhost)(PORT=1521))
(CONNECT_DATA=(SERVICE_NAME=myserv.example.com)))
6-3
Chapter 6
To be able to use an Oracle wallet to create an OML4Py connection in which you can use
Automatic Machine Learning (AutoML), the wallet must also have a credential that has a
tnsnames.ora entry for a server connection pool such as the following:
waltcon_pool = (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=myhost)

(PORT=1521))(CONNECT_DATA=(SID=mysid)(SERVER=pooled)))
Note:
For examples of creating a connection using an Oracle wallet, see Example 6-6 and
Example 6-7.
6.2.3 Connect to an Oracle Database

Establish an OML4Py connection to an on-premises Oracle database with oml.connect.
The oml.connect function establishes a connection to the user’s schema in an on-premises

Oracle database.
The syntax of the oml.connect function is the following.
oml.connect(user=None, password=None, host=None, port=None, sid=None,

service_name=None, dsn=None, encoding=’UTF-8’, nencoding=’UTF-8’, automl=None)
To create a basic connection to the database, you can specify arguments to the oml.connect
function in the following mutually exclusive combinations:
• user, password, dsn
• user, password, host, port, sid
• user, password, host, port, service_name
The arguments specify the following values
Table 6-2 Pararmeters to oml.connect
Parameter Description
user A string specifying a username.
password A string specifying the password for the user.
host A string specifying the name of the host machine on which the OML4Py server
is installed.
6-4
Chapter 6
Table 6-2 (Cont.) Pararmeters to oml.connect
port An int or a string specifying the Oracle database port number on the host
machine.
sid A string specifying the system identifier (SID) of the Oracle database.
service_name A string specifying the service name of the Oracle database.
dsn A string specifying a data source name, which can be a TNS entry for the
database or a TNS alias in an Oracle Wallet.
encoding A string specifying the encoding to use for regular database strings.
nencoding A string specifying the encoding to use for national character set database
strings.
automl A string or a boolean specifying whether to enable an Automatic Machine
Learning (AutoML) connection, which uses the database-resident connection
pool.
If there is a connection pool running for a host, port, SID (or service name),
then you can specify that host, port, SID (or service name) and automl=True.
If the dsn argument is a data source name, then the automl argument must
be a data source name for a running connection pool.
If the dsn argument is a TNS alias, then the automl argument must be a TNS
alias for a connection pool specified in an Oracle Wallet.
To use the AutoML capabilities of OML4Py, the following must be true:

• A connection pool must be running on the server.
• You must explicitly use the automl argument in an oml.connect invocation to specify the
running connection pool on the server.
Note:
Only one active OML4Py connection can exist at a time during a Python session. If you call
oml.connect when an active connection already exists, then the oml.disconnect function is
implicitly invoked, any temporary objects that you created in the previous session are
discarded, and the new connection is established. Before attempting to connect, you can
discover whether an active connection exists by using the oml.isconnected function.
You explicitly end a connection with the oml.disconnect function. If you do not invoke
oml.disconnect, then the connection is automatically terminated when the Python session
ends.
6-5
Chapter 6
Examples
In the following examples, the values of the some of the arguments to the oml.connect
function are string variables that are not declared in the example. To use any of the following
examples, replace the username, password, port, and variable argument values with the
values for your user and database.
Example 6-1 Connecting with a Host, Port, and SID
This example uses the host, port, and sid arguments. It also shows the use of the
oml.isconnected, oml.check_embed, and oml.disconnect functions.
import oml

# Verify that the connection exists.

oml.isconnected()
# Find out whether Embedded Python Execution is enabled in the

# database instance.
oml.check_embed()
# Disconnect from the database.

oml.disconnect()
# Verify that the connection has been terminated.

oml.isconnected()
Listing for This Example
>>> import oml

>>>
>>> oml.connect(user='oml_user', password='oml_user_password', host='myhost',
... port=1521, sid='mysid')
>>>
>>> # Verify that the connection exists.
... oml.isconnected()
True
>>>
>>> # Find out whether Embedded Python Execution is enabled in the
... # database instance.
... oml.check_embed()
True
>>>
>>> # Disconnect from the database.
... oml.disconnect()
>>>
>>> # Verify that the connection has been terminated.
... oml.isconnected()
False
6-6
Chapter 6
Example 6-2 Connecting with Host, Port, and Service Name

This example uses the host, port, and service_name arguments.
import oml

port=1521, service_name='myservice')
Example 6-3 Connecting with a DSN Containing a SID

This example uses the dsn argument to specify a SID.
import oml
mydsn = "(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=myhost)(PORT=1521))\
(CONNECT_DATA=(SID=mysid)))"
oml.connect(user='oml_user', password='oml_user_password', dsn=mydsn)
Example 6-4 Connecting with a DSN Containing a Service Name

This example uses the dsn argument to specify a service name.
import oml
myinst = "(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=myhost)\
(PORT=1521))\
(CONNECT_DATA=(SERVICE_NAME=myservice.example.com)))"
oml.connect(user='oml_user', password='oml_user_password', dsn=myinst)
Example 6-5 Creating a Connection with a DSN and with AutoML Enabled
This example creates an OML4Py connection with AutoML enabled. The example connects to
a local database.
import oml
mydsn = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=myhost)\
(PORT=1521))(CONNECT_DATA=(SID=mysid)))"
dsn_pool = "(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=myhost)\
(PORT=1521))\
(CONNECT_DATA=(SERVICE_NAME=myservice.example.com)\
(SERVER=POOLED)))"
oml.connect(user='oml_user', password='oml_user_password',
dsn=mydsn, automl=dsn_pool)
# Verify that the connection exists and that AutoML is enabled.

oml.isconnected(check_automl=True)
6-7
Chapter 6
Move Data Between the Database and a Python Session
Example 6-6 Connecting with an Oracle Wallet

This example creates a connection using the dsn argument to specify an Oracle wallet. The
dsn value, waltcon in the example, must refer to the alias in the database tnsnames.ora file
that was used to create the appropriate credential in the wallet.
import oml
oml.connect(user='', password='', dsn='waltcon')
See Also:
About Oracle Wallets
Example 6-7 Connecting with an Oracle Wallet with AutoML Enabled

This example connects using an Oracle wallet to establish a connection with AutoML enabled
by using the dsn and automl arguments. The example then verifies that the connection has
AutoML enabled. The dsn and automl values, waltcon and waltcon_pool in the example, must
refer to aliases in the database tnsnames.ora file that were used to create the appropriate
credentials in the wallet.
import oml
oml.connect(user='', password='', dsn='waltcon', automl='waltcon_pool')

oml.isconnected(check_automl=True)
6.3 Move Data Between the Database and a Python Session

With OML4Py functions, you can interact with data structures in a database schema.
In your Python session, you can move data to and from the database and create temporary or
persistent database tables. The OML4Py functions that perform these actions are described in
the following topics.
Topics:
• About Moving Data Between the Database and a Python Session
Using the functions described in this topic, you can move data between the your local
Python session and an Oracle database schema.
• Push Local Python Data to the Database
Use the oml.push function to push data from your local Python session to a temporary
table in your Oracle database schema.
• Pull Data from the Database to a Local Python Session
Use the pull method of an oml proxy object to create a Python object in your local Python
session.
• Create a Python Proxy Object for a Database Object
Use the oml.sync function to create a Python object as a proxy for a database table, view,
or SQL statement.
6-8
Chapter 6
• Create a Persistent Database Table from a Python Data Set

Use the oml.create function to create a persistent table in your database schema from
data in your Python session.
6.3.1 About Moving Data Between the Database and a Python Session
Using the functions described in this topic, you can move data between the your local Python
session and an Oracle database schema.
The following functions create proxy oml Python objects from database objects, create
database tables from Python objects, list the objects in the workspace, and drop tables and
views.
Function Definition
oml.create Creates a persistent database table from a Python data set.
oml.cursor Returns a cx_Oracle cursor object for the current OML4Py database
connection.
oml.dir Returns the names of the oml objects in the workspace.
oml.drop Drops a persistent database table or view.
oml_object.pull Creates a local Python object that contains a copy of the database data
referenced by the oml object.
oml.push Pushes data from the OML Notebooks Python session memory into a
temporary table in the database.
oml.sync Creates an oml.DataFrame proxy object in Python that represents a
database table, view, or query.
With the oml.push function, you can create a temporary database table, and its corresponding
proxy oml.DataFrame object, from a Python object in your local Python session. The temporary
table is automatically deleted when the OML Notebook or OML4Py client connection to the
database ends unless you have saved its proxy object to a datastore before disconnecting.
With the pull method of an oml object, you can create a local Python object that contains a
copy of the database data represented by an oml proxy object.
The oml.push function implicitly coerces Python data types to oml data types and the pull
method on oml objects coerces oml data types to Python data types.
With the oml.create function, you can create a persistent database table and a corresponding
oml.DataFrame proxy object from a Python data set.
With the oml.sync function, you can synchronize the metadata of a database table or view with
the oml object representing the database object.
With the oml.cursor function, you can create a cx_Oracle cursor object for the current
database connection. You can user the cursor to run queries against the database, as shown
in Example 6-13.
6.3.2 Push Local Python Data to the Database

Use the oml.push function to push data from your local Python session to a temporary table in
your Oracle database schema.
The oml.push function creates a temporary table in the user’s database schema and inserts
data into the table. It also creates and returns a corresponding proxy oml.DataFrame object
6-9
Chapter 6
that references the table in the Python session. The table exists as long as an oml object exists
that references it, either in the Python session memory or in an OML4Py datastore.
The syntax of the oml.push function is the following:
oml.push(x, oranumber=True, dbtypes=None)
The x argument may be a pandas.DataFrame or a list of tuples of equal size that contain the
data for the table. For a list of tuples, each tuple represents a row in the table and the column
names are set to COL1, COL2, and so on.
The SQL data types of the columns are determined by the following:
• OML4Py determines default column types by looking at 20 random rows sampled from the
table. For tables with less than 20 rows, it uses all rows in determining the column type.
If the values in a column are all None, or if a column has inconsistent data types that are
not None in the sampled rows, then a default column type cannot be determined and a
ValueError is raised unless a SQL type for the column is specified by the dbtypes
argument.
• For numeric columns, the oranumber argument, which is a bool, determines the SQL data
type. If True (the default), then the SQL data type is NUMBER. If False, then the data type is
BINARY_DOUBLE.
If the data in x contains NaN values, then you should set oranumber to False.
• For string columns, the default type is VARCHAR2(4000).
• For bytes columns, the default type is BLOB.
With the dbtypes argument, you can specify the SQL data types for the table columns. The
values of dbtypes may be either a dict that maps str to str values or a list of str values. For
a dict, the keys are the names of the columns.
Example 6-8 Pushing Data to a Database Table

This example creates pd_df, a pandas.core.frame.DataFrame object with columns of various
data types. It pushes pd_df to a temporary database table, which creates the oml_df object,
which references the table. It then pulls the data from the oml_df object to the df object in local
memory.
import oml
import pandas as pd
pd_df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],

'string' : [None, None, 'a', 'a', 'a', 'b'],
'bytes' : [b'a', b'b', b'c', b'c', b'd', b'e']})
# Push the data set to a database table with the specified dbtypes
# for each column.
oml_df = oml.push(pd_df, dbtypes = {'numeric': 'BINARY_DOUBLE',
'string':'CHAR(1)',
'bytes':'RAW(1)'})
# Display the data type of oml_df.

type(oml_df)
# Pull the data from oml_df into local memory.
6-10
Chapter 6
df = oml_df.pull()
# Display the data type of df.

type(df)
# Create a list of tuples.

lst = [(1, None, b'a'), (1.4, None, b'b'), (-4, 'a', b'c'),
(3.145, 'a', b'c'), (5, 'a', b'd'), (None, 'b', b'e')]
# Create an oml.DataFrame using the list.

oml_df2 = oml.push(lst, dbtypes = ['BINARY_DOUBLE','CHAR(1)','RAW(1)'])
type(oml_df2)
>>> import oml

>>> import pandas as pd
>>>
>>> pd_df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],
... 'string' : [None, None, 'a', 'a', 'a', 'b'],
... 'bytes' : [b'a', b'b', b'c', b'c', b'd', b'e']})
>>>
>>> # Push the data set to a database table with the specified dbtypes
... # for each column.
... oml_df = oml.push(pd_df, dbtypes = {'numeric': 'BINARY_DOUBLE',
... 'string':'CHAR(1)',
... 'bytes':'RAW(1)'})
>>>
>>> # Display the data type of oml_df.
... type(oml_df)
<class 'oml.core.frame.DataFrame'>
>>>
>>> # Pull the data from oml_df into local memory.
... df = oml_df.pull()
>>>
>>> # Display the data type of df.
... type(df)
<class 'pandas.core.frame.DataFrame'>
>>>
>>> # Create a list of tuples.
... lst = [(1, None, b'a'), (1.4, None, b'b'), (-4, 'a', b'c'),
... (3.145, 'a', b'c'), (5, 'a', b'd'), (None, 'b', b'e')]
>>>
>>> # Create an oml.DataFrame using the list.
... oml_df2 = oml.push(lst, dbtypes = ['BINARY_DOUBLE','CHAR(1)','RAW(1)'])
>>>
>>> type(oml_df2)
6.3.3 Pull Data from the Database to a Local Python Session

Use the pull method of an oml proxy object to create a Python object in your local Python
session.
6-11
Chapter 6
The pull method of an oml object returns a Python object of the same type. The object
contains a copy of the database data referenced by the oml object. The Python object exists in-
memory in the Python session in OML Notebooks or in your OML4Py client Python session..
Note:
You can pull data to a local pandas.DataFrame only if the data can fit into the local
Python session memory. Also, even if the data fits in memory but is still very large,
you may not be able to perform many, or any, Python functions in the local Python
session.
Example 6-9 Pulling Data into Local Memory

This example loads the iris data set and creates the IRIS database table and the oml_iris
proxy object that references that table. It displays the type of the oml_iris object, then pulls
the data from it to the iris object in local memory and displays its type.
import oml
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
x = pd.DataFrame(iris.data, columns = [‘SEPAL_LENGTH’,‘SEPAL_WIDTH’,
‘PETAL_LENGTH’,‘PETAL_WIDTH’])
y = pd.DataFrame(list(map(lambda x: {0: ‘setosa’, 1: ‘versicolor’, 2:
‘virginica’}[x], iris.target)), columns = [‘SPECIES’])
iris_df = pd.concat([x, y], axis=1)
oml_iris = oml.create(iris_df, table = ‘IRIS’)
# Display the data type of oml_iris.

type(oml_iris)
# Pull the data from oml_iris into local memory.

iris = oml_iris.pull()
# Display the data type of iris.

type(iris)
# Drop the IRIS database table.

oml.drop('IRIS')
>>> import oml

>>> from sklearn.datasets import load_iris
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
>>> iris = datasets.load_iris()
>>> iris = load_iris()
6-12
Chapter 6
>>> x = pd.DataFrame(iris.data, columns = [‘SEPAL_LENGTH’,‘SEPAL_WIDTH’,

‘PETAL_LENGTH’,‘PETAL_WIDTH’])
>>> y = pd.DataFrame(list(map(lambda x: {0: ‘setosa’, 1: ‘versicolor’, 2:
‘virginica’}[x], iris.target)), columns = [‘SPECIES’])
>>> iris_df = pd.concat([x, y], axis=1)
>>> oml_iris = oml.create(iris_df, table = ‘IRIS’)
>>>
>>> # Display the data type of oml_iris.
... type(oml_iris)
>>>
>>> # Pull the data from oml_iris into local memory.
... iris = oml_iris.pull()
>>>
>>> # Display the data type of iris.
... type(iris)
>>>
>>> # Drop the IRIS database table.
... oml.drop('IRIS')
6.3.4 Create a Python Proxy Object for a Database Object

Use the oml.sync function to create a Python object as a proxy for a database table, view, or
SQL statement.
The oml.sync function returns an oml.DataFrame object or a dictionary of oml.DataFrame
objects. The oml.DataFrame object returned by oml.sync is a proxy for the database object.
You can use the proxy oml.DataFrame object to select data from the table. When you run a
Python function that selects data from the table, the function returns the current data from the
database object. However, if some application has added a column to the table, or has
otherwise changed the metadata of the database object, the oml.DataFrame proxy object does
not reflect such a change until you again invoke oml.sync for the database object.
Tip:
To conserve memory resources and save time, you should only create proxies for the
tables that you want to use in your Python session.
You can use the oml.dir function to list the oml.DataFrame proxy objects in the environment
for a schema.
The syntax of the oml.sync function is the following:
oml.sync(schema=None, regex_match=False, table=None, view=None, query=None)
The schema argument in oml.sync specifies the name of the schema where the database
object exists. If schema=None, which is the default, then the current schema is used.
6-13
Chapter 6
To create an oml.DataFrame object for a table, use the table parameter. To create one for a
view, use the view parameter. To create one for a SQL SELECT statement, use the query
parameter. You can only specify one of these parameters in an oml.sync invocation: the
argument for one of the parameters must be a string and the argument for each of the other
two parameters must be None.
Creating a proxy object for a query enables you to create an oml.DataFrame object without
creating a view in the database. This can be useful when you do not have the CREATE VIEW
system privilege for the current schema. You cannot use the schema parameter and the query
parameter in the same ore.sync invocation.
With the regex_match argument, you can specify whether the value of the table or view
argument is a regular expression. If regex_match=True, then oml.sync creates oml.DataFrame
objects for each database object that matches the pattern. The matched tables or views are
returned in a dict with the table or view names as keys.
Example 6-10 Creating a Python Object for a Database Table

This example creates an oml.DataFrame Python object as a proxy for a database table. For
this example, the table COFFEE exists in the user's schema.
import oml
# Create the Python object oml_coffee as a proxy for the

# database table COFFEE.
oml_coffee = oml.sync(table = 'COFFEE')
type(oml_coffee)
# List the proxy objects in the schema.

oml.dir()
oml_coffee.head()
>>> import oml

>>>
>>> # Create the Python object oml_coffee as a proxy for the
... # database table COFFEE.
... oml_coffee = oml.sync(table = 'COFFEE')
>>> type(oml_coffee)
>>>
>>> # List the proxy objects in the schema.
... oml.dir()
['oml_coffee']
>>>
>>> oml_coffee.head()
ID COFFEE WINDOW
0 1 esp w
1 2 cap d
2 3 cap w
3 4 kon w
4 5 ice w
6-14
Chapter 6
Example 6-11 Using the regex_match Argument

This example uses the regex_match argument in creating a dict object that contains
oml.DataFrame proxy objects for tables whose names start with C. For this example, the
COFFEE and COLOR tables exist in the user's schema and are the only tables whose names
start with C.
# Create a dict of oml.DataFrame proxy objects for tables

# whose names start with 'C'.
oml_cdat = oml.sync(table="^C", regex_match=True)
oml_cdat.keys()
oml_cdat['COFFEE'].columns
oml_cdat['COLOR'].columns
>>> # Create a dict of oml.DataFrame proxy objects for tables

... # whose names start with 'C'.
... oml_cdat = oml.sync(table="^C", regex_match=True)
>>>
>>> oml_cdat.keys()
dict_keys(['COFFEE', 'COLOR']
>>> oml_cdat['COFFEE'].columns
['ID', 'COFFEE', 'WINDOW']
>>> oml_cdat['COLOR'].columns
['REGION', 'EYES', 'HAIR', 'COUNT']
Example 6-12 Synchronizing an Updated Table

This example uses oml.sync to create an oml.DataFrame for the database table COFFEE. For
the example, the new column BREW has been added to the database table by some other
database process after the first invocation of oml.sync. Invoking oml.sync again synchronizes
the metadata of the oml.DataFrame with those of the table.
oml_coffee = oml.sync(table = "COFFEE")

oml_coffee.columns
# After a new column has been inserted into the table.

oml_coffee = oml.sync(table = "COFFEE")
oml_coffee.columns
>>> oml_coffee = oml.sync(table = "COFFEE")

>>> oml_coffee.columns
['ID', 'COFFEE', 'WINDOW']
>>>
>>> # After a new column has been inserted into the table.
... oml_coffee = oml.sync(table = "COFFEE")
>>> oml_coffee.columns
['ID', 'COFFEE', 'WINDOW', 'BREW']
6-15
Chapter 6
6.3.5 Create a Persistent Database Table from a Python Data Set

Use the oml.create function to create a persistent table in your database schema from data in
your Python session.
The oml.create function creates a table in the database schema and returns an
oml.DataFrame object that is a proxy for the table. The proxy oml.DataFrame object has the
same name as the table.
Note:
When creating a table in Oracle Machine Learning for Python, if you use lowercase
or mixed case for the name of the table, then you must use the same lowercase or
mixed case name in double quotation marks when using the table in a SQL query or
function. If, instead, you use an all uppercase name when creating the table, then the
table name is case-insensitive: you can use uppercase, lowercase, or mixed case
when using the table without using double quotation marks. The same is true for
naming columns in a table.
You can delete the persistent table in a database schema with the oml.drop function.
Caution:
Use the oml.drop function to delete a persistent database table. Use the del
statement to remove an oml.DataFrame proxy object and its associated temporary
table; del does not delete a persistent table.
The syntax of the oml.create function is the following:
oml.create(x, table, oranumber=True, dbtypes=None, append=False)
The x argument is a pandas.DataFrame or a list of tuples of equal size that contain the data for
the table. For a list of tuples, each tuple represents a row in the table and the column names
are set to COL1, COL2, and so on. The table argument is a string that specifies a name for
the table.
The SQL data types of the columns are determined by the following:
• OML4Py determines default column types by looking at 20 random rows sampled from the
table. For tables with less than 20 rows, it uses all rows in determining the column type.
If the values in a column are all None, or if a column has inconsistent data types that are
not None in the sampled rows, then a default column type cannot be determined and a
ValueError is raised unless a SQL type for the column is specified by the dbtypes
argument.
• For numeric columns, the oranumber argument, which is a bool, determines the SQL data
type. If True (the default), then the SQL data type is NUMBER. If False, then the data type
is BINARY DOUBLE.
If the data in x contains NaN values, then you should set oranumber to False.
6-16
Chapter 6
• For string columns, the default type is VARCHAR2(4000).

• For bytes columns, the default type is BLOB.
With the dbtypes parameter, you can specify the SQL data types for the table columns. The
values of dbtypes may be either a dict that maps str to str values or a list of str values. For
a dict, the keys are the names of the columns. The dbtypes parameter is ignored if the
append argument is True.
The append argument is a bool that specifies whether to append the x data to an existing table.
Example 6-13 Creating Database Tables from a Python Data Set

This example creates a cursor object for the database connection, creates a
pandas.core.frame.DataFrame with columns of various data types, then creates a series of
tables using different oml.create parameters and shows the SQL data types of the table
columns.
import oml
# Create a cursor object for the current OML4Py database

# connection to run queries and get information from the database.
cr = oml.cursor()
import pandas as pd
df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, 2],

# Get the order of the columns

df.columns
# Create a table with the default parameters.

oml_df1 = oml.create(df, table = 'tbl1')
# Show the default SQL data types of the columns.

_ = cr.execute("select data_type from all_tab_columns where table_name =
'tbl1'")
cr.fetchall()
# Create a table with oranumber set to False.

oml_df2 = oml.create(df, table = 'tbl2', oranumber = False)
# Show the SQL data typea of the columns.

'tbl2'")
cr.fetchall()
# Create a table with dbtypes specified as a dict mapping column names

# to SQL data types.
oml_df3 = oml.create(df, table = 'tbl3',
dbtypes = {'numeric': 'BINARY_DOUBLE',
'bytes':'RAW(1)'})
# Show the SQL data types of the columns.

6-17
Chapter 6
'tbl3'")
cr.fetchall()
# Create a table with dbtypes specified as a list of SQL data types

# matching the order of the columns.
oml_df4 = oml.create(df, table = 'tbl4',
dbtypes = ['BINARY_DOUBLE','VARCHAR2','RAW(1)'])
# Show the SQL data type of the columns.

'tbl4'")
cr.fetchall()
# Create a table from a list of tuples.

lst = [(1, None, b'a'), (1.4, None, b'b'), (-4, 'a', b'c'),
(3.145, 'a', b'c'), (5, 'a', b'd'), (None, 'b', b'e')]
oml_df5 = oml.create(lst, table = 'tbl5',
dbtypes = ['BINARY_DOUBLE','CHAR(1)','RAW(1)'])
# Close the cursor

cr.close()
# Drop the tables.

oml.drop('tbl1')
oml.drop('tbl2')
oml.drop('tbl3')
oml.drop('tbl4')
oml.drop('tbl5')
>>> import oml

>>>
>>> # Create a cursor object for the current OML4Py database
... # connection to run queries and get information from the database.
... cr = oml.cursor()
>>>
>>>
>>> df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, 2],
>>>
>>> # Get the order of the columns.
... df.columns
Index(['numeric', 'string', 'bytes'], dtype='object')
>>>
>>> # Create a table with the default parameters.
... oml_df1 = oml.create(df, table = 'tbl1')
>>>
>>> # Show the default SQL data types of the columns.
... _ = cr.execute("select data_type from all_tab_columns where table_name =
'tbl1'")
>>> cr.fetchall()
[('NUMBER',), ('VARCHAR2',), ('BLOB',)]
6-18
Chapter 6
>>>
>>> # Create a table with oranumber set to False.
... oml_df2 = oml.create(df, table = 'tbl2', oranumber = False)
>>>
>>> # Show the SQL data types of the columns.
'tbl2'")
>>> cr.fetchall()
[('BINARY_DOUBLE',), ('VARCHAR2',), ('BLOB',)]
>>>
>>> # Create a table with dbtypes specified as a dict mapping column names
... # to SQL data types.
... oml_df3 = oml.create(df, table = 'tbl3',
... dbtypes = {'numeric': 'BINARY_DOUBLE',
... 'bytes':'RAW(1)'})
>>>
>>> # Show the SQL data type of the columns.
'tbl3'")
>>> cr.fetchall()
[('BINARY_DOUBLE',), ('VARCHAR2',), ('RAW',)]
>>>
>>> # Create a table with dbtypes specified as a list of SQL data types
... # matching the order of the columns.
... oml_df4 = oml.create(df, table = 'tbl4',
... dbtypes = ['BINARY_DOUBLE','CHAR(1)', 'RAW(1)'])
>>>
>>> # Show the SQL data type of the columns
'tbl4'")
>>> cr.fetchall()
[('BINARY_DOUBLE',), ('CHAR',), ('RAW',)]
>>>
>>> # Create a table from a list of tuples.
... lst = [(1, None, b'a'), (1.4, None, b'b'), (-4, 'a', b'c'),
... (3.145, 'a', b'c'), (5, 'a', b'd'), (None, 'b', b'e')]
>>> oml_df5 = oml.create(lst, table ='tbl5',
... dbtypes = ['BINARY_DOUBLE','CHAR(1)','RAW(1)'])
>>>
>>> # Show the SQL data type of the columns.
'tbl5'")
>>> cr.fetchall()
[('BINARY_DOUBLE',), ('CHAR',), ('RAW',)]
>>>
>>> # Close the cursor.
... cr.close()
>>>
>>> # Drop the tables
... oml.drop('tbl1')
>>> oml.drop('tbl2')
6-19
Chapter 6
Save Python Objects in the Database
6.4 Save Python Objects in the Database

You can save Python objects in OML4Py datastores, which persist in the database.
You can grant or revoke read privilege access to a datastore or its objects to one or more
users. You can restore the saved objects in another Python session.
The following topics describe the OML4Py functions for creating and managing datastores:
Topics:
• About OML4Py Datastores
In an OML4Py datastore, you can store Python objects, which you can then use in
subsequent Python sessions; you can also make them available to other users or
programs.
• Save Objects to a Datastore
The oml.ds.save function saves one or more Python objects to a datastore.
• Load Saved Objects From a Datastore
The oml.ds.load function loads one or more Python objects from a datastore into a
Python session.
• Get Information About Datastores
The oml.ds.dir function provides information about datastores.
• Get Information About Datastore Objects
The oml.ds.describe function provides information about the objects in a datastore.
• Delete Datastore Objects
The oml.ds.delete function deletes datastores or objects in a datastore.
• Manage Access to Stored Objects
The oml.grant and oml.revoke functions grant or revoke the read privilege to datastores
or to user-defined Python functions in the script repository.
6.4.1 About OML4Py Datastores

In an OML4Py datastore, you can store Python objects, which you can then use in subsequent
Python sessions; you can also make them available to other users or programs.
Python objects, including OML4Py proxy objects, exist only for the duration of the current
Python session unless you explicitly save them. You can save a Python object, including oml
proxy objects, to a named datastore and then load that object in a later Python session,
including an Embedded Python Execution session. OML4Py creates the datastore in the user’s
database schema. A datastore, and the objects it contains, persist in the database until you
delete them.
You can grant or revoke read privilege permission to another user to a datastore that you
created or to objects in a datastore.
OML4Py has Python functions for managing objects in a datastore. It also has PL/SQL
procedures for granting or revoking the read privilege and database views for listing available
datastores and their contents.
Using a datastore, you can do the following:
• Save OML4Py and other Python objects that you create in one Python session and load
them in another Python session.
6-20
Chapter 6
• Pass arguments to Python functions for use in Embedded Python Execution.

• Pass objects for use in Embedded Python Execution. You could, for example, use the
oml.glm class to build an Oracle Machine Learning model and save it in a datastore. You
could then use that model to score data in the database through Embedded Python
Execution.
Python Interface for Datastores

The following table lists the Python functions for saving and managing objects in a datastore.
oml.ds.delete Deletes one or more datastores or Python objects from a datastore.
oml.ds.dir Lists the datastores available to the current user.
oml.ds.load Loads Python objects from a datastore into the user’s session.
oml.ds.save Saves Python objects to a named datastore in the user’s database schema.
The following table lists the Python functions for managing access to datastores and datastore
objects.
oml.grant Grants read privilege permission to another user to a datastore or a user-
defined Python function in the script repository owned by the current user.
oml.revoke Revokes the read privilege permission that was granted to another user to a
datastore or a user-defined Python function in the script repository owned by
the current user.
6.4.2 Save Objects to a Datastore

The oml.ds.save function saves one or more Python objects to a datastore.
OML4Py creates the datastore in the current user’s schema.

The syntax of oml.ds.save is the following:
oml.ds.save(objs, name, description=' ', grantable=None,

overwrite=False, append=False, compression=False)
The objs argument is a dict that contains the name and object pairs to save to the datastore
specified by the name argument.
With the description argument, you can provide some descriptive text that appears when you
get information about the datastore. The description parameter has no effect when used with
the append parameter.
With the grantable argument, you can specify whether the read privilege to the datastore may
be granted to other users.
If you set the overwrite argument to TRUE, then you can replace an existing datastore with
another datastore of the same name.
If you set the append argument to TRUE, then you can add objects to an existing datastore. The
overwrite and append arguments are mutually exclusive.
6-21
Chapter 6
If you set compression to True, then the serialized Python objects are compressed in the
datastore.
Example 6-14 Saving Python Objects to a Datastore
This example demonstrates creating datastores.
import oml
from sklearn import datasets
from sklearn import linear_model
import pandas as pd
# Load three data sets and create oml.DataFrame objects for them.
wine = datasets.load_wine()
x = pd.DataFrame(wine.data, columns = wine.feature_names)
y = pd.DataFrame(wine.target, columns = ['Class'])
# Create the database table WINE.

oml_wine = oml.create(pd.concat([x, y], axis=1), table = 'WINE')
oml_wine.columns
diabetes = datasets.load_diabetes()
x = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.DataFrame(diabetes.target, columns=['disease_progression'])
oml_diabetes = oml.create(pd.concat([x, y], axis=1),
table = "DIABETES")
oml_diabetes.columns
boston = datasets.load_boston()
x = pd.DataFrame(boston.data, columns = boston.feature_names.tolist())
y = pd.DataFrame(boston.target, columns = ['Value'])
oml_boston = oml.create(pd.concat([x, y], axis=1), table = "BOSTON")
oml_boston.columns
# Save the wine Bunch object to the datastore directly,

# along with the oml.DataFrame proxy object for the BOSTON table.
oml.ds.save(objs={'wine':wine, 'oml_boston':oml_boston},
name="ds_pydata", description = "python datasets")
# Save the oml_diabetes proxy object to an existing datastore.

oml.ds.save(objs={'oml_diabetes':oml_diabetes},
name="ds_pydata", append=True)
# Save the oml_wine proxy object to another datastore.

oml.ds.save(objs={'oml_wine':oml_wine},
name="ds_wine_data", description = "wine dataset")
# Create regression models using sklearn and oml.

# The regr1 linear model is a native Python object.
regr1 = linear_model.LinearRegression()
regr1.fit(boston.data, boston.target)
# The regr2 GLM model is an oml object.
regr2 = oml.glm("regression")
X = oml_boston.drop('Value')
y = oml_boston['Value']
regr2 = regr2.fit(X, y)
6-22
Chapter 6
# Save the native Python object and the oml proxy object to a datastore
# and allow the read privilege to be granted to them.
oml.ds.save(objs={'regr1':regr1, 'regr2':regr2},
name="ds_pymodel", grantable=True)
# Grant the read privilege to the datastore to every user.

oml.grant(name="ds_pymodel", typ="datastore", user=None)
# List the datastores to which the read privilege has been granted.
oml.ds.dir(dstype="grant")
>>> import oml

>>> from sklearn import datasets
>>> from sklearn import linear_model
>>>
>>> # Load three data sets and create oml.DataFrame objects for them.
>>> wine = datasets.load_wine()
>>> x = pd.DataFrame(wine.data, columns = wine.feature_names)
>>> y = pd.DataFrame(wine.target, columns = ['Class'])
>>>
>>> # Create the database table WINE.
... oml_wine = oml.create(pd.concat([x, y], axis=1), table = 'WINE')
>>> oml_wine.columns
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins',
'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline', 'Class']
>>>
>>> diabetes = datasets.load_diabetes()
>>> x = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
>>> y = pd.DataFrame(diabetes.target, columns=['disease_progression'])
>>> oml_diabetes = oml.create(pd.concat([x, y], axis=1),
... table = "DIABETES")
>>> oml_diabetes.columns
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6',
'disease_progression']
>>>
>>> boston = datasets.load_boston()
>>> x = pd.DataFrame(boston.data, columns = boston.feature_names.tolist())
>>> y = pd.DataFrame(boston.target, columns = ['Value'])
>>> oml_boston = oml.create(pd.concat([x, y], axis=1), table = "BOSTON")
>>> oml_boston.columns
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'Value']
>>>
>>> # Save the wine Bunch object to the datastore directly,
... # along with the oml.DataFrame proxy object for the BOSTON table.
... oml.ds.save(objs={'wine':wine, 'oml_boston':oml_boston},
... name="ds_pydata", description = "python datasets")
>>>
>>> # Save the oml_diabetes proxy object to an existing
datastore.
6-23
Chapter 6
... oml.ds.save(objs={'oml_diabetes':oml_diabetes},
... name="ds_pydata", append=True)
>>>
>>> # Save the oml_wine proxy object to another datastore.
... oml.ds.save(objs={'oml_wine':oml_wine},
... name="ds_wine_data", description = "wine dataset")
>>>
>>> # Create regression models using sklearn and oml.
... # The regr1 linear model is a native Python object.
... regr1 = linear_model.LinearRegression()
>>> regr1.fit(boston.data, boston.target)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> # The regr2 GLM model is an oml proxy object.
... regr2 = oml.glm("regression")
>>> X = oml_boston.drop('Value')
>>> y = oml_boston['Value']
>>> regr2 = regr2.fit(X, y)
>>>
>>> # Save the native Python object and the oml proxy object to a datastore
... # and allow the read privilege to be granted to them.
... oml.ds.save(objs={'regr1':regr1, 'regr2':regr2},
... name="ds_pymodel", grantable=True)
>>>
>>> # Grant the read privilege to the ds_pymodel datastore to every user.
... oml.grant(name="ds_pymodel", typ="datastore", user=None)
>>>
>>> # List the datastores to which the read privilege has been granted.
... oml.ds.dir(dstype="grant")
datastore_name grantee
0 ds_pymodel PUBLIC
6.4.3 Load Saved Objects From a Datastore

The oml.ds.load function loads one or more Python objects from a datastore into a Python
session.
The syntax of oml.ds.load is the following:
oml.ds.load(name, objs=None, owner=None, to_globals=True)
The name argument specifies the datastore that contains the objects to load.
With the objs argument, you identify a specific object or a list of objects to load.
With the boolean to_globals parameter, you can specify whether the objects are loaded to a
global workspace or to a dictionary object. If the argument to to_globals is True, then
oml.ds.load function loads the objects into the global workspace. If the argument is False,
then the function returns a dict object that contains pairs of object names and values.
The oml.ds.load function raises a ValueError if the name argument is an empty string or if the
owner of the datastore is not the current user and the read privilege for the datastore has not
been granted to the current user.
6-24
Chapter 6
Example 6-15 Loading Objects from Datastores

This example loads objects from datastores. For the creation of the datastores used in this
example, see Example 6-14.
import oml
# Load all Python objects from a datastore to the global workspace.

sorted(oml.ds.load(name="ds_pydata"))
# Load the named Python object from the datastore to the global workspace.
oml.ds.load(name="ds_pymodel", objs=["regr2"])
# Load the named Python object from the datastore to the user's workspace.
oml.ds.load(name="ds_pymodel", objs=["regr1"], to_globals=False)
>>> import oml

>>>
>>> # Load all Python objects from a datastore to the current workspace.
... sorted(oml.ds.load(name="ds_pydata"))
['oml_boston', 'oml_diabetes', 'wine']
>>>
>>> # Load the named Python object from the datastore to the global workspace.
... oml.ds.load(name="ds_pymodel", objs=["regr2"])
['regr2']
>>>
>>> # Load the named Python object from the datastore to the user's workspace.
... oml.ds.load(name="ds_pymodel", objs=["regr1"], to_globals=False)
{'regr1': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)}
6.4.4 Get Information About Datastores

The oml.ds.dir function provides information about datastores.
The syntax of oml.ds.dir is the following:
oml.ds.dir(name=None, regex_match=False, dstype=’user’)
Use the name parameter to get information about a specific datastore.
Optionally, you can use the regex_match and dstype parameters to get information about
datastores with certain characteristics. The valid arguments for dstype are the following:
Argument Description
all Lists all of the datastores to which the current user has the read
privilege.
grant Lists the datastores for which the current user has granted read
privilege to other users.
granted Lists the datastores for which other users have granted read
privilege to the current user.
6-25
Chapter 6
grantable Lists the datastores that the current user can grant the read privilege
to.
user Lists the datastores created by current user.
private Lists the datastores that the current user cannot grant the read
privileges to.
The oml.ds.dir function returns a pandas.DataFrame object that contains different columns
depending on which dstype argument you use. The following table lists the arguments and the
columns returned for the values supplied.
dstype Argument Columns in the DataFrame Returned

user DSNAME, which contains the datastore name
private NOBJ, which contains the number of objects in the datastore
grantable DSIZE, which contains the size in bytes of each object in the
datastore
CDATE, which contains the creation date of the datastore
DESCRIPTION, which contains the optional description of the
datastore
all All of the columns returned by the user, private, and grantable
granted values, plus this additional column:
DSOWNER, which contains the owner of the datastore
grant DSNAME, which contains the datastore name
GRANTEE, which contains the name of the user to which the read
privilege to the datastore has been granted by the current session
user
Example 6-16 Getting Information About Datastores

This example demonstrates using different combinations of arguments to the oml.ds.dir
function. It demonstrates using oml.dir to list some or all of the datastores. For the creation of
the datastores used in this example, see Example 6-14.
import oml
# Show all saved datastores.

oml.ds.dir(dstype="all")[['owner', 'datastore_name', 'object_count']]
# Show datastores to which other users have been granted the read
# privilege.
# Show datastores whose names match a pattern.

oml.ds.dir(name='pydata', regex_match=True)\
[['datastore_name', 'object_count']]
>>> import oml

>>>
>>> # Show all saved datastores.
6-26
Chapter 6
... oml.ds.dir(dstype="all")[['owner', 'datastore_name', 'object_count']]

owner datastore_name object_count
0 OML_USER ds_pydata 3
1 OML_USER ds_pymodel 2
2 OML_USER ds_wine_data 1
>>>
>>> # Show datastores to which other users have been granted the read
>>> # privilege.
0 ds_pymodel PUBLIC
>>>
>>> oml.ds.dir(name='pydata', regex_match=True)\
... [['datastore_name', 'object_count']]
datastore_name object_count
0 ds_pydata 3
6.4.5 Get Information About Datastore Objects

The oml.ds.describe function provides information about the objects in a datastore.
The syntax of oml.ds.describe is the following:
oml.ds.describe(name, owner=None))
The name argument is a string that specifies the name of a datastore.
The owner argument is a string that specifies the owner of the datastore or None (the default). If
you do not specify the owner, then the function returns information about the datastore if it is
owned by the current user.
The oml.ds.describe function returns a pandas.DataFrame object, each row of which
represents an object in the datastore. The columns of the DataFrame are the following:
• object_name, which specifies the name of the object

• class, which specifies the class of the object
• size, which specifies the size of the object in bytes
• length, which specifies the length of the object
• row_count, which specifies the rows of the object
• col_count, which specifies the columns of the object
This function raises a ValueError if the following occur:
• The current user is not the owner of the datastore and has not been granted read privilege
for the datastore.
• The datastore does not exist.
Example 6-17 Getting Information About Datastore Objects
This example demonstrates the using the oml.ds.describe function. For the creation of the
datastore used in this example, see Example 6-14.
import oml
6-27
Chapter 6
# Describe the contents of the ds_pydata datastore.

oml.ds.describe(name='ds_pydata')
oml.ds.describe(name="ds_pydata")[['object_name', 'class']]
>>> import oml

>>>
>>> # Describe the contents of the ds_pydata datastore.
... oml.ds.describe(name='ds_pydata')
object_name class size length row_count col_count
0 oml_boston oml.DataFrame 1073 506 506 14
1 oml_diabetes oml.DataFrame 964 442 442 11
2 wine Bunch 24177 5 1 5
>>> oml.ds.describe(name="ds_pydata")[['object_name', 'class']]
object_name class
0 oml_boston oml.DataFrame
1 oml_diabetes oml.DataFrame
2 wine Bunch
6.4.6 Delete Datastore Objects

The oml.ds.delete function deletes datastores or objects in a datastore.
Use the oml.ds.delete function to delete one or more datastores in your database schema or
to delete objects in a datastore.
The syntax of oml.ds.delete is the following:
oml.ds.delete(name, objs=None, regex_match=False)
The argument to the name parameter may be one of the following:
• A string that specifies the name of the datastore to modify or delete, or a regular
expression that matches the datastores to delete.
• A list of str objects that name the datastores from which to delete objects.
The objs parameter specifies the objects to delete from a datastore. The argument to the objs
parameter may be one of the following:
• A string that specifies the object to delete from one or more datastores, or a regular
expression that matches the objects to delete.
• None (the default), which deletes the entire datastore or datastores.
The regex_match parameter is a bool that indicates whether the name or objs arguments are
regular expressions. The default value is False. The regex_match parameter operates as
follows:
• If regex_match=False and if name is not None, and:
– If objs=None, then oml.ds.delete deletes the datastore or datastores specified in the
name argument.
– If you specify one or more datastores with the name argument and one or more
datastore objects with the objs argument, then oml.ds.delete deletes the specified
Python objects from the datastores.
6-28
Chapter 6
• If regex_match=True and:
– If objs=None, then oml.ds.delete deletes the datastores you specified in the name
argument.
– If the name argument is a string and you specify one or more datastore objects with the
objs argument, then oml.ds.delete deletes from the datastore the objects whose
names match the regular expression specified in the objs argument.
– If the name argument is a list of str objects, then the objs argument must be a list of
str objects of the same length as name, and oml.ds.delete deletes from the
datastores the objects whose names match the regular expressions specified in objs.
This function raises an error if the following occur:
• A specified datastore does not exist.
• Argument regex_match is False and argument name is a list of str objects larger than 1
and argument objs is not None.
• Argument regex_match is True and arguments name and objs are lists that are not the
same length.
Example 6-18 Deleting Datastore Objects
This example demonstrates the using the oml.ds.delete function. For the creation of the
datastores used in this example, see Example 6-14.
import oml
# Show the existing datastores.

oml.ds.dir()
# Show the Python objects in the ds_pydata datastore.

oml.ds.describe(name='ds_pydata')
# Delete some objects from the datastore.

oml.ds.delete(name="ds_pydata", objs=["wine", "oml_boston"])
# Delete a datastore.
oml.ds.delete(name="ds_pydata")
# Delete all datastores whose names match a pattern.

oml.ds.delete(name="_pymodel", regex_match=True)
# Show the existing datastores again.

oml.ds.dir()
>>> import oml

>>>
>>> # Show the existing datastores.
... oml.ds.dir()
datastore_name object_count size date description
0 ds_pydata 3 26214 2019-05-18 21:04:06 python datasets
1 ds_pymodel 2 6370 2019-05-18 21:08:18 None
2 ds_wine_data 1 1410 2019-05-18 21:06:53 wine dataset
>>>
6-29
Chapter 6
>>> # Show the Python objects in the ds_pydata datastore.

... oml.ds.describe(name='ds_pydata')
object_name class size length row_count col_count
0 oml_boston oml.DataFrame 1073 506 506 14
1 oml_diabetes oml.DataFrame 964 442 442 11
2 wine Bunch 24177 5 1 5
>>>
>>> # Delete some objects from a datastore.
... oml.ds.delete(name="ds_pydata", objs=["wine", "oml_boston"])
{'wine', 'oml_boston'}
>>>
>>> # Delete a datastore.
... oml.ds.delete(name="ds_pydata")
'ds_pydata'
>>>
>>> # Delete all datastores whose names match a pattern.
... oml.ds.delete(name="_pymodel", regex_match=True)
{'ds_pymodel'}
>>>
>>> # Show the existing datastores again.
... oml.ds.dir()
datastore_name object_count size date description
0 ds_wine_data 1 1410 2019-05-18 21:06:53 wine dataset
6.4.7 Manage Access to Stored Objects

The oml.grant and oml.revoke functions grant or revoke the read privilege to datastores or to
user-defined Python functions in the script repository.
The oml.grant function grants the read privilege to another user to a datastore or to a user-
defined Python function in the OML4Py script repository. The oml.revoke function revokes that
privilege.
The syntax of these functions is the following:
oml.grant(name, typ='datastore', user=None)

oml.revoke(name, typ='datastore', user=None)
The name argument is a string that specifies the name of the user-defined Python function in
the script repository or the name of a datastore.
The typ parameter must be specified. The argument is a string that is either ‘datastore’ or
‘pyqscript’.
The user argument is a string that specifies the user to whom read privilege to the named
datastore or user-defined Python function is granted or from whom it is revoked, or None (the
default). If you specify None, then the read privilege is granted to or revoked from all users.
Example 6-19 Granting and Revoking Access to Datastores

This example displays the datastores to which the read privilege has been granted to all users.
It revokes read privilege from the ds_pymodel datastore and displays the datastores with public
read privilege again. It next grants the read privilege to the user SH and finally displays once
6-30
Chapter 6
more the datastores to which read privilege has been granted. For the creation of the
datastores used in this example, see Example 6-14.
import oml
# Show datastores to which other users have been granted read privilege.
# Revoke the read privilege from every user.

oml.revoke(name="ds_pymodel", typ="datastore", user=None)
# Again show datastores to which read privilege has been granted.

# Grant the read privilege to the user SH.

oml.grant(name="ds_pymodel", typ="datastore", user="SH")
>>> import oml

>>>
>>> # Show datastores to which other users have been granted read privilege.
0 ds_pymodel PUBLIC
>>>
>>> # Revoke the read privilege from every user.
... oml.revoke(name="ds_pymodel", typ="datastore", user=None)
>>>
>>> # Again show datastores to which read privilege has been granted to other
users.
Empty DataFrame
Columns: [datastore_name, grantee]
Index: []
>>>
>>> # Grant the read privilege to the user SH.
... oml.grant(name="ds_pymodel", typ="datastore", user="SH")
>>>
>>> oml.ds.dir(dstype="grant")
0 ds_pymodel SH
Example 6-20 Granting and Revoking Access to User-Defined Python Functions

This example grants the read privilege to the MYLM user-defined Python function to the user
SH and then revokes that privilege. For the creation of the user-defined Python functions used
in this example, see Example 10-11.
# List the user-defined Python functions available only to the current user.
oml.script.dir(sctype='user')
# Grant the read privilege to the MYLM user-defined Python function to the
6-31
Chapter 6
user SH.
oml.grant(name="MYLM", typ="pyqscript", user="SH")
# List the user-defined Python functions to which read privilege has been
granted.
oml.script.dir(sctype="grant")
# Revoke the read privilege to the MYLM user-defined Python function from the
user SH.
oml.revoke(name="MYLM", typ="pyqscript", user="SH")
# List the granted user-defined Python functions again to see if the

revocation was successful.
oml.script.dir(sctype="grant")
>>> # List the user-defined Python functions available only to the current
user.
oml.script.dir(sctype='user')
name script
0 MYLM def build_lm1(dat):\n from sklearn import lin...
>>>
>>># Grant the read privilege to the MYLM user-defined Python function to the
user SH.
...oml.grant(name="MYLM", typ="pyqscript", user="SH")
>>>
>>> # List the user-defined Python functions to which read privilege has been
granted.
... oml.script.dir(sctype="grant")
name grantee
0 MYLM SH
>>>
>>> # Revoke the read privilege to the MYLM user-defined Python function from
the user SH.
... oml.revoke(name="MYLM", typ="pyqscript", user="SH")
>>>
>>> # List the granted user-defined Python functions again to see if the
revocation was successful.
... oml.script.dir(sctype="grant")
Empty DataFrame
Columns: [name, grantee]
Index: []
6-32
7
Prepare and Explore Data
Use OML4Py methods to prepare data for analysis and to perform exploratory analysis of the
data.
Methods of the OML4Py data type classes make it easier for you to prepare very large
enterprise database-resident data for modeling. These methods are described in the following
topics.
Topics:
• Prepare Data
Using methods of OML4Py data type classes, you can prepare data for analysis in the
database, as described in the following topics.
• Explore Data
OML4Py provides methods that enable you to perform exploratory data analysis and
common statistical operations.
• Render Graphics
OML4Py provides functions for rendering graphical displays of data.
7.1 Prepare Data

Using methods of OML4Py data type classes, you can prepare data for analysis in the
database, as described in the following topics.
• About Preparing Data in the Database
OML4Py data type classes have methods that enable you to use Python to prepare
database data for analysis.
• Select Data
A typical step in preparing data for analysis is selecting or filtering values of interest from a
larger data set.
• Combine Data
You can join data from oml.DataFrame objects that represent database tables by using the
append, concat, and merge methods.
• Clean Data
In preparing data for analysis, a typical step is to transform data by dropping some values.
• Split Data
Sample and randomly partition data with the split and KFold methods.
7.1.1 About Preparing Data in the Database

OML4Py data type classes have methods that enable you to use Python to prepare database
data for analysis.
You can perform data preparation operations on large quantities of data in the database and
then continue operating on that data in-database or pull a subset of the results to your local
7-1
Chapter 7
Prepare Data
Python session where, for example, you can use third-party Python packages to perform other
operations.
The following table lists methods with which you can perform common data preparation tasks
and indicates whether the OML4Py data type class supports the method.
Table 7-1 Methods Supported by Data Types
Method Description oml.Boolean oml.Bytes oml.Float oml.String oml.DataFrame

append Appends another oml data
object of the same class to
an oml object.
ceil Computes the ceiling of each
element in an oml.Float
series data object.
concat Combines an oml data
object column-wise with one
or more other data objects.
count_pattern Counts the number of
occurrences of a pattern in
each string.
create_view Creates an Oracle Database
view for the data represented
by the OML4Py data object.
dot Calculates the inner product
of the current oml.Float
object with another
oml.Float, or does matrix
multiplication with an
oml.DataFrame.
drop Drops specified columns in
an oml.DataFrame.
drop_duplicates Removes duplicated
elements from an oml series
data object or duplicated
rows from an
oml.DataFrame.
dropna Removes missing elements
from an oml series data
object, or rows containing
missing values from an
oml.DataFrame.
exp Computes element-wise e to
the power of values in an
oml.Float series data
object.
find Finds the lowest index in
each string in which a
substring is found that is
greater than or equal to a
start index.
floor Computes the floor of each
element in an oml.Float
series data object.
7-2
Chapter 7
Prepare Data
Table 7-1 (Cont.) Methods Supported by Data Types

head Returns the first n elements
of an oml series data object
or the first n rows of an
oml.DataFrame.
KFold Splits the oml data object
randomly into k consecutive
folds.
len Computes the length of each
string in an oml.Bytes or
oml.String series data
object.
log Calculates an element-wise
logarithm, to the given base,
of values in the oml.Float
series data object.
materialize Pushes the contents
represented by an OML4Py
proxy object (a view, a table,
and so on) into a table in
Oracle Database.
merge Joins another
oml.DataFrame to an
oml.DataFrame.
replace Replaces an existing value
with another value.
rename Renames columns of an
oml.DataFrame.
round Rounds oml.Float values
to the specified decimal
place.
select_types Returns the subset of
columns that are included or
excluded based on their oml
data type.
split Splits an oml data object
randomly into multiple sets.
sqrt Computes the square root of
each element in an
oml.Float series data
object.
tail Returns the last n elements
of an oml series data object
or the last n rows of an
oml.DataFrame.
7.1.2 Select Data

A typical step in preparing data for analysis is selecting or filtering values of interest from a
larger data set.
7-3
Chapter 7
Prepare Data
The examples in this section demonstrate selecting data from an oml.DataFrame object by
rows, by columns, and by value.
The examples use the oml_iris object created by the following code, which imports the
sklearn.datasets package and loads the iris data set. It creates the x and y variables, and
then creates the persistent database table IRIS and the oml.DataFrame object oml.iris as a
proxy for the table.
import oml
import pandas as pd
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data, columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x: {0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
The examples are in the following topics:

• Select the First or Last Number of Rows
• Select Data by Column
• Select Data by Value
Select the First or Last Number of Rows

The head and tail methods return the first or last number of elements.
The default number of rows selected is 5.

Example 7-1 Selecting the First and Last Number of Rows
This example selects rows from the oml.DataFrame object oml_iris. It displays the first five
rows and ten rows of oml_iris and then the last five and ten rows.
# Display the first 5 rows.

oml_iris.head()
# Display the first 10 rows.

oml_iris.head(10)
# Display the last 5 rows.

oml_iris.tail()
# Display the last 10 rows.

oml_iris.tail(10)
7-4
Chapter 7
Prepare Data
>>> # Display the first 5 rows.

... oml_iris.head()
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
>>>
>>> # Display the first 10 rows.
... oml_iris.head(10)
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
>>>
>>> # Display the last 5 rows.
... oml_iris.tail()
0 6.7 3.0 5.2 2.3 virginica
1 6.3 2.5 5.0 1.9 virginica
2 6.5 3.0 5.2 2.0 virginica
3 6.2 3.4 5.4 2.3 virginica
4 5.9 3.0 5.1 1.8 virginica
>>>
>>> # Display the last 10 rows.
... oml_iris.tail(10)
0 6.7 3.1 5.6 2.4 virginica
1 6.9 3.1 5.1 2.3 virginica
2 5.8 2.7 5.1 1.9 virginica
3 6.8 3.2 5.9 2.3 virginica
4 6.7 3.3 5.7 2.5 virginica
5 6.7 3.0 5.2 2.3 virginica
6 6.3 2.5 5.0 1.9 virginica
7 6.5 3.0 5.2 2.0 virginica
8 6.2 3.4 5.4 2.3 virginica
9 5.9 3.0 5.1 1.8 virginica
Select Data by Column

Example 7-2 Selecting Data by Columns
The example selects two columns from oml_iris and creates the oml.DataFrame object
iris_projected1 with them. It then displays the first three rows of iris_projected1. The
example also selects a range of columns from oml_iris, creates iris_projected2, and
7-5
Chapter 7
Prepare Data
displays its first three rows. Finally, the example selects columns from oml_iris by data types,
creates iris_projected3, and displays its first three rows.
# Select all rows with the specified column names.

iris_projected1 = oml_iris[:, ["Sepal_Length", "Petal_Length"]]
iris_projected1.head(3)
# Select all rows with columns whose indices are in the range [1, 4).
iris_projected2 = oml_iris[:, 1:4]
# Select all rows with columns of oml.String data type.

iris_projected3 = oml_iris.select_types(include=[oml.String])
>>> # Select all rows with specified column names.

... iris_projected1 = oml_iris[:, ["Sepal_Length", "Petal_Length"]]
>>> iris_projected1.head(3)
Sepal_Length Petal_Length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
>>>
>>> # Select all rows with columns whose indices are in range [1, 4).
... iris_projected2 = oml_iris[:, 1:4]
Sepal_Width Petal_Length Petal_Width
0 3.5 1.4 0.2
1 3.0 1.4 0.2
2 3.2 1.3 0.2
>>>
>>> # Select all rows with columns of oml.String data type.
... iris_projected3 = oml_iris.select_types(include=[oml.String])
Species
0 setosa
1 setosa
2 setosa
Select Data by Value

Example 7-3 Selecting Data by Value
This example filters oml_iris to produce iris_of_filtered1, which contains the values from
the rows of oml_iris that have a petal length of less than 1.5 and that are in the Sepal_Length
and Petal_Length columns. The example also filters the data using conditions, so that
oml_iris_filtered2 contains the values from oml_iris that have a petal length of less than
1.5 or a sepal length equal to 5.0 and oml_iris_filtered3 contains the values from oml_iris
that have a petal length of less than 1.5 and a sepal length larger than 5.0.
# Select sepal length and petal length where petal length

# is less than 1.5.
oml_iris_filtered1 = oml_iris[oml_iris["Petal_Length"] < 1.5,
7-6
Chapter 7
Prepare Data
["Sepal_Length", "Petal_Length"]]
len(oml_iris_filtered1)
oml_iris_filtered1.head(3)
### Using the AND and OR conditions in filtering.

# Select all rows in which petal length is less than 1.5 or sepal length
# sepal length is 5.0.
oml_iris_filtered2 = oml_iris[(oml_iris["Petal_Length"] < 1.5) |
(oml_iris["Sepal_Length"] == 5.0), :]
oml_iris_filtered2.head(3)
# Select all rows in which petal length is less than 1.5 and
# sepal length is larger than 5.0.
oml_iris_filtered3 = oml_iris[(oml_iris["Petal_Length"] < 1.5) &
(oml_iris["Sepal_Length"] > 5.0), :]
oml_iris_filtered3.head()
>>> # Select sepal length and petal length where petal length
... # is less than 1.5.
... oml_iris_filtered1 = oml_iris[oml_iris["Petal_Length"] < 1.5,
... ["Sepal_Length", "Petal_Length"]]
>>> len(oml_iris_filtered1)
24
>>> oml_iris_filtered1.head(3)
Sepal_Length Petal_Length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
>>>
>>> ### Using the AND and OR conditions in filtering.
... # Select all rows in which petal length is less than 1.5 or
... # sepal length is 5.0.
... oml_iris_filtered2 = oml_iris[(oml_iris["Petal_Length"] < 1.5) |
... (oml_iris["Sepal_Length"] == 5.0), :]
30
>>> oml_iris_filtered2.head(3)
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
>>>
>>> # Select all rows in which petal length is less than 1.5
... # and sepal length is larger than 5.0.
... oml_iris_filtered3 = oml_iris[(oml_iris["Petal_Length"] < 1.5) &
... (oml_iris["Sepal_Length"] > 5.0), :]
7
>>> oml_iris_filtered3.head()
0 5.1 3.5 1.4 0.2 setosa
7-7
Chapter 7
Prepare Data
1 5.8 4.0 1.2 0.2 setosa

2 5.4 3.9 1.3 0.4 setosa
3 5.1 3.5 1.4 0.3 setosa
4 5.2 3.4 1.4 0.2 setosa
7.1.3 Combine Data

You can join data from oml.DataFrame objects that represent database tables by using the
append, concat, and merge methods.
Examples of using these methods are in the following topics.

• Append Data from One Object to Another Object
• Combine Two Objects
• Join Data From Two Objects
Append Data from One Object to Another Object

Use the append method to join two objects of the same data type.
Example 7-4 Appending Data from Two Tables

This example first appends the oml.Float series object num1 to another oml.Float series
object, num2. It then appends an oml.DataFrame object to another oml.DataFrame object, which
has the same column types.
import oml
import pandas as pd
df = pd.DataFrame({"id" : [1, 2, 3, 4, 5],

"val" : ["a", "b", "c", "d", "e"],
"ch" : ["p", "q", "r", "a", "b"],
"num" : [4, 3, 6.7, 7.2, 5]})
oml_df = oml.push(df)
# Append an oml.Float series object to another.

num1 = oml_df['id']
num2 = oml_df['num']
num1.append(num2)
# Append an oml.DataFrame object to another.

x = oml_df[['id', 'val']] # 1st column oml.Float, 2nd column oml.String
y = oml_df[['num', 'ch']] # 1st column oml.Float, 2nd column oml.String
x.append(y)
>>> import oml

>>>
>>> df = pd.DataFrame({"id" : [1, 2, 3, 4, 5],
... "val" : ["a", "b", "c", "d", "e"],
... "ch" : ["p", "q", "r", "a", "b"],
... "num" : [4, 3, 6.7, 7.2, 5]})
>>> oml_df = oml.push(df)
7-8
Chapter 7
Prepare Data
>>>
>>> # Append an oml.Float series object to another.
... num1 = oml_df['id']
>>> num2 = oml_df['num']
>>> num1.append(num2)
[1, 2, 3, 4, 5, 4, 3, 6.7, 7.2, 5]
>>>
>>> # Append an oml.DataFrame object to another.
... x = oml_df[['id', 'val']] # 1st column oml.Float, 2nd column oml.String
>>> y = oml_df[['num', 'ch']] # 1st column oml.Float, 2nd column oml.String
>>> x.append(y)
id val
0 1.0 a
1 2.0 b
2 3.0 c
3 4.0 d
4 5.0 e
5 4.0 p
6 3.0 q
7 6.7 r
8 7.2 a
9 5.0 b
Combine Two Objects

Use the concat method to combine columns from one object with those of another object. The
auto_name argument of the concat method controls whether to invoke automatic name conflict
resolution. You can also perform customized renaming by passing in a dictionary mapping
strings to objects.
To combine two objects with the concat method, both objects must represent data from the
same underlying database table, view, or query.
Example 7-5 Combining Data Column-Wise
This example first combines the two oml.DataFrame objects x and y column-wise. It then
concatenates object y with the oml.Float series object w.
import oml
import pandas as pd
from collections import OrderedDict

"val" : ["a", "b", "c", "d", "e"],
"ch" : ["p", "q", "r", "a", "b"],
"num" : [4, 3, 6.7, 7.2, 5]})
# Create two oml.DataFrame objects and combine the objects column-wise.

x = oml_df[['id', 'val']]
y = oml_df[['num', 'ch']]
x.concat(y)
# Create an oml.Float object with the rounded exponential of two times

# the values in the num column of the oml_df object, then
# concatenate it with the oml.DataFrame object y using a new column name.
7-9
Chapter 7
Prepare Data
w = (oml_df['num']*2).exp().round(decimals=2)
y.concat({'round(exp(2*num))':w})
# Concatenate object x with multiple objects and turn on automatic

# name conflict resolution.
z = oml_df[:,'id']
x.concat([z, w, y], auto_name=True)
# Concatenate multiple oml data objects and perform customized renaming.

x.concat(OrderedDict([('ID',z), ('round(exp(2*num))',w), ('New_',y)]))
>>> import oml

>>> from collections import OrderedDict
>>>
... "val" : ["a", "b", "c", "d", "e"],
... "ch" : ["p", "q", "r", "a", "b"],
... "num" : [4, 3, 6.7, 7.2, 5]})
>>> # Create two oml.DataFrame objects and combine the objects column-wise.
... x = oml_df[['id', 'val']]
>>> y = oml_df[['num', 'ch']]
>>> x.concat(y)
id val num ch
0 1 a 4.0 p
1 2 b 3.0 q
2 3 c 6.7 r
3 4 d 7.2 a
4 5 e 5.0 b
>>>
>>> # Create an oml.Float object with the rounded exponential of two times
... # the values in the num column of the oml_df object, then
... # concatenate it with the oml.DataFrame object y using a new column name.
... w = (oml_df['num']*2).exp().round(decimals=2)
>>> y.concat({'round(exp(2*num))':w})
num ch round(exp(2*num))
0 4.0 p 2980.96
1 3.0 q 403.43
2 6.7 r 660003.22
3 7.2 a 1794074.77
4 5.0 b 22026.47
>>>
>>> # Concatenate object x with multiple objects and turn on automatic
... # name conflict resolution.
... z = oml_df[:,'id']
>>> x.concat([z, w, y], auto_name=True)
id val id3 num num5 ch
0 1 a 1 2980.96 4.0 p
1 2 b 2 403.43 3.0 q
2 3 c 3 660003.22 6.7 r
3 4 d 4 1794074.77 7.2 a
7-10
Chapter 7
Prepare Data
4 5 e 5 22026.47 5.0 b
>>>
>>> # Concatenate multiple oml data objects and perform customized renaming.
... x.concat(OrderedDict([('ID',z), ('round(exp(2*num))',w), ('New_',y)]))
id val ID round(exp(2*num)) New_num New_ch
0 1 a 1 2980.96 4.0 p
1 2 b 2 403.43 3.0 q
2 3 c 3 660003.22 6.7 r
3 4 d 4 1794074.77 7.2 a
4 5 e 5 22026.47 5.0 b
Join Data From Two Objects

Use the merge method to join data from two objects.
Example 7-6 Joining Data from Two Tables

This example first performs a cross join on the oml.DataFrame objects x and y, which creates
the oml.DataFrame object xy. The example performs a left outer join on the first four rows of x
with the oml.DataFrame object other on the shared column id and applies the suffixes .l
and .r to column names on the left and right side, respectively. The example then performs a
right outer join on the id column on the left side object x and the num column on the right side
object y.
import oml
import pandas as pd

"val" : ["a", "b", "c", "d", "e"],
"ch" : ["p", "q", "r", "a", "b"],
"num" : [4, 3, 6.7, 7.2, 5]})
x = oml_df[['id', 'val']]
y = oml_df[['num', 'ch']]
# Perform a cross join.

xy = x.merge(y)
xy
# Perform a left outer join.

x.head(4).merge(other=oml_df[['id', 'num']], on="id",
suffixes=['.l','.r'])
# Perform a right outer join.

x.merge(other=y, left_on="id", right_on="num", how="right")
>>> import oml

>>>
... "val" : ["a", "b", "c", "d", "e"],
... "ch" : ["p", "q", "r", "a", "b"],
7-11
Chapter 7
Prepare Data
... "num" : [4, 3, 6.7, 7.2, 5]})

>>>
>>> x = oml_df[['id', 'val']]
>>> y = oml_df[['num', 'ch']]
>>>
>>> # Perform a cross join.
... xy = x.merge(y)
>>> xy
id_l val_l num_r ch_r
0 1 a 4.0 p
1 1 a 3.0 q
2 1 a 6.7 r
3 1 a 7.2 a
4 1 a 5.0 b
5 2 b 4.0 p
6 2 b 3.0 q
7 2 b 6.7 r
8 2 b 7.2 a
9 2 b 5.0 b
10 3 c 4.0 p
11 3 c 3.0 q
12 3 c 6.7 r
13 3 c 7.2 a
14 3 c 5.0 b
15 4 d 4.0 p
16 4 d 3.0 q
17 4 d 6.7 r
18 4 d 7.2 a
19 4 d 5.0 b
20 5 e 4.0 p
21 5 e 3.0 q
22 5 e 6.7 r
23 5 e 7.2 a
24 5 e 5.0 b
>>>
>>> # Perform a left outer join.
... x.head(4).merge(other=oml_df[['id', 'num']], on="id",
... suffixes=['.l','.r'])
id val.l num.r
0 1 a 4.0
1 2 b 3.0
2 3 c 6.7
3 4 d 7.2
>>>
>>> # Perform a right outer join.
... x.merge(other=y, left_on="id", right_on="num", how="right")
id_l val_l num_r ch_r
0 3.0 c 3.0 q
1 4.0 d 4.0 p
2 5.0 e 5.0 b
3 NaN None 6.7 r
4 NaN None 7.2 a
7-12
Chapter 7
Prepare Data
7.1.4 Clean Data

In preparing data for analysis, a typical step is to transform data by dropping some values.
You can filter out unneeded data by using the drop, drop_duplicates, and dropna methods.
Example 7-7 Filtering Data

This example demonstrates ways of dropping columns with the drop method, dropping missing
values with the dropna method, and dropping duplicate values with the drop_duplicates
method.
import pandas as pd
import oml
df = pd.DataFrame({'numeric': [1, 1.4, -4, -4, 5.432, None, None],

'string1' : [None, None, 'a', 'a', 'a', 'b', None],
'string2': ['x', None, 'z', 'z', 'z', 'x', None]})
oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',
'string1':'CHAR(1)',
'string2':'CHAR(1)'})
# Drop rows with any missing values.

oml_df.dropna(how='any')
# Drop rows in which all column values are missing.

oml_df.dropna(how='all')
# Drop rows in which any numeric column values are missing.

oml_df.dropna(how='any', subset=['numeric'])
# Drop duplicate rows.

oml_df.drop_duplicates()
# Drop rows that have the same value in column 'string1' and 'string2'.
oml_df.drop_duplicates(subset=['string1', 'string2'])
# Drop column 'string2'

oml_df.drop('string2')

>>> import oml
>>>
>>> df = pd.DataFrame({'numeric': [1, 1.4, -4, -4, 5.432, None, None],
... 'string1' : [None, None, 'a', 'a', 'a', 'b', None],
... 'string2': ['x', None, 'z', 'z', 'z', 'x', None]})
>>> oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',
... 'string1':'CHAR(1)',
... 'string2':'CHAR(1)'})
>>>
>>> # Drop rows with any missing values.
... oml_df.dropna(how='any')
numeric string1 string2
7-13
Chapter 7
Prepare Data
0 -4.000 a z
1 -4.000 a z
2 5.432 a z
>>>
>>> # Drop rows in which all column values are missing.
... oml_df.dropna(how='all')
0 1.000 None x
1 1.400 None None
2 -4.000 a z
3 -4.000 a z
4 5.432 a z
5 NaN b x
>>>
>>> # Drop rows in which any numeric column values are missing.
... oml_df.dropna(how='any', subset=['numeric'])
0 1.000 None x
1 1.400 None None
2 -4.000 a z
3 -4.000 a z
4 5.432 a z
>>>
>>> # Drop duplicate rows.
... oml_df.drop_duplicates()
0 5.432 a z
1 1.000 None x
2 -4.000 a z
3 NaN b x
4 1.400 None None
5 NaN None None
>>>
>>> # Drop rows that have the same value in columns 'string1' and 'string2'.
... oml_df.drop_duplicates(subset=['string1', 'string2'])
0 -4.0 a z
1 1.4 None None
2 1.0 None x
3 NaN b x
>>>
>>> # Drop the column 'string2'.
... oml_df.drop('string2')
numeric string1
0 1.000 None
1 1.400 None
2 -4.000 a
3 -4.000 a
4 5.432 a
5 NaN b
6 NaN None
7-14
Chapter 7
Prepare Data
7.1.5 Split Data

Sample and randomly partition data with the split and KFold methods.
In analyzing large data sets, a typical operation is to randomly partition the data set into
subsets for training and testing purposes, which you can do with these methods. You can also
sample data with the split method.
Example 7-8 Splitting Data into Multiple Sets

This example demonstrates splitting data into multiple sets and into k consecutive folds, which
can be used for k-fold cross-validation.
import oml
import pandas as pd
digits = datasets.load_digits()
pd_digits = pd.DataFrame(digits.data,
columns=['IMG'+str(i) for i in
range(digits['data'].shape[1])])
pd_digits = pd.concat([pd_digits,
pd.Series(digits.target,
name = 'target')],
axis = 1)
oml_digits = oml.push(pd_digits)
# Sample 20% and 80% of the data.

splits = oml_digits.split(ratio=(.2, .8), use_hash = False)
[len(split) for split in splits]
# Split the data into four sets.

splits = oml_digits.split(ratio = (.25, .25, .25, .25),
use_hash = False)
[len(split) for split in splits]
# Perform stratification on the target column.

splits = oml_digits.split(strata_cols=['target'])
[split.shape for split in splits]
# Verify that the stratified sampling generates splits in which

# all of the different categories of digits (digits 0~9)
# are present in each split.
[split['target'].drop_duplicates().sort_values().pull()
for split in splits]
# Hash on the target column.

splits = oml_digits.split(hash_cols=['target'])
[split.shape for split in splits]
# Verify that the different categories of digits (digits 0~9) are present
# in only one of the splits generated by hashing on the category column.
[split['target'].drop_duplicates().sort_values().pull()
for split in splits]
7-15
Chapter 7
Prepare Data
# Split the data randomly into 4 consecutive folds.

folds = oml_digits.KFold(n_splits=4)
[(len(fold[0]), len(fold[1])) for fold in folds]
>>> import oml

>>>
>>> digits = datasets.load_digits()
>>> pd_digits = pd.DataFrame(digits.data,
... columns=['IMG'+str(i) for i in
... range(digits['data'].shape[1])])
>>> pd_digits = pd.concat([pd_digits,
... pd.Series(digits.target,
... name = 'target')],
... axis = 1)
>>> oml_digits = oml.push(pd_digits)
>>>
>>> # Sample 20% and 80% of the data.
... splits = oml_digits.split(ratio=(.2, .8), use_hash = False)
>>> [len(split) for split in splits]
[351, 1446]
>>>
>>> # Split the data into four sets.
... splits = oml_digits.split(ratio = (.25, .25, .25, .25),
... use_hash = False)
>>> [len(split) for split in splits]
[432, 460, 451, 454]
>>>
>>> # Perform stratification on the target column.
... splits = oml_digits.split(strata_cols=['target'])
>>> [split.shape for split in splits]
[(1285, 65), (512, 65)]
>>>
>>> # Verify that the stratified sampling generates splits in which
... # all of the different categories of digits (digits 0~9)
... # are present in each split.
... [split['target'].drop_duplicates().sort_values().pull()
... for split in splits]
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
>>>
>>> # Hash on the target column
... splits = oml_digits.split(hash_cols=['target'])
>>> [split.shape for split in splits]
[(899, 65), (898, 65)]
>>>
>>> # Verify that the different categories of digits (digits 0~9) are present
... # in only one of the splits generated by hashing on the category column.
... [split['target'].drop_duplicates().sort_values().pull()
... for split in splits]
[[0, 1, 3, 5, 8], [2, 4, 6, 7, 9]]
>>>
>>> # Split the data randomly into 4 consecutive folds.
7-16
Chapter 7
Explore Data
... folds = oml_digits.KFold(n_splits=4)

>>> [(len(fold[0]), len(fold[1])) for fold in folds]
[(1352, 445), (1336, 461), (1379, 418), (1325, 472)]
7.2 Explore Data

OML4Py provides methods that enable you to perform exploratory data analysis and common
statistical operations.
These methods are described in the following topics.
• About the Exploratory Data Analysis Methods
OML4Py provides methods that enable you to perform exploratory data analysis.
• Correlate Data
Use the corr method to perform Pearson, Spearman, or Kendall correlation analysis
across columns where possible in an oml.DataFrame object.
• Cross-Tabulate Data
Use the crosstab method to perform cross-column analysis of an oml.DataFrame object
and the pivot_table method to convert an oml.DataFrame to a spreadsheet-style pivot
table.
• Mutate Data
In preparing data for analysis, a typical operation is to mutate data by reformatting it or
deriving new columns and adding them to the data set.
• Sort Data
The sort_values function enables flexible sorting of an oml.DataFrame along one or more
columns specified by the by argument, and returns an oml.DataFrame.
• Summarize Data
The describe method calculates descriptive statistics that summarize the central tendency,
dispersion, and shape of the data in each column.
7.2.1 About the Exploratory Data Analysis Methods

OML4Py provides methods that enable you to perform exploratory data analysis.
The following table lists methods of OML4Py data type classes with which you can perform
common statistical operations and indicates whether the class supports the method.
Table 7-2 Data Exploration Methods Supported by Data Type Classes

corr Computes pairwise
correlation between all
columns in an
oml.DataFrame where
possible, given the type of
coefficient.
count Computes the number of
elements that are not NULL
in the series data object or in
each column of an
oml.DataFrame.
7-17
Chapter 7
Explore Data
Table 7-2 (Cont.) Data Exploration Methods Supported by Data Type Classes

crosstab Computes a cross-tabulation
of two or more columns in an
oml.DataFrame.
cumsum Computes the cumulative
sum after an oml.Float
series data object is sorted,
or of each float or Boolean
column after an
oml.DataFrame object is
sorted.
describe Computes descriptive
statistics that summarize the
central tendency, dispersion,
and shape of an oml series
data distribution, or of each
column in an
oml.DataFrame.
kurtosis Computes the kurtosis of the
values in an oml.Float
series data object, or for
each float column in an
oml.DataFrame.
max Returns the maximum value
in a series data object or in
each column in an
oml.DataFrame.
mean Computes the mean of the
series object, or for each
float or Boolean column in an
oml.DataFrame.
median Computes the median of the
series object, or for each
float column in an
oml.DataFrame.
min Returns the minimum value
in a series data object or of
each column in an
oml.DataFrame.
nunique Computes the number of
unique values in a series
data object or in each
column of an
oml.DataFrame.
pivot_table Converts an
oml.DataFrame to a
spreadsheet-style pivot table.
7-18
Chapter 7
Explore Data
Table 7-2 (Cont.) Data Exploration Methods Supported by Data Type Classes

sort_values Sorts the values in a series
data object or sorts the rows
in an oml.DataFrame.
skew Computes the skewness of
the values in an oml.Float
data series object or of each
float column in an
oml.DataFrame.
std Computes the standard
deviation of the values in an
oml.Float data series
object or in each float or
Boolean column in an
oml.DataFrame.
sum Computes the sum of the
values in an oml.Float data
series object or of each float
or Boolean column in an
oml.DataFrame.
7.2.2 Correlate Data

Use the corr method to perform Pearson, Spearman, or Kendall correlation analysis across
columns where possible in an oml.DataFrame object.
For details about the function arguments, invoke help(oml.DataFrame.corr) or see Oracle
Machine Learning for Python API Reference.
Example 7-9 Performing Basic Correlation Calculations
This example first creates a temporary database table, with its corresponding proxy
oml.DataFrame object oml_df1, from the pandas.DataFrame object df. It then verifies the
correlation computed between columns A and B, which gives 1, as expected. The values in B
are twice the values in A element-wise. The example also changes a value field in df and
creates a NaN entry. It then creates a temporary database table, with the corresponding proxy
oml.DataFrame object oml_df2. Finally, it invokes the corr method on oml_df2 with skipna set
to True ( the default) and then False to compare the results.
import oml
import pandas as pd
df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

oml_df1 = oml.push(df)
# Verify that the correlation between column A and B is 1.

oml_df1.corr()
# Change a value to test the change in the computed correlation result.

df.loc[2, 'A'] = 1.5
# Change an entry to NaN (not a number) to test the 'skipna'
7-19
Chapter 7
Explore Data
# parameter in the corr method.

df.loc[1, 'B'] = None
# Push df to the database using the floating point column type

# because NaNs cannot be used in Oracle numbers.
oml_df2 = oml.push(df, oranumber=False)
# By default, 'skipna' is True.

oml_df2.corr()
oml_df2.corr(skipna=False)
>>> import oml

>>>
>>> df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})
>>> oml_df1 = oml.push(df)
>>>
>>> # Verify that the correlation between column A and B is 1.
... oml_df1.corr()
A B
A 1 1
B 1 1
>>>
>>> # Change a value to test the change in the computed correlation result.
... df.loc[2, 'A'] = 1.5
>>>
>>> # Change an entry to NaN (not a number) so to test the 'skipna'
... # parameter in the corr method.
... df.loc[1, 'B'] = None
>>>
>>> # Push df to the database using the floating point column type
... # because NaNs cannot be used in Oracle numbers.
... oml_df2 = oml.push(df, oranumber=False)
>>>
>>> # By default, 'skipna' is True.
... oml_df2.corr()
A B
A 1.000000 0.981981
B 0.981981 1.000000
>>> oml_df2.corr(skipna=False)
A B
A 1.0 NaN
B NaN 1.0
7.2.3 Cross-Tabulate Data

Use the crosstab method to perform cross-column analysis of an oml.DataFrame object and
the pivot_table method to convert an oml.DataFrame to a spreadsheet-style pivot table.
Cross-tabulation is a statistical technique that finds an interdependent relationship between

two columns of values. The crosstab method computes a cross-tabulation of two or more
columns. By default, it computes a frequency table for the columns unless a column and an
aggregation function have been passed to it.
7-20
Chapter 7
Explore Data
The pivot_table method converts a data set into a pivot table. Due to the database 1000
column limit, pivot tables with more than 1000 columns are automatically truncated to display
the categories with the most entries for each column value.
For details about the method arguments, invoke help(oml.DataFrame.crosstab) or
help(oml.DataFrame.pivot_table), or see Oracle Machine Learning for Python API
Reference.
Example 7-10 Producing Cross-Tabulation and Pivot Tables
This example demonstrates the use of the crosstab and pivot_table methods.
import pandas as pd
import oml
x = pd.DataFrame({
'GENDER': ['M', 'M', 'F', 'M', 'F', 'M', 'F', 'F',
None, 'F', 'M', 'F'],
'HAND': ['L', 'R', 'R', 'L', 'R', None, 'L', 'R',
'R', 'R', 'R', 'R'],
'SPEED': [40.5, 30.4, 60.8, 51.2, 54, 29.3, 34.1,
39.6, 46.4, 12, 25.3, 37.5],
'ACCURACY': [.92, .94, .87, .9, .85, .97, .96, .93,
.89, .84, .91, .95]
})
x = oml.push(x)
# Find the categories that the most entries belonged to.

x.crosstab('GENDER', 'HAND').sort_values('count', ascending=False)
# For each gender value and across all entries, find the ratio of entries
# with different hand values.
x.crosstab('GENDER', 'HAND', pivot = True, margins = True, normalize = 0)
# Find the mean speed across all gender and hand combinations.
x.pivot_table('GENDER', 'HAND', 'SPEED')
# Find the median accuracy and speed for every gender and hand combination.
x.pivot_table('GENDER', 'HAND', aggfunc = oml.DataFrame.median)
# Find the max and min speeds for every gender and hand combination and
# across all combinations.
x.pivot_table('GENDER', 'HAND', 'SPEED',
aggfunc = [oml.DataFrame.max, oml.DataFrame.min],
margins = True)

>>> import oml
>>>
>>> x = pd.DataFrame({
... 'GENDER': ['M', 'M', 'F', 'M', 'F', 'M', 'F', 'F',
... None, 'F', 'M', 'F'],
... 'HAND': ['L', 'R', 'R', 'L', 'R', None, 'L', 'R',
... 'R', 'R', 'R', 'R'],
7-21
Chapter 7
Explore Data
... 'SPEED': [40.5, 30.4, 60.8, 51.2, 54, 29.3, 34.1,

... 39.6, 46.4, 12, 25.3, 37.5],
... 'ACCURACY': [.92, .94, .87, .9, .85, .97, .96, .93,
... .89, .84, .91, .95]
... })
>>> x = oml.push(x)
>>>
>>> # Find the categories that the most entries belonged to.
... x.crosstab('GENDER', 'HAND').sort_values('count', ascending=False)
GENDER HAND count
0 F R 5
1 M L 2
2 M R 2
3 M None 1
4 F L 1
5 None R 1
>>>
>>> # For each gender value and across all entries, find the ratio of entries
... # with different hand values.
... x.crosstab('GENDER', 'HAND', pivot = True, margins = True, normalize = 0)
GENDER count_(L) count_(R) count_(None)
0 None 0.000000 1.000000 0.000000
1 F 0.166667 0.833333 0.000000
2 M 0.400000 0.400000 0.200000
3 All 0.250000 0.666667 0.083333
>>>
>>> # Find the mean speed across all gender and hand combinations.
... x.pivot_table('GENDER', 'HAND', 'SPEED')
GENDER mean(SPEED)_(L) mean(SPEED)_(R) mean(SPEED)_(None)
0 None NaN 46.40 NaN
1 F 34.10 40.78 NaN
2 M 45.85 27.85 29.3
>>>
>>> # Find the median accuracy and speed for every gender and hand
combination.
... x.pivot_table('GENDER', 'HAND', aggfunc = oml.DataFrame.median)
GENDER median(ACCURACY)_(L) median(ACCURACY)_(R)
median(ACCURACY)_(None) \
0 None NaN 0.890
NaN
1 F 0.96 0.870
NaN
2 M 0.91 0.925
0.97
median(SPEED)_(L) median(SPEED)_(R) median(SPEED)_(None)

0 NaN 46.40 NaN
1 34.10 39.60 NaN
2 45.85 27.85 29.3
>>>
>>> # Find the max and min speeds for every gender and hand combination and
... # across all combinations.
... x.pivot_table('GENDER', 'HAND', 'SPEED',
... aggfunc = [oml.DataFrame.max, oml.DataFrame.min],
... margins = True)
GENDER max(SPEED)_(L) max(SPEED)_(R) max(SPEED)_(None)
7-22
Chapter 7
Explore Data
max(SPEED)_(All) \
0 None NaN 46.4 NaN
46.4
1 F 34.1 60.8 NaN
60.8
2 M 51.2 30.4 29.3
51.2
3 All 51.2 60.8 29.3
60.8
min(SPEED)_(L) min(SPEED)_(R) min(SPEED)_(None) min(SPEED)_(All)

0 NaN 46.4 NaN 46.4
1 34.1 12.0 NaN 12.0
2 40.5 25.3 29.3 25.3
3 34.1 12.0 29.3 12.0
7.2.4 Mutate Data

In preparing data for analysis, a typical operation is to mutate data by reformatting it or deriving
new columns and adding them to the data set.
These examples demonstrate methods of formatting data and deriving columns.
import pandas as pd
import oml
# Create a shopping cart data set.

shopping_cart = pd.DataFrame({
'Item_name': ['paper_towel', 'ground_pork', 'tofu', 'eggs',
'pork_loin', 'whole_milk', 'egg_custard'],
'Item_type': ['grocery', 'meat', 'grocery', 'dairy', 'meat',
'dairy', 'bakery'],
'Quantity': [1, 2.6, 4, 1, 1.9, 1, 1],
'Unit_price': [1.19, 2.79, 0.99, 2.49, 3.19, 2.5, 3.99]
})
oml_cart = oml.push(shopping_cart)
oml_cart
# Add a column 'Price' multiplying 'Quantity' with 'Unit_price',

# rounded to 2 decimal places.
price = oml_cart['Quantity']*(oml_cart['Unit_price'])
type(price)
price
oml_cart = oml_cart.concat({'Price': price.round(2)})
# Count the pattern 'egg' in the 'Item_name' column.

egg_pattern = oml_cart['Item_name'].count_pattern('egg')
type(egg_pattern)
oml_cart.concat({'Egg_pattern': egg_pattern})
# Find the start index of substring 'pork' in the 'Item_name' column.

pork_startInd = oml_cart['Item_name'].find('pork')
type(pork_startInd)
oml_cart.concat({'Pork_startInd': pork_startInd})
7-23
Chapter 7
Explore Data
# Check whether items are of grocery category.

is_grocery=oml_cart['Item_type']=='grocery'
type(is_grocery)
oml_cart.concat({'Is_grocery': is_grocery})
# Calculate the length of item names.

name_length=oml_cart['Item_name'].len()
type(name_length)
oml_cart.concat({'Name_length': name_length})
# Get the ceiling, floor, exponential, logarithm and square root

# of the 'Price' column.
oml_cart['Price'].ceil()
oml_cart['Price'].floor()
oml_cart['Price'].exp()
oml_cart['Price'].log()
oml_cart['Price'].sqrt()

>>> import oml
>>>
>>> # Create a shopping cart data set.
... shopping_cart = pd.DataFrame({
... 'Item_name': ['paper_towel', 'ground_pork', 'tofu', 'eggs',
... 'pork_loin', 'whole_milk', 'egg_custard'],
... 'Item_type': ['grocery', 'meat', 'grocery', 'dairy', 'meat',
... 'dairy', 'bakery'],
... 'Quantity': [1, 2.6, 4, 1, 1.9, 1, 1],
... 'Unit_price': [1.19, 2.79, 0.99, 2.49, 3.19, 2.5, 3.99]
... })
>>> oml_cart = oml.push(shopping_cart)
>>> oml_cart
Item_name Item_type Quantity Unit_price
0 paper_towel grocery 1.0 1.19
1 ground_pork meat 2.6 2.79
2 tofu grocery 4.0 0.99
3 eggs dairy 1.0 2.49
4 pork_loin meat 1.9 3.19
5 whole_milk dairy 1.0 2.50
6 egg_custard bakery 1.0 3.99
>>>
>>> # Add a column 'Price' multiplying 'Quantity' with 'Unit_price',
... # rounded to 2 decimal places.
... price = oml_cart['Quantity']*(oml_cart['Unit_price'])
>>> type(price)
<class 'oml.core.float.Float'>
>>> price
[1.19, 7.254, 3.96, 2.49, 6.061, 2.5, 3.99]
>>> oml_cart = oml_cart.concat({'Price': price.round(2)})
>>>
>>> # Count the pattern 'egg' in the 'Item_name' column.
... egg_pattern = oml_cart['Item_name'].count_pattern('egg')
>>> type(egg_pattern)
7-24
Chapter 7
Explore Data
>>> oml_cart.concat({'Egg_pattern': egg_pattern})
Item_name Item_type Quantity Unit_price Price Egg_pattern
0 paper_towel grocery 1.0 1.19 1.19 0
1 ground_pork meat 2.6 2.79 7.25 0
2 tofu grocery 4.0 0.99 3.96 0
3 eggs dairy 1.0 2.49 2.49 1
4 pork_loin meat 1.9 3.19 6.06 0
5 whole_milk dairy 1.0 2.50 2.50 0
6 egg_custard bakery 1.0 3.99 3.99 1
>>>
>>> # Find the start index of substring 'pork' in the 'Item_name' column.
... pork_startInd = oml_cart['Item_name'].find('pork')
>>> type(pork_startInd)
>>> oml_cart.concat({'Pork_startInd': pork_startInd})
Item_name Item_type Quantity Unit_price Price Pork_startInd
0 paper_towel grocery 1.0 1.19 1.19 -1
2 tofu grocery 4.0 0.99 3.96 -1
3 eggs dairy 1.0 2.49 2.49 -1
4 pork_loin meat 1.9 3.19 6.06 0
5 whole_milk dairy 1.0 2.50 2.50 -1
6 egg_custard bakery 1.0 3.99 3.99 -1
>>>
>>> # Check whether items are of grocery category.
... is_grocery=oml_cart['Item_type']=='grocery'
>>> type(is_grocery)
<class 'oml.core.boolean.Boolean'>
>>> oml_cart.concat({'Is_grocery': is_grocery})
Item_name Item_type Quantity Unit_price Price Is_grocery
0 paper_towel grocery 1.0 1.19 1.19 True
1 ground_pork meat 2.6 2.79 7.25 False
2 tofu grocery 4.0 0.99 3.96 True
3 eggs dairy 1.0 2.49 2.49 False
4 pork_loin meat 1.9 3.19 6.06 False
5 whole_milk dairy 1.0 2.50 2.50 False
6 egg_custard bakery 1.0 3.99 3.99 False
>>>
>>> # Calculate the length of item names.
... name_length=oml_cart['Item_name'].len()
>>> type(name_length)
>>> oml_cart.concat({'Name_length': name_length})
Item_name Item_type Quantity Unit_price Price Name_length
0 paper_towel grocery 1.0 1.19 1.19 11
2 tofu grocery 4.0 0.99 3.96 4
3 eggs dairy 1.0 2.49 2.49 4
4 pork_loin meat 1.9 3.19 6.06 9
5 whole_milk dairy 1.0 2.50 2.50 10
6 egg_custard bakery 1.0 3.99 3.99 11
>>>
>>> # Get the ceiling, floor, exponential, logarithm and square root
... # of the 'Price' column.
... oml_cart['Price'].ceil()
7-25
Chapter 7
Explore Data
[2, 8, 4, 3, 7, 3, 4]
>>> oml_cart['Price'].floor()
[1, 7, 3, 2, 6, 2, 3]
>>> oml_cart['Price'].exp()
[3.2870812073831184, 1408.1048482046956, 52.45732594909905,
12.061276120444719, 428.37543685928694, 12.182493960703473, 54.05488936332659]
>>> oml_cart['Price'].log()
[0.173953307123438, 1.9810014688665833, 1.3762440252663892,
0.9122827104766162, 1.801709800081223, 0.9162907318741551, 1.3837912309017721]
>>> oml_cart['Price'].sqrt()
[1.0908712114635715, 2.692582403567252, 1.98997487421324, 1.57797338380595,
2.4617067250182343, 1.5811388300841898, 1.997498435543818]
7.2.5 Sort Data

The sort_values function enables flexible sorting of an oml.DataFrame along one or more
columns specified by the by argument, and returns an oml.DataFrame.
Example 7-11 Sorting Data

The following example demonstrate these operations.
import oml
import pandas as pd
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
# Modify the data set by replacing a few entries with NaNs to test
# how the na_position parameter works in the sort_values method.
Iris = oml_iris.pull()
Iris['Sepal_Width'].replace({3.5: None}, inplace=True)
Iris['Petal_Length'].replace({1.5: None}, inplace=True)
Iris['Petal_Width'].replace({2.3: None}, inplace=True)
# Create another table using the changed data.

oml_iris2 = oml.create(Iris, table = 'IRIS2')
# Sort the data set first by Sepal_Length then by Sepal_Width

# in descending order and display the first 5 rows of the
# sorted result.
oml_iris2.sort_values(by = ['Sepal_Length', 'Sepal_Width'],
ascending=False).head()
7-26
Chapter 7
Explore Data
# Display the last 5 rows of the data set.

oml_iris2.tail()
# Sort the last 5 rows of the iris data set first by Petal_Length
# then by Petal_Width. By default, rows with NaNs are placed
# after the other rows when the sort keys are the same.
oml_iris2.tail().sort_values(by = ['Petal_Length', 'Petal_Width'])
# Sort the last 5 rows of the iris data set first by Petal_Length
# and then by Petal_Width. When the values in these two columns
# are the same, place the row with a NaN before the other row.
oml_iris2.tail().sort_values(by = ['Petal_Length', 'Petal_Width'],
na_position = 'first')
oml.drop('IRIS')
oml.drop('IRIS2')
>>> import oml

>>>
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Modify the data set by replacing a few entries with NaNs to test
... # how the na_position parameter works in the sort_values method.
... Iris = oml_iris.pull()
>>> Iris['Sepal_Width'].replace({3.5: None}, inplace=True)
>>> Iris['Petal_Length'].replace({1.5: None}, inplace=True)
>>> Iris['Petal_Width'].replace({2.3: None}, inplace=True)
>>>
>>> # Create another table using the changed data.
... oml_iris2 = oml.create(Iris, table = 'IRIS2')
>>>
>>> # Sort the data set first by 'Sepal_Length' then by 'Sepal_Width'
... # in descending order and displays the first 5 rows of the
... # sorted result.
... oml_iris2.sort_values(by = ['Sepal_Length', 'Sepal_Width'],
... ascending=False).head()
0 7.9 3.8 6.4 2.0 virginica
1 7.7 3.8 6.7 2.2 virginica
2 7.7 3.0 6.1 NaN virginica
7-27
Chapter 7
Explore Data
3 7.7 2.8 6.7 2.0 virginica

>>>
>>> # Display the last 5 rows of the data set.
... oml_iris2.tail()
1 6.3 2.5 5.0 1.9 virginica
2 6.5 3.0 5.2 2.0 virginica
4 5.9 3.0 5.1 1.8 virginica
>>>
>>> # Sort the last 5 rows of the iris data set first by 'Petal_Length'
... # then by 'Petal_Width'. By default, rows with NaNs are placed
... # after the other rows when the sort keys are the same.
... oml_iris2.tail().sort_values(by = ['Petal_Length', 'Petal_Width'])
0 6.3 2.5 5.0 1.9 virginica
1 5.9 3.0 5.1 1.8 virginica
2 6.5 3.0 5.2 2.0 virginica
>>>
>>> # Sort the last 5 rows of the iris data set first by 'Petal_Length'
... # and then by 'Petal_Width'. When the values in these two columns
... # are the same, place the row with a NaN before the other row.
... oml_iris2.tail().sort_values(by = ['Petal_Length', 'Petal_Width'],
... na_position = 'first')
0 6.3 2.5 5.0 1.9 virginica
1 5.9 3.0 5.1 1.8 virginica
3 6.5 3.0 5.2 2.0 virginica
>>>
>>> oml.drop('IRIS')
>>> oml.drop('IRIS2')
7.2.6 Summarize Data

The describe method calculates descriptive statistics that summarize the central tendency,
dispersion, and shape of the data in each column.
You can also specify the types of columns to include or exclude from the results.
With the sum and cumsum methods, you can compute the sum and cumulative sum of each
Float or Boolean column of an oml.DataFrame.
The describe method supports finding the following statistics:
• Mean, minimum, maximum, median, top character, standard deviation

• Number of not-Null values, unique values, top characters
• Percentiles between 0 and 1
7-28
Chapter 7
Explore Data
Example 7-12 Calculating Descriptive Statistics

The following example demonstrates these operations.
import pandas as pd
import oml
df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],

oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',

'string':'CHAR(1)',
'bytes':'RAW(1)'})
# Combine a Boolean column with oml_df.

oml_bool = oml_df['numeric'] > 3
oml_df = oml_df.concat(oml_bool)
oml_df.rename({'COL4':'boolean'})
# Describe all of the columns.

oml_df.describe(include='all')
# Exclude Float columns.

oml_df.describe(exclude=[oml.Float])
# Get the sum of values in each Float or Boolean column.

oml_df.sum()
# Find the cumulative sum of values in each Float or Boolean column

# after oml_df is sorted by the bytes column in descending order.
oml_df.cumsum(by = 'bytes', ascending = False)
# Compute the skewness of values in the Float columns.

oml_df.skew()
# Find the median value of Float columns.

oml_df.median()
# Calculate the kurtosis of Float columns.

oml_df.kurtosis()

>>> import oml
>>>
>>> df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],
>>>
>>> oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',
... 'string':'CHAR(1)',
... 'bytes':'RAW(1)'})
>>>
>>> # Combine a Boolean column with oml_df.
7-29
Chapter 7
Explore Data
... oml_bool = oml_df['numeric'] > 3

>>> oml_df = oml_df.concat(oml_bool)
>>> oml_df.rename({'COL4':'boolean'})
bytes numeric string boolean
0 b'a' 1.000 None False
1 b'b' 1.400 None False
2 b'c' -4.000 a False
3 b'c' 3.145 a True
4 b'd' 5.000 a True
5 b'e' NaN b True
>>>
>>> # Describe all of the columns.
... oml_df.describe(include='all')
bytes numeric string boolean
count 6 5.000000 4 6
unique 5 NaN 2 2
top b'c' NaN a False
freq 2 NaN 3 3
mean NaN 1.309000 NaN NaN
std NaN 3.364655 NaN NaN
min NaN -4.000000 NaN NaN
25% NaN 1.000000 NaN NaN
50% NaN 1.400000 NaN NaN
75% NaN 3.145000 NaN NaN
max NaN 5.000000 NaN NaN
>>>
>>> # Exclude Float columns.
... oml_df.describe(exclude=[oml.Float])
bytes string boolean
count 6 4 6
unique 5 2 2
top b'c' a False
freq 2 3 3
>>>
>>> # Get the sum of values in each Float or Boolean column.
... oml_df.sum()
numeric 6.545
boolean 3.000
dtype: float64
>>>
>>> # Find the cumulative sum of values in each Float or Boolean column
... # after oml_df is sorted by the bytes column in descending order.
... oml_df.cumsum(by = 'bytes', ascending = False)
numeric boolean
0 NaN 1
1 5.000 2
2 1.000 2
3 4.145 3
4 5.545 3
5 6.545 3
>>>
>>> # Compute the skewness of values in the Float columns.
... oml_df.skew()
numeric -0.683838
dtype: float64
>>>
7-30
Chapter 7
Render Graphics
>>> # Find the median value of Float columns.

... oml_df.median()
numeric 1.4
dtype: float64
>>>
>>> # Calculate the kurtosis of Float columns.
... oml_df.kurtosis()
numeric -0.582684
dtype: float64
7.3 Render Graphics

OML4Py provides functions for rendering graphical displays of data.
The oml.boxplot and oml.hist functions compute the statistics necessary to generate box
and whisker plots or histograms in-database for scalability and performance.
OML4Py uses the matplotlib library to render the output. You can use methods of
matplotlib.pyplot to customize the created images and matplotlib.pyplot.show to show
the images. By default, rendered graphics have the same properties as those stored in
matplotlib.rcParams.
For the parameters of the oml.boxplot and oml.hist functions, invoke help(oml.boxplot) or
help(oml.hist), or see Oracle Machine Learning for Python API Reference.
Generate a Box Plot

Use the oml.boxplot function to generate a box and whisker plot for every column of x or for
every column object in x.
Example 7-13 Using the oml.boxplot Function
This example first loads the wine data set from sklearn and creates the pandas.DataFrame
object wine_data. It then creates a temporary database table, with its corresponding proxy
oml.DataFrame object oml_wine, from wine_data. It draws a box and whisker plot on every
column with the index ranging from 8 to 12 (not including 12) in oml_wine. The arguments
showmeans and meanline are set to True to show the arithmetic means and to render the mean
as a line spanning the full width of the box. The argument patch_artist is set to True to have
the boxes drawn with Patch artists.
import oml
import pandas as pd
import matplotlib.pyplot as plt
wine = datasets.load_wine()
wine_data = pd.DataFrame(wine.data, columns = wine.feature_names)
oml_wine = oml.push(wine_data)
oml.graphics.boxplot(oml_wine[:,8:12], showmeans=True,
meanline=True, patch_artist=True,
labels=oml_wine.columns[8:12])
plt.title('Distribution of Wine Attributes')
plt.show()
7-31
Chapter 7
Render Graphics
The image shows a box and whisker plot for each of the four columns of the wine data set:
Proanthocyanins, Color intensity, Hue, and OD280/OD315 of diluted wines. The boxes extend
from the lower to upper quartile values of the data, with a solid orange line at the median. The
whiskers that extend from the box show the range of the data. The caps are the horizontal
lines at the ends of the whiskers. Flier or outlier points are those past the ends of the whiskers.
The mean is shown as a green dotted line spanning the width of the each box.
Generate a Histogram
Use the oml.hist function to compute and draw a histogram for every data set column
contained in x.
Example 7-14 Using the oml.hist Function
This example first loads the wine data set from sklearn and creates the pandas.DataFrame
object wine_data. It then creates a temporary database table, with its corresponding proxy
oml.DataFrame object oml_wine, from wine_data. Next it draws a histogram on the proline
column of oml_wine. The argument bins specifies generating ten equal-width bins. Argument
color specifies filling the bars with the color purple. Arguments linestyle and edgecolor are
set to draw the bar edges as solid lines in pink.
import oml
import pandas as pd
from sklearn.datasets import load_wine
wine = load_wine()
wine_data = pd.DataFrame(wine.data, columns = wine.feature_names)
oml_wine = oml.push(wine_data)
oml.graphics.hist(oml_wine['proline'], bins=10, color='red',
linestyle='solid', edgecolor='white')
plt.title('Proline content in Wine')
plt.xlabel('proline content')
plt.ylabel('# of wine instances')
plt.show()
7-32
Chapter 7
Render Graphics
The image shows a traditional bar-type histogram for the Proline column of the wine data set.
The range of proline values is divided into 10 bins of equal size. The height of the rectangular
bar for each bin indicates the number of wine instances in each bin. The bars are red with solid
white edges.
7-33
8
OML4Py Classes That Provide Access to In-
Database Machine Learning Algorithms
OML4Py has classes that provide access to in-database Oracle Machine Learning algorithms.
These classes are described in the following topics.
• About Machine Learning Classes and Algorithms
These classes provide access to in-database machine learning algorithms.
• About Model Settings
You can specify settings that affect the characteristics of a model.
• Shared Settings
These settings are common to all of the OML4Py machine learning classes.
• Export Oracle Machine Learning for Python Models
You can export an oml model from Python and then score it in SQL.
• Automatic Data Preparation
Oracle Machine Learning for Python supports Automatic Data Preparation (ADP) and user-
directed general data preparation.
• Model Explainability
Use the OML4Py Explainability module to identify the important features that impact a
trained model’s predictions.
• Attribute Importance
The oml.ai class computes the relative attribute importance, which ranks attributes
according to their significance in predicting a classification or regression target.
• Association Rules
The oml.ar class implements the Apriori algorithm to find frequent itemsets and
association rules, all as part of an association model object.
• Decision Tree
The oml.dt class uses the Decision Tree algorithm for classification.
• Expectation Maximization
The oml.em class uses the Expectation Maximization (EM) algorithm to create a clustering
model.
• Explicit Semantic Analysis
The oml.esa class extracts text-based features from a corpus of documents and performs
document similarity comparisons.
• Generalized Linear Model
The oml.glm class builds a Generalized Linear Model (GLM) model.
• k-Means
The oml.km class uses the k-Means (KM) algorithm, which is a hierarchical, distance-
based clustering algorithm that partitions data into a specified number of clusters.
• Naive Bayes
The oml.nb class creates a Naive Bayes (NB) model for classification.
8-1
Chapter 8
About Machine Learning Classes and Algorithms
• Neural Network
The oml.nn class creates a Neural Network (NN) model for classification and regression.
• Random Forest
The oml.rf class creates a Random Forest (RF) model that provides an ensemble
learning technique for classification.
• Singular Value Decomposition
Use the oml.svd class to build a model for feature extraction.
• Support Vector Machine
The oml.svm class creates a Support Vector Machine (SVM) model for classification,
regression, or anomaly detection.
8.1 About Machine Learning Classes and Algorithms

These classes provide access to in-database machine learning algorithms.
Algorithm Classes
Class Algorithm Function of Description

Algorithm
oml.ai Minimum Attribute Ranks attributes according to their importance
Description importance for in predicting a target.
Length classification or
regression
oml.ar Apriori Association rules Performs market basket analysis by identifying
co-occurring items (frequent itemsets) within a
set.
oml.dt Decision Tree Classification Extracts predictive information in the form of
human-understandable rules. The rules are if-
then-else expressions; they explain the
decisions that lead to the prediction.
oml.em Expectation Clustering Performs probabilistic clustering based on a
Maximization density estimation algorithm.
oml.esa Explicit Semantic Feature extraction Extracts text-based features from a corpus of
Analysis documents. Performs document similarity
comparisons.
oml.glm Generalized Classification Implements logistic regression for classification
Linear Model Regression of binary targets and linear regression for
continuous targets.
oml.km k-Means Clustering Uses unsupervised learning to group data
based on similarity into a predetermined
number of clusters.
oml.nb Naive Bayes Classification Makes predictions by deriving the probability of
a prediction from the underlying evidence, as
observed in the data.
oml.nn Neural Network Classification Learns from examples and tunes the weights of
Regression the connections among the neurons during the
learning process.
oml.rf Random Forest Classification Provides an ensemble learning technique for
classification of data.
8-2
Chapter 8
About Machine Learning Classes and Algorithms
Class Algorithm Function of Description

Algorithm
oml.svd Singular Value Feature extraction Performs orthogonal linear transformations that
Decomposition capture the underlying variance of the data by
decomposing a rectangular matrix into three
matrices.
oml.svm Support Vector Anomaly detection Builds a model that is a profile of a class,
Machine Classification which, when the model is applied, identifies
cases that are somehow different from that
Regression
profile.
Repeatable Results
You can use the case_id parameter in the fit method of the OML4Py machine learning
algorithm classes to achieve repeatable sampling, data splits (train and held aside), and
random data shuffling.
Persisting Models
In-database models created through the OML4Py API exist as temporary objects that are
dropped when the database connection ends unless you take one of the following actions:
• Save a default-named model object in a datastore, as in the following example:
regr2 = oml.glm("regression")
oml.ds.save(regr2, 'regression2')
• Use the model_name parameter in the fit function when building the model, as in the
following example:
regr2 = regr2.fit(X, y, model_name = 'regression2')
• Change the name of an existing model using the model_name function of the model, as in
the following example:
regr2(model_name = 'myRegression2')
To drop a persistent named model, use the oml.drop function.
Creating a Model from an Existing In-Database Model

You can create an OML4Py model as a proxy object for an existing in-database machine
learning model. The in-database model could have been created through OML4Py, OML4SQL,
or OML4R. To do so, when creating the OML4Py, specify the name of the existing model and,
optionally, the name of the owner of the model, as in the following example.
ar_mod = oml.ar(model_name = 'existing_ar_model', model_owner = 'SH',

**setting)
An OML4Py model created this way persists until you drop it with the oml.drop function.
Scoring New Data with a Model
8-3
Chapter 8
About Model Settings
For most of the OML4Py machine learning classes, you can use the predict and
predict_proba methods of the model object to score new data.
For in-database models, you can use the SQL PREDICTION function on model proxy objects,
which scores directly in the database. You can use in-database models directly from SQL if you
prepare the data properly. For open source models, you can use Embedded Python Execution
and enable data-parallel execution for performance and scalability.
Deploying Models Through a REST API

The REST API for Oracle Machine Learning Services provides REST endpoints hosted on an
Oracle Autonomous Database instance. These endpoints allow you to store OML models
along with their metadata, and to create scoring endpoints for the models.
8.2 About Model Settings

You can specify settings that affect the characteristics of a model.
Some settings are general, some are specific to an Oracle Machine Learning function, and
some are specific to an algorithm.
All settings have default values. If you want to override one or more of the settings for a model,
then you must specify the settings with the **params parameter when instantiating the model
or later by using the set_params method of the model.
For the _init_ method, the argument can be key-value pairs or a dict. Each list element’s
name and value refer to a machine learning algorithm parameter setting name and value,
respectively. The setting value must be numeric or a string.
The argument for the **params parameter of the set_params method is a dict object mapping
a str to a str. The key should be the name of the setting, and the value should be the new
setting.
Example 8-1 Specifying Model Settings
This example shows the creation of an Expectation Maximization (EM) model and the
changing of a setting. For the complete code of the EM model example, see Example 8-10.
# Specify settings.
setting = {'emcs_num_iterations': 100}
# Create an EM model object
em_mod = em(n_clusters = 2, **setting)
# Intervening code not shown.
# Change the random seed and refit the model.

em_mod.set_params(EMCS_RANDOM_SEED = '5').fit(train_dat)
8.3 Shared Settings

These settings are common to all of the OML4Py machine learning classes.
The following table lists the settings that are shared by all OML4Py models.
8-4
Chapter 8
Shared Settings
Table 8-1 Shared Model Settings
Setting Name Setting Value Description

ODMS_DETAILS ODMS_ENABLE Helps to control model size in the database. Model details
ODMS_DISABLE can consume significant disk space, especially for
partitioned models. The default value is ODMS_ENABLE.
If the setting value is ODMS_ENABLE, then model detail
tables and views are created along with the model. You
can query the model details using SQL.
If the value is ODMS_DISABLE, then model detail tables are
not created and tables relevant to model details are also
not created.
The reduction in the space depends on the algorithm.
Model size reduction can be on the order of 10x .
ODMS_MAX_PARTITIONS 1 < value <= 1000000 Controls the maximum number of partitions allowed for a
partitioned model. The default is 1000.
ODMS_MISSING_VALUE_TREATM ODMS_MISSING_VALUE_AUT Indicates how to treat missing values in the training data.
ENT O This setting does not affect the scoring data. The default
ODMS_MISSING_VALUE_MEA value is ODMS_MISSING_VALUE_AUTO.
N_MODE ODMS_MISSING_VALUE_MEAN_MODE replaces missing
ODMS_MISSING_VALUE_DEL values with the mean (numeric attributes) or the mode
(categorical attributes) both at build time and apply time
ETE_ROW
where appropriate. ODMS_MISSING_VALUE_AUTO
performs different strategies for different algorithms.
When ODMS_MISSING_VALUE_TREATMENT is set to
ODMS_MISSING_VALUE_DELETE_ROW, the rows in the
training data that contain missing values are deleted.
However, if you want to replicate this missing value
treatment in the scoring data, then you must perform the
transformation explicitly.
The value ODMS_MISSING_VALUE_DELETE_ROW is
applicable to all algorithms.
ODMS_PARTITION_BUILD_TYPE ODMS_PARTITION_BUILD_I Controls the parallel building of partitioned models.
NTRA ODMS_PARTITION_BUILD_INTRA builds each partition in
ODMS_PARTITION_BUILD_I parallel using all slaves.
NTER ODMS_PARTITION_BUILD_INTER builds each partition
ODMS_PARTITION_BUILD_H entirely in a single slave, but multiple partitions may be
YBRID built at the same time because multiple slaves are active.
ODMS_PARTITION_BUILD_HYBRID combines the other
two types and is recommended for most situations to
adapt to dynamic environments. This is the default value.
ODMS_PARTITION_COLUMNS Comma separated list of Requests the building of a partitioned model. The setting
machine learning attributes value is a comma-separated list of the machine learning
attributes to be used to determine the in-list partition key
values. These attributes are taken from the input columns,
unless an XFORM_LIST parameter is passed to the model.
If XFORM_LIST parameter is passed to the model, then the
attributes are taken from the attributes produced by these
transformations.
8-5
Chapter 8
Shared Settings
Table 8-1 (Cont.) Shared Model Settings

ODMS_TABLESPACE_NAME tablespace_name Specifies the tablespace in which to store the model.
If you explicitly set this to the name of a tablespace (for
which you have sufficient quota), then the specified
tablespace storage creates the resulting model content. If
you do not provide this setting, then the your default
tablespace creates the resulting model content.
ODMS_SAMPLE_SIZE 0 < value Determines how many rows to sample (approximately).
You can use this setting only if ODMS_SAMPLING is
enabled. The default value is system determined.
ODMS_SAMPLING ODMS_SAMPLING_ENABLE Allows the user to request sampling of the build data. The
ODMS_SAMPLING_DISABLE default is ODMS_SAMPLING_DISABLE.
ODMS_TEXT_MAX_FEATURES 1 <= value The maximum number of distinct features, across all text
attributes, to use from a document set passed to the
model. The default is 3000. An oml.esa model has the
default value of 300000.
ODMS_TEXT_MIN_DOCUMENTS Non-negative value This text processing setting controls how many documents
a token needs to appear in to be used as a feature.
The default is 1. An oml.esa model has the default value
of 3.
ODMS_TEXT_POLICY_NAME The name of an Oracle Text Affects how individual tokens are extracted from
POLICY created using unstructured text.
CTX_DDL.CREATE_POLICY. For details about CTX_DDL.CREATE_POLICY, see Oracle
Text Reference.
PREP_AUTO PREP_AUTO_ON This data preparation setting enables fully automated data
PREP_AUTO_OFF preparation.
The default is PREP_AUTO_ON.
PREP_SCALE_2DNUM pPREP_SCALE_STDDEV This data preparation setting enables scaling data
PREP_SCALE_RANGE preparation for two-dimensional numeric columns.
PREP_AUTO must be OFF for this setting to take effect. The
following are the possible values:
PREP_SCALE_STDDEV: A request to divide the column
values by the standard deviation of the column and is often
provided together with PREP_SHIFT_MEAN to yield z-score
normalization.
PREP_SCALE_RANGE: A request to divide the column
values by the range of values and is often provided
together with PREP_SHIFT_MIN to yield a range of [0,1].
PREP_SCALE_NNUM PREP_SCALE_MAXABS This data preparation setting enables scaling data
preparation for nested numeric columns. PREP_AUTO must
be OFF for this setting to take effect. If specified, then the
valid value for this setting is PREP_SCALE_MAXABS, which
yields data in the range of [-1,1].
8-6
Chapter 8
Export Oracle Machine Learning for Python Models
Table 8-1 (Cont.) Shared Model Settings

PREP_SHIFT_2DNUM PREP_SHIFT_MEAN This data preparation setting enables centering data
PREP_SHIFT_MIN preparation for two-dimensional numeric columns.
PREP_AUTO must be OFF for this setting to take effect. The
following are the possible values:
PREP_SHIFT_MEAN: Results in subtracting the average of
the column from each value.
PREP_SHIFT_MIN: Results in subtracting the minimum of
the column from each value.
8.4 Export Oracle Machine Learning for Python Models

You can export an oml model from Python and then score it in SQL.
Export a Model
With the export_sermodel function of an OML4Py algorithm model, you can export the model
in a serialized format. You can then score that model in SQL. To save a model to a permanent
table, you must pass in a name for the new table. If the model is partitioned, then you can
optionally select an individual partition to export; otherwise all partitions are exported.
Note:
Any data transformations you apply to the data for model building you must also
apply to the data for scoring with the imported model.
Example 8-2 Export a Trained oml.svm Model to a Database Table

This example creates the x and y variables using the iris data set. It then creates the persistent
database table IRIS and the oml.DataFrame object oml_iris as a proxy for the table.
This example preprocesses the iris data set and splits the data set into training data and test
data. It then fits an oml.svm model according to the training data of the data set, and saves the
fitted model in a serialized format to a new table named svm_sermod in the database.
import oml
import pandas as pd
try:
8-7
Chapter 8
oml.drop('IRIS')
oml.drop('IRIS_TEST_DATA')
except:
pass
df = oml.sync(table = "IRIS").pull()
# Add a case identifier column.

df.insert(0, 'ID', range(0,len(df)))
# Create training data and test data.

IRIS_TMP = oml.push(df).split()
train_x = IRIS_TMP[0].drop('Species')
train_y = IRIS_TMP[0]['Species']
test_dat = IRIS_TMP[1]
# Create the iris_test_data database table.

oml_test_dat = oml.create(test_dat.pull(), table = "IRIS_TEST_DATA")
# Create an oml SVM model object.

svm_mod = oml.svm('classification',
svms_kernel_function =
'dbms_data_mining.svms_linear')
# Fit the SVM model with the training data.

svm_mod = svm_mod.fit(train_x, train_y, case_id = 'ID')
# Export the oml.svm model to a new table named 'svm_sermod'

# in the database.
svm_export = svm_mod.export_sermodel(table='svm_sermod')
type(svm_export)
# Show the first 10 characters of the BLOB content from the

# model export.
svm_export.pull()[0][1:10]
>>> import oml

>>>
>>>
8-8
Chapter 8
>>> try:
... oml.drop('IRIS_TEST_DATA')
...except:
... pass
>>> # Create the IRIS database table.
>>>
>>> df = oml.sync(table = "IRIS").pull()
>>>
>>> # Add a case identifier column.
... df.insert(0, 'ID', range(0,len(df)))
>>>
>>> # Create training data and test data.
... IRIS_TMP = oml.push(df).split()
>>> train_x = IRIS_TMP[0].drop('Species')
>>> train_y = IRIS_TMP[0]['Species']
>>> test_dat = IRIS_TMP[1]
>>>
>>> # Create the iris_test_data database table.
... oml_test_dat = oml.create(test_dat.pull(), table = "IRIS_TEST_DATA")
>>>
>>> # Create an oml SVM model object.
... svm_mod = oml.svm('classification',
... svms_kernel_function =
>>>
>>> # Fit the SVM model with the training data.
... svm_mod = svm_mod.fit(train_x, train_y, case_id='ID')
>>>
>>> # Export the oml.svm model to a new table named 'svm_sermod'
... # in the database.
... svm_export = svm_mod.export_sermodel(table='svm_sermod')
>>> type(svm_export)
<class 'oml.core.bytes.Bytes'>
>>>
>>> # Show the first 10 characters of the BLOB content from the
... # model export.
... svm_export.pull()[0][1:10]
b'\xff\xfc|\x00\x00\x02\x9c\x00\x00'
Import a Model
In SQL, you can import the serialized format of an OML4Py model into an Oracle Machine
Learning for SQL model with the DBMS_DATA_MINING.IMPORT_SERMODEL procedure. To that
procedure, you pass the BLOB content from the table to which the model was exported and
the name of the model to be created. The import procedure provides the ability to score the
model. It does not create model views or tables that are needed for querying model details.
You can use the SQL function PREDICTION to apply the imported model to the test data and get
the prediction results.
Example 8-3 Import a Serialized SVM Model as an OML4SQL Model in SQL
This example retrieves the serialized content of the SVM classification model from the
svm_sermod table. It uses the IMPORT_SERMODEL procedure to create a model named
my_iris_svm_classifier with the content from the table. It also predicts test data saved in the
8-9
Chapter 8
iris_test_data table with the newly imported model my_iris_svm_classifier, and compares the
prediction results with the target classes.
-- After starting SQL*Plus as the OML4Py user.

-- Import the model from the serialized content.
DECLARE
v_blob blob;
BEGIN
SELECT SERVAL INTO v_blob FROM "svm_sermod";
dbms_data_mining.import_sermodel(v_blob, 'my_iris_svm_classifier');
END;
/
-- Set the output column format.

column TARGET_SPECIES format a15
column PREDICT_SPECIES format a15
-- Make predictions and display cases where mod(ID,3) equals 0.

SELECT ID, "Species" AS TARGET_SPECIES,
PREDICTION(my_iris_svm_classifier USING "Sepal_Length", "Sepal_Width",
"Petal_Length", "Petal_Width")
AS PREDICT_SPECIES
FROM "IRIS_TEST_DATA" WHERE MOD(ID,3) = 0;
-- Drop the imported model

BEGIN
DBMS_DATA_MINING.DROP_MODEL(model_name => 'my_iris_svm_classifier');
END;
/
The prediction produces the following results.
ID TARGET_SPECIES PREDICT_SPECIES
-- -------------- ---------------
0 setosa setosa
24 setosa setosa
27 setosa setosa
33 setosa setosa
36 setosa setosa
39 setosa setosa
48 setosa setosa
54 versicolor versicolor
114 virginica virginica
13 rows selected.
8-10
Chapter 8
Automatic Data Preparation
8.5 Automatic Data Preparation

Oracle Machine Learning for Python supports Automatic Data Preparation (ADP) and user-
directed general data preparation.
The PREP_* settings enable you to request fully automated (ADP) or manual data preparation.
By default, ADP is enabled (PREP_AUTO_ON) . When performed manually, data preparation
requirements of each algorithm must be addressed
When you enable ADP, the model uses heuristics to transform the build data according to the
requirements of the algorithm. Instead of ADP, you can request that the data be shifted and/or
scaled with the PREP_SCALE_* and PREP_SHIFT_* settings. The transformation instructions are
stored with the model and reused whenever the model is applied. The model settings can be
viewed in USER_MINING_MODEL_SETTINGS.
PREP_* Settings
The values for the PREP_* settings are described in the following table.
Table 8-2 title

PREP_AUTO PREP_AUTO_ON This setting enables fully automated data
PREP_AUTO_OFF preparation.
The default is PREP_AUTO_ON.
PREP_SCALE_2DNUM PREP_SCALE_STDDEV This setting enables scaling data
PREP_SCALE_RANGE preparation for two-dimensional numeric
columns. PREP_AUTO must be OFF for this
setting to take effect. The following are the
possible values.
PREP_SCALE_STDDEV: A request to divide
the column values by the standard
deviation of the column and is often
provided together with PREP_SHIFT_MEAN
to yield z-score normalization.
PREP_SCALE_RANGE: A request to divide
the column values by the range of values
and is often provided together with
PREP_SHIFT_MIN to yield a range of [0,1].
PREP_SCALE_NNUM PREP_SCALE_MAXABS This setting enables scaling data
preparation for nested numeric columns.
PREP_AUTO must be OFF for this setting to
take effect. If specified, then the valid value
for this setting is PREP_SCALE_MAXABS,
which yields data in the range of [-1,1].
8-11
Chapter 8
Model Explainability
Table 8-2 (Cont.) title

PREP_SHIFT_2DNUM PREP_SHIFT_MEAN This setting enables centering data
PREP_SHIFT_MIN preparation for two-dimensional numeric
columns. PREP_AUTO must be OFF for this
setting to take effect. The following are the
possible values:
PREP_SHIFT_MEAN: Results in subtracting
the average of the column from each value.
PREP_SHIFT_MIN: Results in subtracting
the minimum of the column from each
value.
See Also:

• Shared Settings
8.6 Model Explainability

Use the OML4Py Explainability module to identify the important features that impact a trained
model’s predictions.
Machine Learning Explainability (MLX) is the process of explaining and interpreting machine
learning models. The OML MLX Python module supports the ability to help better understand a
model's behavior and why it makes its predictions. MLX currently provides model-agnostic
explanations for classification and regression tasks where explanations treat the ML model as
a black-box, instead of using properties from the model to guide the explanation.
The global feature importance explainer object is the interface to the MLX permutation
importance explainer. The global feature importance explainer identifies the most important
features for a given model and data set. The explainer is model-agnostic and currently
supports tabular classification and regression data sets with both numerical and categorical
features.
The algorithm estimates feature importance by evaluating the model's sensitivity to changes in
a specific feature. Higher sensitivity suggests that the model places higher importance on that
feature when making its predictions than on another feature with lower sensitivity.
For information on the oml.GlobalFeatureImportance class attributes and methods, call
help(oml.mlx.GlobalFeatureImportance) or see Oracle Machine Learning for Python API
Reference.
Example 8-4 Binary Classification
This example uses the Breast Cancer binary classification data set. Load the data set into the
database and a unique case id column.
import oml
from oml.mlx import GlobalFeatureImportance
8-12
Chapter 8
import pandas as pd
import numpy as np
bc_ds = datasets.load_breast_cancer()
bc_data = bc_ds.data.astype(float)
X = pd.DataFrame(bc_data, columns=bc_ds.feature_names)
y = pd.DataFrame(bc_ds.target, columns=['TARGET'])
row_id = pd.DataFrame(np.arange(bc_data.shape[0]),
columns=['CASE_ID'])
df = oml.create(pd.concat([X, y, row_id], axis=1),
table='BreastCancer')
Split the data set into train and test variables.
train, test = df.split(ratio=(0.8, 0.2), hash_cols='CASE_ID',

seed=32)
X, y = train.drop('TARGET'), train['TARGET']
X_test, y_test = test.drop('TARGET'), test['TARGET']
Train a Random Forest model.
model = oml.algo.rf(ODMS_RANDOM_SEED=32).fit(X, y, case_id='CASE_ID')

"RF accuracy score = {:.2f}".format(model.score(X_test, y_test))
Create the MLX Global Feature Importance explainer, using the binary f1 metric.
gfi = GlobalFeatureImportance(mining_function='classification',
score_metric='f1', random_state=32,
parallel=4)
Run the explainer to generate the global feature importance. Here we construct an explanation
using the train data set and then display the explanation.
explanation = gfi.explain(model, X, y, case_id='CASE_ID', n_iter=10)

explanation
Drop the BreastCancer table.
oml.drop('BreastCancer')
>>> import oml

>>> from oml.mlx import GlobalFeatureImportance
>>> import numpy as np
>>>
>>> bc_ds = datasets.load_breast_cancer()
>>> bc_data = bc_ds.data.astype(float)
>>> X = pd.DataFrame(bc_data, columns=bc_ds.feature_names)
8-13
Chapter 8
>>> y = pd.DataFrame(bc_ds.target, columns=['TARGET'])

>>> row_id = pd.DataFrame(np.arange(bc_data.shape[0]),
... columns=['CASE_ID'])
>>> df = oml.create(pd.concat([X, y, row_id], axis=1),
... table='BreastCancer')
>>>
>>> train, test = df.split(ratio=(0.8, 0.2), hash_cols='CASE_ID',
... seed=32)
>>> X, y = train.drop('TARGET'), train['TARGET']
>>> X_test, y_test = test.drop('TARGET'), test['TARGET']
>>>
>>> model = oml.algo.rf(ODMS_RANDOM_SEED=32).fit(X, y, case_id='CASE_ID')
... "RF accuracy score = {:.2f}".format(model.score(X_test, y_test))
'RF accuracy score = 0.95'
>>>
>>> gfi = GlobalFeatureImportance(mining_function='classification',
... score_metric='f1', random_state=32,
... parallel=4)
>>>
>>> explanation = gfi.explain(model, X, y, case_id='CASE_ID', n_iter=10)
>>> explanation
Global Feature Importance:
[0] worst concave points: Value: 0.0263, Error: 0.0069
[1] worst perimeter: Value: 0.0077, Error: 0.0027
[2] worst radius: Value: 0.0076, Error: 0.0031
[3] worst area: Value: 0.0045, Error: 0.0037
[4] mean concave points: Value: 0.0034, Error: 0.0033
[5] worst texture: Value: 0.0017, Error: 0.0015
[6] area error: Value: 0.0012, Error: 0.0014
[7] worst concavity: Value: 0.0008, Error: 0.0008
[8] worst symmetry: Value: 0.0004, Error: 0.0007
[9] mean texture: Value: 0.0003, Error: 0.0007
[10] mean perimeter: Value: 0.0003, Error: 0.0015
[11] mean radius: Value: 0.0000, Error: 0.0000
[12] mean smoothness: Value: 0.0000, Error: 0.0000
[13] mean compactness: Value: 0.0000, Error: 0.0000
[14] mean concavity: Value: 0.0000, Error: 0.0000
[15] mean symmetry: Value: 0.0000, Error: 0.0000
[16] mean fractal dimension: Value: 0.0000, Error: 0.0000
[17] radius error: Value: 0.0000, Error: 0.0000
[18] texture error: Value: 0.0000, Error: 0.0000
[19] smoothness error: Value: 0.0000, Error: 0.0000
[20] compactness error: Value: 0.0000, Error: 0.0000
[21] concavity error: Value: 0.0000, Error: 0.0000
[22] concave points error: Value: 0.0000, Error: 0.0000
[23] symmetry error: Value: 0.0000, Error: 0.0000
[24] fractal dimension error: Value: 0.0000, Error: 0.0000
[25] worst compactness: Value: 0.0000, Error: 0.0000
[26] worst fractal dimension: Value: 0.0000, Error: 0.0000
[27] mean area: Value: -0.0001, Error: 0.0011
[28] worst smoothness: Value: -0.0003, Error: 0.0013
8-14
Chapter 8
Example 8-5 Multi-Class Classification

This example uses the Iris multi-class classification data set. Load the data set into the
database, adding a unique case id column.
import oml
import pandas as pd
import numpy as np
iris_ds = datasets.load_iris()
iris_data = iris_ds.data.astype(float)
X = pd.DataFrame(iris_data, columns=iris_ds.feature_names)
y = pd.DataFrame(iris_ds.target, columns=['TARGET'])
row_id = pd.DataFrame(np.arange(iris_data.shape[0]),
df = oml.create(pd.concat([X, y, row_id], axis=1), table='Iris')
train, test = df.split(ratio=(0.8, 0.2), hash_cols='CASE_ID',

seed=32)
Train an SVM model.
model = oml.algo.svm(ODMS_RANDOM_SEED=32).fit(X, y, case_id='CASE_ID')

"SVM accuracy score = {:.2f}".format(model.score(X_test, y_test))
Create the MLX Global Feature Importance explainer, using the f1_weighted metric.
gfi = GlobalFeatureImportance(mining_function='classification',
score_metric='f1_weighted',
random_state=32, parallel=4)
Run the explainer to generate the global feature importance. Here, we use the test data set.
Display the explanation.
explanation = gfi.explain(model, X_test, y_test,

case_id='CASE_ID', n_iter=10)
explanation
Drop the Iris table.
oml.drop('Iris')
>>> import oml

8-15
Chapter 8

>>>
>>> iris_ds = datasets.load_iris()
>>> iris_data = iris_ds.data.astype(float)
>>> X = pd.DataFrame(iris_data, columns=iris_ds.feature_names)
>>> y = pd.DataFrame(iris_ds.target, columns=['TARGET'])
>>> row_id = pd.DataFrame(np.arange(iris_data.shape[0]),
>>> df = oml.create(pd.concat([X, y, row_id], axis=1), table='Iris')
>>>
... seed=32)
>>>
>>> model = oml.algo.svm(ODMS_RANDOM_SEED=32).fit(X, y, case_id='CASE_ID')
>>> "SVM accuracy score = {:.2f}".format(model.score(X_test, y_test))
'SVM accuracy score = 0.94'
>>>
>>> gfi = GlobalFeatureImportance(mining_function='classification',
... score_metric='f1_weighted',
... random_state=32, parallel=4)
>>>
>>> explanation = gfi.explain(model, X_test, y_test,
... case_id='CASE_ID', n_iter=10)
>>> explanation
[0] petal length (cm): Value: 0.3462, Error: 0.0824
[1] petal width (cm): Value: 0.2417, Error: 0.0687
[2] sepal width (cm): Value: 0.0926, Error: 0.0452
[3] sepal length (cm): Value: 0.0253, Error: 0.0152
>>> oml.drop('Iris')
Example 8-6 Regression

This example uses the Boston regression data set. Load the data set into the database, adding
a unique case id column.
import oml
import pandas as pd
import numpy as np
boston_ds = datasets.load_boston()
boston_data = boston_ds.data
X = pd.DataFrame(boston_data, columns=boston_ds.feature_names)
y = pd.DataFrame(boston_ds.target, columns=['TARGET'])
row_id = pd.DataFrame(np.arange(boston_data.shape[0]),
df = oml.create(pd.concat([X, y, row_id], axis=1), table='Boston')
8-16
Chapter 8
train, test = df.split(ratio=(0.8, 0.2), hash_cols='CASE_ID', seed=32)

Train a Neural Network regression model.
model = oml.algo.nn(mining_function='regression',
ODMS_RANDOM_SEED=32).fit(X, y, case_id='CASE_ID')
"NN R^2 score = {:.2f}".format(model.score(X_test, y_test))
Create the MLX Global Feature Importance explainer, using the r2 metric.
gfi = GlobalFeatureImportance(mining_function='regression',
score_metric='r2', random_state=32,
parallel=4)
Run the explainer to generate the global feature importance. Here, we use the test data set.
Display the explanation.
explanation = gfi.explain(model, df, 'TARGET',

case_id='CASE_ID', n_iter=10)
explanation
Drop the Boston table.
oml.drop('Boston')
>>> import oml

>>>
>>> boston_ds = datasets.load_boston()
>>> boston_data = boston_ds.data
>>> X = pd.DataFrame(boston_data, columns=boston_ds.feature_names)
>>> y = pd.DataFrame(boston_ds.target, columns=['TARGET'])
>>> row_id = pd.DataFrame(np.arange(boston_data.shape[0]),
>>> df = oml.create(pd.concat([X, y, row_id], axis=1), table='Boston')
>>>
... seed=32)
>>>
>>> model = oml.algo.nn(mining_function='regression',
.. ODMS_RANDOM_SEED=32).fit(X, y, case_id='CASE_ID')
8-17
Chapter 8
Attribute Importance
>>> "NN R^2 score = {:.2f}".format(model.score(X_test, y_test))

'NN R^2 score = 0.85'
>>>
>>> gfi = GlobalFeatureImportance(mining_function='regression',
... score_metric='r2', random_state=32,
... parallel=4)
>>>
>>> explanation = gfi.explain(model, df, 'TARGET',
... case_id='CASE_ID', n_iter=10)
>>> explanation
[0] LSTAT: Value: 0.7686, Error: 0.0513
[1] RM: Value: 0.5734, Error: 0.0475
[2] CRIM: Value: 0.5131, Error: 0.0345
[3] DIS: Value: 0.4170, Error: 0.0632
[4] NOX: Value: 0.2592, Error: 0.0206
[5] AGE: Value: 0.2083, Error: 0.0212
[6] RAD: Value: 0.1956, Error: 0.0188
[7] INDUS: Value: 0.1792, Error: 0.0199
[8] B: Value: 0.0982, Error: 0.0146
[9] PTRATIO: Value: 0.0822, Error: 0.0069
[10] TAX: Value: 0.0566, Error: 0.0139
[11] ZN: Value: 0.0397, Error: 0.0081
[12] CHAS: Value: 0.0125, Error: 0.0045
>>> oml.drop('Boston')
8.7 Attribute Importance

The oml.ai class computes the relative attribute importance, which ranks attributes according
to their significance in predicting a classification or regression target.
The oml.ai class uses the Minimum Description Length (MDL) algorithm to calculate attribute
importance. MDL assumes that the simplest, most compact representation of the data is the
best and most probable explanation of the data.
You can use methods of the oml.ai class to compute the relative importance of predictor
variables when predicting a response variable.
Note:
Oracle Machine Learning does not support the scoring operation for oml.ai.
The results of oml.ai are the attributes of the build data ranked according to their predictive
influence on a specified target attribute. You can use the ranking and the measure of
importance for selecting attributes.
For information on the oml.ai class attributes and methods, invoke help(oml.ai) or see
Oracle Machine Learning for Python API Reference.
8-18
Chapter 8
See Also:

• Shared Settings
Example 8-7 Ranking Attribute Significance with oml.ai

This example demonstrates the use of various methods of the oml.ai class.
import oml
import pandas as pd
try:
oml.drop('IRIS')
except:
pass
# Create training and test data.

dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Species')
train _y = dat[0]['Species']
test_dat = dat[1]
# Specify settings.
setting = {'ODMS_SAMPLING':'ODMS_SAMPLING_DISABLE'}
# Create an AI model object.

ai_mod = oml.ai(**setting)
# Fit the AI model according to the training data and parameter

# settings.
ai_mod = ai_mod.fit(train_x, train_y)
# Show the model details.

ai_mod
8-19
Chapter 8
>>> import oml

>>>
>>>
>>> try:
... except:
... pass
>>>
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # Specify settings.
... setting = {'ODMS_SAMPLING':'ODMS_SAMPLING_DISABLE'}
>>>
>>> # Create an AI model object.
... ai_mod = oml.ai(**setting)
>>>
>>> # Fit the AI model according to the training data and parameter
... # settings.
>>> ai_mod = ai_mod.fit(train_x, train_y)
>>>
>>> # Show the model details.
... ai_mod
Algorithm Name: Attribute Importance
Mining Function: ATTRIBUTE_IMPORTANCE
Settings:
setting name setting value
0 ALGO_NAME ALGO_AI_MDL
1 ODMS_DETAILS ODMS_ENABLE
2 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
3 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
4 PREP_AUTO ON
Global Statistics:
8-20
Chapter 8
Association Rules
attribute name attribute value

0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
Importance:
variable importance rank

0 Petal_Width 0.615851 1
1 Petal_Length 0.362519 2
2 Sepal_Length 0.042751 3
3 Sepal_Width -0.155867 4
8.8 Association Rules

The oml.ar class implements the Apriori algorithm to find frequent itemsets and association
rules, all as part of an association model object.
The Apriori algorithm is efficient and scales well with respect to the number of transactions,
number of items, and number of itemsets and rules produced.
Use the oml.ar class to identify frequent itemsets within large volumes of transactional data,
such as in market basket analysis. The results of an association model are the rules that
identify patterns of association within the data.
An association rule identifies a pattern in the data in which the appearance of a set of items in
a transactional record implies another set of items. The groups of items used to form rules
must pass a minimum threshold according to how often they occur (the support of the rule) and
how often the consequent follows the antecedent (the confidence of the rule). Association
models generate all rules that have support and confidence greater than user-specified
thresholds.
Oracle Machine Learning does not support the scoring operation for association modeling.
For information on the oml.ar class attributes and methods, invoke help(oml.ar) or see
Settings for an Association Rules Model

The following table lists the settings applicable to association rules models.
8-21
Chapter 8
Association Rules
Table 8-3 Association Rules Models Settings

ASSO_ABS_ERROR 0<ASSO_ABS_ERRORMAX(ASSO_ Specifies the absolute error for the
MIN_SUPPORT, association rules sampling.
ASSO_MIN_CONFIDENCE) A smaller value of ASSO_ABS_ERROR
obtains a larger sample size that gives
accurate results but takes longer to
compute. Set a reasonable value for
ASSO_ABS_ERROR, such as the default
value, to avoid too large a sample size.
The default value is 0.5 *
MAX(ASSO_MIN_SUPPORT,
ASSO_MIN_CONFIDENCE).
ASSO_AGGREGATES NULL Specifies the columns to aggregate. It is a
comma separated list of strings containing
the names of the columns for aggregation.
The number of columns in the list must be
<= 10.
You can set ASSO_AGGREGATES if you
have specified a column name with
ODMS_ITEM_ID_COLUMN_NAME. The data
table must have valid column names such
as ITEM_ID and CASE_ID which are
derived from
ODMS_ITEM_ID_COLUMN_NAME.
An item value is not mandatory.
The default value is NULL.
For each item, you may supply several
columns to aggregate. However, doing so
requires more memory to buffer the extra
data and also affects performance
because of the larger input data set and
increased operations.
ASSO_ANT_IN_RULES NULL Sets Including Rules for the antecedent: it
is a comma separated list of strings, at
least one of which must appear in the
antecedent part of each reported
association rule.
ASSO_ANT_EX_RULES NULL Sets Excluding Rules for the antecedent: it
is a comma separated list of strings, none
of which can appear in the antecedent part
of each reported association rule.
ASSO_CONF_LEVEL 0 ASSO_CONF_LEVEL 1 Specifies the confidence level for an
association rules sample.
A larger value of ASSO_CONF_LEVEL
obtains a larger sample size. Any value
between 0.9 and 1 is suitable. The default
value is 0.95.
8-22
Chapter 8
Association Rules
Table 8-3 (Cont.) Association Rules Models Settings

ASSO_CONS_IN_RULES NULL Sets Including Rules for the consequent: it
is a comma separated list of strings, at
least one of which must appear in the
consequent part of each reported
association rule.
ASSO_CONS_EX_RULES NULL Sets Excluding Rules for the consequent:
it is a comma separated list of strings,
none of which can appear in the
consequent part of a reported association
rule.
You can use the excluding rule to reduce
the data that must be stored, but you may
be required to build extra models for
executing different Including or Excluding
Rules.
ASSO_EX_RULES NULL Sets Excluding Rules applied for each
association rule: it is a comma separated
list of strings that cannot appear in an
association rule. No rule can contain any
item in the list.
ASSO_IN_RULES NULL Sets Including Rules applied for each
association rule: it is a comma separated
list of strings, at least one of which must
appear in each reported association rule,
either as antecedent or as consequent
The default value NULL, which specifies
that filtering is not applied.
ASSO_MAX_RULE_LENGTH TO_CHAR( 2<= numeric_expr Maximum rule length for association rules.
<=20) The default value is 4.
ASSO_MIN_CONFIDENCE TO_CHAR( 0<= numeric_expr Minimum confidence for association rules.
<=1) The default value is 0.1.
ASSO_MIN_REV_CONFIDENCE TO_CHAR( 0<= numeric_expr Sets the Minimum Reverse Confidence
<=1) that each rule should satisfy.
The Reverse Confidence of a rule is
defined as the number of transactions in
which the rule occurs divided by the
number of transactions in which the
consequent occurs.
The value is real number between 0 and 1.
The default value is 0.
ASSO_MIN_SUPPORT TO_CHAR( 0<= numeric_expr Minimum support for association rules.
<=1) The default value is 0.1.
ASSO_MIN_SUPPORT_INT TO_CHAR( 0<= numeric_expr Minimum absolute support that each rule
<=1) must satisfy. The value must be an integer.
8-23
Chapter 8
Association Rules
Table 8-3 (Cont.) Association Rules Models Settings

ASSO_CONS_EX_RULES
ODMS_ITEM_ID_COLUMN_NAME column_name The name of a column that contains the
items in a transaction. When you specify
this setting, the algorithm expects the data
to be presented in native transactional
format, consisting of two columns:
• Case ID, either categorical or numeric
• Item ID, either categorical or numeric
ODMS_ITEM_VALUE_COLUMN_ NAME column_name The name of a column that contains a
value associated with each item in a
transaction. Use this setting only when you
have specified a value for
ODMS_ITEM_ID_COLUMN_NAME, indicating
that the data is presented in native
transactional format.
If you also use ASSO_AGGREGATES, then
the build data must include the following
three columns and the columns specified
in the AGGREGATES setting.
• Case ID, either categorical or numeric
• Item ID, either categorical or numeric,
specified by
ODMS_ITEM_ID_COLUMN_NAME
• Item value, either categorical or
numeric, specified by
ODMS_ITEM_VALUE_COLUMN_ NAME
If ASSO_AGGREGATES, Case ID, and Item
ID columns are present, then the Item
Value column may or may not appear.
The Item Value column may specify
information such as the number of items
(for example, three apples) or the type of
the item (for example, macintosh apples).
See Also:

• Shared Settings
Example 8-8 Using the oml.ar Class

This example uses methods of the oml.ar class.
import pandas as pd
import oml
8-24
Chapter 8
Association Rules
columns = ['Species']))
try:
oml.drop('IRIS')
except:
pass
# Create the IRIS database table.

# Create training data.

train_dat = oml.sync(table = 'IRIS')
# Specify settings.
setting = {'asso_min_support':'0.1', 'asso_min_confidence':'0.1'}
# Create an AR model object.

ar_mod = oml.ar(**setting)
# Fit the model according to the training data and parameter

# settings.
ar_mod = ar_mod.fit(train_dat)
# Show details of the model.

ar_mod

>>> import oml
>>>
>>>
>>> try:
... except:
... pass
>>>
8-25
Chapter 8
Association Rules

>>>
>>> # Create training data.
... train_dat = oml.sync(table = 'IRIS')
>>>
... setting = {'asso_min_support':'0.1', 'asso_min_confidence':'0.1'}
>>>
>>> # Create an AR model object.
... ar_mod = oml.ar(**setting)
>>>
>>> # Fit the model according to the training data and parameter
... # settings.
>>> ar_mod = ar_mod.fit(train_dat)
>>>
>>> # Show details of the model.
... ar_mod
Algorithm Name: Association Rules
Mining Function: ASSOCIATION
Settings:
0 ALGO_NAME ALGO_APRIORI_ASSOCIATION_RULES
1 ASSO_MAX_RULE_LENGTH 4
2 ASSO_MIN_CONFIDENCE 0.1
3 ASSO_MIN_REV_CONFIDENCE 0
4 ASSO_MIN_SUPPORT 0.1
5 ASSO_MIN_SUPPORT_INT 1
9 PREP_AUTO ON
Global Statistics:
0 ITEMSET_COUNT 6.000000
1 MAX_SUPPORT 0.333333
2 NUM_ROWS 150.000000
3 RULE_COUNT 2.000000
4 TRANSACTION_COUNT 150.000000
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Itemsets:
ITEMSET_ID SUPPORT NUMBER_OF_ITEMS ITEM_NAME ITEM_VALUE

0 1 0.193333 1 Petal_Width .20000000000000001
8-26
Chapter 8
Decision Tree
1 2 0.173333 1 Sepal_Width 3
2 3 0.333333 1 Species setosa
3 4 0.333333 1 Species versicolor
4 5 0.333333 1 Species virginica
5 6 0.193333 2 Petal_Width .20000000000000001
6 6 0.193333 2 Species setosa
Rules:
RULE_ID NUMBER_OF_ITEMS LHS_NAME LHS_VALUE RHS_NAME \

0 1 2 Species setosa Petal_Width
1 2 2 Petal_Width .20000000000000001 Species
RHS_VALUE SUPPORT CONFIDENCE REVCONFIDENCE LIFT

0 None 0.186667 0.58 1.00 3
1 None 0.186667 1.00 0.58 3
8.9 Decision Tree

The oml.dt class uses the Decision Tree algorithm for classification.
Decision Tree models are classification models that contain axis-parallel rules. A rule is a
conditional statement that can be understood by humans and may be used within a database
to identify a set of records.
A decision tree predicts a target value by asking a sequence of questions. At a given stage in
the sequence, the question that is asked depends upon the answers to the previous questions.
The goal is to ask questions that, taken together, uniquely identify specific target values.
Graphically, this process forms a tree structure.
During the training process, the Decision Tree algorithm must repeatedly find the most efficient
way to split a set of cases (records) into two child nodes. The oml.dt class offers two
homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.
For information on the oml.dt class attributes and methods, invoke help(oml.dt) or see
Settings for a Decision Tree Model

The following table lists settings that apply to Decision Tree models.
8-27
Chapter 8
Decision Tree
Table 8-4 Decision Tree Model Settings

CLAS_COST_TABLE_NAME table_name The name of a table that stores a cost
matrix for the algorithm to use in building
and applying the model. The cost matrix
specifies the costs associated with
misclassifications.
The cost matrix table is user-created. The
following are the column requirements for
the table.
• Column Name:
ACTUAL_TARGET_VALUE
Data Type: Valid target data type
• Column Name:
PREDICTED_TARGET_VALUE
• Column Name: COST
Data Type: NUMBER
CLAS_MAX_SUP_BINS 2 <= a number <= 2147483647 Specifies the maximum number of bins for
each attribute.
CLAS_WEIGHTS_BALANCED ON Indicates whether the algorithm must
OFF create a model that balances the target
distribution. This setting is most relevant in
the presence of rare targets, as balancing
the distribution may enable better average
accuracy (average of per-class accuracy)
instead of overall accuracy (which favors
the dominant class). The default value is
OFF.
TREE_IMPURITY_METRIC TREE_IMPURITY_ENTROPY Tree impurity metric for a Decision Tree
TREE_IMPURITY_GINI model.
Tree algorithms seek the best test question
for splitting data at each node. The best
splitter and split value are those that result
in the largest increase in target value
homogeneity (purity) for the entities in the
node. Purity is measured in accordance
with a metric. Decision trees can use either
gini (TREE_IMPURITY_GINI) or entropy
(TREE_IMPURITY_ENTROPY) as the purity
metric. By default, the algorithm uses
TREE_IMPURITY_GINI.
TREE_TERM_MAX_DEPTH 2 <= a number <= 100 Criteria for splits: maximum tree depth (the
maximum number of nodes between the
root and any leaf node, including the leaf
node).
The default is 7.
TREE_TERM_MINPCT_NODE 0< = a number <= 10 The minimum number of training rows in a
node expressed as a percentage of the
rows in the training data.
The default value is 0.05, indicating
0.05%.
8-28
Chapter 8
Decision Tree
Table 8-4 (Cont.) Decision Tree Model Settings

TREE_TERM_MINPCT_SPLIT 0 < a number <= 20 Minimum number of rows required to
consider splitting a node expressed as a
percentage of the training rows.
The default value is 0.1, indicating 0.1%.
TREE_TERM_MINREC_NODE A number >= 0 Minimum number of rows in a node.
TREE_TERM_MINREC_SPLIT A number > 1 Criteria for splits: minimum number of
records in a parent node expressed as a
value. No split is attempted if the number
of records is below this value.
See Also:

• Shared Settings
Example 8-9 Using the oml.dt Class

This example demonstrates the use of various methods of the oml.dt class. In the listing for
this example, some of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
try:
oml.drop('COST_MATRIX')
oml.drop('IRIS')
except:
pass

8-29
Chapter 8
Decision Tree
train_y = dat[0]['Species']
test_dat = dat[1]
# Create a cost matrix table in the database.

cost_matrix = [['setosa', 'setosa', 0],
['setosa', 'virginica', 0.2],
['setosa', 'versicolor', 0.8],
['virginica', 'virginica', 0],
['virginica', 'setosa', 0.5],
['virginica', 'versicolor', 0.5],
['versicolor', 'versicolor', 0],
['versicolor', 'setosa', 0.4],
['versicolor', 'virginica', 0.6]]
cost_matrix = oml.create(
pd.DataFrame(cost_matrix,
columns = ['ACTUAL_TARGET_VALUE',
'PREDICTED_TARGET_VALUE', 'COST']),
table = 'COST_MATRIX')
# Specify settings.
setting = {'TREE_TERM_MAX_DEPTH':'2'}
# Create a DT model object.

dt_mod = oml.dt(**setting)
# Fit the DT model according to the training data and parameter

# settings.
dt_mod.fit(train_x, train_y, cost_matrix = cost_matrix)
# Use the model to make predictions on the test data.

dt_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Return the prediction probability.

dt_mod.predict(test_dat.drop('Species'),
'Sepal_Width',
'Species']],
proba = True)
# Make predictions and return the probability for each class

# on new data.
dt_mod.predict_proba(test_dat.drop('Species'),
supplemental_cols = test_dat[:,
['Sepal_Length',
'Species']]).sort_values(by = ['Sepal_Length',
'Species'])
dt_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
8-30
Chapter 8
Decision Tree
>>> import oml

>>>
>>>
>>>
>>> try:
... oml.drop('COST_MATRIX')
... except:
... pass
>>>
>>>
>>> # Create a cost matrix table in the database.
... cost_matrix = [['setosa', 'setosa', 0],
... ['setosa', 'virginica', 0.2],
... ['setosa', 'versicolor', 0.8],
... ['virginica', 'virginica', 0],
... ['virginica', 'setosa', 0.5],
... ['virginica', 'versicolor', 0.5],
... ['versicolor', 'versicolor', 0],
... ['versicolor', 'setosa', 0.4],
... ['versicolor', 'virginica', 0.6]]
>>> cost_matrix = oml.create(
... pd.DataFrame(cost_matrix,
... columns = ['ACTUAL_TARGET_VALUE',
... 'PREDICTED_TARGET_VALUE',
... 'COST']),
... table = 'COST_MATRIX')
>>>
... setting = {'TREE_TERM_MAX_DEPTH':'2'}
>>>
>>> # Create a DT model object.
... dt_mod = oml.dt(**setting)
>>>
>>> # Fit the DT model according to the training data and parameter
... # settings.
8-31
Chapter 8
Decision Tree
>>> dt_mod.fit(train_x, train_y, cost_matrix = cost_matrix)
Algorithm Name: Decision Tree
Mining Function: CLASSIFICATION
Target: Species
Settings:
0 ALGO_NAME ALGO_DECISION_TREE
1 CLAS_COST_TABLE_NAME "OML_USER"."COST_MATRIX"
2 CLAS_MAX_SUP_BINS 32
3 CLAS_WEIGHTS_BALANCED OFF
7 PREP_AUTO ON
8 TREE_IMPURITY_METRIC TREE_IMPURITY_GINI
9 TREE_TERM_MAX_DEPTH 2
10 TREE_TERM_MINPCT_NODE .05
11 TREE_TERM_MINPCT_SPLIT .1
12 TREE_TERM_MINREC_NODE 10
13 TREE_TERM_MINREC_SPLIT 20
Global Statistics:
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Partition: NO
Distributions:
NODE_ID TARGET_VALUE TARGET_COUNT

0 0 setosa 36
1 0 versicolor 35
2 0 virginica 33
3 1 setosa 36
4 2 versicolor 35
5 2 virginica 33
Nodes:
parent node.id row.count prediction \

0 0.0 1 36 setosa
1 0.0 2 68 versicolor
2 NaN 0 104 setosa
split \
0 (Petal_Length <=(2.4500000000000002E+000))
1 (Petal_Length >(2.4500000000000002E+000))
2 None
8-32
Chapter 8
Decision Tree
surrogate \
0 Petal_Width <=(8.0000000000000004E-001))
1 Petal_Width >(8.0000000000000004E-001))
2 None
full.splits
0 (Petal_Length <=(2.4500000000000002E+000))
1 (Petal_Length >(2.4500000000000002E+000))
2 (
>>>
>>> # Use the model to make predictions on the test data.
... dt_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa setosa
... ... ... ... ... ...
44 6.7 3.3 5.7 virginica versicolor
>>>
>>> # Return the prediction probability.
... dt_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Species']],
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION PROBABILITY
0 4.9 3.0 setosa setosa 1.000000
1 4.9 3.1 setosa setosa 1.000000
2 4.8 3.4 setosa setosa 1.000000
3 5.8 4.0 setosa setosa 1.000000
... ... ... ... ... ...
44 6.7 3.3 virginica versicolor 0.514706
>>> # Make predictions and return the probability for each class
>>> # on new data.
>>> dt_mod.predict_proba(test_dat.drop('Species'),
... supplemental_cols = test_dat[:,
... ['Sepal_Length',
... 'Species']]).sort_values(by = ['Sepal_Length',
... 'Species'])
Sepal_Length Species PROBABILITY_OF_SETOSA \
0 4.4 setosa 1.0
1 4.4 setosa 1.0
2 4.5 setosa 1.0
8-33
Chapter 8
Expectation Maximization
3 4.8 setosa 1.0

... ... ... ...
42 6.7 virginica 0.0
43 6.9 versicolor 0.0
44 6.9 virginica 0.0
45 7.0 versicolor 0.0
PROBABILITY_OF_VERSICOLOR PROBABILITY_OF_VIRGINICA
0 0.000000 0.000000
1 0.000000 0.000000
2 0.000000 0.000000
3 0.000000 0.000000
... ... ...
42 0.514706 0.485294
43 0.514706 0.485294
44 0.514706 0.485294
45 0.514706 0.485294
>>>
>>> dt_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.645833
8.10 Expectation Maximization

The oml.em class uses the Expectation Maximization (EM) algorithm to create a clustering
model.
EM is a density estimation algorithm that performs probabilistic clustering. In density
estimation, the goal is to construct a density function that captures how a given population is
distributed. The density estimate is based on observed data that represents a sample of the
population.
For information on the oml.em class methods, invoke help(oml.em) or see Oracle Machine
Learning for Python API Reference.
Settings for an Expectation Maximization Model

The following table lists settings for data preparation and analysis for EM models.
8-34
Chapter 8
Table 8-5 Expectation Maximization Settings for Data Preparation and Analysis

EMCS_ATTRIBUTE_FILTER EMCS_ATTR_FILTER_ENABLE Whether or not to include uncorrelated
EMCS_ATTR_FILTER_DISABLE attributes in the model. When
EMCS_ATTRIBUTE_FILTER is enabled,
uncorrelated attributes are not included.
Note:
This
setting
applies
only to
attributes
that are
not
nested.
The default value is system-determined.

EMCS_MAX_NUM_ATTR_2D TO_CHAR(numeric_expr >= 1) Maximum number of correlated
attributes to include in the model.
Note:
This
setting
applies
only to
attributes
that are
not nested
(2D).

EMCS_NUM_DISTRIBUTION EMCS_NUM_DISTR_BERNOULLI The distribution for modeling numeric
EMCS_NUM_DISTR_GAUSSIAN attributes. Applies to the input table or
view as a whole and does not allow per-
EMCS_NUM_DISTR_SYSTEM attribute specifications.
The options include Bernoulli,
Gaussian, or system-determined
distribution. When Bernoulli or Gaussian
distribution is chosen, all numeric
attributes are modeled using the same
type of distribution. When the
distribution is system-determined,
individual attributes may use different
distributions (either Bernoulli or
Gaussian), depending on the data.
The default value is
EMCS_NUM_DISTR_SYSTEM.
8-35
Chapter 8
Table 8-5 (Cont.) Expectation Maximization Settings for Data Preparation and Analysis

EMCS_NUM_EQUIWIDTH_BINS TO_CHAR( 1 <numeric_expr <= 255) Number of equi-width bins that will be
used for gathering cluster statistics for
numeric columns.
EMCS_NUM_PROJECTIONS TO_CHAR( numeric_expr >= 1) Specifies the number of projections to
use for each nested column. If a column
has fewer distinct attributes than the
specified number of projections, then
the data is not projected. The setting
applies to all nested columns.
EMCS_NUM_QUANTILE_BINS TO_CHAR( 1 < numeric_expr <= 255) Specifies the number of quantile bins to
use for modeling numeric columns with
multivalued Bernoulli distributions.
EMCS_NUM_TOPN_BINS TO_CHAR( 1 < numeric_expr <= 255) Specifies the number of top-N bins to
use for modeling categorical columns
with multivalued Bernoulli distributions.
The following table lists settings for learning for EM models.
Table 8-6 Expectation Maximization Settings for Learning

EMCS_CONVERGENCE_CRITERION EMCS_CONV_CRIT_HELDASIDE The convergence criterion for EM. The
EMCS_CONV_CRIT_BIC convergence criterion may be based on
a held-aside data set or it may be
Bayesian Information Criterion.
The default value is system determined.
EMCS_LOGLIKE_IMPROVEMENT TO_CHAR( 0 < numeric_expr < 1) When the convergence criterion is
based on a held-aside data set
(EMCS_CONVERGENCE_CRITERION =
EMCS_CONV_CRIT_HELDASIDE), this
setting specifies the percentage
improvement in the value of the log
likelihood function that is required for
adding a new component to the model.
EMCS_MODEL_SEARCH EMCS_MODEL_SEARCH_ENABLE Enables model search in EM where
EMCS_MODEL_SEARCH_DISABLE different model sizes are explored and
the best size is selected.
EMCS_MODEL_SEARCH_DISABLE.
8-36
Chapter 8
Table 8-6 (Cont.) Expectation Maximization Settings for Learning

EMCS_NUM_COMPONENTS TO_CHAR( numeric_expr >= 1) Maximum number of components in the
model. If model search is enabled, the
algorithm automatically determines the
number of components based on
improvements in the likelihood function
or based on regularization, up to the
specified maximum.
The number of components must be
greater than or equal to the number of
clusters.
EMCS_NUM_ITERATIONS TO_CHAR( numeric_expr >= 1) Specifies the maximum number of
iterations in the EM algorithm.
EMCS_RANDOM_SEED Non-negative integer Controls the seed of the random
generator used in EM. The default value
is 0.
EMCS_REMOVE_COMPONENTS EMCS_REMOVE_COMPS_ENABLE Allows the EM algorithm to remove a
EMCS_REMOVE_COMPS_DISABLE small component from the solution.
EMCS_REMOVE_COMPS_ENABLE.
The following table lists the settings for component clustering for EM models.
Table 8-7 Expectation Maximization Settings for Component Clustering

CLUS_NUM_CLUSTERS TO_CHAR(numeric_expr >= 1) The maximum number of leaf clusters
generated by the algorithm. The
algorithm may return fewer clusters than
the specified number, depending on the
data. but it cannot return more clusters
than the number of components, which
is governed by algorithm-specific
settings. (See Table 8-6.) Depending on
these settings, there may be fewer
clusters than components. If component
clustering is disabled, then the number
of clusters equals the number of
components.
8-37
Chapter 8
Table 8-7 (Cont.) Expectation Maximization Settings for Component Clustering

EMCS_CLUSTER_COMPONENTS EMCS_CLUSTER_COMP_ENABLE Enables or disables the grouping of EM
EMCS_CLUSTER_COMP_DISABLE components into high-level clusters.
When disabled, the components
themselves are treated as clusters.
When component clustering is enabled,
model scoring through the SQL
CLUSTER function produces
assignments to the higher level clusters.
When clustering is disabled, the
CLUSTER function produces
assignments to the original components.
EMCS_CLUSTER_COMP_ENABLE.
EMCS_CLUSTER_THRESH TO_CHAR(numeric_expr >= 1) Dissimilarity threshold that controls the
clustering of EM components. When the
dissimilarity measure is less than the
threshold, the components are
combined into a single cluster.
A lower threshold may produce more
clusters that are more compact. A
higher threshold may produce fewer
clusters that are more spread out.
EMCS_LINKAGE_FUNCTION EMCS_LINKAGE_SINGLE Allows the specification of a linkage
EMCS_LINKAGE_AVERAGE function for the agglomerative clustering
step.
EMCS_LINKAGE_COMPLETE
EMCS_LINKAGE_SINGLE uses the
nearest distance within the branch. The
clusters tend to be larger and have
arbitrary shapes.
EMCS_LINKAGE_AVERAGE uses the
average distance within the branch.
There is less chaining effect and the
clusters are more compact.
EMCS_LINKAGE_COMPLETE uses the
maximum distance within the branch.
The clusters are smaller and require
strong component overlap.
EMCS_LINKAGE_SINGLE.
The following table lists the settings for cluster statistics for EM models.
8-38
Chapter 8
Table 8-8 Expectation Maximization Settings for Cluster Statistics

EMCS_CLUSTER_STATISTICS EMCS_CLUS_STATS_ENABLE Enables or disables the gathering of
EMCS_CLUS_STATS_DISABLE descriptive statistics for clusters
(centroids, histograms, and rules).
When statistics are disabled, model size
is reduced.
EMCS_CLUS_STATS_ENABLE.
EMCS_MIN_PCT_ATTR_SUPPORT TO_CHAR( 0 < numeric_expr < 1) Minimum support required for including
an attribute in the cluster rule. The
support is the percentage of the data
rows assigned to a cluster that must
have non-null values for the attribute.
The default value is 0.1.
See Also:

• Shared Settings
Example 8-10 Using the oml.em Class

This example creates an EM model and uses some of the methods of the oml.em class.
import oml
import pandas as pd
try:
oml.drop('IRIS')
except:
pass

8-39
Chapter 8
train_dat = dat[0]
test_dat = dat[1]
# Specify settings.
setting = {'emcs_num_iterations': 100}
# Create an EM model object

em_mod = oml.em(n_clusters = 2, **setting)
# Fit the EM model according to the training data and parameter

# settings.
em_mod = em_mod.fit(train_dat)

em_mod

em_mod.predict(test_dat)

# on new data.
em_mod.predict_proba(test_dat,
['Sepal_Length', 'Sepal_Width',
'Petal_Length']]).sort_values(by = ['Sepal_Length',
'Sepal_Width', 'Petal_Length',
'PROBABILITY_OF_2', 'PROBABILITY_OF_3'])
# Change the random seed and refit the model.

em_mod.set_params(EMCS_RANDOM_SEED = '5').fit(train_dat)
>>> import oml

>>>
>>>
>>> try:
... except:
... pass
>>>
>>>
8-40
Chapter 8

>>> train_dat = dat[0]
>>>
... setting = {'emcs_num_iterations': 100}
>>>
>>> # Create an EM model object.
... em_mod = oml.em(n_clusters = 2, **setting)
>>>
>>> # Fit the EM model according to the training data and parameter
... # settings.
>>> em_mod = em_mod.fit(train_dat)
>>>
... em_mod
Algorithm Name: Expectation Maximization
Mining Function: CLUSTERING
Settings:
0 ALGO_NAME ALGO_EXPECTATION_MAXIMIZATION
1 CLUS_NUM_CLUSTERS 2
2 EMCS_CLUSTER_COMPONENTS EMCS_CLUSTER_COMP_ENABLE
3 EMCS_CLUSTER_STATISTICS EMCS_CLUS_STATS_ENABLE
4 EMCS_CLUSTER_THRESH 2
5 EMCS_LINKAGE_FUNCTION EMCS_LINKAGE_SINGLE
6 EMCS_LOGLIKE_IMPROVEMENT .001
7 EMCS_MAX_NUM_ATTR_2D 50
8 EMCS_MIN_PCT_ATTR_SUPPORT .1
9 EMCS_MODEL_SEARCH EMCS_MODEL_SEARCH_DISABLE
10 EMCS_NUM_COMPONENTS 20
11 EMCS_NUM_DISTRIBUTION EMCS_NUM_DISTR_SYSTEM
12 EMCS_NUM_EQUIWIDTH_BINS 11
13 EMCS_NUM_ITERATIONS 100
14 EMCS_NUM_PROJECTIONS 50
15 EMCS_RANDOM_SEED 0
16 EMCS_REMOVE_COMPONENTS EMCS_REMOVE_COMPS_ENABLE
20 PREP_AUTO ON
Computed Settings:
0 EMCS_ATTRIBUTE_FILTER EMCS_ATTR_FILTER_DISABLE
1 EMCS_CONVERGENCE_CRITERION EMCS_CONV_CRIT_BIC
2 EMCS_NUM_QUANTILE_BINS 3
3 EMCS_NUM_TOPN_BINS 3
Global Statistics:
0 CONVERGED YES
8-41
Chapter 8
1 LOGLIKELIHOOD -2.10044
2 NUM_CLUSTERS 2
3 NUM_COMPONENTS 8
4 NUM_ROWS 104
5 RANDOM_SEED 0
6 REMOVED_COMPONENTS 12
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Clusters:
CLUSTER_ID CLUSTER_NAME RECORD_COUNT PARENT TREE_LEVEL \

0 1 1 104 NaN 1
1 2 2 68 1.0 2
2 3 3 36 1.0 2
LEFT_CHILD_ID RIGHT_CHILD_ID
0 2.0 3.0
1 NaN NaN
2 NaN NaN
Taxonomy:
PARENT_CLUSTER_ID CHILD_CLUSTER_ID
0 1 2.0
1 1 3.0
2 2 NaN
3 3 NaN
Centroids:
CLUSTER_ID ATTRIBUTE_NAME MEAN MODE_VALUE VARIANCE

0 1 Petal_Length 3.721154 None 3.234694
1 1 Petal_Width 1.155769 None 0.567539
2 1 Sepal_Length 5.831731 None 0.753255
3 1 Sepal_Width 3.074038 None 0.221358
4 1 Species NaN setosa NaN
6 2 Petal_Width 1.635294 None 0.191572
8 2 Sepal_Width 2.854412 None 0.128786
9 2 Species NaN versicolor NaN
11 3 Petal_Width 0.250000 None 0.012857
13 3 Sepal_Width 3.488889 None 0.134159
Leaf Cluster Counts:
8-42
Chapter 8
CLUSTER_ID CNT
0 2 68
1 3 36
Attribute Importance:
ATTRIBUTE_NAME ATTRIBUTE_IMPORTANCE_VALUE ATTRIBUTE_RANK

3 Sepal_Width 0.196211 5
4 Species 0.612463 1
Components:
COMPONENT_ID CLUSTER_ID PRIOR_PROBABILITY

0 1 2 0.115366
1 2 2 0.079158
2 3 3 0.113448
3 4 2 0.148059
4 5 3 0.126979
5 6 2 0.134402
6 7 3 0.105727
7 8 2 0.176860
Cluster Hists:
cluster.id variable bin.id lower.bound upper.bound \

0 1 Petal_Length 1 1.00 1.59
1 1 Petal_Length 2 1.59 2.18
2 1 Petal_Length 3 2.18 2.77
3 1 Petal_Length 4 2.77 3.36
... ... ... ... ... ...
137 3 Sepal_Width 11 NaN NaN
138 3 Species:'Other' 1 NaN NaN
139 3 Species:setosa 2 NaN NaN
140 3 Species:versicolor 3 NaN NaN
label count
0 1:1.59 25
1 1.59:2.18 11
2 2.18:2.77 0
3 2.77:3.36 3
... ... ...
137 : 0
138 : 0
139 : 36
140 : 0
[141 rows x 7 columns]
Rules:
cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var \

0 1 104 1.000000 93 0.892157 Sepal_Width
1 1 104 1.000000 93 0.892157 Sepal_Width
8-43
Chapter 8
2 1 104 1.000000 99 0.892157 Petal_Length

3 1 104 1.000000 99 0.892157 Petal_Length
... ... ... ... ... ... ...
26 3 36 0.346154 36 0.972222 Petal_Length
27 3 36 0.346154 36 0.972222 Sepal_Length
28 3 36 0.346154 36 0.972222 Sepal_Length
29 3 36 0.346154 36 0.972222 Species
lhs.var.support lhs.var.conf predicate

0 93 0.400000 Sepal_Width <= 3.92
1 93 0.400000 Sepal_Width > 2.48
2 93 0.222222 Petal_Length <= 6.31
3 93 0.222222 Petal_Length >= 1
... ... ... ...
26 35 0.134398 Petal_Length >= 1
27 35 0.094194 Sepal_Length <= 5.74
28 35 0.094194 Sepal_Length >= 4.3
29 35 0.281684 Species = setosa

... em_mod.predict(test_dat)
CLUSTER_ID
0 3
1 3
2 3
3 3
... ...
42 2
43 2
44 2
45 2
... # on new data.
>>> em_mod.predict_proba(test_dat,
... ['Sepal_Length', 'Sepal_Width',
... 'Petal_Length']]).sort_values(by = ['Sepal_Length',
... 'Sepal_Width', 'Petal_Length',
... 'PROBABILITY_OF_2', 'PROBABILITY_OF_3'])
Sepal_Length Sepal_Width Petal_Length PROBABILITY_OF_2 \
0 4.4 3.0 1.3 4.680788e-20
1 4.4 3.2 1.3 1.052071e-20
2 4.5 2.3 1.3 7.751240e-06
3 4.8 3.4 1.6 5.363418e-19
... ... ... ... ...
43 6.9 3.1 4.9 1.000000e+00
44 6.9 3.1 5.4 1.000000e+00
45 7.0 3.2 4.7 1.000000e+00
PROBABILITY_OF_3
0 1.000000e+00
1 1.000000e+00
2 9.999922e-01
8-44
Chapter 8
3 1.000000e+00
... ...
43 3.295578e-97
44 6.438740e-137
45 3.853925e-89
>>>
>>> # Change the random seed and refit the model.
... em_mod.set_params(EMCS_RANDOM_SEED = '5').fit(train_dat)
Algorithm Name: Expectation Maximization
Settings:
0 ALGO_NAME ALGO_EXPECTATION_MAXIMIZATION
2 EMCS_CLUSTER_COMPONENTS EMCS_CLUSTER_COMP_ENABLE
3 EMCS_CLUSTER_STATISTICS EMCS_CLUS_STATS_ENABLE
4 EMCS_CLUSTER_THRESH 2
5 EMCS_LINKAGE_FUNCTION EMCS_LINKAGE_SINGLE
6 EMCS_LOGLIKE_IMPROVEMENT .001
7 EMCS_MAX_NUM_ATTR_2D 50
8 EMCS_MIN_PCT_ATTR_SUPPORT .1
9 EMCS_MODEL_SEARCH EMCS_MODEL_SEARCH_DISABLE
10 EMCS_NUM_COMPONENTS 20
11 EMCS_NUM_DISTRIBUTION EMCS_NUM_DISTR_SYSTEM
12 EMCS_NUM_EQUIWIDTH_BINS 11
13 EMCS_NUM_ITERATIONS 100
14 EMCS_NUM_PROJECTIONS 50
15 EMCS_RANDOM_SEED 5
16 EMCS_REMOVE_COMPONENTS EMCS_REMOVE_COMPS_ENABLE
20 PREP_AUTO ON
Computed Settings:
0 EMCS_ATTRIBUTE_FILTER EMCS_ATTR_FILTER_DISABLE
1 EMCS_CONVERGENCE_CRITERION EMCS_CONV_CRIT_BIC
2 EMCS_NUM_QUANTILE_BINS 3
3 EMCS_NUM_TOPN_BINS 3
Global Statistics:
0 CONVERGED YES
1 LOGLIKELIHOOD -1.75777
2 NUM_CLUSTERS 2
3 NUM_COMPONENTS 9
4 NUM_ROWS 104
5 RANDOM_SEED 5
6 REMOVED_COMPONENTS 11
Attributes:
8-45
Chapter 8
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Clusters:
CLUSTER_ID CLUSTER_NAME RECORD_COUNT PARENT TREE_LEVEL LEFT_CHILD_ID

\
0 1 1 104 NaN 1
2.0
1 2 2 36 1.0 2
NaN
2 3 3 68 1.0 2
NaN
RIGHT_CHILD_ID
0 3.0
1 NaN
2 NaN
Taxonomy:
0 1 2.0
1 1 3.0
2 2 NaN
3 3 NaN
Centroids:
CLUSTER_ID ATTRIBUTE_NAME MEAN MODE_VALUE VARIANCE

1 1 Petal_Width 1.155769 None 0.567539
3 1 Sepal_Width 3.074038 None 0.221358
6 2 Petal_Width 0.250000 None 0.012857
8 2 Sepal_Width 3.488889 None 0.134159
11 3 Petal_Width 1.635294 None 0.191572
13 3 Sepal_Width 2.854412 None 0.128786
14 3 Species NaN versicolor NaN
CLUSTER_ID CNT
0 2 36
1 3 68
8-46
Chapter 8
Attribute Importance:
ATTRIBUTE_NAME ATTRIBUTE_IMPORTANCE_VALUE ATTRIBUTE_RANK

3 Sepal_Width 0.196211 5
4 Species 0.612463 1
Components:
COMPONENT_ID CLUSTER_ID PRIOR_PROBABILITY

0 1 2 0.113452
1 2 2 0.105727
2 3 3 0.114202
3 4 3 0.086285
4 5 3 0.067294
5 6 2 0.124365
6 7 3 0.126975
7 8 3 0.105761
8 9 3 0.155939
Cluster Hists:
cluster.id variable bin.id lower.bound upper.bound \

0 1 Petal_Length 1 1.00 1.59
1 1 Petal_Length 2 1.59 2.18
2 1 Petal_Length 3 2.18 2.77
3 1 Petal_Length 4 2.77 3.36
... ... ... ... ... ...
137 3 Sepal_Width 11 NaN NaN
138 3 Species:'Other' 1 NaN NaN
139 3 Species:setosa 3 NaN NaN
140 3 Species:versicolor 2 NaN NaN
label count
0 1:1.59 25
1 1.59:2.18 11
2 2.18:2.77 0
3 2.77:3.36 3
... ... ...
137 : 0
138 : 33
139 : 0
140 : 35
Rules:
cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var \

0 1 104 1.000000 93 0.894231 Sepal_Width
1 1 104 1.000000 93 0.894231 Sepal_Width
2 1 104 1.000000 99 0.894231 Petal_Length
3 1 104 1.000000 99 0.894231 Petal_Length
8-47
Chapter 8
Explicit Semantic Analysis
... ... ... ... ... ... ...

26 3 68 0.653846 68 0.955882 Sepal_Length
27 3 68 0.653846 68 0.955882 Sepal_Length
28 3 68 0.653846 68 0.955882 Species
29 3 68 0.653846 68 0.955882 Species
lhs.var.support lhs.var.conf predicate

0 93 0.400000 Sepal_Width <= 3.92
1 93 0.400000 Sepal_Width > 2.48
2 93 0.222222 Petal_Length <= 6.31
3 93 0.222222 Petal_Length >= 1
... ... ... ...
26 65 0.026013 Sepal_Length <= 7.9
27 65 0.026013 Sepal_Length > 4.66
28 65 0.125809 Species IN 'Other'
29 65 0.125809 Species IN versicolor
8.11 Explicit Semantic Analysis

The oml.esa class extracts text-based features from a corpus of documents and performs
document similarity comparisons.
Explicit Semantic Analysis (ESA) is an unsupervised algorithm for feature extraction. ESA does
not discover latent features but instead uses explicit features based on an existing knowledge
base.
Explicit knowledge often exists in text form. Multiple knowledge bases are available as
collections of text documents. These knowledge bases can be generic, such as Wikipedia, or
domain-specific. Data preparation transforms the text into vectors that capture attribute-
concept associations.
ESA uses concepts of an existing knowledge base as features rather than latent features
derived by latent semantic analysis methods such as Singular Value Decomposition and Latent
Dirichlet Allocation. Each row, for example, in a document in the training data maps to a
feature, that is, a concept. ESA has multiple applications in the area of text processing, most
notably semantic relatedness (similarity) and explicit topic modeling. Text similarity use cases
might involve, for example, resume matching, searching for similar blog postings, and so on.
For information on the oml.esa class attributes and methods, invoke help(oml.esa) or see
Settings for an Explicit Semantic Analysis Model

The following table lists settings for ESA models.
Table 8-9 Explicit Semantic Analysis Settings

ESAS_MIN_ITEMS A non-negative number Determines the minimum number of non-zero
entries required in an input row. The default
value is 100 for text input and 0 for non-text
input.
ESAS_TOPN_FEATURES A positive integer Controls the maximum number of features per
attribute. The default value is 1000.
8-48
Chapter 8
Table 8-9 (Cont.) Explicit Semantic Analysis Settings

ESAS_VALUE_THRESHOLD A non-negative number Sets the threshold to a small value for attribute
weights in the transformed build data. The
default value is 1e-8.
FEAT_NUM_FEATURES TO_CHAR(numeric_expr >=1) The number of features to extract.
The default value is estimated by the algorithm.
If the matrix rank is smaller than this number,
then fewer features are returned.
See Also:

• Shared Settings
Example 8-11 Using the oml.esa Class

This example creates an ESA model and uses some of the methods of the oml.esa class.
import oml
from oml import cursor
import pandas as pd
# Create training data and test data.

dat = oml.push(pd.DataFrame(
{'COMMENTS':['Aids in Africa: Planning for a long war',
'Mars rover maneuvers for rim shot',
'Mars express confirms presence of water at Mars south pole',
'NASA announces major Mars rover finding',
'Drug access, Asia threat in focus at AIDS summit',
'NASA Mars Odyssey THEMIS image: typical crater',
'Road blocks for Aids'],
'YEAR':['2017', '2018', '2017', '2017', '2018', '2018', '2018'],
'ID':[1,2,3,4,5,6,7]})).split(ratio=(0.7,0.3), seed = 1234)
train_dat = dat[0]
test_dat = dat[1]
# Specify settings.
cur = cursor()
cur.execute("Begin ctx_ddl.create_policy('DMDEMO_ESA_POLICY'); End;")
cur.close()
odm_settings = {'odms_text_policy_name': 'DMDEMO_ESA_POLICY',

'"ODMS_TEXT_MIN_DOCUMENTS"': 1,
'"ESAS_MIN_ITEMS"': 1}
ctx_settings = {'COMMENTS':
'TEXT(POLICY_NAME:DMDEMO_ESA_POLICY)(TOKEN_TYPE:STEM)'}
8-49
Chapter 8
# Create an oml ESA model object.

esa_mod = oml.esa(**odm_settings)
# Fit the ESA model according to the training data and parameter settings.
esa_mod = esa_mod.fit(train_dat, case_id = 'ID',
ctx_settings = ctx_settings)
# Show model details.

esa_mod
# Use the model to make predictions on test data.

esa_mod.predict(test_dat,
supplemental_cols = test_dat[:, ['ID', 'COMMENTS']])
esa_mod.transform(test_dat,
supplemental_cols = test_dat[:, ['ID', 'COMMENTS']],
topN = 2).sort_values(by = ['ID'])
esa_mod.feature_compare(test_dat,
compare_cols = 'COMMENTS',
supplemental_cols = ['ID'])
esa_mod.feature_compare(test_dat,
compare_cols = ['COMMENTS', 'YEAR'],
# Change the setting parameter and refit the model.

new_setting = {'ESAS_VALUE_THRESHOLD': '0.01',
'ODMS_TEXT_MAX_FEATURES': '2',
'ESAS_TOPN_FEATURES': '2'}
esa_mod.set_params(**new_setting).fit(train_dat, 'ID', case_id = 'ID',
ctx_settings = ctx_settings)
cur = cursor()
cur.execute("Begin ctx_ddl.drop_policy('DMDEMO_ESA_POLICY'); End;")
cur.close()
>>> import oml

>>> from oml import cursor
>>>
>>> # Create training data and test data.
... dat = oml.push(pd.DataFrame(
... {'COMMENTS':['Aids in Africa: Planning for a long war',
... 'Mars rover maneuvers for rim shot',
... 'Mars express confirms presence of water at Mars south pole',
... 'NASA announces major Mars rover finding',
... 'Drug access, Asia threat in focus at AIDS summit',
... 'NASA Mars Odyssey THEMIS image: typical crater',
... 'Road blocks for Aids'],
... 'YEAR':['2017', '2018', '2017', '2017', '2018', '2018', '2018'],
... 'ID':[1,2,3,4,5,6,7]})).split(ratio=(0.7,0.3), seed = 1234)
8-50
Chapter 8

>>>
... cur = cursor()
>>> cur.execute("Begin ctx_ddl.create_policy('DMDEMO_ESA_POLICY'); End;")
>>> cur.close()
>>>
>>> odm_settings = {'odms_text_policy_name': 'DMDEMO_ESA_POLICY',
... '"ODMS_TEXT_MIN_DOCUMENTS"': 1,
... '"ESAS_MIN_ITEMS"': 1}
>>>
>>> ctx_settings = {'COMMENTS':
... 'TEXT(POLICY_NAME:DMDEMO_ESA_POLICY)(TOKEN_TYPE:STEM)'}
>>>
>>> # Create an oml ESA model object.
... esa_mod = oml.esa(**odm_settings)
>>>
>>> # Fit the ESA model according to the training data and parameter settings.
... esa_mod = esa_mod.fit(train_dat, case_id = 'ID',
... ctx_settings = ctx_settings)
>>>
>>> # Show model details.
... esa_mod
Algorithm Name: Explicit Semantic Analysis
Mining Function: FEATURE_EXTRACTION
Settings:
0 ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS
1 ESAS_MIN_ITEMS 1
2 ESAS_TOPN_FEATURES 1000
3 ESAS_VALUE_THRESHOLD .00000001
7 ODMS_TEXT_MAX_FEATURES 300000
8 ODMS_TEXT_MIN_DOCUMENTS 1
9 ODMS_TEXT_POLICY_NAME DMDEMO_ESA_POLICY
10 PREP_AUTO ON
Global Statistics:
0 NUM_ROWS 4
Attributes:
COMMENTS
YEAR
Partition: NO
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT

0 1 COMMENTS.AFRICA None 0.342997
8-51
Chapter 8
1 1 COMMENTS.AIDS None 0.171499

2 1 COMMENTS.LONG None 0.342997
3 1 COMMENTS.PLANNING None 0.342997
... ... ... ... ...
24 6 COMMENTS.ODYSSEY None 0.282843
25 6 COMMENTS.THEMIS None 0.282843
26 6 COMMENTS.TYPICAL None 0.282843
27 6 YEAR 2018 0.707107
>>> # Use the model to make predictions on test data.

... esa_mod.predict(test_dat,
... supplemental_cols = test_dat[:, ['ID', 'COMMENTS']])
ID COMMENTS FEATURE_ID
0 4 NASA announces major Mars rover finding 3
1 6 NASA Mars Odyssey THEMIS image: typical crater 2
2 7 Road blocks for Aids 5
>>>
>>> esa_mod.transform(test_dat,
... supplemental_cols = test_dat[:, ['ID', 'COMMENTS']],
... topN = 2).sort_values(by = ['ID'])
COMMENTS TOP_1 TOP_1_VAL \
0 4 NASA announces major Mars rover finding 3 0.647065
1 6 NASA Mars Odyssey THEMIS image: typical crater 2 0.766237
2 7 Road blocks for Aids 5 0.759125
TOP_2 TOP_2_VAL
0 1 0.590565
1 2 0.616672
2 2 0.632604
>>>
>>> esa_mod.feature_compare(test_dat,
compare_cols = 'COMMENTS',
ID_A ID_B SIMILARITY
0 4 6 0.946469
1 4 7 0.871994
2 6 7 0.954565
>>> esa_mod.feature_compare(test_dat,
... compare_cols = ['COMMENTS', 'YEAR'],
... supplemental_cols = ['ID'])
ID_A ID_B SIMILARITY
0 4 6 0.467644
1 4 7 0.377144
2 6 7 O.952857
>>> # Change the setting parameter and refit the model.

... new_setting = {'ESAS_VALUE_THRESHOLD': '0.01',
... 'ODMS_TEXT_MAX_FEATURES': '2',
... 'ESAS_TOPN_FEATURES': '2'}
>>> esa_mod.set_params(**new_setting).fit(train_dat, case_id = 'ID',
... ctx_settings = ctx_settings)
Algorithm Name: Explicit Semantic Analysis
8-52
Chapter 8
Generalized Linear Model
Settings:
0 ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS
1 ESAS_MIN_ITEMS 1
2 ESAS_TOPN_FEATURES 2
3 ESAS_VALUE_THRESHOLD 0.01
7 ODMS_TEXT_MAX_FEATURES 2
8 ODMS_TEXT_MIN_DOCUMENTS 1
9 ODMS_TEXT_POLICY_NAME DMDEMO_ESA_POLICY
10 PREP_AUTO ON
Global Statistics:
0 NUM_ROWS 4
Attributes:
COMMENTS
YEAR
Partition: NO
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT

1 1 YEAR 2017 0.707107
2 2 COMMENTS.MARS None 0.707107
3 2 YEAR 2018 0.707107
4 3 COMMENTS.MARS None 0.707107
5 3 YEAR 2017 0.707107
7 5 YEAR 2018 0.707107
>>>
>>> cur = cursor()
>>> cur.execute("Begin ctx_ddl.drop_policy('DMDEMO_ESA_POLICY'); End;")
>>> cur.close()
8.12 Generalized Linear Model

The oml.glm class builds a Generalized Linear Model (GLM) model.
GLM models include and extend the class of linear models. They relax the restrictions on linear
models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do
not have the same variance across classes.
GLM is a parametric modeling technique. Parametric models make assumptions about the
distribution of the data. When the assumptions are met, parametric models can be more
efficient than non-parametric models.
8-53
Chapter 8
The challenge in developing models of this type involves assessing the extent to which the
assumptions are met. For this reason, quality diagnostics are key to developing quality
parametric models.
In addition to the classical weighted least squares estimation for linear regression and
iteratively re-weighted least squares estimation for logistic regression, both solved through
Cholesky decomposition and matrix inversion, Oracle Machine Learning GLM provides a
conjugate gradient-based optimization algorithm that does not require matrix inversion and is
very well suited to high-dimensional data. The choice of algorithm is handled internally and is
transparent to the user.
GLM can be used to build classification or regression models as follows:
• Classification: Binary logistic regression is the GLM classification algorithm. The
algorithm uses the logit link function and the binomial variance function.
• Regression: Linear regression is the GLM regression algorithm. The algorithm assumes
no target transformation and constant variance over the range of target values.
The oml.glm class allows you to build two different types of models. Some arguments apply to
classification models only and some to regression models only.
For information on the oml.glm class attributes and methods, invoke help(oml.glm) or see
Settings for a Generalized Linear Model

The following table lists the settings that apply to GLM models.
Table 8-10 Generalized Linear Model Settings

CLAS_COST_TABLE_NAME table_name The name of a table that stores a cost matrix for the
algorithm to use in scoring the model. The cost matrix
specifies the costs associated with misclassifications.
The cost matrix table is user-created. The following are the
column requirements for the table.
• Column Name: ACTUAL_TARGET_VALUE
• Column Name: PREDICTED_TARGET_VALUE
Data Type: NUMBER
CLAS_WEIGHTS_BALANCED ON Indicates whether the algorithm must create a model that
OFF balances the target distribution. This setting is most relevant
in the presence of rare targets, as balancing the distribution
may enable better average accuracy (average of per-class
accuracy) instead of overall accuracy (which favors the
dominant class). The default value is OFF.
8-54
Chapter 8
Table 8-10 (Cont.) Generalized Linear Model Settings

CLAS_WEIGHTS_TABLE_NAME table_name The name of a table that stores weighting information for
individual target values in GLM logistic regression models.
The weights are used by the algorithm to bias the model in
favor of higher weighted classes.
The class weights table is user-created. The following are the
• Column Name: TARGET_VALUE
• Column Name: CLASS_WEIGHT
Data Type: NUMBER
GLMS_BATCH_ROWS 0 or a positive integer. Number of rows in a batch used by the SGD solver. The
value of this parameter sets the size of the batch for the SGD
solver. An input of 0 triggers a data-driven batch size
estimate.
GLMS_CONF_LEVEL TO_CHAR(0< The confidence level for coefficient confidence intervals.
numeric_expr <1) The default confidence level is 0.95.
GLMS_CONV_TOLERANCE The range is (0, 1) non- Convergence tolerance setting of the GLM algorithm.
inclusive. The default value is system-determined.
GLMS_FTR_GEN_METHOD GLMS_FTR_GEN_CUBIC Whether feature generation is cubic or quadratic.
GLMS_FTR_GEN_QUADRATI When you enable feature generation, the algorithm
C automatically chooses the most appropriate feature
generation method based on the data.
GLMS_FTR_GENERATION GLMS_FTR_GENERATION_E Whether or not feature generation is enabled for GLM. By
NABLE default, feature generation is not enabled.
GLMS_FTR_GENERATION_D
ISABLE
Note:
Note: Feature generation can
only be enabled when feature
selection is also enabled.
GLMS_FTR_SEL_CRIT GLMS_FTR_SEL_AIC Feature selection penalty criterion for adding a feature to the
GLMS_FTR_SEL_ALPHA_IN model.
V When feature selection is enabled, the algorithm
automatically chooses the penalty criterion based on the
GLMS_FTR_SEL_RIC
data.
GLMS_FTR_SEL_SBIC
GLMS_FTR_SELECTION GLMS_FTR_SELECTION_DI Enable or disable feature selection for GLM.
SABLE By default, feature selection is not enabled.
GLMS_MAX_FEATURES TO_CHAR(0 < When feature selection is enabled, this setting specifies the
numeric_expr <= 2000) maximum number of features that can be selected for the
final model.
By default, the algorithm limits the number of features to
ensure sufficient memory.
GLMS_NUM_ITERATIONS A positive integer. Maximum number of iterations for the GLM algorithm. The
default value is system-determined.
8-55
Chapter 8
Table 8-10 (Cont.) Generalized Linear Model Settings

GLMS_PRUNE_MODEL GLMS_PRUNE_MODEL_ENAB When feature selection is enabled, the algorithm
LE automatically performs pruning based on the data.
GLMS_PRUNE_MODEL_DISA
BLE
GLMS_REFERENCE_CLASS_NAM target_value The target value used as the reference class in a binary
E logistic regression model. Probabilities are produced for the
other class.
By default, the algorithm chooses the value with the highest
prevalence (the most cases) for the reference class.
GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE Enable or disable ridge regression. Ridge applies to both
GLMS_RIDGE_REG_DISABL regression and classification machine learning functions.
E When ridge is enabled, prediction bounds are not produced
by the PREDICTION_BOUNDS SQL function.
GLMS_RIDGE_VALUE TO_CHAR(numeric_expr The value of the ridge parameter. Use this setting only when
> 0) you have configured the algorithm to use ridge regression.
If ridge regression is enabled internally by the algorithm, then
the ridge parameter is determined by the algorithm.
GLMS_ROW_DIAGNOSTICS GLMS_ROW_DIAG_ENABLE Enable or disable row diagnostics.
GLMS_ROW_DIAG_DISABLE By default, row diagnostics are disabled.
GLMS_SOLVER GLMS_SOLVER_CHOL Specifies the GLM solver. You cannot select the solver if
GLMS_SOLVER_LBFGS_ADM GLMS_FTR_SELECTION setting is enabled. The default value
M is system determined.
The GLMS_SOLVER_CHOL solver uses Cholesky
GLMS_SOLVER_QR
decomposition.
GLMS_SOLVER_SGD
The GLMS_SOLVER_SGD solver uses stochastic gradient
descent.
GLMS_SPARSE_SOLVER GLMS_SPARSE_SOLVER_EN Enable or disable the use of a sparse solver if it is available.
ABLE The default value is GLMS_SPARSE_SOLVER_DISABLE.
GLMS_SPARSE_SOLVER_DI
SABLE
ODMS_ROW_WEIGHT_COLUMN_ column_name The name of a column in the training data that contains a
NAME weighting factor for the rows. The column datatype must be
NUMBER.
You can use row weights as a compact representation of
repeated rows, as in the design of experiments where a
specific configuration is repeated several times. You can also
use row weights to emphasize certain rows during model
construction. For example, to bias the model towards rows
that are more recent and away from potentially obsolete data.
See Also:

• Shared Settings
8-56
Chapter 8
Example 8-12 Using the oml.glm Class

This example demonstrates the use of various methods of the oml.glm class. In the listing for
import oml
import pandas as pd
columns = ['Species'
try:
oml.drop('IRIS')
except:
pass

train_x = dat[0].drop('Petal_Width')
train_y = dat[0]['Petal_Width']
test_dat = dat[1]
# Specify settings.
setting = {'GLMS_SOLVER': 'dbms_data_mining.GLMS_SOLVER_QR'}
# Create a GLM model object.

glm_mod = oml.glm("regression", **setting)
# Fit the GLM model according to the training data and parameter
# settings.
glm_mod = glm_mod.fit(train_x, train_y)

glm_mod

glm_mod.predict(test_dat.drop('Petal_Width'),
'Petal_Length', 'Species']])

glm_mod.predict(test_dat.drop('Petal_Width'),
8-57
Chapter 8
'Petal_Length', 'Species']],
proba = True)
glm_mod.score(test_dat.drop('Petal_Width'),
test_dat[:, ['Petal_Width']])
# Change the parameter setting and refit the model.

new_setting = {'GLMS_SOLVER': 'GLMS_SOLVER_SGD'}
glm_mod.set_params(**new_setting).fit(train_x, train_y)
>>> import oml

>>>
>>>
>>> try:
... except:
... pass
>>>
>>>
>>> train_x = dat[0].drop('Petal_Width')
>>> train_y = dat[0]['Petal_Width']
>>>
... setting = {'GLMS_SOLVER': 'dbms_data_mining.GLMS_SOLVER_QR'}
>>>
>>> # Create a GLM model object.
... glm_mod = oml.glm("regression", **setting)
>>>
>>> # Fit the GLM model according to the training data and parameter
... # settings.
>>> glm_mod = glm_mod.fit(train_x, train_y)
>>>
... glm_mod
Algorithm Name: Generalized Linear Model
8-58
Chapter 8
Mining Function: REGRESSION
Target: Petal_Width
Settings:
0 ALGO_NAME ALGO_GENERALIZED_LINEAR_MODEL
1 GLMS_CONF_LEVEL .95
2 GLMS_FTR_GENERATION GLMS_FTR_GENERATION_DISABLE
3 GLMS_FTR_SELECTION GLMS_FTR_SELECTION_DISABLE
4 GLMS_SOLVER GLMS_SOLVER_QR
8 PREP_AUTO ON
Computed Settings:
0 GLMS_CONV_TOLERANCE .0000050000000000000004
1 GLMS_NUM_ITERATIONS 30
2 GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE
Global Statistics:
0 ADJUSTED_R_SQUARE 0.949634
1 AIC -363.888
2 COEFF_VAR 14.6284
3 CONVERGED YES
4 CORRECTED_TOTAL_DF 103
5 CORRECTED_TOT_SS 58.4565
6 DEPENDENT_MEAN 1.15577
7 ERROR_DF 98
8 ERROR_MEAN_SQUARE 0.028585
9 ERROR_SUM_SQUARES 2.80131
10 F_VALUE 389.405
11 GMSEP 0.030347
12 HOCKING_SP 0.000295
13 J_P 0.030234
14 MODEL_DF 5
15 MODEL_F_P_VALUE 0
16 MODEL_MEAN_SQUARE 11.131
17 MODEL_SUM_SQUARES 55.6552
18 NUM_PARAMS 6
19 NUM_ROWS 104
20 RANK_DEFICIENCY 0
21 ROOT_MEAN_SQ 0.16907
22 R_SQ 0.952079
23 SBIC -348.021
24 VALID_COVARIANCE_MATRIX YES
Attributes:
Petal_Length
Sepal_Length
Sepal_Width
Species
8-59
Chapter 8
Partition: NO
Coefficients:
name level estimate

0 (Intercept) None -0.600603
1 Petal_Length None 0.239775
2 Sepal_Length None -0.078338
3 Sepal_Width None 0.253996
4 Species versicolor 0.652420
5 Species virginica 1.010438
Fit Details:
name value
0 ADJUSTED_R_SQUARE 9.496338e-01
1 AIC -3.638876e+02
2 COEFF_VAR 1.462838e+01
3 CORRECTED_TOTAL_DF 1.030000e+02
...
21 ROOT_MEAN_SQ 1.690704e-01
22 R_SQ 9.520788e-01
23 SBIC -3.480213e+02
24 VALID_COVARIANCE_MATRIX 1.000000e+00
Rank:
Deviance:
2.801309
AIC:
-364
Null Deviance:
58.456538
DF Residual:
98.0
DF Null:
103.0
Converged:
True
>>>
... glm_mod.predict(test_dat.drop('Petal_Width'),
8-60
Chapter 8

... 'Petal_Length', 'Species']])
0 4.9 3.0 1.4 setosa 0.113215
1 4.9 3.1 1.5 setosa 0.162592
2 4.8 3.4 1.6 setosa 0.270602
3 5.8 4.0 1.2 setosa 0.248752
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica 2.89876
43 6.7 3.0 5.2 virginica 1.893790
44 6.5 3.0 5.2 virginica 1.909457
45 5.9 3.0 5.1 virginica 1.932483

... glm_mod.predict(test_dat.drop('Petal_Width'),
... 'Petal_Length', 'Species']]),
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION
0 4.9 3.0 setosa 0.113215
1 4.9 3.1 setosa 0.162592
2 4.8 3.4 setosa 0.270602
3 5.8 4.0 setosa 0.248752
... ... ... ... ...
42 6.7 3.3 virginica 2.089876
43 6.7 3.0 virginica 1.893790
44 6.5 3.0 virginica 1.909457
45 5.9 3.0 virginica 1.932483
>>>
>>> glm_mod.score(test_dat.drop('Petal_Width'),
... test_dat[:, ['Petal_Width']])
0.951252
>>>
>>> # Change the parameter setting and refit the model.
... new_setting = {'GLMS_SOLVER': 'GLMS_SOLVER_SGD'}
>>> glm_mod.set_params(**new_setting).fit(train_x, train_y)
Algorithm Name: Generalized Linear Model
Mining Function: REGRESSION
Target: Petal_Width
Settings:
0 ALGO_NAME ALGO_GENERALIZED_LINEAR_MODEL
1 GLMS_CONF_LEVEL .95
2 GLMS_FTR_GENERATION GLMS_FTR_GENERATION_DISABLE
3 GLMS_FTR_SELECTION GLMS_FTR_SELECTION_DISABLE
4 GLMS_SOLVER GLMS_SOLVER_SGD
8 PREP_AUTO ON
8-61
Chapter 8
Computed Settings:
0 GLMS_BATCH_ROWS 2000
1 GLMS_CONV_TOLERANCE .0001
2 GLMS_NUM_ITERATIONS 500
3 GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE
4 GLMS_RIDGE_VALUE .01
Global Statistics:
0 ADJUSTED_R_SQUARE 0.94175
1 AIC -348.764
2 COEFF_VAR 15.7316
3 CONVERGED NO
4 CORRECTED_TOTAL_DF 103
5 CORRECTED_TOT_SS 58.4565
6 DEPENDENT_MEAN 1.15577
7 ERROR_DF 98
8 ERROR_MEAN_SQUARE 0.033059
9 ERROR_SUM_SQUARES 3.23979
10 F_VALUE 324.347
11 GMSEP 0.035097
12 HOCKING_SP 0.000341
13 J_P 0.034966
14 MODEL_DF 5
15 MODEL_F_P_VALUE 0
16 MODEL_MEAN_SQUARE 10.7226
17 MODEL_SUM_SQUARES 53.613
18 NUM_PARAMS 6
19 NUM_ROWS 104
20 RANK_DEFICIENCY 0
21 ROOT_MEAN_SQ 0.181821
22 R_SQ 0.944578
23 SBIC -332.898
24 VALID_COVARIANCE_MATRIX NO
Attributes:
Petal_Length
Sepal_Length
Sepal_Width
Species
Partition: NO
Coefficients:
name level estimate

0 (Intercept) None -0.338046
2 Sepal_Length None -0.084440
4 Species versicolor 0.151916
5 Species virginica 0.337535
8-62
Chapter 8
k-Means
Fit Details:
name value
0 ADJUSTED_R_SQUARE 9.417502e-01
1 AIC -3.487639e+02
2 COEFF_VAR 1.573164e+01
3 CORRECTED_TOTAL_DF 1.030000e+02
... ... ...
21 ROOT_MEAN_SQ 1.818215e-01
22 R_SQ 9.445778e-01
23 SBIC -3.328975e+02
24 VALID_COVARIANCE_MATRIX 0.000000e+00
Rank:
Deviance:
3.239787
AIC:
-349
Null Deviance:
58.456538
Prior Weights:
DF Residual:
98.0
DF Null:
103.0
Converged:
False
8.13 k-Means
The oml.km class uses the k-Means (KM) algorithm, which is a hierarchical, distance-based
clustering algorithm that partitions data into a specified number of clusters.
The algorithm has the following features:
• Several distance functions: Euclidean, Cosine, and Fast Cosine distance functions. The
default is Euclidean.
8-63
Chapter 8
k-Means
• For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a
rule describing the hyperbox that encloses the majority of the data assigned to the cluster.
The centroid reports the mode for categorical attributes and the mean and variance for
numeric attributes.
For information on the oml.km class attributes and methods, invoke help(oml.km) or see
Settings for a k-Means Model

The following table lists the settings that apply to KM models.
Table 8-11 k-Means Model Settings

CLUS_NUM_CLUSTERS TO_CHAR(numeric_expr >= 1) The maximum number of leaf clusters generated by the
algorithm. The algorithm produces the specified
number of clusters unless there are fewer distinct data
points.
KMNS_CONV_TOLERANCE TO_CHAR(0< numeric_expr <1) Minimum Convergence Tolerance for k-Means. The
algorithm iterates until the minimum Convergence
Tolerance is satisfied or until the maximum number of
iterations, specified in KMNS_ITERATIONS, is reached.
Decreasing the Convergence Tolerance produces a
more accurate solution but may result in longer run
times.
The default Convergence Tolerance is 0.001.
KMNS_DETAILS KMNS_DETAILS_ALL Determines the level of cluster detail that is computed
KMNS_DETAILS_HIERARCHY during the build.
KMNS_DETAILS_NONE KMNS_DETAILS_ALL: Cluster hierarchy, record counts,

descriptive statistics (means, variances, modes,
histograms, and rules) are computed.
KMNS_DETAILS_HIERARCHY: Cluster hierarchy and
cluster record counts are computed. This is the default
value.
KMNS_DETAILS_NONE: No cluster details are
computed. Only the scoring information is persisted.
KMNS_DISTANCE KMNS_COSINE Distance function for k-Means.
KMNS_EUCLIDEAN The default distance function is KMNS_EUCLIDEAN.
KMNS_ITERATIONS TO_CHAR(positive_numeric_exp Maximum number of iterations for k-Means. The
r) algorithm iterates until either the maximum number of
iterations is reached or the minimum Convergence
Tolerance, specified in KMNS_CONV_TOLERANCE, is
satisfied.
The default number of iterations is 20.
KMNS_MIN_PCT_ATTR_SUPP TO_CHAR(0< = numeric_expr <= Minimum percentage of attribute values that must be
ORT 1) non-null in order for the attribute to be included in the
rule description for the cluster.
If the data is sparse or includes many missing values, a
minimum support that is too high can cause very short
rules or even empty rules.
The default minimum support is 0.1.
8-64
Chapter 8
k-Means
Table 8-11 (Cont.) k-Means Model Settings

KMNS_NUM_BINS TO_CHAR(numeric_expr > 0) Number of bins in the attribute histogram produced by
k-Means. The bin boundaries for each attribute are
computed globally on the entire training data set. The
binning method is equi-width. All attributes have the
same number of bins with the exception of attributes
with a single value, which have only one bin.
The default number of histogram bins is 11.
KMNS_RANDOM_SEED Non-negative integer Controls the seed of the random generator used during
the k-Means initialization. It must be a non-negative
integer value.
KMNS_SPLIT_CRITERION KMNS_SIZE Split criterion for k-Means. The split criterion controls
KMNS_VARIANCE the initialization of new k-Means clusters. The algorithm
builds a binary tree and adds one new cluster at a time.
When the split criterion is based on size, the new
cluster is placed in the area where the largest current
cluster is located. When the split criterion is based on
the variance, the new cluster is placed in the area of
the most spread-out cluster.
The default split criterion is the KMNS_VARIANCE.
See Also:

• Shared Settings
Example 8-13 Using the oml.km Class

This example creates a KM model and uses methods of it. In the listing for this example, some
of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
try:
8-65
Chapter 8
k-Means
oml.drop('IRIS')
except:
pass

train_dat = dat[0]
test_dat = dat[1]
# Specify settings.
setting = {'kmns_iterations': 20}
# Create a KM model object and fit it.

km_mod = oml.km(n_clusters = 3, **setting).fit(train_dat)
# Show model details.

km_mod

km_mod.predict(test_dat,
supplemental_cols =
test_dat[:, ['Sepal_Length', 'Sepal_Width',
km_mod.predict_proba(test_dat,
supplemental_cols =
test_dat[:, ['Species']]).sort_values(by =
['Species', 'PROBABILITY_OF_3'])
km_mod.transform(test_dat)
km_mod.score(test_dat)
>>> import oml

>>>
>>>
>>> try:
... except:
... pass
8-66
Chapter 8
k-Means
>>>
>>>
>>>
... setting = {'kmns_iterations': 20}
>>>
>>> # Create a KM model object and fit it.
... km_mod = omlkm(n_clusters = 3, **setting).fit(train_dat)
>>>
>>> # Show model details.
... km_mod
Algorithm Name: K-Means
Settings:
0 ALGO_NAME ALGO_KMEANS
2 KMNS_CONV_TOLERANCE .001
3 KMNS_DETAILS KMNS_DETAILS_HIERARCHY
4 KMNS_DISTANCE KMNS_EUCLIDEAN
5 KMNS_ITERATIONS 20
6 KMNS_MIN_PCT_ATTR_SUPPORT .1
7 KMNS_NUM_BINS 11
8 KMNS_RANDOM_SEED 0
9 KMNS_SPLIT_CRITERION KMNS_VARIANCE
13 PREP_AUTO ON
Global Statistics:
0 CONVERGED YES
1 NUM_ROWS 104.0
Attributes: Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Clusters:
CLUSTER_ID ROW_CNT PARENT_CLUSTER_ID TREE_LEVEL DISPERSION
8-67
Chapter 8
k-Means
0 1 104 NaN 1 0.986153

1 2 68 1.0 2 1.102147
2 3 36 1.0 2 0.767052
3 4 37 2.0 3 1.015669
4 5 31 2.0 3 1.205363
Taxonomy:
0 1 2.0
1 1 3.0
2 2 4.0
3 2 5.0
4 3 NaN
5 4 NaN
6 5 NaN
CLUSTER_ID CNT
0 3 50
1 4 53
2 5 47
>>>
... km_mod.predict(test_dat, ['Sepal_Length', 'Sepal_Width',
Sepal_Length Sepal_Width Petal_Length Species CLUSTER_ID
0 4.9 3.0 1.4 setosa 3
1 4.9 3.1 1.5 setosa 3
2 4.8 3.4 1.6 setosa 3
3 5.8 4.0 1.2 setosa 3
... ... ... ... ... ...
38 6.4 2.8 5.6 virginica 5
39 6.9 3.1 5.4 virginica 5
40 6.7 3.1 5.6 virginica 5
41 5.8 2.7 5.1 virginica 5
>>>
>>> km_mod.predict_proba(test_dat,
... supplemental_cols =
... test_dat[:, ['Species']]).sort_values(by =
... ['Species', 'PROBABILITY_OF_3'])
Species PROBABILITY_OF_3 PROBABILITY_OF_4 PROBABILITY_OF_5
0 setosa 0.791267 0.208494 0.000240
1 setosa 0.971498 0.028350 0.000152
2 setosa 0.981020 0.018499 0.000481
3 setosa 0.981907 0.017989 0.000104
... ... ... ... ...
42 virginica 0.000655 0.316671 0.682674
43 virginica 0.001036 0.413744 0.585220
44 virginica 0.001036 0.413744 0.585220
45 virginica 0.002452 0.305021 0.692527
>>>
>>> km_mod.transform(test_dat)
CLUSTER_DISTANCE
0 1.050234
8-68
Chapter 8
Naive Bayes
1 0.859817
2 0.321065
3 1.427080
... ...
42 0.837757
43 0.479313
44 0.448562
45 1.123587
>>>
>>> km_mod.score(test_dat)
-47.487712
8.14 Naive Bayes

The oml.nb class creates a Naive Bayes (NB) model for classification.
The Naive Bayes algorithm is based on conditional probabilities. Naive Bayes looks at the
historical data and calculates conditional probabilities for the target values by observing the
frequency of attribute values and of combinations of attribute values.
Naive Bayes assumes that each predictor is conditionally independent of the others. (Bayes'
Theorem requires that the predictors be independent.)
For information on the oml.nb class attributes and methods, invoke help(oml.nb) or see
Settings for a Naive Bayes Model

The following table lists the settings that apply to NB models.
Table 8-12 Naive Bayes Model Settings

algorithm to use in building the model. The cost matrix
Data Type: NUMBER
CLAS_MAX_SUP_BINS 2 <= a number <= Specifies the maximum number of bins for each attribute.
2147483647 The default value is 32.
CLAS_PRIORS_TABLE_NAME table_name The name of a table that stores prior probabilities to offset
differences in distribution between the build data and the
scoring data.
The priors table is user-created. The following are the column
requirements for the table.
• Column Name: PRIOR_PROBABILITY
Data Type: NUMBER
8-69
Chapter 8
Naive Bayes
Table 8-12 (Cont.) Naive Bayes Model Settings

OFF balances the target distribution. This setting is most relevant
in the presence of rare targets, as balancing the distribution
may enable better average accuracy (average of per-class
accuracy) instead of overall accuracy (which favors the
dominant class). The default value is OFF.
NABS_PAIRWISE_THRESHOLD TO_CHAR(0 <= Value of the pairwise threshold for the NB algorithm.
numeric_expr <= 1) The default value is 0.
NABS_SINGLETON_THRESHOL TO_CHAR(0 <= Value of the singleton threshold for the NB algorithm.
D numeric_expr <= 1) The default value is 0.
See Also:

• Shared Settings
Example 8-14 Using the oml.nb Class

This example creates an NB model and uses some of the methods of the oml.nb class.
import oml
import pandas as pd
try:
oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
oml.drop('IRIS')
except:
pass

8-70
Chapter 8
Naive Bayes
test_dat = dat[1]
# User specified settings.

setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
# Create an oml NB model object.

nb_mod = oml.nb(**setting)
# Fit the NB model according to the training data and parameter

# settings.
nb_mod = nb_mod.fit(train_x, train_y)

nb_mod
# Create a priors table in the database.

priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
priors = oml.create(pd.DataFrame(list(priors.items()),
columns = ['TARGET_VALUE',
'PRIOR_PROBABILITY']),
table = 'NB_PRIOR_PROBABILITY_DEMO')
# Change the setting parameter and refit the model

# with a user-defined prior table.
new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
nb_mod = nb_mod.set_params(**new_setting).fit(train_x,
train_y,
priors = priors)
nb_mod

nb_mod.predict(test_dat.drop('Species'),
'Sepal_Width',
'Petal_Length',
'Species']])
'Sepal_Width',
'Species']],
proba = True)
# Return the top two most influencial attributes of the highest

# probability class.
'Sepal_Width',
'Petal_Length',
'Species']],
topN_attrs = 2)

# on new data.
8-71
Chapter 8
Naive Bayes
nb_mod.predict_proba(test_dat.drop('Species'),
['Sepal_Length',
'Species']]).sort_values(by =
['Sepal_Length',
'Species',
'PROBABILITY_OF_setosa',
'PROBABILITY_OF_versicolor'])
# Make predictions on new data and return the mean accuracy.

nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
>>> import oml

>>>
>>>
>>> try:
... oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
... except:
... pass
>>>
>>>
>>> dat = oml.sync(table = 'IRIS').split()
>>>
>>> # User specified settings.
... setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
>>>
>>> # Create an oml NB model object.
... nb_mod = oml.nb(**setting)
>>>
>>> # Fit the NB model according to the training data and parameter
... # settings.
>>> nb_mod = nb_mod.fit(train_x, train_y)
>>>
... nb_mod
8-72
Chapter 8
Naive Bayes
Algorithm Name: Naive Bayes
Target: Species
Settings:
0 ALGO_NAME ALGO_NAIVE_BAYES
1 CLAS_WEIGHTS_BALANCED ON
2 NABS_PAIRWISE_THRESHOLD 0
3 NABS_SINGLETON_THRESHOLD 0
7 PREP_AUTO ON
Global Statistics:
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
Priors:
TARGET_NAME TARGET_VALUE PRIOR_PROBABILITY COUNT

0 Species setosa 0.333333 36
1 Species versicolor 0.333333 35
2 Species virginica 0.333333 33
Conditionals:
TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME

ATTRIBUTE_VALUE \
0 Species setosa Petal_Length None ( ;
1.05]
1 Species setosa Petal_Length None (1.05; 1.2]
... ... ... ... ... ...
152 Species virginica Sepal_Width None (3.25; 3.35]

CONDITIONAL_PROBABILITY COUNT
0 0.027778 1
1 0.027778 1
8-73
Chapter 8
Naive Bayes
2 0.083333 3
3 0.277778 10
... ... ...
152 0.030303 1
153 0.060606 2
154 0.030303 1
155 0.060606 2
>>> # Create a priors table in the database.

... priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
>>> priors = oml.create(pd.DataFrame(list(priors.items()),
... columns = ['TARGET_VALUE',
... 'PRIOR_PROBABILITY']),
... table = 'NB_PRIOR_PROBABILITY_DEMO')
>>>
>>> # Change the setting parameter and refit the model
... # with a user-defined prior table.
... new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
>>> nb_mod = nb_mod.set_params(**new_setting).fit(train_x,
... train_y,
... priors = priors)
>>> nb_mod
Algorithm Name: Naive Bayes
Target: Species
Settings:
0 ALGO_NAME ALGO_NAIVE_BAYES
1 CLAS_PRIORS_TABLE_NAME "OML_USER"."NB_PRIOR_PROBABILITY_DEMO"
3 NABS_PAIRWISE_THRESHOLD 0
4 NABS_SINGLETON_THRESHOLD 0
8 PREP_AUTO ON
Global Statistics:
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
Priors:
8-74
Chapter 8
Naive Bayes
TARGET_NAME TARGET_VALUE PRIOR_PROBABILITY COUNT

0 Species setosa 0.2 36
1 Species versicolor 0.3 35
2 Species virginica 0.5 33
Conditionals:
TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME

ATTRIBUTE_VALUE \
0 Species setosa Petal_Length None ( ; 1.05]
... ... ... ... ... ...
CONDITIONAL_PROBABILITY COUNT
0 0.027778 1
1 0.027778 1
2 0.083333 3
3 0.277778 10
... ... ...
152 0.030303 1
153 0.060606 2
154 0.030303 1
155 0.060606 2

... nb_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica virginica

>>> nb_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Species']],
... proba = True)
8-75
Chapter 8
Naive Bayes

0 4.9 3.0 setosa setosa 1.000000
1 4.9 3.1 setosa setosa 1.000000
2 4.8 3.4 setosa setosa 1.000000
3 5.8 4.0 setosa setosa 1.000000
... ... ... ... ... ...
42 6.7 3.3 virginica virginica 1.000000
>>> # Return the top two most influencial attributes of the highest
... # probability class.
>>> nb_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']],
... topN_attrs = 2)
Sepal_Length Sepal_Width Petal_Length Species PREDICTION \
... ... ... ... ... ...
TOP_N_ATTRIBUTES
0 <Details algorithm="Naive Bayes" class="setosa...
...
42 <Details algorithm="Naive Bayes" class="virgin...
... # on new data.
>>> nb_mod.predict_proba(test_dat.drop('Species'),
... 'Species']]).sort_values(by =
... 'Species',
... 'PROBABILITY_OF_setosa,
... 'PROBABILITY_OF_versicolor'])
0 4.4 setosa 1.000000e+00
1 4.4 setosa 1.000000e+00
2 4.5 setosa 1.000000e+00
3 4.8 setosa 1.000000e+00
... ... ... ...
8-76
Chapter 8
Neural Network
42 6.7 virginica 1.412132e-13

43 6.9 versicolor 5.295492e-20
44 6.9 virginica 5.295492e-20
45 7.0 versicolor 6.189014e-14
0 9.327306e-21 7.868301e-20
1 3.497737e-20 1.032715e-19
2 2.238553e-13 2.360490e-19
3 6.995487e-22 2.950617e-21
... ... ...
42 4.741700e-13 1.000000e+00
43 1.778141e-07 9.999998e-01
44 2.963565e-20 1.000000e+00
45 4.156340e-01 5.843660e-01
>>> # Make predictions on new data and return the mean accuracy.
... nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.934783
8.15 Neural Network

The oml.nn class creates a Neural Network (NN) model for classification and regression.
Neural Network models can be used to capture intricate nonlinear relationships between inputs
and outputs or to find patterns in data.
The oml.nn class methods build a feed-forward neural network for regression on
oml.DataFrame data. It supports multiple hidden layers with a specifiable number of nodes.
Each layer can have one of several activation functions.
The output layer is a single numeric or binary categorical target. The output layer can have any
of the activation functions. It has the linear activation function by default.
Modeling with the ore.nn class is well-suited for noisy and complex data such as sensor data.
Problems that such data might have are the following:
• Potentially many (numeric) predictors, for example, pixel values
• The target may be discrete-valued, real-valued, or a vector of such values
• Training data may contain errors – robust to noise
• Fast scoring
• Model transparency is not required; models difficult to interpret
Typical steps in Neural Network modeling are the following:
1. Specifying the architecture
2. Preparing the data
3. Building the model
4. Specifying the stopping criteria: iterations, error on a validation set within tolerance
5. Viewing statistical results from the model
6. Improving the model
8-77
Chapter 8
Neural Network
For information on the oml.nn class attributes and methods, invoke help(oml.nn) or
help(oml.hist), or see Oracle Machine Learning for Python API Reference.
Settings for a Neural Network Model

The following table lists settings for NN models.
Table 8-13 Neural Network Models Settings

algorithm to use in scoring the model. The cost matrix
Data Type: NUMBER
OFF balances the target distribution. This setting is most
relevant in the presence of rare targets, as balancing the
distribution may enable better average accuracy (average
of per-class accuracy) instead of overall accuracy (which
favors the dominant class). The default value is OFF.
NNET_ACTIVATIONS A list of the following strings: Defines the activation function for the hidden layers. For
• ''NNET_ACTIVATIONS_AR example, '''NNET_ACTIVATIONS_BIPOLAR_SIG'',
CTAN'' ''NNET_ACTIVATIONS_TANH'''.
• ''NNET_ACTIVATIONS_BI Different layers can have different activation functions.
POLAR_SIG'' The default value is ''NNET_ACTIVATIONS_LOG_SIG''.
• ''NNET_ACTIVATIONS_LI The number of activation functions must be consistent with
NEAR'' NNET_HIDDEN_LAYERS and NNET_NODES_PER_LAYER.
• ''NNET_ACTIVATIONS_LO
G_SIG''
• ''NNET_ACTIVATIONS_TA
Note:
NH''
All quotes are single and two
single quotes are used to
escape a single quote in SQL
statements.
NNET_HELDASIDE_MAX_FAIL A positive integer With NNET_REGULARIZER_HELDASIDE, the training

process is stopped early if the network performance on the
validation data fails to improve or remains the same for
NNET_HELDASIDE_MAX_FAIL epochs in a row.
NNET_HELDASIDE_RATIO 0 <= numeric_expr <= 1 Defines the held ratio for the held-aside method.
NNET_HIDDEN_LAYERS A non-negative integer Defines the topology by number of hidden layers.
8-78
Chapter 8
Neural Network
Table 8-13 (Cont.) Neural Network Models Settings

NNET_ITERATIONS A positive integer Specifies the maximum number of iterations in the Neural
Network algorithm.
NNET_NODES_PER_LAYER A list of positive integers Defines the topology by number of nodes per layer.
Different layers can have different number of nodes.
The value should be a comma separated list non-negative
integers. For example, '10, 20, 5'. The setting values must
be consistent with NNET_HIDDEN_LAYERS. The default
number of nodes per layer is the number of attributes or 50
(if the number of attributes > 50).
NNET_REG_LAMBDA TO_CHAR(numeric_expr >= Defines the L2 regularization parameter lambda. This can
0) not be set together with NNET_REGULARIZER_HELDASIDE.
NNET_REGULARIZER NNET_REGULARIZER_HELDAS Regularization setting for the Neural Network algorithm. If
IDE the total number of training rows is greater than 50000,
then the default is NNET_REGULARIZER_HELDASIDE. If the
NNET_REGULARIZER_L2
total number of training rows is less than or equal to 50000,
NNET_REGULARIZER_NONE then the default is NNET_REGULARIZER_NONE.
NNET_SOLVER NNET_SOLVER_ADAM Specifies the method of optimization.
NNET_SOLVER_LBFGS The default value is NNET_SOLVER_LBFGS.
NNET_TOLERANCE TO_CHAR(0 < Defines the convergence tolerance setting of the Neural
numeric_expr < 1) Network algorithm.
NNET_WEIGHT_LOWER_BOUND A real number Specifies the lower bound of the region where weights are
randomly initialized. NNET_WEIGHT_LOWER_BOUND and
NNET_WEIGHT_UPPER_BOUND must be set together. Setting
one and not setting the other raises an error.
NNET_WEIGHT_LOWER_BOUND must not be greater than
NNET_WEIGHT_UPPER_BOUND. The default value is –
sqrt(6/(l_nodes+r_nodes)). The value of l_nodes
for:
• input layer dense attributes is (1+number of dense
attributes)
• input layer sparse attributes is number of sparse
attributes
• each hidden layer is (1+number of nodes in that
hidden layer)
The value of r_nodes is the number of nodes in the layer
that the weight is connecting to.
NNET_WEIGHT_UPPER_BOUND A real number Specifies the upper bound of the region where weights are
initialized. It should be set in pairs with
NNET_WEIGHT_LOWER_BOUND and its value must not be
smaller than the value of NNET_WEIGHT_LOWER_BOUND. If
not specified, the values of NNET_WEIGHT_LOWER_BOUND
and NNET_WEIGHT_UPPER_BOUND are system determined.
The default value is sqrt(6/(l_nodes+r_nodes)). See
NNET_WEIGHT_LOWER_BOUND.
8-79
Chapter 8
Neural Network
Table 8-13 (Cont.) Neural Network Models Settings

ODMS_RANDOM_SEED A non-negative integer Controls the random number seed used by the hash
function to generate a random number with uniform
distribution. The default values is 0.
See Also:

• Shared Settings
Example 8-15 Building a Neural Network Model

This example creates an NN model and uses some of the methods of the oml.nn class.
import oml
import pandas as pd
try:
oml.drop('IRIS')
except:
pass

test_dat = dat[1]
# Create a Neural Network model object.

nn_mod = oml.nn(nnet_hidden_layers = 1,
nnet_activations= "'NNET_ACTIVATIONS_LOG_SIG'",
NNET_NODES_PER_LAYER= '30')
# Fit the NN model according to the training data and parameter

# settings.
8-80
Chapter 8
Neural Network
nn_mod = nn_mod.fit(train_x, train_y)

nn_mod

nn_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width',
nn_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width',
'Species']], proba = True)
nn_mod.predict_proba(test_dat.drop('Species'),
'Species']]).sort_values(by = ['Sepal_Length', 'Species',
'PROBABILITY_OF_setosa', 'PROBABILITY_OF_versicolor'])
nn_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
# Change the setting parameter and refit the model.

new_setting = {'NNET_NODES_PER_LAYER': '50'}
nn_mod.set_params(**new_setting).fit(train_x, train_y)
>>> import oml

>>>
>>>
>>> try:
... except:
... pass
>>>
>>>
>>>
8-81
Chapter 8
Neural Network
>>> # Create a Neural Network model object.

... nn_mod = oml.nn(nnet_hidden_layers = 1,
... nnet_activations= "'NNET_ACTIVATIONS_LOG_SIG'",
... NNET_NODES_PER_LAYER= '30')
>>>
>>> # Fit the NN model according to the training data and parameter
... # settings.
... nn_mod = nn_mod.fit(train_x, train_y)
>>>
... nn_mod
Algorithm Name: Neural Network
Target: Species
Settings:
0 ALGO_NAME ALGO_NEURAL_NETWORK
2 LBFGS_GRADIENT_TOLERANCE .000000001
3 LBFGS_HISTORY_DEPTH 20
4 LBFGS_SCALE_HESSIAN LBFGS_SCALE_HESSIAN_ENABLE
5 NNET_ACTIVATIONS 'NNET_ACTIVATIONS_LOG_SIG'
6 NNET_HELDASIDE_MAX_FAIL 6
7 NNET_HELDASIDE_RATIO .25
8 NNET_HIDDEN_LAYERS 1
9 NNET_ITERATIONS 200
10 NNET_NODES_PER_LAYER 30
11 NNET_TOLERANCE .000001
14 ODMS_RANDOM_SEED 0
16 PREP_AUTO ON
Computed Settings:
0 NNET_REGULARIZER NNET_REGULARIZER_NONE
Global Statistics:
0 CONVERGED YES
1 ITERATIONS 60.0
2 LOSS_VALUE 0.0
3 NUM_ROWS 102.0
Attributes:
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
8-82
Chapter 8
Neural Network
Partition: NO
Topology:
HIDDEN_LAYER_ID NUM_NODE ACTIVATION_FUNCTION

0 0 30 NNET_ACTIVATIONS_LOG_SIG
Weights:
LAYER IDX_FROM IDX_TO ATTRIBUTE_NAME ATTRIBUTE_SUBNAME

ATTRIBUTE_VALUE \
0 0 0.0 0 Petal_Length None
None
None
None
None
... ... ... ... ... ... ...
239 1 29.0 2 None None

None
240 1 NaN 0 None None
None
None
None
TARGET_VALUE WEIGHT
0 None -39.836487
1 None 32.604824
2 None 0.953903
3 None 0.714064
... ... ...
239 virginica -22.650606
240 setosa 2.402457
241 versicolor 7.647615
242 virginica -9.493982

... nn_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width',
... ... ... ... ... ...
8-83
Chapter 8
Neural Network

>>> nn_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width',
... 'Species']], proba = True)
0 4.9 3.0 setosa setosa 1.000000
1 4.9 3.1 setosa setosa 1.000000
2 4.8 3.4 setosa setosa 1.000000
3 5.8 4.0 setosa setosa 1.000000
... ... ... ... ... ...
>>> nn_mod.predict_proba(test_dat.drop('Species'),
... 'Species']]).sort_values(by = ['Sepal_Length', 'Species',
... 'PROBABILITY_OF_setosa', 'PROBABILITY_OF_versicolor'])
0 4.4 setosa 1.000000e+00
1 4.4 setosa 1.000000e+00
2 4.5 setosa 1.000000e+00
3 4.8 setosa 1.000000e+00
... ... ... ...
44 6.7 virginica 4.567318e-218
45 6.9 versicolor 3.028266e-177
46 6.9 virginica 1.203417e-215
47 7.0 versicolor 3.382837e-148
0 3.491272e-67 3.459448e-283
1 8.038930e-58 2.883999e-288
2 5.273544e-64 2.243282e-293
3 1.332150e-78 2.040723e-283
... ... ...
44 1.328042e-36 1.000000e+00
45 1.000000e+00 5.063405e-55
46 4.000953e-31 1.000000e+00
47 1.000000e+00 2.593761e-121
>>> nn_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])

0.9375
>>> # Change the setting parameter and refit the model.

... new_setting = {'NNET_NODES_PER_LAYER': '50'}
>>> nn_mod.set_params(**new_setting).fit(train_x, train_y)
Algorithm Name: Neural Network
Target: Species
8-84
Chapter 8
Neural Network
Settings:
0 ALGO_NAME ALGO_NEURAL_NETWORK
2 LBFGS_GRADIENT_TOLERANCE .000000001
3 LBFGS_HISTORY_DEPTH 20
4 LBFGS_SCALE_HESSIAN LBFGS_SCALE_HESSIAN_ENABLE
5 NNET_ACTIVATIONS 'NNET_ACTIVATIONS_LOG_SIG'
6 NNET_HELDASIDE_MAX_FAIL 6
7 NNET_HELDASIDE_RATIO .25
8 NNET_HIDDEN_LAYERS 1
9 NNET_ITERATIONS 200
10 NNET_NODES_PER_LAYER 50
11 NNET_TOLERANCE .000001
16 PREP_AUTO ON
Computed Settings:
0 NNET_REGULARIZER NNET_REGULARIZER_NONE
Global Statistics:
0 CONVERGED YES
1 ITERATIONS 68.0
2 LOSS_VALUE 0.0
3 NUM_ROWS 102.0
Attributes:
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Partition: NO
Topology:
HIDDEN_LAYER_ID NUM_NODE ACTIVATION_FUNCTION

0 0 50 NNET_ACTIVATIONS_LOG_SIG
Weights:
LAYER IDX_FROM IDX_TO ATTRIBUTE_NAME ATTRIBUTE_SUBNAME

ATTRIBUTE_VALUE \
None
None
None
8-85
Chapter 8
Random Forest
None
... ... ... ... ... ... ...
399 1 49.0 2 None None

None
None
None
None
TARGET_VALUE WEIGHT
0 None 10.606389
1 None -37.256485
2 None -14.263772
3 None -17.945173
... ... ...
399 virginica -22.179815
400 setosa -6.452953
402 virginica -6.973605
8.16 Random Forest

The oml.rf class creates a Random Forest (RF) model that provides an ensemble learning
technique for classification.
By combining the ideas of bagging and random selection of variables, the Random Forest
algorithm produces a collection of decision trees with controlled variance while avoiding
overfitting, which is a common problem for decision trees.
For information on the oml.rf class attributes and methods, invoke help(oml.rf) or see
Settings for a Random Forest Model

The following table lists settings for RF models.
8-86
Chapter 8
Random Forest
Table 8-14 Random Forest Model Settings

CLAS_COST_TABLE_NAME table_name The name of a table that stores a cost
matrix for the algorithm to use in scoring the
model. The cost matrix specifies the costs
associated with misclassifications.
The cost matrix table is user-created. The
following are the column requirements for
the table.
• Column Name:
ACTUAL_TARGET_VALUE
• Column Name:
PREDICTED_TARGET_VALUE
Data Type: NUMBER
CLAS_MAX_SUP_BINS 2 <= a number <= 254 Specifies the maximum number of bins for
each attribute.
CLAS_WEIGHTS_BALANCED ON Indicates whether the algorithm must create
OFF a model that balances the target distribution.
This setting is most relevant in the presence
of rare targets, as balancing the distribution
may enable better average accuracy
(average of per-class accuracy) instead of
overall accuracy (which favors the dominant
class). The default value is OFF.
ODMS_RANDOM_SEED A non-negative integer Controls the random number seed used by
the hash function to generate a random
number with uniform distribution. The default
values is 0.
RFOR_MTRY A number >= 0 Size of the random subset of columns to
consider when choosing a split at a node.
For each node, the size of the pool remains
the same but the specific candidate columns
change. The default is half of the columns in
the model signature. The special value 0
indicates that the candidate pool includes all
columns.
RFOR_NUM_TREES 1 <= a number <= 65535 Number of trees in the forest
RFOR_SAMPLING_RATIO 0 < a fraction <= 1 Fraction of the training data to be randomly
sampled for use in the construction of an
individual tree. The default is half of the
number of rows in the training data.
8-87
Chapter 8
Random Forest
Table 8-14 (Cont.) Random Forest Model Settings

TREE_IMPURITY_METRIC TREE_IMPURITY_ENTROPY Tree impurity metric for a decision tree
TREE_IMPURITY_GINI model.
Tree algorithms seek the best test question
for splitting data at each node. The best
splitter and split value are those that result
in the largest increase in target value
homogeneity (purity) for the entities in the
node. Purity is measured in accordance with
a metric. Decision trees can use either gini
(TREE_IMPURITY_GINI) or entropy
(TREE_IMPURITY_ENTROPY) as the purity
metric. By default, the algorithm uses
TREE_IMPURITY_GINI.
TREE_TERM_MAX_DEPTH 2 <= a number <= 100 Criteria for splits: maximum tree depth (the
maximum number of nodes between the
root and any leaf node, including the leaf
node).
The default is 16.
TREE_TERM_MINPCT_NODE 0 <= a number <= 10 The minimum number of training rows in a
node expressed as a percentage of the rows
in the training data.
TREE_TERM_MINPCT_SPLIT 0 < a number <= 20 Minimum number of rows required to
consider splitting a node expressed as a
percentage of the training rows.
TREE_TERM_MINREC_NODE A number >= 0 Minimum number of rows in a node.
TREE_TERM_MINREC_SPLIT A number > 1 Criteria for splits: minimum number of
records in a parent node expressed as a
value. No split is attempted if the number of
records is below this value.
See Also:

• Shared Settings
Example 8-16 Using the oml.rf Class

This example creates an RF model and uses some of the methods of the oml.rf class.
import oml
import pandas as pd
8-88
Chapter 8
Random Forest
try:
oml.drop('IRIS')
oml.drop(table = 'RF_COST')
except:
pass

test_dat = dat[1]
# Create a cost matrix table in the database.

cost_matrix = [['setosa', 'setosa', 0],
['setosa', 'virginica', 0.2],
['setosa', 'versicolor', 0.8],
['virginica', 'virginica', 0],
['virginica', 'setosa', 0.5],
['virginica', 'versicolor', 0.5],
['versicolor', 'versicolor', 0],
['versicolor', 'setosa', 0.4],
['versicolor', 'virginica', 0.6]]
cost_matrix = \
oml.create(pd.DataFrame(cost_matrix,
columns = ['ACTUAL_TARGET_VALUE',
'PREDICTED_TARGET_VALUE',
'COST']),
table = 'RF_COST')
# Create an RF model object.

rf_mod = oml.rf(tree_term_max_depth = '2')
# Fit the RF model according to the training data and parameter

# settings.
rf_mod = rf_mod.fit(train_x, train_y, cost_matrix = cost_matrix)

rf_mod

rf_mod.predict(test_dat.drop('Species'),
8-89
Chapter 8
Random Forest
'Sepal_Width',
'Petal_Length',
'Species']])

rf_mod.predict(test_dat.drop('Species'),
'Sepal_Width',
'Species']],
proba = True)
# Return the top two most influencial attributes of the highest

# probability class.
rf_mod.predict_proba(test_dat.drop('Species'),
'Species']],
topN = 2).sort_values(by = ['Sepal_Length', 'Species'])
rf_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
# Reset TREE_TERM_MAX_DEPTH and refit the model.

rf_mod.set_params(tree_term_max_depth = '3').fit(train_x, train_y,
cost_matrix)
>>> import oml

>>>
>>>
>>> try:
... oml.drop(table = 'RF_COST')
... except:
... pass
>>>
>>>
8-90
Chapter 8
Random Forest
>>>
>>> # Create a cost matrix table in the database.
... cost_matrix = [['setosa', 'setosa', 0],
... ['setosa', 'virginica', 0.2],
... ['setosa', 'versicolor', 0.8],
... ['virginica', 'virginica', 0],
... ['virginica', 'setosa', 0.5],
... ['virginica', 'versicolor', 0.5],
... ['versicolor', 'versicolor', 0],
... ['versicolor', 'setosa', 0.4],
... ['versicolor', 'virginica', 0.6]]
>>> cost_matrix = \
... oml.create(pd.DataFrame(cost_matrix,
... columns = ['ACTUAL_TARGET_VALUE',
... 'PREDICTED_TARGET_VALUE',
... 'COST']),
... table = 'RF_COST')
>>>
>>> # Create an RF model object.
... rf_mod = oml.rf(tree_term_max_depth = '2')
>>>
>>> # Fit the RF model according to the training data and parameter
... # settings.
>>> rf_mod = rf_mod.fit(train_x, train_y, cost_matrix = cost_matrix)
>>>
... rf_mod
Algorithm Name: Random Forest
Target: Species
Settings:
0 ALGO_NAME ALGO_RANDOM_FOREST
1 CLAS_COST_TABLE_NAME "OML_USER"."RF_COST"
8 PREP_AUTO ON
9 RFOR_NUM_TREES 20
10 RFOR_SAMPLING_RATIO .5
Computed Settings:
8-91
Chapter 8
Random Forest
0 RFOR_MTRY 2
Global Statistics:
0 AVG_DEPTH 2
1 AVG_NODECOUNT 3
2 MAX_DEPTH 2
3 MAX_NODECOUNT 2
4 MIN_DEPTH 2
5 MIN_NODECOUNT 2
6 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Partition: NO
Importance:
ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE

1 Petal_Width None 0.296799
2 Sepal_Length None 0.037309

... rf_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
... ... ... ... ... ...

... rf_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Species']],
... proba = True)
0 4.9 3.0 setosa setosa 0.989130
1 4.9 3.1 setosa setosa 0.989130
2 4.8 3.4 setosa setosa 0.989130
3 5.8 4.0 setosa setosa 0.950000
... ... ... ... ... ...
8-92
Chapter 8
Random Forest

>>> # Return the top two most influencial attributes of the highest
... # probability class.
>>> rf_mod.predict_proba(test_dat.drop('Species'),
... 'Species']],
... topN = 2).sort_values(by = ['Sepal_Length', 'Species'])
Sepal_Length Species TOP_1 TOP_1_VAL TOP_2 TOP_2_VAL
0 4.4 setosa setosa 0.989130 versicolor 0.010870
... ... ... ... ... ... ...
42 6.7 virginica virginica 0.501016 versicolor 0.498984
43 6.9 versicolor virginica 0.501016 versicolor 0.498984
44 6.9 virginica virginica 0.501016 versicolor 0.498984
45 7.0 versicolor virginica 0.501016 versicolor 0.498984
>>> rf_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])

0.76087
>>> # Reset TREE_TERM_MAX_DEPTH and refit the model.

... rf_mod.set_params(tree_term_max_depth = '3').fit(train_x, train_y,
cost_matrix)
Algorithm Name: Random Forest
Target: Species
Settings:
0 ALGO_NAME ALGO_RANDOM_FOREST
1 CLAS_COST_TABLE_NAME "OML_USER"."RF_COST"
8 PREP_AUTO ON
9 RFOR_NUM_TREES 20
10 RFOR_SAMPLING_RATIO .5
Computed Settings:
8-93
Chapter 8
Singular Value Decomposition
0 RFOR_MTRY 2
Global Statistics:
0 AVG_DEPTH 3
1 AVG_NODECOUNT 5
2 MAX_DEPTH 3
3 MAX_NODECOUNT 6
4 MIN_DEPTH 3
5 MIN_NODECOUNT 4
6 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Partition: NO
Importance:
ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE

1 Petal_Width None 0.568170
2 Sepal_Length None 0.091617
8.17 Singular Value Decomposition

Use the oml.svd class to build a model for feature extraction.
The oml.svd class creates a model that uses the Singular Value Decomposition (SVD)
algorithm for feature extraction. SVD performs orthogonal linear transformations that capture
the underlying variance of the data by decomposing a rectangular matrix into three matrices:
U, V, and D. Columns of matrix V contain the right singular vectors and columns of matrix U
contain the left singular vectors. Matrix D is a diagonal matrix and its singular values reflect the
amount of data variance captured by the bases.
The SVDS_MAX_NUM_FEATURES constant specifies the maximum number of features supported
by SVD. The value of the constant is 2500.
For information on the oml.svd class attributes and methods, invoke help(oml.svd) or see
Settings for a Singular Value Decomposition Model
Table 8-15 Singular Value Decomposition Model Settings

FEAT_NUM_FEATURES TO_CHAR(numeric_expr The number of features to extract.
>=1) The default value is estimated by the algorithm. If the matrix
rank is smaller than this number, fewer features are returned.
8-94
Chapter 8
Table 8-15 (Cont.) Singular Value Decomposition Model Settings

SVDS_OVER_SAMPLING Range [1, 5000]. Configures the number of columns in the sampling matrix used
by the Stochastic SVD solver. The number of columns in this
matrix is equal to the requested number of features plus the
oversampling setting. TSVDS_SOLVER must be set to
SVDS_SOLVER_SSVD or SVDS_SOLVER_STEIGEN.
SVDS_POWER_ITERATIONS Range [0, 20]. Improves the accuracy of the SSVD solver. The default value is
2. SVDS_SOLVER must be set to SVDS_SOLVER_SSVD or
SVDS_SOLVER_STEIGEN.
SVDS_RANDOM_SEED Range [0 - The random seed value for initializing the sampling matrix used
4,294,967,296] by the Stochastic SVD solver. The default value is 0.
SVDS_SOLVER must be set to SVDS_SOLVER_SSVD or
SVDS_SOLVER_STEIGEN
SVDS_SCORING_MODE SVDS_SCORING_PCA Whether to use SVD or PCA scoring for the model.
SVDS_SCORING_SVD When the build data is scored with SVD, the projections are the
same as the U matrix. When the build data is scored with PCA,
the projections are the product of the U and D matrices.
The default value is SVDS_SCORING_SVD.
SVDS_SOLVER SVDS_SOLVER_STEIGEN Specifies the solver to be used for computing SVD of the data.
SVDS_SOLVER_SSVD For PCA, the solver setting indicates the type of SVD solver
used to compute the PCA for the data. When this setting is not
SVDS_SOLVER_TSEIGEN specified, the solver type selection is data driven. If the number
SVDS_SOLVER_TSSVD of attributes is greater than 3240, then the default wide solver is
used. Otherwise, the default narrow solver is selected.
The following are the group of solvers:
• Narrow data solvers: for matrices with up to 11500
attributes (TSEIGEN) or up to 8100 attributes (TSSVD).
• Wide data solvers: for matrices up to 1 million attributes.
For narrow data solvers:
• Tall-Skinny SVD uses QR computation TSVD
(SVDS_SOLVER_TSSVD)
• Tall-Skinny SVD uses eigenvalue computation, TSEIGEN
(SVDS_SOLVER_TSEIGEN), which is the default solver for
narrow data.
For wide data solvers:
• Stochastic SVD uses QR computation SSVD
(SVDS_SOLVER_SSVD), is the default solver for wide data
solvers.
• Stochastic SVD uses eigenvalue computations, STEIGEN
(SVDS_SOLVER_STEIGEN).
SVDS_TOLERANCE Range [0, 1] Defines the minimum value for the eigenvalue of a feature as a
share of the first eigenvalue to not prune. Use this setting to
prune features. The default value is data driven.
8-95
Chapter 8
Table 8-15 (Cont.) Singular Value Decomposition Model Settings

SVDS_U_MATRIX_OUTPUT SVDS_U_MATRIX_ENABLE Specifies whether to persist the U matrix produced by SVD.
SVDS_U_MATRIX_DISABLE The U matrix in SVD has as many rows as the number of rows
in the build data. To avoid creating a large model, the U matrix is
persisted only when SVDS_U_MATRIX_OUTPUT is enabled.
When SVDS_U_MATRIX_OUTPUT is enabled, the build data must
include a case ID. If no case ID is present and the U matrix is
requested, then an exception is raised.
The default value is SVDS_U_MATRIX_DISABLE.
See Also:

• Shared Settings
Example 8-17 Using the oml.svd Class

This example uses some of the methods of the oml.svd class. In the listing for this example,
some of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
try:
oml.drop('IRIS')
except:
pass

train_dat = dat[0]
test_dat = dat[1]
# Create an SVD model object.
8-96
Chapter 8
svd_mod = oml.svd(ODMS_DETAILS = 'ODMS_ENABLE')
# Fit the model according to the training data and parameter

# settings.
svd_mod = svd_mod.fit(train_dat)

svd_mod

svd_mod.predict(test_dat,
['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Perform dimensionality reduction and return values for the two

# features that have the highest topN values.
svd_mod.transform(test_dat,
supplemental_cols = test_dat[:, ['Sepal_Length']],
topN = 2).sort_values(by = ['Sepal_Length',
'TOP_1',
'TOP_1_VAL'])
>>> import oml

>>>
>>>
>>> try:
... except:
... pass
>>>
>>>
>>>
>>> # Create an SVD model object.
8-97
Chapter 8
... svd_mod = oml.svd(ODMS_DETAILS = 'ODMS_ENABLE')

>>>
>>> # Fit the model according to the training data and parameter
... # settings.
>>> svd_mod = svd_mod.fit(train_dat)
>>>
... svd_mod
Algorithm Name: Singular Value Decomposition
Settings:
0 ALGO_NAME ALGO_SINGULAR_VALUE_DECOMP
4 PREP_AUTO ON
5 SVDS_SCORING_MODE SVDS_SCORING_SVD
6 SVDS_U_MATRIX_OUTPUT SVDS_U_MATRIX_DISABLE
Computed Settings:
0 FEAT_NUM_FEATURES 8
1 SVDS_SOLVER SVDS_SOLVER_TSEIGEN
2 SVDS_TOLERANCE .000000000000024646951146678475
Global Statistics:
0 NUM_COMPONENTS 8
1 NUM_ROWS 111
2 SUGGESTED_CUTOFF 1
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE VALUE

0 1 ID None 0.996297
1 1 Petal_Length None 0.046646
2 1 Petal_Width None 0.015917
3 1 Sepal_Length None 0.063312
... ... ... ... ...
60 8 Sepal_Width None -0.030620
61 8 Species setosa 0.431543
62 8 Species versicolor 0.566418
63 8 Species virginica 0.699261
8-98
Chapter 8
D:
FEATURE_ID VALUE
0 1 886.737809
1 2 32.736792
2 3 10.043389
3 4 5.270496
4 5 2.708602
5 6 1.652340
6 7 0.938640
7 8 0.452170
V:
'1' '2' '3' '4' '5' '6' '7' \

0 0.001332 0.156581 -0.317375 0.113462 -0.154414 -0.113058 0.799390
1 0.003692 0.052289 0.316295 0.733040 0.190746 0.022285 -0.046406
2 0.005267 -0.051498 -0.052111 0.527881 -0.066995 0.046461 -0.469396
3 0.015917 0.008741 0.263614 0.244811 0.460445 0.767503 0.262966
4 0.030208 0.550384 -0.358277 0.041807 0.689962 -0.261815 -0.143258
5 0.046646 0.189325 0.766663 0.326363 0.079611 -0.479070 0.177661
6 0.063312 0.790864 0.097964 -0.051230 -0.490804 0.312159 -0.131337
7 0.996297 -0.076079 -0.035940 -0.017429 -0.000960 -0.001908 0.001755
'8'
0 0.431543
1 0.566418
2 0.699261
3 0.005000
4 -0.030620
5 -0.016932
6 -0.052185
7 -0.001415

>>> svd_mod.predict(test_dat,
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species FEATURE_ID
0 5.0 3.6 1.4 setosa 2
1 5.0 3.4 1.5 setosa 2
2 4.4 2.9 1.4 setosa 8
3 4.9 3.1 1.5 setosa 2
... ... ... ... ... ...
35 6.9 3.1 5.4 virginica 1
36 5.8 2.7 5.1 virginica 1
37 6.2 3.4 5.4 virginica 5
38 5.9 3.0 5.1 virginica 1
>>> # Perform dimensionality reduction and return values for the two
... # features that have the highest topN values.
8-99
Chapter 8
Support Vector Machine
>>> svd_mod.transform(test_dat,
... supplemental_cols = test_dat[:, ['Sepal_Length']],
... topN = 2).sort_values(by = ['Sepal_Length',
... 'TOP_1',
... 'TOP_1_VAL'])
Sepal_Length TOP_1 TOP_1_VAL TOP_2 TOP_2_VAL
0 4.4 7 0.153125 3 -0.130778
1 4.4 8 0.171819 2 0.147070
2 4.8 2 0.159324 6 -0.085194
3 4.8 7 0.157187 3 -0.141668
... ... ... ... ... ...
35 7.2 6 -0.167688 1 0.142545
36 7.2 7 -0.176290 6 -0.175527
37 7.6 4 0.205779 3 0.141533
38 7.9 8 -0.253194 7 -0.166967
8.18 Support Vector Machine

The oml.svm class creates a Support Vector Machine (SVM) model for classification,
regression, or anomaly detection.
SVM is a powerful, state-of-the-art algorithm with strong theoretical foundations based on the
Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to
the generalization of the model to new data.
SVM models have a functional form similar to neural networks and radial basis functions,
which are both popular machine learning techniques.
SVM can be used to solve the following problems:
• Classification: SVM classification is based on decision planes that define decision
boundaries. A decision plane is one that separates a set of objects having different class
memberships. SVM finds the vectors (“support vectors") that define the separators that
give the widest separation of classes.
SVM classification supports both binary and multiclass targets.
• Regression: SVM uses an epsilon-insensitive loss function to solve regression problems.
SVM regression tries to find a continuous function such that the maximum number of data
points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon
distance of the true target value are not interpreted as errors.
• Anomaly Detection: Anomaly detection identifies unusual cases in data that is seemingly
homogeneous. Anomaly detection is an important tool for detecting fraud, network
intrusion, and other rare events that may have great significance but are hard to find.
Anomaly detection is implemented as one-class SVM classification. An anomaly detection
model predicts whether a data point is typical for a given distribution or not.
The oml.svm class builds each of these three different types of models. Some arguments apply
to classification models only, some to regression models only, and some to anomaly detection
models only.
For information on the oml.svm class attributes and methods, invoke help(oml.svm) or see
Support Vector Machine Model Settings

The following table lists settings for SVM models.
8-100
Chapter 8
Table 8-16 Support Vector Machine Settings

CLAS_COST_TABLE_NAME table_name The name of a table that stores a cost matrix for the algorithm to
use in scoring the model. The cost matrix specifies the costs
associated with misclassifications.
Data Type: NUMBER
OFF balances the target distribution. This setting is most relevant in the
presence of rare targets, as balancing the distribution may enable
better average accuracy (average of per-class accuracy) instead
of overall accuracy (which favors the dominant class). The default
value is OFF.
CLAS_WEIGHTS_TABLE_NAM table_name The name of a table that stores weighting information for individual
E target values in GLM logistic regression models. The weights are
used by the algorithm to bias the model in favor of higher weighted
classes.
The class weights table is user-created. The following are the
• Column Name: CLASS_WEIGHT
Data Type: NUMBER
SVMS_BATCH_ROWS Positive integer Sets the size of the batch for the SGD solver. This setting applies
to SVM models with linear kernel. An input of 0 triggers a data
driven batch size estimate. The default value is 20000.
SVMS_COMPLEXITY_FACTOR TO_CHAR(numeric_exp Regularization setting that balances the complexity of the model
r >0) against model robustness to achieve good generalization on new
data. SVM uses a data-driven approach to finding the complexity
factor.
Value of complexity factor for SVM algorithm (both Classification
and Regression).
Default value estimated from the data by the algorithm.
SVMS_CONV_TOLERANCE TO_CHAR(numeric_exp Convergence tolerance for SVM algorithm.
r >0) Default is 0.0001.
SVMS_EPSILON TO_CHAR(numeric_exp Regularization setting for regression, similar to complexity factor.
r >0) Epsilon specifies the allowable residuals, or noise, in the data.
Value of epsilon factor for SVM regression.
Default is 0.1.
SVMS_KERNEL_FUNCTION SVMS_GAUSSIAN Kernel for Support Vector Machine. Linear or Gaussian.
SVMS_LINEAR The default value is SVMS_LINEAR.
SVMS_NUM_ITERATIONS Positive integer Sets an upper limit on the number of SVM iterations. The default is
system determined because it depends on the SVM solver.
8-101
Chapter 8
Table 8-16 (Cont.) Support Vector Machine Settings

SVMS_NUM_PIVOTS Range [1; 10000] Sets an upper limit on the number of pivots used in the Incomplete
Cholesky decomposition. It can be set only for non-linear kernels.
SVMS_OUTLIER_RATE TO_CHAR(0< The desired rate of outliers in the training data. Valid for One-
numeric_expr <1) Class SVM models only (Anomaly Detection).
SVMS_REGULARIZER SVMS_REGULARIZER_L1 Controls the type of regularization that the SGD SVM solver uses.
SVMS_REGULARIZER_L2 The setting applies only to linear SVM models. The default value
is system determined because it depends on the potential model
size.
SVMS_SOLVER SVMS_SOLVER_SGD Allows the user to choose the SVM solver. The SGD solver cannot
(Sub-Gradient Descend) be selected if the kernel is non-linear. The default value is system
SVMS_SOLVER_IPM determined.
(Interior Point Method)
SVMS_STD_DEV TO_CHAR(numeric_exp Controls the spread of the Gaussian kernel function. SVM uses a
r >0) data-driven approach to find a standard deviation value that is on
the same scale as distances between typical cases.
Value of standard deviation for SVM algorithm.
This is applicable only for the Gaussian kernel.
The default value is estimated from the data by the algorithm.
See Also:

• Shared Settings
Example 8-18 Using the oml.svm Class

This example demonstrates the use of various methods of the oml.svm class. In the listing for
import oml
import pandas as pd
columns = ['Species']))
try:
8-102
Chapter 8
oml.drop('IRIS')
except:
pass

test_dat = dat[1]
# Create an SVM model object.

svm_mod = oml.svm('classification',
svms_kernel_function =
# Fit the SVM Model according to the training data and parameter
# settings.
svm_mod.fit(train_x, train_y)

svm_mod.predict(test_dat.drop('Species'),
'Sepal_Width',
'Petal_Length',
'Species']])

svm_mod.predict(test_dat.drop('Species'),
'Sepal_Width',
'Species']],
proba = True)
svm_mod.predict_proba(test_dat.drop('Species'),
'Sepal_Width',
'Species']],
topN = 1).sort_values(by = ['Sepal_Length', 'Sepal_Width'])
svm_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
>>> import oml

>>>
8-103
Chapter 8

>>>
>>> try:
... except:
... pass
>>>
>>>
>>>
>>> # Create an SVM model object.
... svm_mod = oml.svm('classification',
... svms_kernel_function =
... 'dbms_data_mining.svms_linear')
>>>
>>> # Fit the SVM model according to the training data and parameter
... # settings.
>>> svm_mod.fit(train_x, train_y)
Algorithm Name: Support Vector Machine
Target: Species
Settings:
0 ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES
5 PREP_AUTO ON
6 SVMS_CONV_TOLERANCE .0001
7 SVMS_KERNEL_FUNCTION SVMS_LINEAR
Computed Settings:
0 SVMS_COMPLEXITY_FACTOR 10
1 SVMS_NUM_ITERATIONS 30
2 SVMS_SOLVER SVMS_SOLVER_IPM
Global Statistics:
0 CONVERGED YES
1 ITERATIONS 14
2 NUM_ROWS 104
8-104
Chapter 8
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
COEFFICIENTS:
TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE COEF

0 setosa Petal_Length None None -0.5809
1 setosa Petal_Width None None -0.7736
2 setosa Sepal_Length None None -0.1653
3 setosa Sepal_Width None None 0.5689
4 setosa None None None -0.7355
5 versicolor Petal_Length None None 1.1304
6 versicolor Petal_Width None None -0.3323
7 versicolor Sepal_Length None None -0.8877
8 versicolor Sepal_Width None None -1.2582
9 versicolor None None None -0.9091
10 virginica Petal_Length None None 4.6042
11 virginica Petal_Width None None 4.0681
12 virginica Sepal_Length None None -0.7985
13 virginica Sepal_Width None None -0.4328
14 virginica None None None -5.3180

... svm_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
... ... ... ... ... ...

... svm_mod.predict(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Species']],
... proba = True)
0 4.9 3.0 setosa setosa 0.761886
1 4.9 3.1 setosa setosa 0.805510
2 4.8 3.4 setosa setosa 0.920317
3 5.8 4.0 setosa setosa 0.998398
... ... ... ... ... ...
8-105
Chapter 8

... # on new data.
>>> svm_mod.predict_proba(test_dat.drop('Species'),
... 'Sepal_Width',
... 'Species']],
... topN = 1).sort_values(by = ['Sepal_Length', 'Sepal_Width'])
Sepal_Length Sepal_Width Species TOP_1 TOP_1_VAL
0 4.4 3.0 setosa setosa 0.698067
1 4.4 3.2 setosa setosa 0.815643
2 4.5 2.3 setosa versicolor 0.605105
3 4.8 3.4 setosa setosa 0.920317
... ... ... ... ... ...
45 6.9 3.1 versicolor versicolor 0.378391
47 7.0 3.2 versicolor setosa 0.586393
>>> svm_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])

0.895833
8-106
9
Automated Machine Learning
Use the automated algorithm selection, feature selection, and hyperparameter tuning of
Automated Machine Learning to accelerate the machine learning modeling process.
Automated Machine Learning in OML4Py is described in the following topics:
• About Automated Machine Learning
Automated Machine Learning (AutoML) provides built-in data science expertise about data
analytics and modeling that you can employ to build machine learning models.
• Algorithm Selection
The oml.automl.AlgorithmSelection class uses the characteristics of the data set and
the task to rank algorithms from the set of supported Oracle Machine Learning algorithms.
• Feature Selection
The oml.automl.FeatureSelection class identifies the most relevant feature subsets for a
training data set and an Oracle Machine Learning algorithm.
• Model Tuning
The oml.automl.ModelTuning class tunes the hyperparameters for the specified
classification or regression algorithm and training data.
• Model Selection
The oml.automl.ModelSelection class automatically selects an Oracle Machine Learning
algorithm according to the selected score metric and then tunes that algorithm.
9.1 About Automated Machine Learning

Automated Machine Learning (AutoML) provides built-in data science expertise about data
analytics and modeling that you can employ to build machine learning models.
Any modeling problem for a specified data set and prediction task involves a sequence of data
cleansing and preprocessing, algorithm selection, and model tuning tasks. Each of these steps
require data science expertise to help guide the process to an efficient final model. Automated
Machine Learning (AutoML) automates this process with its built-in data science expertise.
OML4Py has the following AutoML capabilities:
• Automated algorithm selection that selects the appropriate algorithm from the supported
machine learning algorithms
• Automated feature selection that reduces the size of the original feature set to speed up
model training and tuning, while possibly also increasing model quality
• Automated tuning of model hyperparameters, which selects the model with the highest
score metric from among several metrics as selected by the user
AutoML performs those common modeling tasks automatically, with less effort and potentially
better results. It also leverages in-database algorithm parallel processing and scalability to
minimize runtime and produce high-quality results.
9-1
Chapter 9
About Automated Machine Learning
Note:
As the fit method of the machine learning classes does, the AutoML functions
reduce, select, and tune provide a case_id parameter that you can use to achieve
repeatable data sampling and data shuffling during model building.
The AutoML functionality is also available in a no-code user interface alongside OML
Notebooks on Oracle Autonomous Database. For more information, see Oracle Machine
Learning AutoML User Interface .
Automated Machine Learning Classes and Algorithms

The Automated Machine Learning classes are the following.
Class Description
oml.automl.Algorit Using only the characteristics of the data set and the task, automatically
hmSelection selects the best algorithms from the set of supported Oracle Machine Learning
algorithms.
Supports classification and regression functions.
oml.automl.Feature Uses meta-learning to quickly identify the most relevant feature subsets given
Selection a training data set and an Oracle Machine Learning algorithm.
oml.automl.ModelTu Uses a highly parallel, asynchronous gradient-based hyperparameter
ning optimization algorithm to tune the algorithm hyperparameters.
oml.automl.ModelSe Selects the best Oracle Machine Learning algorithm and then tunes that
lection algorithm.
The Oracle Machine Learning algorithms supported by AutoML are the following:
Table 9-1 Machine Learning Algorithms Supported by AutoML
Algorithm Abbreviation Algorithm Name

dt Decision Tree
glm Generalized Linear Model
glm_ridge Generalized Linear Model with ridge regression
nb Naive Bayes
nn Neural Network
rf Random Forest
svm_gaussian Support Vector Machine with Gaussian kernel
svm_linear Support Vector Machine with linear kernel
Classification and Regression Metrics

The following tables list the scoring metrics supported by AutoML.
9-2
Chapter 9
Table 9-2 Binary and Multiclass Classification Metrics
Metric Description, Scikit-learn Equivalent, and Formula

accuracy Calculates the rate of correct classification of the target.
sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,

sample_weight=None)
Formula: (tp + tn)/samples

f1_macro Calculates the f-score or f-measure, which is a weighted average of the precision and recall.
The f1_macro takes the unweighted average of per-class scores.
sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1,

average=’macro’, sample_weight=None)
Formula: 2 * (precision * recall) / (precision + recall)

f1_micro Calculates the f-score or f-measure with micro-averaging in which true positives, false
positives, and false negatives are counted globally.

average=’micro’, sample_weight=None)

f1_weighted Calculates the f-score or f-measure with weighted averaging of per-class scores based on
support (the fraction of true samples per class). Accounts for imbalanced classes.

average=’weighted’, sample_weight=None)

precision_macro Calculates the ability of the classifier to not label a sample incorrectly. The precision_macro
takes the unweighted average of per-class scores.
sklearn.metrics.precision_score(y_true, y_pred, labels=None,

pos_label=1, average=’macro’, sample_weight=None)
Formula: tp / (tp + fp)

precision_micro Calculates the ability of the classifier to not label a sample incorrectly. Uses micro-averaging
in which true positives, false positives, and false negatives are counted globally.

pos_label=1, average=’micro’, sample_weight=None)
9-3
Chapter 9
Table 9-2 (Cont.) Binary and Multiclass Classification Metrics

precision_weighted Calculates the ability of the classifier to not label a sample incorrectly. Uses weighted
averaging of per-class scores based on support (the fraction of true samples per class).
Accounts for imbalanced classes.

pos_label=1, average=’weighted’, sample_weight=None)

recall_macro Calculates the ability of the classifier to correctly label each class. The recall_macro takes the
unweighted average of per-class scores.
sklearn.metrics.recall_score(y_true, y_pred, labels=None,

pos_label=1, average=’macro’, sample_weight=None)
Formula: tp / (tp + fn)

recall_micro Calculates the ability of the classifier to correctly label each class with micro-averaging in
which the true positives, false positives, and false negatives are counted globally.

pos_label=1, average=’micro’, sample_weight=None)

recall_weighted Calculates the ability of the classifier to correctly label each class with weighted averaging of
per-class scores based on support (the fraction of true samples per class). Accounts for
imbalanced classes.

pos_label=1, average=’weighted’, sample_weight=None)
See Also: Scikit-learn classification metrics
Table 9-3 Binary Classification Metrics Only

f1 Calculates the f-score or f-measure, which is a weighted average of the precision and recall.
This metric by default requires a positive target to be encoded as 1 to function as expected.

average=’binary’, sample_weight=None)
9-4
Chapter 9
Table 9-3 (Cont.) Binary Classification Metrics Only

precision Calculates the ability of the classifier to not label a sample positive (1) that is actually
negative (0).

pos_label=1, average=’binary’, sample_weight=None)

recall Calculates the ability of the classifier to label all positive (1) samples correctly.

pos_label=1, average=’binary’, sample_weight=None)

roc_auc Calculates the Area Under the Receiver Operating Characteristic Curve (roc_auc) from
prediction scores.
sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,

sample_weight=None)
See also the definition of receiver operation characteristic.
Table 9-4 Regression Metrics

r2 Calculates the coefficient of determination (R squared).
sklearn.metrics.r2_score(y_true, y_pred, sample_weight=None,

multioutput=’uniform_average’)
See also the definition of coefficient of determination.

neg_mean_absolute_error Calculates the mean of the absolute difference of predicted and true targets
(MAE).
sklearn.metrics.mean_absolute_error(y_true, y_pred,
sample_weight=None, multioutput=’uniform_average’)
Formula:
9-5
Chapter 9
Algorithm Selection
Table 9-4 (Cont.) Regression Metrics

neg_mean_squared_error Calculates the mean of the squared difference of predicted and true targets.
-1.0 * sklearn.metrics.mean_squared_error(y_true, y_pred,

Formula:
neg_mean_squared_log_error Calculates the mean of the difference in the natural log of predicted and true
targets.
sklearn.metrics.mean_squared_log_error(y_true, y_pred,
Formula:
neg_median_absolute_error Calculates the median of the absolute difference between predicted and true
targets.
sklearn.metrics.median_absolute_error(y_true, y_pred)
Formula:
See Also: Scikit-learn regression metrics
9.2 Algorithm Selection

The oml.automl.AlgorithmSelection class uses the characteristics of the data set and the
task to rank algorithms from the set of supported Oracle Machine Learning algorithms.
Selecting the best Oracle Machine Learning algorithm for a data set and a prediction task is
non-trivial. No single algorithm works best for all modeling problems. The
oml.automl.AlgorithmSelection class ranks the candidate algorithms according to how likely
each is to produce a quality model. This is achieved by using Oracle advanced meta-learning
intelligence learned from a repertoire of data sets with the goal of avoiding exhaustive
searches, thereby reducing overall compute time and costs.
The oml.automl.AlgorithmSelection class supports classification and regression algorithms.
To use the class, you specify a data set and the number of algorithms you want to evaluate.
The select method of the class returns a sorted list of the top algorithms and their predicted
rank (from best to worst).
9-6
Chapter 9
Algorithm Selection
For information on the parameters and methods of the class, invoke

help(oml.automl.AlgorithmSelection) or see Oracle Machine Learning for Python API
Reference.
Example 9-1 Using the oml.automl.AlgorithmSelection Class
This example creates an oml.automl.AlgorithmSelection object and then displays the

algorithm rankings with their corresponding score metric. You may select the top entry or
choose a different model depending on the needs of your particular business problem.
import oml
from oml import automl
import pandas as pd
# Load the breast cancer data set.

bc = datasets.load_breast_cancer()
bc_data = bc.data.astype(float)
X = pd.DataFrame(bc_data, columns = bc.feature_names)
y = pd.DataFrame(bc.target, columns = ['TARGET'])
# Create the database table BreastCancer.

oml_df = oml.create(pd.concat([X, y], axis=1),
table = 'BreastCancer')
# Split the data set into training and test data.

train, test = oml_df.split(ratio=(0.8, 0.2), seed = 1234)
# Create an automated algorithm selection object with f1_macro as

# the score_metric argument.
asel = automl.AlgorithmSelection(mining_function='classification',
score_metric='f1_macro', parallel=4)
# Run algorithm selection to get the top k predicted algorithms and

# their ranking without tuning.
algo_ranking = asel.select(X, y, k=3)
# Show the selected and tuned model.

[(m, "{:.2f}".format(s)) for m,s in algo_ranking]
# Drop the database table.

>>> import oml

>>> from oml import automl
>>>
>>> # Load the breast cancer data set.
... bc = datasets.load_breast_cancer()
>>> bc_data = bc.data.astype(float)
>>> X = pd.DataFrame(bc_data, columns = bc.feature_names)
9-7
Chapter 9
Feature Selection
>>> y = pd.DataFrame(bc.target, columns = ['TARGET'])

>>>
>>> # Create the database table BreastCancer.
>>> oml_df = oml.create(pd.concat([X, y], axis=1),
... table = 'BreastCancer')
>>>
>>> # Split the data set into training and test data.
... train, test = oml_df.split(ratio=(0.8, 0.2), seed = 1234)
>>>
>>> # Create an automated algorithm selection object with f1_macro as
... # the score_metric argument.
... asel = automl.AlgorithmSelection(mining_function='classification',
... score_metric='f1_macro', parallel=4)
>>>
>>> # Run algorithm selection to get the top k predicted algorithms and
... # their ranking without tuning.
... algo_ranking = asel.select(X, y, k=3)
>>>
>>> # Show the selected and tuned model.
>>> [(m, "{:.2f}".format(s)) for m,s in algo_ranking]
[('svm_gaussian', '0.97'), ('glm_ridge', '0.96'), ('nn', '0.96')]
>>>
>>> # Drop the database table.
... oml.drop('BreastCancer')
9.3 Feature Selection

The oml.automl.FeatureSelection class identifies the most relevant feature subsets for a
training data set and an Oracle Machine Learning algorithm.
In a data analytics application, feature selection is a critical data preprocessing step that has a
high impact on both runtime and model performance. The oml.automl.FeatureSelection
class automatically selects the most relevant features for a data set and model. It internally
uses several feature-ranking algorithms to identify the best feature subset that reduces model
training time without compromising model performance. Oracle advanced meta-learning
techniques quickly prune the search space of this feature selection optimization.
The oml.automl.FeatureSelection class supports classification and regression algorithms. To
use the oml.automl.FeatureSelection class, you specify a data set and the Oracle Machine
Learning algorithm on which to perform the feature reduction.
help(oml.automl.FeatureSelection) or see Oracle Machine Learning for Python API
Reference.
Example 9-2 Using the oml.automl.FeatureSelection Class
This example uses the oml.automl.FeatureSelection class. The example builds a model on
the full data set and computes predictive accuracy. It performs automated feature selection,
filters the columns according to the determined set, and rebuilds the model. It then recomputes
predictive accuracy.
import oml
9-8
Chapter 9
Feature Selection
import pandas as pd
import numpy as np
# Load the digits data set into the database.

digits = datasets.load_digits()
X = pd.DataFrame(digits.data,
columns = ['pixel{}'.format(i) for i
in range(digits.data.shape[1])])
y = pd.DataFrame(digits.target, columns = ['digit'])
oml_df = oml.create(pd.concat([X, y], axis=1), table = 'DIGITS')
# Split the data set into train and test.

train, test = oml_df.split(ratio=(0.8, 0.2),
seed = 1234, strata_cols='digit')
X_train, y_train = train.drop('digit'), train['digit']
X_test, y_test = test.drop('digit'), test['digit']
# Default model performance before feature selection.

mod = oml.svm(mining_function='classification').fit(X_train,
y_train)
"{:.2}".format(mod.score(X_test, y_test))
# Create an automated feature selection object with accuracy

# as the score_metric.
fs = automl.FeatureSelection(mining_function='classification',
score_metric='accuracy', parallel=4)
# Get the reduced feature subset on the train data set.

subset = fs.reduce('svm_linear', X_train, y_train)
"{} features reduced to {}".format(len(X_train.columns),
len(subset))
# Use the subset to select the features and create a model on the
# new reduced data set.
X_new = X_train[:,subset]
X_test_new = X_test[:,subset]
mod = oml.svm(mining_function='classification').fit(X_new, y_train)
"{:.2} with {:.1f}x feature reduction".format(
mod.score(X_test_new, y_test),
len(X_train.columns)/len(X_new.columns))
# Drop the DIGITS table.

oml.drop('DIGITS')
# For reproducible results, add a case_id column with unique row

# identifiers.
row_id = pd.DataFrame(np.arange(digits.data.shape[0]),
columns = ['CASE_ID'])
oml_df_cid = oml.create(pd.concat([row_id, X, y], axis=1),
table = 'DIGITS_CID')
train, test = oml_df_cid.split(ratio=(0.8, 0.2), seed = 1234,

hash_cols='CASE_ID',
strata_cols='digit')
X_train, y_train = train.drop('digit'), train['digit']
9-9
Chapter 9
Feature Selection
X_test, y_test = test.drop('digit'), test['digit']
# Provide the case_id column name to the feature selection

# reduce function.
subset = fs.reduce('svm_linear', X_train,
y_train, case_id='CASE_ID')
"{} features reduced to {} with case_id".format(
len(X_train.columns)-1,
len(subset))
# Drop the tables created in the example.

oml.drop('DIGITS')
oml.drop('DIGITS_CID')
>>> import oml

>>>
>>> # Load the digits data set into the database.
... digits = datasets.load_digits()
>>> X = pd.DataFrame(digits.data,
... columns = ['pixel{}'.format(i) for i
... in range(digits.data.shape[1])])
>>> y = pd.DataFrame(digits.target, columns = ['digit'])
>>> oml_df = oml.create(pd.concat([X, y], axis=1), table = 'DIGITS')
>>>
>>> # Split the data set into train and test.
... train, test = oml_df.split(ratio=(0.8, 0.2),
... seed = 1234, strata_cols='digit')
>>> X_train, y_train = train.drop('digit'), train['digit']
>>> X_test, y_test = test.drop('digit'), test['digit']
>>>
>>> # Default model performance before feature selection.
... mod = oml.svm(mining_function='classification').fit(X_train,
... y_train)
>>> "{:.2}".format(mod.score(X_test, y_test))
'0.92'
>>>
>>> # Create an automated feature selection object with accuracy
... # as the score_metric.
... fs = automl.FeatureSelection(mining_function='classification',
... score_metric='accuracy', parallel=4)
>>> # Get the reduced feature subset on the train data set.
... subset = fs.reduce('svm_linear', X_train, y_train)
>>> "{} features reduced to {}".format(len(X_train.columns),
... len(subset))
'64 features reduced to 41'
>>>
>>> # Use the subset to select the features and create a model on the
... # new reduced data set.
... X_new = X_train[:,subset]
9-10
Chapter 9
Model Tuning
>>> X_test_new = X_test[:,subset]

>>> mod = oml.svm(mining_function='classification').fit(X_new, y_train)
>>> "{:.2} with {:.1f}x feature reduction".format(
... mod.score(X_test_new, y_test),
... len(X_train.columns)/len(X_new.columns))
'0.92 with 1.6x feature reduction'
>>>
>>> # Drop the DIGITS table.
... oml.drop('DIGITS')
>>>
>>> # For reproducible results, add a case_id column with unique row
... # identifiers.
>>> row_id = pd.DataFrame(np.arange(digits.data.shape[0]),
... columns = ['CASE_ID'])
>>> oml_df_cid = oml.create(pd.concat([row_id, X, y], axis=1),
... table = 'DIGITS_CID')
>>> train, test = oml_df_cid.split(ratio=(0.8, 0.2), seed = 1234,

... hash_cols='CASE_ID',
... strata_cols='digit')
>>> X_train, y_train = train.drop('digit'), train['digit']
>>> X_test, y_test = test.drop('digit'), test['digit']
>>>
>>> # Provide the case_id column name to the feature selection
... # reduce function.
>>> subset = fs.reduce('svm_linear', X_train,
... y_train, case_id='CASE_ID')
... "{} features reduced to {} with case_id".format(
... len(X_train.columns)-1,
... len(subset))
'64 features reduced to 45 with case_id'
>>>
>>> # Drop the tables created in the example.
... oml.drop('DIGITS')
>>> oml.drop('DIGITS_CID')
9.4 Model Tuning

The oml.automl.ModelTuning class tunes the hyperparameters for the specified classification
or regression algorithm and training data.
Model tuning is a laborious machine learning task that relies heavily on data scientist expertise.
With limited user input, the oml.automl.ModelTuning class automates this process using a
highly-parallel, asynchronous gradient-based hyperparameter optimization algorithm to tune
the hyperparameters of an Oracle Machine Learning algorithm.
The oml.automl.ModelTuning class supports classification and regression algorithms. To use
the oml.automl.ModelTuning class, you specify a data set and an algorithm to obtain a tuned
model and its corresponding hyperparameters. An advanced user can provide a customized
hyperparameter search space and a non-default scoring metric to this black box optimizer.
For a partitioned model, if you pass in the column to partition on in the param_space argument
of the tune method, oml.automl.ModelTuning tunes the partitioned model’s hyperparameters.

help(oml.automl.ModelTuning) or see Oracle Machine Learning for Python API Reference.
9-11
Chapter 9
Model Tuning
Example 9-3 Using the oml.automl.ModelTuning Class
This example creates an oml.automl.ModelTuning object.
import oml
import pandas as pd



# Start an automated model tuning run with a Decision Tree model.

at = automl.ModelTuning(mining_function='classification',
parallel=4)
results = at.tune('dt', X, y, score_metric='accuracy')
# Show the tuned model details.

tuned_model = results['best_model']
tuned_model
# Show the best tuned model train score and the

# corresponding hyperparameters.
score, params = results['all_evals'][0]
"{:.2}".format(score), ["{}:{}".format(k, params[k])
for k in sorted(params)]
# Use the tuned model to get the score on the test set.
"{:.2}".format(tuned_model.score(X_test, y_test))
# An example invocation of model tuning with user-defined

# search ranges for selected hyperparameters on a new tuning
# metric (f1_macro).
search_space = {
'RFOR_SAMPLING_RATIO': {'type': 'continuous',
'range': [0.01, 0.5]},
'RFOR_NUM_TREES': {'type': 'discrete',
'range': [50, 100]},
'TREE_IMPURITY_METRIC': {'type': 'categorical',
'range': ['TREE_IMPURITY_ENTROPY',
'TREE_IMPURITY_GINI']},}
results = at.tune('rf', X, y, score_metric='f1_macro',
param_space=search_space)
9-12
Chapter 9
Model Tuning
("{:.2}".format(score), ["{}:{}".format(k, params[k])

for k in sorted(params)])
# Some hyperparameter search ranges need to be defined based on the

# training data set sizes (for example, the number of samples and
# features). You can use placeholders specific to the data set,
# such as $nr_features and $nr_samples, as the search ranges.
search_space = {'RFOR_MTRY': {'type': 'discrete',
'range': [1, '$nr_features/2']}}
results = at.tune('rf', X, y,
score_metric='f1_macro', param_space=search_space)
("{:.2}".format(score), ["{}:{}".format(k, params[k])
for k in sorted(params)])

>>> import oml

>>>
>>>
>>>
>>>
>>> # Start an automated model tuning run with a Decision Tree model.
... at = automl.ModelTuning(mining_function='classification',
... parallel=4)
>>> results = at.tune('dt', X, y, score_metric='accuracy')
>>>
>>> # Show the tuned model details.
... tuned_model = results['best_model']
>>> tuned_model
Algorithm Name: Decision Tree
Target: TARGET
9-13
Chapter 9
Model Tuning
Settings:
0 ALGO_NAME ALGO_DECISION_TREE
3 ODMS_DETAILS ODMS_DISABLE
6 PREP_AUTO ON
9 TREE_TERM_MINPCT_NODE 3.34
10 TREE_TERM_MINPCT_SPLIT 0.1
Attributes:
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
radius error
texture error
perimeter error
area error
smoothness error
compactness error
concavity error
concave points error
symmetry error
fractal dimension error
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension
Partition: NO
>>>
>>> # Show the best tuned model train score and the
... # corresponding hyperparameters.
... score, params = results['all_evals'][0]
>>> "{:.2}".format(score), ["{}:{}".format(k, params[k])
... for k in sorted(params)]
9-14
Chapter 9
Model Selection
('0.92', ['CLAS_MAX_SUP_BINS:32', 'TREE_IMPURITY_METRIC:TREE_IMPURITY_GINI',

'TREE_TERM_MAX_DEPTH:7', 'TREE_TERM_MINPCT_NODE:0.05',
'TREE_TERM_MINPCT_SPLIT:0.1'])
>>>
>>> # Use the tuned model to get the score on the test set.
... "{:.2}".format(tuned_model.score(X_test, y_test))
'0.92
>>>
>>> # An example invocation of model tuning with user-defined
... # search ranges for selected hyperparameters on a new tuning
... # metric (f1_macro).
... search_space = {
... 'RFOR_SAMPLING_RATIO': {'type': 'continuous',
... 'range': [0.01, 0.5]},
... 'RFOR_NUM_TREES': {'type': 'discrete',
... 'range': [50, 100]},
... 'TREE_IMPURITY_METRIC': {'type': 'categorical',
... 'range': ['TREE_IMPURITY_ENTROPY',
... 'TREE_IMPURITY_GINI']},}
>>> results = at.tune('rf', X, y, score_metric='f1_macro',
>>> param_space=search_space)
>>> score, params = results['all_evals'][0]
>>> ("{:.2}".format(score), ["{}:{}".format(k, params[k])
... for k in sorted(params)])
('0.92', ['RFOR_NUM_TREES:53', 'RFOR_SAMPLING_RATIO:0.4999951',
'TREE_IMPURITY_METRIC:TREE_IMPURITY_ENTROPY'])
>>>
>>> # Some hyperparameter search ranges need to be defined based on the
... # training data set sizes (for example, the number of samples and
... # features). You can use placeholders specific to the data set,
... # such as $nr_features and $nr_samples, as the search ranges.
... search_space = {'RFOR_MTRY': {'type': 'discrete',
... 'range': [1, '$nr_features/2']}}
>>> results = at.tune('rf', X, y,
... score_metric='f1_macro', param_space=search_space)
>>> score, params = results['all_evals'][0]
>>> ("{:.2}".format(score), ["{}:{}".format(k, params[k])
... for k in sorted(params)])
('0.93', ['RFOR_MTRY:10'])
>>>
9.5 Model Selection

The oml.automl.ModelSelection class automatically selects an Oracle Machine Learning
algorithm according to the selected score metric and then tunes that algorithm.
The oml.automl.ModelSelection class supports classification and regression algorithms. To
use the oml.automl.ModelSelection class, you specify a data set and the number of
algorithms you want to tune.
The select method of the class returns the best model out of the models considered.
9-15
Chapter 9
Model Selection

help(oml.automl.ModelSelection) or see Oracle Machine Learning for Python API
Reference.
Example 9-4 Using the oml.automl.ModelSelection Class
This example creates an oml.automl.ModelSelection object and then uses the object to
select and tune the best model.
import oml
import pandas as pd



# Create an automated model selection object with f1_macro as the

# score_metric argument.
ms = automl.ModelSelection(mining_function='classification',
score_metric='f1_macro', parallel=4)
# Run model selection to get the top (k=1) predicted algorithm

# (defaults to the tuned model).
select_model = ms.select(X, y, k=1)
# Show the selected and tuned model.

select_model
# Score on the selected and tuned model.

"{:.2}".format(select_model.score(X_test, y_test))

>>> import oml

>>>
9-16
Chapter 9
Model Selection

>>>
>>>
>>>
>>> # Create an automated model selection object with f1_macro as the
... # score_metric argument.
... ms = automl.ModelSelection(mining_function='classification',
... score_metric='f1_macro', parallel=4)
>>>
>>> # Run the model selection to get the top (k=1) predicted algorithm
... # (defaults to the tuned model).
... select_model = ms.select(X, y, k=1)
>>>
>>> # Show the selected and tuned model.
... select_model
Algorithm Name: Support Vector Machine
Target: TARGET
Settings:
0 ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES
2 ODMS_DETAILS ODMS_DISABLE
5 PREP_AUTO ON
6 SVMS_COMPLEXITY_FACTOR 10
7 SVMS_CONV_TOLERANCE .0001
8 SVMS_KERNEL_FUNCTION SVMS_GAUSSIAN
9 SVMS_NUM_PIVOTS ...
10 SVMS_STD_DEV 5.3999999999999995
Attributes:
area error
compactness error
concave points error
concavity error
fractal dimension error
mean area
mean compactness
mean concave points
mean concavity
mean fractal dimension
9-17
Chapter 9
Model Selection
mean perimeter
mean radius
mean smoothness
mean symmetry
mean texture
perimeter error
radius error
smoothness error
symmetry error
texture error
worst area
worst compactness
worst concave points
worst concavity
worst fractal dimension
worst perimeter
worst radius
worst smoothness
worst symmetry
worst texture
Partition: NO
>>>
>>> # Score on the selected and tuned model.
... "{:.2}".format(select_model.score(X_test, y_test))
'0.99'
>>>
9-18
10
Embedded Python Execution is a feature of Oracle Machine Learning for Python that allows
you to invoke user-defined Python functions directly in an Oracle database instance.
Embedded Python Execution is available on:
• Oracle Autonomous Database, where pre-installed Python packages can be used, via
Python, REST and SQL APIs.
• Oracle Database on premises, ExaCS, ExaC@C, DBCS, and Oracle Database deployed
in a compute instance, where the user can custom install third-party packages to use with
EPE, via Python and SQL APIs.
Embedded Python Execution is described in the following topics:
Topics:
• About Embedded Python Execution
With Embedded Python Execution, you can invoke user-defined Python functions in
Python engines spawned and managed by the Oracle database instance.
• Parallelism with OML4Py Embedded Python Execution
OML4Py embedded Python execution allows users to invoke user-defined functions from
Python, SQL, and REST interfaces using Python engines spawned and controlled by the
Oracle Autonomous Database environment.
• Embedded Python Execution Views
OML4Py includes a number of database views that contain information about datastores
and about the scripts and user-defined functions in the datastores. You can use these
views with the Embedded Python Execution APIs to work with the datastores and their
contents.
• Python API for Embedded Python Execution
You can invoke user-defined Python functions directly in an Oracle database instance by
using Embedded Python Execution functions.
• SQL API for Embedded Python Execution with On-premises Database
SQL API for Embedded Python Execution with On-premises Database has SQL interfaces
for Embedded Python Execution and for datastore and script repository management.
• SQL API for Embedded Python Execution with Autonomous Database
The SQL API for Embedded Python Execution with Autonomous Database provides SQL
interfaces for setting authorization tokens, managing access control list (ACL) privileges,
executing Python scripts, and synchronously and asynchronously running jobs.
10.1 About Embedded Python Execution

With Embedded Python Execution, you can invoke user-defined Python functions in Python
engines spawned and managed by the Oracle database instance.
Embedded Python Execution is available in Oracle Autonomous Database.
In Oracle Autonomous Database, you can use:
10-1
Chapter 10
About Embedded Python Execution
• – An OML Notebooks Python interpreter session (see Use the Python Interpreter in a
Notebook Paragraph)
– REST API for Embedded Python Execution
– SQL API for Embedded Python Execution with Autonomous Database
In an on-premises Oracle Database, you can use:
• – Python API for Embedded Python Execution
– SQL API for Embedded Python Execution with On-premises Database
The following topic compares the four Embedded Python Execution APIs.
Topics:
• Comparison of the Embedded Python Execution APIs
The table below compares the four Embedded Python Execution APIs.
10.1.1 Comparison of the Embedded Python Execution APIs

The table below compares the four Embedded Python Execution APIs.
The APIs are:
• Embedded Python Execution API
• REST API for Embedded Python Execution (for use with Oracle Autonomous Database)
• SQL API for Embedded Python Execution with On-Premises Oracle Database.
The APIs share many functions, but they differ in some ways because of the different
environments. For example, the APIs available for Autonomous Database provide an API for
operating in a web environment.
The procedures and functions are part of the PYQSYS and SYS schemas.
Category Python API for Embedded REST API for Embedded SQL APIs for Embedded
Python Execution Python Execution Python Execution
Embedded Python oml.do_eval function POST /api/py- • pyqEval Function
Execution function See Run a User-Defined Python scripts/v1/do-eval/ (Autonomous Database)
{scriptName} (Autonomous Database)
Function.
• pyqEval Function (On-
See Run a Python Function.
Premises Database) (on-
POST /api/py- premises database)
scripts/v1/do-eval/
{scriptName}/{ownerName}
See Run a Python Function with
Script Owner Specified.
Embedded Python oml.table_apply function POST /api/py-scripts/v1/ • pyqTableEval Function
Execution function See Run a User-Defined Python table-apply/{scriptName} (Autonomous Database)
(Autonomous Database)
Function on the Specified Data. See Run a Python Function on
Specified Data. • pyqTableEval Function (On-
POST /api/py-scripts/v1/ premises database)
table-apply/{scriptName}/
{ownerName}}
See Run a Python Function on
Specified Data with Script Owner
Specified.
10-2
Chapter 10
Embedded Python oml.group_apply function POST /api/py-scripts/v1/ • pyqGroupEval Function
Execution function See Run a Python Function on group-apply/{scriptName} (Autonomous Database)
Data Grouped By Column (Autonomous Database)
Values. Grouped Data. • pyqGroupEval Function (On-
group-apply/{scriptName}/
{ownerName}
Grouped Data with Script Owner
Specified.
Embedded Python oml.row_apply function POST /api/py-scripts/v1/ • pyqRowEval Function
Execution function See Run a User-Defined Python row-apply/{scriptName} (Autonomous Database)
Function on Sets of Rows. See Run a Python Function on
Chunks of Rows. • pyqRowEval Function (On-
row-apply/{scriptName}/
{ownerName}
Chunks of Rows with Script
Owner Specified.
Embedded Python oml.index_apply function POST /api/py-scripts/v1/ • pyqIndexEval Function
Execution function See Run a User-Defined Python index-apply/{scriptName} (Autonomous Database)
Function Multiple Times. See Run a Python Function
Multiple Times. • The API for on-premises
Oracle Database has no
POST /api/py-scripts/v1/ pyqIndexEval function.
index-apply/{scriptName}/ Use pyqGroupEval Function
{ownerName} (On-Premises Database)
See Run a Python Function instead.
Multiple Times with Script Owner
Specified.
Job status API NA GET /api/py-scripts/v1/ • pyqJobStatus Function
jobs/{jobId} (Autonomous Database)
See Retrieve Asynchronous Job • NA (on-premises database)
Status.
Job result API NA GET /api/py-scripts/v1/ • pyqJobResult Function
jobs/{jobId}/result (Autonomous Database)
See Retrieve Asynchronous Job • NA (on-premises database)
Result.
Script repository oml.script.dir function GET /api/py-scripts/v1/ List the scripts by querying the
See List Available User-Defined scripts ALL_PYQ_SCRIPTS View and
Python Functions. the USER_PYQ_SCRIPTS View.
See List Scripts.
Script repository oml.script.create function NA • pyqScriptCreate Procedure
See Create and Store a User- (Autonomous Database)
Defined Python Function.
• pyqScriptCreate Procedure
(On-Premises Database)
(on-premises database)
10-3
Chapter 10
Script repository oml.script.drop function NA • pyqScriptDrop Procedure
See Drop a User-Defined Python (Autonomous Database)
Function from the Repository.
• pyqScriptDrop Procedure
(On-Premises Database)
(on-premises database)
Script repository oml.script.load function NA NA
See Load a User-Defined Python (Scripts are loaded in the SQL
Function. APIs when the function is called.)
Script repository NA NA ALL_PYQ_SCRIPTS View
Script repository NA NA USER_PYQ_SCRIPTS View
Script repository oml.grant function NA • pyqGrant procedure (Oracle
and datastore See About the Script Repository. Autonomous Database)
• pyqGrant procedure (on-
premises database)
See About the SQL API for
with On-Premises Database
(on-premises database).
Script repository oml.revoke function NA • pyqRevoke procedure
and datastore See About the Script Repository. (Autonomous Database)
• pyqRevoke procedure (on-
premises database)
See About the SQL API for
with On-Premises Database
(on-premises database).
Datastore NA NA ALL_PYQ_DATASTORES View
Datastore NA NA ALL_PYQ_DATASTORE_CONT
ENTS View
Datastore NA NA USER_PYQ_DATASTORES
View
Authorization - NA NA • pyqAppendHostACE
Access Control Procedure (Autonomous
Lists Database)
• NA (on-premises database)
(On-premises, the
authorization is related to
logging into the user
schema.)
Authorization - NA NA • pyqRemoveHostACE
Access Control Procedure (Autonomous
Lists Database)
Authorization - NA NA • pyqGetHostACE Function
Access Control (Autonomous Database)
Lists • NA (on-premises database)
Authorization - NA See Authenticate. • pyqSetAuthToken Procedure
Tokens (Autonomous Database)
10-4
Chapter 10
Parallelism with OML4Py Embedded Python Execution
Authorization - NA See Authenticate. • pyqIsTokenSet Function
Tokens (Autonomous Database)
Note:
An output limit exists on the length function for REST API and SQL APIs for
embedded Python execution. A query on the length function with a length of more
than 5000 will result in an error with error code 1024 and the error message "Output
exceeds maximum length 5000". The limit is set on the len() result of the returning
python object. For example, len() of a pandas.DataFrame is the number of rows,
len() of a list is the length of the list, etc. If pandas.DataFrame is returned, it cannot
have more than 5000 rows. If a list is returned, it should not contain more than 5000
items. This limit can be extended by updating the OML_OUTPUT_SZLIMIT in a %script
paragraph:
%script
EXEC sys.pyqconfigset('OML_OUTPUT_SZLIMIT', '8000')
10.2 Parallelism with OML4Py Embedded Python Execution

OML4Py embedded Python execution allows users to invoke user-defined functions from
Python, SQL, and REST interfaces using Python engines spawned and controlled by the
Oracle Autonomous Database environment.
The user-defined functions can be invoked in a data-parallel and task-parallel manner with
multiple Python engines, with output formats including structured data, XML, JSON, and PNG
images.
Oracle Autonomous Database provides different service levels to manage the load on the
system by controlling the degree of parallelism jobs can use:
• LOW - the default, with maximum 2 degrees of parallelism
• MEDIUM - maximum of 4 degrees of parallelism, and allows greater concurrency for job
processing
• HIGH - maximum of 8 degrees of parallelism but significantly limits the number of
concurrent jobs
Parallelism applies to:
• oml.row_apply, oml.group_apply, and oml.index_apply using the Python API for
embedded Python execution
• pyqRowEval, pyqGroupEval, and *pyqIndexEval using the SQL API for embedded Python
execution
• row-apply, group-apply, index-apply using the REST API for embedded Python
execution
10-5
Chapter 10
Embedded Python Execution Views
Note:
pyqIndexEval is available on Oracle Autonomous Database only.
Setting Parallelism Using Embedded Python Execution

For the ADB Python API for Embedded Python Execution:
The parallel parameter specifies the preferred degree of parallelism to use in the embedded
Python execution job. The value may be one of the following:
• A positive integer greater than or equal to 1 for a specific degree of parallelism
• False, None, or 0 for no parallelism
• True for the default data parallelism
Setting the argument parallel=True corresponds to service level defined in the notebook
interpreter. The argument parallel=x is limited by the service level. For instance, the
maximum number of parallel engines allowed by the MEDIUM service level is 4, therefore
selecting parallel=6 effectively results in parallel=4.
For the ADB SQL API for Embedded Python Execution:

The argument oml_parallel_flag and oml_service_level are used together to enable data-
parallelism and task-parallelism. For more information see Special Control Arguments
(Autonomous Database).
For the ADB REST API for Embedded Python Execution:
When executing a REST API Embedded Python Execution function, the service argument
allows you to select the Autonomous Database service level to be used. For example, the
parallelFlag is set to true in order to use database parallelism along with the MEDIUM
service.
-d '{"parallelFlag":true,"service":"MEDIUM"}'
For more information see Specify a Service Level.
10.3 Embedded Python Execution Views

OML4Py includes a number of database views that contain information about datastores and
about the scripts and user-defined functions in the datastores. You can use these views with
the Embedded Python Execution APIs to work with the datastores and their contents.
View Description
ALL_PYQ_DATASTORES View Contains information about the datastores available
to the current user.
ALL_PYQ_DATASTORE_CONTENTS View Contains information about the objects in the
datastores available to the current user.
USER_PYQ_DATASTORES View Contains information about the datastores owned
by the current user.
ALL_PYQ_SCRIPTS View Describes the scripts that are available to the
current user.
10-6
Chapter 10
View Description
USER_PYQ_SCRIPTS View Describes the user-defined Python functions in the
script repository that are owned by the current user.
Embedded Python Execution views are described in the following topics:
Topics:
• ALL_PYQ_DATASTORE_CONTENTS View
The ALL_PYQ_DATASTORE_CONTENTS view contains information about the contents of
datastores that are available to the current user.
• ALL_PYQ_DATASTORES View
The ALL_PYQ_DATASTORES view contains information about the datastores that are available
to the current user.
• ALL_PYQ_SCRIPTS View
The ALL_PYQ_SCRIPTS view contains information about the user-defined Python functions in
the OML4Py script repository that are available to the current user.
• USER_PYQ_DATASTORES View
The USER_PYQ_DATASTORES view contains information about the datastores that are owned
by the current user.
• USER_PYQ_SCRIPTS View
This view contains information about the user-defined Python functions in the OML4Py
script repository that are owned by the current user.
10.3.1 ALL_PYQ_DATASTORE_CONTENTS View

The ALL_PYQ_DATASTORE_CONTENTS view contains information about the contents of datastores
that are available to the current user.
Column Datatype Null Description

DSOWNER VARCHAR2(128) NULL The owner of the datastore.
permitted
DSNAME VARCHAR2(128) NULL The name of the datastore.
permitted
OBJNAME VARCHAR2(128) NULL The name of an object in the datastore.
permitted
CLASS VARCHAR2(128) NULL The class of a Python object in the datastore.
permitted
OBJSIZE NUMBER NULL The size of an object in the datastore.
permitted
LENGTH NUMBER NULL The length of an object in the datastore. The length
permitted is 1 for all objects unless the object is a list, dict,
pandas.DataFrame, or oml.DataFrame, in which
case it is equal to len(obj).
NROW NUMBER NULL The number of rows of an object in the datastore.
permitted The number is 1 for all objects except for
pandas.DataFrame and oml.DataFrame objects,
in which case it is equal to len(df).
10-7
Chapter 10

NCOL NUMBER NULL The number of columns of an object in the
permitted datastore. The number is len(obj) if the object is a
list or dict, len(obj.columns) if the object is a
pandas.DataFrame or oml.DataFrame, and 1
otherwise.
Example 10-1 Selecting from the ALL_PYQ_DATASTORE_CONTENTS View

This example selects all columns from the ALL_PYQ_DATASTORE_CONTENTS view. For the creation
of the datastores in this example, see Example 6-14.
SELECT * FROM ALL_PYQ_DATASTORE_CONTENTS
DSOWNER DSNAME OBJNAME CLASS OBJSIZE LENGTH

NROW NCOL
-------- ------------ ------------ ----------------- ------- ------
---- ----
OML_USER ds_pydata oml_boston oml.DataFrame 1073 506
506 14
OML_USER ds_pydata oml_diabetes oml.DataFrame 964 442
442 11
OML_USER ds_pydata wine Bunch 24177 5
1 5
OML_USER ds_pymodel regr1 LinearRegression 706 1
1 1
OML_USER ds_pymodel regr2 oml.glm 5664 1
1 1
OML_USER ds_wine_data oml_wine oml.DataFrame 1410 178
178 14
10.3.2 ALL_PYQ_DATASTORES View

The ALL_PYQ_DATASTORES view contains information about the datastores that are available to
the current user.

DSOWNER VARCHAR2(256) NULL The owner of the datastore.
permitted
permitted
NOBJ NUMBER NULL The number of objects in the datastore.
permitted
DSSIZE NUMBER NULL The size of the datastore.
permitted
CDATE DATE NULL The date on which the datastore was created.
permitted
DESCRIPTION VARCHAR2(2000) NULL A description of the datastore.
permitted
10-8
Chapter 10

GRANTABLE VARCHAR2(1) NULL Whether or not the read privilege to the datastore
permitted may be granted. The value in this column is either
T for True or F for False.
Example 10-2 Selecting from the ALL_PYQ_DATASTORES View

This example selects all columns from the ALL_PYQ_DATASTORES view. It then selects only the
DSNAME and GRANTABLE columns from the view. For the creation of the datastores in these
examples, see Example 6-14.
SELECT * FROM ALL_PYQ_DATASTORES;
DSOWNER DSNAME NOBJ DSSIZE CDATE DESCRIPTION G

-------- ------------ ----- ------- --------- --------------- -
OML_USER ds_pydata 3 26214 18-MAY-19 python datasets F
OML_USER ds_pymodel 2 6370 18-MAY-19 T
OML_USER ds_wine_data 1 1410 18-MAY-19 wine dataset F
This example selects only the DSNAME and GRANTABLE columns from the view.
SELECT DSNAME, GRANTABLE FROM ALL_PYQ_DATASTORES;
DSNAME G
---------- -
ds_pydata F
ds_pymodel T
ds_wine_data F
10.3.3 ALL_PYQ_SCRIPTS View

The ALL_PYQ_SCRIPTS view contains information about the user-defined Python functions in the
OML4Py script repository that are available to the current user.

OWNER VARCHAR2(256) NULL The owner of the user-defined Python function.
permitted
NAME VARCHAR2(128) NULL The name of the user-defined Python function.
permitted
SCRIPT CLOB NULL The user-defined Python function.
permitted
10-9
Chapter 10
Example 10-3 Selecting from the ALL_PYQ_SCRIPTS View

This example selects the owner and the name of the user-defined Python function from the
ALL_PYQ_SCRIPTS view.
SELECT owner, name FROM ALL_PYQ_SCRIPTS;
OWNER NAME
-------- -----------------
OML_USER create_iris_table
OML_USER tmpqfun2
PYQSYS tmpqfun2
This example selects the name of the user-defined Python function and the function definition
from the view.
SELECT name, script FROM ALL_PYQ_SCRIPTS WHERE name = 'create_iris_table';
NAME SCRIPT
-----------------
---------------------------------------------------------------------
create_iris_table "def create_iris_table(): from sklearn.datasets import
load_iris ...
10.3.4 USER_PYQ_DATASTORES View

The USER_PYQ_DATASTORES view contains information about the datastores that are owned by
the current user.

permitted
NOBJ NUMBER NULL The number of objects in the datastore.
permitted
DSSIZE NUMBER NULL The size of the datastore.
permitted
CDATE DATE NULL The date on which the datastore was created.
permitted
DESCRIPTION VARCHAR2(2000) NULL A description of the datastore.
permitted
GRANTABLE VARCHAR2(1) NULL Whether or not the read privilege to the
permitted datastore may be granted. The value in this
column is either T for True or F for False.
10-10
Chapter 10
Example 10-4 Selecting from the USER_PYQ_DATASTORES View

This example selects all columns from the USER_PYQ_DATASTORES view. For the creation of the
datastores in these examples, see Example 6-14.
SELECT * FROM USER_PYQ_DATASTORES;
DSNAME NOBJ DSSIZE CDATE DESCRIPTION G

---------- ---- ------ --------- --------------- -
ds_wine_data 1 1410 18-MAY-19 wine dataset F
ds_pydata 3 26214 18-MAY-19 python datasets F
ds_pymodel 2 6370 18-MAY-19 T
This example selects only the DSNAME and GRANTABLE columns from the view.
SELECT DSNAME, GRANTABLE FROM USER_PYQ_DATASTORES;
DSNAME G
---------- -
ds_wine_data F
ds_pydata F
ds_pymodel T
10.3.5 USER_PYQ_SCRIPTS View

This view contains information about the user-defined Python functions in the OML4Py script
repository that are owned by the current user.

NAME VARCHAR2(128) NOT NULL The name of the user-defined Python function.
SCRIPT CLOB NULL The user-defined Python function.
permitted
Example 10-5 Selecting from the USER_PYQ_SCRIPTS View

This example selects all columns from USER_PYQ_SCRIPTS.
SELECT * FROM USER_PYQ_SCRIPTS;
NAME SCRIPT
-----------------
-------------------------------------------------------------------
create_iris_table "def create_iris_table(): from sklearn.datasets import
load_iris ...
tmpqfun2 "def return_frame(): import numpy as np import
pickle ...
10-11
Chapter 10
Python API for Embedded Python Execution
10.4 Python API for Embedded Python Execution

You can invoke user-defined Python functions directly in an Oracle database instance by using
Embedded Python Execution functions.
Python API for Embedded Python Execution is described in the following topics:
Topics:
• About Embedded Python Execution
• Run a User-Defined Python Function
Use the oml.do_eval function to run a user-defined input function that explicitly retrieves
data or for which external data is not required.
• Run a User-Defined Python Function on the Specified Data
Use the oml.table_apply function to run a Python function on data that you specify with
the data parameter.
• Run a Python Function on Data Grouped By Column Values
Use the oml.group_apply function to group the values in a database table by one or more
columns and then run a user-defined Python function on each group.
• Run a User-Defined Python Function on Sets of Rows
Use the oml.row_apply function to chunk data into sets of rows and then run a user-
defined Python function on each chunk.
• Run a User-Defined Python Function Multiple Times
Use the oml.index_apply function to run a Python function multiple times in Python
engines spawned by the database environment.
• Save and Manage User-Defined Python Functions in the Script Repository
The OML4Py script repository stores user-defined Python functions for use with Embedded
Python Execution functions.
10.4.1 About Embedded Python Execution

You may choose to run your functions in a data-parallel or task-parallel manner in one or more
of these Python engines. In data-parallel processing, the data is partitioned and the same user-
defined Python function of each data subset is invoked using one or more Python engines. In
task-parallel processing, a user-defined function is invoked multiple times in one or more
Python engines with a unique index passed in as an argument; for example, you may use task
parallelism for Monte Carlo simulations in which you use the index to set a random seed.
The following table lists the Python functions for Embedded Python Execution.
oml.do_eval Runs a user-defined Python function in a Python engine spawned and
managed by the database environment.
oml.group_apply Partitions a database table by the values in one or more columns and runs
the provided user-defined Python function on each partition.
oml.index_apply Runs a Python function multiple times, passing in a unique index of the
invocation to the user-defined function.
oml.row_apply Partitions a database table into sets of rows and runs the provided user-
defined Python function on the data in each set.
10-12
Chapter 10
oml.table_apply Runs a Python function on data in the database as a single
pandas.DataFrame in a single Python engine.
About Special Control Arguments

Special control arguments control what happens before or after the running of the function that
you pass to an Embedded Python Execution function. You specify a special control argument
with the **kwargs parameter of a function such as oml.do_eval. The control arguments are not
passed to the function specified by the func argument of that function.
Table 10-1 Special Control Arguments
oml_input_type Identifies the type of input data object that you are supplying
to the func argument.
The input types are the following:
• pandas.DataFrame
• numpy.recarray
• 'default' (the default value)
If all columns are numeric, then default type is a 2-
dimensional numpy.ndarray of type numpy.float64.
Otherwise, the default type is a pandas.DataFrame.
oml_na_omit Controls the handling of missing values in the input data. If
you specify oml_na_omit = True, then rows that contain
missing values are removed from the input data. If all of the
rows contain missing values, then the input data is an empty
oml.DataFrame. The default value is False.
About Output
When a user-defined Python function runs in OML4Py, by default it returns the Python objects
returned by the function. Also, OML4Py captures all matplotlib.figure.Figure objects
created by the user-defined Python function and converts them into PNG format.
If graphics = True, the Embedded Python Execution functions return
oml.embed.data_image._DataImage objects. The oml.embed.data_image._DataImage class
contains Python objects and PNG images. Calling the method __repr__() displays the PNG
images and prints out the Python object. By default, .dat returns the Python object that the
user-defined Python function returned; .img returns a list containing PNG image data for each
figure.
About the Script Repository

Embedded Python Execution includes the ability to create and store user-defined Python
functions in the OML4Py script repository, grant or revoke the read privilege to a user-defined
Python function, list the available user-defined Python functions, load user-defined Python
functions into the Python environment, or drop a user-defined Python function from the script
repository.
Along with whatever other actions a user-defined Python function performs, it can also create,
retrieve, and modify Python objects that are stored in OML4Py datastores.
10-13
Chapter 10
In Embedded Python Execution, a user-defined Python function runs in one or more Python
engines spawned and managed by the database environment. The engines are dynamically
started and managed by the database. From the same user-defined Python function you can
get structured data and PNG images.
You can make the user-defined Python function either private or global. A global function is
available to any user. A private function is available only to the owner or to users to whom the
owner of the function has granted the read privilege.
10.4.2 Run a User-Defined Python Function

Use the oml.do_eval function to run a user-defined input function that explicitly retrieves data
or for which external data is not required.
The oml.do_eval function runs a user-defined Python function in a Python engine spawned
and managed by the database environment.
The syntax of the oml.do_eval function is the following:
oml.do_eval(func, func_owner=None, graphics=False, **kwargs)
The func argument is the function to run. It may be one of the following:
• A Python function
• A string that is the name of a user-defined Python function in the OML4Py script repository
• A string that defines a Python function
• An oml.script.script.Callable object returned by the oml.script.load function
The optional func_owner argument is a string or None (the default) that specifies the owner of
the registered user-defined Python function when argument func is a registered user-defined
Python function name.
The graphics argument is a boolean that specifies whether to look for images. The default
value is False.
With the **kwargs parameter, you can pass additional arguments to the func function. Special
control arguments, which start with oml_, are not passed to the function specified by func, but
instead control what happens before or after the running of the function.
See Also: About Special Control Arguments
The oml.do_eval function returns a Python object or an oml.embed.data_image._DataImage. If
no image is rendered in the user-defined Python function, oml.do_eval returns whatever
Python object is returned by the function. Otherwise, it returns an
oml.embed.data_image._DataImage object.
See Also: About Output

Example 10-6 Using the oml.do_eval Function
This example defines a Python function that returns a Pandas DataFrame with the columns ID
and RES. It then passes that function to the oml.do_eval function.
import pandas as pd
import oml
def return_df(num, scale):
10-14
Chapter 10
import pandas as pd
id = list(range(0, int(num)))
res = [i/scale for i in id]
return pd.DataFrame({"ID":id, "RES":res})
res = oml.do_eval(func=return_df, scale = 100, num = 10)

type(res)
res

>>> import oml
>>>
>>> def return_df(num, scale):
... import pandas as pd
... id = list(range(0, int(num)))
... res = [i/scale for i in id]
... return pd.DataFrame({"ID":id, "RES":res})
...
>>>
>>> res = oml.do_eval(func=return_df, scale = 100, num = 10)
>>> type(res)
>>>
>>> res
ID RES
0 0.0 0.00
1 1.0 0.01
2 2.0 0.02
3 3.0 0.03
4 4.0 0.04
5 5.0 0.05
6 6.0 0.06
7 7.0 0.07
8 8.0 0.08
9 9.0 0.09
10.4.3 Run a User-Defined Python Function on the Specified Data

Use the oml.table_apply function to run a Python function on data that you specify with the
data parameter.
The oml.table_apply function runs a user-defined Python function in a Python engine

spawned and managed by the database environment. With the func parameter, you can
supply a Python function or you can specify the name of a user-defined Python function in the
OML4Py script repository.
The syntax of the function is the following:
oml.table_apply(data, func, func_owner=None, graphics=False, **kwargs)
10-15
Chapter 10
The data argument is an oml.DataFrame that contains the data that the func function operates
on.
value is False.
instead control what happens before or after the execution of the function.
The oml.table_apply function returns a Python object or an
oml.embed.data_image._DataImage. If no image is rendered in the user-defined Python
function, oml.table_apply returns whatever Python object is returned by the function.
Otherwise, it returns an oml.embed.data_image._DataImage object.

Example 10-7 Using the oml.table_apply Function
This example builds a regression model using in-memory data, and then uses the
oml.table_apply function to predict using the model on the first 10 rows of the IRIS table.
import oml
import pandas as pd
# Drop the IRIS database table if it exists.

try:
oml.drop('IRIS')
except:
pass
10-16
Chapter 10

# Build a regression model using in-memory data.

regr = linear_model.LinearRegression()
regr.fit(iris[['Sepal_Width', 'Petal_Length', 'Petal_Width']],
iris[['Sepal_Length']])
regr.coef_
# Use oml.table_apply to predict using the model on the first 10

# rows of the IRIS table.
def predict(dat, regr):
import pandas as pd
pred = regr.predict(dat[['Sepal_Width', 'Petal_Length',
'Petal_Width']])
return pd.concat([dat,pd.DataFrame(pred)], axis=1)
res = oml.table_apply(data=oml_iris.head(n=10),
func=predict, regr=regr)
res
>>> import oml

>>>
>>>
>>>
>>> # Drop the IRIS database table if it exists.
... try:
... except:
... pass
>>>
>>>
>>> # Build a regression model using in-memory data.
>>> regr = linear_model.LinearRegression()
>>> regr.fit(iris[['Sepal_Width', 'Petal_Length', 'Petal_Width']],
... iris[['Sepal_Length']])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
10-17
Chapter 10
normalize=False)
>>> regr.coef_
array([[ 0.65083716, 0.70913196, -0.55648266]])
>>>
>>> # Use oml.table_apply to predict using the model on the first 10
... # rows of the IRIS table.
... def predict(dat, regr):
... pred = regr.predict(dat[['Sepal_Width', 'Petal_Length',
... 'Petal_Width']])
... return pd.concat([dat,pd.DataFrame(pred)], axis=1)
...
>>> res = oml.table_apply(data=oml_iris.head(n=10),
... func=predict, regr=regr)
>>> res Sepal_Length Sepal_Width Petal_Length Petal_Width
0 4.6 3.6 1 0.2
1 5.1 2.5 3 1.1
2 6.0 2.2 4 1.0
3 5.8 2.6 4 1.2
4 5.5 2.3 4 1.3
5 5.5 2.5 4 1.3
6 6.1 2.8 4 1.3
7 5.7 2.5 5 2.0
8 6.0 2.2 5 1.5
9 6.3 2.5 5 1.9
Species 0
0 setosa 4.796847
6 virginica 5.791442
10.4.4 Run a Python Function on Data Grouped By Column Values

Use the oml.group_apply function to group the values in a database table by one or more
columns and then run a user-defined Python function on each group.
The oml.group_apply function runs a user-defined Python function in a Python engine
spawned and managed by the database environment. The oml.group_apply function passes
the oml.DataFrame specified by the data argument to the user-defined func function as its first
argument. The index argument to oml.group_apply specifies the columns of the
oml.DataFrame by which the database groups the data for processing by the user-defined
Python function. The oml.group_apply function can use data-parallel execution, in which one
or more Python engines perform the same Python function on different groups of data.
The syntax of the function is the following.
oml.group_apply(data, index, func, func_owner=None, parallel=None,

orderby=None, graphics=False, **kwargs)
10-18
Chapter 10
The data argument is an oml.DataFrame that contains the in-database data that the func
function operates on.
The index argument is an oml.DataFrame object, the columns of which are used to group the
data before sending it to the func function.
The parallel argument is a boolean, an int, or None (the default) that specifies the preferred
degree of parallelism to use in the Embedded Python Execution job. The value may be one of
the following:
The optional orderby argument is an oml.DataFrame, oml.Float, or oml.String that specifies
the ordering of the group partitions.
value is False.
The oml.group_apply function returns a dict of Python objects or a dict of
oml.embed.data_image._DataImage objects. If no image is rendered in the user-defined
Python function, oml.group_apply returns a dict of Python object returned by the function.
Otherwise, it returns a dict of oml.embed.data_image._DataImage objects.

Example 10-8 Using the oml.group_apply Function
This example defines some functions and calls oml.group_apply for each function.
import pandas as pd
import oml
10-19
Chapter 10

try:
oml.drop('IRIS')
except:
pass

# Define a function that counts the number of rows and returns a

# dataframe with the species and the count.
def group_count(dat):
import pandas as pd
return pd.DataFrame([(dat["Species"][0], dat.shape[0])],\
columns = ["Species", "COUNT"])
# Select the Species column to use as the index argument.

index = oml.DataFrame(oml_iris['Species'])
# Group the data by the Species column and run the user-defined
# function for each species.
res = oml.group_apply(oml_iris, index, func=group_count,
oml_input_type="pandas.DataFrame")
res
# Define a function that builds a linear regression model, with

# Petal_Width as the feature and Petal_Length as the target value,
# and that returns the model after fitting the values.
def build_lm(dat):
lm = linear_model.LinearRegression()
X = dat[["Petal_Width"]]
y = dat[["Petal_Length"]]
lm.fit(X, y)
return lm
# Run the model for each species and return an objectList in

# dict format with a model for each species.
mod = oml.group_apply(oml_iris[:,["Petal_Length", "Petal_Width",
"Species"]], index, func=build_lm)
# The output is a dict of key-value pairs for each species and model.
type(mod)
# Sort dict by the key species.

{k: mod[k] for k in sorted(mod.keys())}
10-20
Chapter 10

>>> import oml
>>>
>>>
>>>
... try:
... except:
... pass
>>>
>>>
>>> # Define a function that counts the number of rows and returns a
... # dataframe with the species and the count.
... def group_count(dat):
... return pd.DataFrame([(dat["Species"][0], dat.shape[0])],\
... columns = ["Species", "COUNT"])
...
>>> # Select the Species column to use as the index argument.
... index = oml.DataFrame(oml_iris['Species'])
>>>
>>> # Group the data by the Species column and run the user-defined
... # function for each species.
... res = oml.group_apply(oml_iris, index, func=group_count,
... oml_input_type="pandas.DataFrame")
>>> res
{'setosa': Species COUNT
0 setosa 50, 'versicolor': Species COUNT
0 versicolor 50, 'virginica': Species COUNT
0 virginica 50}
>>>
>>> # Define a function that builds a linear regression model, with
... # Petal_Width as the feature and Petal_Length as the target value,
... # and that returns the model after fitting the values.
... def build_lm(dat):
... from sklearn import linear_model
... lm = linear_model.LinearRegression()
... X = dat[["Petal_Width"]]
... y = dat[["Petal_Length"]]
... lm.fit(X, y)
... return lm
10-21
Chapter 10
...
>>> # Run the model for each species and return an objectList in
... # dict format with a model for each species.
... mod = oml.group_apply(oml_iris[:,["Petal_Length", "Petal_Width",
... "Species"]], index, func=build_lm)
>>>
>>> # The output is a dict of key-value pairs for each species and model.
... type(mod)
<class 'dict'>
>>>
>>> # Sort dict by the key species.
... {k: mod[k] for k in sorted(mod.keys())}
{'setosa': LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=None,normalize=False), 'versicolor': LinearRegression(copy_X=True,
fit_intercept=True, n_jobs=None, normalize=False), 'virginica':
normalize=False)}
10.4.5 Run a User-Defined Python Function on Sets of Rows

Use the oml.row_apply function to chunk data into sets of rows and then run a user-defined
Python function on each chunk.
The oml.row_apply function passes the oml.DataFrame specified by the data argument as the
first argument to the user-defined func Python function. The rows argument specifies the
maximum number of rows of the oml.DataFrame to assign to each chunk. The last chunk of
rows may have fewer rows than the number specified.
The oml.row_apply function runs the Python function in a database-spawned Python engine.
The function can use data-parallel execution, in which one or more Python engines perform the
same Python function on different chunks of the data.
The syntax of the function is the following.
oml.row_apply(data, func, func_owner=None, rows=1, parallel=None,

graphics=False, **kwargs)
The data argument is an oml.DataFrame that contains the data that the func function operates
on.
The rows argument is an int that specifies the maximum number of rows to include in each
chunk.
10-22
Chapter 10
the following:
value is True.
The oml.row_apply function returns a pandas.DataFrame or a list of
Python function, oml.row_apply returns a pandas.DataFrame. Otherwise, it returns a list of
oml.embed.data_image._DataImage objects.

Example 10-9 Using the oml.row_apply Function
The example builds a regression model based on iris data. It defines a function that predicts
the Petal_Width values based on the Sepal_Length, Sepal_Width, and Petal_Length columns
of the input data. It then concatenates the Species column, the Petal_Width column, and the
predicted Petal_Width as the object to return. Finally, the example calls the oml.row_apply
function to apply the make_pred() function on each 4-row chunk of the input data.
import oml
import pandas as pd

try:
oml.drop('IRIS')
except:
pass
10-23
Chapter 10
# Build a regression model to predict Petal_Width using in-memory

# data.
regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
iris[['Petal_Width']])
regr.coef_
# Define a Python function.

def make_pred(dat, regr):
import pandas as pd
import numpy as np
pred = regr.predict(dat[['Sepal_Length',
'Sepal_Width',
'Petal_Length']])
return pd.concat([dat[['Species', 'Petal_Width']],
pd.DataFrame(pred,
columns=['Pred_Petal_Width'])],
axis=1)
input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]

input_data.crosstab(index = 'Species').sort_values('Species')
res = oml.row_apply(input_data, rows=4, func=make_pred,

regr=regr, parallel=2)
type(res)
res
>>> import oml

>>>
>>>
... try:
... except:
... pass
>>>
>>> oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
10-24
Chapter 10
>>>
>>> # Build a regression model to predict Petal_Width using in-memory
... # data.
>>> regr = linear_model.LinearRegression()
>>> regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
... iris[['Petal_Width']])
normalize=False)
>>> regr.coef_
array([[-0.20726607, 0.22282854, 0.52408311]])
>>>
>>> # Define a Python function.
... def make_pred(dat, regr):
... import numpy as np
... pred = regr.predict(dat[['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length']])
... return pd.concat([dat[['Species', 'Petal_Width']],
... pd.DataFrame(pred,
... columns=['Pred_Petal_Width'])],
... axis=1)
>>>
>>> input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]
>>> input_data.crosstab(index = 'Species').sort_values('Species')
SPECIES count
0 setosa 7
1 versicolor 8
2 virginica 4
>>> res = oml.row_apply(input_data, rows=4, func=make_pred, regr=regr,
... columns=['Species',
... 'Petal_Width',
... 'Pred_Petal_Width']))
>>> res = oml.row_apply(input_data, rows=4, func=make_pred,
... regr=regr, parallel=2)
>>> type(res)
>>> res
Species Petal_Width Pred_Petal_Width
0 setosa 0.4 0.344846
1 setosa 0.3 0.335509
2 setosa 0.2 0.294117
3 setosa 0.2 0.220982
4 setosa 0.2 0.080937
5 versicolor 1.5 1.504615
10 versicolor 1.3 1.272388
11 virginica 1.8 1.623561
12 virginica 1.8 1.878132
10-25
Chapter 10
10.4.6 Run a User-Defined Python Function Multiple Times

Use the oml.index_apply function to run a Python function multiple times in Python engines
spawned by the database environment.
oml.index_apply(times, func, func_owner=None, parallel=None, graphics=False,

**kwargs)
The times argument is an int that specifies the number of times to run the func function.
the following:
value is True.
The oml.index_apply function returns a list of Python objects or a list of
Python function, oml.index_apply returns a list of the Python objects returned by the user-
defined Python function. Otherwise, it returns a list of oml.embed.data_image._DataImage
objects.
Example 10-10 Using the oml.index_apply Function
This example defines a function that returns the mean of a set of random numbers the
specified number of times.
import oml
import pandas as pd
10-26
Chapter 10
def compute_random_mean(index):
import numpy as np
import scipy
from statistics import mean
np.random.seed(index)
res = np.random.random((100,1))*10
return mean(res[1])
res = oml.index_apply(times=10, func=compute_random_mean)
type(res)
res
>>> import oml

>>>
>>> def compute_random_mean(index):
... import numpy as np
... import scipy
... from statistics import mean
... np.random.seed(index)
... res = np.random.random((100,1))*10
... return mean(res[1])
...
>>> res = oml.index_apply(times=10, func=compute_random_mean)
>>> type(res)
<class 'list'>
>>> res
[7.203244934421581, 0.25926231827891333, 7.081478226181048,
5.4723224917572235, 8.707323061773764, 3.3197980530117723,
7.7991879224011464, 9.68540662820932, 5.018745921487388,
0.207519493594015]
10.4.7 Save and Manage User-Defined Python Functions in the Script

Repository
The OML4Py script repository stores user-defined Python functions for use with Embedded
Python Execution functions.
Note:
The user-defined Python functions can be used outside of Embedded Python
Execution. You can store functions and reload them back into notebooks or other
user-defined functions.
The script repository is a component of the Embedded Python Execution functionality.

The following topics describe the script repository and the Python functions for managing user-
defined Python functions:
10-27
Chapter 10
Topics:
• About the Script Repository
Use these functions to store, manage, and use user-defined Python functions in the script
repository.
• Create and Store a User-Defined Python Function
Use the oml.script.create function to add a user-defined Python function to the script
repository.
• List Available User-Defined Python Functions
Use the oml.script.dir function to list the user-defined Python functions in the OML4Py
script repository.
• Load a User-Defined Python Function
Use the oml.script.load function to load a user-defined Python function from the script
repository into a Python session.
• Drop a User-Defined Python Function from the Repository
Use the oml.script.drop function to remove a user-defined Python function from the
script repository.
10.4.7.1 About the Script Repository

Use these functions to store, manage, and use user-defined Python functions in the script
repository.
The following table lists the Python functions for the script repository.
oml.script.create Registers a single user-defined Python function in the script repository.
oml.script.dir Lists the user-defined Python functions present in the script repository.
oml.script.drop Drops a user-defined Python function from the script repository.
oml.script.load Loads a user-defined Python function from the script repository into a
Python session.
The following table lists the Python functions for managing access to user-defined Python
functions in the script repository, and to datastores and datastore objects.
oml.grant Grants read privilege permission to another user to a datastore or user-
defined Python function owned by the current user.
oml.revoke Revokes the read privilege permission that was granted to another user to a
datastore or user-defined Python function owned by the current user.
10.4.7.2 Create and Store a User-Defined Python Function

Use the oml.script.create function to add a user-defined Python function to the script
repository.
With the oml.script.create function, you can store a single user-defined Python function in
the OML4Py script repository. You can then specify the user-defined Python function as the
func argument to the Embedded Python Execution functions oml.do_eval, oml.group_apply,
oml.index_apply, oml.row_apply, and oml.table_apply.
10-28
Chapter 10
You can make the user-defined Python function either private or global. A private user-defined
Python function is available only to the owner, unless the owner grants the read privilege to
other users. A global user-defined Python function is available to any user.
The syntax of oml.script.create is the following:
oml.script.create(name, func, is_global=False, overwrite=False)
The name argument is a string that specifies a name for the user-defined Python function in the
Python script repository.
The func argument is the Python function to run. The argument can be a Python function or a
string that contains the definition of a Python function. You must specify a string in an
interactive session if readline cannot get the command history.
The is_global argument is a boolean that specifies whether to create a global user-defined
Python function. The default value is False, which indicates that the user-defined Python
function is a private function available only to the current session user. When is_global is
True, it specifies that the function is global and every user has the read privilege and the
execute privilege to it.
The overwrite argument is a boolean that specifies whether to overwrite the user-defined
Python function if it already exists. The default value is False.
Example 10-11 Using the oml.script.create Function

This example stores two user-defined Python functions in the script repository. It then lists the
contents of the script repository using different arguments to the oml.script.dir function.
Load the iris dataset as a pandas dataframe from the seaborn library. Use the oml.create
function to create the IRIS database table and the proxy object for the table.
%python

import pandas as pd
import oml
# Create objects containing data for the user-defined functions to use.

columns =
['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'])
{0: 'setosa', 1: 'versicolor', 2:'virginica'}[x],
iris.target)),
try:
oml.drop(table="IRIS")
except:
pass
10-29
Chapter 10
Create an user-defined function build_lm1 and use oml.script.create function to store it in

the OML4Py script repository. The parameter "build_lm1" is a string that specifies the name
of the user-defined function. The parameter func=build_lm1 is the Python function to run. Run
the user-defined Python function in embedded Python execution.
%python
# Define a function.
build_lm1 = '''def build_lm1(dat):

import pandas as pd
dat = pd.get_dummies(dat, drop_first=True)
X = dat[["Sepal_Width", "Petal_Length", "Petal_Width",
"Species_versicolor", "Species_virginica"]]
y = dat[["Sepal_Length"]]
regr.fit(X, y)
return regr'''
# Create a private user-defined Python function.

oml.script.create("build_lm1", func=build_lm1, overwrite=True)
# Run the user-defined Python function in embedded Python execution

res = oml.table_apply(oml_iris, func="build_lm1",
res
res.coef_
The output is the following:
array([[ 0.49588894, 0.82924391, -0.31515517, -0.72356196, -1.02349781]])
Define another user-defined function build_lm2, store the function as a global script in the
OML4Py script repository. Run the user-defined Python function in embedded Python
execution.
%python
# Define another function
build_lm2 = '''def build_lm2(dat):

X = dat[["Petal_Width"]]
y = dat[["Petal_Length"]]
regr.fit(X, y)
return regr'''
# Save the function as a global script to the script repository, overwriting
10-30
Chapter 10
any existing function with the same name.

oml.script.create("build_lm2", func=build_lm2, is_global=True,
overwrite=True)
res = oml.table_apply(oml_iris, func="build_lm2",

res
LinearRegression()
List the user-defined Python functions in the script repository available to the current user only.
%python
oml.script.dir()
name ... date

0 build_lm1 ... 2022-12-15 19:02:44
1 build_mod ... 2022-12-12 23:02:31
2 myFitMultiple ... 2022-12-14 22:30:43
3 sample_iris_table ... 2022-12-14 22:21:24
List all of the user-defined Python functions available to the current user.
%python
oml.script.dir(sctype='all')
owner ... date

0 PYQSYS ... 2022-02-11 06:06:44
1 PYQSYS ... 2022-10-19 16:59:50
2 PYQSYS ... 2022-10-19 16:59:52
3 PYQSYS ... 2022-10-19 16:59:53
List the user-defined Python functions available to all users.
%python
oml.script.dir(sctype='global')
10-31
Chapter 10
name ... date

0 GLBLM ... 2022-02-11 06:06:44
1 RandomRedDots ... 2022-10-19 16:59:50
2 RandomRedDots2 ... 2022-10-19 16:59:52
3 RandomRedDots3 ... 2022-10-19 16:59:53
4 TEST ... 2021-08-13 17:37:02
5 TEST4 ... 2021-08-13 17:42:49
6 TEST_FUN ... 2021-08-13 22:38:54
10.4.7.3 List Available User-Defined Python Functions

Use the oml.script.dir function to list the user-defined Python functions in the OML4Py
script repository.
The syntax of the oml.script.dir function is the following:
oml.script.dir(name=None, regex_match=False, sctype=’user’)
The name argument is a string that specifies the name of a user-defined Python function or a
regular expression to match to the names of user-defined Python functions in the script
repository. When name is None, this function returns the type of user-defined Python functions
specified by argument sctype.
The regex_match argument is a boolean that indicates whether argument name is a regular
expression to match. The default value is False.
The sctype argument is a string that specifies the type of user-defined Python function to list.
The value may be one of the following.
• user, to specify the user-defined Python functions available to the current user only.
• grant, to specify the user-defined Python functions to which the read and execute privilege
have been granted by the current user to other users.
• granted, to specify the user-defined Python functions to which the read and execute
privilege have been granted by other users to the current user.
• global, to specify all of the global user-defined Python functions created by the current
user.
• all, to specify all of the user-defined Python functions available to the current user.
The oml.script.dir function returns a pandas.DataFrame that contains the columns NAME
and SCRIPT and, optionally, the columns OWNER and GRANTEE.
Example 10-12 Using the oml.script.dir Function
This example lists the contents of the script repository using different arguments to the
oml.script.dir function. For the creation of the user-defined Python functions, see
Example 10-11.
import oml
# List the user-defined Python functions in the script

# repository available to the current user only.
10-32
Chapter 10
oml.script.dir()
# List all of the user-defined Python functions available

# to the current user.
oml.script.dir(sctype='all')
# List the user-defined Python functions available to all users.

oml.script.dir(sctype='global')
# List the user-defined Python functions that contain the letters

# BL and that are available to all users.
oml.script.dir(name="BL", regex_match=True, sctype='all')
>>> import oml

>>>
>>> # List the user-defined Python functions in the script
... # repository available to the current user only.
... oml.script.dir()
NAME SCRIPT
0 MYLM def build_lm1(dat):\n from sklearn import l...
>>>
>>> # List all of the user-defined Python functions available
... to the current user.
... oml.script.dir(sctype='all')
OWNER NAME SCRIPT
0 PYQSYS GLBLM def build_lm2(dat):\n from sklearn import l...
1 OML_USER MYLM def build_lm1(dat):\n from sklearn import l...
>>>
>>> # List the user-defined Python functions available to all users.
>>> oml.script.dir(sctype='global')
NAME SCRIPT
0 GLBLM def build_lm2(dat):\n from sklearn import l...
>>>
>>> # List the user-defined Python functions that contain the letters
... # BL and that are available to all users.
... oml.script.dir(name="BL", regex_match=True, sctype='all')
OWNER NAME SCRIPT
0 PYQSYS GLBLM def build_lm2(dat):\n from sklearn import l...
10.4.7.4 Load a User-Defined Python Function

Use the oml.script.load function to load a user-defined Python function from the script
repository into a Python session.
oml.script.load(name, owner=None)
The name argument is a string that specifies the name of the user-defined Python function to
load from the OML4Py script repository.
10-33
Chapter 10
The optional owner argument is a string that specifies the owner of the user-defined Python
function or None (the default). If owner=None, then this function finds and loads the user-defined
Python function that matches name in the following order:
1. A user-defined Python function that the current user created.

2. A global user-defined Python function that was created by another user.
The oml.script.load function returns an oml.script.script.Callable object that references
the named user-defined Python function.
Example 10-13 Using the oml.script.load Function
This example loads user-defined Python functions from the script repository and pulls them to
the local Python session. For the creation of the user-defined Python functions, see
Example 10-11.
import oml
# Load the MYLM and GLBLM user-defined Python functions.

MYLM = oml.script.load(name="MYLM")
GMYLM = oml.script.load(name="GLBLM")
# Pull the models to the local Python session.

MYLM(oml_iris.pull()).coef_
GMYLM(oml_iris.pull())
>>> import oml

>>>
>>> # Load the MYLM and GLBLM user-defined Python functions.
>>> MYLM = oml.script.load(name="MYLM")
>>> GMYLM = oml.script.load(name="GLBLM")
>>>
>>> # Pull the models to the local Python session.
... MYLM(oml_iris.pull()).coef_
array([[ 0.49588894, 0.82924391, -0.31515517, -0.72356196, -1.02349781]])
>>> GMYLM(oml_iris.pull())
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
10.4.7.5 Drop a User-Defined Python Function from the Repository

Use the oml.script.drop function to remove a user-defined Python function from the script
repository.
The oml.script.drop function drops a user-defined Python function from the OML4Py script
repository.
oml.script.drop(name, is_global=False, silent=False)
The name argument is a string that specifies the name of the user-defined Python function in
the script repository.
10-34
Chapter 10
The is_global argument is a boolean that specifies whether the user-defined Python function
to drop is a global or a private user-defined Python function. The default value is False, which
indicates a private user-defined Python function.
The silent argument is a boolean that specifies whether to display an error message when
oml.script.drop encounters an error in dropping the specified user-defined Python function.
The default value is False.
Example 10-14 Using the oml.script.drop Function

This example drops user-defined Python functions the MYLM private user-defined Python
function and the GLBLM global user-defined Python function from the script repository. For the
creation of the user-defined Python functions, see Example 10-11.
import oml
# List the available user-defined Python functions.

oml.script.dir(sctype="all")
# Drop the private user-defined Python function.

oml.script.drop("MYLM")
# Drop the global user-defined Python function.

oml.script.drop("GLBLM", is_global=True)
# List the available user-defined Python functions again.

oml.script.dir(sctype="all")
>>> import oml

>>>
>>> # List the available user-defined Python functions.
... oml.script.dir(sctype="all")
OWNER NAME SCRIPT
0 PYQSYS GLBLM def build_lm2(dat):\n from sklearn import lin...
1 OML_USER MYLM def build_lm1(dat):\n from sklearn import lin...
>>>
>>> # Drop the private user-defined Python function.
... oml.script.drop("MYLM")
>>>
>>> # Drop the global user-defined Python function.
... oml.script.drop("GLBLM", is_global=True)
>>>
>>> # List the available user-defined Python functions again.
... oml.script.dir(sctype="all")
Empty DataFrame
Columns: [OWNER, NAME, SCRIPT]
Index: []
10-35
Chapter 10
SQL API for Embedded Python Execution with On-premises Database
10.5 SQL API for Embedded Python Execution with On-premises

Database
SQL API for Embedded Python Execution with On-premises Database has SQL interfaces for
Embedded Python Execution and for datastore and script repository management.
The following topics describe the OML4Py SQL interfaces for Embedded Python Execution.
Topics:
• About the SQL API for Embedded Python Execution with On-Premises Database
With the SQL API, you can run user-defined Python functions in one or more separate
Python engines in an Oracle database environment, manage user-defined Python
functions in the OML4Py script repository, and control access to and get information about
datastores and about user-defined Python functions in the script repository.
• pyqEval Function (On-Premises Database)
This topic describes the pyqEval function when used in an on-premises Oracle Database.
The pyqEval function runs a user-defined Python function that explicitly retrieves data or
for which external data is to be automatically loaded for the function.
• pyqTableEval Function (On-Premises Database)
This topic describes the pyqTableEval function when used in an on-premises Oracle
Database. The pyqTableEval function runs a user-defined Python function on data from an
Oracle Database table.
• pyqRowEval Function (On-Premises Database)
This topic describes the pyqRowEval function when used in an on-premises Oracle
Database. The pyqRowEval function chunks data into sets of rows and then runs a user-
defined Python function on each chunk.
• pyqGroupEval Function (On-Premises Database)
This topic describes the pyqGroupEval function when used in an on-premises Oracle
Database. The pyqGroupEval function groups data by one or more columns and runs a
user-defined Python function on each group.
• pyqGrant Function (On-Premises Database)
This topic describes the pyqGrant function when used in an on-premises Oracle Database.
• pyqRevoke Function (On-Premises Database)
This topic describes the pyqRevoke function when used in an on-premises Oracle
Database.
• pyqScriptCreate Procedure (On-Premises Database)
This topic describes the pyqScriptCreate procedure in an on-premises Oracle Database.
The pyqScriptCreate procedure creates a user-defined Python function and adds it to the
• pyqScriptDrop Procedure (On-Premises Database)
This topic describes the pyqScriptDrop procedure in an on-premises Oracle Database.
The pyqScriptDrop procedure removes a user-defined Python function from the OML4Py
script repository.
10-36
Chapter 10
10.5.1 About the SQL API for Embedded Python Execution with On-
Premises Database
With the SQL API, you can run user-defined Python functions in one or more separate Python
engines in an Oracle database environment, manage user-defined Python functions in the
OML4Py script repository, and control access to and get information about datastores and
about user-defined Python functions in the script repository.
You can use the SQL interface for Embedded Python Execution with an on-premises Oracle
Database instance.
OML4Py provides the following types of SQL functions and procedures.
• SQL table functions for running user-defined Python functions in one or more database-
spawned and managed Python engines; the user-defined Python functions may reference
Python objects in OML4Py datastores and use third-party packages installed with the
database server machine Python engines..
• PL/SQL procedures for creating and dropping user-defined Python functions in the
• PL/SQL procedures for granting and revoking the read privilege to datastores and the
datastore objects in them, and to user-defined Python functions in the OML4Py script
repository.
The following table lists the SQL functions for Embedded Python Execution and the PL/SQL
procedures for managing datastores and user-defined Python functions.
Function or Procedure Description

pyqEval function Runs a user-defined Python function on the data
passed in.
pyqGroupEval function Groups data by one or more columns and runs a
user-defined Python function on each group.
pyqTableEval function Runs a user-defined Python function on data in the
database.
pyqRowEval function Runs the specified number of rows in each
invocation of the user-defined Python function in
parallel processes.
pyqGrant procedure Grants the read privilege to another user to a user-
defined Python function owned by the current user.
pyqRevoke procedure Revokes the read privilege that was granted to
another user to a user-defined Python function
owned by the current user.
pyqScriptCreate procedure Creates a user-defined Python function in the script
repository.
pyqScriptDrop procedure Drops a user-defined Python function from the
script repository.
10.5.2 pyqEval Function (On-Premises Database)

This topic describes the pyqEval function when used in an on-premises Oracle Database. The
pyqEval function runs a user-defined Python function that explicitly retrieves data or for which
external data is to be automatically loaded for the function.
10-37
Chapter 10
You can pass arguments to the Python function with the PAR_QRY parameter.
The pyqEval function does not automatically receive any data from the database. The Python
function generates the data that it uses or it explicitly retrieves it from a data source such as
Oracle Database, other databases, or flat files.
The Python function can return a boolean, a dict, a float, an int, a list, a str, a tuple or a
pandas.DataFrame object. You define the form of the returned value with the OUT_QRY
parameter.
Syntax
pyqEval (
PAR_QRY VARCHAR2 IN
OUT_QRY VARCHAR2 IN
EXP_NAM VARCHAR2 IN)
Parameters
PAR_QRY A JSON string that contains additional parameters to pass to the user-defined
Python function specified by the EXP_NAM parameter. Special control
arguments, which start with oml_, are not passed to the function specified by
EXP_NAM, but instead control what happens before or after the invocation of the
function.
For example, to specify the input data type as pandas.DataFrame, use:
'{"oml_input_type":"pandas.DataFrame"}'
OUT_QRY The format of the output returned by the function. It can be one of the
following:
• A JSON string that specifies the column names and data types of the table
returned by the function. Any image data is discarded.
• The name of a table or view to use as a prototype. If using a table or view
owned by another user, use the format <owner name>.<table/view
name>. You must have read access to the specified table or view.
• The string 'XML', which specifies that the table returned contains a CLOB
that is an XML string. The XML can contain both structured data and
images, with structured or semi-structured Python objects first, followed by
the image or images generated by the Python function.
• The string 'PNG', which specifies that the table returned contains a BLOB
that has the image or images generated by the Python function. Images
are returned as a base 64 encoding of the PNG representation.
EXP_NAM The name of a user-defined Python function in the OML4Py script repository.
Returns
Function pyqEval returns a table that has the structure specified by the OUT_QRY parameter
value.
Example 10-15 Using the pyqEval Function
This example defines Python functions and stores them in the OML4Py script repository. It
invokes the pyqEval function on the user-defined Python functions.
10-38
Chapter 10
In a PL/SQL block, create an unnamed Python function that is stored in script repository with
the name pyqFun1.
BEGIN
sys.pyqScriptCreate('pyqFun1', 'func = lambda: "Hello World from a
lambda!"',
FALSE, TRUE); -- V_GLOBAL, V_OVERWRITE
END;
/
Invoke the pyqEval function, which runs the user-defined Python function and returns the
results as XML.
SELECT name, value

FROM table(pyqEval(
NULL,
'XML',
'pyqFun1'));
The output is the following.
NAME VALUE
---- --------------------------------------------------
<root><str>Hello World from a lambda!</str></root>
Drop the user-defined Python function.
BEGIN
sys.pyqScriptDrop('pyqFun1');
END;
/
Define a Python function that returns a numpy.ndarray that is stored in script repository with
the name pyqFun2.
BEGIN
sys.pyqScriptCreate('pyqFun2',
'def return_frame():
import numpy as np
import pickle
z = np.array([y for y in zip([str(x)+"demo" for x in range(10)],
[float(x)/10 for x in range(10)],
[x for x in range(10)],
[bool(x%2) for x in range(10)],
[pickle.dumps(x) for x in range(10)],
["test"+str(x**2) for x in range(10)])],
dtype=[("a", "U10"), ("b", "f8"), ("c", "i4"),
("d", "?"), ("e", "S20"), ("f", "O")])
return z');
END;
/
10-39
Chapter 10
Invoke the pyqEval function, which runs the pyqFun2 user-defined Python function.
SELECT *
FROM table(pyqEval(
NULL,
'{"A":"varchar2(10)", "B":"number",
"C":"number", "D":"number",
"E":"raw(10)", "F": "varchar2(10)" }',
'pyqFun2'));
A B C D E F
---------- ---------- ---------- ---------- -------------------- ----------
0demo 0 0 0 80034B002E test0
1demo 1.0E-001 1 1 80034B012E test1
2demo 2.0E-001 2 0 80034B022E test4
3demo 3.0E-001 3 1 80034B032E test9
4demo 4.0E-001 4 0 80034B042E test16
5demo 5.0E-001 5 1 80034B052E test25
6demo 6.0E-001 6 0 80034B062E test36
7demo 7.0E-001 7 1 80034B072E test49
8demo 8.0E-001 8 0 80034B082E test64
9demo 9.0E-001 9 1 80034B092E test81
10 rows selected.
Drop the user-defined Python function.
BEGIN
END;
/
10.5.3 pyqTableEval Function (On-Premises Database)

This topic describes the pyqTableEval function when used in an on-premises Oracle
Database. The pyqTableEval function runs a user-defined Python function on data from an
Oracle Database table.
You pass data to the Python function with the INP_NAM parameter. You can pass arguments to
the Python function with the PAR_QRY parameter.
pandas.DataFrame object. You define the form of the returned value with the OUT_QRY
parameter.
Syntax
pyqTableEval (
INP_NAM VARCHAR2 IN
PAR_QRY VARCHAR2 IN
10-40
Chapter 10
OUT_QRY VARCHAR2 IN
Parameters
INP_NAM The name of a table or view that specifies the data to pass to the Python
function specified by the EXP_NAM parameter. If using a table or view owned
by another user, use the format <owner name>.<table/view name>. You
must have read access to the specified table or view.
EXP_NAM, but instead control what happens before or after the invocation of
the function.
following:
• A JSON string that specifies the column names and data types of the
table returned by the function. Any image data is discarded.
• The string 'XML', which specifies that the table returned contains a
CLOB that is an XML string. The XML can contain both structured data
and images, with structured or semi-structured Python objects first,
followed by the image or images generated by the Python function.
• The string 'PNG', which specifies that the table returned contains a
BLOB that has the image or images generated by the Python function.
Images are returned as a base 64 encoding of the PNG representation.
Returns
Function pyqTableEval returns a table that has the structure specified by the OUT_QRY
parameter value.
Example 10-16 Using the pyqTableEval Function
This example stores a user-defined Python function in the OML4Py script repository with the
name create_iris_table. It uses the function to create a database table as the result of a
pyqEval function invocation. It creates another user-defined Python function that fits a linear
regression model to the input data and saves the model in the OML4Py datastore. The
example runs a SQL SELECT statement that invokes the pyqTableEval function, which invokes
the function stored in the script repository with the name myLinearRegressionModel.
In a PL/SQL block, define the Python function create_iris_table and store in the script
repository with the name create_iris_table, overwriting any existing user-defined Python
function stored in the script repository with the same name.
10-41
Chapter 10
The create_iris_table function imports and loads the iris data set, creates two
pandas.DataFrame objects, and then returns the concatenation of those objects.
BEGIN
sys.pyqScriptCreate('create_iris_table',
'def create_iris_table():
import pandas as pd
iris = load_iris()
x = pd.DataFrame(iris.data, columns = ["Sepal_Length",\
"Sepal_Width", "Petal_Length", "Petal_Width"])
y = pd.DataFrame(list(map(lambda x: {0:"setosa", 1: "versicolor",\
2: "virginica"}[x], iris.target)),\
columns = ["Species"])
return pd.concat([y, x], axis=1)',
END;
/
CREATE TABLE IRIS AS
(SELECT * FROM pyqEval(
NULL,
'{"Species":"VARCHAR2(10)","Sepal_Length":"number",
"Sepal_Width":"number","Petal_Length":"number",
"Petal_Width":"number"}',
'create_iris_table'
));
Define the Python function fit_model and store it with the name myLinearRegressionModel as
a private function in the script repository, overwriting any existing user-defined Python function
stored with that name.
The fit_model function fits a regression model to the input data dat and then saves the fitted
model as an object specified by the modelName argument to the datastore specified by the
datastoreName argument. The fit_model function returns the fitted model in a string format.
By default, Python objects are saved to a new datastore with the specified datastoreName. To
save an object to an existing datastore, either set the overwrite or append argument to True in
the oml.ds.save invocation.
BEGIN
sys.pyqScriptCreate('myLinearRegressionModel',
'def fit_model(dat, modelName, datastoreName):
import oml
regr.fit(dat.loc[:, ["Sepal_Length", "Sepal_Width", \
"Petal_Length"]], dat.loc[:,["Petal_Width"]])
oml.ds.save(objs={modelName:regr}, name=datastoreName,
overwrite=True)
return str(regr)',
FALSE, TRUE);
END;
/
10-42
Chapter 10
Run a SELECT statement that invokes the pyqTableEval function. The INP_NAM parameter of
the pyqTableEval function specifies the IRIS table as the data to pass to the Python function.
The PAR_QRY parameter specifies the names of the model and datastore to pass to the Python
function, and specifies the oml_connect control argument to establish an OML4Py connection
to the database during the invocation of the user-defined Python function. The OUT_QRY
parameter specifies returning the value in XML format and the EXP_NAM parameter specifies
the myLinearRegressionModel function in the script repository as the Python function to
invoke. The XML output is a CLOB; you can call set long [length] to get more output.
SELECT *
FROM table(pyqTableEval(
'IRIS',
'{"modelName":"linregr",
"datastoreName":"pymodel",
"oml_connect":1}',
'XML',
'myLinearRegressionModel'));
NAME VALUE
----- ------------------------------------------------------------
<root><str>LinearRegression()</str></root>
10.5.4 pyqRowEval Function (On-Premises Database)

This topic describes the pyqRowEval function when used in an on-premises Oracle Database.
The pyqRowEval function chunks data into sets of rows and then runs a user-defined Python
function on each chunk.
The pyqRowEval function passes the data specified by the INP_NAM parameter to the Python
function specified by the EXP_NAM parameter. You can pass arguments to the Python function
with the PAR_QRY parameter.
The ROW_NUM parameter specifies the maximum number of rows to pass to each invocation of
the Python function. The last set of rows may have fewer rows than the number specified.
pandas.DataFrame object. You may define the form of the returned value with the OUT_QRY
parameter.
Syntax
pyqRowEval (
INP_NAM VARCHAR2 IN
PAR_QRY VARCHAR2 IN
OUT_QRY VARCHAR2 IN
ROW_NUM NUMBER IN
10-43
Chapter 10
Parameters
the function.
following:
ROW_NUM The number of rows to include in each invocation of the Python function.
Returns
Function pyqRowEval returns a table that has the structure specified by the OUT_QRY parameter
value.
Example 10-17 Using the pyqRowEval Function
This example loads the Python model linregr to predict row chunks of sample iris data. The
model is created and saved in the datastore pymodel in Example 10-16.
The example defines a Python function and stores it in the OML4Py script repository. It uses
the user-defined Python function to create a database table as the result of the pyqEval
function. It defines a Python function that runs a prediction function on a model loaded from the
OML4Py datastore. It then invokes the pyqTableEval function to invoke the function on chunks
of rows from the database table.
In a PL/SQL block, define the function sample_iris_table and store it in the script repository.
The function loads the iris data set, creates two pandas.DataFrame objects, and then returns a
sample of the concatenation of those objects.
BEGIN
sys.pyqScriptCreate('sample_iris_table',
'def sample_iris_table(size):
10-44
Chapter 10

import pandas as pd
iris = load_iris()
"Sepal_Width","Petal_Length","Petal_Width"])
return pd.concat([y, x], axis=1).sample(int(size))',
END;
/
Create the SAMPLE_IRIS table in the database as the result of a SELECT statement, which
invokes the pyqEval function on the sample_iris_table user-defined Python function saved in
the script repository with the same name. The sample_iris_table function returns an iris data
sample of size size.
CREATE TABLE sample_iris AS

SELECT *
FROM TABLE(pyqEval(
'{"size":20}',
'{"Species":"varchar2(10)","Sepal_Length":"number",
'sample_iris_table'));
Define the Python function predict_model and store it with the name linregrPredict in the
script repository. The function predicts the data in dat with the Python model specified by the
modelName argument, which is loaded from the datastore specified by the datastoreName
argument. The predictions are finally concatenated and returned with dat as the object that the
function returns.
BEGIN
sys.pyqScriptCreate('linregrPredict',
'def predict_model(dat, modelName, datastoreName):
import oml
import pandas as pd
objs = oml.ds.load(name=datastoreName, to_globals=False)
pred = objs[modelName].predict(dat[["Sepal_Length","Sepal_Width",\
"Petal_Length"]])
return pd.concat([dat, pd.DataFrame(pred, \
columns=["Pred_Petal_Width"])], axis=1)',
FALSE, TRUE);
END;
/
Run a SELECT statement that invokes the pyqRowEval function, which runs the specified Python
function on each chunk of rows in the specified data set.
The INP_NAM argument specifies the data in the SAMPLE_IRIS table to pass to the Python
function.
10-45
Chapter 10
The PAR_QRY argument specifies connecting to the OML4Py server with the special control
argument oml_connect, passing the input data as a pandas.DataFrame with the special control
argument oml_input_type, along with values for the function arguments modelName and
datastoreName.
In the OUT_QRY argument, the JSON string specifies the column names and data types of the
table returned by pyqRowEval.
The ROW_NUM argument specifies that five rows are included in each invocation of the function
specified by EXP_NAM.
The EXP_NAM parameter specifies linregrPredict, which is the name in the script repository
of the user-defined Python function to invoke.
SELECT *
FROM table(pyqRowEval(
'SAMPLE_IRIS',
'{"oml_connect":1,"oml_input_type":"pandas.DataFrame",
"modelName":"linregr", "datastoreName":"pymodel"}',
'{"Species":"varchar2(10)", "Sepal_Length":"number",
"Sepal_Width":"number", "Petal_Length":"number",
"Petal_Width":"number","Pred_Petal_Width":"number"}',
5,
'linregrPredict'));
Species Sepal_Length Sepal_Width Petal_Length Petal_Width Pred_Petal_Width

---------- ------------ ----------- ------------ -----------
------------------
versicolor 5.4 3 4.5 1.5 1.66731546068336
versicolor 6 3.4 4.5 1.6 1.63208723397328
setosa 5.5 4.2 1.4 0.2
0.289325450127603
virginica 6.4 3.1 5.5 1.8 2.00641535609046
versicolor 6.1 2.8 4.7 1.2 1.58248012323666
setosa 5.4 3.7 1.5 0.2
0.251046097050724
virginica 7.2 3 5.8 1.6 1.97554457713195
versicolor 6.2 2.2 4.5 1.5 1.32323976658868
setosa 4.8 3.1 1.6 0.2
0.294116926466465
virginica 6.7 3.3 5.7 2.5 2.0936178656911
virginica 7.2 3.6 6.1 2.5 2.26646663788204
setosa 5 3.6 1.4 0.2
0.259261360689759
virginica 6.3 3.4 5.6 2.4 2.14639883810232
virginica 6.1 3 4.9 1.8 1.73186245496453
versicolor 6.1 2.9 4.7 1.4 1.60476297762276
versicolor 5.7 2.8 4.5 1.3 1.56056992978395
virginica 6.4 2.7 5.3 1.9 1.8124673155904
setosa 5 3.5 1.3 0.3
0.184570194825823
versicolor 5.6 2.7 4.2 1.3 1.40178874834007
10-46
Chapter 10
setosa 4.5 2.3 1.3 0.3

0.0208089790714202
10.5.5 pyqGroupEval Function (On-Premises Database)

This topic describes the pyqGroupEval function when used in an on-premises Oracle
Database. The pyqGroupEval function groups data by one or more columns and runs a user-
defined Python function on each group.
The pyqGroupEval function runs the user-defined Python function specified by the EXP_NAM
parameter. Pass data to the Python function with the INP_NAM parameter. Pass arguments to
the Python function with the PAR_QRY parameter. Specify one or more grouping columns with
the GRP_COL parameter.
pandas.DataFrame object. Define the form of the returned value with the OUT_QRY parameter.
Syntax
pyqGroupEval (
INP_NAM VARCHAR2 IN
PAR_QRY VARCHAR2 IN
OUT_QRY VARCHAR2 IN
GRP_COL VARCHAR2 IN
Parameters
the function.
10-47
Chapter 10
following:
GRP_COL The names of the grouping columns by which to partition the data. Use
commas to separate multiple columns. For example, to group by GENDER and
YEAR:
"GENDER,YEAR"
Returns
Function pyqGroupEval returns a table that has the structure specified by the OUT_QRY
parameter value.
Example 10-18 Using the pyqGroupEval Function
This example defines the Python function create_iris_table and stores it with the name
create_iris_table in the OML4Py script repository. It then invokes pyqEval, which invokes the
user-definded Python function and creates the IRIS database table. The example creates the
package irisPkg and uses that package in specifying the data cursor to pass to the
irisGroupEval function, which is a user-defined pyqGroupEval function. It defines another
Python function, group_count and stores it in the script repository with the name
mygroupcount. The example then invokes the irisGroupEval function and passes it the
Python function saved with the name mygroupcount.
repository with the name create_iris_table.
BEGIN
import pandas as pd
iris = load_iris()
return pd.concat([y, x], axis=1)');
END;
/
10-48
Chapter 10
Invoke the pyqEval function to create the database table IRIS, using the Python function
stored with the name create_iris_table in the script repository.

NULL,
'create_iris_table'
));
Define the Python function group_count and store it with the name mygroupcount in the script
repository. The function returns a pandas.DataFrame generated on each group of data dat.
BEGIN
sys.pyqScriptCreate('mygroupcount',
'def group_count(dat):
import pandas as pd
columns = ["Species", "CNT"]) ');
END;
/
Issue a query that invokes the pyqGroupEval function. In the function, the INP_NAM argument
specifies the data in the IRIS table to pass to the function.
The PAR_QRY argument specifies the special control argument oml_input_type.
The OUT_QRY argument specifies a JSON string that contains the column names and data types
of the table returned by pyqGroupEval.
The GRP_COL parameter specifies the column to group by.
The EXP_NAM parameter specifies the user-defined Python function stored with the name
mygroupcount in the script repository.
SELECT *
FROM table(
pyqGroupEval(
'IRIS',
'{"oml_input_type":"pandas.DataFrame"}',
'{"Species":"varchar2(10)", "CNT":"number"}',
'Species',
'mygroupcount'));
Species CNT
---------- ----------
setosa 50
versicolor 50
virginica 50
10-49
Chapter 10
10.5.6 pyqGrant Function (On-Premises Database)

This topic describes the pyqGrant function when used in an on-premises Oracle Database.
The pyqGrant function grants read privilege access to an OML4Py datastore or to a script in
the OML4Py script repository.
Syntax
pyqGrant (
V_NAME VARCHAR2 IN
V_TYPE VARCHAR2 IN
V_USER VARCHAR2 IN DEFAULT)
Parameters
V_NAME The name of an OML4Py datastore or a script in the OML4Py script repository.
V_TYPE For a datastore, the type is datastore; for script the type is pyqScript.
V_USER The name of the user to whom to grant access.
Example 10-19 Granting Read Access to a script
-- Grant read privilege access to Scott.

BEGIN
pyqGrant('pyqFun1', 'pyqscript', 'SCOTT');
END;
/
Example 10-20 Granting Read Access to a datastore
-- Grant read privilege access to datastore ds1 to SCOTT.

BEGIN
pyqGrant('ds1', 'datastore', 'SCOTT');
END;
/
Example 10-21 Granting Read Access to a Script to all Users
-- Grant read privilege access to script RandomRedDots to all users.

BEGIN
pyqGrant('pyqFun1', 'pyqscript', NULL);
END;
/
Example 10-22 Granting Read Access to a datastore to all Users
-- Grant read privilege access to datastore ds1 to all users.

BEGIN
pyqGrant('ds1', 'datastore', NULL);
10-50
Chapter 10
END;
/
10.5.7 pyqRevoke Function (On-Premises Database)

This topic describes the pyqRevoke function when used in an on-premises Oracle Database.
The pyqRevoke function revokes read privilege access to an OML4Py datastore or to a script in
Syntax
pyqRevoke (
V_NAME VARCHAR2 IN
V_TYPE VARCHAR2 IN
Parameters
V_USER The name of the user from whom to revoke access.
Example 10-23 Revoking Read Access to a script
-- Revoke read privilege access to script pyqFun1 from SCOTT.

BEGIN
pyqRevoke('pyqFun1', 'pyqscript', 'SCOTT');
END;
/
Example 10-24 Revoking Read Access to a datastore
-- Revoke read privilege access to datastore ds1 from SCOTT.

BEGIN
pyqRevoke('ds1', 'datastore', 'SCOTT');
END;
/
Example 10-25 Revoking Read Access to a script from all Users
-- Revoke read privilege access to script pyqFun1 from all users.

BEGIN
pyqRevoke('pyqFun1', 'pyqscript', NULL);
END;
/
10-51
Chapter 10
Example 10-26 Revoking Read Access to a datastore from all Users
-- Revoke read privilege access to datastore ds1 from all users.

BEGIN
pyqRevoke('ds1', 'datastore', NULL);
END;
/
10.5.8 pyqScriptCreate Procedure (On-Premises Database)

This topic describes the pyqScriptCreate procedure in an on-premises Oracle Database. The
pyqScriptCreate procedure creates a user-defined Python function and adds it to the OML4Py
script repository.
To create a user-defined Python function, you must have the PYQADMIN database role.
Syntax
sys.pyqScriptCreate (
V_NAME VARCHAR2 IN
V_SCRIPT CLOB IN
V_GLOBAL BOOLEAN IN DEFAULT
V_OVERWRITE BOOLEAN IN DEFAULT)
V_NAME A name for the user-defined Python function in the OML4Py script repository.
V_SCRIPT The definition of the Python function.
V_GLOBAL TRUE specifies that the user-defined Python function is public; FALSE specifies that
the user-defined Python function is private.
V_OVERWRITE If the script repository already has a user-defined Python function with the same
name as V_NAME, then TRUE replaces the content of that user-defined Python
function with V_SCRIPT and FALSE does not replace it.
Example 10-27 Using the pyqScriptCreate Procedure

This example creates a private user-defined Python function named pyqFun2 in the OML4Py
script repository.
BEGIN
import numpy as np
import pickle
dtype=[("a", "U10"), ("b", "f8"), ("c", "i4"), ("d", "?"),
("e", "S20"), ("f", "O")])
return z');
10-52
Chapter 10
END;
/
This example creates a global user-defined Python function named pyqFun2 in the script
repository and overwrites any existing user-defined Python function of the same name.
BEGIN
import numpy as np
import pickle
dtype=[("a", "U10"), ("b", "f8"), ("c", "i4"), ("d", "?"),
("e", "S20"), ("f", "O")])
return z',
TRUE, -- Make the user-defined Python function global.
TRUE); -- Overwrite any global user-defined Python function
-- with the same name.
END;
/
This example creates a private user-defined Python function named create_iris_table in the
script repository.
BEGIN
import pandas as pd
iris = load_iris()
END;
/
Display the user-defined Python functions owned by the current user.
SELECT * from USER_PYQ_SCRIPTS;
NAME SCRIPT
-----------------
---------------------------------------------------------------------
create_iris_table def create_iris_table(): from sklearn.datasets
10-53
Chapter 10
import load_iris ...

pyqFun2 def return_frame(): import numpy as np import
pickle ...
Display the user-defined Python functions available to the current user.
SELECT * from ALL_PYQ_SCRIPTS;
OWNER NAME SCRIPT

-------- -----------------
--------------------------------------------------------------------
OML_USER create_iris_table "def create_iris_table(): from
sklearn.datasets import load_iris ...
OML_USER pyqFun2 "def return_frame(): import numpy as np
import pickle ...
PYQSYS pyqFun2 "def return_frame(): import numpy as np
import pickle ...
10.5.9 pyqScriptDrop Procedure (On-Premises Database)

This topic describes the pyqScriptDrop procedure in an on-premises Oracle Database. The
pyqScriptDrop procedure removes a user-defined Python function from the OML4Py script
repository.
To drop a user-defined Python function, you must have the PYQADMIN database role.
Syntax
sys.pyqScriptDrop (
V_NAME VARCHAR2 IN
V_SILENT BOOLEAN IN DEFAULT)
V_GLOBAL A BOOLEAN that specifies whether the user-defined Python function to drop is a
global or a private user-defined Python function. The default value is FALSE, which
indicates a private user-defined Python function. TRUE specifies that the user-
defined Python function is public.
V_SILENT A BOOLEAN that specifies whether to display an error message when
sys.pyqScriptDrop encounters an error in dropping the specified user-defined
Python function. The default value is FALSE.
Example 10-28 Using the sys.pyqScriptDrop Procedure

For the creation of the user-defined Python functions dropped in these examples, see
Example 10-27.
10-54
Chapter 10
SQL API for Embedded Python Execution with Autonomous Database
This example drops the private user-defined Python function pyqFun2 from the script repository.
BEGIN
END;
/
This example drops the global user-defined Python function pyqFun2 from the script repository.
BEGIN
sys.pyqScriptDrop('pyqFun2', TRUE);
END;
/
10.6 SQL API for Embedded Python Execution with Autonomous

Database
The SQL API for Embedded Python Execution with Autonomous Database provides SQL
interfaces for setting authorization tokens, managing access control list (ACL) privileges,
executing Python scripts, and synchronously and asynchronously running jobs.
The following topics describe the SQL API interfaces for Embedded Python Execution.
Topics:
• Access and Authorization Procedures and Functions
Use the network access control lists (ACL) API to control access by users to external
network services and resources from the database. Use the token store API to persist the
authorization token issued by a cloud host so it can be used with subsequent SQL calls.
• Embedded Python Execution Functions (Autonomous Database)
The SQL API for Embedded Python Execution with Autonomous Database functions are
• Asynchronous Jobs (Autonomous Database)
When a function is run asynchronously, it's run as a job which can be tracked by using the
pyqJobStatus and pyqJobResult functions.
• Special Control Arguments (Autonomous Database)
Use the PAR_LST parameter to specify special control arguments and additional arguments
to be passed into the Python script.
• Output Formats (Autonomous Database)
The OUT_FMT parameter controls the format of output returned by the table functions
pyqEval, pyqGroupEval, pyqIndexEval, pyqRowEval, pyqTableEval, and pyqJobResult.
10.6.1 Access and Authorization Procedures and Functions

Use the network access control lists (ACL) API to control access by users to external network
services and resources from the database. Use the token store API to persist the authorization
token issued by a cloud host so it can be used with subsequent SQL calls.
Use the following to manage ACL privileges. An ADMIN user is required.
• pyqAppendHostACE Procedure
10-55
Chapter 10
• pyqGetHostACE Function
• pyqRemoveHostACE Procedure
Use the following to manage authorization tokens:
• pyqSetAuthToken Procedure
• pyqIsTokenSet Function
Workflow
The typical workflow for using the SQL API for Embedded Python Execution with Autonomous
Database is:
1. Connect to PDB as the ADMIN user, and add a normal user OMLUSER to the ACL list of the
cloud host of which the root domain is adb.us-region-1.oraclecloudapps.com:
exec pyqAppendHostAce('OMLUSER','adb.us-region-1.oraclecloudapps.com');
2. The OML Rest URLs can be obtained from the Oracle Autonomous Database that is
provisioned.
a. Sign into your Oracle Cloud Infrastructure account. You will need your OCI user name
and password.
b. Click the hamburger menu and select Autonomous Database instance that is
provisioned. For more information on provisioning an Autonomous Database, see:
Provision an Oracle Autonomous Database.
c. Click Service Console and then click Devlopment.
d. Scroll down to Oracle Machine Learning RESTful Services tile and click Copy to
obtain the following URLs for:
• Obtaining the REST authentication token for REST APIs provided by OML:
<oml-cloud-service-location-url>/omlusers/
The URL <oml-cloud-service-location-url> includes the tenancy ID, location, and

database name. For example, https://qtraya2braestch-omldb.adb.us-
sanjose-1.oraclecloudapps.com.
In this example,
• qtraya2braestch is the tenancy ID
• omldb is the database name
• us-sanjose-1 is the datacenter region
• oraclecloudapps.com is the root domain
3. The Oracle Machine Learning REST API uses tokens to authenticate an Oracle Machine
Learning user. To authenticate and obtain an access token, send a POST request to the
Oracle Machine Learning User Management Cloud Service REST endpoint /oauth2/v1/
token with your OML username and password.
curl -X POST --header 'Content-Type: application/json' --header 'Accept:

application/json'
-d '{"grant_type":"password", "username":"'${username}'", "password":"'$
10-56
Chapter 10
{password}'"}'
"<oml-cloud-service-location-url>/omlusers/api/oauth2/v1/token"
The example uses the following values:

• username is the OML username.
• password is the OML user password.
• oml-cloud-service-location-url is a variable containing the REST server portion of
the Oracle Machine Learning User Management Cloud Service instance URL that
includes the tenancy ID, database name, and the location name. You can obtain the
omlserver URL from the Development tab in the Service Console of your Oracle
Autonomous Database instance.
Note:
When a token expires, all calls to the OML Services REST endpoints with return
a message stating that the token has expired along with the HTTP error:
HTTP/1.1 401 Unauthorized
4. Connect to PDB as OMLUSER, set the access token, and run pyqIndexEval:
exec pyqSetAuthToken('<access token>');

select *
from table(pyqIndexEval(
par_qry => NULL,
out_fmt => '{"ID":"number", "RES":"varchar2(3)"}',
times_num => 3,
scr_name => 'idx_ret_df'));
ID RES
---------- ---
1 a
2 b
3 c
3 rows selected.
The following topics describe the steps to manage ACL privileges and authorization tokens.
Topics:
• pyqAppendHostACE Procedure
The pyqAppendHostACE procedure appends an access control entry (ACE) to the access
control list (ACL) of the cloud host. The ACL controls access to the cloud host from the
database, and the ACE specifies the connect privilege granted to the specified user name.
• pyqGetHostACE Function
The pyqGetHostACE function gets the existing host access control entry (ACE) for the
specified user. An exception is raised if the host ACE doesn't exist for the specified user.
• pyqRemoveHostACE Procedure
• pyqSetAuthToken Procedure
The pyqSetAuthToken procedure sets the access token in the token store.
10-57
Chapter 10
• pyqIsTokenSet Function
The pyqIsTokenSet function returns whether the authorization token is set or not.
10.6.1.1 pyqAppendHostACE Procedure

The pyqAppendHostACE procedure appends an access control entry (ACE) to the access
control list (ACL) of the cloud host. The ACL controls access to the cloud host from the
database, and the ACE specifies the connect privilege granted to the specified user name.
Syntax
PROCEDURE SYS.pyqAppendHostACE(
username IN VARCHAR2,
host_root_domain IN VARCHAR2
)
Parameter
username - Database user to whom the connect privilege to the cloud host is granted.
host_root_domain - Root domain of the cloud host. For example, if the URL is https://
qtraya2braestch-omldb.adb.us-sanjose-1.oraclecloudapps.com, the root domain of the
cloud host is: adb.us-sanjose-1.oraclecloudapps.com.
Example
exec pyqAppendHostAce('OMLUSER','adb.us-region-1.oraclecloudapps.com');
Note:
OML username is case sensitive
10.6.1.2 pyqGetHostACE Function

The pyqGetHostACE function gets the existing host access control entry (ACE) for the specified
user. An exception is raised if the host ACE doesn't exist for the specified user.
Syntax
FUNCTION sys.pyqGetHostACE(
p_username IN VARCHAR2
)
Parameter
p_username - Database user to look for the host ACE.
10-58
Chapter 10
Example
If user OMLUSER has access to the cloud host, i.e., ibuwlq4mjqkeils-omlrgpy1.adb.us-
region-1.oraclecloudapps.com, the ADMIN user can run the following to check the user's
privileges:
SQL> set serveroutput on

DECLARE
hostname VARCHAR2(4000);
BEGIN
hostname := pyqGetHostACE('OMLUSER');
DBMS_OUTPUT.put_line ('hostname: ' || hostname);
END;
/
SQL> hostname: ibuwlq4mjqkeils-omlrgpy1.adb.us-region-1.oraclecloudapps.com
PL/SQL procedure successfully completed.
10.6.1.3 pyqRemoveHostACE Procedure

The pyqRemoveHostACE procedure removes the existing host access control entry (ACE) from
the specified username. If an access token was set for the cloud host, the token is also
removed. An exception is raised if the host ACE does not exist.
Syntax
PROCEDURE SYS.pyqRemoveHostACE(
username IN VARCHAR2
)
Parameter
username - Database user from whom the connect privilege to the cloud host is revoked.
10.6.1.4 pyqSetAuthToken Procedure

The pyqSetAuthToken procedure sets the access token in the token store.
Syntax
PROCEDURE SYS.pyqSetAuthToken(
access_token IN VARCHAR2
)
10.6.1.5 pyqIsTokenSet Function

The pyqIsTokenSet function returns whether the authorization token is set or not.
Syntax
FUNCTION SYS.pyqIsTokenSet() RETURN BOOLEAN
10-59
Chapter 10
Example
The following example shows how to use the pyqSetAuthToken procedure and the
pyqIsTokenSet function.
DECLARE
is_set BOOLEAN;
BEGIN
pyqSetAuthToken('<access token>');
is_set := pyqIsTokenSet();
IF (is_set) THEN
DBMS_OUTPUT.put_line ('token is set');
END IF;
END;
/
10.6.2 Embedded Python Execution Functions (Autonomous Database)

The SQL API for Embedded Python Execution with Autonomous Database functions are
Topics
• pyqListEnvs Function (Autonomous Database)
The function pyqListEnvs when used in Oracle Autonomous Database, lists the
environments saved in an Object Storage.
• pyqEval Function (Autonomous Database)
The function pyqEval, when used in Oracle Autonomous Database, calls a user-defined
Python function. Users can pass arguments to the user-defined Python function.
• pyqTableEval Function (Autonomous Database)
The function pyqTableEval function when used in Oracle Autonomous Database, runs a
user-defined Python function on data from an Oracle Database table.
• pyqRowEval Function (Autonomous Database)
The function pyqRowEval when used in Oracle Autonomous Database, chunks data into
sets of rows and then runs a user-defined Python function on each chunk.
• pyqGroupEval Function (Autonomous Database)
The function pyqGroupEval when used in Oracle Autonomous Database, groups data by
one or more columns and runs a user-defined Python function on each group.
• pyqIndexEval Function (Autonomous Database)
The function pyqIndexEval when used in Oracle Autonomous Database, runs a user-
defined Python function multiple times as required in the Python engines spawned by the
database environment.
• pyqGrant Function (Autonomous Database)
This topic describes the pyqGrant function when used in Oracle Autonomous Database.
• pyqRevoke Function (Autonomous Database)
This topic describes the pyqRevoke function when used in Oracle Autonomous Database.
• pyqScriptCreate Procedure (Autonomous Database)
This topic describes the pyqScriptCreate procedure in Oracle Autonomous Database.
Use the pyqScriptCreate procedure to create a user-defined Python function and add it to
10-60
Chapter 10
• pyqScriptDrop Procedure (Autonomous Database)

This topic describes the pyqScriptDrop procedure in Oracle Autonomous Database. Use
the pyqScriptDrop procedure to remove a user-defined Python function from the OML4Py
script repository.
10.6.2.1 pyqListEnvs Function (Autonomous Database)

The function pyqListEnvs when used in Oracle Autonomous Database, lists the environments
saved in an Object Storage.
Syntax
FUNCTION PYQSYS.pyqListEnvs
RETURN SYS.AnyDataSet
Example
Issue a query that calls the pyqListEnvs function and lists the environments present.
select * from table(pyqListEnvs());
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
{"envs":[{"size":" 1.7 GiB","name":"sbenv","description":"Conda environment
with seaborn","number_of_installed_packages":78,"tags":"appli
cation":"OML4PY"}]}
10.6.2.2 pyqEval Function (Autonomous Database)

The function pyqEval, when used in Oracle Autonomous Database, calls a user-defined
Python function. Users can pass arguments to the user-defined Python function.
The function pyqEval does not automatically load the data. Within the user-defined Python
function, the user may explicitly access and/or retrieve data using the transparency layer or an
ROracle database connection.
Syntax
FUNCTION PYQSYS.pyqEval(
PAR_LST VARCHAR2,
OUT_FMT VARCHAR2,
SCR_NAME VARCHAR2,
SCR_OWNER VARCHAR2 DEFAULT NULL,
ENV_NAME VARCHAR2 DEFAULT NULL
)
10-61
Chapter 10
Parameters
PAR_LST A JSON string that contains additional parameters to pass to the user-defined
Python function specified by the SCR_NAME parameter. Special control arguments,
which start with oml_, are not passed to the function specified by SCR_NAME, but
instead control what happens before or after the invocation of the function.
See also: Special Control Arguments (Autonomous Database).
OUT_FMT The format of the output returned by the function. It can be one of the following:
returned by the function. Any image data is discarded. The Python function
must return a pandas.DataFrame, a numpy.ndarray, a tuple, or a list of
tuples.
• The string 'JSON', which specifies that the table returned contains a CLOB
that is a JSON string.
• The string 'XML', which specifies that the table returned contains a CLOB that
is an XML string. The XML can contain both structured data and images, with
structured or semi-structured Python objects first, followed by the image or
images generated by the Python function.
• The string 'PNG', which specifies that the table returned contains a BLOB that
has the image or images generated by the Python function. Images are
returned as a base 64 encoding of the PNG representation.
See also: Output Formats (Autonomous Database).
SCR_NAME The name of a user-defined Python function in the OML4Py script repository.
SCR_OWNER The owner of the registered Python script. The default value is NULL. If NULL, will
search for the Python script in the user’s script repository.
ENV_NAME The name of the conda environment that should be used when running the named
user-defined Python function.
Example
This example defines a Python function and stores it in the OML4Py script repository. It calls
the pyqEval function on the user-defined Python functions.
In a PL/SQL block, create a Python function that is stored in script repository with the name
pyqFun1.
begin
'def fun_tab():
import pandas as pd
names = ["demo_"+str(i) for i in range(10)]
ids = [x for x in range(10)]
floats = [float(x)/10 for x in range(10)]
d = {''ID'': ids, ''NAME'': names, ''FLOAT'': floats}
scores_table = pd.DataFrame(d)
return scores_table
',FALSE,TRUE); -- V_GLOBAL, V_OVERWRITE
end;
/
10-62
Chapter 10
Next, call the pyqEval function, which runs the user-defined Python function.
The PAR_LST argument specifies using LOW service level with the special control argument
oml_service_level.
In the OUT_FMT argument, the string 'JSON', specifies that the table returned contains a CLOB
that is a JSON string.
The SCR_NAME parameter specifies the pyqFun1 function in the script repository as the Python
function to call.
The JSON output is a CLOB. You can call set long [length] to get more output.
set long 500

select *
from table(pyqEval(
par_lst => '{"oml_service_level":"LOW"}',
out_fmt => 'JSON',
scr_name => 'pyqFun1'));
NAME
----------------------------------------------------------------------
VALUE
----------------------------------------------------------------------
[{"FLOAT":0,"ID":0,"NAME":"demo_0"},{"FLOAT":0.1,"ID":1,"NAME":"demo_1
"},{"FLOAT":0.2,"ID":2,"NAME":"demo_2"},{"FLOAT":0.3,"ID":3,"NAME":"de
mo_3"},{"FLOAT":0.4,"ID":4,"NAME":"demo_4"},{"FLOAT":0.5,"ID":5,"NAME"
:"demo_5"},{"FLOAT":0.6,"ID":6,"NAME":"demo_6"},{"FLOAT":0.7,"ID":7,"N
AME":"demo_7"},{"FLOAT":0.8,"ID":8,"NAME":"demo_8"},{"FLOAT":0.9,"ID":
9,"NAME":"demo_9"}]
1 row selected.
Issue another query that invokes the same pyqFun1 script. The OUT_FMT argument specifies a
JSON string that contains the column names and data types of the structured table output.
select *
from table(pyqEval(
par_lst => '{"oml_service_level":"LOW"}',
out_fmt => '{"ID":"number", "NAME":"VARCHAR2(8)",
"FLOAT":"binary_double"}',
ID NAME FLOAT
0 demo_0 0.0
1 demo_1 0.1
2 demo_2 0.2
3 demo_3 0.3
4 demo_4 0.4
5 demo_5 0.5
6 demo_6 0.6
10-63
Chapter 10
7 demo_7 0.7
8 demo_8 0.8
9 demo_9 0.9
10 rows selected.
repository with the name create_iris_table, overwriting any existing user-defined Python
function stored in the script repository with the same name.
The create_iris_table function imports and loads the iris data set, creates two
pandas.DataFrame objects, and then returns the concatenation of those objects.
BEGIN
import pandas as pd
iris = load_iris()
"Sepal_Width", "Petal_Length", "Petal_Width"])
return pd.concat([y, x], axis=1)',
END;
/
NULL,
'create_iris_table'
));
10.6.2.3 pyqTableEval Function (Autonomous Database)

The function pyqTableEval function when used in Oracle Autonomous Database, runs a user-
defined Python function on data from an Oracle Database table.
Pass data to the user-defined Python function from the table name specified in the INP_NAM
parameter. Pass arguments to the user-defined Python function with the PAR_LST parameter.
The user-defined Python function can return a boolean, a dict, a float, an int, a list, a str,
a tuple or a pandas.DataFrame object. You define the form of the returned value with the
OUT_FMT parameter.
Syntax
FUNCTION PYQSYS.pyqTableEval(
INP_NAM VARCHAR2,
PAR_LST VARCHAR2,
10-64
Chapter 10
OUT_FMT VARCHAR2,
SCR_NAME VARCHAR2,
)
Parameters
function specified by the SCR_NAME parameter. If using a table or view owned
Python function specified by the SCR_NAME parameter. Special control
SCR_NAME, but instead control what happens before or after the invocation of
the function.
OUT_FMT The format of the output returned by the function. It can be one of the
following:
returned by the function. Any image data is discarded. The Python
function must return a pandas.DataFrame, a numpy.ndarray, a tuple,
or a list of tuples.
• The string 'JSON', which specifies that the table returned contains a
CLOB that is a JSON string.
SCR_OWNER The owner of the registered Python script. The default value is NULL. If NULL,
will search for the Python script in the user’s script repository.
ENV_NAME The name of the conda environment that should be used when running the
named user-defined Python function.
Example
Define the Python function fit_model and store it with the name myLinearRegressionModel as
a private function in the script repository, overwriting any existing user-defined Python function
stored with that name.
The fit_model function fits a regression model to the input data dat and then saves the fitted
model as an object specified by the modelName argument to the datastore specified by the
datastoreName argument. The fit_model function returns the fitted model in a string format.
10-65
Chapter 10
By default, Python objects are saved to a new datastore with the specified datastoreName. To
save an object to an existing datastore, either set the overwrite or append argument to True in
the oml.ds.save invocation.
BEGIN
sys.pyqScriptCreate('myLinearRegressionModel',
'def fit_model(dat, modelName, datastoreName):
import oml
regr.fit(dat.loc[:, ["Sepal_Length", "Sepal_Width", \
"Petal_Length"]],
dat.loc[:,["Petal_Width"]])
oml.ds.save(objs={modelName:regr}, name=datastoreName,
overwrite=True)
return str(regr)',
END;
/
This example uses the IRIS table created in the example shown in pyqEval Function
(Autonomous Database). Run a SELECT statement that invokes the pyqTableEval function. The
INP_NAM parameter of the pyqTableEval function specifies the IRIS table as the data to pass to
the Python function. The PAR_LST parameter specifies the names of the model and datastore to
pass to the Python function. The OUT_FMT parameter specifies returning the value in XML
format and the SCR_NAME parameter specifies the myLinearRegressionModel function in the
script repository as the Python function to invoke. The XML output is a CLOB; you can call set
long [length] to get more output.
SELECT *
FROM table(pyqTableEval(
inp_nam => 'IRIS',
par_lst => '{"modelName":"linregr",
"datastoreName":"pymodel"}',
out_fmt => 'XML',
scr_name => 'myLinearRegressionModel'));
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
<root><str>LinearRegression()</str></root>
1 row selected.
10-66
Chapter 10
10.6.2.4 pyqRowEval Function (Autonomous Database)

The function pyqRowEval when used in Oracle Autonomous Database, chunks data into sets of
rows and then runs a user-defined Python function on each chunk.
The function pyqRowEval passes the data specified by the INP_NAM parameter to the user-
defined Python function specified by the SCR_NAME parameter. The PAR_LST parameter specifies
the special control argument oml_graphics_flag to capture images rendered in the script, and
the oml_parallel_flag and oml_service_level flags enable parallelism using the MEDIUM
service level. See also: Special Control Arguments (Autonomous Database).
The ROW_NUM parameter specifies the maximum number of rows to pass to each invocation of
the user-defined Python function. The last set of rows may have fewer rows than the number
specified.
a tuple or a pandas.DataFrame object. You can define the form of the returned value with the
OUT_FMT parameter.
Syntax
FUNCTION PYQSYS.pyqRowEval(
INP_NAM VARCHAR2,
PAR_LST VARCHAR2,
OUT_FMT VARCHAR2,
ROW_NUM NUMBER,
SCR_NAME VARCHAR2,
)
Parameters
function specified by the SCR_NAME parameter. If using a table or view owned
Python function specified by the SCR_NAME parameter. Special control
SCR_NAME, but instead control what happens before or after the invocation of
the function.
10-67
Chapter 10
OUT_FMT The format of the output returned by the function. It can be one of the
following:
returned by the function. Any image data is discarded. The Python
function must return a pandas.DataFrame, a numpy.ndarray, a tuple,
or a list of tuples.
• The string 'JSON', which specifies that the table returned contains a
CLOB that is a JSON string.
ROW_NUM The number of rows in a chunk. The Python script is executed in each chunk.
SCR_OWNER The owner of the registered Python script. The default value is NULL. If NULL,
will search for the Python script in the user’s script repository.
ENV_NAME The name of the conda environment that should be used when running the
named user-defined Python function.
Example
This example loads the Python model linregr to predict row chunks of sample iris data. The
model is created and saved in the datastore pymodel, which is shown in the example for
pyqTableEval Function (Autonomous Database).
The example defines a Python function and stores it in the OML4Py script repository. It uses
the user-defined Python function to create a database table as the result of the pyqEval
function. It defines a Python function that runs a prediction function on a model loaded from the
OML4Py datastore. It then invokes the pyqTableEval function to invoke the function on chunks
of rows from the database table.
In a PL/SQL block, define the function sample_iris_table and store it in the script repository.
The function loads the iris data set, creates two pandas.DataFrame objects, and then returns a
sample of the concatenation of those objects.
BEGIN
sys.pyqScriptCreate('sample_iris_table',
'def sample_iris_table(size):
import pandas as pd
iris = load_iris()
return pd.concat([y, x], axis=1).sample(int(size))',
10-68
Chapter 10
END;
/
Create the SAMPLE_IRIS table in the database as the result of a SELECT statement, which
invokes the pyqEval function on the sample_iris_table user-defined Python function saved in
the script repository with the same name. The sample_iris_table function returns an iris data
sample of size size.
CREATE TABLE sample_iris AS

SELECT *
FROM TABLE(pyqEval(
'{"size":20}',
'{"Species":"varchar2(10)","Sepal_Length":"number",
'sample_iris_table'));
Define the Python function predict_model and store it with the name linregrPredict in the
script repository. The function predicts the data in dat with the Python model specified by the
modelName argument, which is loaded from the datastore specified by the datastoreName
argument. The function also plots the actual petal width values with the predicted values. The
predictions are finally concatenated and returned with dat as the object that the function
returns.
BEGIN
sys.pyqScriptCreate('linregrPredict',
'def predict_model(dat, modelName, datastoreName):
import oml
import pandas as pd
objs = oml.ds.load(name=datastoreName, to_globals=False)
pred = objs[modelName].predict(dat[["Sepal_Length",\
"Sepal_Width","Petal_Length"]])
return pd.concat([dat, pd.DataFrame(pred, \
columns=["Pred_Petal_Width"])], axis=1)',
END;
/
Run a SELECT statement that invokes the pyqRowEval function, which runs the specified Python
function on each chunk of rows in the specified data set.
The INP_NAM argument specifies the data in the SAMPLE_IRIS table to pass to the Python
function.
The PAR_LST argument specifies passing the input data as a pandas. DataFrame with the
special control argument oml_input_type, along with values for the function arguments
modelName and datastoreName.
In the OUT_FMT argument, the JSON string specifies the column names and data types of the
structured table output.
The ROW_NUM argument specifies that five rows are included in each invocation of the function
specified by SCR_NAME.
10-69
Chapter 10
The SCR_NAME parameter specifies linregrPredict, which is the name in the script repository
of the user-defined Python function to invoke.
SELECT *
inp_nam => 'SAMPLE_IRIS',
par_lst => '{"oml_input_type":"pandas.DataFrame",
out_fmt => '{"Species":"varchar2(12)", "Petal_Length":"number",
"Pred_Petal_Width":"number"}',
row_num => 5,
scr_name => 'linregrPredict'));
Species Petal_Length Pred_Petal_Width

setosa 1.2 0.0653133202
versicolor 4.5 1.632087234
setosa 1.3 0.2420812759
setosa 1.9 0.5181904241
setosa 1.4 0.2162518989
setosa 1.4 0.1732424372
setosa 1.5 0.2510460971
setosa 1.3 0.1907951829
versicolor 3.9 1.1999981051
versicolor 4.2 1.4017887483
versicolor 4 1.2332360562
versicolor 4.8 1.765473067
virginica 5.6 2.0095892178
versicolor 4.7 1.5824801232
Species Petal_Length Pred_Petal_Width

virginica 5.4 2.0623088225
versicolor 4.7 1.6524411804
virginica 5.6 1.9919751044
virginica 5.8 2.1206308288
virginica 5.1 1.7983383572
versicolor 4.4 1.3677441077
20 rows selected.
Run a SELECT statement that invokes the pyqRowEval function and return the XML output. Each
invocation of script linregrPredict is applied to 10 rows of data in the SAMPLE_IRIS table. The
XML output is a CLOB; you can call set long [length] to get more output.
set long 300

SELECT *
inp_nam => 'SAMPLE_IRIS',
"modelName":"linregr", "datastoreName":"pymodel",
"oml_parallel_flag":true", "oml_service_level":"MEDIUM"}',
out_fmt => 'XML',
10-70
Chapter 10
row_num => 10,

scr_name => 'linregrPredict'));
NAME VALUE
<root><pandas_dataFrame><ROW-pandas_dataFrame><Species>setosa</
Species><Sepal_Length>5</Sepal_Length><Sepal_Width>3.2</
Sepal_Width><Petal_Length>1.2</Petal_Length><Petal_Width>0.2</
Petal_Width><Pred_Petal_Width>0.0653133201897007</Pred_Petal_Width></ROW-
pandas_dataFrame><ROW-pandas_dataFrame><Species>
10.6.2.5 pyqGroupEval Function (Autonomous Database)

The function pyqGroupEval when used in Oracle Autonomous Database, groups data by one
or more columns and runs a user-defined Python function on each group.
The function pyqGroupEval runs the user-defined Python function specified by the SCR_NAME
parameter. Pass data to the user-defined Python function with the INP_NAM parameter. The
PAR_LST parameter specifies the special control argument oml_graphics_flag to capture
images rendered in the script, and the oml_parallel_flag and oml_service_level flags
enable parallelism using the MEDIUM service level. See also: Special Control Arguments
(Autonomous Database). Specify one or more grouping columns with the GRP_COL parameter.
a tuple or a pandas.DataFrame object. Define the form of the returned value with the OUT_FMT
parameter.
Syntax
FUNCTION PYQSYS.pyqGroupEval(
INP_NAM VARCHAR2,
PAR_LST VARCHAR2,
OUT_FMT VARCHAR2,
GRP_COL VARCHAR2,
ORD_COL VARCHAR2,
SCR_NAME VARCHAR2,
)
Parameters
INP_NAM The name of a table or view that specifies the data to pass to the Python function
specified by the SCR_NAME parameter. If using a table or view owned by another user,
use the format <owner name>.<table/view name>. You must have read access to
the specified table or view.
10-71
Chapter 10
PAR_LST A JSON string that contains additional parameters to pass to the user-defined Python
function specified by the SCR_NAME parameter. Special control arguments, which start
with oml_, are not passed to the function specified by SCR_NAME, but instead control
what happens before or after the invocation of the function.
OUT_FMT The format of the output returned by the function. It can be one of the following:
• A JSON string that specifies the column names and data types of the table returned
by the function. Any image data is discarded. The Python function must return a
pandas.DataFrame, a numpy.ndarray, a tuple, or a list of tuples.
• The string 'JSON', which specifies that the table returned contains a CLOB that is a
JSON string.
• The string 'XML', which specifies that the table returned contains a CLOB that is an
XML string. The XML can contain both structured data and images, with structured
or semi-structured Python objects first, followed by the image or images generated
by the Python function.
• The string 'PNG', which specifies that the table returned contains a BLOB that has
the image or images generated by the Python function. Images are returned as a
base 64 encoding of the PNG representation.
GRP_COL The names of the grouping columns by which to partition the data. Use commas to
separate multiple columns. For example, to group by GENDER and YEAR:
"GENDER,YEAR"
ORD_COL Comma-separated column names to order the input data. For example to order by
GENDER:
"GENDER"
If specified, the input data will first be ordered by the ORD_COL columns and then
grouped by the GRP_COL columns.
SCR_OWNER The owner of the registered Python script. The default value is NULL. If NULL, will search
for the Python script in the user’s script repository.
ENV_NAME The name of the conda environment that should be used when running the named user-
defined Python function.
Example
This example uses the IRIS table created in the example shown in pyqEval Function
Define the Python function group_count and store it with the name mygroupcount in the script
repository. The function returns a pandas.DataFrame generated on each group of data dat. The
function also plots the sepal length with the petal length values on each group.
BEGIN
sys.pyqScriptCreate('mygroupcount',
'def group_count(dat):
import pandas as pd
plt.plot(dat[["Sepal_Length"]], dat[["Petal_Length"]], ".")
plt.xlabel("Sepal Length")
10-72
Chapter 10
plt.ylabel("Petal Length")
plt.title("{}".format(dat["Species"][0]))
columns = ["Species", "CNT"]) ',
END;
/
Issue a query that invokes the pyqGroupEval function. In the function, the INP_NAM argument
specifies the data in the IRIS table to pass to the function.
The PAR_LST argument specifies the special control argument oml_input_type.
The OUT_FMT argument specifies a JSON string that contains the column names and data types
of the table returned by pyqGroupEval.
The SCR_NAME parameter specifies the user-defined Python function stored with the name
mygroupcount in the script repository.
SELECT *
FROM table(
pyqGroupEval(
inp_nam => 'IRIS',
par_lst => '{"oml_input_type":"pandas.DataFrame"}',
out_fmt => '{"Species":"varchar2(10)", "CNT":"number"}',
grp_col => 'Species',
ord_col => NULL,
scr_name => 'mygroupcount'));
Species CNT
---------- ----------
virginica 50
setosa 50
versicolor 50
3 rows selected.
Run the same script with IRIS data and return the XML output. The PAR_LST argument
specifies the special control argument oml_graphics_flag to capture images rendered in the
script. Both structured data and images are included in the XML output. The XML output is a
CLOB; you can call set long [length] to get more output.
set long 300

SELECT *
FROM table(
pyqGroupEval(
inp_nam => 'IRIS',
"oml_graphics_flag":true, "oml_parallel_flag":true",
"oml_service_level":"MEDIUM"}',
out_fmt => 'XML',
10-73
Chapter 10
ord_col => NULL,

NAME VALUE
virginica <root><Py-data><pandas_dataFrame><ROW-
pandas_dataFrame><Species>virginica</Species><CNT>50</CNT></ROW-
pandas_dataFrame></pandas_dataFrame></Py-data><images><image><img
src="data:image/pngbase64"><!
[CDATA[iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXR
wbG90bGliIHZlcnNpb24zLjMu
setosa <root><Py-data><pandas_dataFrame><ROW-
pandas_dataFrame><Species>setosa</Species><CNT>50</CNT></ROW-
wbG90bGliIHZlcnNpb24zLjMuMyw
versicolor <root><Py-data><pandas_dataFrame><ROW-
pandas_dataFrame><Species>versicolor</Species><CNT>50</CNT></ROW-
wbG90bGliIHZlcnNpb24zLjM
Run the same script with IRIS data and get the PNG output. The PAR_LST argument specifies
the special control argument oml_graphics_flag to capture images.
column name format a7

column value format a15
column title format a16
column image format a15
SELECT *
FROM table(
pyqGroupEval(
inp_nam => 'IRIS',
"oml_graphics_flag":true}',
out_fmt => 'PNG',
ord_col => NULL,
NAME ID VALUE TITLE IMAGE

------- ---------- --------------- ---------------- ---------------
GROUP_s 1 [{"Species":"se setosa 89504E470D0A1A0
etosa tosa","CNT":50} A0000000D494844
] 520000028000000
1E0080600000035
D1DCE4000000397
4455874536F6674
77617265004D617
10-74
Chapter 10
4706C6F746C6962
2076657273696F6
E332E332E332C20
6874747073

------- ---------- --------------- ---------------- ---------------
GROUP_v 1 [{"Species":"ve versicolor 89504E470D0A1A0

ersicol rsicolor","CNT" A0000000D494844
or :50}] 520000028000000
1E0080600000035
D1DCE4000000397
4455874536F6674
77617265004D617
4706C6F746C6962
2076657273696F6
E332E332E332C20

------- ---------- --------------- ---------------- ---------------
6874747073
GROUP_v 1 [{"Species":"vi virginica 89504E470D0A1A0

irginic rginica","CNT": A0000000D494844
a 50}] 520000028000000
1E0080600000035
D1DCE4000000397
4455874536F6674
77617265004D617
4706C6F746C6962
2076657273696F6
10.6.2.6 pyqIndexEval Function (Autonomous Database)

The function pyqIndexEval when used in Oracle Autonomous Database, runs a user-defined
Python function multiple times as required in the Python engines spawned by the database
environment.
The function pyqIndexEval runs the user-defined Python function specified by the SCR_NAME
parameter. The PAR_LST parameter specifies the special control argument oml_graphics_flag
to capture images rendered in the script, and the oml_parallel_flag and oml_service_level
flags enable parallelism using the MEDIUM service level. See also: Special Control Arguments
Syntax
FUNCTION PYQSYS.pyqIndexEval(
PAR_LST VARCHAR2,
OUT_FMT VARCHAR2,
TIMES_NUM NUMBER,
SCR_NAME VARCHAR2,
10-75
Chapter 10
)
10-76
Chapter 10
Parameters
Parameter Des
cript
ion
PAR_LST A
JSO
N
strin
g
that
cont
ains
addit
ional
para
mete
rs to
pass
to
the
user-
defin
ed
Pyth
on
funct
ion
speci
fied
by
the
SCR_
NAME
para
mete
r.
Spec
ial
contr
ol
argu
ment
s,
whic
h
start
with
oml_
, are
not
pass
ed to
the
funct
ion
speci
fied
by
10-77
Chapter 10
Parameter Des
cript
ion
SCR_
NAME
, but
inste
ad
contr
ol
what
happ
ens
befor
e or
after
the
invoc
ation
of
the
funct
ion.
For
exa
mple
, to
speci
fy
the
input
data
type
as
pand
as.D
ataF
rame
,
use:
'{"o
ml_i
nput
_typ
e":"
pand
as.D
ataF
rame
"}'
See
also:
Spec
ial
Cont
rol
Argu
ment
s
10-78
Chapter 10
Parameter Des
cript
ion
(Aut
ono
mou
s
Data
base
).
10-79
Chapter 10
Parameter Des
cript
ion
OUT_FMT The
form
at of
the
outp
ut
retur
ned
by
the
funct
ion.
It
can
be
one
of
the
follo
wing:
• A
J
S
O
N
s
t
r
i
n
g
t
h
a
t
s
p
e
c
i
f
i
e
s
t
h
e
c
o
l
u
m
n
n
a
m
10-80
Chapter 10
Parameter Des
cript
ion
e
s
a
n
d
d
a
t
a
t
y
p
e
s
o
f
t
h
e
t
a
b
l
e
r
e
t
u
r
n
e
d
b
y
t
h
e
f
u
n
c
t
i
o
n
.
A
n
y
i
m
a
g
e
d
a
10-81
Chapter 10
Parameter Des
cript
ion
t
a
i
s
d
i
s
c
a
r
d
e
d
.
T
h
e
P
y
t
h
o
n
f
u
n
c
t
i
o
n
m
u
s
t
r
e
t
u
r
n
a
p
a
n
d
a
s
.
D
a
t
a
F
r
10-82
Chapter 10
Parameter Des
cript
ion
a
m
e
,
a
n
u
m
p
y
.
n
d
a
r
r
a
y
,
a
t
u
p
l
e
,
o
r
a
l
i
s
t
o
f
t
u
p
l
e
s
.
• T
h
e
s
t
r
i
n
g
'
J
S
10-83
Chapter 10
Parameter Des
cript
ion
O
N
'
,
w
h
i
c
h
s
p
e
c
i
f
i
e
s
t
h
a
t
t
h
e
t
a
b
l
e
r
e
t
u
r
n
e
d
c
o
n
t
a
i
n
s
a
C
L
O
B
t
h
a
t
i
10-84
Chapter 10
Parameter Des
cript
ion
s
a
J
S
O
N
s
t
r
i
n
g
.
• T
h
e
s
t
r
i
n
g
'
X
M
L
'
,
w
h
i
c
h
s
p
e
c
i
f
i
e
s
t
h
a
t
t
h
e
t
a
b
l
e
r
e
10-85
Chapter 10
Parameter Des
cript
ion
t
u
r
n
e
d
c
o
n
t
a
i
n
s
a
C
L
O
B
t
h
a
t
i
s
a
n
X
M
L
s
t
r
i
n
g
.
T
h
e
X
M
L
c
a
n
c
o
n
t
a
i
n
b
o
t
10-86
Chapter 10
Parameter Des
cript
ion
h
s
t
r
u
c
t
u
r
e
d
d
a
t
a
a
n
d
i
m
a
g
e
s
,
w
i
t
h
s
t
r
u
c
t
u
r
e
d
o
r
s
e
m
i
-
s
t
r
u
c
t
u
r
e
d
10-87
Chapter 10
Parameter Des
cript
ion
P
y
t
h
o
n
o
b
j
e
c
t
s
f
i
r
s
t
,
f
o
l
l
o
w
e
d
b
y
t
h
e
i
m
a
g
e
o
r
i
m
a
g
e
s
g
e
n
e
r
a
t
e
d
b
y
10-88
Chapter 10
Parameter Des
cript
ion
t
h
e
P
y
t
h
o
n
f
u
n
c
t
i
o
n
.
• T
h
e
s
t
r
i
n
g
'
P
N
G
'
,
w
h
i
c
h
s
p
e
c
i
f
i
e
s
t
h
a
t
t
h
e
t
a
10-89
Chapter 10
Parameter Des
cript
ion
b
l
e
r
e
t
u
r
n
e
d
c
o
n
t
a
i
n
s
a
B
L
O
B
t
h
a
t
h
a
s
t
h
e
i
m
a
g
e
o
r
i
m
a
g
e
s
g
e
n
e
r
a
t
e
d
10-90
Chapter 10
Parameter Des
cript
ion
b
y
t
h
e
P
y
t
h
o
n
f
u
n
c
t
i
o
n
.
I
m
a
g
e
s
a
r
e
r
e
t
u
r
n
e
d
a
s
a
b
a
s
e
6
4
e
n
c
o
d
i
n
g
o
f
10-91
Chapter 10
Parameter Des
cript
ion
t
h
e
P
N
G
r
e
p
r
e
s
e
n
t
a
t
i
o
n
.
See
also:
Outp
ut
Form
ats
(Aut
ono
mou
s
Data
base
).
TIMES_NUM The
num
ber
of
time
s to
exec
ute
the
Pyth
on
scrip
t.
10-92
Chapter 10
Parameter Des
cript
ion
SCR_NAME The
nam
e of
a
user-
defin
ed
Pyth
on
funct
ion
in
the
OML
4Py
scrip
t
repo
sitor
y.
SCR_OWNER The
own
er of
the
regis
tered
Pyth
on
scrip
t.
The
defa
ult
value
is
NULL
. If
NULL
, will
sear
ch
for
the
Pyth
on
scrip
t in
the
user
’s
scrip
t
repo
sitor
y.
10-93
Chapter 10
Parameter Des
cript
ion
ENV_NAME The
nam
e of
the
cond
a
envir
onm
ent
that
shou
ld be
used
whe
n
runni
ng
the
nam
ed
user-
defin
ed
Pyth
on
funct
ion.
Example
Define the Python function fit_lm and store it with the name myFitMultiple in the script
repository. The function returns a pandas.DataFrame containing the index and prediction score
of the fitted model on the data sampled from scikit-learn’s IRIS dataset.
begin
sys.pyqScriptCreate('myFitMultiple',
'def fit_lm(i, sample_size):
import pandas as pd
import random
random.seed(10)
iris = load_iris()
dat = pd.concat([y, x], axis=1).sample(sample_size)
regr.fit(x.loc[:, ["Sepal_Length", "Sepal_Width", \
"Petal_Length"]],
10-94
Chapter 10
x.loc[:,["Petal_Width"]])
sc = regr.score(dat.loc[:, ["Sepal_Length", "Sepal_Width", \
"Petal_Length"]],
dat.loc[:,["Petal_Width"]])
return pd.DataFrame([[i,sc]],columns=["id","score"])
',FALSE,TRUE); -- V_GLOBAL, V_OVERWRITE
end;
/
Issue a query that invokes the pyqIndexEval function. In the function, the PAR_LST argument
specifies the function argument sample_size. The OUT_FMT argument specifies a JSON string
that contains the column names and data types of the table returned by pyqIndexEval. The
TIMES_NUM parameter specifies the number of times to execute the script. The SCR_NAME
parameter specifies the user-defined Python function stored with the name myFitMultiple in
the script repository.
select *
par_lst => '{"sample_size":80,
"oml_parallel_flag":true",
"oml_service_level":"MEDIUM"}',
out_fmt => '{"id":"number","score":"number"}',
times_num => 3,
scr_name => 'myFitMultiple'));
id score
---------- ----------
1 .943550631
2 .927836941
3 .937196049
3 rows selected.
10.6.2.7 pyqGrant Function (Autonomous Database)

This topic describes the pyqGrant function when used in Oracle Autonomous Database.
The pyqGrant function grants read privilege access to an OML4Py datastore or to a script in
Syntax
pyqGrant (
V_NAME VARCHAR2 IN
V_TYPE VARCHAR2 IN
Parameters
10-95
Chapter 10
V_USER The name of the user to whom to grant access.
Example 10-29 Granting Read Access to a script
-- Grant read privilege access to Scott.

BEGIN
pyqGrant('pyqFun1', 'pyqscript', 'SCOTT');
END;
/
Example 10-30 Granting Read Access to a datastore
-- Grant read privilege access to datastore ds1 to SCOTT.

BEGIN
pyqGrant('ds1', 'datastore', 'SCOTT');
END;
/
Example 10-31 Granting Read Access to a Script to all Users
-- Grant read privilege access to script RandomRedDots to all users.

BEGIN
pyqGrant('pyqFun1', 'pyqscript', NULL);
END;
/
Example 10-32 Granting Read Access to a datastore to all Users
-- Grant read privilege access to datastore ds1 to all users.

BEGIN
pyqGrant('ds1', 'datastore', NULL);
END;
/
10.6.2.8 pyqRevoke Function (Autonomous Database)

This topic describes the pyqRevoke function when used in Oracle Autonomous Database.
The pyqRevoke function revokes read privilege access to an OML4Py datastore or to a script in
Syntax
pyqRevoke (
V_NAME VARCHAR2 IN
V_TYPE VARCHAR2 IN
10-96
Chapter 10
Parameters
V_USER The name of the user from whom to revoke access.
Example 10-33 Revoking Read Access to a script
-- Revoke read privilege access to script pyqFun1 from SCOTT.

BEGIN
pyqRevoke('pyqFun1', 'pyqscript', 'SCOTT');
END;
/
Example 10-34 Revoking Read Access to a datastore
-- Revoke read privilege access to datastore ds1 from SCOTT.

BEGIN
pyqRevoke('ds1', 'datastore', 'SCOTT');
END;
/
Example 10-35 Revoking Read Access to a script from all Users
-- Revoke read privilege access to script pyqFun1 from all users.

BEGIN
pyqRevoke('pyqFun1', 'pyqscript', NULL);
END;
/
Example 10-36 Revoking Read Access to a datastore from all Users
-- Revoke read privilege access to datastore ds1 from all users.

BEGIN
pyqRevoke('ds1', 'datastore', NULL);
END;
/
10.6.2.9 pyqScriptCreate Procedure (Autonomous Database)

This topic describes the pyqScriptCreate procedure in Oracle Autonomous Database. Use the
pyqScriptCreate procedure to create a user-defined Python function and add it to the
Syntax
sys.pyqScriptCreate (
V_NAME VARCHAR2 IN
V_SCRIPT CLOB IN
10-97
Chapter 10

V_OVERWRITE BOOLEAN IN DEFAULT)
V_SCRIPT The definition of the Python function.
V_GLOBAL TRUE specifies that the user-defined Python function is public; FALSE specifies that
the user-defined Python function is private.
V_OVERWRITE If the script repository already has a user-defined Python function with the same
name as V_NAME, then TRUE replaces the content of that user-defined Python
function with V_SCRIPT and FALSE does not replace it.
Example 10-37 Using the pyqScriptCreate Procedure

This example creates a private user-defined Python function named pyqFun2 in the OML4Py
script repository.
BEGIN
import numpy as np
import pickle
dtype=[("a", "U10"), ("b", "f8"), ("c", "i4"), ("d", "?"),
("e", "S20"), ("f", "O")])
return z');
END;
/
This example creates a global user-defined Python function named pyqFun2 in the script
repository and overwrites any existing user-defined Python function of the same name.
BEGIN
import numpy as np
import pickle
dtype=[("a", "U10"), ("b", "f8"), ("c", "i4"), ("d", "?"),
("e", "S20"), ("f", "O")])
return z',
TRUE, -- Make the user-defined Python function global.
TRUE); -- Overwrite any global user-defined Python function
-- with the same name.
10-98
Chapter 10
END;
/
This example creates a private user-defined Python function named create_iris_table in the
script repository.
BEGIN
import pandas as pd
iris = load_iris()
END;
/
Display the user-defined Python functions owned by the current user.
SELECT * from USER_PYQ_SCRIPTS;
NAME SCRIPT
-----------------
---------------------------------------------------------------------
create_iris_table def create_iris_table(): from sklearn.datasets
import load_iris ...
pyqFun2 def return_frame(): import numpy as np import
pickle ...
Display the user-defined Python functions available to the current user.
SELECT * from ALL_PYQ_SCRIPTS;
OWNER NAME SCRIPT

-------- -----------------
--------------------------------------------------------------------
OML_USER create_iris_table "def create_iris_table(): from
sklearn.datasets import load_iris ...
OML_USER pyqFun2 "def return_frame(): import numpy as np
import pickle ...
PYQSYS pyqFun2 "def return_frame(): import numpy as np
import pickle ...
10-99
Chapter 10
10.6.2.10 pyqScriptDrop Procedure (Autonomous Database)

This topic describes the pyqScriptDrop procedure in Oracle Autonomous Database. Use the
pyqScriptDrop procedure to remove a user-defined Python function from the OML4Py script
repository.
Syntax
sys.pyqScriptDrop (
V_NAME VARCHAR2 IN
V_SILENT BOOLEAN IN DEFAULT)
V_GLOBAL A BOOLEAN that specifies whether the user-defined Python function to drop is a
global or a private user-defined Python function. The default value is FALSE, which
indicates a private user-defined Python function. TRUE specifies that the user-
defined Python function is public.
V_SILENT A BOOLEAN that specifies whether to display an error message when
sys.pyqScriptDrop encounters an error in dropping the specified user-defined
Python function. The default value is FALSE.
Example 10-38 Using the sys.pyqScriptDrop Procedure

For the creation of the user-defined Python functions dropped in these examples, see
Example 10-27.
This example drops the private user-defined Python function pyqFun2 from the script repository.
BEGIN
END;
/
This example drops the global user-defined Python function pyqFun2 from the script repository.
BEGIN
sys.pyqScriptDrop('pyqFun2', TRUE);
END;
/
10.6.3 Asynchronous Jobs (Autonomous Database)

When a function is run asynchronously, it's run as a job which can be tracked by using the
pyqJobStatus and pyqJobResult functions.
Asynchronous Jobs is described in the following topics:
10-100
Chapter 10
Topics:
• oml_async_flag Argument
The special control argument oml_async_flag determines if a job is run synchronously or
asynchronously. The default value is false.
• pyqJobStatus Function
Use the pyqJobStatus function to look up the status of an asynchronous job. If the job is
pending, it returns job is still running . If the job is completed, the function returns a
URL.
• pyqJobResult Function
Use the pyqJobResult function to return the job result.
• Asynchronous Job Example
The following examples shows how to submit asynchronous jobs with non-XML output and
with XML output.
10.6.3.1 oml_async_flag Argument

The special control argument oml_async_flag determines if a job is run synchronously or
asynchronously. The default value is false.
Set the oml_async_flag Argument
• To run a function in synchronous mode, set oml_async_flag to false.

In synchronous mode, the SQL API waits for the HTTP call to finish and returns when the
HTTP response is ready.
By default, pyq*Eval functions are executed synchronously. The default connection
timeout limit is 60 seconds. Synchronous mode is used if oml_async_flag is not set or if
it's set to false.
• To run a function in asynchronous mode, set oml_async_flag to true.
In asynchronous mode, the SQL API returns a URL directly after the asynchronous job is
submitted to the web server. The URL contains a job ID, which can be used to fetch the job
status and result in subsequent SQL calls.
Submit Asynchronous Job Example

This example uses the table GRADE, created as follows:
CREATE TABLE GRADE (

NAME VARCHAR2(30),
GENDER VARCHAR2(1),
STATUS NUMBER(10),
YEAR NUMBER(10),
SECTION VARCHAR2(1),
SCORE NUMBER(10),
FINALGRADE NUMBER(10)
);
insert into GRADE values('Abbott', 'F', 2, 97, 'A', 90, 87);

insert into GRADE values('Branford', 'M', 1, 98, 'A', 92, 97);
insert into GRADE values('Crandell', 'M', 2, 98, 'B', 81, 71);
insert into GRADE values('Dennison', 'M', 1, 97, 'A', 85, 72);
10-101
Chapter 10
insert into GRADE values('Edgar', 'F', 1, 98, 'B', 89, 80);

insert into GRADE values('Faust', 'M', 1, 97, 'B', 78, 73);
insert into GRADE values('Greeley', 'F', 2, 97, 'A', 82, 91);
insert into GRADE values('Hart', 'F', 1, 98, 'B', 84, 80);
insert into GRADE values('Isley', 'M', 2, 97, 'A', 88, 86);
insert into GRADE values('Jasper', 'M', 1, 97, 'B', 91, 83);
In the following code, the Python function score_diff is defined and stored with the name
computeGradeDiff as a private function in the script repository. The function returns a
pandas.DataFrame after assigning a new DIFF column by computing the difference between
the SCORE and FINALGRADE column of the input data.
begin
sys.pyqScriptCreate('computeGradeDiff','def score_diff(dat):
import numpy as np
import pandas as pd
df = dat.assign(DIFF=dat.SCORE-dat.FINALGRADE)
return df
');
end;
/
Run the saved computeGradeDiff script as follows:
select *
from table(pyqTableEval(
inp_nam => 'GRADE',
par_lst => '{"oml_async_flag":true}',
out_fmt => NULL,
scr_name => 'computeGradeDiff',
scr_owner => NULL
));
The VALUE column of the result contains a URL containing the job ID of the asynchronous job:
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
https://<host_name>/oml/tenants/<tenant_name>/databases/
<database_name>/api/py-scripts/v1/jobs/<job_id>
1 row selected.
10.6.3.2 pyqJobStatus Function

Use the pyqJobStatus function to look up the status of an asynchronous job. If the job is
pending, it returns job is still running . If the job is completed, the function returns a URL.
10-102
Chapter 10
Syntax
FUNCTION PYQSYS.pyqJobStatus(
job_id VARCHAR2
)
RETURN PYQSYS.pyqClobSet
Parameters
job_id The ID of the asynchronous job.
Example
The following example shows a pyqJobStatus call and its output.
SQL> select * from pyqJobStatus(

job_id => '<job id>'
);
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
https://<host name>/oml/tenants/<tenant name>/databases/<database

name>/api/py-scripts/v1/jobs/<job id>/result
1 row selected.
10.6.3.3 pyqJobResult Function

Use the pyqJobResult function to return the job result.
Syntax
FUNCTION PYQSYS.pyqJobResult(
job_id VARCHAR2,
out_fmt VARCHAR2 DEFAULT 'JSON'
)
Parameters
job_id The ID of the asynchronous job.
10-103
Chapter 10
Example
The following example shows a pyqJobResult call and its output.
SQL> select * from pyqJobResult(

job_id => '<job id>',
out_fmt =>
'{"NAME":"varchar2(7)","SCORE":"number","FINALGRADE":"number","DIFF":"number"}
'
);
NAME SCORE FINALGRADE DIFF

---------- ---------- ---------- ----------
Abbott 90 87 3
Branford 92 97 -5
Crandell 81 71 10
Dennison 85 72 13
Edgar 89 80 9
Faust 78 73 5
Greeley 82 91 -9
Hart 84 80 4
Isley 88 86 2
Jasper 91 83 8
10 rows selected.
10.6.3.4 Asynchronous Job Example

The following examples shows how to submit asynchronous jobs with non-XML output and
with XML output.
Non-XML Output
When submitting asynchronous jobs, for JSON, PNG and relational outputs, set the OUT_FMT
argument to NULL when submitting the job. When fetching the job result, specify OUT_FMT in the
pyqJobResult call.
This example uses the IRIS table created in the example shown in the pyqTableEval Function
(Autonomous Database) topic and the linregrPredict script created in the example shown in
the pyqRowEval Function (Autonomous Database) topic.
Issue a pyqGroupEval function call to submit an asynchronous job. In the function, the INP_NAM
argument specifies the data in the IRIS table to pass to the function.
The PAR_LST argument specifies submitting the job asynchronously with the special control
argument oml_async_flag, capturing the images rendered in the script with the special control
argument oml_graphics_flag, passing the input data as a pandas.DataFrame with the special
control argument oml_input_type, along with values for the function arguments modelName and
datastoreName.
The OUT_FMT argument is NULL.
10-104
Chapter 10
The SCR_NAME parameter specifies the user-defined Python function stored with the name
linregrPredict in the script repository.
The asynchronous call returns a job status URL in CLOB, you can call set long [length] to
get the full URL.
set long 150

select *
from table(pyqGroupEval(
inp_nam => 'IRIS',
"oml_async_flag":true, "oml_graphics_flag":true,
out_fmt => NULL,
ord_col => NULL,
scr_name => 'linregrPredict',
scr_owner => NULL
));
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
name>/api/py-scripts/v1/jobs/<job id>
1 row selected.
Run a SELECT statement that invokes the pyqJobStatus function, which returns a resource URL
containing the job ID when the job result is ready.
select * from pyqJobStatus(

job_id => '<job id>');
The output is the following when the job is still pending.
NAME
----------------------------------------------------------------------
VALUE
----------------------------------------------------------------------
job is still running
1 row selected.
The output is the following when the job finishes.
NAME
------------------------------------------------------------------------------
--
10-105
Chapter 10
VALUE
------------------------------------------------------------------------------
--
1 row selected.
Run a SELECT statement that invokes the pyqJobResult function.
In the OUT_FMT argument, the string 'PNG' specifies to include both return value and images
(titles and image bytes) in the result.
column name format a7

select * from pyqJobResult(
out_fmt => 'PNG'
);

------- ---------- --------------- ---------------- ---------------
GROUP_s 1 [{"Species":"se Prediction of Pe 6956424F5277304
etosa tosa","Sepal_Le tal Width B47676F41414141
ngth":4.6,"Sepa 4E5355684555674
l_Width":3.6,"P 141416F41414141
etal_Length":1. 486743415941414
0,"Petal_Width" 1413130647A6B41
:0.2,"Pred_Peta 41414142484E435
l_Width":0.1325 356514943416749
345443},{"Speci 6641686B6941414
es":"setosa","S 141416C7753466C
7A4141415059514
141443245427144
2B6E61514141414
468305256683055
32396D644864686
36D554162574630
634778766447787
0596942325A584A
7A615739754D793
4784C6A49734947
GROUP_v 1 [{"Species":"ve Prediction of Pe 6956424F5277304

ersicol rsicolor","Sepa tal Width B47676F41414141
or l_Length":5.1," 4E5355684555674
Sepal_Width":2. 141416F41414141
5,"Petal_Length 486743415941414
":3.0,"Petal_Wi 1413130647A6B41
dth":1.1,"Pred_ 41414142484E435
Petal_Width":0. 356514943416749
10-106
Chapter 10
8319563387},{"S 6641686B6941414
pecies":"versic 141416C7753466C
7A4141415059514
141443245427144
2B6E61514141414
468305256683055
32396D644864686
36D554162574630
634778766447787
0596942325A584A
7A615739754D793
4784C6A49734947
GROUP_v 1 [{"Species":"vi Prediction of Pe 6956424F5277304

irginic rginica","Sepal tal Width B47676F41414141
a _Length":5.7,"S 4E5355684555674
epal_Width":2.5 141416F41414141
,"Petal_Length" 486743415941414
:5.0,"Petal_Wid 1413130647A6B41
th":2.0,"Pred_P 41414142484E435
etal_Width":1.7 356514943416749
55762924},{"Spe 6641686B6941414
cies":"virginic 141416C7753466C
7A4141415059514
141443245427144
2B6E61514141414
468305256683055
32396D644864686
36D554162574630
634778766447787
0596942325A584A
7A615739754D793
4784C6A49734947
3 rows selected.
XML Ouput
If XML output is expected from the asynchronous job, set the OUT_FMT argument to 'XML' when
submitting the job and fetching the job result.
This example uses the script myFitMultiple created in the example shown in the
pyqIndexEval Function (Autonomous Database) topic.
Issue a pyqIndexEval function call to submit an asynchronous job. In the function, the PAR_LST
argument specifies submitting the job asynchronously with the special control argument
oml_async_flag, along with values for the function arguments sample_size.
The asynchronous call returns a job status URL in CLOB, you can call set long [length] to
get the full URL.
set long 150 select *

par_lst => '{"sample_size":80,"oml_async_flag":true}',
out_fmt => 'XML',
times_num => 3,
10-107
Chapter 10
scr_name => 'myFitMultiple',

scr_owner => NULL
));
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
1 row selected.
Run a SELECT statement that invokes the pyqJobStatus function, which returns a resource URL
containing the job id when the job result is ready.
select * from pyqJobStatus(

);
The output is the following when the job is still pending.
NAME
----------------------------------------------------------------------
VALUE
----------------------------------------------------------------------
job is still running
1 row selected.
The output is the following when the job result is ready.
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
1 row selected.
Run a SELECT statement that invokes the pyqJobResult function.
10-108
Chapter 10
In the OUT_FMT argument, the string 'XML' specifies that the table returned contains a CLOB
that is an XML string.

out_fmt => 'XML'
);
NAME
----------------------------------------------------------------------
VALUE
----------------------------------------------------------------------
1
<root><pandas_dataFrame><ROW-pandas_dataFrame><id>1</id><score>0.94355
0631313753</score></ROW-pandas_dataFrame></pandas_dataFrame></root>
2
3
3 rows selected.
10.6.4 Special Control Arguments (Autonomous Database)

Use the PAR_LST parameter to specify special control arguments and additional arguments to
be passed into the Python script.
Argument Syntax and Description

oml_input_type Syntax
oml_input_type : 'pandas.DataFrame', 'numpy.recarray', or 'default'
(default)
Description
Specifies the type of object to construct from data in the Autonomous Database. By
default, a two-dimensional numpy.ndarray of type numpy.float64 is constructed
if all columns are numeric. Otherwise, a pandas.DataFrame is constructed.
oml_na_omit Syntax
oml_na_omit : bool, false (default)
Description
Determines if rows with any missing values will be omitted from the table to be
evaluated.
If true, omit all rows with missing values from the table.
If false, do not omit rows with missing values from the table.
10-109
Chapter 10
Argument Syntax and Description

oml_async_flag Syntax
oml_async_flag: bool, false (default)
Description
If true, the job will be submitted asynchronously.
If false, the job will be executed in synchronous mode.
oml_graphics_fl Syntax
ag oml_graphics_flag: bool, false (default)
Description
If true, the server will capture images rendered in the Python script.
If false, the server will not capture images rendered in the Python script.
oml_parallel_fl Syntax
ag oml_parallel_flag: bool, false (default)
Description
If true, the Python script will be run with data parallelism. Data parallelism is only
applicable to pyqRowEval, pyqGroupEval, and pyqIndexEval.
If false, the Python script will not be run with data parallelism.
oml_service_lev Syntax
el oml_service_level : string, allowed values: 'LOW'(default), 'MEDIUM',
'HIGH'
Description
Controls the different levels of performance and concurrency in Autonomous
Database.
Examples
• Input data is pandas.DataFrame:
par_lst => '{"oml_input_type":"pandas.DataFrame"}'
• Drop rows with missing values from input data:
par_lst => '{"oml_na_omit":true}'
• Submit a job in asynchronous mode:
par_lst => '{"oml_async_flag":true}'
• Use MEDIUM service level:
par_lst => '{"oml_service_level":"MEDIUM"}'
10.6.5 Output Formats (Autonomous Database)

The OUT_FMT parameter controls the format of output returned by the table functions pyqEval,
pyqGroupEval, pyqIndexEval, pyqRowEval, pyqTableEval, and pyqJobResult.
The output formats are:

• JSON
10-110
Chapter 10
• Relational
• XML
• PNG
• Asynchronous Mode Output
JSON
When OUT_FMT is set to JSON, the table functions return a table containing a CLOB that is a
JSON string.
The following example invokes the pyqEval function on the 'pyqFun1' created in the pyqEval
function section.
SQL> select *
from table(pyqEval(
par_lst => '{"oml_service_level":"MEDIUM"}',
out_fmt => 'JSON',
NAME
----------------------------------------------------------------------
VALUE
----------------------------------------------------------------------
[{"FLOAT":0,"ID":0,"NAME":"demo_0"},{"FLOAT":0.1,"ID":1,"NAME":"demo_1
"},{"FLOAT":0.2,"ID":2,"NAME":"demo_2"},{"FLOAT":0.3,"ID":3,"NAME":"de
mo_3"},{"FLOAT":0.4,"ID":4,"NAME":"demo_4"},{"FLOAT":0.5,"ID":5,"NAME"
:"demo_5"},{"FLOAT":0.6,"ID":6,"NAME":"demo_6"},{"FLOAT":0.7,"ID":7,"N
AME":"demo_7"},{"FLOAT":0.8,"ID":8,"NAME":"demo_8"},{"FLOAT":0.9,"ID":
9,"NAME":"demo_9"}]
1 row selected.
Relational
When OUT_FMT is specified with a JSON string where column names are mapped to column
types, the table functions return the response by reshaping it into table columns.
For example, if OUT_FMT is specified with {"NAME":"varchar2(7)", "DIFF":"number"}, the
output should contain a NAME column of type VARCHAR2(7) and a DIFF column of type NUMBER.
The following example uses the table GRADE and the script 'computeGradeDiff' (created in
Asynchronous Jobs (Autonomous Database) and invokes the computeGradeDiff function:
SQL> select *
from table(pyqTableEval(
inp_nam => 'GRADE',
par_lst => '{"oml_input_type":"pandas.DataFrame"}',
out_fmt => '{"NAME":"varchar2(7)","DIFF":"number"}',
scr_name => 'computeGradeDiff'));
NAME DIFF
------- ----------
Abbott 3
Branfor -5
10-111
Chapter 10
Crandel 10
Denniso 13
Edgar 9
Faust 5
Greeley -9
Hart 4
Isley 2
Jasper 8
10 rows selected.
XML
When OUT_FMT is specified with XML, the table functions return the response in a table with fixed
columns. The output consists of two columns. The NAME column contains the name of the row.
The NAME column value is NULL for pyqEval, pyqTableEval,pyqRowEval function returns. For
pyqGroupEval, pyqIndexEval, the NAME column value is the group/index name. The VALUE
column contains the XML string.
The XML can contain both structured data and images, with structured or semi-structured
Python objects first, followed by the image or images generated by the Python function.
Images are returned as a base 64 encoding of the PNG representation. To include images in
the XML string, the special control argument oml_graphics_flag must be set to true.
In the following code, the python function gen_two_images is defined and stored with name
plotTwoImages in the script repository. The function renders two subplots with random dots in
red and blue color and returns the number of columns of the input data.
begin
sys.pyqScriptCreate('plotTwoImages','def gen_two_images (dat):
import numpy as np
np.random.seed(22)
fig = plt.figure(1);
fig2 = plt.figure(2);
ax = fig.add_subplot(111);
ax.set_title("Random red dots")
ax2 = fig2.add_subplot(111);
ax2.set_title("Random blue dots")
ax.plot(range(100), np.random.normal(size=100), marker = "o",
color = "red", markersize = 2)
ax2.plot(range(100,0,-1), marker = "o", color = "blue",
markersize = 2)
return dat.shape[1]
',FALSE,TRUE);
end;
/
The following example shows the XML output of a pyqRowEval function call where both
structured data and images are included in the result:
SQL> select *
from table(pyqRowEval(
inp_nam => 'GRADE',
par_lst => '{"oml_graphics_flag":true}',
10-112
Chapter 10
out_fmt => 'XML',

row_num => 5,
scr_name => 'plotTwoImages'
));
NAME
------------------------------------------------------------------------------
--
VALUE
----------------------------------------------------------------------
1
<root><Py-data><int>7</int></Py-data><images><image><img src="data:ima
ge/pngbase64"><![CDATA[iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAA
ABHNCSVQICAgIfAhkiAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAADh0RVh0U29mdHdhcmUAb
WF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAA
gAElEQVR4nOydeZwcVb32n549k0xCSMhGEohhEZFNUAEBE0UUIYOACG4gFxWvgGzqldf3s
lz1xYuKLBe3i7LcNyhctoxsviCJoAQFNAKCCLITQyCQbZJMZqb
2
<root><Py-data><int>7</int></Py-data><images><image><img src="data:ima
ge/pngbase64"><![CDATA[iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAA
ABHNCSVQICAgIfAhkiAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAADh0RVh0U29mdHdhcmUAb
WF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAA
gAElEQVR4nOydeZwcVb32n549k0xCSMhGEohhEZFNUAEBE0UUIYOACG4gFxWvgGzqldf3s
lz1xYuKLBe3i7LcNyhctoxsviCJoAQFNAKCCLITQyCQbZJMZqb
2 rows selected
PNG
When OUT_FMT is specified with PNG, the table functions return the response in a table with fixed
columns (including an image bytes column). When calling the SQL API, you must set the
special control argument oml_graphics_flag to true so that the web server can capture
images rendered in the executed script.
The PNG output consists of four columns. The NAME column contains the name of the row. The
NAME column value is NULL for pyqEval and pyqTableEval function returns. For pyqRowEval,
pyqGroupEval, pyqIndexEval, the NAME column value is the chunk/group/index name. The ID
column indicates the ID of the image. The VALUE column contains the return value of the
executed script. The TITLE column contains the titles of the rendered PNG images. The IMAGE
column is a BLOB column containing the bytes of the PNG images rendered by the executed
script.
The following example shows the PNG output of a pyqRowEval function call.
SQL> column name format a7

column valueformat a5
select *
from table(pyqRowEval(
inp_nam => 'GRADE',
par_lst => '{"oml_graphics_flag":true}',
out_fmt => 'PNG',row_num => 5,
scr_name => 'plotTwoImages',
10-113
Chapter 10
scr_owner =>NULL
));

-------- --------- ----- ---------------- ---------------
CHUNK_1 1 7 Random red dots 6956424F5277304
B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A41414150
CHUNK_1 2 7 Random blue dots 6956424F5277304

B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A41414150
CHUNK_2 1 7 Random red dots 6956424F5277304

B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A41414150
CHUNK_2 2 7 Random blue dots 6956424F5277304

B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A41414150
4 rows selected.
10-114
Chapter 10
Asynchronous Mode Output

When you set oml_async_flag to true to run an asynchronous job, set OUT_FMT to NULL for
jobs that return non-XML results, or set it to XML for jobs that return XML results, as described
below.
See also oml_async_flag Argument.
Asynchronous Mode: Non-XML Output
When submitting asynchronous jobs, for JSON, PNG, and relational outputs, set OUT_FMT to
NULL when submitting the job. When fetching the job result, specify OUT_FMT in the
pyqJobResult call.
The following example shows how to get the JSON output from an asynchronous
pyqIndexEval function call:
SQL> select *
from table(pyqGroupEval(
inp_nam => 'GRADE',
par_lst => '{"oml_async_flag":true, "oml_graphics_flag":true}',
out_fmt => NULL,
grp_col => 'GENDER',
ord_col => NULL,
scr_name => 'inp_twoimgs',
scr_owner => NULL
));
NAME
--------------------------------------------------------------------
VALUE
--------------------------------------------------------------------

1 row selected.

job_id => '<job id>');
NAME
--------------------------------------------------------------------
VALUE
--------------------------------------------------------------------

10-115
Chapter 10
1 row selected.
SQL> column name format a7

out_fmt => 'PNG'
);

------- ---------- ----- ---------------- ---------------
GROUP_F 1 7 Random red dots 6956424F5277304
B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A4141415059514
141443245427144
2B6E61514141414
468305256683055
32396D644864686
36D554162574630
634778766447787
0596942325A584A
7A615739754D793
4784C6A49734947
GROUP_F 2 7 Random blue dots 6956424F5277304

B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A4141415059514
141443245427144
2B6E61514141414
468305256683055
32396D644864686
36D554162574630
634778766447787
0596942325A584A
7A615739754D793
10-116
Chapter 10
4784C6A49734947
GROUP_M 1 7 Random red dots 6956424F5277304

B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A4141415059514
141443245427144
2B6E61514141414
468305256683855
32396D644864686
36D554162574630
634778766447787
0596942325A584A
7A615739754D793
4784C6A49734947
GROUP_M 2 7 Random blue dots 6956424F5277304

B47676F41414141
4E5355684555674
141416F41414141
486743415941414
1413130647A6B41
41414142484E435
356514943416749
6641686B6941414
141416C7753466C
7A4141415059514
141443245427144
2B6E61514141414
468305256683055
32396D644864686
36D554162574630
634778766447787
0596942325A584A
7A615739754D793
4784C6A49734947
4 rows selected
Asynchronous Mode: XML Output

If XML output is expected from the asynchronous job, you must set OUT_FMT to XML when
submitting the job and fetching the job result.
The following example shows how to get the XML output from an asynchronous pyqIndexEval
function call.
SQL> select *
10-117
Chapter 10
par_lst => '{"oml_async_flag":true}',

out_fmt => 'XML',
times_num => 3,
scr_name => 'idx_ret_df',
scr_owner => NULL
));
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
1 row selected.

);
2
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--

1 row selected.
SQL> select * from pyqJobResult(

out_fmt => 'XML'
);
2 3 4
NAME
------------------------------------------------------------------------------
--
VALUE
------------------------------------------------------------------------------
--
1
<root><pandas_dataFrame><ROW-
pandas_dataFrame><ID>1</ID><RES>a</RES></ROW-pandas
_dataFrame></pandas_dataFrame></root>
10-118
Chapter 10
2
pandas_dataFrame><ID>2</ID><RES>b</RES></ROW-pandas
_dataFrame></pandas_dataFrame></ro
3
pandas_dataFrame><ID>3</ID><RES>c</RES></ROW-pandas
_dataFrame></pandas_dataFrame></root>
3 rows selected
10-119
11
Administrative Tasks for Oracle Machine
Learning for Python
If you find that your Python process is consuming too many of your machine's resources, or
causing your machine to crash, you can get information about, or set limits for, the resources
Python is using.
The Python system and process utilities library psutil is a cross-platform library for retrieving
information on running processes and system utilization, such as CPU, memory, disks,
network, and sensors, in Python. It is useful for system monitoring, profiling, limiting process
resources, and the management of running processes.
The function psutil.Process.rlimit gets or sets process resource limits. In psutil, process
resource limits are constants with names beginning with psutil.RLIMIT_. Each resource is
controlled by a soft limit and hard limit tuple.
For example, psutil.RLIMIT_AS represents the maximum size (in bytes) of the virtual memory
(address space) used by the process. The default limit of psutil.RLIMIT_AS can be -1
(psutil.RLIM_INFINITY). You can lower the resource limit of psutil.RLIMIT_AS to prevent
your Python program from loading too much data into memory, as shown in the following
example.
Example 11-1 Resource Control with psutil.RLIMIT_AS
import psutil
import numpy as np
# Get the current OS process.

p = psutil.Process()
# Get a list of available resources.

[attr for attr in dir(psutil) if attr[:7] == 'RLIMIT_']
# Display the Virtual Memory Size of the current process.

p.memory_info().vms
# Get the process resource limit RLIMIT_AS.

soft, hard = p.rlimit(psutil.RLIMIT_AS)
print('Original resource limits of RLIMIT_AS (soft/hard): {}/{}'.format(soft,
hard))
# Check the constant used to represent the limit for an unlimited resource.
psutil.RLIM_INFINITY
# Set resource RLIMIT_AS (soft, hard) limit to (1GB, 2GB).

p.rlimit(psutil.RLIMIT_AS, (pow(1024,3)*1, pow(1024,3)*2))
# Get the current resource limit of RLIMIT_AS.

cur_soft, cur_hard = p.rlimit(psutil.RLIMIT_AS)
print('Current resource limits of RLIMIT_AS (soft/hard): {}/
11-1
Chapter 11
{}'.format(cur_soft, cur_hard))
# Define a list of sizes to be allocated in MB (megabytes).

sz = [5, 10, 20]
# Define a megabyte variable in bytes.

MB = 1024*1024
# Allocate an increasing amount of data.

for val in sz:
stmt = "Allocate %s MB " % val
try:
print("virtual memory: %d MB" % int(p.memory_info().vms/MB))
m = np.arange(val*MB/8, dtype="u8")
print(stmt + " Success.")
except:
print(stmt + " Fail.")
raise
# Delete the allocated variable.

del m
# Raise the soft limit of RLIMIT_AS to 2GB.

p.rlimit(psutil.RLIMIT_AS, (pow(1024,3)*2, pow(1024,3)*2))
# Get the current resource limit of RLIMIT_AS.

cur_soft, cur_hard = p.rlimit(psutil.RLIMIT_AS)
print('Current resource limits of RLIMIT_AS (soft/hard): {}/
# Retry: allocate an increasing amount of data.

for val in sz:
stmt = "Allocate %s MB " % val
try:
print("virtual memory: %d MB" % int(p.memory_info().vms/MB))
m = np.arange(val*MB/8, dtype="u8")
print(stmt + " Success.")
except:
print(stmt + " Fail.")
raise
>>> import psutil

>>>
>>> # Get the current OS process.
... p = psutil.Process()
>>>
>>> # Get a list of available resources.
... [attr for attr in dir(psutil) if attr[:7] == 'RLIMIT_']
['RLIMIT_AS', 'RLIMIT_CORE', 'RLIMIT_CPU', 'RLIMIT_DATA',
'RLIMIT_FSIZE', 'RLIMIT_LOCKS', 'RLIMIT_MEMLOCK', 'RLIMIT_MSGQUEUE',
'RLIMIT_NICE', 'RLIMIT_NOFILE', 'RLIMIT_NPROC', 'RLIMIT_RSS',
'RLIMIT_RTPRIO', 'RLIMIT_RTTIME', 'RLIMIT_SIGPENDING', 'RLIMIT_STACK']
11-2
Chapter 11
>>>
>>> # Display the Virtual Memory Size of the current process.
... p.memory_info().vms
413175808
>>>
>>> # Get the process resource limit RLIMIT_AS.
... soft, hard = p.rlimit(psutil.RLIMIT_AS)
>>> print('Original resource limits of RLIMIT_AS (soft/hard): {}/
{}'.format(soft, hard))
Original resource limits of RLIMIT_AS (soft/hard): -1/-1
>>>
>>> # Check the constant used to represent the limit for an unlimited
resource.
... psutil.RLIM_INFINITY
-1
>>>
>>> # Set the resource RLIMIT_AS (soft, hard) limit to (1GB, 2GB).
... p.rlimit(psutil.RLIMIT_AS, (pow(1024,3)*1, pow(1024,3)*2))
>>>
>>> # Get the current resource limit of RLIMIT_AS.
... cur_soft, cur_hard = p.rlimit(psutil.RLIMIT_AS)
>>> print('Current resource limits of RLIMIT_AS (soft/hard): {}/
Current resource limits of RLIMIT_AS (soft/hard): 1073741824/2147483648
>>>
>>> # Define a list of sizes to be allocated in MB (megabytes).
... sz = [100, 200, 500, 1000]
>>>
>>> # Define a megabyte variable in bytes.
... MB = 1024*1024
>>>
>>> # Allocate an increasing amount of data.
... for val in sz:
... stmt = "Allocate %s MB " % val
... try:
... print("virtual memory: %d MB" % int(p.memory_info().vms/MB))
... m = np.arange(val*MB/8, dtype="u8")
... print(stmt + " Success.")
... except:
... print(stmt + " Fail.")
... raise
...
virtual memory: 394 MB
Allocate 100 MB Success.
Allocate 500 MB Fail.
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
MemoryError
>>>
>>> # Delete the allocated variable.
... del m
>>>
>>> # Raise the soft limit of RLIMIT_AS to 2GB.
11-3
Chapter 11
... p.rlimit(psutil.RLIMIT_AS, (pow(1024,3)*2, pow(1024,3)*2))

>>>
>>> # Get the current resource limit of RLIMIT_AS.
... cur_soft, cur_hard = p.rlimit(psutil.RLIMIT_AS)
>>> print('Current resource limits of RLIMIT_AS (soft/hard): {}/
Current resource limits of RLIMIT_AS (soft/hard): 2147483648/2147483648
>>>
>>> # Retry: allocate an increasing amount of data.
... for val in sz:
... stmt = "Allocate %s MB " % val
... try:
... print("virtual memory: %d MB" % int(p.memory_info().vms/MB))
... m = np.arange(val*MB/8, dtype="u8")
... print(stmt + " Success.")
... except:
... print(stmt + " Fail.")
... raise
...
11-4
Index
Numerics classes (continued)
machine learning, 8-2
3rd party package, 5-13 oml.ai, 8-18
3rd party packages, 5-9 oml.ar, 8-21
oml.automl.AlgorithmSelection, 9-6
oml.automl.FeatureSelection, 9-8
A oml.automl.ModelSelection, 9-15
ADMIN, 5-9 oml.automl.ModelTuning, 9-11
algorithm selection class, 9-6 oml.dt, 8-11, 8-27
algorithms oml.em, 8-34
Apriori, 8-21 oml.esa, 8-48
attribute importance, 8-18 oml.glm, 8-53
Automated Machine Learning, 9-1 oml.graphics, 7-31
Automatic Data Preparation, 8-11 oml.km, 8-63
automatically selecting, 9-15 oml.nb, 8-69
Decision Tree, 8-27 oml.nn, 8-77
Expectation Maximization, 8-34 oml.rf, 8-86
Explicit Semantic Analysis, 8-48 oml.svd, 8-94
Generalized Linear Model, 8-53 oml.svm, 8-100
k-Means, 8-63 classification algorithm, 8-86
machine learning, 8-2 classification models, 8-11, 8-27, 8-53, 8-69, 8-77,
Minimum Description Length, 8-18 8-86, 8-100
Naive Bayes, 8-69 client
Neural Network, 8-77 installing for Linux for Autonomous Database,
Random Forest, 8-86 2-1
settings common to all, 8-4 installing for Linux on-premises, 3-18
Singular Value Decomposition, 8-94 clustering models, 8-34, 8-48, 8-63
Support Vector Machine, 8-100 conda enviroment, 5-9
ALL_PYQ_DATASTORE_CONTENTS view, 10-7 connection
ALL_PYQ_DATASTORES view, 10-8 creating a on-premises database, 6-4
ALL_PYQ_SCRIPTS view, 10-9 functions, 6-2
anomaly detection models, 8-100 control arguments, 10-12
Apriori algorithm, 8-21 convert Python to SQL, 1-4
attribute importance, 8-18 creating
Automated Machine Learning proxy objects, 6-13, 6-16
about, 9-1 cx_Oracle package, 6-2
Automatic Data Preparation algorithm, 8-11 cx_Oracle.connect function, 6-2
Automatic Machine Learning
connection parameter, 6-2 D
Autonomous Database, 6-1
data
about moving, 6-9
C exploring, 7-17
classes filtering, 7-13
Automated Machine Learning, 9-1 preparing, 7-1
GlobalFeatureImportance, 8-12 selecting, 7-3
Index-1
Index
data parallel processing, 10-12 functions (continued)

database oml.disconnect, 6-2, 6-4
connecting to an on-premises, 6-4 oml.do_eval, 10-14
datastores oml.drop, 6-16
about, 6-20 oml.ds.delete, 6-28
database views for, 10-7, 10-8, 10-10 oml.ds.describe, 6-27
deleting objects, 6-28 oml.ds.dir, 6-25
describing objects in, 6-27 oml.ds.load, 6-24
getting information about, 6-25 oml.ds.save, 6-21
granting or revoking access to, 6-30 oml.grant, 6-30
loading objects from, 6-24 oml.group_apply, 10-18
saving objects in, 6-21 oml.hist, 7-31
DCLI oml.index_apply, 10-26
Exadata, 4-5 oml.isconnected, 6-2, 6-4
python, 4-2 oml.row_apply, 10-22
Decision Tree algorithm, 8-27 oml.script.create, 10-28
Distributed Command Line Interface, 4-2 oml.script.dir, 10-32
Download environment from object storage, 5-13 oml.script.drop, 10-34
dropping oml.script.load, 10-33
tables, 6-16 oml.set_connection, 6-2
oml.sync, 6-13
oml.table_apply, 10-15
E pyqEval, 10-37
EM model, 8-34 pyqGroupEval, 10-47
Embedded Python Execution pyqRowEval, 10-43
about, 10-12 pyqTableEval, 10-40
about the SQL interface for, 10-37
SQL interface for, 10-36, 10-55 G
ESA model, 8-48
Exadata, 4-1 GLM models, 8-53
compute nodes, 4-2 granting
DCLI, 4-5 access to scripts and datastores, 6-30
Expectation Maximization algorithm, 8-34 user privileges, 3-13
explainability, 8-12 graphics
Explicit Semantic Analysis algorithm, 8-48 rendering, 7-31
exporting models, 8-7
I
F
importing models, 8-7
feature extraction algorithm, 8-48 installing
feature extraction class, 8-94 client for Linux for Autonomous Database, 2-1
feature selection class, 9-8 client for Linux on-premises, 3-18
function server for Linux on-premises, 3-6, 3-11
pyqGrant, 10-50, 10-95 Instant Client
functions installing for Linux on-premises, 3-17
cx_Oracle.connect, 6-2
Embedded Python Execution, 10-12
for graphics, 7-31
K
for managing user-defined Python functions, KM model, 8-63
10-28
oml.boxplot, 7-31
oml.check_embed, 6-2, 6-4 L
oml.connect, 6-2, 6-4 libraries inOML4Py, 1-5
oml.create, 6-16 Linux
oml.cursor, 6-9, 6-16 installing Python for, 3-1
oml.dir, 6-9, 6-13
Index-2
Index
Linux (continued) N
requirements, 3-1
uninstalling on-premises client for, 3-22 Naive Bayes model, 8-69
uninstalling on-premises server for, 3-16 Neural Network model, 8-77
Linux for Autonomous Database
installing client for, 2-1
Linux on-premises
O
installing client for, 3-18 oml_input_type argument, 10-12
installing Oracle Instant Client for, 3-17 oml_na_omit argument, 10-12
installing server for, 3-6, 3-11 oml.ai class, 8-18
supporting packages for, 3-4 oml.ar class, 8-21
oml.automl.AlgorithmSelection class, 9-6
M oml.automl.FeatureSelection class, 9-8
oml.automl.ModelSelection class, 9-15
machine learning oml.automl.ModelTuning class, 9-11
classes, 8-2 oml.boxplot function, 7-31
methods oml.check_embed function, 6-2, 6-4
drop, 7-13 oml.connect function, 6-2, 6-4
drop_duplicates, 7-13 oml.create function, 6-16
dropna, 7-13 oml.cursor function, 6-9, 6-16
for exploring data, 7-17 oml.dir function, 6-9, 6-13
for preparing data, 7-1 oml.disconnect function, 6-2, 6-4
pull, 6-11 oml.do_eval function, 10-14
Minimum Description Length algorithm, 8-18 oml.drop function, 6-16
model selection, 9-15 oml.ds.delete function, 6-28
model tuning, 9-11 oml.ds.describe function, 6-27
models oml.ds.dir function, 6-25
association rules, 8-21 oml.ds.load function, 6-24
attribute importance, 8-18 oml.ds.save function, 6-21
Decision Tree, 8-11, 8-27 oml.dt class, 8-11, 8-27
Expectation Maximization, 8-34 oml.em class, 8-34
explainability, 8-12 oml.esa class, 8-48
Explicit Semantic Analysis, 8-48 oml.glm class, 8-53
exporting and importing, 8-7 oml.grant function, 6-30
for anomaly detection, 8-100 oml.graphics class, 7-31
for classification, 8-11, 8-27, 8-53, 8-69, 8-77, oml.group_apply function, 10-18
8-86, 8-100 oml.hist function, 7-31
for clustering, 8-34, 8-63 oml.index_apply function, 10-26
for feature extraction, 8-48, 8-94 oml.isconnected function, 6-2, 6-4
for regression, 8-53, 8-77, 8-100 oml.km class, 8-63
Generalized Linear Model, 8-53 oml.nb class, 8-69
k-Means, 8-63 oml.nn class, 8-77
Naive Bayes, 8-69 oml.push function, 6-9
Neural Network, 8-77 oml.revoke function, 6-30
parametric, 8-53 oml.rf class, 8-86
persisting, 8-2 oml.row_apply function, 10-22
Random Forest, 8-86 oml.script.create function, 10-28
Singular Value Decomposition, 8-94 oml.script.dir function, 10-32
Support Vector Machine, 8-100 oml.script.drop function, 10-34
moving data oml.script.load function, 10-33
about, 6-9 oml.set_connection function, 6-2, 6-4
to a local Python session, 6-11 oml.svd class, 8-94
to the database, 6-9 oml.svm class, 8-100
oml.sync function, 6-13
oml.table_apply function, 10-15
Index-3
Index
OML4Py, 1-1, 4-1 regression models, 8-53, 8-77

Exadata, 4-5 requirements
on-premises client on-premises system, 3-1
installing, 3-16 resources
uninstalling, 3-22 managing, 11-1
on-premises server revoking
installing, 3-6 access to scripts and datastores, 6-30
uninstalling, 3-16 roles
on-premises system requirements, 3-1 PYQADMIN, 3-13
Oracle Machine Learning Notebooks, 6-1
Oracle Machine Learning Python interpreter, 6-1
Oracle wallets
S
about, 6-3 scoring new data, 1-2, 8-2
script repository
P granting or revoking access to, 6-30
managing user-defined Python functions in,
packages 10-28
supporting for Linux on-premises, 3-4 registering a user-defined function, 10-28
parallel processing, 10-12 scripts
parametric models, 8-53 pyquser, 3-14
PL/SQL procedures server
sys.pyqScriptCreate, 10-52 installing for Linux on-premises, 3-6, 3-11
sys.pyqScriptDrop, 10-54 settings
predict method, 8-69 about model, 8-4
predict.proba method, 8-69 Apriori algorithm, 8-21
privileges association rules, 8-21
required, 3-13 Automatic data preparation algorithm, 8-11
proxy objects, 1-4 Decision Tree algorithm, 8-27
for database tables, 6-13, 6-16 Expectation Maximization model, 8-34
storing, 6-20 Explicit Semantic Analysis algorithm, 8-48
pull method, 6-11 Generalized Linear Model algorithm, 8-53
PYQADMIN role, 3-13 k-Means algorithm, 8-63
pyqEval function, 10-37 Minimum Description Length algorithm, 8-18
pyqGrant function, 10-50, 10-95 Naive Bayes algorithm, 8-69
pyqGroupEval function, 10-47 Neural Network algorithm, 8-77
pyqRowEval function, 10-43 Random Forest algorithm, 8-86
pyqTableEval function, 10-40 shared algorithm, 8-4
pyquser.sql script, 3-14 Singular Value Decomposition algorithm, 8-94
Python, 4-1 sttribute importance, 8-18
installing for Linux, 3-1 Support Vector Machine algorithm, 8-100
libraries in OML4Py, 1-5 special control arguments, 10-12
requirements, 3-1 SQL APIs
version used, 1-5 pyqEval function, 10-37
Python interpreter, 6-1 pyqGrant function, 10-50, 10-95
Python objects pyqGroupEval function, 10-47
storing, 6-20 pyqRowEval function, 10-43
python packages, 5-9 pyqTableEval function, 10-40
Python to SQL conversion, 1-4 SQL to Python conversion, 1-4
supporting packages
for Linux on-premises, 3-4
R SVD model, 8-94
Random Forest algorithm, 8-86 SVM models, 8-100
ranking synchronizing database tables, 6-13
attribute importance, 8-18 sys.pyqScriptCreate procedure, 10-52
read privilege sys.pyqScriptDrop procedure, 10-54
granting or revoking, 6-30
Index-4
Index
T user-defined Python functions (continued)

users
tables creating new, 3-14
creating, 6-16
dropping, 6-13, 6-16
proxy objects for, 6-13, 6-16
V
task parallel processing, 10-12 views
transparency layer, 1-4 ALL_PYQ_DATASTORE_CONTENTS, 10-7
ALL_PYQ_DATASTORES, 10-8
U ALL_PYQ_SCRIPTS, 10-9
USER_PYQ_DATASTORES, 10-10
uninstalling USER_PYQ_SCRIPTS, 10-11
on-premises client, 3-22
on-premises server, 3-16
USER_PYQ_DATASTORES view, 10-10
W
USER_PYQ_SCRIPTS view, 10-11 wallets
user-defined Python functions about Oracle, 6-3
Embedded Python Execution of, 10-12
Index-5

Oracle Machine Learning Python Users Guide

Uploaded by

Copyright:

Available Formats

Oracle Machine Learning Python Users Guide

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Oracle Machine Learning Python Users Guide

Uploaded by

Copyright:

Available Formats

Oracle® Machine Learning for Python

Copyright © 2019, 2024, Oracle and/or its affiliates.

Primary Author: Dhanish Kumar

1 About Oracle Machine Learning for Python

3 Install OML4Py for On-Premises Databases

4 Install OML4Py on Exadata

5 Install Third-Party Packages

6 Get Started with Oracle Machine Learning for Python

7 Prepare and Explore Data

8 OML4Py Classes That Provide Access to In-Database Machine Learning

9 Automated Machine Learning

10 Embedded Python Execution

11 Administrative Tasks for Oracle Machine Learning for Python

Access to Oracle Support

For more information, see these Oracle resources:

1.1 What Is Oracle Machine Learning for Python?

• An OML4Py client connection to OML4Py in an on-premises Oracle Database instance.

1.2 Advantages of Oracle Machine Learning for Python

• Keep data secure

1.3 Transparently Convert Python to SQL

Table 1-1 Transparency Layer Functions

Table 1-2 Transparency Layer Data Type Classes

Table 1-2 (Cont.) Transparency Layer Data Type Classes

Table 1-3 Python and SQL Data Type Equivalencies

Database Read Python Data Types Database Write

1.4 About the Python Components and Libraries in OML4Py

Python Version in Current Release of OML4Py

Required Python Libraries

2. OML4Py requires the presence of the perl-Env, libffi-devel, openssl, openssl-devel,

sudo yum install perl-Env libffi-devel openssl openssl-devel tk-devel xz-

make clean; make

You can now start Python with the python3 script:

python3 -m pip install --upgrade pip

5. Install the Oracle Instant Client for Autonomous Database, as follows:

README ewallet.p12 ojdbc.properties tnsnames.ora

WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA =

• pip3.9 install pandas==1.3.4

pip3.9 install scipy==1.7.3

pip3.9 install matplotlib==3.3.3

pip3.9 install cx_Oracle==8.1.0

pip3.9 install threadpoolctl==2.1.0

pip3.9 install joblib==0.14.0

pip3.9 install scikit-learn==1.0.1 --no-deps

pip3.9 uninstall numpy

pip3.9 install numpy==1.21.5

• Install OML4Py client:

Oracle Machine Learning for Python 1.0 Client.

Operation ........................ Install/Upgrade

• Start Python and load the oml library:

3.1 OML4Py On Premises System Requirements

Table 3-1 On-Premises OML4Py Platform Requirements

Operating System Hardware Platform Description

Table 3-2 On-Premises OML4Py Configuration Requirements and Server Support

Oracle Machine Learning Python Version On-Premises Oracle Database

Python 3.9.5 is required to install and use OML4Py.

2. Create a directory $ORACLE_HOME/python and extract the contents to this directory:

4. OML4Py requires the presence of the perl-Env, libffi-devel, openssl, openssl-devel,

sudo yum install perl-Env libffi-devel openssl openssl-devel tk-devel xz-

make clean; make

python3 -m pip install --upgrade pip