BDD Installation
BDD Installation
BDD Installation
Installation Guide
Version 1.3.2 Revision A October 2016
Table of Contents
Copyright and disclaimer ..........................................................2
Preface ..........................................................................6
About this guide ................................................................6
Audience......................................................................6
Conventions ...................................................................6
Contacting Oracle Customer Support .................................................7
Table of Contents
Table of Contents
Preface
Oracle Big Data Discovery is a set of end-to-end visual analytic capabilities that leverage the power of Apache
Spark to turn raw data into business insight in minutes, without the need to learn specialist big data tools or
rely only on highly skilled resources. The visual user interface empowers business analysts to find, explore,
transform, blend and analyze big data, and then easily share results.
Audience
This guide addresses administrators and engineers who need to install and deploy Big Data Discovery within
their existing Hadoop environment.
Conventions
The following conventions are used in this document.
Typographic conventions
The following table describes the typographic conventions used in this document.
Typeface
Meaning
Code Sample
Variable
File Path
Preface
Symbol conventions
The following table describes symbol conventions used in this document.
Symbol
Description
Example
Meaning
>
Meaning
$ORACLE_HOME
$BDD_HOME
Indicates the absolute path to your Oracle Big Data Discovery home
directory, $ORACLE_HOME/BDD-<version>.
$DOMAIN_HOME
Indicates the absolute path to your WebLogic domain home directory. For
example, if your domain is named bdd-<version>_domain, then
$DOMAIN_HOME is $ORACLE_HOME/user_projects/domains/bdd<version>_domain.
$DGRAPH_HOME
Part I
Before You Install
Chapter 1
Introduction
The following sections describe Oracle Big Data Discovery and how it integrates with other software products.
They also describe some of the different cluster configurations Big Data Discovery supports.
The Big Data Discovery software package
Integration with Hadoop
Integration with WebLogic
Integration with Jetty
Cluster configurations and diagrams
A note about component names
Studio
Studio is Big Data Discovery's front-end web application. It provides tools that you can use to create and
manage data sets and projects, as well as administrator tools for managing end user access and other
settings. Studio stores its project data and the majority of its configuration in a relational database.
Studio is a Java-based application. It runs inside the WebLogic Server, along with the Dgraph Gateway.
Dgraph Gateway
The Dgraph Gateway is a Java-based interface that routes requests to the Dgraph instances and provides
caching and business logic. It also leverages Hadoop ZooKeeper to handle cluster services for the Dgraph
instances.
The Dgraph Gateway runs inside WebLogic Server, along with Studio.
Transform Service
The Transform Service processes end user-defined changes to data sets (called transformations) on behalf of
Studio. It enables you to preview the effects your transformations will have on your data before saving them.
The Transform Service is a web application that runs inside a Jetty container. It is separate from Studio and
the Dgraph Gateway.
Introduction
10
Data Processing
Data Processing collectively refers to a set of processes and jobs that discover, sample, profile, and enrich
source data. Many of these processes run within Hadoop, so Data Processing must be installed on Hadoop
nodes.
Dgraph
The Dgraph indexes the data sets produced by Data Processing and stores them in databases on either
HDFS or a shared NFS. It also responds to end user queries for data routed to it by the Dgraph Gateway. It is
designed to be stateless, so each Dgraph instance can respond to queries independently of the others.
The nodes the Dgraph instances can be hosted on depend on whether the databases are stored on HDFS or
an NFS. These nodes form a Dgraph cluster inside the BDD cluster.
Introduction
11
You must have one of these installed on your cluster before installing BDD, as the configuration of your
Hadoop cluster determines where some of the BDD components will be installed. However, Hadoop doesn't
need to be on every node that will host BDD, as some BDD components don't require Hadoop to function. For
more information, see Hadoop requirements on page 24.
Note: You can't connect BDD to more than one Hadoop cluster.
Introduction
12
In a single-node deployment, all BDD and Hadoop components are hosted on the same node, and the Dgraph
databases are stored on the local filesystem.
Introduction
13
Nodes 4 and 5 are running WebLogic Server, Studio, and the Dgraph Gateway. Having two of these
nodes ensures minimal redundancy of the Studio instances.
Remember that you aren't restricted to the above configurationyour cluster can contain as many Data
Processing, WebLogic Server, and Dgraph nodes as necessary. You can also co-locate WebLogic Server and
Hadoop on the same nodes, or host your databases on a shared NFS and run the Dgraph on its own node.
Be aware that these decisions may impact your cluster's overall performance and are dependent on your site's
resources and requirements.
Introduction
14
Dgraph nodes: Your deployment must include at least one Dgraph instance. If there are more than one,
they will run as a cluster within the BDD cluster. Having a cluster of Dgraphs is desirable because it
enhances high availability of query processing. Note that if your Dgraph databases are on HDFS, the
Dgraph must be installed on HDFS DataNodes.
Note: You can add and remove nodes from your Hadoop cluster without reinstalling BDD.
Chapter 2
Prerequisites
The following sections describe the hardware and software requirements your environment must meet before
you can install BDD.
Supported platforms
Hardware requirements
Memory requirements
Disk space requirements
Network requirements
Supported operating systems
Required Linux utilities
OS user requirements
Hadoop requirements
JDK requirements
Security options
Dgraph database requirements
Studio database requirements
Supported Web browsers
Screen resolution requirements
Studio support for iPad
Supported platforms
The following tables list the platforms and versions supported in each BDD release.
Note that this is not an exhaustive list of BDD's requirements. Be sure to read through the rest of this chapter
before installing for more information about the components and configuration changes BDD requires.
Supported version(s)
1.0
5.3.0
Prerequisites
16
Supported version(s)
1.1.x
2.2.4-2.3.x
5.5.2+
2.3.4.17-5
1.2.0
1.2.2
1.3.x
1.0
N/A
1.1.x
4.3, 4.4
1.2.0
4.4
1.2.2
4.4, 4.5
1.3.x
4.5, 4.6
Operating system
Supported version(s)
1.0
6.4+
6.4+
6.4+, 7.1
6.4+, 7.1
1.1.x
1.2.0
Prerequisites
17
Operating system
Supported version(s)
1.2.2
6.4+, 7.1
6.4+, 7.1
6.4+, 7.1
6.4+, 7.1
1.3.x
Application server
Supported version(s)
1.0
12c 12.1.3
1.1.x
12c 12.1.3
1.2.0
12c 12.1.3
1.2.2
12c 12.1.3
1.3.x
12c 12.1.3
1.0
1.1.x
1.2.0
1.2.2
1.3.x
Prerequisites
18
Database server
Supported version(s)
1.0
Oracle
MySQL
5.5.3+
N/A
Oracle
MySQL
5.5.3+
N/A
Oracle
MySQL
5.5.3+
N/A
Oracle
MySQL
5.5.3+
N/A
Oracle
MySQL
5.5.3+
N/A
1.1.x
1.2.0
1.2.2
1.3.x
Supported browsers
Big Data Discovery version
Supported browsers
1.0
1.1.x
Prerequisites
19
Supported browsers
1.2.0
Internet Explorer 11
Firefox ESR
Chrome for Business
Safari Mobile 9.x
1.2.2
Internet Explorer 11
Firefox ESR
Chrome for Business
Safari Mobile 9.x
1.3.x
Internet Explorer 11
Firefox ESR
Chrome for Business
Safari Mobile 9.x
Hardware requirements
The hardware requirements for your BDD installation depend on the amount of data you will process. Oracle
recommends the following minimum requirements:
Note: In this guide, the term "x64" refers to any processor compatible with the AMD64/EM64T
architecture. You might need to upgrade your hardware, depending on the data you are processing.
All run-time code must fit entirely in RAM. Likewise, hard disk capacity must be sufficient based on the
size of your data set. Please contact your Oracle representative if you need more information on
sizing your hardware.
x86_64 dual-core CPU for Dgraph nodes
x86_64 quad-core CPU for WebLogic Managed Servers, which will run Studio and the Dgraph Gateway
Note: Oracle recommends turning off hyper-threading for Dgraph nodes. Because of the way the
Dgraph works, hyper-threading is actually detrimental to cache performance.
Memory requirements
The amount of RAM your system requires depends on the amount of data you plan on processing.
The following table lists the minimum amounts of RAM required to install BDD on each type of node.
Important: Be aware that these are the amounts required by the product itself and don't account for
storing or processing datafull-scale installations will require more. You should work with your Oracle
representative to determine an appropriate amount for your processing needs before installing.
Prerequisites
20
Type of node
Requirements
WebLogic
16GB
This breaks down into 5GB for WebLogic Server and 11GB for the
Transform Service.
Note that installing the Transform Service on WebLogic nodes is
recommended, but not required. If you decide to host it on a different type
of node, verify that it has enough RAM.
Dgraph
5GB
If you're planning on storing your databases on HDFS, your Dgraph nodes
should have 5GB of RAM plus the amount required by HDFS and any
other Hadoop components running on them. For more information, see
Dgraph database requirements on page 35.
16GB
Note that this is for the entire YARN cluster combined, not per node.
Prerequisites
21
Network requirements
The hostname of each BDD machine must be externally-resolvable and accessible using the machine's IP
address. Oracle recommends using only Fully Qualified Domain Names (FQDNs).
Prerequisites
22
The default umask set to 022 on all BDD nodes, including Hadoop nodes.
curl 7.19.7+, with support for the --tlsv1.2 and --negotiate options. This must be installed on all
nodes that will host Studio.
Network Security Services (NSS) 3.16.1+ on all nodes that will host Studio.
nss-devel on all nodes that will host Studio. This contains the nss-config command, which must be
installed in /usr/bin.
nss-devel is included in Linux 6.7 and higher, but needs to be installed manually on older versions. To
see if it's installed, run:
sudo rpm -q nss-devel
If nss-devel is installed, the above command should return its version number. You should also verify
that nss-config is available in /usr/bin.
If you don't have nss-devel, install it by running:
sudo yum install nss-devel
Install Mail::Address:
(a) Download Mail::Address from http://pkgs.fedoraproject.org/repo/pkgs/perl-MailTools/MailTools2.14.tar.gz/813ae849683367bb75e6be89e4e8cc46/MailTools-2.14.tar.gz.
(b) Extract MailTools-2.14.tar.gz:
tar -xvf MailTools-2.14.tar.gz
Prerequisites
23
make
make test
sudo make install
2.
Install XML::Parser:
(a) Download XML::Parser from http://search.cpan.org/CPAN/authors/id/T/TO/TODDR/XML-Parser2.44.tar.gz.
(b) Extract XML-Parser-2.44.tar.gz:
tar -xvf XML-Parser-2.44.tar.gz
3.
Install JSON-2.90:
(a) Download JSON-2.90 from http://search.cpan.org/CPAN/authors/id/M/MA/MAKAMAKA/JSON2.90.tar.gz.
(b) Extract JSON-2.90.tar.gz:
tar -xvf JSON-2.90.tar.gz
OS user requirements
The entire installation must be performed by a single OS user, called the bdd user. After installing, this user
will run all BDD processes.
You must create this user or select an existing one to fill this role before installing. Although this document
refers to it as the bdd user, its name is arbitrary.
The user you choose must meet the following requirements:
It can't be the root user.
Its UID must be the same on all nodes in the cluster, including Hadoop nodes.
It must have passwordless sudo enabled on all nodes in the cluster, including Hadoop nodes.
It must have passwordless SSH enabled on all nodes in the cluster, including Hadoop nodes, so that it
can log into each node from the install machine. For instructions on enabling this, see Enabling
passwordless SSH on page 24.
It must have bash set as its default shell on all nodes in the cluster, including Hadoop nodes.
Oracle Big Data Discovery : Installation Guide
Prerequisites
24
It must have permission to create the directory BDD will be installed in on all nodes in the cluster,
including Hadoop nodes. This directory is defined by the ORACLE_HOME property in the BDD configuration
file.
If your databases are located on HDFS, the bdd user has additional requirements. These are described in
Dgraph database requirements on page 35.
Enabling passwordless SSH
Generate SSH keys on all nodes in the cluster, including Hadoop nodes.
2.
Copy the keys to the install machine to create known_hosts and authorized_keys files.
3.
Copy the known_hosts and authorized_keys files to all servers in the cluster.
Hadoop requirements
One of the following Hadoop distributions must be running on your cluster before you install BDD:
Cloudera Distribution for Hadoop (CDH) 5.5.x (min. 5.5.2), 5.6, 5.7.x (min. 5.7.1), 5.8. Enterprise edition is
recommended.
Hortonworks Data Platform (HDP) 2.3.4.17-5, 2.4.x (min. 2.4.2)
MapR Converged Data Platform (MapR) 5.1
Note: You can switch to a different version of your Hadoop distribution after installing BDD, if
necessary. See the Administrator's Guide for more information.
BDD doesn't require all of the components each distribution provides, and the components it does require
don't need to be installed on all nodes. The following table lists the required Hadoop components and the
node(s) they must be installed on.
Prerequisites
25
Note: If you are installing on a single machine, that machine must be running all required Hadoop
components.
Component
Description
BDD uses ZooKeeper to manage the Dgraph instances and ensure high availability of
Dgraph query processing. ZooKeeper must be installed on at least one node in your
Hadoop cluster, although it doesn't have to be on any that will host BDD. For more
information on ZooKeeper and how it affects BDD's high availability, see the
Administrator's Guide.
All Managed Servers must be able to connect to a node running ZooKeeper.
HDFS/MapR-FS
The Hive tables that contain your source data are stored in HDFS. HDFS must be
installed on at least one node in your cluster.
You can also store your Dgraph databases on HDFS. If you choose to do this, the
Dgraph must be installed on HDFS DataNode service must be installed on all nodes that
will run the Dgraph.
Note: MapR uses the MapR File System (MapR-FS) instead of standard HDFS,
although this document typically refers to HDFS only for simplicity. Any
requirements specific to MapR-FS will be called out explicitly.
HCatalog
The Data Processing Hive Table Detector monitors HCatalog for new and deleted tables
that require processing. HCatalog must be installed on at least one node in your Hadoop
cluster, although it doesn't have to be one that will host BDD.
Hive
All of your data is stored as Hive tables on HDFS. When BDD discovers a new or
modified Hive table, it launches a Data Processing workflow for that table.
Spark on YARN
BDD uses Spark on YARN to run all Data Processing jobs. Spark on YARN must be
installed on all nodes that will run Data Processing.
Hue
You can use Hue to load your source data into Hive and to view data exported from
Studio.
Note: HDP doesn't include Hue. If you have HDP, you must install Hue
separately and set the HUE_URI property in BDD's configuration file. You can
also use the bdd-admin script to update this property after installation, if
necessary. For more information, see the Administrator's Guide.
Prerequisites
26
Component
Description
YARN
YARN worker nodes run all Data Processing jobs. YARN must be installed on all nodes
that will run Data Processing.
Note: Data Processing will automatically be installed on nodes running the following Hadoop
components:
Spark on YARN
YARN
HDFS
If you want to store your Dgraph databases on HDFS, the Dgraph must be installed on HDFS
DataNodes. For more information, see Dgraph database requirements on page 35.
You must also make a few changes within your Hadoop cluster to ensure that BDD can communicate with
your Hadoop nodes. These changes are described below.
YARN setting changes
Required Hadoop client libraries
Required HDP JARs
MapR-specific requirements
Description
yarn.nodemanager.resource.me The total amount of memory available to your entire YARN cluster.
mory-mb
This should be at least 16GB, although you might need to set it
higher depending on the amount of data you plan on processing.
yarn.scheduler.maximumallocation-vcores
Prerequisites
27
Property
Description
yarn.scheduler.maximumallocation-mb
Prerequisites
28
/opt/mapr/zookeeper/zookeeper-3.4.5/lib
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/tools/lib
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib
/usr/hdp/<version>/hive/lib/hive-metastore.jar
/usr/hdp/<version>/spark/lib/spark-assembly-1.2.1.2.3.X-hadoop2.6.0.2.3.X.jar
If any are missing, copy them over from one of your Hive or Spark nodes.
MapR-specific requirements
If you have MapR, your system must meet a few additional requirements.
The MapR Client must be installed and added to the $PATH on all non-MapR nodes that will host the
Dgraph, Studio, and the Transform Service (if different from Studio nodes). Note that the Client isn't
required on these nodes if they host any MapR processes.
For instructions on installing the MapR Client, see Installing the MapR Client in MapR's documentation.
Pluggable authentication modules (PAMs) must be disabled for the installation.
The yarn.resourcemanager.hostname property in yarn-site.xml must be set to the fully-qualified
domain name (FQDN) of your YARN ResourceManager. For instructions on updating this property, see
Updating the YARN ResourceManager configuration on page 29.
The directories /user/HDFS_DP_USER_DIR/<bdd> and /user/HDFS_DP_USER_DIR/edp/data must
be either nonexistent or mounted with a volume. HDFS_DP_USER_DIR is defined in BDD's configuration
file, and <bdd> is be the name of the bdd user.
The /opt/mapr/zkdata and /opt/mapr/zookeeper/zookeeper-3.4.5/logs directories must
have their permissions set to 755.
If you want to store your Dgraph databases on MapR-FS, the directory defined by DGRAPH_INDEX_DIR in
BDD's configuration file must be either nonexistent or mounted with a volume. Additionally, the MapR NFS
Prerequisites
29
service must be installed on all nodes that will host the Dgraph. For more information, see HDFS on page
36.
The required Spark, ZooKeeper, and Hive patches must be installed as described in Applying the MapR
patches on page 29.
The property is set to 0.0.0.0 by default. To update it, run the following command on the machine hosting
MCS:
/opt/mapr/server/configure.sh -C <cldb_host>[:<cldb_port>][,<cldb_host>[:<cldb_port>]...]
-Z <zk_host>[:<zk_port>][,<zk_host>[:<zk_port>]...] [-RM <rm_host>] [-HS <hs_host>] [-L <logfile>]
[-N <cluster_name>]
Where:
<cldb_host> and <cldb_port> are the FQDNs and ports of your container location database (CLDB)
nodes
<zk_host> and <zk_port> are the FQDNs and ports of your ZooKeeper nodes
<rm_host> is the FQDN of your ResourceManager
<hs_host> is the FQDN of your HistoryServer
<logfile> is the log file configure.sh will write to
<cluster_name> is the name of your MapR cluster
For more information on updating node configuration, see configure.sh in MapR's documentation.
The patches are required to upgrade the versions of Spark, ZooKeeper, and Hive you have installed.
Otherwise, BDD won't be able to work with them.
To apply the patches:
1.
Prerequisites
30
(b) Go to the directory you put the patches in and install each by running:
rmp -ivh <patch>
If the patches succeeded, your Spark nodes should contain the directory
/opt/mapr/spark/spark-1.6.1/.
2.
3.
(c) Go to MCS and restart the HiveServer 2, Hivemeta, and WebHcat services.
4.
JDK requirements
BDD requires one of the following JDK versions:
Note: BDD requires a JDK that includes the HotSpot JVM, which must support the MD5 algorithm.
These requirements will be met by any version you download using the following links, as long as you
don't select a version from the JRockit Family.
JDK 7u67+ x64
Oracle Big Data Discovery : Installation Guide
Prerequisites
31
Also, be sure to set the $JAVA_HOME environment variable on all nodes. If you have multiple versions of the
JDK installed, be sure that this points to the correct one. If the path is set to or contains a symlink, the symlink
must be identical on all other nodes.
Security options
The following sections describe methods for securing your BDD cluster.
Additional information on BDD security is available in the Security Guide.
Kerberos
Sentry
TLS/SSL
HDFS data at rest encryption
Other security options
Kerberos
The Kerberos network authentication protocol enables client/server applications to identify one another in a
secure manner, even when communicating over an unsecured network.
In Kerberos terminology, individual applications are called principals. Each principal has a keytab file, which
contains its key, or password. When one principal wants to communicate with another, it presents its keytab
file for authentication and is only granted access to the other principal if its name and key are recognized.
Because keytab files are protected using strong encryption, this process still works over unsecured networks.
You can configure BDD to use Kerberos authentication for its communications with Hadoop. This is required if
Kerberos is already enabled in your Hadoop cluster, and strongly recommended for production environments
in general. BDD supports integration with Kerberos 5+.
Note: This procedure assumes you already have Kerberos enabled in your Hadoop cluster.
To enable Kerberos:
1.
2.
Prerequisites
3.
Add the bdd user to the hdfs group on all BDD nodes.
4.
32
The primary component must be the name of the bdd user. The realm must be your default realm.
5.
Generate a keytab file for the BDD principal and copy it to the install machine.
The name and location of this file are arbitrary. The installer will rename it bdd.keytab and copy it to
all BDD nodes.
6.
Copy the krb5.conf file from one of your Hadoop nodes to the install machine.
The location you put it in is arbitrary. The installer will copy it to /etc on all BDD nodes.
7.
8.
You also need to manually configure Kerberos for the Transform Service after installing BDD. For instructions,
see Enabling Kerberos for the Transform Service on page 81.
Sentry
Sentry provides role-based authorization in Hadoop clusters. Among other things, it can be used to restrict
access to Hive data at a granular level.
Oracle strongly recommends using Sentry to protect your data from outside users. If you already have it set up
in your Hadoop cluster, you must do a few things to enable BDD to work with it.
Note: The first two steps in this procedure are also required to enable Kerberos. If you've already
done them, you can skip them.
To enable Sentry:
1.
2.
If you haven't already, add the bdd user to the hive group.
3.
Prerequisites
33
TLS/SSL
BDD can be installed on Hadoop clusters secured with TLS/SSL.
TLS/SSL can be configured for specific Hadoop services to encrypt communication between them. If you have
it enabled in Hadoop, you can enable it for BDD to encrypt its communications with your Hadoop cluster.
If your Hadoop cluster has TLS/SSL enabled, verify that your system meets the following requirements:
Kerberos is enabled for both Hadoop and BDD. Note that this isn't required, but is strongly recommended.
For more information, see Kerberos on page 31.
TLS/SSL is enabled in your Hadoop cluster for the HDFS, YARN, Hive, and/or Key Management Server
(KMS) services.
The KMS service is installed in your Hadoop cluster. You should have already done this as part of
enabling TLS/SSL.
To enable BDD to run on a Hadoop cluster secured with TLS/SSL:
1.
Export the public key certificates for all nodes running TLS/SSL-enabled HDFS, YARN, Hive, and/or
KMS.
You can do this with the following command:
keytool -exportcert -alias <alias> -keystore <keystore_filename> -file <export_filename>
Where:
<alias> is the certificate's alias.
<keystore_filename> is the absolute path to your keystore file. You can find this in Cloudera
Manager, Ambari, or MCS.
<export_filename> is the name of the file you want to export the keystore to.
2.
3.
When the installer runs, it imports the certificates to the custom truststore file, then copies the truststore to
$BDD_HOME/common/security/cacerts on all BDD nodes.
Prerequisites
34
2.
Grant the bdd user the GENERATE_EEK and DECRYPT_EEK privileges for the encryption and
decryption keys.
You can do this in Cloudera Manager, Ambari, or MCS by adding the following properties to the KMS
service's kms-acls.xml file. If you need help locating them, refer to your distribution's
documentation.
<property>
<name>key.acl.bdd_key.DECRYPT_EEK</name>
<value>bdd,hdfs supergroup</value>
<description>
ACL for DECRYPT_EEK operations on key 'bdd_key'.
</description>
</property>
<property>
<name>key.acl.bdd_key.GENERATE_EEK</name>
<value>bdd supergroup</value>
<description>
ACL for GENERATE_EEK operations on key 'bdd_key'.
</description>
</property>
Be sure to replace bdd in the above code with the name of the bdd user.
Also note that the hdfs user is included in the value of the DECRYPT_EEK property. This is required if
you're storing your Dgraph databases on HDFS, but can be omitted otherwise. For more information,
see Installing the HDFS NFS Gateway service on page 38.
Prerequisites
35
Firewalls
Oracle recommends using a firewall to protect your network and BDD cluster from external entities. A firewall
limits traffic into and out of your network, creating a secure barrier around it. It can consist of a combination of
software and hardware, including routers and dedicated gateway machines.
There are multiple types of firewalls, so be sure to choose one that suits your resources and specific needs.
One option is to use a reverse proxy server as part of your firewall, which you can configure after installing
BDD. For instructions, see Using Studio with a Reverse Proxy on page 87.
TLS/SSL in Studio
You can enable TLS/SSL on Studio's outward-facing ports in one or both of the following ways:
Enable encryption through WebLogic Server. You can do this by setting WLS_SECURE_MODE to TRUE in
BDD's configuration file.
This method activates WebLogic's default demo keystores, which you should replace with your own
certificates after deployment. For more information, see Replacing certificates on page 84.
Set up a reverse-proxy server. For instructions on how to do this, see About reverse proxies on page 88.
Note: These methods don't enable encryption on the inward-facing port on which the Dgraph Gateway
listens for requests from Studio.
Prerequisites
36
HDFS
Storing your databases on HDFS provides increased high availability for the Dgraphthe contents of the
databases are distributed across multiple nodes, so the Dgraph can continue to process queries if a node
goes down. It also increases the amount of data your databases can contain.
Note: This information also applies to MapR-FS.
To store your databases on HDFS, your system must meet the following requirements:
The HDFS DataNode service must be running on all nodes that will host the Dgraph. For best
performance, this should be the only Hadoop service running on your Dgraph nodes. In particular, the
Dgraph shouldn't be co-located with Spark, as both services require a lot of resources.
If you have to co-locate the Dgraph with Spark or any other Hadoop services, you should use cgroups to
isolate resources for it. For more information, see Setting up cgroups on page 36.
For best performance, configure short-circuit reads in HDFS. This enables the Dgraph to access the local
database files directly, rather than using the DataNode's network sockets to transfer the data. For
instructions, refer to the documentation for your Hadoop distribution.
The bdd user must have read and write permissions for the HDFS directory where the databases will be
stored. Be sure to set this on all Dgraph nodes.
If you have HDFS data at rest encryption enabled in Hadoop, you must store your databases in an
encryption zone. For more information, see HDFS data at rest encryption on page 34.
If you decide to not use the default HDFS mount point (the local directory where the Dgraph mounts the
HDFS root directory), make sure the one you use is empty and has read, write, and execute permissions
for the bdd user. This must be set on all Dgraph nodes.
Be sure to set the DGRAPH_HDFS_USE_MOUNT property in BDD's configuration file to TRUE.
Additionally, to enable the Dgraph to access its databases in HDFS, you must install either the HDFS NFS
Gateway (called MapR NFS in MapR) service or FUSE. The option you use depends on your Hadoop cluster:
You must use the NFS Gateway if have any of the following:
MapR
CDH 5.7.x or higher
HDFS data at rest encryption enabled
For more information, see Installing the HDFS NFS Gateway service on page 38.
In all other cases, you can use either FUSE or the NFS Gateway. For more information on FUSE, see
Installing FUSE on page 38.
Setting up cgroups
Control groups, or cgroups, are a Linux kernel feature that enable you to allocate resources like CPU time and
system memory to specific processes or groups of processes. If you need to host the Dgraph on nodes
running Spark, you should use cgroups to ensure sufficient resources are available to it.
Note: Installing the Dgraph on Spark nodes is not recommended and should only be done if
absolutely necessary.
Prerequisites
37
To do this, you enable cgroups in Hadoop and create one for YARN that limits the amounts of CPU and
memory it can consume. You then create a separate cgroup for the Dgraph.
To set up cgroups:
1.
If your system doesn't currently have the libcgroup package, install it as root.
This creates /etc/cgconfig.conf, which is used to configure cgroups.
2.
3.
Create a cgroup for YARN. You must do this within Hadoop. For instructions, refer to the
documentation for your Hadoop distribution.
The YARN cgroup should limit the amounts of CPU and memory allocated to all YARN containers.
The appropriate limits to set depend on your system and the amount of data you will process. At a
minimum, you should reserve the following for the Dgraph:
5GB of RAM
2 CPU cores
The number of CPU cores YARN is allowed to use must be specified as a percentage. For example,
on a quad-core machine, YARN should only get two cores, or 50%. On an eight-core machine, YARN
could get up to six of them, or 75%. When setting this amount, remember that allocating more cores to
the Dgraph will boost its performance.
4.
Prerequisites
38
After installing, the Dgraph will mount HDFS via the NFS Gateway when it starts.
Installing FUSE
Filesystem in Userspace (FUSE) enables unprivileged users to access filesystems without having to make
changes to the kernel. In the context of BDD, it enables the Dgraph to read and write data to HDFS by making
HDFS behave like a mountable local disk. The Dgraph supports FUSE 2.8+.
Note: FUSE isn't supported for Hadoop clusters that have MapR, CDH 5.7.x or higher, or HDFS data
at rest encryption.
If you're not using the HDFS NFS Gateway service, FUSE must be installed on all HDFS DataNodes that will
host the Dgraph. Additionally, the bdd user requires extra permissions to enable the Dgraph process to
integrate with FUSE, and socket timeouts in HDFS must be increased to prevent FUSE and the Dgraph from
crashing during parallel ingests.
To install FUSE:
1.
2.
Extract fuse-<version>.tar.gz:
tar xvf fuse-<version>.tar.gz
4.
5.
Prerequisites
39
(c) Give the bdd user read and write permissions for /dev/fuse.
6.
NFS
If you don't want to store your databases on HDFS, you can keep them on a shared NFS.
Before installing, be sure that your NFS is properly set up and that all Dgraph nodes have read/write access to
it.
soft
hard
nofile
nofile
65536
65536
Prerequisites
40
Prerequisites
41
If you want to use a Hypersonic database, the installer will create it for you. You can enable this in BDD's
configuration file.
Important: If you install in a demo environment with a Hypersonic database and later decide to scale
up to a production environment, you must reinstall BDD with one of the supported MySQL or Oracle
databases listed above.
Sample commands for production databases
Oracle database
You can use the following commands to create a user and schema for an Oracle 11g or 12c database.
CREATE USER <username> PROFILE "DEFAULT" IDENTIFIED BY <password> DEFAULT TABLESPACE "USERS"
TEMPORARY TABLESPACE "TEMP" ACCOUNT UNLOCK;
GRANT CREATE PROCEDURE TO <username>;
GRANT CREATE SESSION TO <username>;
GRANT CREATE SYNONYM TO <username>;
GRANT CREATE TABLE TO <username>;
GRANT CREATE VIEW TO <username>;
GRANT UNLIMITED TABLESPACE TO <username>;
GRANT CONNECT TO <username>;
GRANT RESOURCE TO <username>;
MySQL database
You can use the following commands to create a user and schema for a MySQL database.
Note: MySQL databases must use UTF-8 as the default character encoding.
Prerequisites
42
Part II
Installing Big Data Discovery
Chapter 3
Prerequisite checklist
Before installing, run through the following checklist to verify you've satisfied all prerequisites.
For more information on each prerequisite, refer to the relevant section in Prerequisites on page 14.
Prerequisite
Description
Hardware
Minimum requirements:
WebLogic nodes: quad-core CPU
Dgraph nodes: dual-core CPU
Note that these are the minimum amounts required to install BDD. A full-scale
installation will require more.
Memory
Minimum requirements:
Managed Servers: 16GB (5GB for WebLogic Server and 11GB for the Transform
Service)
Dgraph nodes: 5GB (excluding requirements for HDFS, if applicable)
YARN cluster: 16GB (combined)
Note that these are the minimum amounts required to install BDD. A full-scale
installation will require more.
Disk space
Minimum requirements:
30GB in ORACLE_HOME on all BDD nodes
20GB in TEMP_FOLDER_PATH on all BDD nodes
10GB in INSTALLER_PATH on the install machine
512MB swap space on the install machine and all Managed Servers
39GB virtual memory on all Transform Service nodes
Note that these are the minimum amounts required to install BDD. A full-scale
installation will require more.
Network
Operating system
The hostname of each BDD machine can be externally resolved and accessed using
the machine's IP address.
OEL 6.4+, 7.1
RHEL 6.4+, 7.1
Prerequisite checklist
Prerequisite
Linux utilities
45
Description
/bin:
basename
cat
chgrp
chown
date
dd
df
mkdir
more
rm
sed
tar
true
/usr/bin:
awk
cksum
cut
dirname
expr
gzip
head
id
netcat
perl
printf
sudo
tail
tr
unzip
wc
which
Prerequisite checklist
Prerequisite
Hadoop
46
Description
Distributions:
CDH 5.5.x (min. 5.5.2), 5.6, 5.7.x (min. 5.7.1), 5.8
HDP 2.3.4.17-5, 2.4.x (min. 2.4.2)
MapR 5.1
Components:
Cluster manager: Cloudera Manager, Ambari, or MCS
ZooKeeper
HDFS
HCatalog
Hive
Spark on YARN
Hue
YARN
Spark on YARN, YARN, and HDFS are on all Data Processing nodes
YARN configuration has been updated
HDP-specific
requirements
MapR-specific
requirements
The MapR Client is installed on all non-MapR nodes that will host the Dgraph,
Studio, and the Transform Service
PAMs are disabled
The YARN Resource Manager IP is configured correctly on the machine hosting
MCS
The directories /user/HDFS_DP_USER_DIR/<bdd> and
/user/HDFS_DP_USER_DIR/edp/data are either nonexistent or mounted with a
volume
The permissions for the /opt/mapr/zkdata and
/opt/mapr/zookeeper/zookeeper-3.4.5/logs directories are set to 755
The required Spark, ZooKeeper, and Hive patches have been applied
JDK
JDK 7u67+
JDK 8u45+
The installed JDK contains the HotSpot JVM, which supports MD5
$JAVA_HOME set on all nodes
Prerequisite checklist
Prerequisite
Kerberos
47
Description
/user/<bdd_user> and /user/<HDFS_DP_USER_DIR> created in HDFS
bdd user is a member of the hive and hdfs groups
bdd principal and keytab file have been generated
bdd keytab file and krb5.conf are on the install machine
kinit and kdestroy are installed on BDD nodes
core-site.xml has been updated (HDP only)
Sentry
TLS/SSL
Prerequisite checklist
Prerequisite
Dgraph databases
48
Description
If stored on HDFS:
The HDFS DataNode service is on all Dgraph nodes
cgroups are set up, if necessary
(Optional) Short-circuit reads are enabled in HDFS
The bdd user has read and write permissions to the databases directory in
HDFS
If using a non-default mount point, it's empty and the bdd user has read, write,
and execute permissions for it
You installed either the HDFS NFS Gateway service or FUSE
If stored on an NFS:
The NFS is set up
All Dgraph nodes can write to it
The number of open file descriptors is set to 65536 on all Dgraph nodes
Studio database
Web browser
Firefox ESR
Internet Explorer 11 (compatibility mode not supported)
Chrome for Business
Safari 9+ (for mobile)
Chapter 4
QuickStart Installation
The BDD installer includes a quickstart option, which installs the software on a single machine with default
configuration suitable for a demo environment. You can use quickstart to install BDD quickly and easily,
without having to worry about setting it up yourself.
Important: Single-node installations can only be used for demo purposes; you can't host a production
environment on a single machine. If you want to install BDD in a production environment, see Cluster
Installation on page 56.
Before you can install BDD with quickstart, you must satisfy all of the prerequisites described in
Prerequisites on page 14, with a few exceptions:
You must use CDH. HDP and MapR aren't supported.
You must have a MySQL database.
You can't have Kerberos installed.
You can't use any existing Dgraph databases.
Note: If you want to install BDD on a single machine but need more control and flexibility than
quickstart offers, see Single-Node Installation on page 50.
Installing BDD with quickstart
QuickStart Installation
50
YARN
Hue
To install BDD with quickstart:
1.
On your machine, create a new directory or choose an existing one to be the installation source
directory.
This directory must contain at least 10GB of free space.
2.
Within the installation source directory, create a new directory named packages.
3.
Download the BDD media pack from the Oracle Software Delivery Cloud.
Be sure to download all packages in the media pack. Make a note of each file's part number, as you
will need this to identify it later.
4.
Move the BDD installer, BDD binary, and WebLogic Server packages from the download location to
the packages directory.
5.
Rename the first BDD binary package bdd1.zip and the second bdd2.zip.
This ensures that the installer will recognize them.
6.
7.
Navigate back to the installation source directory and extract the BDD installer package:
unzip packages/<BDD_installer_package>.zip
This creates a new directory called installer, which contains the install script and other files it
requires.
8.
9.
If the script succeeded, BDD is now installed under the current directory and ready for you to begin working
with it. See Post-Installation Tasks on page 75 to learn more about your installation and how to verify it.
If the script failed, see Troubleshooting a Failed Installation on page 71.
Chapter 5
Single-Node Installation
If you want to demo BDD before committing to a full-cluster installation, you can install it on a single node.
This gives you the chance to learn more about the software and see how it performs on a smaller scale. The
following sections describe how to get BDD running on your machine quickly and easily.
Important: Single-node installations can only be used for demo purposes; you can't host a production
environment on a single machine. If you want to install BDD in a production environment, see Cluster
Installation on page 56.
Installing BDD on a single node
Configuring a single-node installation
On your machine, create a new directory or choose an existing one to be the installation source
directory.
This directory must contain at least 10GB of free space.
2.
Within the installation source directory, create a new directory named packages.
Single-Node Installation
3.
52
Download the BDD media pack from the Oracle Software Delivery Cloud.
Be sure to download all packages in the media pack. Make a note of each file's part number, as you
will need this to identify it later.
4.
Move the BDD installer, BDD binary, and WebLogic Server packages from the download location to
the packages directory.
5.
Rename the first BDD binary package bdd1.zip and the second bdd2.zip.
This ensures that the installer will recognize them.
6.
7.
Navigate back to the installation source directory and extract the BDD installer package:
unzip packages/<BDD_installer_package>.zip
This creates a new directory called installer, which contains the install script and other files it
requires.
8.
Open BDD's configuration file, bdd.conf, in a text editor and update the Required Settings section.
See Configuring a single-node installation on page 52 for instructions on how to do this.
9.
10.
If the script succeeded, BDD is now installed on your machine and ready for you to begin working with it. See
Post-Installation Tasks on page 75 to learn more about your installation and how to verify it.
If the script failed, see Troubleshooting a Failed Installation on page 71.
Single-Node Installation
53
Some of the directories defined in bdd.conf have location requirements. These are specified below.
Configuration property
Description
ORACLE_HOME
The path to the directory BDD will be installed in. This must not exist
and the system must contain at least 30GB of free space to create
this directory. Additionally, its parent directories' permissions must be
set to either 755 or 775.
Note that this setting is different from the ORACLE_HOME environment
variable required by the Studio database.
ORACLE_INV_PTR
The absolute path to the Oracle inventory pointer file, which the
installer will create when it runs. This can't be located in the
ORACLE_HOME directory.
If you have any other Oracle software products installed, this file will
already exist. Update this property to point to it.
INSTALLER_PATH
DGRAPH_INDEX_DIR
HADOOP_UI_HOST
STUDIO_JDBC_URL
The JDBC URL for your Studio database, which Studio requires to
connect to it.
There are three templates for this property. Copy the template that
corresponds to your database type to STUDIO_JDBC_URL and
update the URL to point to your database.
If you have a MySQL database, use the first template and update
the URL as follows:
jdbc:mysql://<database hostname>:<port number>
/<database name>?useUnicode=true&characterEncoding
=UTF-8&useFastDateParsing=false
If you have an Oracle database, use the first template and update
the URL as follows:
jdbc:oracle:thin:
@<database hostname>:<port number>:<database SID>
Single-Node Installation
54
Configuration property
Description
INSTALL_TYPE
JAVA_HOME
The absolute path to the JDK install directory. This should have the
same value as the $JAVA_HOME environment variable.
If you have multiple versions of the JDK installed, be sure that this
points to the correct one.
TEMP_FOLDER_PATH
The temporary directory used by the installer. This must exist and
contain at least 20GB of free space.
HADOOP_UI_PORT
HADOOP_UI_CLUSTER_NAME
HUE_URI
HADOOP_CLIENT_LIB_PATHS
HADOOP_CERTIFICATES_PATH
Single-Node Installation
55
Configuration property
Description
ENABLE_KERBEROS
KERBEROS_PRINCIPAL
The name of the BDD principal. This should include the name of your
domain; for example, [email protected].
This property is only required if ENABLE_KERBEROS is set to TRUE.
KERBEROS_KEYTAB_PATH
The absolute path to the BDD keytab file. This property is only
required if ENABLE_KERBEROS is set to TRUE.
KRB5_CONF_PATH
ADMIN_SERVER
MANAGED_SERVERS
DGRAPH_SERVERS
DGRAPH_THREADS
The number of threads the Dgraph starts with. This will default to the
number of cores your machine has minus 2, so you don't need to set
it.
DGRAPH_CACHE
The size of the Dgraph cache, in MB. This will default to either 50%
of your RAM or the total amount of free memory minus 2GB
(whichever is larger), so you don't need to set it.
ZOOKEEPER_INDEX
HDFS_DP_USER_DIR
The location within the HDFS /user directory that stores the Avro
files created when Studio users export data. The installer will create
this directory if it doesn't already exist. The name of this directory
can't include spaces or slashes (/).
YARN_QUEUE
HIVE_DATABASE_NAME
The name of the Hive database that stores the source data for Studio
data sets.
Single-Node Installation
56
Configuration property
Description
SPARK_ON_YARN_JAR
For MapR, use the third template. This should be the absolute
path to spark-assembly-1.5.2-mapr-1602-hadoop2.7.0mapr-1602.jar.
TRANSFORM_SERVICE_SERVERS
TRANSFORM_SERVICE_PORT
The port the Transform Service listens on for requests from Studio.
ENABLE_CLUSTERING_SERVICE
For use by Oracle Support only. Leave this property set to FALSE.
CLUSTERING_SERVICE_SERVERS
CLUSTERING_SERVICE_PORT
Chapter 6
Cluster Installation
The following sections describe how to install BDD on multiple nodes, and provide tips on troubleshooting a
failed installation.
The BDD installer
Setting up the install machine
Downloading the BDD media pack
Downloading a WebLogic Server patch
Configuring BDD
Running the prerequisite checker
Installing BDD on a cluster
Silent installation
You can optionally run the installer in silent mode. This means that instead of prompting you for information it
requires at runtime, it obtains that information from environment variables you set beforehand.
Normally, the script prompts you to enter the following:
The username and password for your cluster manager (Cloudera Manager, Ambari, or MCS), which the
script uses to query your cluster manager for information related to your Hadoop cluster.
The username and password for the WebLogic Server admin. The script will create this user when it
deploys WebLogic.
Cluster Installation
58
The JDBC username and password for the Studio database, which it requires to connect Studio to the
database.
The username and password for the Studio admin.
The absolute path to the location of the installation packages.
You can avoid these steps by setting the following environment variables before running the script.
Environment variable
Value
BDD_HADOOP_UI_USERNAME
BDD_HADOOP_UI_PASSWORD
BDD_WLS_USERNAME
BDD_WLS_PASSWORD
BDD_STUDIO_JDBC_USERNAME
BDD_STUDIO_JDBC_PASSWORD
BDD_STUDIO_ADMIN_USERNAME
The email address of the Studio admin, which will be their username.
This must be a full email address and can't begin with root@ or
postmaster@.
Note: The installer will automatically populate this value to
the STUDIO_ADMIN_EMAIL_ADDERESS property in
bdd.conf, overwriting any existing value. If you set
STUDIO_ADMIN_EMAIL_ADDERESS instead of this
environment variable, the installer will still execute silently.
BDD_STUDIO_ADMIN_PASSWORD
The password for the Studio admin. This must contain at least 6
characters, one of which must be a non-alphanumeric character.
Note that the Studio admin will be asked to reset their password the
first time they log in if you set the
STUDIO_ADMIN_PASSWORD_RESET_REQUIRED property to TRUE.
INSTALLER_PATH
Cluster Installation
59
Installer behavior
The diagram below illustrates the behavior of the installer.
Note: This diagram shows how the installer distributes the BDD components to the different nodes in
your cluster. This diagram is not intended to illustrate the number of nodes you can have. For various
installation configurations, including options for co-locating different BDD components on the same
node, see Cluster configurations and diagrams on page 11.
Cluster Installation
60
Cluster Installation
61
Choose an existing directory or create a new one to be the installation source directory.
You'll perform the entire installation process from this directory. Its name and location are arbitrary and
it must contain at least 10GB of free space.
3.
Within the installation source directory, create a new directory named packages.
2.
3.
4.
5.
6.
Click Continue.
7.
Verify that Available Release and Oracle Big Data Discovery 1.3.x.x.x for Linux x86-64 are both
checked, then click Continue.
8.
Accept the Oracle Standard Terms and Restrictions and click Continue.
9.
You should also make a note of each file's part number, as you will need this information to identify it.
10.
Move the BDD installer, BDD binary, and WebLogic Server packages from the download location to
the packages directory.
11.
Rename the first BDD binary package bdd1.zip and the second bdd2.zip.
This ensures that the installer will recognize them.
Cluster Installation
12.
62
13.
Navigate back to the installation source directory and extract the installer package:
unzip packages/<installer_package>.zip
This creates a new directory within the installation source directory called installer, which contains
the installer, bdd.conf, and other files required by the installer.
Next, you can download a WebLogic Server patch for the installer to apply. If you don't want to patch
WebLogic Server, you should configure your BDD installation.
Within the installation source directory, create a new directory called WLSPatches.
Don't change the name of this directory or the installer won't recognize it.
2.
3.
On the Patches & Updates tab, find and download the patch you want to apply.
4.
Configuring BDD
After you download the required Hadoop client libraries, you must configure your installation by updating the
bdd.conf file, which is located in the /<installation_src_dir>/installer directory.
Important: bdd.conf defines the configuration of your BDD cluster and provides the installer with
parameters it requires to run. Updating this file is the most important step of the installation process. If
you don't modify the file, or if you modify it incorrectly, the installer could fail or your cluster could be
configured differently than you intended.
You can edit the file in any text editor. Be sure to save your changes before closing.
The installer validates bdd.conf at runtime and fails if it contains any invalid values. To avoid this, keep the
following in mind when updating the file:
The accepted values for some properties are case-sensitive and must be entered exactly as they appear
in this document.
Cluster Installation
63
Required settings
The first part of bdd.conf contains required settings. You must update these with information specific to your
system, or the installer could fail.
Must Set
This section contains blank settings that you must provide values for. If you don't set these, the installation will
fail.
Configuration property
Description
ORACLE_HOME
The path to the BDD root directory, where BDD will be installed on each
node in the cluster. This directory must not exist and its parent directories'
permissions must be set to either 755 or 775. There must be at least 30GB
of space available on each BDD node to create this directory.
Note that this is different from the ORACLE_HOME environment variable
required by the Studio database.
Important: You must ensure that the installer can create this
directory on all nodes that will host BDD components, including
Hadoop nodes that will host Data Processing.
ORACLE_INV_PTR
The absolute path to the Oracle inventory pointer file, which the installer
will create. This file can't be located in the ORACLE_HOME directory.
If you have any other Oracle software products installed, this file will
already exist. Update this property to point to it.
Cluster Installation
64
Configuration property
Description
INSTALLER_PATH
Optional. The absolute path to the installation source directory. This must
contain at least 10GB of free space.
If you don't set this property, you can either set the INSTALLER_PATH
environment variable or specify the path at runtime. For more information,
see The BDD installer on page 57.
DGRAPH_INDEX_DIR
HADOOP_UI_HOST
The name of the server hosting your Hadoop manager (Cloudera Manager,
Ambari, or MCS).
STUDIO_JDBC_URL
For Oracle databases, use the first template and update the URL as
follows:
jdbc:oracle:thin:
@<database hostname>:<port number>:<database SID>
Cluster Installation
65
General
This section configures settings relevant to all components and the installation process itself.
Configuration property
Description
INSTALL_TYPE
JAVA_HOME
The absolute path to the JDK install directory. This must be the same on all
BDD servers and should have the same value as the $JAVA_HOME
environment variable.
If you have multiple versions of the JDK installed, be sure that this points to
the correct one.
TEMP_FOLDER_PATH
The temporary directory used on each node during the installation. This
directory must exist on all BDD nodes and must contain at least 20GB of
free space.
CDH/HDP
This section contains properties related to Hadoop. The installer uses these properties to query the Hadoop
cluster manager (Cloudera Manager, Ambari, or MCS) for information about the Hadoop components, such as
the URIs and names of their host servers.
Configuration property
HADOOP_UI_PORT
The port number of the server running the Hadoop cluster manager.
HADOOP_UI_CLUSTER_NAME
HUE_URI
HDP only. The hostname and port of the node running Hue, in the
format <hostname>:<port>.
Cluster Installation
66
Configuration property
HADOOP_CLIENT_LIB_PATHS
HADOOP_CERTIFICATES_PATH
Only required for Hadoop clusters with TLS/SSL enabled. The absolute
path to the directory on the install machine where you put the
certificates for HDFS, YARN, Hive, and the KMS.
Don't remove this directory after installing, as you will use it if you have
to update the certificates.
Kerberos
This section configures Kerberos for BDD.
Note: You only need to modify these properties if you want to enable Kerberos.
Configuration property
ENABLE_KERBEROS
KERBEROS_PRINCIPAL
The name of the BDD principal. This should include the name of your
domain; for example, [email protected].
This property is only required if ENABLE_KERBEROS is set to TRUE.
KERBEROS_KEYTAB_PATH
The absolute path to the BDD keytab file on the install machine.
The installer will rename this to bdd.keytab and copy it to
$BDD_HOME/common/kerberos/ on all BDD nodes.
This property is only required if ENABLE_KERBEROS is set to TRUE.
Cluster Installation
67
Configuration property
KRB5_CONF_PATH
The absolute path to the krb5.conf file on the install machine. The
installer will copy this to /etc on all BDD nodes.
This property is only required if ENABLE_KERBEROS is set to TRUE.
ADMIN_SERVER
The hostname of the install machine, which will become the Admin
Server.
If you leave this blank, it will default to the hostname of the machine
you're on.
MANAGED_SERVERS
DGRAPH_SERVERS
A comma-separated list of the hostnames of the nodes that will run the
Dgraph and the Dgraph HDFS Agent.
This list can't contain duplicate values. If you plan on storing your
databases on HDFS, these must be HDFS DataNodes. For best
performance, there shouldn't be any other Hadoop services running on
these nodes, especially Spark.
Cluster Installation
68
Configuration property
DGRAPH_THREADS
The number of threads the Dgraph starts with. This should be at least 2.
The exact number depends on the other services running on the machine:
For machines running only the Dgraph, the number of threads should
be equal to the number of cores on the machine.
For machines running the Dgraph and other BDD components, the
number of threads should be the number of cores minus 2. For
example, a quad-core machine should have 2 threads.
For HDFS nodes running the Dgraph, the number of threads should be
the number of CPU cores minus the number required for the Hadoop
services. For example, a quad-core machine running Hadoop services
that require 2 cores should have 2 threads.
If you leave this property blank, it will default to the number of CPU cores
minus 2.
Be sure that the number you use is in compliance with the licensing
agreement.
DGRAPH_CACHE
The size of the Dgraph cache, in MB. Only specify the number; don't
include MB.
If you leave this property blank, it will default to either 50% of the node's
available RAM or the total mount of free memory minus 2GB (whichever is
larger).
Oracle recommends allocating at least 50% of the node's available RAM to
the Dgraph cache. If you later find that queries are getting cancelled
because there isn't enough available memory to process them, experiment
with gradually decreasing this amount.
ZOOKEEPER_INDEX
Data Processing
This section configures Data Processing and the Hive Table Detector.
Configuration property
HDFS_DP_USER_DIR
The location within the HDFS /user directory that stores the sample
files created when Studio users export data. The name of this
directory must not include spaces or slashes (/). The installer will
create it if it doesn't already exist.
If you have MapR and want to use an existing directory, it must be
mounted with a volume.
YARN_QUEUE
Cluster Installation
69
Configuration property
HIVE_DATABASE_NAME
The name of the Hive database that stores the source data for Studio
data sets.
The default value is default. This is the same as the default value
of DETECTOR_HIVE_DATABASE, which is used by the Hive Table
Detector. It is possible to use different databases for these properties,
but it is recommended that you start with one for a first time
installation.
SPARK_ON_YARN_JAR
If you have MapR, use the third template. This should be the
absolute path to spark-assembly-1.5.2-mapr-1602hadoop2.7.0-mapr-1602.jar.
This JAR must be located in the same location on all Hadoop nodes.
Micro Service
This section configures the Transform Service.
Configuration property
TRANSFORM_SERVICE_SERVERS
TRANSFORM_SERVICE_PORT
The port the Transform Service listens on for requests from Studio.
ENABLE_CLUSTERING_SERVICE
For use by Oracle Support only. Leave this property set to FALSE.
Cluster Installation
70
Configuration property
CLUSTERING_SERVICE_SERVERS
CLUSTERING_SERVICE_PORT
2.
Enter the username and password for your Hadoop manager when prompted.
4.
When the script completes, go to the timestamped output directory and open test_report.html in
a browser.
The report lists all BDD requirements and whether each passed, failed, or was ignored. Ignored requirements
aren't applicable to your system.
If everything passed, you're ready to install BDD. If any requirement failed, update your system or bdd.conf
accordingly and rerun the prerequisite checker.
Cluster Installation
71
On the install machine, open a new terminal window and go to the /installer directory.
2.
3.
If you are not running the script in silent mode, enter the following information when prompted:
The username and password for the cluster manager.
A username and password for the WebLogic Server admin. The password must contain at least 8
characters, including at least 1 number, and can't begin with a number.
The username and password for the Studio database.
The password for the Studio admin. This must contain at least 6 characters, including at least 1
non-alphanumeric character.
The absolute path to the installation source directory, if you didn't set INSTALLER_PATH in
bdd.conf.
If the script succeeds, BDD will be fully installed and running. See Post-Installation Tasks on page 75 to learn
more about your installation and how to verify it.
If the script fails, see Troubleshooting a Failed Installation on page 71.
Chapter 7
You can then check the log files on those servers for more information about the failure. The installer's log
files are located on each server in the directory defined by TEMP_FOLDER_PATH.
Once you determine what caused the failure, you can fix it and rerun the installer.
Failed ZooKeeper check
Failure to download the Hadoop client libraries
Failure to generate the Hadoop fat JAR
Rerunning the installer
To fix this problem, try rerunning the installer according to the instructions in Rerunning the installer on page
73. If it continues to fail, check if ZooKeeper is completely down and restart it if it is.
73
On the install machine, download the following packages from http://archiveprimary.cloudera.com/cdh5/cdh/5/ and extract them:
Note: It is recommended that you use a browser other than Chrome for this.
spark-<spark_version>.cdh.<cdh_version>.tar.gz
hive-<hive_version>.cdh.<cdh_version>.tar.gz
hadoop-<hadoop_version>.cdh.<cdh_version>.tar.gz
avro-<avro_version>.cdh.<cdh_version>.tar.gz
3.
Copy and paste the value of the first template to HADOOP_CLIENT_LIB_PATHS and replace each
instance of $UNZIPPED_<COMPONENT>_BASE with the absolute path to that library's location on the
install machine.
4.
For instructions on rerunning the installer, see Rerunning the installer on page 73.
2.
3.
For instructions on rerunning the installer, see Rerunning the installer on page 73.
This removes many of the files created the last time you ran the installer and cleans up your
environment.
74
2.
If the installer was previously run by a different Linux user, delete the TEMP_FOLDER_PATH directory
from all nodes.
3.
Go to the installation source directory and open bdd.conf in any text editor.
4.
5.
The installer removes any files created the last time it ran and runs again on the clean system.
Part III
After You Install
Chapter 8
Post-Installation Tasks
The following sections describe tasks you can perform after you install BDD, such as verifying your installation
and increasing Linux file descriptors.
Verifying your installation
Navigating the BDD directory structure
Enabling Kerberos for the Transform Service
Configuring load balancing
Updating the DP CLI whitelist and blacklist
Signing in to Studio as an administrator
Backing up your cluster
Replacing certificates
Increasing Linux file descriptors
Customizing the WebLogic JVM heap size
Configuring Studio database caching
On the Admin Server, open a new terminal window and navigate to the
$BDD_HOME/BDD_manager/bin directory.
2.
Post-Installation Tasks
77
If your cluster is healthy, the script's output should be similar to the following:
[2015/06/19 04:18:55 -0700] [Admin Server] Checking health of BDD cluster...
[2015/06/19 04:20:39 -0700] [web009.us.example.com] Check BDD functionality......Pass!
[2015/06
/19 04:20:39 -0700] [web009.us.example.com] Check Hive Data Detector health......Hive Data Detector
has previously run
[2015/06/19 04:20:39 -0700] [Admin Server] Successfully checked statuses.
$BDD_HOME
$BDD_HOME is the root directory of your BDD installation. Its default path is:
$ORACLE_HOME/BDD-<version>
Post-Installation Tasks
78
Description
/BDD_manager
/bdd-shell
Files related to the optional BDD Shell component. For more information,
see the BDD Shell Guide.
/clusteringservice
For use by Oracle Support, only. Files and directories related to the
Cluster Analysis service.
/common
/dataprocessing/edp_cli
Post-Installation Tasks
79
Directory name
Description
/dgraph
/jetty
/logs
/microservices
/server
/studio
Contains the EAR file for the Studio application and a version file for
Studio.
/transformservice
Post-Installation Tasks
80
Directory name
Description
/uninstall
version.txt
$DOMAIN_HOME
$DOMAIN_HOME is the root directory of Studio, the Dgraph Gateway, and your WebLogic domain. Its default
path is:
$ORACLE_HOME/user_projects/domains/bdd-<version>_domain
Description
/autodeploy
/bin
/config
Data sources and configuration files for Studio and the Dgraph
Gateway.
/console-ext
edit.lok
Ensures can only edit the domain's configuration one at a time. Don't
edit this file.
fileRealm.properties
/init-info
/lib
/nodemanager
/pending
/security
Post-Installation Tasks
81
Directory name
Description
/servers
Log files and security information for each server in the cluster.
startWebLogic.sh
/tmp
Temporary directory.
2.
On each Transform Service node, start k5start by running the following command from
$BDD_HOME/transformservice/:
./k5start -f $KERBEROS_KEYTAB_PATH -K <ticket_refresh>
-l <ticket_lifetime> $KERBEROS_PRINCIPAL -b > <logfile> 2>&1
Where:
$KERBEROS_KEYTAB_PATH and $KERBEROS_PRINCIPAL are the values of those properties
defined in bdd.conf.
<ticket_refresh> is the rate at which the Transform Service's Kerberos ticket is refreshed, in
minutes. For example, a value of 60 would set its ticket to be refreshed every 60 minutes, or every
hour. You can optionally use the value for KERBEROS_TICKET_REFRESH_INTERVAL in
bdd.conf.
<ticket_lifetime> is the amount of time the Transform Service's Kerberos ticket is valid for.
This should be given as a number followed by a supported unit of time: s, m, h, or d. For example,
10h (10 hours) or 10m (10 minutes). You can optionally use the value for
KERBEROS_TICKET_LIFETIME in bdd.conf.
<logfile> is the absolute path to the log file you want k5start to write to.
3.
Post-Installation Tasks
82
There are many load balancing options available. Oracle recommends an external HTTP load balancer, but
you can use whatever option is best suited to your needs and available resources. Just be sure the option you
choose uses session affinity (also called sticky sessions).
Session affinity forces all requests from a given session to be routed to the same node, resulting in one
session token. Without this, requests from a single session could be handled by multiple nodes, which would
create multiple session tokens.
There are many load balancing options available. Be sure to choose one that:
Uses session affinity, or "sticky sessions". For more information, see Configuring load balancing for Studio
on page 82.
Can assign a virtual IP address to the Transform Service cluster. This is required for Studio to
communicate with the cluster; without it, Studio will only send requests to the first Transform Service
instance.
To configure load balancing for the Transform Service:
1.
Set up the load balancer and configure a virtual IP address for the Transform Service cluster.
Post-Installation Tasks
2.
83
2.
3.
Specify the admin username and password set during the installation and click Sign In.
If the admin username and password weren't set, login with the default values.
Table 8.1: Sign in Values
Field
Value
Login
Password
Welcome123
Post-Installation Tasks
4.
84
Now you can add additional Studio users. There are several ways to add new Studio Users:
Integrate Studio with an Oracle Single Sign On (SSO) system. For details, see the Administrator's Guide.
Integrate Studio with an LDAP system. For details, see the Administrator's Guide.
Or, while you are signed in as an administrator, you can create users manually in Studio from the Control
Panel>Users page.
Replacing certificates
Enabling SSL for Studio activates WebLogic Server's default Demo Identity and Demo Trust Keystores. As
their names suggest, these keystores are untrusted and meant for demo purposes only. After deployment, you
should replace them with your own certificates.
More information on WebLogic's demo keystores is available in section Configure keystores of WebLogic's
Administration Console Online Help.
2.
Modify the nofile limit so that soft is 4096 and hard is 8192. Either edit existing lines or add these two
lines to the file:
*
*
soft
hard
nofile
nofile
4096
8192
Post-Installation Tasks
85
2.
3.
4.
5.
Note that any changes you make must be made on all Studio nodes.
Post-Installation Tasks
86
2.
3.
4.
2.
3.
Post-Installation Tasks
87
2.
3.
Chapter 9
89
90
If the reverse proxy does not retain the Host: header, the result is:
Host: http://studioserver1:8080
In the latter case, where the header uses the actual target server hostname, the client may not have access to
studioserver1, or may not be able to resolve the hostname. It also will bypass the reverse proxy on the
next request, which may cause security issues.
If the Host: header cannot be relied on as correct for the client, then it must be configured specifically for the
web or application server, so that it can render correct absolute URLs.
Most reverse proxy solutions should have a configuration option to allow the Host: header to be preserved.
91
92
Where:
reverseProxyHostName is the host name of the reverse proxy server.
reverseProxyPort is the port number for the reverse proxy server.
Part IV
Uninstalling Big Data Discovery
Chapter 10
Uninstallation
This section describes how to uninstall BDD.
The uninstallation script
Running the uninstallation script
Uninstallation
95
2.
The optional [--silent] option runs the script in silent mode, which enables you to skip the
following confirmation step.
3.
Enter yes or y when asked if you're sure you want to uninstall BDD.
Appendix A
Optional settings
The second part of bdd.conf contains optional properties. You can update these if you want, but the default
values will work for most installations.
General
This section configures settings relevant to all components and the installation process itself.
Configuration property
Description
FORCE
Determines whether the installer removes files and directories left over
from previous installations.
Use FALSE if this is your first time installing BDD. Use TRUE if you're
reinstalling after either a failed installation or an uninstallation.
Note that this property only accepts UPPERCASE values.
ENABLE_AUTOSTART
BACKUP_LOCAL_TEMP_FOLDER The absolute path to the default temporary folder on the Admin Server
_ PATH
used during backup and restore operations. This can be overridden on a
case-by-case basis by the bdd-admin script.
BACKUP_HDFS_TEMP_FOLDER_ The absolute path to the default temporary folder on HDFS used during
PATH
backup and restore operations. This can be overridden on a case-by-case
basis by the bdd-admin script.
97
WLS_START_MODE
WLS_NO_SWAP
WEBLOGIC_DOMAIN_NAME
The name of the WebLogic domain, which Studio and the Dgraph
Gateway run in. This is automatically created by the installer.
ADMIN_SERVER_PORT
MANAGED_SERVER_PORT
The port used by the Managed Server (i.e., Studio). This number
must be unique.
This property is still required if you're installing on a single server.
WLS_SECURE_MODE
ADMIN_SERVER_SECURE_PORT
The secure port on the Admin Server that Studio listens on when
WLS_SECURE_MODE is set to TRUE.
Note that when SSL is enabled, Studio still listens on the un-secure
ADMIN_SERVER_PORT for requests from the Dgraph Gateway.
98
Configuration property
MANAGED_SERVER_SECURE_PORT
The secure port on the Managed Server that Studio listens on when
WLS_SECURE_MODE is set to TRUE.
Note that when SSL is enabled, Studio still listens on the un-secure
MANAGED_SERVER_PORT for requests from the Dgraph Gateway.
ENDECA_SERVER_LOG_LEVEL
SERVER_TIMEOUT
SERVER_INGEST_TIMEOUT
SERVER_HEALTHCHECK_TIMEOUT
The timeout value (in milliseconds) used when checking data source
availability when connections are initialized. A value of 0 means
there is no timeout.
STUDIO_JDBC_CACHE
STUDIO_ADMIN_SCREEN_NAME
STUDIO_ADMIN_EMAIL_ADDRESS
99
Configuration property
STUDIO_ADMIN_FIRST_NAME
STUDIO_ADMIN_MIDDLE_NAME
STUDIO_ADMIN_LAST_NAME
DGRAPH_WS_PORT
DGRAPH_BULKLOAD_PORT
The port that the Dgraph listens on for bulk load ingest requests.
DGRAPH_OUT_FILE
DGRAPH_LOG_LEVEL
Defines the log levels for the Dgraph's out log subsystems. This must be
formatted as:
"subsystem1 level1|subsystem2,subsystem3
level2|subsystemN levelN"
DGRAPH_USE_MOUNT_HDFS
Specifies whether the Dgraph databases are stored on HDFS. When set
to TRUE, the Dgraph runs on Hadoop DataNodes and mounts HDFS
when it starts, through either the NFS Gateway or FUSE.
100
Configuration property
DGRAPH_HDFS_MOUNT_DIR
The absolute path to the local directory where the Dgraph mounts the
HDFS root directory.
Use a nonexistent directory when installing. If this location changes after
installing, the new location must be empty and have read, write, and
execute permissions for the bdd user.
This setting is only required if DGRAPH_USE_MOUNT_HDFS is set to
TRUE.
DGRAPH_ENABLE_MPP
DGRAPH_MPP_PORT
KERBEROS_TICKET_REFRESH_
INTERVAL
KERBEROS_TICKET_LIFETIME
The amount of time that the Dgraph's Kerberos ticket is valid. This
should be given as a number followed by a supported unit of time: s, m,
h, or d. For example, 10h (10 hours), or 10m (10 minutes).
This setting is only required if DGRAPH_USE_MOUNT_HDFS and
ENABLE_KERBEROS are set to TRUE.
DGRAPH_ENABLE_CGROUP
Enables cgroups for the Dgraph. This must be set to TRUE if you created
a cgroup for the Dgraph.
If set to TRUE, DGRAPH_CGROUP_NAME must also be set.
DGRAPH_CGROUP_NAME
The name of the cgroup that controls the Dgraph. This is required if
DGRAPH_ENABLE_CGROUP is set to TRUE. You must create this before
installing; for more information, see Setting up cgroups on page 36.
AGENT_PORT
The port that the HDFS Agent listens on for HTTP requests.
AGENT_EXPORT_PORT
The port that the HDFS Agent listens on for requests from the Dgraph.
AGENT_OUT_FILE
101
Data Processing
This section configures Data Processing and the Hive Table Detector.
Configuration property
ENABLE_HIVE_TABLE_DETECTOR
DETECTOR_SERVER
The hostname of the server the Hive Table Detector runs on. This
must be one of the WebLogic Managed Servers.
DETECTOR_HIVE_DATABASE
The name of the Hive database that the Hive Table Detector
monitors.
The default value is default. This is the same as the default value
of HIVE_DATABASE_NAME, which is used by Studio and the CLI. You
can use a different database for each these properties, but Oracle
recommends you start with one for a first time installation.
This value can't contain semicolons (;).
DETECTOR_MAXIMUM_WAIT_TIME
The maximum amount of time (in seconds) that the Hive Table
Detector waits before submitting update jobs.
DETECTOR_SCHEDULE
The cron schedule that specifies how often the Hive Table Detector
runs. This must be enclosed in quotes. The default value is "0 0 * *
*", which sets the Hive Table Detector to run at midnight every day
of every month.
ENABLE_ENRICHMENTS
102
Configuration property
MAX_RECORDS
SANDBOX_PATH
The path to the HDFS directory where the Avro files created when
Studio users export data are stored.
LANGUAGE
DP_ADDITIONAL_JARS
Internal settings
The third part of bdd.conf contains internal settings either required by the installer or intended for use by
Oracle Support. Note that the installer will automatically add properties to this section when it runs.
Note: Don't modify any properties in this part unless instructed to by Oracle Support.
Configuration property
Description
DP_POOL_SIZE
DP_TASK_QUEUE_SIZE
103
Configuration property
Description
MAX_INPUT_SPLIT_SIZE
The maximum partition size used for Spark inputs, in MB. This
controls the size of the blocks of data handled by Data Processing
jobs.
Partition size directly affects Data Processing performance. When
partitions are smaller, more jobs run in parallel and cluster
resources are used more efficiently. This improves both speed and
stability.
The default value is 32. This amount should be sufficient for most
clusters, with a few exceptions:
If your Hadoop cluster has a very large processing capacity
and most of your data sets are small (around 1GB), you can
decrease this value.
In rare cases, when data enrichments are enabled, the
enriched data set in a partition can become too large for its
YARN container to handle. If this occurs, you can decrease this
value to reduce the amount of memory each partition requires.
Note that this property overrides the HDFS block size used in
Hadoop.
SPARK_DYNAMIC_ALLOCATION
SPARK_DRIVER_CORES
SPARK_DRIVER_MEMORY
The maximum memory heap size for the Spark job driver. This
must be in the same format as JVM memory settings; for example,
512m or 2g.
SPARK_EXECUTORS
SPARK_EXECUTOR_CORES
104
Configuration property
Description
SPARK_EXECUTOR_MEMORY
The maximum memory heap size for each Spark executor. This
must be in the same format as JVM memory settings; for example,
512M or 2g.
RECORD_SEARCH_THRESHOLD
VALUE_SEARCH_THRESHOLD
BDD_VERSION
BDD_RELEASE_VERSION
The BDD hotfix or patch version. This property is intended for use
by Oracle Support and shouldn't be changed.
Index
A
directory structure
$BDD_HOME 77
$DOMAIN_HOME 80
DP CLI whitelist and blacklist, updating 83
B
backups 84
bdd.conf
internal settings 102
optional settings 96
overview 62
required settings 63
BDD installer
about 57
behavior 59
rerunning 73
silent mode 57
troubleshooting 72
Big Data Discovery
about 9
configuration options 11
integration with Hadoop 10
integration with WebLogic 11
uninstalling 94
C
cgroups 36
cluster installation
configuration 62
downloading a WebLogic Server patch 62
downloading the media pack 61
installing 71
selecting the install machine 60
Command Line Interface, about 10
configuration
internal settings 102
optional settings 96
required settings 63
E
Endeca Server 14
F
file descriptors, increasing 84
H
Hadoop, about 10
Hadoop requirements
client libraries 27
distributions and components 24
HDP JARs 28
YARN setting changes 26
HDP-specific requirements
required JARs 28
Hive Table Detector, about 10
I
installation and deployment
about 57
rerunning the installer 73
troubleshooting 72
install machine, selecting 60
iPad, using to view projects 42
J
Jetty, about 11
JVM heap size, setting 85
D
Data Processing, about 10
Data Processing CLI, about 10
Dgraph, about 10
Dgraph Gateway, about 9
Dgraph HDFS Agent, about 10
Dgraph requirements
about 35
file descriptors 39
FUSE 38
HDFS 36
NFS Gateway 38
Kerberos 31
L
load balancing
overview 82
Studio 82
Transform Service 82
M
MapR
configuration 29
patches 29
Version 1.3.2 Revision A October 2016
Index
106
special requirements 28
signing in 83
Studio database caching
clearing cache 87
customizing 85
overview 85
supported platforms 15
system requirements
authentication 31
authorization 32
bdd user 23
bdd user, enabling passwordless SSH 24
Dgraph databases 35
encryption 33
Hadoop client libraries 27
Hadoop requirements 24
hardware 19
HDFS encryption 34
JDK 30
Linux utilities 21
memory 19
operating system 21
Perl modules, installing 22
physical memory and disk space 20
screen resolution 42
Studio database 40
Studio database commands 41
supported browsers 41
supported platforms 15
YARN setting changes 26
P
prerequisite checker, running 70
prerequisite checklist 44
prerequisites
authentication 31
authorization 32
bdd user 23
bdd user, enabling passwordless SSH 24
Dgraph databases 35
encryption 33
Hadoop client libraries 27
Hadoop requirements 24
hardware 19
HDFS encryption 34
JDK 30
memory 19
network 21
operating system 21
Perl modules, installing 22
physical memory and disk space 20
screen resolution 42
Studio database 40
Studio database commands 41
supported browsers 41
supported platforms 15
YARN setting changes 26
Q
quickstart
about 49
installing BDD 50
T
Transform Service
about 9
Kerberos configuration 81
troubleshooting
about 72
failed ZooKeeper check 72
failure to download Hadoop client libraries 72
failure to generate Hadoop fat JAR 73
R
reverse proxy, using with Studio 88
S
security
firewalls 35
Hadoop encryption 33
HDFS encryption 34
Kerberos 31
replacing certificates 84
reverse proxy 88
Sentry 32
Studio encryption 35
Sentry 32
single-node installation
configuring 52
installing 51
Studio
about 9
database, creating 41
disabling 86
projects, viewing on iPad 42
Oracle Big Data Discovery : Installation Guide
U
uninstallation
about 94
running the uninstallation script 95
V
verification
Data Processing 77
deployed components 76
W
WebLogic Server
about 11
patches, downloading 62
setting JVM heap size 85