Lab1 InstallationOfBigInsight

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Lab 1 Overview

In this hands-on lab, you'll learn how to work with Big Data using Apache Hadoop and InfoSphere
BigInsights, IBM's Hadoop-based platform. In particular, you'll learn the basics of working with the
Hadoop Distributed File System (HDFS) and see how to administer your Hadoop-based environment
using the BigInsights Web console. After launching a sample MapReduce application, you'll explore a
more sophisticated scenario involving social media data. In doing so, you'll learn how to use a
spreadsheet-style interface to discover insights about the global coverage of a popular brand without
writing any code. Finally, you'll learn how to apply industry standard SQL to data managed by
BigInsights through IBM's Big SQL technology. Indeed, you'll have a chance to create tables and
execute complex queries over data in HDFS, including data derived from a relational data warehouse.

Ready to get started?

After completing this hands-on lab, you’ll be able to:

• Work directly with Apache Hadoop through file system commands

• Inspect and administer your cluster through the BigInsights Web Console

• Explore big data using a spreadsheet-style tool

• Use Big SQL to create tables and issue complex queries

Allow 2 ½ - 3 hours to complete this lab.

This lab was developed by Cynthia M. Saracco, IBM Silicon Valley Lab. Please post questions or
comments about this lab or the technologies it describes to the forum on Hadoop Dev at
https://developer.ibm.com/hadoop/.

1.1. About your environment


This lab was developed for the InfoSphere BigInsights 3.0 Quick Start Edition VMware image. If
necessary, download and install the single-node cluster VMware image from this site: http://www-
01.ibm.com/software/data/infosphere/biginsights/quick-start/downloads.html

The VMware image is set up in the following manner:

User Password
VM Image root account root password
VM Image lab user account biadmin biadmin
BigInsights Administrator biadmin biadmin
Big SQL Administrator bigsql bigsql
Lab user biadmin biadmin

Hands On Lab Page 3


IBM Software

Property Value
Host name bivm.ibm.com
BigInsights Web Console URL http://bivm.ibm.com:8080
Big SQL database name bigsql
Big SQL port number 51000

About the screen captures, sample code, and environment configuration

Screen captures in this lab depict examples and results that may vary from
what you see when you complete the exercises. In addition, some code
examples may need to be customized to match your environment. For
example, you may need to alter directory path information or user ID
information.

1.2. Getting started


To get started with the lab exercises, you need to install and launch the VMware image as well as start
the required services.

__1. If necessary, obtain a copy of the BigInsights 3.0 Quick Start Edition VMware image from IBM's
external download site (http://www-01.ibm.com/software/data/infosphere/biginsights/quick-
start/downloads.html). Use the image for the single-node cluster.

__2. Follow the instructions provided to decompress (unzip) the file and install the image on your
laptop. Note that there is a README file with additional information.

__3. If necessary, install VMware player or other required software to run VMware images. Details
are in the README file provided with the BigInsights VMware image.

__4. Launch the VMware image. When logging in for the first time, use the root ID (with a password
of password). Follow the instructions to configure your environment, accept the licensing
agreement, and enter the passwords for the root and biadmin IDs (root/password and
biadmin/biadmin) when prompted. This is a one-time only requirement.

Page 4 Explore Hadoop and BigInsights


__5. When the one-time configuration process is completed, you will be presented with a SUSE Linux
log in screen. Log in as biadmin with a password of biadmin.

Hands On Lab Page 5


IBM Software

__6. Verify that your screen appears similar to this:

__7. Click Start BigInsights to start all required services. (Alternatively, you can open a terminal
window and issue this command: $BIGINSIGHTS_HOME/bin/start-all.sh)

Wait until the operation completes. This may take several minutes, depending on your
machine's resources.
Page 6 Explore Hadoop and BigInsights
__8. Verify that all required BigInsights services are up and running. From a terminal window, issue
this command: $BIGINSIGHTS_HOME/bin/status.sh.

__9. Inspect the results, a subset of which are shown below. Verify that, at a minimum, the following
components started successfully: hdm, zookeeper, hadoop, catalog, hive, bigsql, oozie,
console, and httpfs.

Now you're ready to start working with big data!


If have any questions or need help getting your environment up and running, visit Hadoop
Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post
a message to the forum.
You cannot proceed with subsequent lab exercises until you've logged into the VMware
image and launched the necessary BigInsights services.

Hands On Lab Page 7


IBM Software

Lab 2 Issuing basic Hadoop commands


In this exercise, you’ll work directly with Apache Hadoop to perform some basic tasks involving the
Hadoop Distributed File System (HDFS) and launching a sample application. All the work you’ll perform
here involves commands and interfaces provided with Hadoop from http://hadoop.apache.org. As
mentioned earlier, Hadoop is part of IBM’s InfoSphere BigInsights platform.

Allow 15 minutes to complete this lab module.

2.1. Creating a directory in your distributed file system


__1. Click the BigInsights Shell icon.

__2. Select the Terminal icon to open a terminal window.

__3. Execute the following Hadoop file system command to create a directory in HDFS for your work:

hadoop fs -mkdir /user/biadmin/test

Note that HDFS is distinct from your Unix/Linux local file system directory, and working with
HDFS requires using hadoop fs commands.

2.2. Copying data into HDFS


__1. Using standard Unix/Linux file system commands, list the contents of the /home/biadmin/licenses
directory.

ls /home/biadmin/licenses

Note the BIlicense_en.txt file. It contains license information in English, and it will serve as a
sample data file for a future exercise.

__2. Copy the BIlicense_en.txt file into the /user/biadmin/test directory you just created in HDFS.

hadoop fs -put /home/biadmin/licenses/BIlicense_en.txt /user/biadmin/test

Page 8 Explore Hadoop and BigInsights


__3. List the contents of your target HDFS directory to verify that the file was successfully copied.

hadoop fs -ls /user/biadmin/test

2.3. Running a sample MapReduce application


WordCount is one of several sample MapReduce applications provided for Apache Hadoop. Written in
Java, it simply scans through input document(s) and, for each word, returns the total number of
occurrences found. You can read more about WordCount on the Apache wiki
(http://wiki.apache.org/hadoop/WordCount).

Since launching MapReduce applications (or jobs) is a common practice in Hadoop, you'll explore how to
do that with WordCount.

__1. Execute the following command to launch the sample WordCount application provided with your
Hadoop distribution.

hadoop jar /opt/ibm/biginsights/IHC/hadoop-example.jar wordcount


/user/biadmin/test WordCount_output

This command specifies that the wordcount application contained in the specified .jar file is to be
launched. The input for this application is in the /user/biadmin/test directory of HDFS. The output of
this job will be stored in HDFS in the WordCount_output subdirectory of the user executing this
command (biadmin). Thus, the output directory will be /user/biadmin/WordCount_output. This
directory will be created automatically as a result of executing this application.

NOTE: If the output folder already exists or if you try to rerun a successful
MapReduce job with the same parameters, you will receive an error message. This
is the default behavior of the sample WordCount application.

Hands On Lab Page 9


IBM Software

__2. Inspect the output of your job.

hadoop fs -ls WordCount_output

In this case, the output was small and contained written to a single file. If you had run WordCount
against a larger volume of data, its output would have been split into multiple files (e.g., part-r-00001,
part-r-00002, and so on).

__3. To view the contents of part-r-0000 file, issue this command:

hadoop fs -cat WordCount_output/*00

Partial output is shown here:

Page 10 Explore Hadoop and BigInsights


__4. Optionally, inspect details about your job. Open a Web browser, or click on the web console
icon on your desktop and open a new tab. Access the URL for Hadoop's Job Tracker
(http://bivm.ibm.com:50030/jobtracker.jsp). Scroll to the Completed Jobs section to
locate the Job ID associated with the Word Count application. Click on the Job ID link to review
details, such as the number of Map and Reduce tasks launched for your application, the number
of bytes read and written, etc. Partial output is shown in the second image that follows.

Hands On Lab Page 11


IBM Software

Page 12 Explore Hadoop and BigInsights


Lab 3 Exploring and administering your cluster with the
BigInsights Web console
As you saw in the previous lab, Apache Hadoop users typically work through a command line interface to
perform many common tasks. This lab introduces you to the BigInsights Web console, which enables
you to administer your cluster, work with HDFS, launch jobs, and perform many other tasks using a
graphical interface.

After completing this hands-on lab, you’ll be able to:


• Launch the Web console.
• Work with popular resources accessible through the Welcome page.
• Administer BigInsights by inspecting the status of your cluster and accessing tools for open
source components provided with BigInsights.
• Work with the distributed file system. In particular, you'll explore the HDFS directory structure,
create subdirectories, and upload files to HDFS.
• Manage and launch pre-built applications from a Web catalog.
• Inspect the status of previously launched applications (jobs) and review their output.

Allow 30 minutes to complete this section of lab.

This lab is an introduction to a subset of console functions. Real-time monitoring, dashboards, alerts,
and application linking are among the more advanced console functions that are beyond this lab's scope.

3.1. Getting started with the Web Console


In this exercise, you will launch the console and inspect its Welcome page.

__1. Launch the BigInsights Web console. Direct your browser to http://bivm.ibm.com:8080 or click
the Web Console icon on your desktop.

__2. Log in with your user name and password (biadmin / biadmin).

Hands On Lab Page 13


IBM Software

__3. Verify that your Web console appears similar to this:

__4. Briefly skim through the links provided in these sections to become familiar with resources
available to you:

Tasks: Quick access to popular BigInsights tasks


Quick Links: Links to internal and external quick links and downloads to enhance your
environment
Learn More: Online resources available to learn more about BigInsights

3.2. Administering BigInsights


The Web console allows administrators to inspect the overall health of the system as well as perform
basic functions, such as starting and stopping specific servers or components, adding nodes to the
cluster, and so on. You’ll explore a subset of these capabilities here.

Page 14 Explore Hadoop and BigInsights


__5. Click on the Cluster Status tab at the top of the page.

__6. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster
that had several services running. One service – Monitoring -- was unavailable. Your display
may differ somewhat. It’s not necessary for all BigInsights services to be running to complete
the exercises in this lab.

__7. Click on the Hive service and note the detailed information provided for this service in the pane
at right. For example, you can see the URL for Hive's Web interface and its process ID. In
addition, note that you can start and stop services (such as the Hive service) from the Cluster
Status page of the console.

Hands On Lab Page 15


IBM Software

__8. Optionally, cut-and-paste the URL for Hive’s Web interface into a new tab of your browser.
You'll see an open source tool provided with Hive for administration purposes, as shown below.

Other open source tools provided with Apache Hadoop are also available through IBM's
packaged distribution (BigInsights), as you'll see shortly. Close this browser tab.

__9. Click on the Welcome page of your Web console.

__10. Click on the Access secure cluster servers button in the Quick Links section at right.

If nothing appears, verify that the pop-up blocker of your browser is disabled; a prompt should
appear at the top of the page if pop-ups are blocked.

__11. Inspect the list of server components for which there are additional Web-based tools. The
BigInsights console displays the URLs you can use to access each of these Web sites directly.
(This information will only appear if the pop-up blocker is disabled on browser.)

__12. Click on the jobtracker alias. The display should be familiar to you -- it's the same one you saw
in the previous lab that introduced you to some basic Hadoop facilities.

Page 16 Explore Hadoop and BigInsights


3.3. Working with the distributed file system (HDFS)
In this section, you'll learn how to use the Web console to create directories in HDFS, navigate the file
system, and upload small files -- tasks you performed earlier through a command-line interface. In
addition, you'll perform a few other file-related tasks as well. Many people find the console's graphical
interface to be easier to use than the command-line interface.

__1. Click on the Files tab at the top of the page.

__2. Expand the DFS directory tree in the left pane to display the contents of /user/biadmin. Note
the presence of the /WordCount_output and /test subdirectories, which you created in an
earlier lab. If desired, expand each directory and inspect its contents.

Hands On Lab Page 17


IBM Software

__3. Become familiar with the functions provided through the icons at the top of this pane, as we'll
refer to some of these in subsequent sections of this module. Simply position your cursor on
each icon to learn its function. From left to right, the icons enable you to copy a file or directory,
move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file
from HDFS to your local file system, remove a file or directory from HDFS, set permissions, open
a command window to launch HDFS shell commands, and refresh the Web console page.

__4. Delete the /user/biadmin/test directory and its contents. Position your cursor on this
directory, click the red X icon, and click Yes when prompted.

Page 18 Explore Hadoop and BigInsights


__5. Create a new subdirectory in /user/biadmin. With your cursor positioned on /user/biadmin,
click the create directory icon.

__6. When a pop-up window appears, specify test2 as the new directory's name and click OK.

Hands On Lab Page 19


IBM Software

__7. Expand the directory hierarchy to verify that your new subdirectory was created.

__8. Upload a file into this directory from your local file system. Click the upload icon.

Page 20 Explore Hadoop and BigInsights


__9. When a pop-up window appears, click the Browse button to navigate through your local file
system to /home/biadmin/licenses. Select the BIlicense_en.txt file and click Open.

__10. Expand the /user/biadmin/test2 directory and verify that the BIlicense_en.txt file was
successfully copied into HDFS. Note that the right pane of the Web console previews the file's
contents.

Hands On Lab Page 21


IBM Software

3.4. Managing and launching pre-built applications from the Web catalog
The Web console includes a catalog of ready-made applications that users can launch through a
graphical interface. Each application's status, execution history, and output are easy to monitor from this
page as well. In this exercise, you'll first manage the catalog’s contents, selecting one of more than 20
pre-built applications provided with BigInsights to deploy on your cluster. Once deployed, the application
will be visible to all authorized users. You'll then launch the application, monitor its execution status, and
inspect its output.

As you might have guessed, the sample application used in this lab is Word Count -- the same
application you ran from a command line earlier.

__1. Click the Applications tab of the Web console. No applications are deployed on a new cluster,
so there won't be much to see yet.

__2. In the upper left corner, click Manage. A list of applications available for deployment are
displayed.

Page 22 Explore Hadoop and BigInsights


__3. Expand the Test category and click on the Word Count application.

__4. Click Deploy.

__5. When a pop-up window appears, accept the defaults for all settings and click Deploy.

Hands On Lab Page 23


IBM Software

__6. After the application has been deployed, you're ready to run it. Click Run in the upper left pane.

__7. Verify that the Word Count application appears in the catalog. (Any other applications that were
previously deployed to the Web catalog will also appear.)

Page 24 Explore Hadoop and BigInsights


__8. Click on the Word Count icon. The pane at right prompts you to enter appropriate information.
For this application, you need to specify an execution name for your application's run, the HDFS
directory containing the input document(s) for the Word Count application, and an output
directory in HDFS.

__9. For the Execution name, enter My Test Run 1.

__10. For the Input path, click Browse and navigate to /user/biadmin/test2. Click OK.

__11. For the Output path, type /user/biadmin/WordCount_console_output. (Recall that the Word
Count application creates this output directory at run time. If you specify an existing HDFS
directory for the output, the application will fail.)

__12. Verify that your display appears similar to this and click Run.

Hands On Lab Page 25


IBM Software

__13. As your application executes, monitor its status through the Application History pane at lower
right.

__14. When the application completes successfully, click the link provided in the Output column to see
the application's output.

__15. Optionally, return to the Applications page of the console and click on the link provided in the
Details column for your application's run.

Page 26 Explore Hadoop and BigInsights


__16. Note that the console displays the Application Status page, which contains information about the
Oozie workflow for your application as well as the application itself. If desired, click on one or
more available links to explore details available for your review.

Hands On Lab Page 27


IBM Software

Lab 4 Analyzing social media data with BigSheets


To help business analysts and those without a programming background analyze big data, IBM provides
a spreadsheet-style tool called BigSheets. In this lab, you'll learn how you can explore big data through
this tool without writing any scripts or MapReduce applications. The sample data for this lab consists of
social media posts about a popular brand (IBM Watson) that was collected using a sample application
provided with BigInsights. For background information, you may want to read the article on Analyzing
social media and structured data with InfoSphere BigInsights at
http://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.html

After completing this hands-on lab, you’ll be able to:


• Create a BigSheets workbook
• Analyze and customize a workbook
• Visualize your workbook's data in a chart
• Create a Big SQL table based on your workbook
• Export your workbook's data into one of several popular formats

Allow 45 – 60 minutes to complete this lab.

Where did this data come from?


For time efficiency, social media data about "IBM Watson" was already collected using the
Boardreader sample application, which collects social media data from various global sites
and writes the output in JSON array format to files. This lab focuses on blog data collected
about IBM Watson for a six-month interval.

Boardreader is an IBM business partner that offers a social media content aggregation and
provisioning service based on a multilingual data dating back to 2001. The service searches
message boards / forums, social networks, blogs/comments, microblogs, reviews,
videos/comments and online news. Customers who want to use the Boardreader service
should contact the firm directly to obtain a license key.

4.1. Creating a workbook


To get started, copy the sample blogs-data.txt file to HDFS and create a master workbook for it.

__1. Obtain the blogs-data.txt file. You’ll find this in the sampleData.zip file provided with the article
mentioned earlier.

__2. Use Hadoop file system commands or the BigInsights Web console to create subdirectories in
HDFS for your sample data. Under /user/biadmin, create a /sampleData directory. Beneath
/user/biadmin/sampleData, create the /IBMWatson subdirectory.

Page 28 Explore Hadoop and BigInsights


If you forgot how to create a subdirectory in HDFS, consult the earlier labs on Issuing Basic
Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web
Console.

__3. Upload the blogs-data.txt file to the /user/biadmin/sampleData/IBMWatson directory. You


can use Hadoop file system commands or the BigInsights Web console to do this. (If you forgot
how to copy a file to HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or
Exploring and Administering Your Cluster with the BigInsights Web Console.)

__4. From the Files page of the Web console, position your cursor on the
/user/biadmin/sampleData/IBMWatson/blogs-data.txt file, as shown in the previous
image.

__5. Click the Sheet radio button to preview this data in a spreadsheet-style format.

__6. Because the sample blog data for this lab is uses a JSON Array structure, you must click on the
pencil icon to select an appropriate reader (data format translator) for this data. Select the JSON
Array reader and click the green check.

Hands On Lab Page 29


IBM Software

__7. Save this as a Master Workbook named Watson Blogs. Optionally, provide a description. Click
Save.

__8. Note that the BigSheets page of the Web console will open and your new workbook will be
displayed.

Now you're ready to begin exploring this data using BigSheets.

4.2. Analyzing and customizing your workbook


BigSheets offers analysts a variety of macros, functions, and built-in analytical features. You'll learn
about a few here.

Page 30 Explore Hadoop and BigInsights


__1. To make it easier to search and manage your workbooks, add a few tags to the Watson Blogs
master workbook you just created. In the upper right corner, click the icon to toggle the
workbook display to show additional fields.

Depending on the size of your browser, an additional scroll bar may appear at right.

__2. Scroll down to the Workbook Details section. Locate the Tags field, select the green plus sign
(+) , enter a tag for Watson, and click the green check mark. Repeat the process to add
separate tags for IBM and blogs.

__3. Click on the Workbooks link the upper left corner of your open workbook.

__4. From the list of available workbooks, you can quickly search for a specific tag. Use the drop-
down Tags menu to select the blogs tag or type tag: blogs into the box.

Hands On Lab Page 31


IBM Software

__5. Open the Watson Blogs master workbook again. (Double click on it.)

__6. Create a new workbook based on this master workbook. In BigSheets, a master workbook is a
“base” workbook and has a limited set of things you can edit. So, to manipulate the data
contained within a workbook, you want to create a new workbook derived from the master.

__a. Click the Build new Workbook button.

__b. When the new Workbook appears, change its default name. Click the pencil icon next to
the name, enter Watson Blogs Revised as the new name, and click the green check
mark.

__c. Click the Fit column(s) button to more easily see columns A through H on your screen

__7. Remove the column IsAdult from your workbook. This is currently column E. Click on the
triangle next to the column name of IsAdult and select the Remove.

Page 32 Explore Hadoop and BigInsights


Did I lose data?
Deleting a column does not remove data. Deleting a column in a workbook just
removes the mapping to this column.

__8. In this case, you want to keep only a few columns. To easily remove several columns, click the
triangle again (from any column) and select Organize ColumnsF

__a. Click the red X button next to each column you want to remove.

In this case, KEEP the following columnsT


__i. Country
__ii. FeedInfo
__iii. Language
__iv. Published
__v. SubjectHtml
__vi. Tags
__vii. Type
__viii. Url

__b. Click the green check mark button when you are ready to remove the columns you
selected.

Hands On Lab Page 33


IBM Software

__9. Click on the Fit column(s) button again to show columns A through H. Verify that your screen
appears similar to this:

__10. From the Save menu at upper left, select Save. Provide a description for your workbook if you’d
like.

__11. Apply a built-in function to further investigate the contents of this workbook. Click the Add
Sheets button in the lower left corner.

Page 34 Explore Hadoop and BigInsights


__12. From the pop-up menu, select Function. You're going to apply a built-in function that extracts
the URL Host information from the full URL links associated with the blog data that was
captured. Doing so will enable you to identify and chart sites with greatest blog coverage of IBM
Watson.

__13. From the Function menu, click Categories and Url.

__14. Select the URLHOST function.

__15. In the new menu that appears, enter Get Host URL as the sheet name and select the Url
column as the source of input to the URLHOST function.

Hands On Lab Page 35


IBM Software

__16. At the bottom of the menu, click the Carry Over tab to specify which columns from the workbook
you'd like to retain. Select Add All and click the green check mark.

__17. Verify that your workbook contains a new URLHOST column and all previously existing columns.
(Whenever you create a new Sheet or edit your workbook in some way, BigSheets will preview
the results of your work against a small sample of the data represented by your workbook.) If
desired, click the Fit Column button to show more columns on your screen.

Page 36 Explore Hadoop and BigInsights


__18. Click Save > Save & Exit.

__19. When prompted to Run or Close the workbook, click Run. "Running" a workbook instructs
BigSheets to apply the logic you specified graphically against all data associated with your
workbook. You can monitor the progress of your request by watching the status bar indicator in
the upper right-hand side of the page.

__20. When the operation completes, verify that your workbook appears similar to this:

Hands On Lab Page 37


IBM Software

__21. If desired, use the Next button in the lower right corner to see page through the content a few
times, noting the various URLHOST values. If desired, you could use built-in BigSheets features
to sort the data based on URLHOST (or other) values, filter records (such as blogs written in the
English language), etc. But perhaps the quickest way to see which sites published the most
blogs about IBM Watson during this time period is to chart the results. You'll do that next.

4.3. Creating charts


Now that you've customized your workbook to eliminate some unwanted columns and generate a new
column containing URL host information, it's time to visualize the results. In this short exercise, you'll
create two simple charts that identify the top 10 global sites with the most blog posts about IBM Watson.

__1. If necessary, open the Watson Blogs Revised notebook.

__2. Click on the Add chart link in the lower left.

Page 38 Explore Hadoop and BigInsights


__3. Select chart > Bar as the chart type.

__4. Specify appropriate properties for the bar chart, paying close attention to these fields:

__a. Title: Top 10 Blog Sites for IBM Watson

__b. X Axis: URLHOST

__c. Sort By: Y Axis

__d. Occurrence Order: Descending

__e. Limit: 10

Hands On Lab Page 39


IBM Software

__5. Click the green check mark.

__6. When prompted, Run the chart. This causes BigSheets to apply your instructions to the entire
data set.

__7. Inspect the results. Are you surprised that ibm.com wasn’t the top site for blog posts about IBM
Watson?

Page 40 Explore Hadoop and BigInsights


__8. If desired, hover over each bar to see the URL host name and the number of blogs posted at that
site.

__9. Next, create a new chart of a different type to visualize the information in a different format.
Select Add Chart > Categories > cloud > Bubble Cloud.

__10. Provide appropriate values for the following fields:

__a. Title: Top 10 Blog Sites for IBM Waton

__b. Tags: URLHOST

__c. Occurrence Order: Descending

__d. Sort By: Count

__e. Limit: 10

Hands On Lab Page 41


IBM Software

__11. Click the green check mark.

__12. When prompted, Run the chart.

__13. Inspect the results. If desired, hover over a bubble to see the number of blog postings for that
site.

4.4. Creating a Big SQL table based on your workbook


BigSheets offers a wide range of built-in features, including the ability to create a Big SQL table from
your workbook. This is quite handy if you have SQL-based tools or applications that you'd like to use
with data you've customized in BigSheets.

Page 42 Explore Hadoop and BigInsights


__1. If necessary, open your Watson Blogs Revised workbook.

__2. Click Create Table button just above the columns of your workbook. When prompted, accept
sheets as the target schema name and type mywatsonblogs as the target table name.

__3. Click Confirm.

__4. From the Files page of the Web console, click the Catalog Tables tab in the navigation window
and expand the sheets folder.

__5. Click the mywatsonblogs file. Note that a preview of the table appears in the pane at right.

__6. Click the Welcome tab of the Web console. In the Quick Links section, click the Run Big SQL
queries link.

Hands On Lab Page 43


IBM Software

__7. A new tab will appear in your Web browser.

__8. In the box where you're prompted to enter your Big SQL query, type this statement:

select urlhost, language, subjecthtml from sheets.mywatsonblogs

fetch first 10 rows only;

__9. Verify that the Big SQL radio button is checked (not the Big SQL V1 radio button).

__10. If necessary, use the scroll bar at right to expose the Run button just below the radio buttons.
Click Run.

__11. Inspect the results.

Page 44 Explore Hadoop and BigInsights


Querying tables with Big SQL
While the Web console's Big SQL query interface is handy for executing test
queries that return a small amount of data, it's best to use other facilities provided
by IBM or third parties to execute Big SQL queries that return larger volumes of
data to avoid memory constraints imposed by your browser. In a subsequent lab,
you'll learn how to execute Big SQL queries from Eclipse.

__12. Close the Big SQL browser tab.

4.5. Optional: Exporting your workbook data


In this optional exercise, you'll see how easy it is to export data in your workbook to one of several
popular formats so that other applications can easily access the data.

__1. If necessary, open your Watson Blogs Revised workbook.

__2. Click Export data. From the drop-down menu, select TSV (tab separated value) as the format
type.

__3. Click the File radio button to export the data to a file in your distributed file system.

Hands On Lab Page 45


IBM Software

__4. Use the Browse button to navigate to the directory in HDFS where you would like to export this
workbook. In this case, select /user/biadmin/sampleData/IBMWatson. In the box below the
directory tree, enter myworkbook as the file name. Do not add a file extension such as .tsv.
Click OK.

__5. Click OK again to initiate the data export operation.

__6. When a message appears indicating that the operation has finished, click OK.

__7. On the Files page of the Web console, navigate to the directory you specified for the export
(/user/biadmin/sampleData/IBMWatson) and locate your new myworkbook.tsv file.

Page 46 Explore Hadoop and BigInsights


__8. Optionally, click the download icon to copy the file from HDFS to a directory of your choice in
your local file system.

Hands On Lab Page 47


IBM Software

Lab 5 Querying data with Big SQL


Now that you know how to work with HDFS and analyze your data with a spreadsheet-style tool, it’s a
good time to explore how you can query your data with Big SQL. Big SQL provides broad SQL support
based on the ISO SQL standard. You can issue queries using JDBC or ODBC drivers to access data
that is stored in InfoSphere BigInsights in the same way that you access relational databases from your
enterprise applications. The SQL query engine supports joins, unions, grouping, common table
expressions, windowing functions, and other familiar SQL expressions.

This tutorial uses sales data from a fictional company that sells and distributes outdoor products to third-
party retailer stores as well as directly to consumers through its online store. It maintains its data in a
series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In
this lab, you will explore how to create, populate, and query a subset of the star schema database to
investigate the company’s performance and offerings. Note that BigInsights provides scripts to create
and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use
fewer than 10 of these tables in this lab.

To execute the queries in this lab, you will use the open source Eclipse environment provided with the
BigInsights Quick Start Edition VMware image. Of course, you can use other tools or interfaces to
invoke Big SQL, such as the Java SQL Shell (JSqsh), a command-line facility provided with the
BigInsights. However, Eclipse is a good choice for this lab, as it formats query results in a manner that’s
easy to read and encourages you to collect your SQL statements into scripts for editing and testing.

After you complete the lessons in this module, you will understand how to:
• Connect to the Big SQL server from Eclipse
• Execute individual or multiple Big SQL statements
• Create Big SQL tables in Hadoop
• Populate Big SQL tables with data from local files
• Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular
expressions.
• Create and query a view based on multiple Big SQL tables.
• Create and run a JDBC client application for Big SQL using Eclipse.

Allow 45 – 60 minutes to complete this lab.

5.1. Creating a project and executing Big SQL statements


To begin, create a BigInsights project and Big SQL script.

__1. Launch Eclipse using the icon on your desktop. Accept the default workspace when prompted.

__2. Create a BigInsights project for your work. From the Eclipse menu bar, click File > New > Other.
Expand the BigInsights folder, and select BigInsights Project, and then click Next.

Page 48 Explore Hadoop and BigInsights


__3. Type myBigSQL in the Project name field, and then click Finish.

__4. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective
window opens. Click Yes to switch to the BigInsights perspective.

__5. Create a new SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the
BigInsights folder, and select SQL script, and then click Next.

__6. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL.
Your new SQL file is stored in this project folder.

__7. In the File name field, type aFirstFile. The .sql extension is added automatically. Click Finish.

In the Select Connection Profile window, locate the Big SQL JDBC connection, which is the
pre-defined connection to Big SQL 3.0 provided with the VMware image. Inspect the properties
displayed in the Properties field. Verify that the connection uses the JDBC driver and database
name shown in the Properties pane here.

Hands On Lab Page 49


IBM Software

About the driver selection

You may be wondering why you are using a connection that employs the
com.ibm.com.db2.jcc.DB2 driver class. In 2014, IBM released a common SQL
query engine as part of its DB2 and BigInsights offerings. Doing so provides
for greater SQL commonality across its relational DBMS and Hadoop-based
offerings. It also brings a greater breadth of SQL function to Hadoop
(BigInsights) users. This common query engine is accessible through the DB2
driver. The Big SQL driver remains operational and offers connectivity to an
earlier, BigInsights-specific SQL query engine. This lab focuses on using the
common SQL query engine.

__8. Click Edit to edit this connection's log in information.

Page 50 Explore Hadoop and BigInsights


__9. Change the user name and password properties to match your user ID and password (e.g.,
biadmin / biadmin). Leave the remaining property values intact.

__10. Click Test Connection to verify that you can successfully connect to the server.

__11. Check the Save password box and click OK.

__12. Click Finish to close the connection window. Your empty SQL script will be displayed.

__13. Copy the following statement into your SQL script:

create hadoop table test1 (col1 int, col2 varchar(5));

Hands On Lab Page 51


IBM Software

Because you didn't specify a schema name for the table, it will be created in your default schema,
which is your user name (biadmin). Thus, the previous statement is equivalent to

create hadoop table biadmin.test1 (col1 int, col2 varchar(5));

In some cases, the Eclipse SQL editor may flag certain Big SQL statements as
containing syntax errors. Ignore these false warnings and continue with your lab
exercises.

__14. Save your file (press Ctrl + S or click File > Save).

__15. Right mouse click anywhere in the script to display a menu of options.

__16. Select Run SQL or press F5. This causes all statements in your script to be executed.

__17. Inspect the SQL Results pane that appears towards the bottom of your display. (If desired,
double click on the SQL Results tab to enlarge this pane. Then double click on the tab again to
return the pane to its normal size.) Verify that the statement executed successfully. Your Big
SQL database now contains a new table named BIADMIN.TEST1. Note that your schema and
table name were folded into upper case.

Page 52 Explore Hadoop and BigInsights


For the remainder of this lab, you should execute each SQL statement
individually. To do so, highlight the statement with your cursor and press F5.

When you’re developing a SQL script with multiple statements, it’s generally a
good idea to test each statement one at a time to verify that each is working as
expected.

__18. From your Eclipse project, query the system for meta data about your test1 table:

select tabschema, colname, colno, typename, length


from syscat.columns where tabschema = USER and tabname= 'TEST1';

In case you're wondering, syscat.columns is one of a number of views supplied over system
catalog data automatically maintained for you by the Big SQL service.

__19. Inspect the SQL Results to verify that the query executed successfully, and click on the Result1
tab to view its output.

__20. Finally, clean up the object you created in the database.

drop table test1;

__21. Save your file. If desired, leave it open to execute statements for subsequent exercises.

Now that you’ve set up your Eclipse environment and know how to create SQL scripts and execute
queries, you’re ready to develop more sophisticated scenarios using Big SQL. In the next lab, you will
create a number of tables in your schema and use Eclipse to query them.

5.2. Creating sample tables and loading sample data


In this lesson, you will create several sample tables and load data into these tables from local files.

Hands On Lab Page 53


IBM Software

__1. Determine the location of the sample data in your local file system and make a note of it. You
will need to use this path specification when issuing LOAD commands later in this lab.

Subsequent examples in this section presume your sample data is in the


/opt/ibm/biginsights/bigsql/samples/data directory. This is the location
of the data on the BigInsights VMware image, and it is the default location in
typical BigInsights installations.

Furthermore, the /opt/ibm/biginsights/bigsql/samples/queries


directory contains SQL scripts that include the CREATE TABLE, LOAD, and
SELECT statements used in this lab, as well as other statements.

__2. Create several tables to track information about sales. Issue each of the following CREATE
TABLE statements one at a time, and verify that each completed successfully:
-- dimension table for region info

CREATE HADOOP TABLE IF NOT EXISTS go_region_dim


( country_key INT NOT NULL
, country_code INT NOT NULL
, flag_image VARCHAR(45)
, iso_three_letter_code VARCHAR(9) NOT NULL
, iso_two_letter_code VARCHAR(6) NOT NULL
, iso_three_digit_code VARCHAR(9) NOT NULL
, region_key INT NOT NULL
, region_code INT NOT NULL
, region_en VARCHAR(90) NOT NULL
, country_en VARCHAR(90) NOT NULL
, region_de VARCHAR(90), country_de VARCHAR(90), region_fr VARCHAR(90)
, country_fr VARCHAR(90), region_ja VARCHAR(90), country_ja VARCHAR(90)
, region_cs VARCHAR(90), country_cs VARCHAR(90), region_da VARCHAR(90)
, country_da VARCHAR(90), region_el VARCHAR(90), country_el VARCHAR(90)
, region_es VARCHAR(90), country_es VARCHAR(90), region_fi VARCHAR(90)
, country_fi VARCHAR(90), region_hu VARCHAR(90), country_hu VARCHAR(90)
, region_id VARCHAR(90), country_id VARCHAR(90), region_it VARCHAR(90)
, country_it VARCHAR(90), region_ko VARCHAR(90), country_ko VARCHAR(90)
, region_ms VARCHAR(90), country_ms VARCHAR(90), region_nl VARCHAR(90)
, country_nl VARCHAR(90), region_no VARCHAR(90), country_no VARCHAR(90)
, region_pl VARCHAR(90), country_pl VARCHAR(90), region_pt VARCHAR(90)
, country_pt VARCHAR(90), region_ru VARCHAR(90), country_ru VARCHAR(90)
, region_sc VARCHAR(90), country_sc VARCHAR(90), region_sv VARCHAR(90)
, country_sv VARCHAR(90), region_tc VARCHAR(90), country_tc VARCHAR(90)
, region_th VARCHAR(90), country_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

-- dimension table tracking method of order for the sale (e.g., Web, fax)

CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim

Page 54 Explore Hadoop and BigInsights


( order_method_key INT NOT NULL
, order_method_code INT NOT NULL
, order_method_en VARCHAR(90) NOT NULL
, order_method_de VARCHAR(90), order_method_fr VARCHAR(90)
, order_method_ja VARCHAR(90), order_method_cs VARCHAR(90)
, order_method_da VARCHAR(90), order_method_el VARCHAR(90)
, order_method_es VARCHAR(90), order_method_fi VARCHAR(90)
, order_method_hu VARCHAR(90), order_method_id VARCHAR(90)
, order_method_it VARCHAR(90), order_method_ko VARCHAR(90)
, order_method_ms VARCHAR(90), order_method_nl VARCHAR(90)
, order_method_no VARCHAR(90), order_method_pl VARCHAR(90)
, order_method_pt VARCHAR(90), order_method_ru VARCHAR(90)
, order_method_sc VARCHAR(90), order_method_sv VARCHAR(90)
, order_method_tc VARCHAR(90), order_method_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

-- look up table with product brand info in various languages

CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup


( product_brand_code INT NOT NULL
, product_brand_en VARCHAR(90) NOT NULL
, product_brand_de VARCHAR(90), product_brand_fr VARCHAR(90)
, product_brand_ja VARCHAR(90), product_brand_cs VARCHAR(90)
, product_brand_da VARCHAR(90), product_brand_el VARCHAR(90)
, product_brand_es VARCHAR(90), product_brand_fi VARCHAR(90)
, product_brand_hu VARCHAR(90), product_brand_id VARCHAR(90)
, product_brand_it VARCHAR(90), product_brand_ko VARCHAR(90)
, product_brand_ms VARCHAR(90), product_brand_nl VARCHAR(90)
, product_brand_no VARCHAR(90), product_brand_pl VARCHAR(90)
, product_brand_pt VARCHAR(90), product_brand_ru VARCHAR(90)
, product_brand_sc VARCHAR(90), product_brand_sv VARCHAR(90)
, product_brand_tc VARCHAR(90), product_brand_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

-- product dimension table

CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim


( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_number INT NOT NULL
, base_product_key INT NOT NULL
, base_product_number INT NOT NULL
, product_color_code INT

Hands On Lab Page 55


IBM Software

, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

-- look up table with product line info in various languages

CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup


( product_line_code INT NOT NULL
, product_line_en VARCHAR(90) NOT NULL
, product_line_de VARCHAR(90), product_line_fr VARCHAR(90)
, product_line_ja VARCHAR(90), product_line_cs VARCHAR(90)
, product_line_da VARCHAR(90), product_line_el VARCHAR(90)
, product_line_es VARCHAR(90), product_line_fi VARCHAR(90)
, product_line_hu VARCHAR(90), product_line_id VARCHAR(90)
, product_line_it VARCHAR(90), product_line_ko VARCHAR(90)
, product_line_ms VARCHAR(90), product_line_nl VARCHAR(90)
, product_line_no VARCHAR(90), product_line_pl VARCHAR(90)
, product_line_pt VARCHAR(90), product_line_ru VARCHAR(90)
, product_line_sc VARCHAR(90), product_line_sv VARCHAR(90)
, product_line_tc VARCHAR(90), product_line_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

-- look up table for products


CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup
( product_number INT NOT NULL
, product_language VARCHAR(30) NOT NULL
, product_name VARCHAR(150) NOT NULL
, product_description VARCHAR(765)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

-- fact table for sales


CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact
( order_day_key INT NOT NULL
, organization_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, retailer_site_key INT NOT NULL
, product_key INT NOT NULL

Page 56 Explore Hadoop and BigInsights


, promotion_key INT NOT NULL
, order_method_key INT NOT NULL
, sales_order_key INT NOT NULL
, ship_day_key INT NOT NULL
, close_day_key INT NOT NULL
, quantity INT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

-- fact table for marketing promotions


CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact
( organization_key INT NOT NULL
, order_day_key INT NOT NULL
, rtl_country_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, product_key INT NOT NULL
, promotion_key INT NOT NULL
, sales_order_key INT NOT NULL
, quantity SMALLINT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Let’s briefly explore some aspects of the CREATE TABLE statements shown here. If
you have a SQL background, the majority of these statements should be familiar to
you. However, after the column specification, there are some additional clauses
unique to Big SQL – clauses that enable it to exploit Hadoop storage mechanisms (in
this case, Hive). The ROW FORMAT clause specifies that fields are to be terminated
by tabs (“\t”) and lines are to be terminated by new line characters (“\n”). The table
will be stored in a TEXTFILE format, making it easy for a wide range of applications to
work with. For details on these clauses, refer to the Apache Hive documentation.

Hands On Lab Page 57


IBM Software

__3. Load data into each of these tables using sample data provided in files. One at a time, issue
each of the following LOAD statements and verify that each completed successfully. Remember
to change the file path shown (if needed) to the appropriate path for your environment. The
statements will return a warning message providing details on the number of rows loaded, etc.
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE
PROPERTIES ('field.delimiter'='\t') INTO TABLE GO_REGION_DIM overwrite;

load hadoop using file url


'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_ORDER_METHOD_DIM.txt' with
SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_ORDER_METHOD_DIM overwrite;

load hadoop using file url


'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt'
with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_BRAND_LOOKUP
overwrite;

load hadoop using file url


'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt' with SOURCE
PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_DIM overwrite;

load hadoop using file url


'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LINE_LOOKUP.txt' with
SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite;

load hadoop using file url


'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LOOKUP.txt' with
SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_LOOKUP overwrite;

load hadoop using file url


'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_SALES_FACT.txt' with SOURCE
PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_SALES_FACT overwrite;

load hadoop using file url


'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.MRK_PROMOTION_FACT.txt' with
SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MRK_PROMOTION_FACT overwrite;

Page 58 Explore Hadoop and BigInsights


Let’s explore the LOAD syntax shown in these examples briefly. The first line of each
example loads data into your table using a file URL specification and then specifies the
full path to the data source file on your local file system. Note that the path is local to the
Big SQL server (not your Eclipse client). The WITH SOURCE PROPERTIES clause
specifies that fields in the source data are delimited by tabs (“\t”). The INTO TABLE
clause identifies the target table for the LOAD operation. The OVERWRITE keyword
indicates that any existing data in the table will be replaced by data contained in the
source file. (If you wanted to simply add rows to the table’s content, you could specify
APPEND instead.)

Note that loading data from a local file is only one of several available options. You can
also load data using FTP or SFTP. This is particularly handy for loading data from
remote file systems, although you can practice using it against your local file system, too.
For example, the following statement for loading data into the
GOSALESDW.GO_REGION_DIM table using SFTP is equivalent to the syntax shown
earlier for loading data into this table from a local file:
load hadoop using file url
'sftp://myID:[email protected]:22/opt/ibm/biginsights/bigsql/
samples/data/GOSALESDW.GO_REGION_DIM.txt'
with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE
gosalesdw.GO_REGION_DIM overwrite;

Big SQL supports other LOAD options, including loading data directly from a remote
relational DBMS via a JDBC connection. See the product documentation for details.

__4. Query the tables to verify that the expected number of rows was loaded into each table. Execute
each query that follows individually and compare the results with the number of rows specified in
the comment line preceding each query.
-- total rows in GO_REGION_DIM = 21
select count(*) from GO_REGION_DIM;

-- total rows in sls_order_method_dim = 7


select count(*) from sls_order_method_dim;

-- total rows in SLS_PRODUCT_BRAND_LOOKUP = 28


select count(*) from SLS_PRODUCT_BRAND_LOOKUP;

-- total rows in SLS_PRODUCT_DIM = 274


select count(*) from SLS_PRODUCT_DIM;

-- total rows in SLS_PRODUCT_LINE_LOOKUP = 5


select count(*) from SLS_PRODUCT_LINE_LOOKUP;

-- total rows in SLS_PRODUCT_LOOKUP = 6302


select count(*) from SLS_PRODUCT_LOOKUP;

-- total rows in SLS_SALES_FACT = 446023


select count(*) from SLS_SALES_FACT;

-- total rows gosalesdw.MRK_PROMOTION_FACT = 11034


select count(*) from MRK_PROMOTION_FACT;

Hands On Lab Page 59


IBM Software

5.3. Querying tables with joins, aggregations and more


Now you're ready to query your tables. Based on earlier exercises, you've already seen that you can
perform basic SQL operations, including projections (to extract specific columns from your tables) and
restrictions (to extract specific rows meeting certain conditions you specified). Let's explore a few
examples that are a bit more sophisticated.

In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as
perform aggregations and other SQL operations. Note that the queries included in this section are based
on queries shipped with BigInsights as samples. Some of these queries return hundreds of thousands of
rows; however, the Eclipse SQL Results page limits output to only 500 rows. Although you can change
that value in the Data Management preferences section, retain the default setting for this lab.
__1. Join data from multiple tables to return the product name, quantity and order method of goods
that have been sold. To do so, execute the following query.

-- Fetch the product name, quantity, and order method


-- of products sold.
-- Query 1
SELECT pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
sls_sales_fact sales,
sls_product_dim prod,
sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key;

Let’s review a few aspects of this query briefly:

• Data from four tables will be used to drive the results of this query (see the tables referenced in
the FROM clause). Relationships between these tables are resolved through 3 join predicates
specified as part of the WHERE clause. The query relies on 3 equi-joins to filter data from the
referenced tables. (Predicates such as prod.product_number=pnumb.product_number help to
narrow the results to product numbers that match in two tables.)
• For improved readability, this query uses aliases in the SELECT and FROM clauses when
referencing tables. For example, pnumb.product_name refers to “pnumb,” which is the alias for
the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used
in the WHERE clause so that you do not need to repeat the complete table name.
• The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result
to only English output. This database contains thousands of rows of data in various languages, so
restricting the language provides some optimization.

Page 60 Explore Hadoop and BigInsights


__2. Modify the query to restrict the order method to one type – those involving a Sales visit. To
do so, add the following query predicate just before the semi-colon:

AND order_method_en='Sales visit'

__3. Inspect the results, a subset of which is shown below:

Hands On Lab Page 61


IBM Software

__4. To find out which sales method of all the methods has the greatest quantity of orders, add a
GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition,
invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and
method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to
substitute a more readable column header.
-- Query 3
SELECT pll.product_line_en AS Product,
md.order_method_en AS Order_method,
sum(sf.QUANTITY) AS total
FROM
sls_order_method_dim AS md,
sls_product_dim AS pd,
sls_product_line_lookup AS pll,
sls_product_brand_lookup AS pbl,
sls_sales_fact AS sf
WHERE
pd.product_key = sf.product_key
AND md.order_method_key = sf.order_method_key
AND pll.product_line_code = pd.product_line_code
AND pbl.product_brand_code = pd.product_brand_code
GROUP BY pll.product_line_en, md.order_method_en;

Page 62 Explore Hadoop and BigInsights


__5. Inspect the results, which should contain 35 rows. A portion is shown below.

5.4. Optional: Using SerDes for non-traditional data


While data structured in CSV and TSV columns are often stored in BigInsights and loaded into Big SQL
tables, you may also need to work with other types of data – data that might require the use of a
serializer / deserializer (SerDe). SerDes are common in the Hadoop environment. You’ll find a number
of SerDes available in the public domain, or you can write your own following typical Hadoop practices.

Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need,
just add its JAR file to the appropriate BigInsights subdirectories. Then stop and restart the Big SQL
service, and specify the SerDe class name when you create your table.

In this lab exercise, you will use a SerDe to define a table for JSON-based blog data. The sample blog
file for this exercise is the same blog file you used as input to BigSheets in a prior lab.

__1. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system,
such as /home/biadmin/sampleData. (As of this writing, the full URL for this SerDe is
https://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar)

__2. Register the SerDe with BigInsights.

__a. Stop the Big SQL server. From a terminal window, issue this command:
$BIGINSIGHTS_HOME/bin/stop.sh bigsql

__b. Copy the SerDe .jar file to the $BIGSQL_HOME/userlib and $HIVE_HOME/lib
directories.

Hands On Lab Page 63


IBM Software

__c. Restart the Big SQL server. From a terminal window, issue this command:
$BIGINSIGHTS_HOME/bin/start.sh bigsql

Now that you’ve registered your SerDe, you’re ready to use it. In this section, you will create a table that
relies on the SerDe you just registered. For simplicity, this will be an externally managed table – i.e., a
table created over a user directory that resides outside of the Hive warehouse. This user directory will
contain the table's data in files. As part of this exercise, you will upload the sample blogs-data.txt file into
the target DFS directory.

Creating a Big SQL table over an existing DFS directory has the effect of populating this table with all the
data in the directory. To satisfy queries, Big SQL will look in the user directory specified when you
created the table and consider all files in that directory to be the table’s contents. This is consistent with
the Hive concept of an externally managed table.

Once the table is created, you'll query that table. In doing so, you'll note that the presence of a SerDe is
transparent to your queries.

__3. If necessary, download the .zip file containing the sample data from the bottom half of the article
referenced in the introduction. Unzip the file into a directory on your local file system, such as
/home/biadmin. You will be working with the blogs-data.txt file.

From the Files tab of the Web console, navigate to the /user/biadmin/sampleData directory
of your distributed file system. Use the create directory button to create a subdirectory named
SerDe-Test.

__4. Upload the blogs-data.txt file into /user/biadmin/sampleData/SerDe-Test.

Page 64 Explore Hadoop and BigInsights


__5. Return to the Big SQL execution environment of your choice (JSqsh or Eclipse).

__6. Execute the following statement, which creates a TESTBLOGS table that includes a LOCATION
clause that specifies the DFS directory containing your sample blogs-data.txt file:
create hadoop table if not exists testblogs (
Country String,
Crawled String,
FeedInfo String,
Inserted String,
IsAdult int,
Language String,
Postsize int,
Published String,
SubjectHtml String,
Tags String,
Type String,
Url String)
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '/user/biadmin/sampleData/SerDe-Test';

5.5. Optional: Developing a JDBC client application with Big SQL


You can write a JDBC client application that uses Big SQL to open a database connection, execute
queries, and process the results. In this optional exercise, you'll see how writing a client JDBC
application for Big SQL is like writing a client application for any relational DBMS that supports JDBC
access.

__1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File >
New >Project. From the New Project window, select Java Project. Click Next.

Hands On Lab Page 65


IBM Software

__2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next.

__3. Open the Libraries tab and click Add External Jars. Add the DB2 JDBC driver for BigInsights,
located at /opt/ibm/biginsights/database/db2/java/db2jcc4.jar.

__4. Click Finish. Click Yes when you are asked if you want to open the Java perspective.

__5. Right-click the MyJavaProject project, and click New > Package. In the Name field, in the New
Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.

Page 66 Explore Hadoop and BigInsights


__6. Right-click the aJavaPackage4me package, and click New > Class.

__7. In the New Java Class window, in the Name field, type SampApp. Select the public static void
main(String[] args) check box. Click Finish.

__8. Replace the default code for this class and copy or type the following code into the
SampApp.java file (you'll find the file in
/opt/ibm/biginsights/bigsql/samples/data/SampApp.java):
package aJavaPackage4me;

//a. Import required package(s)


import java.sql.*;

public class SampApp {

Hands On Lab Page 67


IBM Software

/**
* @param args
*/

//b. set JDBC & database info


//change these as needed for your environment
static final String db = "jdbc:db2://YOUR_HOST_NAME:51000/bigsql";
static final String user = "YOUR_USER_ID";
static final String pwd = "YOUR_PASSWORD";

public static void main(String[] args) {


Connection conn = null;
Statement stmt = null;
System.out.println("Started sample JDBC application.");

try{
//c. Register JDBC driver -- not needed for DB2 JDBC type 4 connection
// Class.forName("com.ibm.db2.jcc.DB2Driver");

//d. Get a connection


conn = DriverManager.getConnection(db, user, pwd);
System.out.println("Connected to the database.");

//e. Execute a query


stmt = conn.createStatement();
System.out.println("Created a statement.");
String sql;
sql = "select product_color_code, product_number from sls_product_dim " +
"where product_key=30001";
ResultSet rs = stmt.executeQuery(sql);
System.out.println("Executed a query.");

//f. Obtain results


System.out.println("Result set: ");
while(rs.next()){
//Retrieve by column name
int product_color = rs.getInt("PRODUCT_COLOR_CODE");
int product_number = rs.getInt("PRODUCT_NUMBER");
//Display values
System.out.print("* Product Color: " + product_color + "\n");
System.out.print("* Product Number: " + product_number + "\n");
}

//g. Close open resources


rs.close();
stmt.close();
conn.close();

}catch(SQLException sqlE){
// Process SQL errors
sqlE.printStackTrace();
}catch(Exception e){
// Process other errors
e.printStackTrace();
}

finally{

Page 68 Explore Hadoop and BigInsights


// Ensure resources are closed before exiting
try{
if(stmt!=null)
stmt.close();
}catch(SQLException sqle2){
} // nothing we can do

try{
if(conn!=null)
conn.close();
}
catch(SQLException sqlE){
sqlE.printStackTrace();
}// end finally block
}// end try block
System.out.println("Application complete");
}}

__a. After the package declaration, ensure that you include the packages that contain
the JDBC classes that are needed for database programming (import java.sql.*;).
__b. Set up the database information so that you can refer to it. Be sure to change
the user ID, password, and connection information as needed for your environment.
__c. Optionally, register the JDBC driver. The class name is provided here for your
reference. When using the DB2 Type 4.0 JDBC driver, it’s not necessary to specify the
class name.
__d. Open the connection.
__e. Run a query by submitting an SQL statement to the database.
__f. Extract data from result set.
__g. Clean up the environment by closing all of the database resources.

__9. Save the file and right-click the Java file and click Run > Run as > Java Application.

__10. The results show in the Console view of Eclipse:


Started sample JDBC application.
Connected to the database.
Created a statement.
Executed a query.
Result set:
* Product Color: 908
* Product Number: 1110
Application complete

Hands On Lab Page 69


IBM Software

Lab 6 Summary
In this lab, you gained hands-on experience using many popular capabilities of InfoSphere BigInsights,
IBM's Hadoop-based platform for analyzing big data. You explored your BigInsights cluster using a
Web-based console and manipulated social media data using a spreadsheet-style interface. You also
created Big SQL tables for your data and executed several complex queries over this data.

To expand your skills even further, visit the HadoopDev web site (https://developer.ibm.com/hadoop/)
contains for links to free online courses, tutorials, and more.

Now that you’re ready to get started using BigInsights for your own projects. What will you do with big
data?

Page 70 Explore Hadoop and BigInsights


NOTES
NOTES
© Copyright IBM Corporation 2014.

The information contained in these materials is provided for


informational purposes only, and is provided AS IS without warranty
of any kind, express or implied. IBM shall not be responsible for any
damages arising out of the use of, or otherwise related to, these
materials. Nothing contained in these materials is intended to, nor
shall have the effect of, creating any warranties or representations
from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of
IBM software. References in these materials to IBM products,
programs, or services do not imply that they will be available in all
countries in which IBM operates. This information is based on
current IBM product plans and strategy, which are subject to change
by IBM without notice. Product release dates and/or capabilities
referenced in these materials may change at any time at IBM’s sole
discretion based on market opportunities or other factors, and are not
intended to be a commitment to future product or feature availability
in any way.

IBM, the IBM logo and ibm.com are trademarks of International


Business Machines Corp., registered in many jurisdictions
worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.

You might also like