Lab1 InstallationOfBigInsight
Lab1 InstallationOfBigInsight
Lab1 InstallationOfBigInsight
In this hands-on lab, you'll learn how to work with Big Data using Apache Hadoop and InfoSphere
BigInsights, IBM's Hadoop-based platform. In particular, you'll learn the basics of working with the
Hadoop Distributed File System (HDFS) and see how to administer your Hadoop-based environment
using the BigInsights Web console. After launching a sample MapReduce application, you'll explore a
more sophisticated scenario involving social media data. In doing so, you'll learn how to use a
spreadsheet-style interface to discover insights about the global coverage of a popular brand without
writing any code. Finally, you'll learn how to apply industry standard SQL to data managed by
BigInsights through IBM's Big SQL technology. Indeed, you'll have a chance to create tables and
execute complex queries over data in HDFS, including data derived from a relational data warehouse.
• Inspect and administer your cluster through the BigInsights Web Console
This lab was developed by Cynthia M. Saracco, IBM Silicon Valley Lab. Please post questions or
comments about this lab or the technologies it describes to the forum on Hadoop Dev at
https://developer.ibm.com/hadoop/.
User Password
VM Image root account root password
VM Image lab user account biadmin biadmin
BigInsights Administrator biadmin biadmin
Big SQL Administrator bigsql bigsql
Lab user biadmin biadmin
Property Value
Host name bivm.ibm.com
BigInsights Web Console URL http://bivm.ibm.com:8080
Big SQL database name bigsql
Big SQL port number 51000
Screen captures in this lab depict examples and results that may vary from
what you see when you complete the exercises. In addition, some code
examples may need to be customized to match your environment. For
example, you may need to alter directory path information or user ID
information.
__1. If necessary, obtain a copy of the BigInsights 3.0 Quick Start Edition VMware image from IBM's
external download site (http://www-01.ibm.com/software/data/infosphere/biginsights/quick-
start/downloads.html). Use the image for the single-node cluster.
__2. Follow the instructions provided to decompress (unzip) the file and install the image on your
laptop. Note that there is a README file with additional information.
__3. If necessary, install VMware player or other required software to run VMware images. Details
are in the README file provided with the BigInsights VMware image.
__4. Launch the VMware image. When logging in for the first time, use the root ID (with a password
of password). Follow the instructions to configure your environment, accept the licensing
agreement, and enter the passwords for the root and biadmin IDs (root/password and
biadmin/biadmin) when prompted. This is a one-time only requirement.
__7. Click Start BigInsights to start all required services. (Alternatively, you can open a terminal
window and issue this command: $BIGINSIGHTS_HOME/bin/start-all.sh)
Wait until the operation completes. This may take several minutes, depending on your
machine's resources.
Page 6 Explore Hadoop and BigInsights
__8. Verify that all required BigInsights services are up and running. From a terminal window, issue
this command: $BIGINSIGHTS_HOME/bin/status.sh.
__9. Inspect the results, a subset of which are shown below. Verify that, at a minimum, the following
components started successfully: hdm, zookeeper, hadoop, catalog, hive, bigsql, oozie,
console, and httpfs.
__3. Execute the following Hadoop file system command to create a directory in HDFS for your work:
Note that HDFS is distinct from your Unix/Linux local file system directory, and working with
HDFS requires using hadoop fs commands.
ls /home/biadmin/licenses
Note the BIlicense_en.txt file. It contains license information in English, and it will serve as a
sample data file for a future exercise.
__2. Copy the BIlicense_en.txt file into the /user/biadmin/test directory you just created in HDFS.
Since launching MapReduce applications (or jobs) is a common practice in Hadoop, you'll explore how to
do that with WordCount.
__1. Execute the following command to launch the sample WordCount application provided with your
Hadoop distribution.
This command specifies that the wordcount application contained in the specified .jar file is to be
launched. The input for this application is in the /user/biadmin/test directory of HDFS. The output of
this job will be stored in HDFS in the WordCount_output subdirectory of the user executing this
command (biadmin). Thus, the output directory will be /user/biadmin/WordCount_output. This
directory will be created automatically as a result of executing this application.
NOTE: If the output folder already exists or if you try to rerun a successful
MapReduce job with the same parameters, you will receive an error message. This
is the default behavior of the sample WordCount application.
In this case, the output was small and contained written to a single file. If you had run WordCount
against a larger volume of data, its output would have been split into multiple files (e.g., part-r-00001,
part-r-00002, and so on).
This lab is an introduction to a subset of console functions. Real-time monitoring, dashboards, alerts,
and application linking are among the more advanced console functions that are beyond this lab's scope.
__1. Launch the BigInsights Web console. Direct your browser to http://bivm.ibm.com:8080 or click
the Web Console icon on your desktop.
__2. Log in with your user name and password (biadmin / biadmin).
__4. Briefly skim through the links provided in these sections to become familiar with resources
available to you:
__6. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster
that had several services running. One service – Monitoring -- was unavailable. Your display
may differ somewhat. It’s not necessary for all BigInsights services to be running to complete
the exercises in this lab.
__7. Click on the Hive service and note the detailed information provided for this service in the pane
at right. For example, you can see the URL for Hive's Web interface and its process ID. In
addition, note that you can start and stop services (such as the Hive service) from the Cluster
Status page of the console.
__8. Optionally, cut-and-paste the URL for Hive’s Web interface into a new tab of your browser.
You'll see an open source tool provided with Hive for administration purposes, as shown below.
Other open source tools provided with Apache Hadoop are also available through IBM's
packaged distribution (BigInsights), as you'll see shortly. Close this browser tab.
__10. Click on the Access secure cluster servers button in the Quick Links section at right.
If nothing appears, verify that the pop-up blocker of your browser is disabled; a prompt should
appear at the top of the page if pop-ups are blocked.
__11. Inspect the list of server components for which there are additional Web-based tools. The
BigInsights console displays the URLs you can use to access each of these Web sites directly.
(This information will only appear if the pop-up blocker is disabled on browser.)
__12. Click on the jobtracker alias. The display should be familiar to you -- it's the same one you saw
in the previous lab that introduced you to some basic Hadoop facilities.
__2. Expand the DFS directory tree in the left pane to display the contents of /user/biadmin. Note
the presence of the /WordCount_output and /test subdirectories, which you created in an
earlier lab. If desired, expand each directory and inspect its contents.
__3. Become familiar with the functions provided through the icons at the top of this pane, as we'll
refer to some of these in subsequent sections of this module. Simply position your cursor on
each icon to learn its function. From left to right, the icons enable you to copy a file or directory,
move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file
from HDFS to your local file system, remove a file or directory from HDFS, set permissions, open
a command window to launch HDFS shell commands, and refresh the Web console page.
__4. Delete the /user/biadmin/test directory and its contents. Position your cursor on this
directory, click the red X icon, and click Yes when prompted.
__6. When a pop-up window appears, specify test2 as the new directory's name and click OK.
__7. Expand the directory hierarchy to verify that your new subdirectory was created.
__8. Upload a file into this directory from your local file system. Click the upload icon.
__10. Expand the /user/biadmin/test2 directory and verify that the BIlicense_en.txt file was
successfully copied into HDFS. Note that the right pane of the Web console previews the file's
contents.
3.4. Managing and launching pre-built applications from the Web catalog
The Web console includes a catalog of ready-made applications that users can launch through a
graphical interface. Each application's status, execution history, and output are easy to monitor from this
page as well. In this exercise, you'll first manage the catalog’s contents, selecting one of more than 20
pre-built applications provided with BigInsights to deploy on your cluster. Once deployed, the application
will be visible to all authorized users. You'll then launch the application, monitor its execution status, and
inspect its output.
As you might have guessed, the sample application used in this lab is Word Count -- the same
application you ran from a command line earlier.
__1. Click the Applications tab of the Web console. No applications are deployed on a new cluster,
so there won't be much to see yet.
__2. In the upper left corner, click Manage. A list of applications available for deployment are
displayed.
__5. When a pop-up window appears, accept the defaults for all settings and click Deploy.
__6. After the application has been deployed, you're ready to run it. Click Run in the upper left pane.
__7. Verify that the Word Count application appears in the catalog. (Any other applications that were
previously deployed to the Web catalog will also appear.)
__10. For the Input path, click Browse and navigate to /user/biadmin/test2. Click OK.
__11. For the Output path, type /user/biadmin/WordCount_console_output. (Recall that the Word
Count application creates this output directory at run time. If you specify an existing HDFS
directory for the output, the application will fail.)
__12. Verify that your display appears similar to this and click Run.
__13. As your application executes, monitor its status through the Application History pane at lower
right.
__14. When the application completes successfully, click the link provided in the Output column to see
the application's output.
__15. Optionally, return to the Applications page of the console and click on the link provided in the
Details column for your application's run.
Boardreader is an IBM business partner that offers a social media content aggregation and
provisioning service based on a multilingual data dating back to 2001. The service searches
message boards / forums, social networks, blogs/comments, microblogs, reviews,
videos/comments and online news. Customers who want to use the Boardreader service
should contact the firm directly to obtain a license key.
__1. Obtain the blogs-data.txt file. You’ll find this in the sampleData.zip file provided with the article
mentioned earlier.
__2. Use Hadoop file system commands or the BigInsights Web console to create subdirectories in
HDFS for your sample data. Under /user/biadmin, create a /sampleData directory. Beneath
/user/biadmin/sampleData, create the /IBMWatson subdirectory.
__4. From the Files page of the Web console, position your cursor on the
/user/biadmin/sampleData/IBMWatson/blogs-data.txt file, as shown in the previous
image.
__5. Click the Sheet radio button to preview this data in a spreadsheet-style format.
__6. Because the sample blog data for this lab is uses a JSON Array structure, you must click on the
pencil icon to select an appropriate reader (data format translator) for this data. Select the JSON
Array reader and click the green check.
__7. Save this as a Master Workbook named Watson Blogs. Optionally, provide a description. Click
Save.
__8. Note that the BigSheets page of the Web console will open and your new workbook will be
displayed.
Depending on the size of your browser, an additional scroll bar may appear at right.
__2. Scroll down to the Workbook Details section. Locate the Tags field, select the green plus sign
(+) , enter a tag for Watson, and click the green check mark. Repeat the process to add
separate tags for IBM and blogs.
__3. Click on the Workbooks link the upper left corner of your open workbook.
__4. From the list of available workbooks, you can quickly search for a specific tag. Use the drop-
down Tags menu to select the blogs tag or type tag: blogs into the box.
__5. Open the Watson Blogs master workbook again. (Double click on it.)
__6. Create a new workbook based on this master workbook. In BigSheets, a master workbook is a
“base” workbook and has a limited set of things you can edit. So, to manipulate the data
contained within a workbook, you want to create a new workbook derived from the master.
__b. When the new Workbook appears, change its default name. Click the pencil icon next to
the name, enter Watson Blogs Revised as the new name, and click the green check
mark.
__c. Click the Fit column(s) button to more easily see columns A through H on your screen
__7. Remove the column IsAdult from your workbook. This is currently column E. Click on the
triangle next to the column name of IsAdult and select the Remove.
__8. In this case, you want to keep only a few columns. To easily remove several columns, click the
triangle again (from any column) and select Organize ColumnsF
__a. Click the red X button next to each column you want to remove.
__b. Click the green check mark button when you are ready to remove the columns you
selected.
__9. Click on the Fit column(s) button again to show columns A through H. Verify that your screen
appears similar to this:
__10. From the Save menu at upper left, select Save. Provide a description for your workbook if you’d
like.
__11. Apply a built-in function to further investigate the contents of this workbook. Click the Add
Sheets button in the lower left corner.
__15. In the new menu that appears, enter Get Host URL as the sheet name and select the Url
column as the source of input to the URLHOST function.
__16. At the bottom of the menu, click the Carry Over tab to specify which columns from the workbook
you'd like to retain. Select Add All and click the green check mark.
__17. Verify that your workbook contains a new URLHOST column and all previously existing columns.
(Whenever you create a new Sheet or edit your workbook in some way, BigSheets will preview
the results of your work against a small sample of the data represented by your workbook.) If
desired, click the Fit Column button to show more columns on your screen.
__19. When prompted to Run or Close the workbook, click Run. "Running" a workbook instructs
BigSheets to apply the logic you specified graphically against all data associated with your
workbook. You can monitor the progress of your request by watching the status bar indicator in
the upper right-hand side of the page.
__20. When the operation completes, verify that your workbook appears similar to this:
__21. If desired, use the Next button in the lower right corner to see page through the content a few
times, noting the various URLHOST values. If desired, you could use built-in BigSheets features
to sort the data based on URLHOST (or other) values, filter records (such as blogs written in the
English language), etc. But perhaps the quickest way to see which sites published the most
blogs about IBM Watson during this time period is to chart the results. You'll do that next.
__4. Specify appropriate properties for the bar chart, paying close attention to these fields:
__e. Limit: 10
__6. When prompted, Run the chart. This causes BigSheets to apply your instructions to the entire
data set.
__7. Inspect the results. Are you surprised that ibm.com wasn’t the top site for blog posts about IBM
Watson?
__9. Next, create a new chart of a different type to visualize the information in a different format.
Select Add Chart > Categories > cloud > Bubble Cloud.
__e. Limit: 10
__13. Inspect the results. If desired, hover over a bubble to see the number of blog postings for that
site.
__2. Click Create Table button just above the columns of your workbook. When prompted, accept
sheets as the target schema name and type mywatsonblogs as the target table name.
__4. From the Files page of the Web console, click the Catalog Tables tab in the navigation window
and expand the sheets folder.
__5. Click the mywatsonblogs file. Note that a preview of the table appears in the pane at right.
__6. Click the Welcome tab of the Web console. In the Quick Links section, click the Run Big SQL
queries link.
__8. In the box where you're prompted to enter your Big SQL query, type this statement:
__9. Verify that the Big SQL radio button is checked (not the Big SQL V1 radio button).
__10. If necessary, use the scroll bar at right to expose the Run button just below the radio buttons.
Click Run.
__2. Click Export data. From the drop-down menu, select TSV (tab separated value) as the format
type.
__3. Click the File radio button to export the data to a file in your distributed file system.
__4. Use the Browse button to navigate to the directory in HDFS where you would like to export this
workbook. In this case, select /user/biadmin/sampleData/IBMWatson. In the box below the
directory tree, enter myworkbook as the file name. Do not add a file extension such as .tsv.
Click OK.
__6. When a message appears indicating that the operation has finished, click OK.
__7. On the Files page of the Web console, navigate to the directory you specified for the export
(/user/biadmin/sampleData/IBMWatson) and locate your new myworkbook.tsv file.
This tutorial uses sales data from a fictional company that sells and distributes outdoor products to third-
party retailer stores as well as directly to consumers through its online store. It maintains its data in a
series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In
this lab, you will explore how to create, populate, and query a subset of the star schema database to
investigate the company’s performance and offerings. Note that BigInsights provides scripts to create
and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use
fewer than 10 of these tables in this lab.
To execute the queries in this lab, you will use the open source Eclipse environment provided with the
BigInsights Quick Start Edition VMware image. Of course, you can use other tools or interfaces to
invoke Big SQL, such as the Java SQL Shell (JSqsh), a command-line facility provided with the
BigInsights. However, Eclipse is a good choice for this lab, as it formats query results in a manner that’s
easy to read and encourages you to collect your SQL statements into scripts for editing and testing.
After you complete the lessons in this module, you will understand how to:
• Connect to the Big SQL server from Eclipse
• Execute individual or multiple Big SQL statements
• Create Big SQL tables in Hadoop
• Populate Big SQL tables with data from local files
• Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular
expressions.
• Create and query a view based on multiple Big SQL tables.
• Create and run a JDBC client application for Big SQL using Eclipse.
__1. Launch Eclipse using the icon on your desktop. Accept the default workspace when prompted.
__2. Create a BigInsights project for your work. From the Eclipse menu bar, click File > New > Other.
Expand the BigInsights folder, and select BigInsights Project, and then click Next.
__4. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective
window opens. Click Yes to switch to the BigInsights perspective.
__5. Create a new SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the
BigInsights folder, and select SQL script, and then click Next.
__6. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL.
Your new SQL file is stored in this project folder.
__7. In the File name field, type aFirstFile. The .sql extension is added automatically. Click Finish.
In the Select Connection Profile window, locate the Big SQL JDBC connection, which is the
pre-defined connection to Big SQL 3.0 provided with the VMware image. Inspect the properties
displayed in the Properties field. Verify that the connection uses the JDBC driver and database
name shown in the Properties pane here.
You may be wondering why you are using a connection that employs the
com.ibm.com.db2.jcc.DB2 driver class. In 2014, IBM released a common SQL
query engine as part of its DB2 and BigInsights offerings. Doing so provides
for greater SQL commonality across its relational DBMS and Hadoop-based
offerings. It also brings a greater breadth of SQL function to Hadoop
(BigInsights) users. This common query engine is accessible through the DB2
driver. The Big SQL driver remains operational and offers connectivity to an
earlier, BigInsights-specific SQL query engine. This lab focuses on using the
common SQL query engine.
__10. Click Test Connection to verify that you can successfully connect to the server.
__12. Click Finish to close the connection window. Your empty SQL script will be displayed.
Because you didn't specify a schema name for the table, it will be created in your default schema,
which is your user name (biadmin). Thus, the previous statement is equivalent to
In some cases, the Eclipse SQL editor may flag certain Big SQL statements as
containing syntax errors. Ignore these false warnings and continue with your lab
exercises.
__14. Save your file (press Ctrl + S or click File > Save).
__15. Right mouse click anywhere in the script to display a menu of options.
__16. Select Run SQL or press F5. This causes all statements in your script to be executed.
__17. Inspect the SQL Results pane that appears towards the bottom of your display. (If desired,
double click on the SQL Results tab to enlarge this pane. Then double click on the tab again to
return the pane to its normal size.) Verify that the statement executed successfully. Your Big
SQL database now contains a new table named BIADMIN.TEST1. Note that your schema and
table name were folded into upper case.
When you’re developing a SQL script with multiple statements, it’s generally a
good idea to test each statement one at a time to verify that each is working as
expected.
__18. From your Eclipse project, query the system for meta data about your test1 table:
In case you're wondering, syscat.columns is one of a number of views supplied over system
catalog data automatically maintained for you by the Big SQL service.
__19. Inspect the SQL Results to verify that the query executed successfully, and click on the Result1
tab to view its output.
__21. Save your file. If desired, leave it open to execute statements for subsequent exercises.
Now that you’ve set up your Eclipse environment and know how to create SQL scripts and execute
queries, you’re ready to develop more sophisticated scenarios using Big SQL. In the next lab, you will
create a number of tables in your schema and use Eclipse to query them.
__1. Determine the location of the sample data in your local file system and make a note of it. You
will need to use this path specification when issuing LOAD commands later in this lab.
__2. Create several tables to track information about sales. Issue each of the following CREATE
TABLE statements one at a time, and verify that each completed successfully:
-- dimension table for region info
-- dimension table tracking method of order for the sale (e.g., Web, fax)
, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;
Let’s briefly explore some aspects of the CREATE TABLE statements shown here. If
you have a SQL background, the majority of these statements should be familiar to
you. However, after the column specification, there are some additional clauses
unique to Big SQL – clauses that enable it to exploit Hadoop storage mechanisms (in
this case, Hive). The ROW FORMAT clause specifies that fields are to be terminated
by tabs (“\t”) and lines are to be terminated by new line characters (“\n”). The table
will be stored in a TEXTFILE format, making it easy for a wide range of applications to
work with. For details on these clauses, refer to the Apache Hive documentation.
__3. Load data into each of these tables using sample data provided in files. One at a time, issue
each of the following LOAD statements and verify that each completed successfully. Remember
to change the file path shown (if needed) to the appropriate path for your environment. The
statements will return a warning message providing details on the number of rows loaded, etc.
load hadoop using file url
'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE
PROPERTIES ('field.delimiter'='\t') INTO TABLE GO_REGION_DIM overwrite;
Note that loading data from a local file is only one of several available options. You can
also load data using FTP or SFTP. This is particularly handy for loading data from
remote file systems, although you can practice using it against your local file system, too.
For example, the following statement for loading data into the
GOSALESDW.GO_REGION_DIM table using SFTP is equivalent to the syntax shown
earlier for loading data into this table from a local file:
load hadoop using file url
'sftp://myID:[email protected]:22/opt/ibm/biginsights/bigsql/
samples/data/GOSALESDW.GO_REGION_DIM.txt'
with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE
gosalesdw.GO_REGION_DIM overwrite;
Big SQL supports other LOAD options, including loading data directly from a remote
relational DBMS via a JDBC connection. See the product documentation for details.
__4. Query the tables to verify that the expected number of rows was loaded into each table. Execute
each query that follows individually and compare the results with the number of rows specified in
the comment line preceding each query.
-- total rows in GO_REGION_DIM = 21
select count(*) from GO_REGION_DIM;
In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as
perform aggregations and other SQL operations. Note that the queries included in this section are based
on queries shipped with BigInsights as samples. Some of these queries return hundreds of thousands of
rows; however, the Eclipse SQL Results page limits output to only 500 rows. Although you can change
that value in the Data Management preferences section, retain the default setting for this lab.
__1. Join data from multiple tables to return the product name, quantity and order method of goods
that have been sold. To do so, execute the following query.
• Data from four tables will be used to drive the results of this query (see the tables referenced in
the FROM clause). Relationships between these tables are resolved through 3 join predicates
specified as part of the WHERE clause. The query relies on 3 equi-joins to filter data from the
referenced tables. (Predicates such as prod.product_number=pnumb.product_number help to
narrow the results to product numbers that match in two tables.)
• For improved readability, this query uses aliases in the SELECT and FROM clauses when
referencing tables. For example, pnumb.product_name refers to “pnumb,” which is the alias for
the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used
in the WHERE clause so that you do not need to repeat the complete table name.
• The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result
to only English output. This database contains thousands of rows of data in various languages, so
restricting the language provides some optimization.
__4. To find out which sales method of all the methods has the greatest quantity of orders, add a
GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition,
invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and
method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to
substitute a more readable column header.
-- Query 3
SELECT pll.product_line_en AS Product,
md.order_method_en AS Order_method,
sum(sf.QUANTITY) AS total
FROM
sls_order_method_dim AS md,
sls_product_dim AS pd,
sls_product_line_lookup AS pll,
sls_product_brand_lookup AS pbl,
sls_sales_fact AS sf
WHERE
pd.product_key = sf.product_key
AND md.order_method_key = sf.order_method_key
AND pll.product_line_code = pd.product_line_code
AND pbl.product_brand_code = pd.product_brand_code
GROUP BY pll.product_line_en, md.order_method_en;
Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need,
just add its JAR file to the appropriate BigInsights subdirectories. Then stop and restart the Big SQL
service, and specify the SerDe class name when you create your table.
In this lab exercise, you will use a SerDe to define a table for JSON-based blog data. The sample blog
file for this exercise is the same blog file you used as input to BigSheets in a prior lab.
__1. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system,
such as /home/biadmin/sampleData. (As of this writing, the full URL for this SerDe is
https://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar)
__a. Stop the Big SQL server. From a terminal window, issue this command:
$BIGINSIGHTS_HOME/bin/stop.sh bigsql
__b. Copy the SerDe .jar file to the $BIGSQL_HOME/userlib and $HIVE_HOME/lib
directories.
__c. Restart the Big SQL server. From a terminal window, issue this command:
$BIGINSIGHTS_HOME/bin/start.sh bigsql
Now that you’ve registered your SerDe, you’re ready to use it. In this section, you will create a table that
relies on the SerDe you just registered. For simplicity, this will be an externally managed table – i.e., a
table created over a user directory that resides outside of the Hive warehouse. This user directory will
contain the table's data in files. As part of this exercise, you will upload the sample blogs-data.txt file into
the target DFS directory.
Creating a Big SQL table over an existing DFS directory has the effect of populating this table with all the
data in the directory. To satisfy queries, Big SQL will look in the user directory specified when you
created the table and consider all files in that directory to be the table’s contents. This is consistent with
the Hive concept of an externally managed table.
Once the table is created, you'll query that table. In doing so, you'll note that the presence of a SerDe is
transparent to your queries.
__3. If necessary, download the .zip file containing the sample data from the bottom half of the article
referenced in the introduction. Unzip the file into a directory on your local file system, such as
/home/biadmin. You will be working with the blogs-data.txt file.
From the Files tab of the Web console, navigate to the /user/biadmin/sampleData directory
of your distributed file system. Use the create directory button to create a subdirectory named
SerDe-Test.
__6. Execute the following statement, which creates a TESTBLOGS table that includes a LOCATION
clause that specifies the DFS directory containing your sample blogs-data.txt file:
create hadoop table if not exists testblogs (
Country String,
Crawled String,
FeedInfo String,
Inserted String,
IsAdult int,
Language String,
Postsize int,
Published String,
SubjectHtml String,
Tags String,
Type String,
Url String)
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '/user/biadmin/sampleData/SerDe-Test';
__1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File >
New >Project. From the New Project window, select Java Project. Click Next.
__2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next.
__3. Open the Libraries tab and click Add External Jars. Add the DB2 JDBC driver for BigInsights,
located at /opt/ibm/biginsights/database/db2/java/db2jcc4.jar.
__4. Click Finish. Click Yes when you are asked if you want to open the Java perspective.
__5. Right-click the MyJavaProject project, and click New > Package. In the Name field, in the New
Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.
__7. In the New Java Class window, in the Name field, type SampApp. Select the public static void
main(String[] args) check box. Click Finish.
__8. Replace the default code for this class and copy or type the following code into the
SampApp.java file (you'll find the file in
/opt/ibm/biginsights/bigsql/samples/data/SampApp.java):
package aJavaPackage4me;
/**
* @param args
*/
try{
//c. Register JDBC driver -- not needed for DB2 JDBC type 4 connection
// Class.forName("com.ibm.db2.jcc.DB2Driver");
}catch(SQLException sqlE){
// Process SQL errors
sqlE.printStackTrace();
}catch(Exception e){
// Process other errors
e.printStackTrace();
}
finally{
try{
if(conn!=null)
conn.close();
}
catch(SQLException sqlE){
sqlE.printStackTrace();
}// end finally block
}// end try block
System.out.println("Application complete");
}}
__a. After the package declaration, ensure that you include the packages that contain
the JDBC classes that are needed for database programming (import java.sql.*;).
__b. Set up the database information so that you can refer to it. Be sure to change
the user ID, password, and connection information as needed for your environment.
__c. Optionally, register the JDBC driver. The class name is provided here for your
reference. When using the DB2 Type 4.0 JDBC driver, it’s not necessary to specify the
class name.
__d. Open the connection.
__e. Run a query by submitting an SQL statement to the database.
__f. Extract data from result set.
__g. Clean up the environment by closing all of the database resources.
__9. Save the file and right-click the Java file and click Run > Run as > Java Application.
Lab 6 Summary
In this lab, you gained hands-on experience using many popular capabilities of InfoSphere BigInsights,
IBM's Hadoop-based platform for analyzing big data. You explored your BigInsights cluster using a
Web-based console and manipulated social media data using a spreadsheet-style interface. You also
created Big SQL tables for your data and executed several complex queries over this data.
To expand your skills even further, visit the HadoopDev web site (https://developer.ibm.com/hadoop/)
contains for links to free online courses, tutorials, and more.
Now that you’re ready to get started using BigInsights for your own projects. What will you do with big
data?