Hadoop Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 110

BIGDATA HADOOP

USING

IBM INFOSPHERE BIGINSIGHTS

Submitted to Submitted by
BIG DATA
Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. But it’s not
the amount of data that’s important. Its what organizations do with the data
that matters. Big data can be analyzed for insights that lead to better decisions
and strategic business moves.
REQUIREMENTS OF ANAYSIS
While the term “big data” is relatively new, the act of gathering and storing
large amounts of information for eventual analysis is ages old. The concept
gained momentum in the early 2000s when industry analyst Doug Laney
articulated the now-mainstream definition of big data as the three Vs:
Volume: Organizations collect data from a variety of sources, including business
transactions, social media and information from sensor or machine-to-machine
data. In the past, storing it would’ve been a problem – but new technologies
(such as Hadoop) have eased the burden.
Velocity: Data streams in at an unprecedented speed and must be dealt with in
a timely manner. RFID tags, sensors and smart metering are driving the need to
deal with torrents of data in near-real time.
Variety: Data comes in all types of formats – from structured, numeric data in
traditional databases to unstructured text documents, email, video, audio,
stock ticker data and financial transactions.At SAS, we consider two
additional dimensions when it comes to big data:
Variability: In addition to the increasing velocities and varieties of data, data
flows can be highly inconsistent with periodic peaks. Is something trending
in social media? Daily, seasonal and event-triggered peak data loads can be
challenging to manage. Even more so with unstructured data.
Complexity: Today's data comes from multiple sources, which makes it
difficult to link, match, cleanse and transform data across systems. However,
it’s
necessary to connect and correlate relationships, hierarchies and multiple
data linkages or your data can quickly spiral out of control.
OBJECTIVES OF BIG DATA
In this section, we will explain the objectives that we strive to achieve
while dealing with Big Data:
Cost reduction- MIPS (Million Instructions Per Second) and above terabyte
storage for structured data are now cheaply delivered through big data
technologies like Hadoop clusters. This is mainly because of the ability of
Big Data technologies to utilize commodity scale hardware for processing by
employing techniques such as data sharing, distributed computing etc.

Faster processing speeds- Big data technologies have also helped reducing
large scale-analytics processing from hours to minutes. These technologies
have also been instrumental in real time analytics reducing processing times to
seconds.
Big Data based offerings- Big data technologies have enabled organizations to
leverage big data in developing new product and service offerings. The best
example may be LinkedIn, which has used big data and data scientists to develop
a broad array of product offerings and features, including People You May
Know, Groups You May Like, Jobs You May Be Interested In, Who has Viewed
My Profile, and several others. These offerings have brought millions of new
customers to LinkedIn.
Supporting internal business decisions- Just like traditional data analytics, big
data analytics can be employed to support business decisions when there are
new and less structured data sources.
For example, any data that can shed light on customer satisfaction is helpful,
and much data from customer interactions is unstructured such as website
clicks, transaction records, and voice recordings from call centres.
WHY IS BIG DATA IMPORTANT?
The importance of big data doesn’t revolve around how much data you have,
but what you do with it. You can take data from any source and analyze it to
find answers that enable 1) cost reductions, 2) time reductions, 3) new product
development and optimized offerings, and 4) smart decision making. When you
combine big data with high-powered analytics, you can accomplish business-
related tasks such as:
Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying
habits.
Recalculating entire risk portfolios in minutes.

Detecting fraudulent behavior before it affects your organization


BIG DATA COMES FROM
Streaming Data:
This category includes data that reaches your IT systems from a web of connected
devices. You can analyze this data as it arrives and make decisions on what data to
keep, what not to keep and what requires further analysis.
Social Media Data:
The data on social interactions is an increasingly attractive set of information,
particularly for marketing, sales and support functions. It's often in unstructured or
semi structured forms, so it poses a unique challenge when it comes to
consumption and analysis.
Publicly Available Sources:
Massive amounts of data are available through open data sources like the US
government’s data.gov, the CIA World Facebook or the European Union
Open Data Portal.
After identifying all the potential sources for data, consider the decisions
you’ll need to make once you begin harnessing information.

These include:
Storage and Management:
storage would have been a problem several years ago, there are now low-cost
options for storing data if that’s the best strategy for your business.

How much of it to analyze:


Some organizations don't exclude any data from their analyses, which is
possible with today’s high-performance technologies such as grid computing
or in-memory analytics. Another approach is to determine upfront which data
is relevant before analyzing it.
How to use any insights you uncover:

The more knowledge you have, the more confident you’ll be in making
business decisions. It’s smart to have a strategy in place once you have an
abundance of information at hand.

Cloud computing and other flexible resource allocation arrangements.


Faster processing
Cheap abundant storage
Distributed file system such as hadoop

REAL FACTS:
New York Stock Exchange generates 1 TB/day.
Google processes 700 PB/month.
Facebook hosts 10 billion photos taking 1 PB
storage
BIG DATA CHALLENGES:

The major challenges associated with big data are as follows:


Capturing
data
Transferring
Sharing
Storage
Processing

BIG DATA SOLUTIONS


Traditional Enterprise Approach

In this approach, an enterprise will have a computer to store and process big data.
For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc. In this approach, the user interacts
with the application, which in turn handles the part of data storage and analysis.

Limitation
This approach works fine with those applications that process less voluminous
data that can be accommodated by standard database servers, or up to the limit of
the processor that is processing the data. But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to process such data through a single
database bottleneck.

Google’s Solution
Google solved this problem using an algorithm called MapReduce. This
algorithm divides the task into small parts and assigns them to many computers,
and collects the results from them which when integrated, form the result dataset.
Hadoop
Using the solution provided by Google, Doug Cutting and his team developed an
Open Source Project called HADOOP.

Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel with others. In short, Hadoop is used to develop
applications that could perform complete statistical analysis on huge amounts of
data.

Introduction to Hadoop
An open-source software framework that supports data-intensive distributed
applications, licensed under the Apache v2license.

Data Lake

One of the strong use case of Big Data technologies is to analyze the data,
and find out the hidden patterns and information out of it. For this to be
effective, all the data from sources must be saved without any loss or
tailoring. However traditional RDBMS databases and most of NoSQL
storage systems require data to be transformed to a specific format in order
to be utilized - adding, updating, searching - effectively.

Data Lake concept is introduced to fill this gap and talks about storing the
data in raw state (same state as data exist in source systems) without any
data loss and transformation. For the same reason, Data Lake is also
referred as Data Landing Area.

Data Lake is rather a concept and can be implemented using any suitable
technology/software that can hold the data in any form along with
ensuring that no data loss is occurred using distributed storage providing
failover.
Example of such a technology would be Apache Hadoop where its
MapReduce component could be used to load data into its distributed
file system known as Hadoop Distributed File System (HDFS).

As we can see that there are two differences in the process. Firstly, we have
ETL component instead of data loader which emphasizes that in case of
data warehouse.

Input data is transformed and tailored to a pre-defined schema in order to be


saved to data warehouse.

This process of ETL, in most cases, results into data loss due to fixed
schemas. Second difference is that in case of data warehouse, there are no
data processors as data is already in a pre-defined schema ready to be
consumed by data analysts.
Benef There are following benefits that companies can reap by implementing
Data Lake
Data Consolidation - Data Lake enales enterprises to consolidate its data
available in various forms such as videos, customer care recordings, web
logs, documents etc. in one place which was not possible with traditional
approach of using data warehouse.
Schema-less and Format-free Storage - Data Lake talks about the storage of
data in its raw form i.e. same format as it is sent from source systems. This
eliminates need of source systems having to emit data in a pre-defined
schema.
No Data Loss - Since Data Lake doesn't require source systems to emit the
data in a pre-defined schema, source systems do not need to tailor the data.
This enables the access of all the data to data analysts and scientists resulting
into more accurate analysis.
Cost Effectiveness - Data Lake talks about distributed storage wherein
commodity hardware can be utilized to store the huge volumes of data.
procuring and levearing the commodity hardware for storage is much cost
effective than using the high configuration hardware.

Challenges with Data Lake


Ability to accommodate any format and type of data sometimes can convert
Data Lake into a data mess referred as Data Swarm. Hence, it is important
that enterprise take caution while implementing Data Lake to ensure that
data is property maintained and accessible in Data Lake.

Another challenge with Data Lake is that since it stores the raw data, each
type of analysis requires the data transformation and tailoring from scratch
requiring additional processing infrastructure every time you want to do
some analysis on data stored in Data Lake.

its of Data Lak


There are following benefits that companies can reap by implementing Data Lake
Data Consolidation - Data Lake enales enterprises to consolidate its data
available in various forms such as videos, customer care recordings,
web logs, documents etc. in one place which was not possible with
traditional approach of using data warehouse.

Schema-less and Format-free Storage - Data Lake talks about the storage
of data in its raw form i.e. same format as it is sent from source systems.
This eliminates need of source systems having to emit data in a pre-defined
schema.
No Data Loss - Since Data Lake doesn't require source systems to emit the
data in a pre-defined schema, source systems do not need to tailor the data.
This enables the access of all the data to data analysts and scientists
resulting into more accurate analysis.
Cost Effectiveness - Data Lake talks about distributed storage wherein
commodity hardware can be utilized to store the huge volumes of data.
procuring and levearing the commodity hardware for storage is much
cost effective than using the high configuration hardware.

Challenges with Data Lake

Ability to accommodate any format and type of data sometimes can convert
Data Lake into a data mess referred as Data Swarm. Hence, it is important
that enterprise take caution while implementing Data Lake to ensure that
data is property maintained and accessible in Data Lake.

Another challenge with Data Lake is that since it stores the raw data, each
type of analysis requires the data transformation and tailoring from
scratch requiring additional processing infrastructure every time you want
to do some analysis on data stored in Data Lake.
Data
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making process - Bill
Inmon.
William H. Inmon (born 1945) is an American computer scientist, recognized
by many as the father of the data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise
data warehouse (EDW), is a system used for reporting and data analysis.
Subject-Oriented: A data warehouse can be used to analyze a
particular subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources.


For example, source A and source B may have different ways of identifying
a product, but in a data warehouse, there will be only a single way of
identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example,


one can retrieve data from 3 months, 6 months, 12 months, or even older
data from a data warehouse. This contrasts with a transactions system,
where often only the most recent data is kept. For example, a transaction
system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.

Data Mart
A data mart is the access layer of the data warehouse environment that is
used to get data out to the users.

A database, or collection of databases, designed to help managers make


strategic decisions about their business. Whereas a data warehouse combines
databases across an entire enterprise, data marts are usually smaller and
focus on a particular subject or department. Some data marts, called
dependent data marts, are subsets of larger data warehouses.
Wa A data warehouse is a subject-oriented, integrated, time-variant and non-
volatile collection of data in support of management's decision making process
- Bill Inmon.William H. Inmon (born 1945) is an American computer scientist,
recognized by many as the father of the data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise
data warehouse (EDW), is a system used for reporting and data analysis.
Subject-Oriented: A data warehouse can be used to analyze a
particular subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources.


For example, source A and source B may have different ways of identifying
a product, but in a data warehouse, there will be only a single way of
identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example,


one can retrieve data from 3 months, 6 months, 12 months, or even older
data from a data warehouse. This contrasts with a transactions system,
where often only the most recent data is kept. For example, a transaction
system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.

Data Mart
A data mart is the access layer of the data warehouse environment that is
used to get data out to the users.

A database, or collection of databases, designed to help managers make strategic


decisions about their business. Whereas a data warehouse combines databases
across an entire enterprise, data marts are usually smaller and focus on a particular
subject or department. Some data marts, called dependent data marts, are subsets
of larger data warehouses.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making process - Bill
Inmon.William H. Inmon (born 1945) is an American computer scientist,
recognized by many as the father of the data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise
data warehouse (EDW), is a system used for reporting and data analysis.
Subject-Oriented: A data warehouse can be used to analyze a
particular subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources.


For example, source A and source B may have different ways of identifying
a product, but in a data warehouse, there will be only a single way of
identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example,


one can retrieve data from 3 months, 6 months, 12 months, or even older
data from a data warehouse. This contrasts with a transactions system,
where often only the most recent data is kept. For example, a transaction
system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.

Data Mart
A data mart is the access layer of the data warehouse environment that is
used to get data out to the users.

A database, or collection of databases, designed to help managers make strategic


decisions about their business. Whereas a data warehouse combines databases
across an entire enterprise, data marts are usually smaller and focus on a particular
subject or department. Some data marts, called dependent data marts, are subsets
of larger data warehouses.
Introduction to Hadoop
An open-source software framework that supports data-intensive
distributed applications, licensed under the Apache v2license.Goals /
Requirements:
Abstract and facilitate the storage and processing of large and/or rapidly

growing data sets

Structured and non-structured data


Simple programming models

High scalability and availability

Use commodity (cheap!) hardware with little redundancy


Fault-tolerance

Move computation rather than data


Hadoop Framework Tools

Hadoop Architecture (HDFS)


Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using lowcost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing.

Features of HDFS
It is suitable for the distributed storage and processing Hadoop provides a
command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture

Given below is the architecture of a Hadoop File System.


HDFS follows the master-slave architecture and it has the following elements.

Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does
the following tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files
and directories.

Datanode
The datanode is a commodity hardware having the GNU/Linux operating system
and datanode software. For every node Commodityhardware/System in a cluster,
there will be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.

Block
Generally the user data is stored in the files of HDFS. The file in a file system will
be divided into one or more segments and/or stored in individual data nodes. These
file segments are called as blocks. In other words, the minimum amount of data
that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change
in HDFS configuration.

Goals of HDFS

Fault detection and recovery : Since HDFS includes a large number of


commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.

Huge datasets : HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.

Hardware at data : A requested task can be done efficiently, when the


computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
MapReduce
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed algorithm on
a cluster. Conceptually similar approaches have been very well known since 1995
with the Message Passing Interface standard having reduce and scatter operations.
Advantages of Hadoop

Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work
across the machines and in turn, utilizes the underlying parallelism of the
CPU cores.

Hadoop does not rely on hardware to provide fault-tolerance and


high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.

Servers can be added or removed from the cluster dynamically and


Hadoop continues to operate without interruption.

Another big advantage of Hadoop is that apart from being open source, it
is compatible on all the platforms since it is Java based.
Hadoop vs. Traditional Database:

Today’s ultra-connected world is generating massive volumes of data at ever-


accelerating rates. As a result, big data analytics has become a powerful tool for
businesses looking to leverage mountains of valuable data for profit and
competitive advantage. In the midst of this big data rush, Hadoop, as an on-
premise or cloud-based platform has been heavily promoted as the one-size fits
all solution for the business world’s big data problems. While Hadoop has lived
up to much of the hype, there are certain situations where running workloads on
a traditional database may be the better solution.
For companies wondering which functionality will better serve their big data
use case needs, here are some key questions that need to be asked when
choosing between Hadoop databases – including cloud-based Hadoop services
such as Qubole – and a traditional database.

Is Hadoop a Database?
Hadoop is not a database, but rather an open source software framework
specifically built to handle large volumes of structure and semi-structured data.
Organizations considering Hadoop adoption should evaluate whether their
current or future data needs require the type of capabilities Hadoop offers.
Is the data being analyzed structured or unstructured?

Structured Data: Data that resides within the fixed confines of a record or file is
known as structured data. Owing to the fact that structured data – even in large
volumes – can be entered, stored, queried, and analyzed in a simple and
straightforward manner, this type of data is best served by a traditional database.

Unstructured Data: Data that comes from a variety of sources, such as emails,
text documents, videos, photos, audio files, and social media posts, is referred to
as unstructured data. Being both complex and voluminous, unstructured data
cannot be handled or efficiently queried by a traditional database. Hadoop’s
ability to join, aggregate, and analyze vast stores of multi-source data without
having to structure it first allows organizations to gain deeper insights quickly.
Thus Hadoop is a perfect fit for companies looking to store, manage, and analyze
large volumes of unstructured data.
Is a scalable analytics infrastructure needed?
Companies whose data workloads are constant and predictable will be better
served by a traditional database.
Companies challenged by increasing data demands will want to take advantage
of Hadoop’s scalable infrastructure. Scalability allows servers to be added on
demand to accommodate growing workloads. As a cloud-based Hadoop service,
Qubole offers more flexible scalability by spinning virtual servers up or down
within minutes to better accommodate fluctuating workloads.

Will a Hadoop Database implementation be cost-effective? Cost-


effectiveness is always a concern for companies looking to adopt new

technologies. When considering a Hadoop implementation, companies need to do


their homework to make sure that the realized benefits of a Hadoop deployment
outweigh the costs. Otherwise it would be best to stick with a traditional database
to meet data storage and analytics needs.

All things considered, Hadoop has a number of things going for it that make
implementation more cost-effective than companies may realize. For one thing,
Hadoop saves money by combining open source software with commodity
servers. Cloud-based Hadoop platforms such as Qubole reduce costs further by
eliminating the expense of physical servers and warehouse space.
MAPREDUCE PROGRAM ANALYSIS
ANALYSIS-1

AIM: - Develops a MapReduce application that finds the highest


average monthly temperature using the Java perspective in Eclipse.

This analysis perform using the InfoSphere BigInsights 2.1 Quick Start
Edition, but has been tested on the 3.0 image.

1.1 Start the BigInsights components

1. Log into your BigInsights image with a userid of biadmin and a password of
biadmin.

2. Start your BigInsights components. Use the icon on the desktop.

3. From a command line (right-click the desktop and select Open in Terminal)
execute
hadoop fs -mkdir TempData

4. Upload some temperature data from the local file system.


hadoop fs -copyFromLocal/home/labfiles/SumnerCountyTemp.dat
/user/biadmin/TempData

5. You can view this data from the Files tab in the Web Console or by
executing the following command. The values in the 95th column (354, 353,
353,353, 352...) are the average daily temperatures. They are the result of
multiplying the actual average temperature value times 10. (That way you
don’t have to worry about working with decimal points.)
hadoop fs -cat TempData/SumnerCountyTemp.dat
1.2 Define a Java project in Eclipse

1. Start Eclipse using the icon on the desktop. Go with the default workspace. _
2. Make sure that you are using the Java perspective. Click Window->Open
Perspective->Other. Then select Java. Click OK.
3. Create a Java project. Select File->New->Java Project.
4. Specify a Project name of MaxTemp. Click Finish.
5. Right-click the MaxTemp project, scroll down and select Properties.
6. Select Java Build Path.

7. In the Properties dialog, select the Libraries tab. _


_ 8. Click the Add Library pushbutton.
9. Select BigInsights Libraries and click Next. Then
click Finish. Then click OK.
1.3 Create a Java package and the mapper class

1. In the Package Explorer expand MaxTemp and right-click src. Select New-
>Package.

2. Type a Name of com.some.company. Click Finish.

3. Right-click com.some.company and select New->Class.

4. Type in a Name of MaxTempMapper. It will be a public class. Click Finish


The data type for the input key to the mapper will be LongWritable. The data
itself will be of type Text. The output key from the mapper will be of type
Text. And the data from the mapper (the temperature) will be of type
IntWritable.
5. Your class:
a. You will need to import java.io.IOException.
b. Exend Mapper<LongWritable, Text, Text, IntWritable>
c. Define a public class called map.
d. Your code should look like the following:
package com.some.company;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;
public class MaxTempMapper extends
Mapper<LongWritable, Text, Text, IntWritable>
{ @Override

public void map(LongWritable key, Text value, Context


context) throws IOException, InterruptedException {
}
}

1.4 Complete the mapper


You are reading in a line of data. You will want to convert it to a string so that
you can do some string manipulation. You will want to extract the month and
average temperature for each record.

The month begins at the 22th character of the record (zero offset) and the average
temperature begins at the 95th character. (Remember that the average
temperature value is three digits.)

1. In the map method, add the following code (or whatever code you think is
required):
String line = value.toString();

String month =
line.substring(22,24); int avgTemp;

avgTemp = Integer.parseInt(line.substring(95,98));
context.write(new Text(month), new
IntWritable(avgTemp));
2. Save your work
1.5 Create the reducer class

1. In the Package Explorer right-click com.some.company and select New-


>Class.
2. Type in a Name of MaxTempReducer. It will be a public class. Click Finish.
The data type for the input key to the reducer will be Text. The data itself will be
of type IntWritable. The output key from the reducer will be of type Text. And
the data from thereducer will be of type IntWritable_
3. Your class:
a. You will need to import java.io.IOException.
b. Extend Reducer<Text, LongWritable, Text, IntWritable>
c. Define a public class called reduce.
d. Your code should look like the following:
package com.some.company;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;
public class MaxTempReducer extends
Reducer<Text, IntWritable, Text, IntWritable>
{ @Override

public void reduce(Text key, Iterable<IntWritable> values,


Context context)

throws IOException, InterruptedException {


}

}
1.6 Complete the reducer

For the reducer, you want to iterate through all values for a given key. For
each value, check
to see if it is higher than any of the other values.

1. Add the following code (or your variation) to the reduce method.
int maxTemp = Integer.MIN_VALUE;

for (IntWritable value: values) {

maxTemp = Math.max(maxTemp, value.get());


}

context.write(key, new IntWritable(maxTemp));


2. Save your work.

1.7 Create the mainclass

1. In the Package Explorer right-click com.some.company and select New-


>Class.

2. Type in a Name of MaxMonthTemp. It will be a public class. Click Finish.


IBM Software

The GenericOptionsParser() will extract any input parameters that are not
system parameters and place them in an array. In your case, two parameters will
be passed to your application. The first parameter is the input file. The second
parameter is the output directory. (This directory must not exist or your
MapReduce application will fail.) Your code should look like this:

package com.some.company;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import com.some.company.MaxTempReducer;
import com.some.company.MaxTempMapper;
public class MaxMonthTemp {

public static void main(String[] args) throws Exception


{ Configuration conf = new Configuration();

String[] programArgs =new GenericOptionsParser(conf,


args).getRemainingArgs(); if (programArgs.length != 2) {

System.err.println("Usage: MaxTemp <in>


<out>"); System.exit(2);

}
Job job = new Job(conf, "Monthly Max Temp");
job.setJarByClass(MaxMonthTemp.class);
job.setMapperClass(MaxTempMapper.class);
job.setCombinerClass(MaxTempReducer.class);
job.setReducerClass(MaxTempReducer.class);
1);
}
}

__ 3. Save your work


1.8 Create a JAR file

1. In the Package Explorer, expand the MaxTemp project. Right-click src and
select Export.2. Expand Java and select JAR file. Click Next.

3. Click the JAR file browse pushbutton. Type in a name of MyMaxTemp.jar.


Keep biadmin as the folder. Click OK. _
4. Click Finish.To save time, you are not going to run your application now.
You will add a combiner class and then run the application.
1.9 Run your application
1. You need a command line. You may have to open a new one. Change to
biadmin’s home directory.
cd ~
2. Execute your program. At the command line type:

hadoop jar MyMaxTemp.jar com.some.company.MaxMonthTemp


/user/biadmin/TempData/SumnerCountyTemp.dat
/user/biadmin/TempDataOut __ 3. Close the open edit windows in Eclipse.
1.10 Output
.

ANALYSIS 2
AIM :- Develops a MapReduce application that finds out how many
persons make recharge of Rs.400 or more than Rs.400 to their number
using the Java perspective in Eclipse.

This analysis perform using the InfoSphere BigInsights 2.1 Quick Start Edition,
but has been tested on the 3.0 image.

2.1 Start the BigInsights components


1. Log into your BigInsights image with a userid of biadmin and a password of
biadmin.
2. Start your BigInsights components. Use the icon on the desktop. __
3. From a command line (right-click the desktop and select Open in
Terminal) execute:

hadoop fs -mkdir mob


4. Upload some temperature data from the local file system.hadoop fs -
copyFromLocal /home/labfiles/SampleData.txt /user/biadmin/mob
5. You can view this data from the Files tab in the Web Console or by
executing the following command.
hadoop fs -cat mob/ SampleData.txt
input data
2.2 Define a Java project in Eclipse

1. Start Eclipse using the icon on the desktop. Go with the default workspace.
2. Make sure that you are using the Java perspective. Click Window-
>OpenPerspective->Other. Then select Java. Click OK.

_ 3. Create a Java project. Select File->New->Java Project.


4. Specify a Project name of mobile Click Finish.
5. Right-click the mobile project, scroll down and select Properties.

6. Select Java Build Path.

7. In the Properties dialog, select the Libraries tab.


8. Click the Add Library pushbutton.

9. Select BigInsights Libraries and click Next. Then click Finish. Then click

OK.

2.3 Create a Java package and the program


1. In the Package Explorer expand Mobile and right-click src. Select New-

>Package.
2. Type a Name of hadoop Click Finish.
3. Right-click hadoop and select New->Class.

4.Now creating MAPREDUCE program using class within class concept of


java perspective

package hadoop;
import
java.util.*;

import java.io.IOException;
import
org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;

import
org.apache.hadoop.mapred.*; public
class Mobile
{

public static class E_EMapper extends MapReduceBase


implements Mapper<LongWritable,Text,Text,IntWritable>

public void map(LongWritable key, Text value,


OutputCollector<Text,IntWritable> output,Reporter reporter) throws IOException
{

String line =
value.toString(); String
lasttoken = null;

StringTokenizer s = new StringTokenizer(line,"


"); String mob = s.nextToken();
while(s.hasMoreTokens())
{

lasttoken=s.nextToken();
}

int amout = Integer.parseInt(lasttoken);


output.collect(new Text(mob),new IntWritable(amout));
}

public static class E_EReduce extends MapReduceBase implements


Reducer< Text, IntWritable, Text, IntWritable >
{

public void reduce( Text key, Iterator <IntWritable> values,


OutputCollector<Text, IntWritable> output,Reporter reporter)
throws IOException
{

int maxamt=400; while (values.hasNext())


{

if((val=values.next().get())>maxamt)
{

output.collect(key, new IntWritable(val));


}

}
}
}
//Main function

public static void main(String args[])throws Exception


{

JobConf conf = new JobConf(WordCount.class);


conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new
Path(args[1])); JobClient.runJob(conf);

}
}

3. Save your work.

2.4 Create a JAR file

1. In the Package Explorer, expand the Mobile project. Right-click src and
select Export.
2. Expand Java and select JAR file. Click Next.

3. Click the JAR file browse pushbutton. Type in a name of mobile.jar. Keep
biadmin
as the folder. Click OK.
4. Click Finish.
Then run your application.
2.5 Run your application
1. You need a command line. You may have to open a new one. Change to
biadmin’s home directory.
cd ~
2. Execute your program. At the command line type:
hadoop jar mobile.jar hadoop.Mobile
/user/biadmin/mob/SampleData.txt /user/biadmin/mobOut
3. Close the open edit windows in Eclipse.

2.6 Output
JAQL

Jaql is primarily a query language for JavaScript Object Notation (JSON), but it
supports more than just JSON. It allows you to process both structured and non
traditional data and was donated by IBM to the open source community (just one
of many contributions IBM has made to open source).

Specifically, Jaql allows you to select, join, group, and filter data that is stored
in HDFS, much like a blend of Pig and Hive. Jaql’s query language was
inspired by many programming and query languages, including Lisp, SQL,
XQuery, and Pig. Jaql is a functional, declarative query language that is
designed to process large data sets. For parallelism, Jaql rewrites high-level
queries, when appropriate, into “low-level” queries consisting of MapReduce
jobs.

JSON
Before we get into the Jaql language, let’s first look at the popular data
interchange format known as JSON, so that we can build our Jaql examples on
top of it. Application developers are moving in large numbers towards JSON as
their choice for a data interchange format, because it’s easy for humans to read,
and because of its structure, it’s easy for applications to parse or generate.

JSON is built on top of two types of structures. The first is a collection of


name/value pairs

(which, as you learned earlier in the “The Basics of MapReduce” section, makes
it ideal for data manipulation in Hadoop, which works on key/value pairs). These
name/value pairs can represent anything since they are simply text strings (and
subsequently fit well into existing models) that could represent a record in a
database, an object, an associative array, and more. The second JSON structure is
the ability to create an ordered list of values much like an array, list, or sequence
you might have in your existing applications. An object in JSON is represented
as {string:value}, where an array can be simply represented by [ value, value, …
], where value can be a string, number, another JSON object, or another JSON
array.

Both Jaql and JSON are record-oriented models, and thus fit together perfectly.
Note that JSON is not the only format that Jaql supports—in fact, Jaql is
extremely flexible and can support many semi-structured data sources such as
XML, CSV, flat files, and more. However, in consideration of the space we
have, we’ll use the JSON example above in the following Jaql queries. As you
will see from this section, Jaql looks very similar to Pig but also has some
similarity to SQL.

JAQL QUERY
Much like a MapReduce job is a flow of data, Jaql can be thought of as a
pipeline of data flowing from a source, through a set of various operators, and
out into a sink (a destination). The operand used to signify flow from one
operand to another is an arrow: ->. Unlike SQL, where the output comes first (for
example, the SELECT list), in Jaql, the operations listed are in natural order,
where you specify the source, followed by the various operators you want to use
to manipulate the data, and finally the sink.
INPUT DATA
TWITTER DATA ANALYSIS USING JAQL
1.open terminal and run

>cd $BIGINSIGHTS_HOME/jaql/bin

>./jaqlshell

2. or you can directly open jaql shell

3. How to read data from JSON file:

jaql>tweets =read(file("file:///home/biadmin/Twitter%20Search.json"));

tweets[0];
tweets[0].text;

tweets[*].text;

tweets;
4 Retrieve a single field, from_user, using the transform command.

jaql>tweets -> transform $.from_user;

jaql>tweets-> transform {$.created_at, $.from_user, $.iso_language_code, $.text};


jaql>tweetrecs;

b3=tweets[0]{.text,.iso_language_code};
:Using filters:

Use the filter operator to see all non-English language records from tweetrecs

jaql>tweetrecs -> filter $.language != 'en';

Sort tweetrecs

jaql>tweetrecs -> sort by [$.created_at asc];


sort in created sequence but also in descending language sequence

jaql>tweetrecs -> sort by [$.language desc, $.created_at asc ];

count languages and group by key

tweetrecs ->group by key = $.language into {language: key, num: count($)};


write your results to a file in Hadoop but first you must create the target directory in
HDFS

jaql>hdfsShell('-mkdir /user/biadmin/jaql');

jaql>tweetrecs->group by key = $.langauage into {language: key, num: count($)}-


>write(seq("hdfs:/user/biadmin/jaql/twittercount.seq"));

jaql> hdfsShell('-ls /user/biadmin/jaql');


INPUT DATA OF BOOK
BOOKS DATA ANALYSIS USING JAQL

1.open terminal and run

>cd $BIGINSIGHTS_HOME/jaql/bin

>./jaqlshell

2. or you can directly open jaql shell

3. How to read data from JSON file:

jaql> book=read(file("/home/biadmin/bookreviews.json"));

jaql> book;
how to read sevrel JSON FILE:

readjson=fn(filename) read(file("/home/biadmin/SampleData/"+filename));

readjson("bookreviews.json");

jaql> readjson=fn(filename) read(file("/home/biadmin/"+filename));

jaql> readjson("bookreviews.json");
11:find the books having author 'David Baldacci'

books=read(file("/home/biadmin/bookreviews.json"));

books -> filter $.author == 'David Baldacci';

List all books that were published before 200

books -> filter $.pu blished < 20011;


Total the number of books by each author and display the name of the author and the
book total.

books-> group by author = $.author into {author, num_books: count($)};

Next, for those books published in 2000 and 2005 , write a record that only contains
the author and the title.

books->filter $.published == 2000 ->transform {$.auth books->filter $.published ==


2005 ->transform {$.author, $.title}

;books->filter $.published == 2005 ->transform {$.author, $.title};


And if you wanted to sort your data on published?

books -> sort by [$.published asc];

How to read csv File:

booksonly=read(del("file:///home/biadmin/books.csv"));

booksonly;
Because all Jaql had to read was that delimited data, there were no fields assigned and
if you notice, the years are read as strings. Reread the file but this time specify a
schema to assign fields and to make sure that the data is read correctly.

booksonly=read(del("file:///home/biadmin/books.csv",schema=schema{booknum:lon
g,author:string,title:string,published:long}));

booksonly;
reviews=read(del("file:///home/biadmin/reviews.csv",schema=schema{booknum:long
,name:string,stars:long}));

reviews;
PIG

Pig was initially developed at Yahoo! to allow people using Apache Hadoop® to
focus more on analyzing large data sets and spend less time having to write mapper
and reducer programs. Like actual pigs, who eat almost anything, the Pig
programming language is designed to handle any kind of data—hence the name!

Pig is made up of two components: the first is the language itself, which is called
PigLatin (yes, people naming various Hadoop projects do tend to have a sense of
humor associated with their naming conventions), and the second is a runtime
environment where PigLatin programs are executed. Think of the relationship
between a Java Virtual Machine (JVM) and a Java application. In this section, we’ll
just refer to the whole entity as Pig.

The programming language


Let’s first look at the programming language itself so you can see how it’s
significantly easier than having to write mapper and reducer programs.

1. The first step in a Pig program is to LOAD the data you want to manipulate from
HDFS.

2. Then you run the data through a set of transformations (which, under the
covers, are translated into a set of mapper and reducer tasks).

3. Finally, you DUMP the data to the screen or you STORE the results in a file
somewhere.
LOAD

As is the case with all the Hadoop features, the objects that are being worked on by
Hadoop are stored in HDFS. In order for a Pig program to access this data, the
program must first tell Pig what file (or files) it will use, and that’s done through the
LOAD 'data_file' command (where 'data_file' specifies either an HDFS file or
directory). If a directory is specified, all the files in that directory will be loaded into
the program. If the data is stored in a file format that is not natively accessible to Pig,
you can optionally add the USING function to the LOAD statement to specify a user-
defined function that can read in and interpret the data.

TRANSFORM

The transformation logic is where all the data manipulation happens. Here you can
FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to
build aggregations, ORDER results, and much more.

DUMP and STORE


If you don’t specify the DUMP or STORE command, the results of a Pig pro¬gram
are not generated. You would typically use the DUMP command, which sends the
output to the screen, when you are debugging your Pig programs. When you go into
production, you simply change the DUMP call to a STORE call so that any results
from running your programs are stored in a file for further processing or analysis.
Note that you can use the DUMP command anywhere in your program to dump
intermediate result sets to the screen, which is very useful for debugging purposes.

Advantages Of PIG
Decrease in development time. This is the biggest advantage especially
considering vanilla map-reduce jobs' complexity, time-spent and maintenance
of the programs.

Learning curve is not steep, anyone who does not know how to write vanilla
map-reduce or SQL for that matter could pick up and can write map-reduce jobs;
not easy to master, though.

Procedural, not declarative unlike SQL, so easier to follow the commands and
provides better expressiveness in the transformation of data every step.
Comparing to vanilla map-reduce, it is much more like an english language. It is
concise and unlike Java but more like Python.

I really liked the idea of dataflow where everything is about data even though we
sacrifice control structures like for loop or if structures. This enforces the
developer to think about the data but nothing else. In Python or Java, you create
the control structures(for loop and ifs) and get the data transformation as a side
effect. In here, data cannot create for loops, you need to always transform and
manipulate data. But if you

are not transforming data, what are you doing in the very first place?

Since it is procedural, you could control of the execution of every step. If you
want to write your own UDF(User Defined Function) and inject in one specific
part in the pipeline, it is straightforward.

Disadvantages Of PIG

Especially the errors that Pig produces due to UDFS(Python) are not helpful at all.
When something goes wrong, it just gives exec error in udf even if problem is
related to syntax or type error, let alone a logical one. This is a big one. At least, as
a user, I should get different error messages when I have a syntax error, type error
or a runtime error. and because of data, data transformation is a first class citizen.
Without data, cannot create for loops, you need to always transform and
manipulate data. But if youare not transforming data, what are you doing in the
very first place?

Since it is procedural, you could control of the execution of every step. If you
want to write your own UDF(User Defined Function) and inject in one specific
part in the pipeline, it is straightforward.
Not mature. Even if it has been around for quite some time, it is still in the
development. (only recently they introduced a native datetime structure which is
quite fundamental for a language like Pig especially considering how an important
component of datetime for time-series data.

Support: Stackoverflow and Google generally does not lead good solutions for
the problems.

Data Schema is not enforced explicitly but implicitly. I think this is big one, too.
The debugging of pig scripts in my experience is %90 of time schema and since it
does not enforce an explicit schema, sometimes one data structure goes bytearray,
which is a “raw” data type and unless you coerce the fields even the strings, they
turn bytearray without notice. This may propagate for other steps of the data
processing.

Minor one: There is not a good ide or plugin for Vim which provides
more functionality than syntax completion to write the pig scripts.
The commands
final result. Thisare not executed
increases the unless either you dump or store an intermediate or
Hive and Pig are not the same thing and the things that Pig does quite well Hive
may not and vice versa. However, someone who knows SQL could write Hive
queries(most of SQL queries do already work in Hive) where she cannot do that in
Pig. She needs to learn Pig syntax. iteration between debug and resolving the
issue.
INPUT DATA
PIG

1-cd $PIG_HOME/bin

2-./pig -x local

2-Read the data from a comma separated values file,

/home/labfiles/SampleData/books.csv into a relation called data using the


default

PigStorage() loader.

Try to put data on hdfs /user/biadmin/PRASHANT/books.csv'

data = load '/home/biadmin/SampleData/books.csv';

dump data;
3.-

Next access the first field in each tuple and then write the results out to the
console. You will have to use the foreach operator to accomplish this. I
understand that we have not covered that operator yet. Just bare with me.

b = foreach data generate $0;

dump b;

Fetching first colum but it will display all columns..

go for the next step;

First of all, let me explain what you just did. For each tuple in the data bag (or
relation), you accessed the first field ($0 remember, positions begin with zero)
and projected it to a field called f1 that is in the relation called b.

The data listed shows a tuple on each line and each tuple contains a single
character stringthat contains all of the data. This may not be what you expected,
since each line contained several fields separated by commas.

data = load '/home/biadmin/SampleData/books.csv' using PigStorage(',');

b = foreach data generate $0;


dump b;

c = foreach data generate $1;

dump c;

d = foreach data generate $2;


dump d;

e= foreach data generate $3;

du mp e;
5.- What if you wanted to be able to access the fields in the tuple by name
instead of only by position? You have to using the LOAD operator and specify
a schema

data = load '/home/biadmin/SampleData/books.csv' using PigStorage(',') as


(booknum:int, author:chararray, title:chararray, published:int);

b = for each data generate author;

dump b;
b = for each data generate author,title;

dump b;

b = for each data generate author,title,published year;

dump b;
Read the /home/labfiles/SampleData/books.csv file. Filter the resulting relation
so that you only have those books published prior to 2002. In the LOAD
operator below, it is referencing a parameter for the directory structure. If you
are running the command from the Grunt shell, then you will replace $dir with
the qualified directory path.

a = load '/home/biadmin/SampleData/books.csv' using PigStorage(',') as


(bknum: int,author:chararray, title:chararray, pubyear:int);

b = filter a by pubyear < 2002;

dump b;
b = filter a by pubyear > 2002;

dump b;

b = filter a by pubyear <= 2002;

dump b;
b = filter a by pubyear <= 2002;

dump b;

c = order b by pubyear desc;

dump c;
Calculate the number of books published in each year.

booksPerYear = foreach booksInYear generate group, COUNT($1);

dump booksPerYear;
STAFF AND FACULTY DATA

Pig basics

1. If Hadoop is not running, start it and its components using the icon on the
desktop.

2. Open a command line. Right-click the desktop and select Open in Terminal.

3. Next start the Grunt shell. Change to the Pig bin directory and start the shell
running in local mode.

commands:

1-cd $PIG_HOME/bin

2-./pig -x local

2 Read the data from a comma separated values file,

/home/labfiles/SampleData/books.csv into a relation called data using the


default PigStorage() loader.

data = load '/home/biadmin/Sheet.csv

dump data;
Next access the first field in each tuple and then write the results out to the
console. You

will have to use the for each operator to accomplish this. I understand that we
have not

covered that operator yet. Just bare with me.

b = foreach data generate $0;

dump b;

First of all, let me explain what you just did. For each tuple in the data bag (or
relation), you

accessed the first field ($0 remember, positions begin with zero) and projected it
to a field

called f1 that is in the relation called b.

The data listed shows a tuple on each line and each tuple contains a single
character string

that contains all of the data. This may not be what you expected, since each line
contained

several fields separated by commas.


data = load '/home/biadmin/Sheet.csv' using PigStorage(',') as (
campus:chararray, name:chararray, title:chararray, dept:chararray, ftr:int,
basis:chararray, fte:int, gen_fund:int);

b = foreach data generate camus,name,title;

dump b;
data1 = load '/home/biadmin/Sheet.csv' using PigStorage(',') as (
campus:chararray, name:chararray, title:chararray, dept:chararray, ftr:int,
basis:chararray, fte:int, gen_fund:int);

b = filter data1 by gen_fund <= 0 ;

dump b;
B= filter data1 by fte<=1;

Dump b;
Hive

Hive allows SQL developers to write Hive Query Language (HQL) statements
that are similar to standard SQL statements; now you should be aware that
HQL is limited in the commands it understands, but it is still pretty useful.
HQL statements are broken down by the Hive service into MapReducejobs
and executed across a Hadoop cluster.

For anyone with a SQL or relational database background, this section will look
very familiar to you. As with any database management system (DBMS), you
can run your Hive queries in many ways. You can run them from a command
line interface (known as the Hive shell), from a Java Database Connectivity
(JDBC) or Open Database Connectivity (ODBC) application leveraging the
Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The
Hive Thrift Client is much like any database client that gets installed on a user’s
client machine (or in a middle tier of a three-tier architecture): it communicates
with the Hive services running on the server. You can use the Hive Thrift Client
within applications written in C++, Java, PHP, Python, or Ruby (much like you
can use these client-side languages with embedded SQL to access a database
such as DB2 or Informix).

Hive looks very much like traditional database code with SQL access. However,
because Hive is based on Hadoop and MapReduce operations, there are several
key differences. The first is that Hadoop is intended for long sequential scans,
and because Hive is based on Hadoop, you can expect queries to have a very
high latency (many minutes). This means that Hive would not be appropriate for
applications that need very fast response times, as you would expect with a
database such as DB2. Finally, Hive is read-based and therefore not appropriate
for transaction processing that typically involves a high percentage of write
operations.

If you're interested in SQL on Hadoop, in addition to Hive, IBM offers Big


SQL which makes accessing Hive datasets faster and more secure. Checkout
our videos, below, for a quick overview of Hive and Big SQL.
we can start working with Hive and the Hadoop Distributed File system, we
must first start all the
BigInsights components. There are two ways of doing this, through terminal
and through simply doubleclicking an icon. Both of these methods will be
shown in the following steps.
_Now open the terminal by double clicking the BigInsights Shell icon.

Once the terminal has been opened change to the $BIGINSIGHTS_HOME/bin


directory (which
by default is /opt/ibm/biginsights)
cd $BIGINSIGHTS_HOME/bin

How to enter inside


beline*********************************************[inside terminal]

1- ./beeline
2- !connect jdbc:hive2://bivm.ibm.com:10000/default

username:biadmin

password:biadmi

3-show schemas;

show databases;
4 -how to create database using hive shell:

hive >create database testdb;

hive >DESCRIBE DATABASE testdb;

hive >show databases;


5 Check HDFS to confirm our new database directory was created. run on
terminal:

hadoop fs -ls /biginsights/hive/warehouse

hive >show databases;


6:-Add some information to the DBPROPERTIES metadata for the testdb
database. We do this by using the ALTER DATABASE syntax. ---hive shell

ALTER DATABASE computersalesdb SET DBPROPERTIES ('creator' =


'amitabh');

ALTER DATABASE prashant SET DBPROPERTIES ('creator' = 'amitabh');

ALTER DATABASE yukti SET DBPROPERTIES ('creator' = 'aman');

ALTER DATABASE salesdb SET DBPROPERTIES ('creator' = 'prashant');


7:Let’s view the extended details of our testdb database.

hive> DESCRIBE DATABASE EXTENDED parvi;

hive> DESCRIBE DATABASE EXTENDED hivedb;

8:Go ahead and delete the testdb database.

hive > DROP DATABASE testdb CASCADE;

9:Confirm that testdb is no longer in the Hive metastore catalog.

hive > SHOW DATABASES;


10:create a new databse computersalesdb;

hive >CREATE DATABASE computersalesdb;

hive> DESCRIBE DATABASE computersalesdb;

:Exploring Our Sample Dataset Before we begin creating tables in our new
database it is important to understand what data is in our sample files and how
that data is structure

.
13:-create a new table product for lab files.

CREATE TABLE product

(prod_name STRING,

description STRING,

category STRING,

qty_on_hand INT,

prod_num STRING,

packaged_with ARRAY<STRING>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

COLLECTION ITEMS TERMINATED BY ':'

STORED AS TEXTFILE;
14:-Show tables:

hive > SHOW TABLES IN computersalesdb;

15:-Add a note to the TBLPROPERTIES for our new products table. hive shell

hive> ALTER TABLE products SET TBLPROPERTIES ('details' = 'This table


holds products');

hive> DESCRIBE EXTENDED products;


16:Let’s verify that the products directory was created on HDFS in the location
listed above.

in Hadoop termianl:

hadoop fs -ls /biginsights/hive/warehouse/computersalesdb.db;


17: Imagine that our fictitious computer company adds sales data to a
“sales_staging” table at the end of each month.From this sales_staging table
they then move the data they want to analyze into a partitioned “sales” table.
The partitioned sales table is the one they actual use for their analysis

1-use computersalesdb;

2 CREATE TABLE sales_staging2

(cust_id STRING,

prod_num STRING,

qty INT,

sale_date STRING,

sales_id STRING)

COMMENT 'Staging table for sales data'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;
3-show tables in computersalesdb;

4-hadoop fs -ls /biginsights/hive/warehouse/computersalesdb.db; --


terminal

[Inside terminal]

s;
1-CREATE TABLE sales_sttaging2

cust_id STRING,

prod_num STRING,

qty INT,

sales_id STRING

COMMENT 'Table for analysis of sales data'

PARTITIONED BY (sale_date STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

show tables in computersalesdb;


2-DESCRIBE EXTENDED sales;
hadoop fs -mkdir /user/biadmin/shared_hive_data;

2-hadoop fs -ls /user/biadmin;


hadoop fs-
put/home/biadmin/Computer_Business/WithoutHeaders/Customer.csv
/user/biadmin/shared_hive_data/Customer.csv;

4-hadoop fs -ls /user/biadmin/shared_hive_data/;

5-hadoop fs -cat /user/biadmin/shared_hive_data/Customer.csv;


21-Now we just need to define our external customer table.

CREATE EXTERNAL TABLE customer

fname STRING,

lname STRING,

status STRING,

telno STRING,

customer_id STRING,

city_zip STRUCT<city:STRING, zip:STRING>

COMMENT 'External table for customer data'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

COLLECTION ITEMS TERMINATED BY '|'

LOCATION '/user/biadmin/shared_hive_data/';

2-DESCRIBE EXTENDED customer;


3-!set outputformat table

4-SELECT * FROM customer LIMIT 25;


SCOOP
Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. This is a brief tutorial that explains how to make use of
Sqoop in Hadoop ecosystem.

The traditional application management system, that is, the interaction of


applications with relational database using RDBMS, is one of the sources that
generate Big Data. Such Big Data, generated by RDBMS, is stored in
Relational Database Servers in the relational database structure.

When Big Data storages and analyzers such as MapReduce, Hive, HBase,
Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a
tool to interact with the relational database servers for importing and exporting
the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop
ecosystem to provide feasible interaction between relational database server and
Hadoop’s HDFS.

Sqoop: “SQL to Hadoop and Hadoop to SQL”

Sqoop is a tool designed to transfer data between Hadoop and relational


database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. It is provided by the Apache Software Foundation.
How Sqoop Works?

The following image describes the workflow of Sqoop.

Sqoop Import

The import tool imports individual tables from RDBMS to HDFS. Each row in
a table is treated as a record in HDFS. All records are stored as text data in text
files or as binary data in Avro and Sequence files.

Sqoop Export

The export tool exports a set of files from HDFS back to an RDBMS. The files
given as input to Sqoop contain records, which are called as rows in table.
Those are read and parsed into a set of records and delimited with user-specified
delimiter.
SCOOP
1-Start db2.

2-su db2inst1

3-passowrd db2inst1

4-db2 list database directory


make database connection first.... db2 connect to MYDB

db2 list tablespaces


check all instances(SCHEMA).............

db2ilist

check file system information...............

5-db2inst1@bivm:/home/biadmin>df
db2 "insert into db2inst1.logic values(1,'priya')";

db2 "insert into db2inst1.logic values(2,'riya')";

db2 "insert into db2inst1.logic values(3,'yukti')";

db2 "insert into db2inst1.logic values(4,'sristhi')";

db2 "insert into db2inst1.logic values(5,'meenakshi')";

db2 "insert into db2inst1.logic values(6,'meena')";


db2 "select * from db2inst1.logic";
8-Grant access to your table to public.................

db2 grant all on db2inst1.student to public

9- disconnect from connection...

db2 connect reset

run exit command to exit from db2inst1.


FLUME

Flume is a standard, simple, robust, flexible, and extensible tool for data
ingestion from various data producers (webservers) into Hadoop

Apache Flume is a tool/service/data ingestion mechanism for collecting


aggregating and transporting large amounts of streaming data such as log files,
events (etc...) from various sources to a centralized data store.

Flume is a highly reliable, distributed, and configurable tool. It is principally


designed to copy streaming data (log data) from various web servers to HDFS.

Applications of Flume

Assume an e-commerce web application wants to analyze the customer


behavior from a particular region. To do so, they would need to move the
available log data in to Hadoop for analysis. Here, Apache Flume comes to our
rescue.

Flume is used to move the log data generated by application servers into HDFS
at a higher speed.
Advantages of Flume

Here are the advantages of using Flume −

 Using Apache Flume we can store the data in to any of the centralized
stores (HBase, HDFS).
 When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two transactions (one
sender and one receiver) are maintained for each message. It guarantees
reliable message delivery.
 Flume is reliable, fault tolerant, scalable, manageable, and customizable.

Features of Flume

Some of the notable features of Flume are as follows −

 Flume ingests log data from multiple web servers into a centralized store
(HDFS, HBase) efficiently.
 Using Flume, we can get the data from multiple servers immediately into
Hadoop.
 Along with the log files, Flume is also used to import huge volumes of
event data produced by social networking sites like Facebook and
Twitter, and e-commerce websites like Amazon and Flipkart.
 Flume supports a large set of sources and destinations types.
 Flume supports multi-hop flows, fan-in fan-out flows, contextual routing,
etc.
 Flume can be scaled horizontally.
Big Data, as we know, is a collection of large datasets that cannot be processed
using traditional computing techniques. Big Data, when analyzed, gives
valuable results. Hadoop is an open-source framework that allows to store and
process Big Data in a distributed environment across clusters of computers
using simple programming models.

Streaming / Log Data

Generally, most of the data that is to be analyzed will be produced by various


data sources like applications servers, social networking sites, cloud servers,
and enterprise servers. This data will be in the form of log files and events.

Log file − In general, a log file is a file that lists events/actions that occur in an
operating system. For example, web servers list every request made to the
server in the log files.

On harvesting such log data, we can get information about −

 the application performance and locate various software and hardware


failures.
 the user behavior and derive better business insights.

The traditional method of transferring data into the HDFS system is to use the
put command. Let us see how to use the put command.
FLI

Flume
FLUME PROPERTIES:File System->opt->ibm->biginsights-
>flume->conf
Make this file
flume_agent1.properties
/opt/ibm/biginsights/flume/conf/
agent1.sources = seqGenSource
agent1.sinks = loggerSink
agent1.channels = memChannel
agent1.sources.seqGenSource.type = seq
agent1.sinks.loggerSink.type = logger
agent1.channels.memChannel.type = memory
agent1.channels.memChannel.capacity = 100
agent1.sources.seqGenSource.channels = memChannel
agent1.sinks.loggerSink.channel = memChannel
cd $BIGINSIGHTS_HOME/flume

bin/flume-ng agent --name agent1 --conf conf --conf-file


conf/flume_agent1.properties - Dflume.root.logger=INFO,console

Run this command to process logs.

bin/flume-ng agent -n agent1 --conf conf -f conf/flume_agent1.properties -


Dflume.root.logger=INFO,console

You might also like