Hadoop Report
Hadoop Report
Hadoop Report
USING
Submitted to Submitted by
BIG DATA
Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. But it’s not
the amount of data that’s important. Its what organizations do with the data
that matters. Big data can be analyzed for insights that lead to better decisions
and strategic business moves.
REQUIREMENTS OF ANAYSIS
While the term “big data” is relatively new, the act of gathering and storing
large amounts of information for eventual analysis is ages old. The concept
gained momentum in the early 2000s when industry analyst Doug Laney
articulated the now-mainstream definition of big data as the three Vs:
Volume: Organizations collect data from a variety of sources, including business
transactions, social media and information from sensor or machine-to-machine
data. In the past, storing it would’ve been a problem – but new technologies
(such as Hadoop) have eased the burden.
Velocity: Data streams in at an unprecedented speed and must be dealt with in
a timely manner. RFID tags, sensors and smart metering are driving the need to
deal with torrents of data in near-real time.
Variety: Data comes in all types of formats – from structured, numeric data in
traditional databases to unstructured text documents, email, video, audio,
stock ticker data and financial transactions.At SAS, we consider two
additional dimensions when it comes to big data:
Variability: In addition to the increasing velocities and varieties of data, data
flows can be highly inconsistent with periodic peaks. Is something trending
in social media? Daily, seasonal and event-triggered peak data loads can be
challenging to manage. Even more so with unstructured data.
Complexity: Today's data comes from multiple sources, which makes it
difficult to link, match, cleanse and transform data across systems. However,
it’s
necessary to connect and correlate relationships, hierarchies and multiple
data linkages or your data can quickly spiral out of control.
OBJECTIVES OF BIG DATA
In this section, we will explain the objectives that we strive to achieve
while dealing with Big Data:
Cost reduction- MIPS (Million Instructions Per Second) and above terabyte
storage for structured data are now cheaply delivered through big data
technologies like Hadoop clusters. This is mainly because of the ability of
Big Data technologies to utilize commodity scale hardware for processing by
employing techniques such as data sharing, distributed computing etc.
Faster processing speeds- Big data technologies have also helped reducing
large scale-analytics processing from hours to minutes. These technologies
have also been instrumental in real time analytics reducing processing times to
seconds.
Big Data based offerings- Big data technologies have enabled organizations to
leverage big data in developing new product and service offerings. The best
example may be LinkedIn, which has used big data and data scientists to develop
a broad array of product offerings and features, including People You May
Know, Groups You May Like, Jobs You May Be Interested In, Who has Viewed
My Profile, and several others. These offerings have brought millions of new
customers to LinkedIn.
Supporting internal business decisions- Just like traditional data analytics, big
data analytics can be employed to support business decisions when there are
new and less structured data sources.
For example, any data that can shed light on customer satisfaction is helpful,
and much data from customer interactions is unstructured such as website
clicks, transaction records, and voice recordings from call centres.
WHY IS BIG DATA IMPORTANT?
The importance of big data doesn’t revolve around how much data you have,
but what you do with it. You can take data from any source and analyze it to
find answers that enable 1) cost reductions, 2) time reductions, 3) new product
development and optimized offerings, and 4) smart decision making. When you
combine big data with high-powered analytics, you can accomplish business-
related tasks such as:
Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying
habits.
Recalculating entire risk portfolios in minutes.
These include:
Storage and Management:
storage would have been a problem several years ago, there are now low-cost
options for storing data if that’s the best strategy for your business.
The more knowledge you have, the more confident you’ll be in making
business decisions. It’s smart to have a strategy in place once you have an
abundance of information at hand.
REAL FACTS:
New York Stock Exchange generates 1 TB/day.
Google processes 700 PB/month.
Facebook hosts 10 billion photos taking 1 PB
storage
BIG DATA CHALLENGES:
In this approach, an enterprise will have a computer to store and process big data.
For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc. In this approach, the user interacts
with the application, which in turn handles the part of data storage and analysis.
Limitation
This approach works fine with those applications that process less voluminous
data that can be accommodated by standard database servers, or up to the limit of
the processor that is processing the data. But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to process such data through a single
database bottleneck.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This
algorithm divides the task into small parts and assigns them to many computers,
and collects the results from them which when integrated, form the result dataset.
Hadoop
Using the solution provided by Google, Doug Cutting and his team developed an
Open Source Project called HADOOP.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel with others. In short, Hadoop is used to develop
applications that could perform complete statistical analysis on huge amounts of
data.
Introduction to Hadoop
An open-source software framework that supports data-intensive distributed
applications, licensed under the Apache v2license.
Data Lake
One of the strong use case of Big Data technologies is to analyze the data,
and find out the hidden patterns and information out of it. For this to be
effective, all the data from sources must be saved without any loss or
tailoring. However traditional RDBMS databases and most of NoSQL
storage systems require data to be transformed to a specific format in order
to be utilized - adding, updating, searching - effectively.
Data Lake concept is introduced to fill this gap and talks about storing the
data in raw state (same state as data exist in source systems) without any
data loss and transformation. For the same reason, Data Lake is also
referred as Data Landing Area.
Data Lake is rather a concept and can be implemented using any suitable
technology/software that can hold the data in any form along with
ensuring that no data loss is occurred using distributed storage providing
failover.
Example of such a technology would be Apache Hadoop where its
MapReduce component could be used to load data into its distributed
file system known as Hadoop Distributed File System (HDFS).
As we can see that there are two differences in the process. Firstly, we have
ETL component instead of data loader which emphasizes that in case of
data warehouse.
This process of ETL, in most cases, results into data loss due to fixed
schemas. Second difference is that in case of data warehouse, there are no
data processors as data is already in a pre-defined schema ready to be
consumed by data analysts.
Benef There are following benefits that companies can reap by implementing
Data Lake
Data Consolidation - Data Lake enales enterprises to consolidate its data
available in various forms such as videos, customer care recordings, web
logs, documents etc. in one place which was not possible with traditional
approach of using data warehouse.
Schema-less and Format-free Storage - Data Lake talks about the storage of
data in its raw form i.e. same format as it is sent from source systems. This
eliminates need of source systems having to emit data in a pre-defined
schema.
No Data Loss - Since Data Lake doesn't require source systems to emit the
data in a pre-defined schema, source systems do not need to tailor the data.
This enables the access of all the data to data analysts and scientists resulting
into more accurate analysis.
Cost Effectiveness - Data Lake talks about distributed storage wherein
commodity hardware can be utilized to store the huge volumes of data.
procuring and levearing the commodity hardware for storage is much cost
effective than using the high configuration hardware.
Another challenge with Data Lake is that since it stores the raw data, each
type of analysis requires the data transformation and tailoring from scratch
requiring additional processing infrastructure every time you want to do
some analysis on data stored in Data Lake.
Schema-less and Format-free Storage - Data Lake talks about the storage
of data in its raw form i.e. same format as it is sent from source systems.
This eliminates need of source systems having to emit data in a pre-defined
schema.
No Data Loss - Since Data Lake doesn't require source systems to emit the
data in a pre-defined schema, source systems do not need to tailor the data.
This enables the access of all the data to data analysts and scientists
resulting into more accurate analysis.
Cost Effectiveness - Data Lake talks about distributed storage wherein
commodity hardware can be utilized to store the huge volumes of data.
procuring and levearing the commodity hardware for storage is much
cost effective than using the high configuration hardware.
Ability to accommodate any format and type of data sometimes can convert
Data Lake into a data mess referred as Data Swarm. Hence, it is important
that enterprise take caution while implementing Data Lake to ensure that
data is property maintained and accessible in Data Lake.
Another challenge with Data Lake is that since it stores the raw data, each
type of analysis requires the data transformation and tailoring from
scratch requiring additional processing infrastructure every time you want
to do some analysis on data stored in Data Lake.
Data
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making process - Bill
Inmon.
William H. Inmon (born 1945) is an American computer scientist, recognized
by many as the father of the data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise
data warehouse (EDW), is a system used for reporting and data analysis.
Subject-Oriented: A data warehouse can be used to analyze a
particular subject area. For example, "sales" can be a particular subject.
Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.
Data Mart
A data mart is the access layer of the data warehouse environment that is
used to get data out to the users.
Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.
Data Mart
A data mart is the access layer of the data warehouse environment that is
used to get data out to the users.
Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.
Data Mart
A data mart is the access layer of the data warehouse environment that is
used to get data out to the users.
Features of HDFS
It is suitable for the distributed storage and processing Hadoop provides a
command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does
the following tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files
and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system
and datanode software. For every node Commodityhardware/System in a cluster,
there will be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will
be divided into one or more segments and/or stored in individual data nodes. These
file segments are called as blocks. In other words, the minimum amount of data
that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change
in HDFS configuration.
Goals of HDFS
Huge datasets : HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work
across the machines and in turn, utilizes the underlying parallelism of the
CPU cores.
Another big advantage of Hadoop is that apart from being open source, it
is compatible on all the platforms since it is Java based.
Hadoop vs. Traditional Database:
Is Hadoop a Database?
Hadoop is not a database, but rather an open source software framework
specifically built to handle large volumes of structure and semi-structured data.
Organizations considering Hadoop adoption should evaluate whether their
current or future data needs require the type of capabilities Hadoop offers.
Is the data being analyzed structured or unstructured?
Structured Data: Data that resides within the fixed confines of a record or file is
known as structured data. Owing to the fact that structured data – even in large
volumes – can be entered, stored, queried, and analyzed in a simple and
straightforward manner, this type of data is best served by a traditional database.
Unstructured Data: Data that comes from a variety of sources, such as emails,
text documents, videos, photos, audio files, and social media posts, is referred to
as unstructured data. Being both complex and voluminous, unstructured data
cannot be handled or efficiently queried by a traditional database. Hadoop’s
ability to join, aggregate, and analyze vast stores of multi-source data without
having to structure it first allows organizations to gain deeper insights quickly.
Thus Hadoop is a perfect fit for companies looking to store, manage, and analyze
large volumes of unstructured data.
Is a scalable analytics infrastructure needed?
Companies whose data workloads are constant and predictable will be better
served by a traditional database.
Companies challenged by increasing data demands will want to take advantage
of Hadoop’s scalable infrastructure. Scalability allows servers to be added on
demand to accommodate growing workloads. As a cloud-based Hadoop service,
Qubole offers more flexible scalability by spinning virtual servers up or down
within minutes to better accommodate fluctuating workloads.
All things considered, Hadoop has a number of things going for it that make
implementation more cost-effective than companies may realize. For one thing,
Hadoop saves money by combining open source software with commodity
servers. Cloud-based Hadoop platforms such as Qubole reduce costs further by
eliminating the expense of physical servers and warehouse space.
MAPREDUCE PROGRAM ANALYSIS
ANALYSIS-1
This analysis perform using the InfoSphere BigInsights 2.1 Quick Start
Edition, but has been tested on the 3.0 image.
1. Log into your BigInsights image with a userid of biadmin and a password of
biadmin.
3. From a command line (right-click the desktop and select Open in Terminal)
execute
hadoop fs -mkdir TempData
5. You can view this data from the Files tab in the Web Console or by
executing the following command. The values in the 95th column (354, 353,
353,353, 352...) are the average daily temperatures. They are the result of
multiplying the actual average temperature value times 10. (That way you
don’t have to worry about working with decimal points.)
hadoop fs -cat TempData/SumnerCountyTemp.dat
1.2 Define a Java project in Eclipse
1. Start Eclipse using the icon on the desktop. Go with the default workspace. _
2. Make sure that you are using the Java perspective. Click Window->Open
Perspective->Other. Then select Java. Click OK.
3. Create a Java project. Select File->New->Java Project.
4. Specify a Project name of MaxTemp. Click Finish.
5. Right-click the MaxTemp project, scroll down and select Properties.
6. Select Java Build Path.
1. In the Package Explorer expand MaxTemp and right-click src. Select New-
>Package.
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTempMapper extends
Mapper<LongWritable, Text, Text, IntWritable>
{ @Override
The month begins at the 22th character of the record (zero offset) and the average
temperature begins at the 95th character. (Remember that the average
temperature value is three digits.)
1. In the map method, add the following code (or whatever code you think is
required):
String line = value.toString();
String month =
line.substring(22,24); int avgTemp;
avgTemp = Integer.parseInt(line.substring(95,98));
context.write(new Text(month), new
IntWritable(avgTemp));
2. Save your work
1.5 Create the reducer class
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTempReducer extends
Reducer<Text, IntWritable, Text, IntWritable>
{ @Override
}
1.6 Complete the reducer
For the reducer, you want to iterate through all values for a given key. For
each value, check
to see if it is higher than any of the other values.
1. Add the following code (or your variation) to the reduce method.
int maxTemp = Integer.MIN_VALUE;
The GenericOptionsParser() will extract any input parameters that are not
system parameters and place them in an array. In your case, two parameters will
be passed to your application. The first parameter is the input file. The second
parameter is the output directory. (This directory must not exist or your
MapReduce application will fail.) Your code should look like this:
package com.some.company;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.some.company.MaxTempReducer;
import com.some.company.MaxTempMapper;
public class MaxMonthTemp {
}
Job job = new Job(conf, "Monthly Max Temp");
job.setJarByClass(MaxMonthTemp.class);
job.setMapperClass(MaxTempMapper.class);
job.setCombinerClass(MaxTempReducer.class);
job.setReducerClass(MaxTempReducer.class);
1);
}
}
1. In the Package Explorer, expand the MaxTemp project. Right-click src and
select Export.2. Expand Java and select JAR file. Click Next.
ANALYSIS 2
AIM :- Develops a MapReduce application that finds out how many
persons make recharge of Rs.400 or more than Rs.400 to their number
using the Java perspective in Eclipse.
This analysis perform using the InfoSphere BigInsights 2.1 Quick Start Edition,
but has been tested on the 3.0 image.
1. Start Eclipse using the icon on the desktop. Go with the default workspace.
2. Make sure that you are using the Java perspective. Click Window-
>OpenPerspective->Other. Then select Java. Click OK.
9. Select BigInsights Libraries and click Next. Then click Finish. Then click
OK.
>Package.
2. Type a Name of hadoop Click Finish.
3. Right-click hadoop and select New->Class.
package hadoop;
import
java.util.*;
import java.io.IOException;
import
org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import
org.apache.hadoop.mapred.*; public
class Mobile
{
String line =
value.toString(); String
lasttoken = null;
lasttoken=s.nextToken();
}
if((val=values.next().get())>maxamt)
{
}
}
}
//Main function
}
}
1. In the Package Explorer, expand the Mobile project. Right-click src and
select Export.
2. Expand Java and select JAR file. Click Next.
3. Click the JAR file browse pushbutton. Type in a name of mobile.jar. Keep
biadmin
as the folder. Click OK.
4. Click Finish.
Then run your application.
2.5 Run your application
1. You need a command line. You may have to open a new one. Change to
biadmin’s home directory.
cd ~
2. Execute your program. At the command line type:
hadoop jar mobile.jar hadoop.Mobile
/user/biadmin/mob/SampleData.txt /user/biadmin/mobOut
3. Close the open edit windows in Eclipse.
2.6 Output
JAQL
Jaql is primarily a query language for JavaScript Object Notation (JSON), but it
supports more than just JSON. It allows you to process both structured and non
traditional data and was donated by IBM to the open source community (just one
of many contributions IBM has made to open source).
Specifically, Jaql allows you to select, join, group, and filter data that is stored
in HDFS, much like a blend of Pig and Hive. Jaql’s query language was
inspired by many programming and query languages, including Lisp, SQL,
XQuery, and Pig. Jaql is a functional, declarative query language that is
designed to process large data sets. For parallelism, Jaql rewrites high-level
queries, when appropriate, into “low-level” queries consisting of MapReduce
jobs.
JSON
Before we get into the Jaql language, let’s first look at the popular data
interchange format known as JSON, so that we can build our Jaql examples on
top of it. Application developers are moving in large numbers towards JSON as
their choice for a data interchange format, because it’s easy for humans to read,
and because of its structure, it’s easy for applications to parse or generate.
(which, as you learned earlier in the “The Basics of MapReduce” section, makes
it ideal for data manipulation in Hadoop, which works on key/value pairs). These
name/value pairs can represent anything since they are simply text strings (and
subsequently fit well into existing models) that could represent a record in a
database, an object, an associative array, and more. The second JSON structure is
the ability to create an ordered list of values much like an array, list, or sequence
you might have in your existing applications. An object in JSON is represented
as {string:value}, where an array can be simply represented by [ value, value, …
], where value can be a string, number, another JSON object, or another JSON
array.
Both Jaql and JSON are record-oriented models, and thus fit together perfectly.
Note that JSON is not the only format that Jaql supports—in fact, Jaql is
extremely flexible and can support many semi-structured data sources such as
XML, CSV, flat files, and more. However, in consideration of the space we
have, we’ll use the JSON example above in the following Jaql queries. As you
will see from this section, Jaql looks very similar to Pig but also has some
similarity to SQL.
JAQL QUERY
Much like a MapReduce job is a flow of data, Jaql can be thought of as a
pipeline of data flowing from a source, through a set of various operators, and
out into a sink (a destination). The operand used to signify flow from one
operand to another is an arrow: ->. Unlike SQL, where the output comes first (for
example, the SELECT list), in Jaql, the operations listed are in natural order,
where you specify the source, followed by the various operators you want to use
to manipulate the data, and finally the sink.
INPUT DATA
TWITTER DATA ANALYSIS USING JAQL
1.open terminal and run
>cd $BIGINSIGHTS_HOME/jaql/bin
>./jaqlshell
jaql>tweets =read(file("file:///home/biadmin/Twitter%20Search.json"));
tweets[0];
tweets[0].text;
tweets[*].text;
tweets;
4 Retrieve a single field, from_user, using the transform command.
b3=tweets[0]{.text,.iso_language_code};
:Using filters:
Use the filter operator to see all non-English language records from tweetrecs
Sort tweetrecs
jaql>hdfsShell('-mkdir /user/biadmin/jaql');
>cd $BIGINSIGHTS_HOME/jaql/bin
>./jaqlshell
jaql> book=read(file("/home/biadmin/bookreviews.json"));
jaql> book;
how to read sevrel JSON FILE:
readjson=fn(filename) read(file("/home/biadmin/SampleData/"+filename));
readjson("bookreviews.json");
jaql> readjson("bookreviews.json");
11:find the books having author 'David Baldacci'
books=read(file("/home/biadmin/bookreviews.json"));
Next, for those books published in 2000 and 2005 , write a record that only contains
the author and the title.
booksonly=read(del("file:///home/biadmin/books.csv"));
booksonly;
Because all Jaql had to read was that delimited data, there were no fields assigned and
if you notice, the years are read as strings. Reread the file but this time specify a
schema to assign fields and to make sure that the data is read correctly.
booksonly=read(del("file:///home/biadmin/books.csv",schema=schema{booknum:lon
g,author:string,title:string,published:long}));
booksonly;
reviews=read(del("file:///home/biadmin/reviews.csv",schema=schema{booknum:long
,name:string,stars:long}));
reviews;
PIG
Pig was initially developed at Yahoo! to allow people using Apache Hadoop® to
focus more on analyzing large data sets and spend less time having to write mapper
and reducer programs. Like actual pigs, who eat almost anything, the Pig
programming language is designed to handle any kind of data—hence the name!
Pig is made up of two components: the first is the language itself, which is called
PigLatin (yes, people naming various Hadoop projects do tend to have a sense of
humor associated with their naming conventions), and the second is a runtime
environment where PigLatin programs are executed. Think of the relationship
between a Java Virtual Machine (JVM) and a Java application. In this section, we’ll
just refer to the whole entity as Pig.
1. The first step in a Pig program is to LOAD the data you want to manipulate from
HDFS.
2. Then you run the data through a set of transformations (which, under the
covers, are translated into a set of mapper and reducer tasks).
3. Finally, you DUMP the data to the screen or you STORE the results in a file
somewhere.
LOAD
As is the case with all the Hadoop features, the objects that are being worked on by
Hadoop are stored in HDFS. In order for a Pig program to access this data, the
program must first tell Pig what file (or files) it will use, and that’s done through the
LOAD 'data_file' command (where 'data_file' specifies either an HDFS file or
directory). If a directory is specified, all the files in that directory will be loaded into
the program. If the data is stored in a file format that is not natively accessible to Pig,
you can optionally add the USING function to the LOAD statement to specify a user-
defined function that can read in and interpret the data.
TRANSFORM
The transformation logic is where all the data manipulation happens. Here you can
FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to
build aggregations, ORDER results, and much more.
Advantages Of PIG
Decrease in development time. This is the biggest advantage especially
considering vanilla map-reduce jobs' complexity, time-spent and maintenance
of the programs.
Learning curve is not steep, anyone who does not know how to write vanilla
map-reduce or SQL for that matter could pick up and can write map-reduce jobs;
not easy to master, though.
Procedural, not declarative unlike SQL, so easier to follow the commands and
provides better expressiveness in the transformation of data every step.
Comparing to vanilla map-reduce, it is much more like an english language. It is
concise and unlike Java but more like Python.
I really liked the idea of dataflow where everything is about data even though we
sacrifice control structures like for loop or if structures. This enforces the
developer to think about the data but nothing else. In Python or Java, you create
the control structures(for loop and ifs) and get the data transformation as a side
effect. In here, data cannot create for loops, you need to always transform and
manipulate data. But if you
are not transforming data, what are you doing in the very first place?
Since it is procedural, you could control of the execution of every step. If you
want to write your own UDF(User Defined Function) and inject in one specific
part in the pipeline, it is straightforward.
Disadvantages Of PIG
Especially the errors that Pig produces due to UDFS(Python) are not helpful at all.
When something goes wrong, it just gives exec error in udf even if problem is
related to syntax or type error, let alone a logical one. This is a big one. At least, as
a user, I should get different error messages when I have a syntax error, type error
or a runtime error. and because of data, data transformation is a first class citizen.
Without data, cannot create for loops, you need to always transform and
manipulate data. But if youare not transforming data, what are you doing in the
very first place?
Since it is procedural, you could control of the execution of every step. If you
want to write your own UDF(User Defined Function) and inject in one specific
part in the pipeline, it is straightforward.
Not mature. Even if it has been around for quite some time, it is still in the
development. (only recently they introduced a native datetime structure which is
quite fundamental for a language like Pig especially considering how an important
component of datetime for time-series data.
Support: Stackoverflow and Google generally does not lead good solutions for
the problems.
Data Schema is not enforced explicitly but implicitly. I think this is big one, too.
The debugging of pig scripts in my experience is %90 of time schema and since it
does not enforce an explicit schema, sometimes one data structure goes bytearray,
which is a “raw” data type and unless you coerce the fields even the strings, they
turn bytearray without notice. This may propagate for other steps of the data
processing.
Minor one: There is not a good ide or plugin for Vim which provides
more functionality than syntax completion to write the pig scripts.
The commands
final result. Thisare not executed
increases the unless either you dump or store an intermediate or
Hive and Pig are not the same thing and the things that Pig does quite well Hive
may not and vice versa. However, someone who knows SQL could write Hive
queries(most of SQL queries do already work in Hive) where she cannot do that in
Pig. She needs to learn Pig syntax. iteration between debug and resolving the
issue.
INPUT DATA
PIG
1-cd $PIG_HOME/bin
2-./pig -x local
PigStorage() loader.
dump data;
3.-
Next access the first field in each tuple and then write the results out to the
console. You will have to use the foreach operator to accomplish this. I
understand that we have not covered that operator yet. Just bare with me.
dump b;
First of all, let me explain what you just did. For each tuple in the data bag (or
relation), you accessed the first field ($0 remember, positions begin with zero)
and projected it to a field called f1 that is in the relation called b.
The data listed shows a tuple on each line and each tuple contains a single
character stringthat contains all of the data. This may not be what you expected,
since each line contained several fields separated by commas.
dump c;
du mp e;
5.- What if you wanted to be able to access the fields in the tuple by name
instead of only by position? You have to using the LOAD operator and specify
a schema
dump b;
b = for each data generate author,title;
dump b;
dump b;
Read the /home/labfiles/SampleData/books.csv file. Filter the resulting relation
so that you only have those books published prior to 2002. In the LOAD
operator below, it is referencing a parameter for the directory structure. If you
are running the command from the Grunt shell, then you will replace $dir with
the qualified directory path.
dump b;
b = filter a by pubyear > 2002;
dump b;
dump b;
b = filter a by pubyear <= 2002;
dump b;
dump c;
Calculate the number of books published in each year.
dump booksPerYear;
STAFF AND FACULTY DATA
Pig basics
1. If Hadoop is not running, start it and its components using the icon on the
desktop.
2. Open a command line. Right-click the desktop and select Open in Terminal.
3. Next start the Grunt shell. Change to the Pig bin directory and start the shell
running in local mode.
commands:
1-cd $PIG_HOME/bin
2-./pig -x local
dump data;
Next access the first field in each tuple and then write the results out to the
console. You
will have to use the for each operator to accomplish this. I understand that we
have not
dump b;
First of all, let me explain what you just did. For each tuple in the data bag (or
relation), you
accessed the first field ($0 remember, positions begin with zero) and projected it
to a field
The data listed shows a tuple on each line and each tuple contains a single
character string
that contains all of the data. This may not be what you expected, since each line
contained
dump b;
data1 = load '/home/biadmin/Sheet.csv' using PigStorage(',') as (
campus:chararray, name:chararray, title:chararray, dept:chararray, ftr:int,
basis:chararray, fte:int, gen_fund:int);
dump b;
B= filter data1 by fte<=1;
Dump b;
Hive
Hive allows SQL developers to write Hive Query Language (HQL) statements
that are similar to standard SQL statements; now you should be aware that
HQL is limited in the commands it understands, but it is still pretty useful.
HQL statements are broken down by the Hive service into MapReducejobs
and executed across a Hadoop cluster.
For anyone with a SQL or relational database background, this section will look
very familiar to you. As with any database management system (DBMS), you
can run your Hive queries in many ways. You can run them from a command
line interface (known as the Hive shell), from a Java Database Connectivity
(JDBC) or Open Database Connectivity (ODBC) application leveraging the
Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The
Hive Thrift Client is much like any database client that gets installed on a user’s
client machine (or in a middle tier of a three-tier architecture): it communicates
with the Hive services running on the server. You can use the Hive Thrift Client
within applications written in C++, Java, PHP, Python, or Ruby (much like you
can use these client-side languages with embedded SQL to access a database
such as DB2 or Informix).
Hive looks very much like traditional database code with SQL access. However,
because Hive is based on Hadoop and MapReduce operations, there are several
key differences. The first is that Hadoop is intended for long sequential scans,
and because Hive is based on Hadoop, you can expect queries to have a very
high latency (many minutes). This means that Hive would not be appropriate for
applications that need very fast response times, as you would expect with a
database such as DB2. Finally, Hive is read-based and therefore not appropriate
for transaction processing that typically involves a high percentage of write
operations.
1- ./beeline
2- !connect jdbc:hive2://bivm.ibm.com:10000/default
username:biadmin
password:biadmi
3-show schemas;
show databases;
4 -how to create database using hive shell:
:Exploring Our Sample Dataset Before we begin creating tables in our new
database it is important to understand what data is in our sample files and how
that data is structure
.
13:-create a new table product for lab files.
(prod_name STRING,
description STRING,
category STRING,
qty_on_hand INT,
prod_num STRING,
packaged_with ARRAY<STRING>
STORED AS TEXTFILE;
14:-Show tables:
15:-Add a note to the TBLPROPERTIES for our new products table. hive shell
in Hadoop termianl:
1-use computersalesdb;
(cust_id STRING,
prod_num STRING,
qty INT,
sale_date STRING,
sales_id STRING)
STORED AS TEXTFILE;
3-show tables in computersalesdb;
[Inside terminal]
s;
1-CREATE TABLE sales_sttaging2
cust_id STRING,
prod_num STRING,
qty INT,
sales_id STRING
STORED AS TEXTFILE;
fname STRING,
lname STRING,
status STRING,
telno STRING,
customer_id STRING,
LOCATION '/user/biadmin/shared_hive_data/';
When Big Data storages and analyzers such as MapReduce, Hive, HBase,
Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a
tool to interact with the relational database servers for importing and exporting
the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop
ecosystem to provide feasible interaction between relational database server and
Hadoop’s HDFS.
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in
a table is treated as a record in HDFS. All records are stored as text data in text
files or as binary data in Avro and Sequence files.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files
given as input to Sqoop contain records, which are called as rows in table.
Those are read and parsed into a set of records and delimited with user-specified
delimiter.
SCOOP
1-Start db2.
2-su db2inst1
3-passowrd db2inst1
db2ilist
5-db2inst1@bivm:/home/biadmin>df
db2 "insert into db2inst1.logic values(1,'priya')";
Flume is a standard, simple, robust, flexible, and extensible tool for data
ingestion from various data producers (webservers) into Hadoop
Applications of Flume
Flume is used to move the log data generated by application servers into HDFS
at a higher speed.
Advantages of Flume
Using Apache Flume we can store the data in to any of the centralized
stores (HBase, HDFS).
When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one
sender and one receiver) are maintained for each message. It guarantees
reliable message delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume
Flume ingests log data from multiple web servers into a centralized store
(HDFS, HBase) efficiently.
Using Flume, we can get the data from multiple servers immediately into
Hadoop.
Along with the log files, Flume is also used to import huge volumes of
event data produced by social networking sites like Facebook and
Twitter, and e-commerce websites like Amazon and Flipkart.
Flume supports a large set of sources and destinations types.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing,
etc.
Flume can be scaled horizontally.
Big Data, as we know, is a collection of large datasets that cannot be processed
using traditional computing techniques. Big Data, when analyzed, gives
valuable results. Hadoop is an open-source framework that allows to store and
process Big Data in a distributed environment across clusters of computers
using simple programming models.
Log file − In general, a log file is a file that lists events/actions that occur in an
operating system. For example, web servers list every request made to the
server in the log files.
The traditional method of transferring data into the HDFS system is to use the
put command. Let us see how to use the put command.
FLI
Flume
FLUME PROPERTIES:File System->opt->ibm->biginsights-
>flume->conf
Make this file
flume_agent1.properties
/opt/ibm/biginsights/flume/conf/
agent1.sources = seqGenSource
agent1.sinks = loggerSink
agent1.channels = memChannel
agent1.sources.seqGenSource.type = seq
agent1.sinks.loggerSink.type = logger
agent1.channels.memChannel.type = memory
agent1.channels.memChannel.capacity = 100
agent1.sources.seqGenSource.channels = memChannel
agent1.sinks.loggerSink.channel = memChannel
cd $BIGINSIGHTS_HOME/flume