Unit 5 - Introduction To Hadoop
Unit 5 - Introduction To Hadoop
Unit 5 - Introduction To Hadoop
Apache Pig was developed by Yahoo researchers, targeted mainly towards non-
programmers. It was designed with the ability to analyze and process large datasets
without using complex Java codes. It provides a high-level data processing language that
can perform numerous operations without getting bogged down with too many technical
concepts.
It consists of:
1.Pig Latin - This is the language for scripting
2.Pig Latin Compiler - This converts Pig Latin code into executable code
Pig also provides Extract, Transfer, and Load (ETL), and a platform for building data flow. Did
you know that ten lines of Pig Latin script equals approximately 200 lines of MapReduce
job? Pig uses simple, time-efficient steps to analyze datasets
• Pig’s architecture
•
CON…
• Programmers write scripts in Pig Latin to analyze data using Pig. Grunt
Shell is Pig’s interactive shell, used to execute all Pig scripts.
• If the Pig script is written in a script file, the Pig Server executes it.
• The parser checks the syntax of the Pig script, after which the output
will be a DAG (Directed Acyclic Graph). The DAG (logical plan) is
passed to the logical optimizer.
• The compiler converts the DAG into MapReduce jobs.
• The MapReduce jobs are then run by the Execution Engine. The
results are displayed using the “DUMP” statement and stored in HDFS
using the “STORE” statement.
Oozie
HDFS is the primary or major component of the Hadoop ecosystem which is responsible
for storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
• To check the Hadoop services are up
and running use the following
command:
• jps
ls: This command is used to list all the files.
Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin
directory contains executables so, bin/hdfs means we
want the executables
of hdfs particularly dfs(Distributed File System)
commands.
mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name> creating home directory: hdfs/bin -mkdir /user hdfs/bin -
mkdir /user/username
-> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path bin/hdfs dfs -mkdir geeks2 => Relative
path
-> the folder will be created relative to the home directory.
bin/hdfs dfs -touchz /geeks/myfile.txt
rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a
non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside
the directory
then the directory itself.