Big Data Manual Ai
Big Data Manual Ai
Big Data Manual Ai
REGULATION 2021
LAB MANUAL
CCS334 BIG DATA ANALYTICS LTPC2023
LIST OF EXPERIMENTS:
1. Downloading and installing Hadoop; Understanding different Hadoop modes. Startup scripts,
Configuration files.
2. Hadoop Implementation of file management tasks, such as Adding files and directories, retrieving files
and Deleting files
3. Implement of Matrix Multiplication with Hadoop Map Reduce
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
5. Installation of Hive along with practice examples.
6. Installation of HBase, Installing thrift along with Practice examples
7. Practice importing and exporting data from various databases.
Software Requirements: Cassandra, Hadoop, Java, Pig, Hive and HBase
CONTENT
AIM:
To Download and install Hadoop, Understanding different Hadoop modes, Startup scripts,
Configuration files.
PROCEDURE:
1. Install Java JDK 1.8.0 under “C:\JAVA”
2. Install Hadoop . Extract file Hadoop .
3. Set the path HADOOP_HOME Environment variable on windows 10
4. Set the path JAVA_HOME Environment variable on windows 10.
5. Next we set the Hadoop bin directory path and JAVA bin directory path.
6. Configuring Hadoop
7. Start Hadoop Cluster
8. Access Hadoop Namenode and Resource Manager.
INSTALLATION PROCEDURE:
1. Hardware Requirement
* RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work.
Now we can organize our Hadoop installation, we can create a folder and move the final extracted file
in it.
Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER NAME
Another important step in setting up a work environment is to set your Systems environment variable.
To edit environment variables,
go to Control Panel > System > click on the “Advanced system settings” link
Alternatively, We can Right click on This PC icon and click on Properties and click on the “Advanced
system settings” link
Or, easiest way is to search for Environment Variable in search bar and there you go.
Hadoop Installation:
Note: This change sets the default replicationcount for blocks used by HDFS.
3. We need to setup password less login so that themaster will be able to do a password-less ssh to start
the daemons on all the slaves.Check if ssh server is running on your host or not:
a. ssh localhost( enter your password and if youare able to login then ssh server is running)
You can check If NameNode has started successfully or not by using the following web
interface: http://0.0.0.0:50070 .
If you are unable tosee this, try to check the logs in the /home/ hadoop_dev/hadoop2/logs folder.
7. You can check whether the daemons are runningor not by issuing Jps command.
10. Stop the daemons when you are done executing the jobs, with the below command:
sbin/stop-dfs.sh
sbin/start-yarn.sh
This starts the daemons ResourceManager andNodeManager.
Once this command is run, you can check if ResourceManager is running or not by visiting the
following URL on browser : http://0.0.0.0:8088 . If you are unable to see this, check for the logs in the
directory: /home/hadoop_dev/hadoop2/logs
5. To check whether the services are running, issuea jps command. The following shows all the services
necessary to run YARN on a single server:
$ jps
15933 Jps
15567 ResourceManager
15785 NodeManager
7. Stop the daemons when you are done executingthe jobs, with the below command:
sbin/stop-yarn.sh
Result:
Thus the Installing of Hadoop in three operating modes has been successfully completed.
HADOOP IMPLEMENTATION OF FILE MANAGEMENT TASKS, SUCH AS ADDING FILES
AND DIRECTORIES,RETRIEVING FILES AND DELETING FILES
AIM:
To implement file management tasks in Hadoop
PROCEDURE:
1. Adding Files and Directories from HDFS
2. Retrieving files from HDFS
3. Retrieve the files from HDFS
4. Delete the files from HDFS
5. Copy the data from NFS to HDFS
6. Verify the files
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login username. This directory isn‘t automatically created for you,
though, so let‘s create it with the mkdir command. Login with your hadoop user
start-all.sh
For the purpose of illustration, we use chuck. You should substitute your user name in the example
commands.
Step-2 :
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve example.txt, we
can run the following command.
● View the file by using the command “hdfs dfs -cat /sanjay_english/glossary”
OUTPUT:
RESULT:
Map Function – It takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (Key-Value pair).
a. for each element mij of M do produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the
number of columns of N
b. for each element njk of N do produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the
number of rows of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values (M,j,mij) and (N, j,njk) for
all possible values of j.
Algorithm for Reduce Function
Reduce Function – Takes the output from Map as an input and combines those data tuples into a smaller
set of tuples.
a. for each key (i,k) do
b. sort values begin with M by j in list M sort values begin with N by j in listN multiply mij and njk
for jth value of each list
c. sum up mij x njk return (i,k), Σj=1 mij x njk
PROCEDURE:
1. Download the hadoop jar files.
2. Creating Mapper file for Matrix Multiplication.
3. Compiling the program in particular folder.
4. Running the program in particular folder
In mathematics, matrix multiplication or the matrix product is a binary operation that produces a
matrix from two matrices. The definition is motivated by linear equations and linear transformations on
vectors, which have numerous applications in applied mathematics, physics, and engineering. In more
detail, if A is an n × m matrix and B is an m × p matrix, their matrix product AB is an n × p matrix, in which
the m entries across a row of A are multiplied with the m entries down a column of B and summed to
produce an entry of AB. When two linear transformations are represented by matrices, then the matrix
product represents the composition of the two transformations.
Program:
Download Hadoop Common Jar files :
wget https://goo.gl/G4MyHp -O hadoop-common-3.1.2.jar
Download Hadoop Mapreduce Jar File :
wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-3.1.2.jar
map.java
package com.lendap.hadoop;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
Reduce.java
package com.lendap.hadoop;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashMap;
Matrix multiplication.java
package com.lendap.hadoop;
import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Uploading the M, N file which contains the matrix multiplication data to HDFS.
hadoop fs -mkdir Matrix/
hadoop fs -copyFromLocal M Matrix/
hadoop fs -copyFromLocal N Matrix/
Executing the jar file using hadoop command and thus how fetching record fromHDFS and storing
output in HDFS.
hadoop jar MatrixMultiply.jar
Output:
Result:
Thus the MapReduce program to implement Matrix Multiplication has been successfully completed.
WORD COUNT MAP REDUCE PROGRAM USING MAP REDUCE PARADIGM
AIM:
To implement word count map reduce program using map reduce paradigm
ALGORITHM:
1. First Open Eclipse -> then select File -> New -> Java Project ->Name it WordCount -> then Finish.
2. Create Three Java Classes into the project. Name them WCDriver(having the main
function), WCMapper, WCReducer.
3. You have to include two Reference Libraries for that:
Right Click on Project -> then select Build Path-> Click on Configure Build Path
4. You can see the Add External JARs option on the Right Hand Side. Click on it and add the below
mention files.
You can find these files in /usr/lib/
1. /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1-cdh5.13.0.jar
2. /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar
PROCEDURE:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
OUTPUT:
Run the application:
$ bin/hadoop jar wc.jar WordCount /user/joe/wordcount/input
/user/joe/wordcount/output
OUTPUT
$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
RESULT:
Thus a basic Word Count Map Reduce program is executed successfully to understand Map Reduce
Paradigm.
INSTALLATION OF HIVE ALONG WITH PRACTICE EXAMPLES
AIM:
Installation of Hive along with practice examples.
Prerequisites
There are some prerequisites to install hive on any machine:
1. Java Installation
2. Hadoop Installation
ALGORITHM:
Step 1:
Step 2:
Install Hive:
Step 1: Download the tar file.
Step 4: Set up the Hive environment by appending the following lines to ~/.bashrc file
3. Java Installation
4. Hadoop Installation
ALGORITHM:
Step 1:
Verify Java is installed.
Open the Terminal and Type the Command.
Java-Version
If java is installed on the system, it will give you the version or else an error. In my case, Java is
already installed and below is the output of the command.
In case, Java is not installed in your system. You can visit the below link and download java and
install it.
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads- 1880260.html.
Java Installation
Step 2:
Hadoop-Version
If Hadoop is already installed, this command will give you the version or else an error.
In my case, Hadoop has already installed hence the below output.
Hadoop Installation
1. Setup Hadoop
2. Configure Hadoop
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
3. Setup Namenode using the command:
start -dfs.sh
Start -yarn.sh
Install Hive:
The first thing we need to do is download the hive release which can be performed by clicking the
link below: https://apachemirror.wuchna.com/hive/
Above link will give the link from which you have to choose stable-2 highlighted below in yellow:
After opening stable-2, choose the bin file (highlighted yellow in the screenshot) and right click and
“copy link address”.
http://apachemirror.wuchna.com/hive/stable-2/apache-hive-2.3.6-bin.tar.gz0
Step 4: Set up the Hive environment by appending the following lines to ~/.bashrc file
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
$ source ~/.bashrc
export HADOOP_HOME=/usr/local/Hadoop
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
OUTPUT :
Set them in HDFS before verifying Hive. Use the following commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
RESULT:
AIM:
To execute Hive Commands in Cloudera
PROCEDURE:
1. To enter hive shell
hive
2. To display databases/ list the databases
Show databases;
3. Create a new database
Create database employeedb;
4. Use a database
Use employeedb;
5. Create table employee ( id bigint, name string, address string);
6. To list tables in a database
tables;
7. Display schema of a hive table
Describe customers;
8. Describe table with extra information
Describe formatted customers;
9. Insert record into hive table
Insert into employee values (11,”Abi”,”Banglore”);
10. Insert multiple record into hive table:
Insert into employee values(22,”Ane”,”Chennai”),(33,”vino”,”salem”),(44,”Diana”,”theni”);
[Insert will trigger a mapreduce job in background]
11. To check the hive warehouse directory in terminal
Hadoop fs –ls /user/hive/warehouse
Hadoop fs –ls /user/hive/warehouse/employeedb.db
12. To check the folder structure of customers table in
employee database from terminal:
Hadoop fs –ls /user/hive/warehouse/employeedb.db/employee
13. Read the contents of customers table from terminal
Hadoop fs –cat /user/hive/warehouse/employeedb.db/employee/*
14. Display table data with where condition:
Select * from employee where address =”Chennai”;
Select name, address from employee where address = “chennai”;
Select name, address from employee where address=”Chennai” and id> 22;
15. To display distinct values
Select DISTINT address from employee;
16. To display records with order by clause
Select name, address from employee order by address;
17. To display no. of records in a table
Select count(*) from employee;
18. To display records with group by clause
Select address, count(*) from employee group by address;
Select address, count(*) as employee_count from employee group by address;
19. Display records using limit clause
Select * from employee limit 1;
. To exit from hive shell
Exit
21. Create a new hive table with <if not exist> statement
Create table if not exists menu_orders(
Id bigint, product_id string, customer_id bigint, quantity int, amount double);
22. Insert a record into orders table
Insert into menu_orders values(11,”pho”,11,3,120);
Insert into menu_orders values(12,”phon”,12,4,130), (13,”phone”,13,5,140);
OUTPUT:
1. Create database
2. Creating a table
3. Display Database
4. Describe Database
RESULT:
Thus the Hive commands are executed successfully in Cloudera.
INSTALLATION OF HBASE ALONG WITH PRACTICE EXAMPLES
AIM:
Installation of HBase along with Practice examples.
ALGORITHM:
1. Connect to hbase
hbase shell
2. To list the tables in hbase
list
3. To check the hbase master and region server stopped or working
exit
5. create ‘students’, ‘personal_detials’, ‘contact_details’,’marks’
list
put ‘students’,’student1’,’personal_details:name’,’kiruba’
put ‘students’,’student1’,’personal_details:email’,’[email protected]’
6. To see all records using scan
scan ‘students’
get ‘students’,’student1’
get ‘students’,’student1’, {column=> ‘personal_details’}
7. Delete email id column for student1
delete ‘students’,’students1’,’personal_details:email’
scan ‘students’
describe ‘students’
exists ‘students’
8. Drop a table
Drop is used to delete a hbase table. But this operator can’t be applied directly to the table. Instead,
the table is first disabled and then dropped.
disable ‘students’
drop ‘students’
OUTPUT:
Create
List
Describe
Drop
Alter
RESULT:
Thus the hbase installed with practice examples.
IMPORT DATA BETWEEN HDFS AND RDBMS USING APACHE SQOOP
AIM:
Import data between HDFS and RDBMS using APACHE
ALGORITHM:
1 Sqoop — IMPORT Command
Import command is used to importing a table from relational databases to HDFS. In our case, we are going to
tables from MySQL databases to HDFS.
OUTPUT:
After the code is executed, you can check the Web UI of HDFS i.e. localhost:50070 where the data is
imported.
RESULT:
Thus the Import data between HDFS and RDBMS using APACHE.