How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2
How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2
How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2
on Amazon EC2.
We are going to setup 4 node hadoop cluster as below.
• NameNode (Master)
• SecondaryNameNode
• DataNode (Slave1)
• DataNode (Slave2)
If you do not already have a account, please create a new one. I already have AWS account and
going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.
1.2 Launch Instance
Once you have signed up for Amazon account. Login to Amazon Web Services, click on My
Account and navigate to Amazon EC2 Console
Create a new security group, later on we are going to modify the security group with security
rules.
Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–
key cryptography uses a public key to encrypt a piece of data, such as a password, then the
recipient uses the private key to decrypt the data. The public and private keys are known as a key
pair.
1.10 Launching Instances
Once you click “Launch Instance” 4 instance should be launched with “pending” state
Once in “running” state we are now going to rename the instance name as below.
1. HadoopNameNode (Master)
2. HadoopSecondaryNameNode
3. HadoopSlave1 (data node will reside here)
4. HaddopSlave2 (data node will reside here)
2. Setting up client access to Amazon Instances
Now, lets make sure we can connect to all 4 instances. For that we are going to use Putty client
We are going setup password-less SSH access among servers to setup the cluster. This allows
remote access from Master Server to Slave Servers so Master Server can remotely start the Data
Node and Task Tracker services on Slave servers.
We are going to use downloaded hadoopec2cluster.pem file to generate the private key (.ppk). In
order to generate the private key we need Puttygen client
Let’s launch PUTTYGEN client and import the key pair we created during launch instance step
– “hadoopec2cluster.pem”
Any access to server using passphrase protected private key will require the user to enter the
passphrase to enable the private key enabled access to AWS EC2 server.
Now save the private key by clicking on “Save Private Key” and click “Yes” as we are going to
leave passphrase empty.
Save the .ppk file and give it a meaningful name
Let’s connect to HadoopNameNode first. Launch Putty client, grab the public URL , import the
.ppk private key that we just created for password-less SSH access. As per amazon
documentation, for Ubuntu machines username is “ubuntu”
2.3.1 Provide private key for authentication
and will prompt you for the username, enter ubuntu, if everything goes well you will be presented
welcome message with Unix shell at the end.
Similarly connect to remaining 3 machines HadoopSecondaryNameNode, HaddopSlave1,HadoopSlave2
respectively to make sure you can connect successfully.
2.4 Enable Public Access
Issue ifconfig command and note down the ip address. Next, we are going to update the
hostname with ec2 public URL and finally we are going to update /etc/hosts file to map the ec2
public URL with ip address. This will help us to configure master ans slaves nodes with
hostname instead of ip address.
Open the /etc/hosts in vi, in a very first line it will show 127.0.0.1 localhost, we need to replace
that with amazon ec2 hostname and ip address we just collected.
Modify the file and save your changes, Repeat 2.3 and 2.4 sections for remaining 3 machines.
In order to securely transfer files from your windows machine to Amazon EC2 WinSCP is a
handy utility.Provide hostname, username and private key file and save your configuration and
Login.
upon successful login you will see unix file system of a logged in user /home/ubuntu your Amazon EC2
Ubuntu machine.
Upload the .pem file to master machine (HadoopNameNode) in /home/ubuntu/.ssh folder. It will be
used while connecting to slave nodes during hadoop startup daemons.
Let’s update the packages , I will start with master , repeat this for all slaves.
Add following PPA and install the latest Oracle Java (JDK) 7 in Ubuntu
We are going to use haddop 2.7.1 stable version from apache site
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
cd /apache
ln -s hadoop-2.7.1 hadoop
$ vi .bashrc
export HADOOP_HOME=/apache/hadoop
export HADOOP_CONF_DIR=/apache/hadoop/etc/hadoop
export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
To check whether its been updated correctly or not, reload bash profile, use following commands
source ~/.bashrc
Create a config file in /home/ubuntu/.ssh folder on all the nodes and enter following in it.
host master
hostname ec2-13-233-154-170.ap-south-1.compute.amazonaws.com
user ubuntu
identityFile ~/.ssh/aws_key.pem
host slave1
hostname ec2-35-154-115-217.ap-south-1.compute.amazonaws.com
user ubuntu
identityFile ~/.ssh/aws_key.pem
host slave2
hostname ec2-13-232-200-61.ap-south-1.compute.amazonaws.com
user ubuntu
identityFile ~/.ssh/aws_key.pem
ssh-keygen
Check the public amd private rsa keys in the .ssh folder
Copy the id_rsa.pub file to authorized file
cat id_rsa.pub>>authorized_keys
ssh slave1
This section will cover the hadoop cluster configuration. We will have to modify
• hadoop-env.sh – This file contains some environment variable settings used by Hadoop.
You can use these to affect some aspects of Hadoop daemon behavior, such as where log
files are stored, the maximum amount of heap used etc. The only variable you should
need to change at this point is in this file is JAVA_HOME, which specifies the path to the
Java 1.7.x installation used by Hadoop.
• core-site.xml – key property fs.default.name – for namenode configuration for
e.g hdfs://namenode/
• hdfs-site.xml – key property – dfs.replication – by default 3
• mapred-site.xml – set the processing framework as YARN
• yarn-site.xml – set up the yarn related properties.
• masters – defines on which machines Hadoop will start secondary NameNodes in our
multi-node cluster.
• slaves – defines the lists of hosts, one per line, where the Hadoop slave daemons
(datanodes and tasktrackers) will run.
We will first start with master (NameNode) and then copy above xml changes to remaining 3
nodes (all slaves).
1. hadoop-env.sh
2. core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-52-66-240-250.ap-south-1.compute.amazonaws.com:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/apache/hadoop/tmp</value>
</property>
3. hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/apache/hadoop/hdfs/nn/</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/apache/hadoop/hdfs/dn/</value>
</property>
Ceate the name node and datanode folders on all the nodes(nn only on master node)
mkdir -p /apache/hadoop/tmp
mkdir -p /apache/hadoop/hdfs/nn/
mkdir -p /apache/hadoop/hdfs/dn/
4. mapred-site.xml
create mapred-site.xml file from mapred-site.xml.template file
cp mapred-site.xml.template mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
5. yarn-site.xml( change the nodemanager localizer property to respective node dns name
for each node)
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ec2-52-66-240-250.ap-south-1.compute.amazonaws.com</value>
</property>
<property> <name>yarn.nodemanager.localizer.address</name>
<value>ec2-52-66-240-250.ap-south-1.compute.amazonaws.com:8040</value>
</property>
6. masters
put the hostnames of secondary name node
slave1
7. slaves
put the hostsnames of slave machines
master
slave1
slave2
slave3
Update the yarn-ite.xml files on each slave machine to reflect respective public dns names
For slave1
For slave2
4.7 Hadoop Daemon Startup
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which
runs on top of your , which is implemented on top of the local filesystems of your cluster. You
need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop
filesystem, this will cause all your data to be erased.
$HADOOP_HOME/sbin/start-all.sh
To Protect the EC2 instances from unauthorized access from the internet set the security group as
follows
Check the processes running on al the nodes using JPS command
Check resource manager webui on the 8088 port on the master machine