How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

How to Set Up a Multi-Node Hadoop Cluster

on Amazon EC2.
We are going to setup 4 node hadoop cluster as below.

• NameNode (Master)
• SecondaryNameNode
• DataNode (Slave1)
• DataNode (Slave2)

Here’s what you will need

1. Amazon AWS Account


2. PuTTy Windows Client (to connect to Amazon EC2 instance)
3. PuTTYgen (to generate private key – this will be used in putty to connect to EC2 instance)
4. WinSCP (secury copy)

1. Setting up Amazon EC2 Instances:


1.1 Get Amazon AWS Account by visiting this URL
https://aws.amazon.com/free/?sc_ichannel=ha&sc_icampaign=free-
tier&sc_icontent=2234&all-free-tier.sort-by=item.additionalFields.SortRank&all-free-
tier.sort-order=asc

If you do not already have a account, please create a new one. I already have AWS account and
going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.
1.2 Launch Instance

Once you have signed up for Amazon account. Login to Amazon Web Services, click on My
Account and navigate to Amazon EC2 Console

1.3 Select AMI

1.4 Select Instance Type

Select the micro instance


1.5 Configure Number of Instances

1.6 Add Storage

Minimum volume size is 8GB

1.7 Instance Description

Give your instance name and description


1.8 Define a Security Group

Create a new security group, later on we are going to modify the security group with security
rules.

1.9 Launch Instance and Create Security Pair

Review and Launch Instance.

Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–
key cryptography uses a public key to encrypt a piece of data, such as a password, then the
recipient uses the private key to decrypt the data. The public and private keys are known as a key
pair.
1.10 Launching Instances

Once you click “Launch Instance” 4 instance should be launched with “pending” state

Once in “running” state we are now going to rename the instance name as below.

1. HadoopNameNode (Master)
2. HadoopSecondaryNameNode
3. HadoopSlave1 (data node will reside here)
4. HaddopSlave2 (data node will reside here)
2. Setting up client access to Amazon Instances

Now, lets make sure we can connect to all 4 instances. For that we are going to use Putty client
We are going setup password-less SSH access among servers to setup the cluster. This allows
remote access from Master Server to Slave Servers so Master Server can remotely start the Data
Node and Task Tracker services on Slave servers.

We are going to use downloaded hadoopec2cluster.pem file to generate the private key (.ppk). In
order to generate the private key we need Puttygen client

2.1 Generating Private Key

Let’s launch PUTTYGEN client and import the key pair we created during launch instance step
– “hadoopec2cluster.pem”

Navigate to Conversions and “Import Key”


Once you import the key You can enter passphrase to protect your private key or leave the
passphrase fields blank to use the private key without any passphrase. Passphrase protects the
private key from any unauthorized access to servers using your machine and your private key.

Any access to server using passphrase protected private key will require the user to enter the
passphrase to enable the private key enabled access to AWS EC2 server.

2.2 Save Private Key

Now save the private key by clicking on “Save Private Key” and click “Yes” as we are going to
leave passphrase empty.
Save the .ppk file and give it a meaningful name

2.3 Connect to Amazon Instance

Let’s connect to HadoopNameNode first. Launch Putty client, grab the public URL , import the
.ppk private key that we just created for password-less SSH access. As per amazon
documentation, for Ubuntu machines username is “ubuntu”
2.3.1 Provide private key for authentication

2.3.2 Hostname and Port and Connection Type

and “Open” to launch putty session

and will prompt you for the username, enter ubuntu, if everything goes well you will be presented
welcome message with Unix shell at the end.
Similarly connect to remaining 3 machines HadoopSecondaryNameNode, HaddopSlave1,HadoopSlave2
respectively to make sure you can connect successfully.
2.4 Enable Public Access

Issue ifconfig command and note down the ip address. Next, we are going to update the
hostname with ec2 public URL and finally we are going to update /etc/hosts file to map the ec2
public URL with ip address. This will help us to configure master ans slaves nodes with
hostname instead of ip address.

Following is the output on HadoopNameNode ifconfig

2.5 Modify /etc/hosts

Lets change the host to EC2 public IP and hostname.

Open the /etc/hosts in vi, in a very first line it will show 127.0.0.1 localhost, we need to replace
that with amazon ec2 hostname and ip address we just collected.
Modify the file and save your changes, Repeat 2.3 and 2.4 sections for remaining 3 machines.

3. Setup WinSCP access to EC2 instances

In order to securely transfer files from your windows machine to Amazon EC2 WinSCP is a
handy utility.Provide hostname, username and private key file and save your configuration and
Login.
upon successful login you will see unix file system of a logged in user /home/ubuntu your Amazon EC2
Ubuntu machine.

Upload the .pem file to master machine (HadoopNameNode) in /home/ubuntu/.ssh folder. It will be
used while connecting to slave nodes during hadoop startup daemons.

4. Apache Hadoop Installation and Cluster Setup


4.1 Update the packages and dependencies.

Let’s update the packages , I will start with master , repeat this for all slaves.

$ sudo apt-get update

Once its complete, let’s install java

4.2 Install Java

Add following PPA and install the latest Oracle Java (JDK) 7 in Ubuntu

$ sudo apt-get install openjdk-7-jdk

Check if Ubuntu uses JDK 7


Y

Repeat this for all slaves.

4.3 Download Hadoop

We are going to use haddop 2.7.1 stable version from apache site

issue wget command from shell on all the machines

$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz

Create a folder called /apache on all machines

sudo mkdir /apache


Unzip the files into /apache folder and review the package content and configuration files.

Change the ownership of /apache folder to ubuntu

sudo chown ubuntu /apache

$ tar -xzvf hadoop-2.7.1.tar.gz -C /apache

In /apache folder create a softlink to the hadoop-2.7.1 folder

cd /apache

ln -s hadoop-2.7.1 hadoop

4.4 Setup Environment Variable

Navigate to home directory


$cd

Open .bashrc file in vi edito

$ vi .bashrc

Add following at the end of file

export HADOOP_HOME=/apache/hadoop

export HADOOP_CONF_DIR=/apache/hadoop/etc/hadoop

export JAVA_HOME=/usr

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

To check whether its been updated correctly or not, reload bash profile, use following commands

source ~/.bashrc

4.5 Setup Password-less SSH on Servers

Create a config file in /home/ubuntu/.ssh folder on all the nodes and enter following in it.

host master
hostname ec2-13-233-154-170.ap-south-1.compute.amazonaws.com
user ubuntu
identityFile ~/.ssh/aws_key.pem
host slave1
hostname ec2-35-154-115-217.ap-south-1.compute.amazonaws.com
user ubuntu
identityFile ~/.ssh/aws_key.pem
host slave2
hostname ec2-13-232-200-61.ap-south-1.compute.amazonaws.com
user ubuntu
identityFile ~/.ssh/aws_key.pem

change the permissions of both config and pem file to 600

sudo chmod 600 config

sudo chmod 600 aws_key.pem


Run the ssh-keygen command on the master node

ssh-keygen

Check the public amd private rsa keys in the .ssh folder
Copy the id_rsa.pub file to authorized file

cat id_rsa.pub>>authorized_keys

Copy config .pem amd authorized_keys file to all the nodes

scp config aws_key.pem authorized_keys slave1:~/.ssh

Check the password less password as follows

ssh slave1

4.6 Hadoop Cluster Setup

This section will cover the hadoop cluster configuration. We will have to modify

• hadoop-env.sh – This file contains some environment variable settings used by Hadoop.
You can use these to affect some aspects of Hadoop daemon behavior, such as where log
files are stored, the maximum amount of heap used etc. The only variable you should
need to change at this point is in this file is JAVA_HOME, which specifies the path to the
Java 1.7.x installation used by Hadoop.
• core-site.xml – key property fs.default.name – for namenode configuration for
e.g hdfs://namenode/
• hdfs-site.xml – key property – dfs.replication – by default 3
• mapred-site.xml – set the processing framework as YARN
• yarn-site.xml – set up the yarn related properties.
• masters – defines on which machines Hadoop will start secondary NameNodes in our
multi-node cluster.
• slaves – defines the lists of hosts, one per line, where the Hadoop slave daemons
(datanodes and tasktrackers) will run.

We will first start with master (NameNode) and then copy above xml changes to remaining 3
nodes (all slaves).

1. hadoop-env.sh

2. core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://ec2-52-66-240-250.ap-south-1.compute.amazonaws.com:9000</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/apache/hadoop/tmp</value>
</property>

Create the tmp folder

3. hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/apache/hadoop/hdfs/nn/</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/apache/hadoop/hdfs/dn/</value>
</property>

Ceate the name node and datanode folders on all the nodes(nn only on master node)

mkdir -p /apache/hadoop/tmp

mkdir -p /apache/hadoop/hdfs/nn/

mkdir -p /apache/hadoop/hdfs/dn/

4. mapred-site.xml
create mapred-site.xml file from mapred-site.xml.template file

cp mapred-site.xml.template mapred-site.xml

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

5. yarn-site.xml( change the nodemanager localizer property to respective node dns name
for each node)

<property>
<name>yarn.resourcemanager.hostname</name>
<value>ec2-52-66-240-250.ap-south-1.compute.amazonaws.com</value>
</property>
<property> <name>yarn.nodemanager.localizer.address</name>
<value>ec2-52-66-240-250.ap-south-1.compute.amazonaws.com:8040</value>
</property>
6. masters
put the hostnames of secondary name node
slave1

7. slaves
put the hostsnames of slave machines

master
slave1
slave2
slave3

copy all the files to all the nodes

scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml masters


slaves slave1:$HADOOP_CONF_DIR

Update the yarn-ite.xml files on each slave machine to reflect respective public dns names

For slave1
For slave2
4.7 Hadoop Daemon Startup

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which
runs on top of your , which is implemented on top of the local filesystems of your cluster. You
need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop
filesystem, this will cause all your data to be erased.

To format the namenode

$ hadoop namenode -format

Lets start all hadoop daemons from HadoopNameNode

$HADOOP_HOME/sbin/start-all.sh

To Protect the EC2 instances from unauthorized access from the internet set the security group as
follows
Check the processes running on al the nodes using JPS command

Chekck the Namenode webui on the 50070 port on master machine

Check resource manager webui on the 8088 port on the master machine

You might also like