Big data notepad: Using Apache Hadoop Client Library from Mac local system to connect to remote hadoop cluster

In the process of writing some map reduce jobs, one of the biggest frustration is to upload the Jar to the cluster and then start a job. There are many ways to overcome this problem

Automate the build system to actually put the Jar on the cluster
Write a separate script that picks the Jar from the build target folder and put on the cluster
Ability to run map reduce jobs from local system

To me the third approach looked to the be best considering there is no hassle of doing a ssh into the cluster to run the job. I work with HDP platform and a Mac and hence the following steps. If you are on windows you may want to get cygwin first and try this approach. [I do not guarantee that this approach would work with windows]

I use a HW sandbox from my development purpose. However, I could not find the packaged hadoop core distributions for download. So I switched to Apache site. The following link as links to download the binaries for all versions

https://archive.apache.org/dist/hadoop/core/

I used 2.6.2 and a direct link for download is hadoop-2.6.2.tar.gz

I created a directory under my user folder for the files.

mkdir ~/Apache

Assuming the file was download to downloads folder

mv ~/Downloads/hadoop-2.6.2.* ~/Apache/
cd ~/Apache
tar -zxvf hadoop-2.6.2.*

Now the files are extracted and available in /Users/<your username>/Apache
Lets change the configurations to make the client talk to cluster when a command is run.

cd ~/Apache/hadoop-2.6.2/etc/hadoop

There are a bunch of files here. The 3 files that needs changes are

core-site.xml
hdfs-site.xml
yarn-site.xml

The best approach is to copy these files from your cluster and and put (replace the existing) in your ~/Apache/hadoop-2.6.2/etc/hadoop folder. Unless the default locations are changed, you can find these files on your cluster under /etc/hadoop/conf/ folder

Edit your bash profile file to include the environment variables required for hadoop client libraries and also to add the bin path.

vi ~/.bash_profile

Add the following lines to the end of the file. Do replace the user name to your username. If you are using a sandbox, you can set the username to root

export PATH=$PATH:/Users/<your local username>/Apache/hadoop-2.6.2/bin:/Users/dsarangi/Apache/hadoop-2.6.2/sbin
export HADOOP_USER_NAME=<your cluster username or root>
export HADOOP_HOME=/Users/<your username>/Apache/hadoop-2.6.2/
export HADOOP_CONF_DIR=/Users/<your username>/Apache/hadoop-2.6.2/etc/hadoop/

After saving the file, you can run the .bash_profile script to effect the changes in the current shell

. ~/.bash_profile

With these changes you are ready to go from your local.

Note: if you are using a HW sandbox on your local system with host-only networking, modify your hosts file /private/etc/hosts with textedit and add a line. Change the IP address to your sandbox host-only adapter IP address. (Run ifconfig in your sandbox to find the IP address)

192.168.56.102 sandbox.hortonworks.com

This is just to ensure that if you are using sandbox.hortonworks.com in your *-site.xml, they get redirected appropriately. In the core-site.xml, hfs-site.xml and yarn-site.xml you can set

1 comment:

UnknownApril 29, 2016 at 3:57 AM
The expansion of internet and intelligence in business process lead the way to huge volume of data. It is important to maintain and process these data to be efficient in data handling.
Regards,
Hadoop Training in Chennai|Big Data Training in Chennai|Big Data Training

Monday, December 21, 2015

Using Apache Hadoop Client Library from Mac local system to connect to remote hadoop cluster

1 comment: