Saturday, December 19, 2015

Hive on Spark With HDP 2.3.2

It took me a while to figure this out. Hope this guide helps others. I am using HDP 2.3.2 on windows azure but not using any windows command, so this should work on HDP 2.3.2 on any OS

If using a sandbox please make sure that the virtual box appliance has a host only network and a NAT adapter attached. Host only is required to connect to the sandbox from outside and NAT is required for the sandbox to talk to internet. Look for a previous post on this blog about how to do it.

Setup Maven for build


This may not be required for everybody but for me it kept giving warning that I have an older version than 3.3.3 so I decided to get a later version

sudo su
export PATH=./:$PATH
wget http://apache.arvixe.com/spark/spark-1.4.1/spark-1.4.1.tgz
tar -zxvf apache-maven-3.3.9-bin.tar.gz
export M2_HOME=./apache-maven-3.3.9
export PATH=$M2_HOME/bin:$PATH

Test if maven is setup correctly

which mvn

This should point to the mvn to be in ./apache-maven-3.3.9/bin

Note: If you still get build errors, you may want to modify the make_distribution.sh

vi make_distribution.sh
look for

MVN="$SPARK_HOME/build/mvn"

and replace with the following. Put your path to maven appropriately.

export M2_HOME=/usr/hdp/build/sparkbuild/apache-maven-3.3.9/

MVN="/usr/hdp/build/sparkbuild/apache-maven-3.3.9/bin/mvn"


Get the source files for spark and extract

wget http://www.apache.org/dyn/closer.lua/spark/spark-1.4.1/spark-1.4.1.tgz
tar -zxvf spark-1.4.1.tgz

IF you get error with the tar command, likely the tar file is corrupt. I had to switch links. Make sure the spark version downloaded is 1.4.1

Start the build with instructions not to build with hive dependancy. you may not include ",paraquat-provided" in below command if you don't plan to use parquet file storage.

cd spark-1.4.1
make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

Note: Watch the screen for 30 seconds for any confirmation message, else go for a walk as this is going to run for a while.

The build process creates a file with name spark-1.4.1-bin-hadoop2-without-hive.tgz


Configure the environment

Extract and move the binaries to hdp root

tar -zxvf spark-1.4.1-bin-hadoop2-without-hive.tgz
mv spark-1.4.1-bin-hadoop2-without-hive/ /usr/hdp/spark-1.4.1

Add the required hadoop libraries to Spark class path configuration

cd /usr/hdp/spark-1.4.1/conf
cp spark-env.sh.template spark-env.sh

vi spark-env.sh

Add the following line to the end of "spark-env.sh". Do not replace (hadoop class path) with anything. The following line need to be added to spark-env.sh as it is

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

There is a issue in the mapred-site.xml which leads to a failure in startup with "bad substitution message". Remove the path from mapred config file.

vi /etc/hadoop/conf/mapred-site.xml 

search for mapreduce.application.classpath in editor
Remove the text
:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar 
from the value

Create the log directory for spark
mkdir /tmp/spark-events

Change the Yarn Scheduler class to Fair Scheduler. You can do this through Ambari or by making changes to yarn-site.xml

vi /etc/hadoop/conf/yarn-site.xml

Search for name "yarn.resourcemanager.scheduler.class" and change the value from
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler"
to
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"

Alternatively you can change the scheduler value through Ambari. Screenshot below


Restart Yarn, Hive, Mapreduce services. Oozie may need a restart also.


Test if the setup was successful

Start hive cli as hive user

sudo -u hive hive

At the hive prompt give the following commands

set spark.home=/usr/hdp/spark-1.4.1;
set hive.execution.engine=spark;
set spark.master=yarn-client;
set spark.eventLog.enabled=true;
set spark.eventLog.dir=/tmp/spark-events;
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

Assuming the default database has the tables sample_07 and sample_08, issue the following SQL at hive prompt. Its a sample cross join query to make sure hive is using Spark

create table sparkytable as select a.* from sample_07 a inner join sample_08 b;

While the query is running, in another terminal you can follow the hive cli log by using

tail -f /tmp/hive/hive.log

You would notice in the logs that hive brings up a spark client to run the queries.

Note: The first time the query may appear to run very slow as spark request yarn to start a container. The subsequent queries should run faster.

Make the changes permanent

Fire up a browser and bring up ambari. Log in using admin/admin. Go to configurations - > advanced for hive and add the parameters to the hive-site.xml

spark.master=yarn-client;
spark.eventLog.enabled=true;
spark.eventLog.dir=/tmp/spark-events;
spark.executor.memory=512m;
spark.serializer=org.apache.spark.serializer.KryoSerializer;

Screenshot from my Ambari



Or you can even directly add them to hive-site.xml

Notice,the above parameters do not include the hive.execution.engine parameter. Restart hive.

Now you can in run time chose between MR, Tez or Spark as the execution engine by using the following statements on hive cli or your favorite SQL editor connecting to hive.

set hive.execution.engine=spark;
set hive.execution.engine=tez;
set hive.execution.engine=mr;

You can verify that these engines are actually used by starting a query and then checking on Resource Manager UI for yarn




1 comment:

  1. Great post, but i get the error as below, please help.

    hive> create table sparkytable as select a.* from sample_07 a inner join sample_08 b;
    Warning: Map Join MAPJOIN[7][bigTable=b] in task 'Stage-1:MAPRED' is a cross product
    Query ID = hive_20170109053846_a670281d-d0f5-4e14-a8a9-c8068e0b28c2
    Total jobs = 2
    Launching Job 1 out of 2
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=
    Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask



    HDP 2.3.2 comes with a spark by itself. what is the need to build it once again.

    ReplyDelete