It took me a while to figure this out. Hope this guide helps others. I am using HDP 2.3.2 on windows azure but not using any windows command, so this should work on HDP 2.3.2 on any OS
If using a sandbox please make sure that the virtual box appliance has a host only network and a NAT adapter attached. Host only is required to connect to the sandbox from outside and NAT is required for the sandbox to talk to internet. Look for a previous post on this blog about how to do it.
Setup Maven for build
This may not be required for everybody but for me it kept giving warning that I have an older version than 3.3.3 so I decided to get a later version
sudo su
export PATH=./:$PATH
wget http://apache.arvixe.com/spark/spark-1.4.1/spark-1.4.1.tgz
tar -zxvf apache-maven-3.3.9-bin.tar.gz
export M2_HOME=./apache-maven-3.3.9
export PATH=$M2_HOME/bin:$PATH
Test if maven is setup correctly
which mvn
This should point to the mvn to be in ./apache-maven-3.3.9/bin
Note: If you still get build errors, you may want to modify the make_distribution.sh
vi make_distribution.sh
look for
Get the source files for spark and extract
wget http://www.apache.org/dyn/closer.lua/spark/spark-1.4.1/spark-1.4.1.tgz
tar -zxvf spark-1.4.1.tgz
IF you get error with the tar command, likely the tar file is corrupt. I had to switch links. Make sure the spark version downloaded is 1.4.1
Start the build with instructions not to build with hive dependancy. you may not include ",paraquat-provided" in below command if you don't plan to use parquet file storage.
cd spark-1.4.1
make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
Note: Watch the screen for 30 seconds for any confirmation message, else go for a walk as this is going to run for a while.
The build process creates a file with name spark-1.4.1-bin-hadoop2-without-hive.tgz
Configure the environment
Extract and move the binaries to hdp root
tar -zxvf spark-1.4.1-bin-hadoop2-without-hive.tgz
mv spark-1.4.1-bin-hadoop2-without-hive/ /usr/hdp/spark-1.4.1
Add the required hadoop libraries to Spark class path configuration
cd /usr/hdp/spark-1.4.1/conf
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
Add the following line to the end of "spark-env.sh". Do not replace (hadoop class path) with anything. The following line need to be added to spark-env.sh as it is
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
There is a issue in the mapred-site.xml which leads to a failure in startup with "bad substitution message". Remove the path from mapred config file.
vi /etc/hadoop/conf/mapred-site.xml
search for mapreduce.application.classpath in editor
Remove the text
:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar
from the value
Create the log directory for spark
mkdir /tmp/spark-events
Change the Yarn Scheduler class to Fair Scheduler. You can do this through Ambari or by making changes to yarn-site.xml
vi /etc/hadoop/conf/yarn-site.xml
Search for name "yarn.resourcemanager.scheduler.class" and change the value from
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler"
to
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"
Alternatively you can change the scheduler value through Ambari. Screenshot below
Restart Yarn, Hive, Mapreduce services. Oozie may need a restart also.
Test if the setup was successful
Start hive cli as hive user
sudo -u hive hive
At the hive prompt give the following commands
set spark.home=/usr/hdp/spark-1.4.1;
set hive.execution.engine=spark;
set spark.master=yarn-client;
set spark.eventLog.enabled=true;
set spark.eventLog.dir=/tmp/spark-events;
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Assuming the default database has the tables sample_07 and sample_08, issue the following SQL at hive prompt. Its a sample cross join query to make sure hive is using Spark
create table sparkytable as select a.* from sample_07 a inner join sample_08 b;
While the query is running, in another terminal you can follow the hive cli log by using
tail -f /tmp/hive/hive.log
You would notice in the logs that hive brings up a spark client to run the queries.
Note: The first time the query may appear to run very slow as spark request yarn to start a container. The subsequent queries should run faster.
Make the changes permanent
Fire up a browser and bring up ambari. Log in using admin/admin. Go to configurations - > advanced for hive and add the parameters to the hive-site.xml
spark.master=yarn-client;
spark.eventLog.enabled=true;
spark.eventLog.dir=/tmp/spark-events;
spark.executor.memory=512m;
spark.serializer=org.apache.spark.serializer.KryoSerializer;
Screenshot from my Ambari
Or you can even directly add them to hive-site.xml
Notice,the above parameters do not include the hive.execution.engine parameter. Restart hive.
Now you can in run time chose between MR, Tez or Spark as the execution engine by using the following statements on hive cli or your favorite SQL editor connecting to hive.
set hive.execution.engine=spark;
set hive.execution.engine=tez;
set hive.execution.engine=mr;
You can verify that these engines are actually used by starting a query and then checking on Resource Manager UI for yarn
If using a sandbox please make sure that the virtual box appliance has a host only network and a NAT adapter attached. Host only is required to connect to the sandbox from outside and NAT is required for the sandbox to talk to internet. Look for a previous post on this blog about how to do it.
Setup Maven for build
This may not be required for everybody but for me it kept giving warning that I have an older version than 3.3.3 so I decided to get a later version
sudo su
export PATH=./:$PATH
wget http://apache.arvixe.com/spark/spark-1.4.1/spark-1.4.1.tgz
tar -zxvf apache-maven-3.3.9-bin.tar.gz
export M2_HOME=./apache-maven-3.3.9
export PATH=$M2_HOME/bin:$PATH
Test if maven is setup correctly
which mvn
This should point to the mvn to be in ./apache-maven-3.3.9/bin
Note: If you still get build errors, you may want to modify the make_distribution.sh
vi make_distribution.sh
look for
MVN="$SPARK_HOME/build/mvn"
and replace with the following. Put your path to maven appropriately.
export M2_HOME=/usr/hdp/build/sparkbuild/apache-maven-3.3.9/
MVN="/usr/hdp/build/sparkbuild/apache-maven-3.3.9/bin/mvn"
Get the source files for spark and extract
wget http://www.apache.org/dyn/closer.lua/spark/spark-1.4.1/spark-1.4.1.tgz
tar -zxvf spark-1.4.1.tgz
IF you get error with the tar command, likely the tar file is corrupt. I had to switch links. Make sure the spark version downloaded is 1.4.1
Start the build with instructions not to build with hive dependancy. you may not include ",paraquat-provided" in below command if you don't plan to use parquet file storage.
cd spark-1.4.1
make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
Note: Watch the screen for 30 seconds for any confirmation message, else go for a walk as this is going to run for a while.
The build process creates a file with name spark-1.4.1-bin-hadoop2-without-hive.tgz
Configure the environment
Extract and move the binaries to hdp root
tar -zxvf spark-1.4.1-bin-hadoop2-without-hive.tgz
mv spark-1.4.1-bin-hadoop2-without-hive/ /usr/hdp/spark-1.4.1
Add the required hadoop libraries to Spark class path configuration
cd /usr/hdp/spark-1.4.1/conf
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
Add the following line to the end of "spark-env.sh". Do not replace (hadoop class path) with anything. The following line need to be added to spark-env.sh as it is
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
There is a issue in the mapred-site.xml which leads to a failure in startup with "bad substitution message". Remove the path from mapred config file.
vi /etc/hadoop/conf/mapred-site.xml
search for mapreduce.application.classpath in editor
Remove the text
:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar
from the value
Create the log directory for spark
mkdir /tmp/spark-events
Change the Yarn Scheduler class to Fair Scheduler. You can do this through Ambari or by making changes to yarn-site.xml
vi /etc/hadoop/conf/yarn-site.xml
Search for name "yarn.resourcemanager.scheduler.class" and change the value from
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler"
to
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"
Alternatively you can change the scheduler value through Ambari. Screenshot below
Restart Yarn, Hive, Mapreduce services. Oozie may need a restart also.
Test if the setup was successful
Start hive cli as hive user
sudo -u hive hive
At the hive prompt give the following commands
set spark.home=/usr/hdp/spark-1.4.1;
set hive.execution.engine=spark;
set spark.master=yarn-client;
set spark.eventLog.enabled=true;
set spark.eventLog.dir=/tmp/spark-events;
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Assuming the default database has the tables sample_07 and sample_08, issue the following SQL at hive prompt. Its a sample cross join query to make sure hive is using Spark
create table sparkytable as select a.* from sample_07 a inner join sample_08 b;
While the query is running, in another terminal you can follow the hive cli log by using
tail -f /tmp/hive/hive.log
You would notice in the logs that hive brings up a spark client to run the queries.
Note: The first time the query may appear to run very slow as spark request yarn to start a container. The subsequent queries should run faster.
Make the changes permanent
Fire up a browser and bring up ambari. Log in using admin/admin. Go to configurations - > advanced for hive and add the parameters to the hive-site.xml
spark.master=yarn-client;
spark.eventLog.enabled=true;
spark.eventLog.dir=/tmp/spark-events;
spark.executor.memory=512m;
spark.serializer=org.apache.spark.serializer.KryoSerializer;
Screenshot from my Ambari
Or you can even directly add them to hive-site.xml
Notice,the above parameters do not include the hive.execution.engine parameter. Restart hive.
Now you can in run time chose between MR, Tez or Spark as the execution engine by using the following statements on hive cli or your favorite SQL editor connecting to hive.
set hive.execution.engine=spark;
set hive.execution.engine=tez;
set hive.execution.engine=mr;
You can verify that these engines are actually used by starting a query and then checking on Resource Manager UI for yarn
Great post, but i get the error as below, please help.
ReplyDeletehive> create table sparkytable as select a.* from sample_07 a inner join sample_08 b;
Warning: Map Join MAPJOIN[7][bigTable=b] in task 'Stage-1:MAPRED' is a cross product
Query ID = hive_20170109053846_a670281d-d0f5-4e14-a8a9-c8068e0b28c2
Total jobs = 2
Launching Job 1 out of 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
HDP 2.3.2 comes with a spark by itself. what is the need to build it once again.