With mild hacks Aginity workbench for hadoop can be used with SparkSQL as a client to run queries.
How do I get Spark SQL?
If you are looking for deploying a herd of elephants to churn the forest, then you already know it. If you are looking for an infant elephant to snuggle and get a feel, hortonworks sandbox 2.3 would do it. Download a 2.3 sandbox from hortonworks . I have another post about getting host only network setup with Virtualbox and hortonworks sandbox Link
Please note: Without Host only network, external clients can not connect to the sandbox. Make sure the sandbox is configured correctly. Link
Assuming: We have a working environment now.
Some Concepts (Feel free to correct me through comments if I am wrong)
Confirm Hive is working
If you are looking for deploying a herd of elephants to churn the forest, then you already know it. If you are looking for an infant elephant to snuggle and get a feel, hortonworks sandbox 2.3 would do it. Download a 2.3 sandbox from hortonworks . I have another post about getting host only network setup with Virtualbox and hortonworks sandbox Link
Please note: Without Host only network, external clients can not connect to the sandbox. Make sure the sandbox is configured correctly. Link
Assuming: We have a working environment now.
Some Concepts (Feel free to correct me through comments if I am wrong)
- For hive clients or spark clients to talk and issue commands against said services, a thrift service is required. E.g Hiveserver / Hiveserver2 are thrift services for external clients for hive
- For spark sql to be accessible to external clients, spark thrift service has to run
- For Spark SQL to work, spark should know about hive meta store. So we have to keep Spark informed about the hive metastore configuration.
Confirm Hive is working
Start a hive shell. And fire a query
Before brining up spark shell, it would advised to change the logging level from info to warn as otherwise spark spits out a lot of text to console.
Open the file /usr/hdp/current/spark-client/conf/log4j.properties and on the first line change the log level from INFO to WARN.
After the change my file looks like below
Fire up a spark-sql shell and fire some commands to know that spark is working fine. Following screenshot from my environment confirms that its working.
Note: If you have not changed your log level as mentioned above, the below screen would look very different
- On the hortonworks sandbox use ALT + F5 and then log in using root/hadoop
- issue the command "hive" to fire up a shell.
- "show tables;" should show the tables in default schema.
- "select count(*) from sample_07;" one of the existing tables should confirm hive is working
Confirm Spark is working
Before brining up spark shell, it would advised to change the logging level from info to warn as otherwise spark spits out a lot of text to console.
Open the file /usr/hdp/current/spark-client/conf/log4j.properties and on the first line change the log level from INFO to WARN.
After the change my file looks like below
Fire up a spark-sql shell and fire some commands to know that spark is working fine. Following screenshot from my environment confirms that its working.
Note: If you have not changed your log level as mentioned above, the below screen would look very different
Start Spark thrift service
To avoid thrift server for hive and spark not to collide on same port we need to start the spark thrift on a different port.
Run the following command.
To avoid thrift server for hive and spark not to collide on same port we need to start the spark thrift on a different port.
Run the following command.
start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=11000
Notice the output log file mentioned in the screenshot above
run the following command to tail the log to know that spark-sql is actually working
tail -f /var/log/spark/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-sandbox.hortonworks.com.out
So now we have spark sql thrift service is running. Time to connect the client.
Use Aginity Workbench to connect to Spark
Get Aginity Workbench for hadoop if you do not have it already. Its free.
Notice below the port number 12000. By default its 10000. Since while starting the thrift service we changed it to 12000, we would connect on 12000
After connecting, I could run a query and see the results. The screenshot below confirms that
How do we know that spark-sql is actually running and giving us the results.
Look at the log tail that we ran before... here is a screenshot.
So enjoy spark SQL on a nice client interface.