Big data notepad: October 2015

With mild hacks Aginity workbench for hadoop can be used with SparkSQL as a client to run queries.

How do I get Spark SQL?
If you are looking for deploying a herd of elephants to churn the forest, then you already know it. If you are looking for an infant elephant to snuggle and get a feel, hortonworks sandbox 2.3 would do it. Download a 2.3 sandbox from hortonworks . I have another post about getting host only network setup with Virtualbox and hortonworks sandbox Link

Please note: Without Host only network, external clients can not connect to the sandbox. Make sure the sandbox is configured correctly. Link

Assuming: We have a working environment now.

Some Concepts (Feel free to correct me through comments if I am wrong)

For hive clients or spark clients to talk and issue commands against said services, a thrift service is required. E.g Hiveserver / Hiveserver2 are thrift services for external clients for hive
For spark sql to be accessible to external clients, spark thrift service has to run
For Spark SQL to work, spark should know about hive meta store. So we have to keep Spark informed about the hive metastore configuration.

Confirm Hive is working

Start a hive shell. And fire a query

On the hortonworks sandbox use ALT + F5 and then log in using root/hadoop
issue the command "hive" to fire up a shell.
"show tables;" should show the tables in default schema.
"select count(*) from sample_07;" one of the existing tables should confirm hive is working

Confirm Spark is working

Before brining up spark shell, it would advised to change the logging level from info to warn as otherwise spark spits out a lot of text to console.

Open the file /usr/hdp/current/spark-client/conf/log4j.properties and on the first line change the log level from INFO to WARN.

After the change my file looks like below

Fire up a spark-sql shell and fire some commands to know that spark is working fine. Following screenshot from my environment confirms that its working.

Note: If you have not changed your log level as mentioned above, the below screen would look very different

Start Spark thrift service

To avoid thrift server for hive and spark not to collide on same port we need to start the spark thrift on a different port.

Run the following command.

start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=11000

Notice the output log file mentioned in the screenshot above

run the following command to tail the log to know that spark-sql is actually working

tail -f /var/log/spark/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-sandbox.hortonworks.com.out

So now we have spark sql thrift service is running. Time to connect the client.

Use Aginity Workbench to connect to Spark

Get Aginity Workbench for hadoop if you do not have it already. Its free.

Notice below the port number 12000. By default its 10000. Since while starting the thrift service we changed it to 12000, we would connect on 12000

After connecting, I could run a query and see the results. The screenshot below confirms that

How do we know that spark-sql is actually running and giving us the results.

Look at the log tail that we ran before... here is a screenshot.

So enjoy spark SQL on a nice client interface.

Hortonworks sandbox is great platform to play with big data on a small box.

The first thing I tried doing after downloading the sandbox is to connect Aginity Hadoop Workbench to hive metastore and it won't work.

The sandbox is built with network configurations that helps for easy access to toolsets. The host name is sandbox.hortonworks.com. This approach although makes it easy to remember the resource locations, makes it difficult to use external tools to connect to the sandbox. Even if writing a java application to connect the sandbox may not work without changing network configurations.

The key is to have a host only network for the sandbox so that it can be treated as a server even though it may be running on your local system.

I used virtual box to use the sandbox. Following are the steps to get it working

1. Download virtual box and install
2. Go to preferences and add a host only network interface. (Click the + sign on right)

Click on the screwdriver icon to the right, you should see something like following

Looking from the configuration above, the IP address assigned to you box would be 192.168.56.101

We would confirm this later

3. Download hortonworks sandbox and import the appliance into virtual box
4. Click on the properties of the sandbox and make it use the host only network interface that was created. Following is a screenshot from my sandbox configuration

5. Start the sandbox. Once the box is ready, log in with root/hadoop
6. To confirm the IP address issue command ifconfig
Look at the inet address below, its 192.168.56.101

So the sandbox is up and running with IP address 192.168.56.101 to be connected by any client / application/ program

If the sandbox is already setup on the system, you can still go to virtual box preferences, add the host only network, then go to sandbox preferences and change the network to the new created hostly network and restart the sandbox to get the same results.

Big data notepad

Thursday, October 15, 2015

"Spark SQL Client" : Do Spark SQL Using a Query Editor

Tuesday, October 13, 2015

Connecting To Hortonworks Sandbox from External Apps