Big data notepad: Improving Performance for Hive Queries

If hive is used as the interface for accessing TB and PB scale data, it is quite important to optimize the queries to get them to run faster. Optimizing the queries is directly related to infrastructure, size of data, organization of data, storage formats and the data readers/ processors. This opens up thousands of possibilities in terms of how the queries can be optimized.

Following are some suggestions that can increase the query performance without worrying too much about the data organization.

Set hive execution engine to spark
Set hive execution engine to Tez
Set table storage format to ORC
Use cost based query optimization
Use Vectorization for queries
Use Partition on data when possible
Force Situational Map Join. (More on this link)

Apart from these options there are tens of other ways to improve the performance that are scenario driven like map joins, bucketing and bucket sort the data, change the VM and YARN memory configurations, etc

Following are the details for some of the above options.

EXECUTION ENGINE: SPARK

Use spark as the query engine for Hive: Currently the hadoop releases do not support a direct approach to setting the query engine to Spark (some level of effort is required in setting it up). Refer to this article if you are interested. Remember this is somewhat new and there are chances that some expected stuff may not work. So do a POC based on your needs before using in production.

EXECUTION ENGINE: TEZ

If setting up Spark is a difficult route, use tez as the query engine for hive. Setting the query engine to Tez is as easy as giving the command

set hive.execution.engine=tez

Or you can make a permanent change by changing your hive-site.xml, usually located in [/usr/hdp/current/hive-server2/conf/] or by changing the "hive.execution.engine" property to "Tez" through Ambari

STORAGE FORMAT: ORC

ORC (Optimized Row Columnar) is an evolution to RC (Row Columnar) format. Data is stored in stripes and efficient for reading. You can lean more about ORC here.

A summary fo ORC: A collection of rows are stored in a single block of data, and inside the block the columns are stored together. i.e Depending on the row size, a block may have 20,000 rows in a block and within the block, column 1 data for all 20000 rows are stored together, column 2 data for all 20000 rows are stored together and so on... The advantage of this format is based on the columns selected in a query, the data is read only for the required columns in a block than the entire row data. In my experience, ORC format storage does not pose any challenges as against storage as text file in terms of SQL features. ORC just improves the performance.

About setting the storage to ORC, it can done in 3 ways.

>>While creating the table

create table myorctable (

  name string,

  age int

) stored as orc tblproperties ("orc.compress"="NONE");

>>By altering the table

ALTER TABLE myorctable SET FILEFORMAT ORC

>>By setting the default file format to ORC and then creating the tables.

SET hive.default.fileformat=Orc;

Setting the default file format can be done from the client application or can be applied in hive config file. When creating a table stored as ORC there are table properties that get applied like compression method. If you are planning to set the default storage as ORC, you may want to take a look at this link for setting other default table properties and their config names for setting the values in the hive config file.

AUTOMATIC QUERY PLAN OPTIMIZATION
A feature which is borrowed from RDBMS systems or MPP (Massively parallel processing) systems. This is predominantly helpful when the queries have joins or multiple stepped operations. Also commonly referred as COST BASED QUERY OPTIMIZATION

Using this feature has 2 parts.

Letting hive know that this feature should be used.
Making the table statistics available to hive.

>>Letting hive know that this feature should be used

Set the following parameters through the client or in hive configurations through Ambari.

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

>> Making the table statistics available to hive
For the tables involved in the queries that you wish to use the CBQO feature, run the following SQL

analyze table myorctable compute statistics for columns name, age;

--OR if using version 0.14 and beyond

analyze table myorctable compute statistics for columns;

Do this for all the tables involved in the query.

USE VECTORIZATION
This option helps only in certain scenarios. Vectorization basically stalls the rows and performs operation on then in batches of 1024 rows instead of processing 1 row at a time.

Vectorization can be enabled through the following settings

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

You can either give the above set commands through your client tool or enable them permanently in hive settings through Ambari.

USE DATA PARTITIONS

Hive partitions basically store the data in different folders based on the partition rules defined and accesses only required data when query filter are used. The most common partition used is date.

Unlike relational database, the partition columns being common to a dataset are not part of the columns in the data and as a result also helps in reducing the data volume.

2 important tips:

Do not over partition the data just because it reduces data volume.
Keep your commonly used/ most frequently used queries in mind while partitioning the data

Following is a sample DDL of how partitions are used.

CREATE TABLE MY_WEBLOG (

    SESSION_ID  BIGINT,

    URL   STRING,

    LOG_TIMESTAMP STRING

)

PARTITIONED BY (LOG_DATE STRING, LOG_HOUR STRING) ;

To insert into a partitioned tables, the insert statement goes like

INSERT INTO MY_WEBLOG

PARTITION (LOG_DATE = '2016-01-01', LOG_HOUR = '23')

SELECT  SESSION_ID, URL, LOG_TIMESTAMP

FROM    LOG_SOURCE

WHERE   LOG_DATE = '2016-01-01', LOG_HOUR = '23' ;

This is quirky to specify partition values every time there is an insert. Hive provides dynamic partition options in order for inserts to automatically partition the data.

INSERT INTO MY_WEBLOG

PARTITION (LOG_DATE, LOG_HOUR)

SELECT  SESSION_ID, URL, LOG_TIMESTAMP , LOG_DATE, LOG_HOUR

FROM    LOG_SOURCE;

Note: Its required to have the partition columns at the end of the insert columns. Remember, partition columns are not physical columns, they are grouping folders.

8 comments:

UnknownJune 1, 2016 at 6:09 AM
I read your article and it’s totally awesome. You can consider including RSS feed in your site, so that we can follow your blog and get latest update direct on my inbox.
Regards,
JAVA Training in Chennai|Best JAVA Training in Chennai
UnknownJuly 28, 2017 at 12:07 AM
The blog gave me idea to improve the hive queries My sincere thanks for sharing this post Please Continue to share this post
Hadoop Training in Chennai
UnknownJuly 31, 2017 at 11:01 PM
Before looking into the different types of storage media, it's important to point out the different types of data that a company may need to store.Self Storage
Melba henryAugust 1, 2017 at 12:40 AM
It is really wonderful thus it is very much useful for me to understand many concepts and helped me a lot. keep update. thank you.. Big data Analytics Training in Chennai | Software Testing Training in Chennai

UnknownAugust 1, 2017 at 4:22 AM
Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.Big Data Hadoop Training in Bangalore | AWS Training in Bangalore
JAHANAugust 28, 2017 at 10:45 PM
a
JAHANAugust 28, 2017 at 10:45 PM
Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.

Hadoop Training in Marathallai

Hadoop Training in BtmLayout
marksonAugust 28, 2019 at 6:01 AM
She was additionally ready to respond to straightforward inquiries, including inquiries regarding general data, for example, regardless of whether felines have tails or wings. artificial intelligence course

Sunday, December 27, 2015

Improving Performance for Hive Queries

8 comments: