If hive is used as the interface for accessing TB and PB scale data, it is quite important to optimize the queries to get them to run faster. Optimizing the queries is directly related to infrastructure, size of data, organization of data, storage formats and the data readers/ processors. This opens up thousands of possibilities in terms of how the queries can be optimized.
Following are some suggestions that can increase the query performance without worrying too much about the data organization.
Following are the details for some of the above options.
EXECUTION ENGINE: SPARK
Use spark as the query engine for Hive: Currently the hadoop releases do not support a direct approach to setting the query engine to Spark (some level of effort is required in setting it up). Refer to this article if you are interested. Remember this is somewhat new and there are chances that some expected stuff may not work. So do a POC based on your needs before using in production.
EXECUTION ENGINE: TEZ
Following are some suggestions that can increase the query performance without worrying too much about the data organization.
- Set hive execution engine to spark
- Set hive execution engine to Tez
- Set table storage format to ORC
- Use cost based query optimization
- Use Vectorization for queries
- Use Partition on data when possible
- Force Situational Map Join. (More on this link)
Apart from these options there are tens of other ways to improve the performance that are scenario driven like map joins, bucketing and bucket sort the data, change the VM and YARN memory configurations, etc
Following are the details for some of the above options.
EXECUTION ENGINE: SPARK
Use spark as the query engine for Hive: Currently the hadoop releases do not support a direct approach to setting the query engine to Spark (some level of effort is required in setting it up). Refer to this article if you are interested. Remember this is somewhat new and there are chances that some expected stuff may not work. So do a POC based on your needs before using in production.
EXECUTION ENGINE: TEZ
If setting up Spark is a difficult route, use tez as the query engine for hive. Setting the query engine to Tez is as easy as giving the command
set hive.execution.engine=tez
Or you can make a permanent change by changing your hive-site.xml, usually located in [/usr/hdp/current/hive-server2/conf/] or by changing the "hive.execution.engine" property to "Tez" through Ambari
STORAGE FORMAT: ORC
ORC (Optimized Row Columnar) is an evolution to RC (Row Columnar) format. Data is stored in stripes and efficient for reading. You can lean more about ORC here.
A summary fo ORC: A collection of rows are stored in a single block of data, and inside the block the columns are stored together. i.e Depending on the row size, a block may have 20,000 rows in a block and within the block, column 1 data for all 20000 rows are stored together, column 2 data for all 20000 rows are stored together and so on... The advantage of this format is based on the columns selected in a query, the data is read only for the required columns in a block than the entire row data. In my experience, ORC format storage does not pose any challenges as against storage as text file in terms of SQL features. ORC just improves the performance.
About setting the storage to ORC, it can done in 3 ways.
>>While creating the table
>>By altering the table
>>By setting the default file format to ORC and then creating the tables.
AUTOMATIC QUERY PLAN OPTIMIZATION
A feature which is borrowed from RDBMS systems or MPP (Massively parallel processing) systems. This is predominantly helpful when the queries have joins or multiple stepped operations. Also commonly referred as COST BASED QUERY OPTIMIZATION
Using this feature has 2 parts.
Set the following parameters through the client or in hive configurations through Ambari.
>> Making the table statistics available to hive
For the tables involved in the queries that you wish to use the CBQO feature, run the following SQL
Do this for all the tables involved in the query.
USE VECTORIZATION
This option helps only in certain scenarios. Vectorization basically stalls the rows and performs operation on then in batches of 1024 rows instead of processing 1 row at a time.
Vectorization can be enabled through the following settings
You can either give the above set commands through your client tool or enable them permanently in hive settings through Ambari.
USE DATA PARTITIONS
Hive partitions basically store the data in different folders based on the partition rules defined and accesses only required data when query filter are used. The most common partition used is date.
Unlike relational database, the partition columns being common to a dataset are not part of the columns in the data and as a result also helps in reducing the data volume.
2 important tips:
Following is a sample DDL of how partitions are used.
To insert into a partitioned tables, the insert statement goes like
This is quirky to specify partition values every time there is an insert. Hive provides dynamic partition options in order for inserts to automatically partition the data.
Note: Its required to have the partition columns at the end of the insert columns. Remember, partition columns are not physical columns, they are grouping folders.
A summary fo ORC: A collection of rows are stored in a single block of data, and inside the block the columns are stored together. i.e Depending on the row size, a block may have 20,000 rows in a block and within the block, column 1 data for all 20000 rows are stored together, column 2 data for all 20000 rows are stored together and so on... The advantage of this format is based on the columns selected in a query, the data is read only for the required columns in a block than the entire row data. In my experience, ORC format storage does not pose any challenges as against storage as text file in terms of SQL features. ORC just improves the performance.
About setting the storage to ORC, it can done in 3 ways.
>>While creating the table
create table myorctable (
name string,
age int
) stored as orc tblproperties ("orc.compress"="NONE");
>>By altering the table
ALTER TABLE myorctable SET FILEFORMAT ORC
>>By setting the default file format to ORC and then creating the tables.
SET hive.default.fileformat=Orc;
Setting the default file format can be done from the client application or can be applied in hive config file. When creating a table stored as ORC there are table properties that get applied like compression method. If you are planning to set the default storage as ORC, you may want to take a look at this link for setting other default table properties and their config names for setting the values in the hive config file.AUTOMATIC QUERY PLAN OPTIMIZATION
A feature which is borrowed from RDBMS systems or MPP (Massively parallel processing) systems. This is predominantly helpful when the queries have joins or multiple stepped operations. Also commonly referred as COST BASED QUERY OPTIMIZATION
Using this feature has 2 parts.
- Letting hive know that this feature should be used.
- Making the table statistics available to hive.
Set the following parameters through the client or in hive configurations through Ambari.
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
>> Making the table statistics available to hive
For the tables involved in the queries that you wish to use the CBQO feature, run the following SQL
analyze table myorctable compute statistics for columns name, age;
--OR if using version 0.14 and beyond
analyze table myorctable compute statistics for columns;
Do this for all the tables involved in the query.
USE VECTORIZATION
This option helps only in certain scenarios. Vectorization basically stalls the rows and performs operation on then in batches of 1024 rows instead of processing 1 row at a time.
Vectorization can be enabled through the following settings
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
You can either give the above set commands through your client tool or enable them permanently in hive settings through Ambari.
USE DATA PARTITIONS
Hive partitions basically store the data in different folders based on the partition rules defined and accesses only required data when query filter are used. The most common partition used is date.
Unlike relational database, the partition columns being common to a dataset are not part of the columns in the data and as a result also helps in reducing the data volume.
2 important tips:
- Do not over partition the data just because it reduces data volume.
- Keep your commonly used/ most frequently used queries in mind while partitioning the data
Following is a sample DDL of how partitions are used.
CREATE TABLE MY_WEBLOG (
SESSION_ID BIGINT,
URL STRING,
LOG_TIMESTAMP STRING
)
PARTITIONED BY (LOG_DATE STRING, LOG_HOUR STRING) ;
To insert into a partitioned tables, the insert statement goes like
INSERT INTO MY_WEBLOG
PARTITION (LOG_DATE = '2016-01-01', LOG_HOUR = '23')
SELECT SESSION_ID, URL, LOG_TIMESTAMP
FROM LOG_SOURCE
WHERE LOG_DATE = '2016-01-01', LOG_HOUR = '23' ;
This is quirky to specify partition values every time there is an insert. Hive provides dynamic partition options in order for inserts to automatically partition the data.
INSERT INTO MY_WEBLOG
PARTITION (LOG_DATE, LOG_HOUR)
SELECT SESSION_ID, URL, LOG_TIMESTAMP , LOG_DATE, LOG_HOUR
FROM LOG_SOURCE;
Note: Its required to have the partition columns at the end of the insert columns. Remember, partition columns are not physical columns, they are grouping folders.
I read your article and it’s totally awesome. You can consider including RSS feed in your site, so that we can follow your blog and get latest update direct on my inbox.
ReplyDeleteRegards,
JAVA Training in Chennai|Best JAVA Training in Chennai
The blog gave me idea to improve the hive queries My sincere thanks for sharing this post Please Continue to share this post
ReplyDeleteHadoop Training in Chennai
Before looking into the different types of storage media, it's important to point out the different types of data that a company may need to store.Self Storage
ReplyDeleteIt is really wonderful thus it is very much useful for me to understand many concepts and helped me a lot. keep update. thank you.. Big data Analytics Training in Chennai | Software Testing Training in Chennai
ReplyDeleteWell Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.Big Data Hadoop Training in Bangalore | AWS Training in Bangalore
ReplyDeletea
ReplyDeleteWell Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.
ReplyDeleteHadoop Training in Marathallai
Hadoop Training in BtmLayout
She was additionally ready to respond to straightforward inquiries, including inquiries regarding general data, for example, regardless of whether felines have tails or wings. artificial intelligence course
ReplyDelete