Big data notepad: Hive External Tables with Multiple Character Delimiter

Hive external tables allow creating external reference tables without moving the data to a new location. I created a sample dataset with delimiter "|^|". However I soon realized that hive external tables only allow a single character as delimiter. i.e the following code does NOT work.

CREATE EXTERNAL TABLE TwitterFlatExternal

(

 Column_1 STRING,

 Column_2 STRING,

 Column_3 STRING,

 Column_4 STRING,

 Column_5 STRING,

 Column_6 STRING

)

ROW FORMAT DELIMITED

 FIELDS TERMINATED BY '|^|'

 LINES TERMINATED BY '\n'

LOCATION '/user/dsarangi/twitterdataflat';

However it is possible to achieve the same results using RegexSerde. SerDe stands for serializer and deserializer. Read more about it on Hive Serde

The regular expression in regex SerDe basically would have to grab each column on a regex grouping. The following one is an example for 6 column table with delimiter = "|^|"

CREATE EXTERNAL TABLE TwitterFlatExternalReg

(

 Column_1 STRING,

 Column_2 STRING,

 Column_3 STRING,

 Column_4 STRING,

 Column_5 STRING,

 Column_6 STRING

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

 WITH SERDEPROPERTIES (

 "input.regex" = "^(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)$",

 "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s"

)

STORED AS TEXTFILE

LOCATION '/user/dsarangi/twitterdataflat';

Using regex is cool but has its coolest pain areas also. Its a non-definitive approach and the regex has to be really solid to make sure every row renders the required number of columns. You have to create the table with the regex and then do a select on the table to know that the regex is actually working to render the columns. I use regex101 site to test my regex

Most important: When using the regex in hive use \\ (double back slash) for escaping. Single does not work.

So when testing the regex on a online site, convert the escape character from single backslash to double backslash before using in hive query.

5 comments:

Sanjiv SinghNovember 12, 2016 at 11:41 AM
Try this :https://github.com/sanjivsingh/FixedLengthAndDelimitedSerde
– Supported both fixed length and delimited data serialization/deserialization
– It support both case in single definition.
UnknownDecember 8, 2016 at 2:17 AM
I want delimter as space and comma
UnknownJuly 5, 2017 at 11:45 PM
Thanks for this post!
UnknownDecember 3, 2017 at 10:42 PM
Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.
hadoop training in bangalore
UnknownAugust 15, 2018 at 12:42 AM
Very informative
http://www.metaforumtechnologies.com/qtp-testing-training-in-chennai

Friday, December 25, 2015

Hive External Tables with Multiple Character Delimiter

5 comments: