Hive external tables allow creating external reference tables without moving the data to a new location. I created a sample dataset with delimiter "|^|". However I soon realized that hive external tables only allow a single character as delimiter. i.e the following code does NOT work.
However it is possible to achieve the same results using RegexSerde. SerDe stands for serializer and deserializer. Read more about it on Hive Serde
The regular expression in regex SerDe basically would have to grab each column on a regex grouping. The following one is an example for 6 column table with delimiter = "|^|"
Using regex is cool but has its coolest pain areas also. Its a non-definitive approach and the regex has to be really solid to make sure every row renders the required number of columns. You have to create the table with the regex and then do a select on the table to know that the regex is actually working to render the columns. I use regex101 site to test my regex
Most important: When using the regex in hive use \\ (double back slash) for escaping. Single does not work.
So when testing the regex on a online site, convert the escape character from single backslash to double backslash before using in hive query.
CREATE EXTERNAL TABLE TwitterFlatExternal
(
Column_1 STRING,
Column_2 STRING,
Column_3 STRING,
Column_4 STRING,
Column_5 STRING,
Column_6 STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|^|'
LINES TERMINATED BY '\n'
LOCATION '/user/dsarangi/twitterdataflat';
However it is possible to achieve the same results using RegexSerde. SerDe stands for serializer and deserializer. Read more about it on Hive Serde
The regular expression in regex SerDe basically would have to grab each column on a regex grouping. The following one is an example for 6 column table with delimiter = "|^|"
CREATE EXTERNAL TABLE TwitterFlatExternalReg
(
Column_1 STRING,
Column_2 STRING,
Column_3 STRING,
Column_4 STRING,
Column_5 STRING,
Column_6 STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)\\|\\^\\|(.*)$",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s"
)
STORED AS TEXTFILE
LOCATION '/user/dsarangi/twitterdataflat';
Using regex is cool but has its coolest pain areas also. Its a non-definitive approach and the regex has to be really solid to make sure every row renders the required number of columns. You have to create the table with the regex and then do a select on the table to know that the regex is actually working to render the columns. I use regex101 site to test my regex
Most important: When using the regex in hive use \\ (double back slash) for escaping. Single does not work.
So when testing the regex on a online site, convert the escape character from single backslash to double backslash before using in hive query.
Try this :https://github.com/sanjivsingh/FixedLengthAndDelimitedSerde
ReplyDelete– Supported both fixed length and delimited data serialization/deserialization
– It support both case in single definition.
I want delimter as space and comma
ReplyDeleteThanks for this post!
ReplyDeleteThanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.
ReplyDeletehadoop training in bangalore
Very informative
ReplyDeletehttp://www.metaforumtechnologies.com/qtp-testing-training-in-chennai