In RDBMS world, we could generate unique numbers using sequences and other SQL approaches. In Hive a generated unique number should be unique in a distributed environment.
Trying to generate a unique number across nodes that synchronize data across nodes can be really slow. The alternative is to generate unique numbers on individual nodes in a way that the generated number is confirmed to be unique across nodes.
One of the approach that I have used, is to use a Hive UDF that generates unique numbers on individual nodes by using a combination of
UniqueNumberGenerator.java
Trying to generate a unique number across nodes that synchronize data across nodes can be really slow. The alternative is to generate unique numbers on individual nodes in a way that the generated number is confirmed to be unique across nodes.
One of the approach that I have used, is to use a Hive UDF that generates unique numbers on individual nodes by using a combination of
- Node IP Address
- Job Start Timestamp
- Sequence number generated on the node
Following is a sample code. This code is not tested and possibly may be having some typos somewhere. The intent is to provide the approach
UniqueNumberGenerator.java
import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.UDFType; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import java.net.InetAddress; import java.net.UnknownHostException; @UDFType(deterministic = false, stateful = true) public class UniqueNumberGenerator extends GenericUDF { private transient Long sequence; private transient String prefix; //Gets called once @Override public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException { // per-instance fields initialization prefix = getPrefix(); sequence = 0L; return PrimitiveObjectInspectorFactory.javaLongObjectInspector; } //Gets called on each call to function @Override public Object evaluate(DeferredObject[] deferredObjects) throws HiveException { sequence = sequence + 1; return Long.parseLong(prefix + sequence); } @Override public String getDisplayString(String[] children) { return getStandardDisplayString("unique_number", children); } //Following are support functions //Create number prefix public String getPrefix() { return getStartingTimestamp() + getLocalIPFormatted(); } //Returns the last 2 parts of IP address left padded to 3 number with tota 6 digits //The last 2 parts are chose to introduce maximum variability public String getLocalIPFormatted() { InetAddress addr = null; String ip = null; try { addr = InetAddress.getLocalHost(); ip = addr.getHostAddress(); } catch (UnknownHostException e) { ip = "111.111.111.111"; } String[] parts = ip.split("\\."); //Create number by reverse concatenation of the ip parts return String.format("%3s", parts[3]).replace(' ', '0') + String.format("%3s", parts[2]).replace(' ', '0'); } //Get the system time in millis and left pad with zeros to make the number future proof public String getStartingTimestamp() { return String.format("%-13d0", System.currentTimeMillis()).replace(' ', '0'); } }
The key things to note are the parameters for UDFType (deterministic = false and stateful = true)
Deterministic : Evaluates once and uses across rows. Setting to false instructs the function to be called for each row
Stateful: Instructs to maintain the state of variables between UDF calls for each row. This is how the sequence number is incremented
Why the variables are transient? : We don't want any variable value to persist if Hive serialize the UDF object and throws it across nodes.
It is a stunning post. Exceptionally valuable to me. I preferred it
ReplyDeleteThanks,
Bigdata course in Chennai | Bigdata Training Chennai
Thanks for the informative post.Good work.Keep posting stuff like this.
ReplyDeleteSelenium Training in Chennai | Software Testing Training in chennai | Digital Marketing Training in Chennai
Great information, I like this kind of blog information really very nice and more I can easily new skills are develop after reading that post.
ReplyDeleteDOTNET Training in Chennai | DOTNET course in Chennai
Java Training Institutes Java Training Institutes Java EE Training in Chennai Java EE Training in Chennai Java Spring Hibernate Training Institutes in Chennai J2EE Training Institutes in Chennai J2EE Training Institutes in Chennai Core Java Training Institutes in Chennai Core Java Training Institutes in Chennai
ReplyDeleteJava Online Training Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training
Excellent content ,Thanks for sharing this .,
ReplyDeleteICAGILE ATF
ICP ATF