Monday, May 9, 2016

Hello Hive UDF: Unique Identifier Generator using Hive UDF

In RDBMS world, we could generate unique numbers using sequences and other SQL approaches. In Hive a generated unique number should be unique in a distributed environment.

Trying to generate a unique number across nodes that synchronize data across nodes can be really slow. The alternative is to generate unique numbers on individual nodes in a way that the generated number is confirmed to be unique across nodes.

One of the approach that I have used, is to use a Hive UDF that generates unique numbers on individual nodes by using a combination of

  1. Node IP Address
  2. Job Start Timestamp
  3. Sequence number generated on the node
Following is a sample code. This code is not tested and possibly may be having some typos somewhere. The intent is to provide the approach

UniqueNumberGenerator.java

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.net.InetAddress;
import java.net.UnknownHostException;



@UDFType(deterministic = false, stateful = true)
public class UniqueNumberGenerator extends GenericUDF {

    private transient Long sequence;
    private transient String prefix;


    //Gets called once    @Override    public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException {
        // per-instance fields initialization        prefix = getPrefix();
        sequence = 0L;
        return PrimitiveObjectInspectorFactory.javaLongObjectInspector;
    }


    //Gets called on each call to function    @Override    public Object evaluate(DeferredObject[] deferredObjects) throws HiveException {
        sequence = sequence + 1;
        return Long.parseLong(prefix + sequence);
    }

    @Override    public String getDisplayString(String[] children) {
        return getStandardDisplayString("unique_number", children);
    }



    //Following are support functions
    //Create number prefix    public String getPrefix() {
        return getStartingTimestamp() + getLocalIPFormatted();
    }


    //Returns the last 2 parts of IP address left padded to 3 number with tota 6 digits    //The last 2 parts are chose to introduce maximum variability    public String getLocalIPFormatted() {
        InetAddress addr = null;
        String ip = null;

        try {
            addr = InetAddress.getLocalHost();
            ip = addr.getHostAddress();
        } catch (UnknownHostException e) {
            ip = "111.111.111.111";
        }

        String[] parts = ip.split("\\.");
        //Create number by reverse concatenation of the ip parts        return String.format("%3s", parts[3]).replace(' ', '0') + String.format("%3s", parts[2]).replace(' ', '0');
    }


    //Get the system time in millis and left pad with zeros to make the number future proof    public String getStartingTimestamp() {
        return String.format("%-13d0", System.currentTimeMillis()).replace(' ', '0');
    }

}





The key things to note are the parameters for UDFType (deterministic = false and stateful = true)

Deterministic : Evaluates once and uses across rows. Setting to false instructs the function to be called for each row
Stateful: Instructs to maintain the state of variables between UDF calls for each row. This is how the sequence number is incremented 


Why the variables are transient? : We don't want any variable value to persist if Hive serialize the UDF object and throws it across nodes.

5 comments:

  1. It is a stunning post. Exceptionally valuable to me. I preferred it
    Thanks,
    Bigdata course in Chennai | Bigdata Training Chennai

    ReplyDelete
  2. Great information, I like this kind of blog information really very nice and more I can easily new skills are develop after reading that post.
    DOTNET Training in Chennai | DOTNET course in Chennai

    ReplyDelete
  3. Java Training Institutes Java Training Institutes Java EE Training in Chennai Java EE Training in Chennai Java Spring Hibernate Training Institutes in Chennai J2EE Training Institutes in Chennai J2EE Training Institutes in Chennai Core Java Training Institutes in Chennai Core Java Training Institutes in Chennai

    Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training

    ReplyDelete
  4. Excellent content ,Thanks for sharing this .,
    ICAGILE ATF
    ICP ATF

    ReplyDelete