Goal:
Understanding the differences between NullableVarCharHolder and VarCharHolder in Drill UDF code.Env:
Drill 1.4Root Cause:
When writing custom Drill UDF code, here are 2 input types for varchar:org.apache.drill.exec.expr.holders.VarCharHolder org.apache.drill.exec.expr.holders.NullableVarCharHolder
Take function to_int(varchar,bigint) for example, and let's focus on the 1st input parameter in varchar type.
If the function is applied on a column of json file, the column is nullable(NullableVarCharHolder), eg:
select to_int(col1,999) from dfs.drill.`a.json`;If it is applied on a constant value, then it is for sure not-nullable(VarCharHolder),eg:
select to_int('111', 999) from sys.version;
One major difference between NullableVarCharHolder and VarCharHolder is:
NullableVarCharHolder has an attribute "isSet". "isSet=0" means this object is null, vice versa.
Since VarCharHolder does not have that attribute, so if the function's null handling is "nulls = NullHandling.INTERNAL" instead of "nulls = NullHandling.NULL_IF_NULL", you need to implement the same function twice -- one for "NullableVarCharHolder" and one for "VarCharHolder".
Sample Code:
Here are the sample code for to_int(varchar,bigint) function:1. For input type = NullableVarCharHolder
https://github.com/viadea/DrillUDF/blob/master/src/main/java/openkb/drill/udf/GetInt.java
2. For input type = VarCharHolder
https://github.com/viadea/DrillUDF/blob/master/src/main/java/openkb/drill/udf/GetInt2.java
Test
1. For input type = NullableVarCharHolder> select col1 from dfs.drill.`a.json`; +-------+ | col1 | +-------+ | 123 | | null | +-------+ 2 rows selected (0.246 seconds) > select to_int(col1,999) from dfs.drill.`a.json`; +---------+ | EXPR$0 | +---------+ | 123 | | 999 | +---------+ 2 rows selected (0.23 seconds)
2. For input type = VarCharHolder
> select to_int('111', 999) from sys.version; +---------+ | EXPR$0 | +---------+ | 111 | +---------+ 1 row selected (0.537 seconds) > select to_int('wrongnumber', 999) from sys.version; +---------+ | EXPR$0 | +---------+ | 999 | +---------+ 1 row selected (0.374 seconds)
I learned that the hard way. I find this impossible to live with if I have just 2 more arguments for my UDF to handle ! Why can't the UDF version receiving 'NullableVarCharHolder' also accept constants with the 'isSet' flag always set to '1' ?
ReplyDeleteThis is a design issue. I would suggest you open a Apache Drill JIRA for this.
DeleteGrreat share
ReplyDelete