Goal:
This article does some code analysis on which Spark SQL data type is Order-able or Sort-able.
We will look into the source code logic for method "isOrderable" of object org.apache.spark.sql.catalyst.expressions.RowOrdering.
Env:
Spark 2.4 source code
Solution:
The source code for method "isOrderable" is:
/**
* Returns true iff the data type can be ordered (i.e. can be sorted).
*/
def isOrderable(dataType: DataType): Boolean = dataType match {
case NullType => true
case dt: AtomicType => true
case struct: StructType => struct.fields.forall(f => isOrderable(f.dataType))
case array: ArrayType => isOrderable(array.elementType)
case udt: UserDefinedType[_] => isOrderable(udt.sqlType)
case _ => false
}
So basically any NullType or AtomicType should be Order-able. For other complex types, it depends on their element types.
Let's now take a look at ALL of the Spark SQL data types in org.apache.spark.sql.types
1. NullType
import org.apache.spark.sql.catalyst.expressions.RowOrdering
import org.apache.spark.sql.types._
scala> RowOrdering.isOrderable(NullType)
res0: Boolean = true
2. class which extends AtomicType
scala> RowOrdering.isOrderable(BinaryType)
res29: Boolean = true
scala> RowOrdering.isOrderable(BooleanType)
res6: Boolean = true
scala> RowOrdering.isOrderable(DateType)
res31: Boolean = true
RowOrdering.isOrderable(HiveStringType)
scala> RowOrdering.isOrderable(StringType)
res1: Boolean = true
scala> RowOrdering.isOrderable(TimestampType)
res19: Boolean = true
Here is another abstract class HiveStringType which also extends AtomicType.
But as per the comment below, it should be replaced by a StringType. And it is even removed in Spark 3.1.
/**
* A hive string type for compatibility. These datatypes should only used for parsing,
* and should NOT be used anywhere else. Any instance of these data types should be
* replaced by a [[StringType]] before analysis.
*/
sealed abstract class HiveStringType extends AtomicType
3. class which extends IntegralType
Basically "abstract class IntegralType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.
So any class which extends IntegralType should also be order-able:
scala> RowOrdering.isOrderable(ByteType)
res11: Boolean = true
scala> RowOrdering.isOrderable(IntegerType)
res3: Boolean = true
scala> RowOrdering.isOrderable(LongType)
res13: Boolean = true
scala> RowOrdering.isOrderable(ShortType)
res14: Boolean = true
4. class which extends FractionalType
Basically "abstract class FractionalType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.
So any class which extends FractionalType should also be order-able:
scala> RowOrdering.isOrderable(DecimalType(10,5))
res17: Boolean = true
scala> RowOrdering.isOrderable(DoubleType)
res2: Boolean = true
scala> RowOrdering.isOrderable(FloatType)
res13: Boolean = true
5. Spark SQL data types which are not order-able
scala> RowOrdering.isOrderable(CalendarIntervalType)
res26: Boolean = false
scala> RowOrdering.isOrderable(DataTypes.createMapType(StringType,StringType))
res9: Boolean = false
scala> RowOrdering.isOrderable(ObjectType(classOf[java.lang.Integer]))
res23: Boolean = false
6. Complex Spark SQL data types
If the ArrayType's element type is order-able, then ArrayType is order-able. Vice Versa.
scala> RowOrdering.isOrderable(ArrayType(IntegerType))
res22: Boolean = true
scala> RowOrdering.isOrderable(ArrayType(CalendarIntervalType))
res27: Boolean = false
If all of the field types of StructType is order-able, then StructType is order-able. Vice Versa.
scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", StringType))
res6: Boolean = true
scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", CalendarIntervalType))
res7: Boolean = false
7. UserDefinedType
As per below comment, it should not be used and becomes private after Spark 2.x.
* Note: This was previously a developer API in Spark 1.x. We are making this private in Spark 2.0
* because we will very likely create a new version of this that works better with Datasets.
No comments:
Post a Comment