How to improve the performance of Drill query with many in-lists

Tuesday, August 25, 2015

How to improve the performance of Drill query with many in-lists

Goal:

How to improve the performance of Drill query with many in-lists.

Env:

Drill 1.1

Root Cause:

For Drill query like: select * from table where col in (1,2,3,...);
The in-list evaluation is done in sequential. It means if you increase the number of in-list, the performance of that query could degrade linearly.
However if the number of in-list elements reaches 20(including 20), Drill can optimize this query to use a in-memory hash table to store the in-list elements, and then do a table join instead. This optimization can increase the performance a lot.

See the differences in below 2 plans.
a. No optimization

explain plan for
select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19);
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      StreamAgg(group=[{}], EXPR$0=[COUNT()])
00-02        Project($f0=[1])
00-03          SelectionVectorRemover
00-04            Filter(condition=[OR(=(CAST($0):INTEGER, 1), =(CAST($0):INTEGER, 2), =(CAST($0):INTEGER, 3), =(CAST($0):INTEGER, 4), =(CAST($0):INTEGER, 5), =(CAST($0):INTEGER, 6), =(CAST($0):INTEGER, 7), =(CAST($0):INTEGER, 8), =(CAST($0):INTEGER, 9), =(CAST($0):INTEGER, 10), =(CAST($0):INTEGER, 11), =(CAST($0):INTEGER, 12), =(CAST($0):INTEGER, 13), =(CAST($0):INTEGER, 14), =(CAST($0):INTEGER, 15), =(CAST($0):INTEGER, 16), =(CAST($0):INTEGER, 17), =(CAST($0):INTEGER, 18), =(CAST($0):INTEGER, 19))])
00-05              Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])

b. With optimization

explain plan for
select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      StreamAgg(group=[{}], EXPR$0=[COUNT()])
00-02        Project($f0=[1])
00-03          Project(f6=[$1], ROW_VALUE=[$0])
00-04            MergeJoin(condition=[=($1, $0)], joinType=[inner])
00-06              SelectionVectorRemover
00-08                Sort(sort0=[$0], dir0=[ASC])
00-10                  HashAgg(group=[{0}])
00-12                    Values
00-05              SelectionVectorRemover
00-07                Sort(sort0=[$0], dir0=[ASC])
00-09                  Project(f6=[CAST($0):INTEGER])
00-11                    Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])

Solution:

Add duplicate in-list columns to make the number of in-list reach 20 to take advantage of this optimization.
For example, change the where condition from:
"where col1 in (1,2,3)"
To:
"where col1 in (1,2,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)"

Improvement JIRA DRILL-3710 is filed to make number "20" configurable.

Tuesday, August 25, 2015