Goal:
How to improve the performance of Drill query with many in-lists.Env:
Drill 1.1Root Cause:
For Drill query like: select * from table where col in (1,2,3,...);The in-list evaluation is done in sequential. It means if you increase the number of in-list, the performance of that query could degrade linearly.
However if the number of in-list elements reaches 20(including 20), Drill can optimize this query to use a in-memory hash table to store the in-list elements, and then do a table join instead. This optimization can increase the performance a lot.
See the differences in below 2 plans.
a. No optimization
explain plan for select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19); +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 StreamAgg(group=[{}], EXPR$0=[COUNT()]) 00-02 Project($f0=[1]) 00-03 SelectionVectorRemover 00-04 Filter(condition=[OR(=(CAST($0):INTEGER, 1), =(CAST($0):INTEGER, 2), =(CAST($0):INTEGER, 3), =(CAST($0):INTEGER, 4), =(CAST($0):INTEGER, 5), =(CAST($0):INTEGER, 6), =(CAST($0):INTEGER, 7), =(CAST($0):INTEGER, 8), =(CAST($0):INTEGER, 9), =(CAST($0):INTEGER, 10), =(CAST($0):INTEGER, 11), =(CAST($0):INTEGER, 12), =(CAST($0):INTEGER, 13), =(CAST($0):INTEGER, 14), =(CAST($0):INTEGER, 15), =(CAST($0):INTEGER, 16), =(CAST($0):INTEGER, 17), =(CAST($0):INTEGER, 18), =(CAST($0):INTEGER, 19))]) 00-05 Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])b. With optimization
explain plan for select count(1) from h1_passwords where cast(col2 as int) in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20); +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 StreamAgg(group=[{}], EXPR$0=[COUNT()]) 00-02 Project($f0=[1]) 00-03 Project(f6=[$1], ROW_VALUE=[$0]) 00-04 MergeJoin(condition=[=($1, $0)], joinType=[inner]) 00-06 SelectionVectorRemover 00-08 Sort(sort0=[$0], dir0=[ASC]) 00-10 HashAgg(group=[{0}]) 00-12 Values 00-05 SelectionVectorRemover 00-07 Sort(sort0=[$0], dir0=[ASC]) 00-09 Project(f6=[CAST($0):INTEGER]) 00-11 Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:h1_passwords), inputSplits=[maprfs:///user/hive/warehouse/h1_passwords/passwd:0+1680], columns=[`col2`], partitions= null]])
Solution:
Add duplicate in-list columns to make the number of in-list reach 20 to take advantage of this optimization.For example, change the where condition from:
"where col1 in (1,2,3)"
To:
"where col1 in (1,2,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)"
Improvement JIRA DRILL-3710 is filed to make number "20" configurable.
No comments:
Post a Comment