Goal:
This article introduces the behavior of Hive outer join.Solution:
From Hive Outer Join Behavior, here are the definitions of Preserved Row table, Null Supplying table, During Join predicate and After Join predicate(Where predicate).Take a left outer join for example:
select t1.*,t2.name as name_t2 from t1 left outer join t2 on (t1.id=t2.id and t2.name='aa' ) where t1.name='a';Then
- Preserved Row table: t1.
- Null Supplying table: t2.
- During Join predicate: t2.name='aa'.
- After Join predicate(Where predicate): t1.name='a'.
- During Join predicates cannot be pushed past Preserved Row tables.
- After Join predicates cannot be pushed past Null Supplying tables.
To explorer the behaviors, let's create below 2 Hive tables with sample data:
create table t1 (id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","; create table t2 (id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";Data for t1:
1,a 1,b 2,a 2,b 3,a 3,b 4,bData for t2:
1,aa 4,ddTo show a consistent plan using distributed join instead of map join, we disables hive.auto.convert.join in below tests.
set hive.auto.convert.join=false;
Case 1: Both tables have After Join predicates.
SQL:select t1.*,t2.name as name_t2
from t1 left outer join t2 on (t1.id=t2.id)
where t1.name='a' and t2.name='aa';
Result:1 a aaSQL plan:
STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: t1 Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (name = 'a') (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: id (type: int), name (type: string) TableScan alias: t2 Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE value expressions: name (type: string) Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col1} outputColumnNames: _col0, _col1, _col5 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Filter Operator predicate: (_col5 = 'aa') (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1Only After Join Predicate of Preserved Row table "t1" is pushed before the join.
After Join Predicate of Null Supplying table "t2" is executed as a filter on the output of the join.
The logic is:
( (t1 where t1.name='a') left outer join t2 ) where t2.name='aa'
Case 2: Null Supplying table has During Join predicate, Preserved Row table has After Join predicate
SQL:select t1.*,t2.name as name_t2 from t1 left outer join t2 on (t1.id=t2.id and t2.name='aa' ) where t1.name='a';Result:
1 a aa 2 a NULL 3 a NULLSQL Plan:
STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: t1 Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (name = 'a') (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: id (type: int), name (type: string) TableScan alias: t2 Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (name = 'aa') (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: name (type: string) Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col1} outputColumnNames: _col0, _col1, _col5 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1Both predicates are pushed before the join.
The logic is:
(t1 where t1.name='a') left outer join (t2 where t2.name='aa')
Case 3: Null Supplying table has After Join predicate, Preserved Row table has During Join predicate
SQL:select t1.*,t2.name as name_t2 from t1 left outer join t2 on (t1.id=t2.id and t1.name='a' ) where t2.name='aa';Result:
1 a aaSQL Plan:
STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: t1 Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE value expressions: id (type: int), name (type: string) TableScan alias: t2 Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE value expressions: name (type: string) Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col1} filter predicates: 0 {(VALUE._col1 = 'a')} 1 outputColumnNames: _col0, _col1, _col5 Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (_col5 = 'aa') (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1Neither of the two predicates are pushed before the join.
The logic is:
(t1 left outer join t2 ) where t1.name='a' and t2.name='aa'
Case 4: Both tables have During Join predicates.
SQL:select t1.*,t2.name as name_t2
from t1 left outer join t2 on
(t1.id=t2.id and t1.name='a' and t2.name='aa');
Result:1 a aa 1 b NULL 2 a NULL 2 b NULL 3 a NULL 3 b NULL 4 b NULLSQL Plan:
STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: t1 Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE value expressions: id (type: int), name (type: string) TableScan alias: t2 Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (name = 'aa') (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: name (type: string) Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col1} filter predicates: 0 {(VALUE._col1 = 'a')} 1 outputColumnNames: _col0, _col1, _col5 Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1Only During Join Predicate of Null Supplying table "t2" is pushed before the join.
During Join Predicate of Preserved Row table "t1" is ignored.
The logic is:
( t1 left outer join (t2 where t2.name='aa') )
Key takeaways:
- The SQL result depends on how you write the SQL.
- During Join predicates of Null Supplying tables can be pushed.
- After Join predicates of Preserved Row tables can be pushed.
- During Join Predicate of Preserved Row table "t1" is ignored.
- Sometimes if the predicates of both sides need to be pushed, please refer to case 2.
For example, imagine both "t1" and "t2" are partition tables based on partition key "name", you may want partition pruning to happen before join.
No comments:
Post a Comment