Env:
Drill 1.0Theory:
Partition pruning is a performance optimization that limits the number of files and partitions that Drill reads when querying file systems and Hive tables.Goal:
Know how to use the SQL plan to verify if SQL is optimized using partition pruning.Know how to change the prefix of directory column.
Workshop:
1. Querying data in directories
Assume the CSV data is organized in below directory structure:[root@h1 part1]# hadoop fs -lsr /drill/part1 drwxr-xr-x - root root 2 2015-05-17 18:35 /drill/part1/2014 drwxr-xr-x - root root 1 2015-05-17 18:35 /drill/part1/2014/11 -rw-r--r-- 3 root root 9 2015-05-17 18:35 /drill/part1/2014/11/11.csv drwxr-xr-x - root root 1 2015-05-17 18:36 /drill/part1/2014/12 -rw-r--r-- 3 root root 9 2015-05-17 18:36 /drill/part1/2014/12/12.csv drwxr-xr-x - root root 3 2015-05-17 18:35 /drill/part1/2015 drwxr-xr-x - root root 1 2015-05-17 18:37 /drill/part1/2015/01 -rw-r--r-- 3 root root 8 2015-05-17 18:37 /drill/part1/2015/01/01.csv drwxr-xr-x - root root 1 2015-05-17 18:37 /drill/part1/2015/02 -rw-r--r-- 3 root root 8 2015-05-17 18:37 /drill/part1/2015/02/02.csv drwxr-xr-x - root root 1 2015-05-17 18:37 /drill/part1/2015/03 -rw-r--r-- 3 root root 8 2015-05-17 18:37 /drill/part1/2015/03/03.csvQuerying the directory "part1" can show all data:
> select * from dfs.drill.`part1`; +-----------------+-------+-------+ | columns | dir0 | dir1 | +-----------------+-------+-------+ | ["1","'111'"] | 2015 | 01 | | ["2","'234'"] | 2015 | 02 | | ["3","'3rd'"] | 2015 | 03 | | ["11","'abc'"] | 2014 | 11 | | ["12","'def'"] | 2014 | 12 | +-----------------+-------+-------+ 5 rows selected (0.116 seconds)If we only need to get data of 2015 02~03:
> select * from dfs.drill.`part1` where dir0='2015' and (dir1 >= '02' and dir1 <= '03'); +----------------+-------+-------+ | columns | dir0 | dir1 | +----------------+-------+-------+ | ["2","'234'"] | 2015 | 02 | | ["3","'3rd'"] | 2015 | 03 | +----------------+-------+-------+ 2 rows selected (0.128 seconds)Now check the physical plan of above SQL to confirm it only scans the 2 files instead of full table scan.
> explain plan for select * from dfs.drill.`part1` where dir0='2015' and (dir1 >= '02' and dir1 <= '03'); +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 Project(*=[$0]) 00-02 Project(*=[$0]) 00-03 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/part1, numFiles=2, columns=[`*`], files=[maprfs:/drill/part1/2015/02/02.csv, maprfs:/drill/part1/2015/03/03.csv]]])
2. Querying data in Hive partition table
Assume we create below hive partition table:CREATE TABLE partition_table(id INT, username string) PARTITIONED BY(year STRING, month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","; insert into table partition_table PARTITION(year='2014',month='11') select 1,'u' from passwords limit 1; insert into table partition_table PARTITION(year='2014',month='12') select 2,'s' from passwords limit 1; insert into table partition_table PARTITION(year='2015',month='01') select 3,'e' from passwords limit 1; insert into table partition_table PARTITION(year='2015',month='02') select 4,'r' from passwords limit 1; insert into table partition_table PARTITION(year='2015',month='03') select 5,'n' from passwords limit 1;Querying the Hive table from Drill to show all data:
> select * from hive.partition_table; +-----+-----------+-------+--------+ | id | username | year | month | +-----+-----------+-------+--------+ | 1 | u | 2014 | 11 | | 2 | s | 2014 | 12 | | 3 | e | 2015 | 01 | | 4 | r | 2015 | 02 | | 5 | n | 2015 | 03 | +-----+-----------+-------+--------+ 5 rows selected (0.797 seconds)If we only need to get data of 2015 02~03:
select * from hive.partition_table where `year`='2015' and (`month` >= '02' and `month` <= '03'); +-----+-----------+-------+--------+ | id | username | year | month | +-----+-----------+-------+--------+ | 4 | r | 2015 | 02 | | 5 | n | 2015 | 03 | +-----+-----------+-------+--------+ 2 rows selected (0.374 seconds)Now check the physical plan of above SQL and found it actually scans 3 partitions instead of 2.
(Filed DRILL-3121 for this.)
> explain plan for select * from hive.partition_table where `year`='2015' and (`month` >= '02' and `month` <= '03'); +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 Project(id=[$0], username=[$1], year=[$2], month=[$3]) 00-02 SelectionVectorRemover 00-03 Filter(condition=[AND(=($2, '2015'), >=($3, '02'), <=($3, '03'))]) 00-04 Scan(groupscan=[HiveScan [table=Table(dbName:default, tableName:partition_table), inputSplits=[maprfs:/user/hive/warehouse/partition_table/year=2015/month=01/000000_0:0+4, maprfs:/user/hive/warehouse/partition_table/year=2015/month=02/000000_0:0+4, maprfs:/user/hive/warehouse/partition_table/year=2015/month=03/000000_0:0+4], columns=[`*`], partitions= [Partition(values:[2015, 01]), Partition(values:[2015, 02]), Partition(values:[2015, 03])]]])
3. How to change the prefix of directory column.
Imagine what will happen if the data itself has such columns "dir0","dir1",etc?Query will fail with "java.lang.IndexOutOfBoundsException". See DRILL-3118.
This is because, by default drill.exec.storage.file.partition.column.label is set to 'dir':
> select * from sys.options where name = 'drill.exec.storage.file.partition.column.label'; +-------------------------------------------------+---------+---------+----------+----------+-------------+-----------+------------+ | name | kind | type | status | num_val | string_val | bool_val | float_val | +-------------------------------------------------+---------+---------+----------+----------+-------------+-----------+------------+ | drill.exec.storage.file.partition.column.label | STRING | SYSTEM | DEFAULT | null | dir | null | null | +-------------------------------------------------+---------+---------+----------+----------+-------------+-----------+------------+To workaround the issue, we can change drill.exec.storage.file.partition.column.label to a prefix which is not in the data column names.
> alter session set `drill.exec.storage.file.partition.column.label`='directory'; +-------+----------------------------------------------------------+ | ok | summary | +-------+----------------------------------------------------------+ | true | drill.exec.storage.file.partition.column.label updated. | +-------+----------------------------------------------------------+ 1 row selected (0.074 seconds) > select `dir999` from dfs.root.`user/hive/warehouse/testdir999/3d49fc1fd0bc7e81-e6c5bb9affac8684_358897896_data.parquet`; +--------------+ | dir999 | +--------------+ | [B@5e5af20c | +--------------+ 1 row selected (0.098 seconds)
Reference:
Querying DirectoriesPartition Pruning(Drill Wiki)
Configuration Options Introduction
No comments:
Post a Comment