Goal:
When running count(*) queries on parquet files, the physical plan may show below:Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@331f5e71...This article is to explain what does "org.apache.drill.exec.store.pojo.PojoRecordReader" mean.
Root Cause:
For select count(*) or count( not-nullable-expr) queries on parquet files, Drill may do an optimization to read from Parquet metadata instead of reading the whole parquet file.This logic is in code exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ConvertCountToDirectScan.java:
/** * This rule will convert * " select count(*) as mycount from table " * or " select count( not-nullable-expr) as mycount from table " * into * * Project(mycount) * \ * DirectGroupScan ( PojoRecordReader ( rowCount )) * * or * " select count(column) as mycount from table " * into * Project(mycount) * \ * DirectGroupScan (PojoRecordReader (columnValueCount)) * * Currently, only parquet group scan has the exact row count and column value count, * obtained from parquet row group info. This will save the cost to * scan the whole parquet files. */For example:
create table dfs.tmp.`prune2` as select 1 as id from hive.bigtable; refresh table metadata dfs.tmp.`prune2`; select count(*) from dfs.tmp.`prune2` ;The physical plan contains below line:
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@4b7000eb[columns = null, isStarQuery = false, isSkipQuery = false]])
No comments:
Post a Comment