1. Join order is listed in FROM clause.
Compare below 2 plans:Query: explain select count(*) from passwords a,passwords2 b,passwords3 c where a.col0=b.col0 and b.col0=c.col0 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=4.06GB VCores=2 | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.passwords, default.passwords2, default.passwords3 | | | | 09:AGGREGATE [MERGE FINALIZE] | | | output: sum(count(*)) | | | | | 08:EXCHANGE [PARTITION=UNPARTITIONED] | | | | | 05:AGGREGATE | | | output: count(*) | | | | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: b.col0 = c.col0 | | | | | |--07:EXCHANGE [BROADCAST] | | | | | | | 02:SCAN HDFS [default.passwords3 c] | | | partitions=1/1 size=10.42KB | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: a.col0 = b.col0 | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [default.passwords2 b] | | | partitions=1/1 size=2.60KB | | | | | 00:SCAN HDFS [default.passwords a] | | partitions=1/1 size=10.42KB | +------------------------------------------------------------------------------------+VS
Query: explain select count(*) from passwords2 b,passwords a,passwords3 c where a.col0=b.col0 and b.col0=c.col0 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=2.06GB VCores=2 | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.passwords, default.passwords2, default.passwords3 | | | | 09:AGGREGATE [MERGE FINALIZE] | | | output: sum(count(*)) | | | | | 08:EXCHANGE [PARTITION=UNPARTITIONED] | | | | | 05:AGGREGATE | | | output: count(*) | | | | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: b.col0 = c.col0 | | | | | |--07:EXCHANGE [BROADCAST] | | | | | | | 02:SCAN HDFS [default.passwords3 c] | | | partitions=1/1 size=10.42KB | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: b.col0 = a.col0 | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [default.passwords a] | | | partitions=1/1 size=10.42KB | | | | | 00:SCAN HDFS [default.passwords2 b] | | partitions=1/1 size=2.60KB | +------------------------------------------------------------------------------------+So largest table(or fact table) should be put on most left side.
2. Hash Join types -- Broadcast VS Shuffle.
Broadcast
Best for large table join small table. (For example, fact table join dimension table).
Shuffle
Best for large table join large table.
We can use hint "[shuffle]" to force a shuffle join.
Compare below 2 plans:
Query: explain select count(*) from passwords a join passwords2 b where a.col0=b.col0 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=2.04GB VCores=2 | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.passwords, default.passwords2 | | | | 06:AGGREGATE [MERGE FINALIZE] | | | output: sum(count(*)) | | | | | 05:EXCHANGE [PARTITION=UNPARTITIONED] | | | | | 03:AGGREGATE | | | output: count(*) | | | | | 02:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: a.col0 = b.col0 | | | | | |--04:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [default.passwords2 b] | | | partitions=1/1 size=2.60KB | | | | | 00:SCAN HDFS [default.passwords a] | | partitions=1/1 size=10.42KB | +------------------------------------------------------------------------------------+VS
Query: explain select count(*) from passwords a join [shuffle] passwords2 b where a.col0=b.col0 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=2.04GB VCores=2 | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.passwords, default.passwords2 | | | | 07:AGGREGATE [MERGE FINALIZE] | | | output: sum(count(*)) | | | | | 06:EXCHANGE [PARTITION=UNPARTITIONED] | | | | | 03:AGGREGATE | | | output: count(*) | | | | | 02:HASH JOIN [INNER JOIN, PARTITIONED] | | | hash predicates: a.col0 = b.col0 | | | | | |--05:EXCHANGE [PARTITION=HASH(b.col0)] | | | | | | | 01:SCAN HDFS [default.passwords2 b] | | | partitions=1/1 size=2.60KB | | | | | 04:EXCHANGE [PARTITION=HASH(a.col0)] | | | | | 00:SCAN HDFS [default.passwords a] | | partitions=1/1 size=10.42KB | +------------------------------------------------------------------------------------+Note: Similar to scenarios in Greenplum: 2 tables have the same distribution key, so that local join is done.
3. Partition Pruning
For example:Query: explain select count(*) from logs where year='2013' and month='07' and day='28'
+------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
| |
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] |
| 03:AGGREGATE [MERGE FINALIZE] |
| | output: sum(count(*)) |
| | hosts=4 per-host-mem=unavailable |
| | tuple-ids=1 row-size=8B cardinality=1 |
| | |
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
| hosts=4 per-host-mem=unavailable |
| tuple-ids=1 row-size=8B cardinality=1 |
| |
| F00:PLAN FRAGMENT [PARTITION=RANDOM] |
| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=02, PARTITION=UNPARTITIONED] |
| 01:AGGREGATE |
| | output: count(*) |
| | hosts=4 per-host-mem=10.00MB |
| | tuple-ids=1 row-size=8B cardinality=1 |
| | |
| 00:SCAN HDFS [external_partitions.logs, PARTITION=RANDOM] |
| partitions=2/4 size=30B |
| table stats: 4 rows total |
| column stats: all |
| hosts=4 per-host-mem=32.00MB |
| tuple-ids=0 row-size=45B cardinality=2 |
+------------------------------------------------------------------------+
Note: Greenplum supports dynamic partition pruning.4. Impala does not spill to disks.
Impala's pipelines are in memory. Minimum of 128G RAM is suggested for each Impala node.Greenplum will spill, so recently it introduces workfile limit to control the workfile size.
5. Compression Types
Type: | Snappy | Gzip/Zlib |
---|---|---|
Speed: | Faster | Slower |
CPU: | Less | More |
Compression Ratio: | Lower | Higher |
No comments:
Post a Comment