Goal:
Hive users may set different customized "mapred.output.compression.codec"(same as "mapreduce.output.fileoutputformat.compress.codec") in Hive Cli, Beeline, hive-site.xml or even Hive script files. There could be thousands of such Hive script files.This article explains how to override "mapred.output.compression.codec" globally without modifying each script file one by one.
Env:
Hive 1.2Hadoop 2.7
Solution:
Say all Hive scripts are using Lzo compression algorithm right now:SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;If the Hadoop Admin want to override the compression algorithm to below situations:
1. org.apache.hadoop.io.compress.SnappyCodec
Put <final> tag in mapred-site.xml on all nodes:<property> <name>mapred.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> <final>true</final> </property>After that, the YARN job container log should show below info:
INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/mapr/tmp/hive/mapr/593a64b1-16a4-4e96-9e32-a0066e10309d/hive_2016-09-02_16-56-10_136_8088252876833149660-1/-mr-10000/.hive-staging_hive_2016-09-02_16-56-10_136_8088252876833149660-1/_tmp.-ext-10001/000000_1.snappy INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
2. org.apache.hadoop.io.compress.DefaultCodec
Leave the value empty and put <final> tag in mapred-site.xml on all nodes:<property> <name>mapred.output.compression.codec</name> <value></value> <final>true</final> </property>After that, the YARN job container log should show below info:
INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/mapr/tmp/hive/mapr/94d62385-73ae-4770-b050-6f2f6b4f77ba/hive_2016-09-02_19-19-36_670_8404792977770566381-1/-mr-10000/.hive-staging_hive_2016-09-02_19-19-36_670_8404792977770566381-1/_tmp.-ext-10001/000000_0.deflate INFO [main] org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.deflate]
3. No compression at all
We have to remove the "SET hive.exec.compress.output=true;" from all hive scripts.After that, the YARN job container log should show below info:
INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/mapr/tmp/hive/mapr/94d62385-73ae-4770-b050-6f2f6b4f77ba/hive_2016-09-02_19-23-58_054_204837380219085966-1/-mr-10000/.hive-staging_hive_2016-09-02_19-23-58_054_204837380219085966-1/_tmp.-ext-10001/000000_0
Note: Even after we override the mapred.output.compression.codec in mapred-site.xml, but from Hive CLI or Beeline, it is still showing "Lzo". That is fine, we can ignore that:
hive> set mapred.output.compression.codec; mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodecBecause the real value is set in mapred-site.xml.
Below is the evidence from Yarn job container log:
WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.output.fileoutputformat.compress.codec; Ignoring.
No comments:
Post a Comment