Env:
Pig 0.13 on YarnSymptom:
- A pig job which is reading and writing many parquet files, fails with ApplicationMaster OutOfMemoryError in the last commitJob phase.
- All mappers and reducers finishes successfully.
- ApplicationMaster container log shows below stacktrace:
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to Job commit failed: java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:281) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:253) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:279) ... 5 more Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300) at java.lang.StringCoding.encode(StringCoding.java:344) at java.lang.String.getBytes(String.java:916) at parquet.org.apache.thrift.protocol.TCompactProtocol.writeString(TCompactProtocol.java:298) at parquet.format.ColumnChunk.write(ColumnChunk.java:512) at parquet.format.RowGroup.write(RowGroup.java:521) at parquet.format.FileMetaData.write(FileMetaData.java:923) at parquet.format.Util.write(Util.java:56) at parquet.format.Util.writeFileMetaData(Util.java:30) at parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:322) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:342) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) ... 10 more
Root Cause:
Pig is using a parquet jar to read and write parquet files. The source code of the parquet jar comes from the parquet-mr github.The logic is in ParquetOutputCommitter.java:
In this case during commitJob() phase, ApplicationMaster is calling ParquetOutputCommitter.commitJob().
It will firstly read all the footers of the output parquet files:
List<Footer> footers = ParquetFileReader.readAllFootersInParallel(configuration, outputStatus);And then write the metadata into a file named "_metadata" in output directory:
ParquetFileWriter.writeMetadataFile(configuration, outputPath, footers);If the output parquet files have large schema and the number of parquet files is huge, ApplicationMaster needs much memory during commitJob phase.
By default, the ApplicationMaster's memory configurations are:
yarn.app.mapreduce.am.resource.mb 1536 yarn.app.mapreduce.am.command-opts -Xmx1024mIf above memory is not enough, ApplicationMaster will fail with OOM error.
Solution:
If _metadata file is needed, just increase yarn.app.mapreduce.am.resource.mb and yarn.app.mapreduce.am.command-opts to large enough.Else, starting from parquet 1.6.0 per PARQUET-107, configuration "parquet.enable.summary-metadata" was introduced to enable or disable metadata generation in the commintJob phase. So just run below command to disable metadata generation:
set parquet.enable.summary-metadata false;Note: please make sure parquet-pig-bundle-<version>.jar is compiled from parquet 1.6.0 source code or above. For example, twitter compiled parquet-pig-bundle jars here:
http://repo1.maven.org/maven2/com/twitter/parquet-pig-bundle/
No comments:
Post a Comment