Thursday, April 16, 2015

Pig job fails with ApplicationMaster OutOfMemoryError when writing parquet files.

Env:

Pig 0.13 on Yarn

Symptom:

  • A pig job which is reading and writing many parquet files, fails with ApplicationMaster OutOfMemoryError in the last commitJob phase.
  • All mappers and reducers finishes successfully.
  • ApplicationMaster container log shows below stacktrace:
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to Job commit failed: java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:281)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:253)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:279)
        ... 5 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
        at java.lang.StringCoding.encode(StringCoding.java:344)
        at java.lang.String.getBytes(String.java:916)
        at parquet.org.apache.thrift.protocol.TCompactProtocol.writeString(TCompactProtocol.java:298)
        at parquet.format.ColumnChunk.write(ColumnChunk.java:512)
        at parquet.format.RowGroup.write(RowGroup.java:521)
        at parquet.format.FileMetaData.write(FileMetaData.java:923)
        at parquet.format.Util.write(Util.java:56)
        at parquet.format.Util.writeFileMetaData(Util.java:30)
        at parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:322)
        at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:342)
        at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
        ... 10 more

Root Cause:

Pig is using a parquet jar to read and write parquet files. The source code of the parquet jar comes from the parquet-mr github.
The logic is in ParquetOutputCommitter.java:
In this case during commitJob() phase, ApplicationMaster is calling ParquetOutputCommitter.commitJob().
It will firstly read all the footers of the output parquet files:
List<Footer> footers = ParquetFileReader.readAllFootersInParallel(configuration, outputStatus);
And then write the metadata into a file named "_metadata" in output directory:
ParquetFileWriter.writeMetadataFile(configuration, outputPath, footers);
If the output parquet files have large schema and the number of parquet files is huge, ApplicationMaster needs much memory during commitJob phase.
By default, the ApplicationMaster's memory configurations are:
yarn.app.mapreduce.am.resource.mb 1536
yarn.app.mapreduce.am.command-opts -Xmx1024m
If above memory is not enough, ApplicationMaster will fail with OOM error.

Solution:

If _metadata file is needed, just increase yarn.app.mapreduce.am.resource.mb and yarn.app.mapreduce.am.command-opts to large enough.

Else, starting from parquet 1.6.0 per PARQUET-107, configuration "parquet.enable.summary-metadata" was introduced to enable or disable metadata generation in the commintJob phase. So just run below command to disable metadata generation:
set parquet.enable.summary-metadata false;
Note: please make sure parquet-pig-bundle-<version>.jar is compiled from parquet 1.6.0 source code or above.  For example, twitter compiled parquet-pig-bundle jars here:
http://repo1.maven.org/maven2/com/twitter/parquet-pig-bundle/

No comments:

Post a Comment

Popular Posts