Symptom:
Huge Pig job causes local /tmp directory runs out of disk space.Env:
Pig 0.13Root cause:
Per PIG-1838, pig keeps the jar files for each job until the pig script finishes.It means if a single pig script contains lots of MapReduce jobs, pig will create many jar files in /tmp directory on the node where the pig job is submitted. Until the whole pig script finishes, pig will then clean the temp jars.
Tests:
For example, below pig job will keep 2 jars in /tmp directory until the whole pig job finishes, because it contains 2 MapReduce jobs.
a = load '/dir' using ParquetLoader(); b = order a by price ; STORE b INTO '/output' USING parquet.pig.ParquetStorer;The temp jars in /tmp during execution:
Job4571716915535666311.jar Job3312616966593773080.jar
If we put 2 of above pig jobs into one pig script, pig will keep 4 temp jars in /tmp:
Job7482213044249144977.jar Job4615931692370853067.jar Job182685348991417556.jar Job4601767432482914524.jar
Source Code analysis:
The logic is in pig source code -- JobControlCompiler.java, which calls createTempFile() function in java.io.File:
import java.io.File;
File submitJarFile = File.createTempFile("Job", ".jar");
log.info("creating jar file "+submitJarFile.getName());
Per java source doe -- File.java, the directory location is controlled by java.io.tmpdir:File tmpdir = (directory != null) ? directory : TempDirectory.location(); private TempDirectory() { } // temporary directory location private static final File tmpdir = new File(fs.normalize(AccessController .doPrivileged(new GetPropertyAction("java.io.tmpdir")))); static File location() { return tmpdir; }
Solution:
To avoid /tmp directory running of disk space, available solutions are:1. Split a huge pig script into small pieces and run each piece separately.
Or
2. Set java.io.tmpdir to a directory with enough disk space in HADOOP_OPTS or PIG_OPTS before submitting the pig job.
For example:
export PIG_OPTS="-Djava.io.tmpdir=/dir_with_enough_disk_space" pig test.pig
No comments:
Post a Comment