Monday, February 23, 2015

How to build and use parquet-tools to read parquet files

Goal:

How to build and use parquet-tools to read parquet files.

Solution:

1. Download and Install maven.

Follow below link:
http://maven.apache.org/download.cgi

2. Download the parquet source code

git clone https://github.com/Parquet/parquet-mr.git

3. Build the parquet-tools.

cd parquet-mr/parquet-tools/
mvn clean package -Plocal 
The resulting jar is target/parquet-tools.jar.

Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository
It is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>

4. Show help manual

cd target
java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help

 5. Dump the schema

Take sample nation.parquet file for example.
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet
message root {
  required int64 N_NATIONKEY;
  required binary N_NAME (UTF8);
  required int64 N_REGIONKEY;
  required binary N_COMMENT (UTF8);
}

6. Read the data


# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f

(... ...)

7. Read first n records

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f

N_NATIONKEY = 1
N_NAME = ARGENTINA
N_REGIONKEY = 1
N_COMMENT = al foxes promise sly

N_NATIONKEY = 2
N_NAME = BRAZIL
N_REGIONKEY = 1
N_COMMENT = y alongside of the p 

8. Show meta info


# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet
file:        file:/tmp/nation.parquet
creator:     parquet-mr

file schema: root
--------------------------------------------------------------------------------
N_NATIONKEY: REQUIRED INT64 R:0 D:0
N_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0
N_REGIONKEY: REQUIRED INT64 R:0 D:0
N_COMMENT:   REQUIRED BINARY O:UTF8 R:0 D:0

row group 1: RC:25 TS:1352 OFFSET:4
--------------------------------------------------------------------------------
N_NATIONKEY:  INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED
N_NAME:       BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED
N_REGIONKEY:  INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED
N_COMMENT:    BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED

9. Dump all data

Note: Values are in column format.
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta  /tmp/nation.parquet
INT64 N_NATIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:2
(...)

BINARY N_NAME
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:ALGERIA
value 2:  R:0 D:0 V:ARGENTINA
value 3:  R:0 D:0 V:BRAZIL
(...)

INT64 N_REGIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:1
(...)

BINARY N_COMMENT
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V: haggle. carefully f
value 2:  R:0 D:0 V:al foxes promise sly
value 3:  R:0 D:0 V:y alongside of the p
(...)


7 comments:

  1. Thanks for the compilation fix! Too bad that the project on GitHub does not include issues where this could be mentioned, because it is quite a useful fix.

    ReplyDelete
  2. Can not clone this project.
    error: fatal: repository 'https://github.com/Parquet/parquet-mr.git/' not found
    Any idea how to clone it?

    ReplyDelete
    Replies
    1. It is moved to :
      https://github.com/apache/parquet-mr/

      Delete
    2. Thank you. OpenKB.

      Delete
  3. [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project parquet-tools: Compilation failure: Compilation failure:
    [ERROR] /C:/bigdata/parquet-mr/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
    [ERROR] symbol: method getCrc()
    [ERROR] location: variable pageV1 of type org.apache.parquet.column.page.DataPageV1

    Getting the above error while trying to run mvn clean package -Plocal

    ReplyDelete
    Replies
    1. I am also getting the same error:
      parquet-mr-master/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol

      Any pointers please?

      Delete

Popular Posts