Goal:
How to build and use parquet-tools to read parquet files.Solution:
1. Download and Install maven.
Follow below link:http://maven.apache.org/download.cgi
2. Download the parquet source code
git clone https://github.com/Parquet/parquet-mr.git
3. Build the parquet-tools.
cd parquet-mr/parquet-tools/ mvn clean package -PlocalThe resulting jar is target/parquet-tools.jar.
Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repositoryIt is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>
4. Show help manual
cd target java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help
5. Dump the schema
Take sample nation.parquet file for example.# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet message root { required int64 N_NATIONKEY; required binary N_NAME (UTF8); required int64 N_REGIONKEY; required binary N_COMMENT (UTF8); }
6. Read the data
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f (... ...)
7. Read first n records
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f N_NATIONKEY = 1 N_NAME = ARGENTINA N_REGIONKEY = 1 N_COMMENT = al foxes promise sly N_NATIONKEY = 2 N_NAME = BRAZIL N_REGIONKEY = 1 N_COMMENT = y alongside of the p
8. Show meta info
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet file: file:/tmp/nation.parquet creator: parquet-mr file schema: root -------------------------------------------------------------------------------- N_NATIONKEY: REQUIRED INT64 R:0 D:0 N_NAME: REQUIRED BINARY O:UTF8 R:0 D:0 N_REGIONKEY: REQUIRED INT64 R:0 D:0 N_COMMENT: REQUIRED BINARY O:UTF8 R:0 D:0 row group 1: RC:25 TS:1352 OFFSET:4 -------------------------------------------------------------------------------- N_NATIONKEY: INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED N_NAME: BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED N_REGIONKEY: INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED N_COMMENT: BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED
9. Dump all data
Note: Values are in column format.# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta /tmp/nation.parquet INT64 N_NATIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:2 (...) BINARY N_NAME -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:ALGERIA value 2: R:0 D:0 V:ARGENTINA value 3: R:0 D:0 V:BRAZIL (...) INT64 N_REGIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:1 (...) BINARY N_COMMENT -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V: haggle. carefully f value 2: R:0 D:0 V:al foxes promise sly value 3: R:0 D:0 V:y alongside of the p (...)
Thanks for the compilation fix! Too bad that the project on GitHub does not include issues where this could be mentioned, because it is quite a useful fix.
ReplyDeleteCan not clone this project.
ReplyDeleteerror: fatal: repository 'https://github.com/Parquet/parquet-mr.git/' not found
Any idea how to clone it?
It is moved to :
Deletehttps://github.com/apache/parquet-mr/
Thank you. OpenKB.
Delete[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project parquet-tools: Compilation failure: Compilation failure:
ReplyDelete[ERROR] /C:/bigdata/parquet-mr/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
[ERROR] symbol: method getCrc()
[ERROR] location: variable pageV1 of type org.apache.parquet.column.page.DataPageV1
Getting the above error while trying to run mvn clean package -Plocal
I am also getting the same error:
Deleteparquet-mr-master/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
Any pointers please?
This is great.
ReplyDelete