Goal:
This article explains how to use PyArrow to view the metadata information inside a Parquet file.
Env:
CentOS 7
Solution:
1. Create a Python 3 virtual environment
This step is because the default python version is 2.x on CentOS/Redhat 7 and it is too old to install pyArrow latest version.
Using Python 3 and its pip3 is the way to go.
However if we just use "alternatives" to config the python to use python3, it may break some other tools such as "yum" which depends on python2.
Using virtual environment is the easiest way to keep both python2 and python3 on CentOS 7.
python3 -m venv .venv
. .venv/bin/activate
2. Install PyArrow and its dependencies
pip install --upgrade pip setuptools
pip install Cython
pip install pyarrow
3. Read the metadata inside a Parquet file
>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('/.../part-00000-67861019-20bb-4396-96f8-146141351ff2-c000.snappy.parquet')
>>> parquet_file.metadata
<pyarrow._parquet.FileMetaData object at 0x7f8014250bf8>
created_by: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
num_columns: 10
num_rows: 546097
num_row_groups: 1
format_version: 1.0
serialized_size: 1886
>>> parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7f808aaf4f98>
num_columns: 10
num_rows: 546097
total_byte_size: 17515040
>>> parquet_file.metadata.row_group(0).column(3)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f801356cf48>
file_offset: 6588315
file_path:
physical_type: INT64
num_values: 546097
path_in_schema: Index
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f8013fd2ea8>
has_min_max: True
min: 0
max: 396316
null_count: 0
distinct_count: 0
num_values: 546097
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY
encodings: ('BIT_PACKED', 'RLE', 'PLAIN')
has_dictionary_page: False
dictionary_page_offset: None
data_page_offset: 6588315
total_compressed_size: 2277936
total_uncompressed_size: 4369155
>>> parquet_file.metadata.row_group(0).column(3).statistics
<pyarrow._parquet.Statistics object at 0x7f801356cef8>
has_min_max: True
min: 0
max: 396316
null_count: 0
distinct_count: 0
num_values: 546097
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
From above information, we can tell that:
- The parquet file version is 1.10.1.
- It has only 1 row group inside.
- It has 10 columns and 546097 rows.
- The 4th column(.column(3)) named “Index” is a INT64 type with min=0 and max=396316.
Nice Post
ReplyDelete