Friday, November 1, 2019

Can not drop partition or drop Hive partition table due to UTF8 characters in partition value

Symptom:

Can not drop partition or drop Hive partition table due to UTF8 characters in partition value.
The partition key value with UTF8 characters shows as "???" in Hive also in mysql(Hive Metastore backend database).
For example,
User may create a Hive partition table named "hao3" and does a dynamic partition insert into this Hive partition table from a table named "t3" with some Chinese words inside the partition key.
hive> desc t3;
OK
name                 string
partition_col        string
age                  int

hive> select * from t3;
OK
abc 小明 20
def part2 15
ghi part3 36
ijk part4 50

hive> CREATE TABLE hao3(
  name string,
  age int)
PARTITIONED BY (
  partition_col1 string);

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table hao3 partition(partition_col1) select name,age,partition_col from t3;
After the dynamic partition insert completes, the problematic partition will show as "??" in Hive:
hive> show partitions hao3;
OK
partition_col1=??
partition_col1=part2
partition_col1=part3
partition_col1=part4
Time taken: 0.092 seconds, Fetched: 4 row(s)
hive> select * from hao3 ;
OK
def 15 part2
ghi 36 part3
ijk 50 part4
Time taken: 0.229 seconds, Fetched: 3 row(s)
It also shows as "??" in backend mysql database as well:
select p.PART_NAME,s.LOCATION,pkv.PART_KEY_VAL
from 
TBLS t, DBS d, PARTITIONS p, SDS s, PARTITION_KEY_VALS pkv
where t.DB_ID=d.DB_ID 
and t.TBL_NAME='hao3'
and d.NAME='default'
and p.TBL_ID=t.TBL_ID
and p.SD_ID=s.SD_ID
and p.PART_ID=pkv.PART_ID
;

+----------------------+-------------------------------------------------------+--------------+
| PART_NAME            | LOCATION                                              | PART_KEY_VAL |
+----------------------+-------------------------------------------------------+--------------+
| partition_col1=part3 | maprfs:/user/hive/warehouse/hao3/partition_col1=part3 | part3        |
| partition_col1=part2 | maprfs:/user/hive/warehouse/hao3/partition_col1=part2 | part2        |
| partition_col1=part4 | maprfs:/user/hive/warehouse/hao3/partition_col1=part4 | part4        |
| partition_col1=??    | maprfs:/user/hive/warehouse/hao3/partition_col1=??    | ??           |
+----------------------+-------------------------------------------------------+--------------+
4 rows in set (0.00 sec)
Further more, you can not drop the partition or even rename/drop this table:
hive> alter table hao3 drop partition(partition_col1='小明');
Dropped the partition partition_col1=%3F%3F
OK
Time taken: 0.256 seconds
hive> show partitions hao3;
OK
partition_col1=??
partition_col1=part2
partition_col1=part3
partition_col1=part4
hive> drop table hao3;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out

Root Cause:

The character set encoding in backend MySQL may not support UTF8.
Below are the reasons:
1. The "character_set_server" is set to latin1 instead of utf8:
> show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)
2.  Hive related metadata tables are created as latin1 character set by default.
This is the current behavior in Hive(including version 2.3 at least).
It is described in below Hive JIRAs:
https://issues.apache.org/jira/browse/HIVE-14156
https://issues.apache.org/jira/browse/HIVE-18083
The reason is:
[mapr@v4 mysql]$ pwd
/opt/mapr/hive/hive-2.3/scripts/metastore/upgrade/mysql
[mapr@v4 mysql]$ grep CHARSET hive-schema-2.3.0.mysql.sql
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
...

Solution:

1. Make sure the mysql character set are all set to UTF8.

In this case, we need to change "character_set_server" to utf8 by setting below in my.cnf under [mysqld] section:
character-set-server=utf8
Then restart mysql.

2. Change all hive metadata tables in mysql to UTF8 character set.

For example, for each table run below 2 SQLs:
alter table SDS default character set utf8;
alter table SDS convert to character set utf8; 

For this case, the most important 3 metadata tables.columns are below:
SDS.LOCATION
PARTITIONS.PART_NAME
PARTITION_KEY_VALS.PART_KEY_VAL

Note that for table PARTITIONS, you need to reduce the length of column PART_NAME from varchar(767) to varchar(255) before running "convert to" command:
alter table PARTITIONS modify column PART_NAME varchar(255);
Otherwise the "convert to" command will fail with below error:
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes
This is a trade-off if you want to get UTF8 support.

 3. Retry the same UTF8 Chinese characters and below is working fine now.

hive> show partitions hao2;
OK
partition_col1=part2
partition_col1=part3
partition_col1=part4
partition_col1=小明
Time taken: 0.092 seconds, Fetched: 4 row(s)
hive> select * from hao2 ;
OK
def 15 part2
ghi 36 part3
ijk 50 part4
abc 20 小明
Time taken: 0.211 seconds, Fetched: 4 row(s)
hive> alter table hao2 drop partition(partition_col1='小明');
Dropped the partition partition_col1=小明
OK
Time taken: 0.531 seconds
hive> show partitions hao2;
OK
partition_col1=part2
partition_col1=part3
partition_col1=part4
Time taken: 0.091 seconds, Fetched: 3 row(s)
hive> select * from hao2;
OK
def 15 part2
ghi 36 part3
ijk 50 part4
Time taken: 0.23 seconds, Fetched: 3 row(s)
Note: Please change the encoding to UTF8 when you create Hive Metastore at the beginning.
Otherwise it is risky to modify the existing metadata.

No comments:

Post a Comment

Popular Posts