Symptom:
Can not drop partition or drop Hive partition table due to UTF8 characters in partition value.The partition key value with UTF8 characters shows as "???" in Hive also in mysql(Hive Metastore backend database).
For example,
User may create a Hive partition table named "hao3" and does a dynamic partition insert into this Hive partition table from a table named "t3" with some Chinese words inside the partition key.
hive> desc t3; OK name string partition_col string age int hive> select * from t3; OK abc 小明 20 def part2 15 ghi part3 36 ijk part4 50 hive> CREATE TABLE hao3( name string, age int) PARTITIONED BY ( partition_col1 string); set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert into table hao3 partition(partition_col1) select name,age,partition_col from t3;After the dynamic partition insert completes, the problematic partition will show as "??" in Hive:
hive> show partitions hao3; OK partition_col1=?? partition_col1=part2 partition_col1=part3 partition_col1=part4 Time taken: 0.092 seconds, Fetched: 4 row(s) hive> select * from hao3 ; OK def 15 part2 ghi 36 part3 ijk 50 part4 Time taken: 0.229 seconds, Fetched: 3 row(s)It also shows as "??" in backend mysql database as well:
select p.PART_NAME,s.LOCATION,pkv.PART_KEY_VAL from TBLS t, DBS d, PARTITIONS p, SDS s, PARTITION_KEY_VALS pkv where t.DB_ID=d.DB_ID and t.TBL_NAME='hao3' and d.NAME='default' and p.TBL_ID=t.TBL_ID and p.SD_ID=s.SD_ID and p.PART_ID=pkv.PART_ID ; +----------------------+-------------------------------------------------------+--------------+ | PART_NAME | LOCATION | PART_KEY_VAL | +----------------------+-------------------------------------------------------+--------------+ | partition_col1=part3 | maprfs:/user/hive/warehouse/hao3/partition_col1=part3 | part3 | | partition_col1=part2 | maprfs:/user/hive/warehouse/hao3/partition_col1=part2 | part2 | | partition_col1=part4 | maprfs:/user/hive/warehouse/hao3/partition_col1=part4 | part4 | | partition_col1=?? | maprfs:/user/hive/warehouse/hao3/partition_col1=?? | ?? | +----------------------+-------------------------------------------------------+--------------+ 4 rows in set (0.00 sec)Further more, you can not drop the partition or even rename/drop this table:
hive> alter table hao3 drop partition(partition_col1='小明'); Dropped the partition partition_col1=%3F%3F OK Time taken: 0.256 seconds hive> show partitions hao3; OK partition_col1=?? partition_col1=part2 partition_col1=part3 partition_col1=part4 hive> drop table hao3; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
Root Cause:
The character set encoding in backend MySQL may not support UTF8.Below are the reasons:
1. The "character_set_server" is set to latin1 instead of utf8:
> show variables like '%char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.01 sec)2. Hive related metadata tables are created as latin1 character set by default.
This is the current behavior in Hive(including version 2.3 at least).
It is described in below Hive JIRAs:
https://issues.apache.org/jira/browse/HIVE-14156
https://issues.apache.org/jira/browse/HIVE-18083
The reason is:
[mapr@v4 mysql]$ pwd /opt/mapr/hive/hive-2.3/scripts/metastore/upgrade/mysql [mapr@v4 mysql]$ grep CHARSET hive-schema-2.3.0.mysql.sql ) ENGINE=InnoDB DEFAULT CHARSET=latin1; ) ENGINE=InnoDB DEFAULT CHARSET=latin1; ...
Solution:
1. Make sure the mysql character set are all set to UTF8.
In this case, we need to change "character_set_server" to utf8 by setting below in my.cnf under [mysqld] section:character-set-server=utf8Then restart mysql.
2. Change all hive metadata tables in mysql to UTF8 character set.
For example, for each table run below 2 SQLs:alter table SDS default character set utf8; alter table SDS convert to character set utf8;
For this case, the most important 3 metadata tables.columns are below:
SDS.LOCATION
PARTITIONS.PART_NAME
PARTITION_KEY_VALS.PART_KEY_VAL
Note that for table PARTITIONS, you need to reduce the length of column PART_NAME from varchar(767) to varchar(255) before running "convert to" command:
alter table PARTITIONS modify column PART_NAME varchar(255);Otherwise the "convert to" command will fail with below error:
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytesThis is a trade-off if you want to get UTF8 support.
3. Retry the same UTF8 Chinese characters and below is working fine now.
hive> show partitions hao2; OK partition_col1=part2 partition_col1=part3 partition_col1=part4 partition_col1=小明 Time taken: 0.092 seconds, Fetched: 4 row(s) hive> select * from hao2 ; OK def 15 part2 ghi 36 part3 ijk 50 part4 abc 20 小明 Time taken: 0.211 seconds, Fetched: 4 row(s) hive> alter table hao2 drop partition(partition_col1='小明'); Dropped the partition partition_col1=小明 OK Time taken: 0.531 seconds hive> show partitions hao2; OK partition_col1=part2 partition_col1=part3 partition_col1=part4 Time taken: 0.091 seconds, Fetched: 3 row(s) hive> select * from hao2; OK def 15 part2 ghi 36 part3 ijk 50 part4 Time taken: 0.23 seconds, Fetched: 3 row(s)Note: Please change the encoding to UTF8 when you create Hive Metastore at the beginning.
Otherwise it is risky to modify the existing metadata.
No comments:
Post a Comment