How to use snapshot feature of MapR-FS to isolate Hive reads and writes

Monday, March 11, 2019

How to use snapshot feature of MapR-FS to isolate Hive reads and writes

Goal:

As we know, Hive locks can be used to isolate reads and writes.
Please refer to my previous blogs for details:
http://www.openkb.info/2014/11/hive-locks-tablepartition-level.html
http://www.openkb.info/2018/07/hive-different-lock-behaviors-between.html

That lock mechanism is maintained by Hive and it could have performance overhead or unexpected lock behavior based on your application logic.
This article shows another way to use snapshot feature of MapR-FS to isolate Hive reads and writes.

Env:

MapR 6.1
Hiv 2.3

Solution:

The idea of this solution is straightforward and simple:

When Hive writes such as data loading finishes, take the MapR volume snapshot.
Create an external Hive table based on that snapshot Hive data as a read-only copy for Hive reads.
In the meantime, the Hive writes can still happen on the original Hive table.

The Hive application logic should handle the life cycle of the volume snapshot and Hive external table.
Below is a simple example:
1. Create a MapR Volume named "testhive" mounted at /testhive

maprcli volume create -name testhive -path /testhive

2. Create a Hive table in that volume

CREATE TABLE `testsnapshot`(
  `id` int
)
LOCATION
  'maprfs:/testhive/testsnapshot';

And Load 500 rows into this table:

insert into testsnapshot select 1 from src;

select count(*) from testsnapshot;
500

3. Create a snapshot on the volume when no writes are happening

maprcli volume snapshot create -snapshotname mysnapshot -volume testhive

The snapshot data for that Hive table named "testsnapshot" is:

$ hadoop fs -ls /testhive/.snapshot/mysnapshot/testsnapshot
Found 1 items
-rwxr-xr-x   3 mapr mapr       1000 2019-03-11 10:34 /testhive/.snapshot/mysnapshot/testsnapshot/000000_0

4. Create an external table on that snapshot data

CREATE EXTERNAL TABLE `testsnapshot_ext`(
  `id` int
)
LOCATION
  'maprfs:/testhive/.snapshot/mysnapshot/testsnapshot';

5. In the meantime, write into the original Hive table

insert into testsnapshot select 1 from src;
select count(*) from testsnapshot;
1000

select count(*) from testsnapshot_ext;
500

As you can see, the Hive external table "testsnapshot_ext" based on the snapshot data can be a read-only copy for Hive reads.
The Hive writes can still happen in the meantime on the original Hive table "testsnapshot".

This use case works well if you plan to periodically load data into the Hive table, and you do not want to cause inconsistency for Hive reads instead of using Hive locks feature.

Monday, March 11, 2019