This is a quick explanation on the Hbase Region Split policy.
Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The hierarchy of objects is as follows:
Pre-split
Here are 2 predefined Split Algorithm -- HexStringSplit and UniformSplit.1. HexStringSplit
The format of a HexStringSplit region boundary is the ASCII representation of an MD5 checksum, or any other uniformly distributed hexadecimal value. Row are hex-encoded long values in the range "00000000" => "FFFFFFFF" and are left-padded with zeros to keep the same order lexicographically as if they were binary.Sample:
Below command will create a table with 10 regions using HexStringSplit Algorithm:
hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1 DEBUG util.RegionSplitter: Creating table test_table with 1 column families. Presplitting to 10 regions10 regions created:
[root]# hadoop fs -ls /apps/hbase/data/test_table Found 12 items -rw-r--r-- 3 hbase hadoop 673 2014-05-21 09:54 /apps/hbase/data/test_table/.tableinfo.0000000001 drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/.tmp drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/339d0eb61160df679c6ea628ee80b0d6 drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/86da408d174d83aae3fb0bcdb68145c8 drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/b0129aac1ec9f20a6a4ffe27b125cd27 drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/b94f184ee55374ed5d5db71b88a7bc05 drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/c003cd8b2ff3b4a9c6c653ce1a3c0fce drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/ca8cc09027606d6c51f189d61fe6eb4f drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/d41006f677c222b62695035364c528d6 drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/e8e2f820883ccd5771d1470f3a36b88f drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/ef2cb46e051fdcccb070b1e637bb5fd5 drwxr-xr-x - hbase hadoop 0 2014-05-21 09:54 /apps/hbase/data/test_table/f7d3a744d584f44a890e398618c85c4fYou can find out the range for each region by:
hadoop fs -cat /apps/hbase/data/test_table/339d0eb61160df679c6ea628ee80b0d6/.regioninfo STARTKEY => '99999996', ENDKEY => 'b333332f' hadoop fs -cat /apps/hbase/data/test_table/86da408d174d83aae3fb0bcdb68145c8/.regioninfo STARTKEY => '', ENDKEY => '19999999'
2. UniformSplit
A SplitAlgorithm that divides the space of possible keys evenly. Useful when the keys are approximately uniform random bytes (e.g. hashes). Rows are raw byte values in the range 00 => FF and are right-padded with zeros to keep the same memcmp() order. This is the natural algorithm to use for a byte[] environment and saves space, but is not necessarily the easiest for readability.Sample:
hbase org.apache.hadoop.hbase.util.RegionSplitter test_table3 UniformSplit -c 3 -f f1 DEBUG util.RegionSplitter: Creating table test_table3 with 1 column families. Presplitting to 3 regions3 regions created:
[root@hdm ~]# hadoop fs -ls /apps/hbase/data/test_table3 Found 5 items -rw-r--r-- 3 hbase hadoop 675 2014-05-21 14:09 /apps/hbase/data/test_table3/.tableinfo.0000000001 drwxr-xr-x - hbase hadoop 0 2014-05-21 14:09 /apps/hbase/data/test_table3/.tmp drwxr-xr-x - hbase hadoop 0 2014-05-21 14:09 /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2 drwxr-xr-x - hbase hadoop 0 2014-05-21 14:09 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb drwxr-xr-x - hbase hadoop 0 2014-05-21 14:09 /apps/hbase/data/test_table3/e9f130fc2ebbafb20e5ebc45ea3bc7bdRange of region:
hadoop fs -cat /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/.regioninfo STARTKEY => '', ENDKEY => 'UUUUUUUU' hadoop fs -cat /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2/.regioninfo STARTKEY => 'UUUUUUUU', ENDKEY => '\xAA\xAA\xAA\xAA\xAA\xAA\xAA\xAA' hadoop fs -cat /apps/hbase/data/test_table3/e9f130fc2ebbafb20e5ebc45ea3bc7bd/.regioninfo STARTKEY => '\xAA\xAA\xAA\xAA\xAA\xAA\xAA\xAA', ENDKEY => ''Key = "1","2" are in one region, Key="zzz" is in another region:
put 'test_table3','1','f1:col1','data_1_col1' # hadoop fs -ls /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 697 2014-05-21 14:12 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1/585ad8fa038c434880848a260160eed2 put 'test_table3','2','f1:col1','data_2_col1' # hadoop fs -ls /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1 Found 2 items -rw-r--r-- 3 hbase hadoop 697 2014-05-21 14:13 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1/1d26504ba52444309dc03c0a4ef92283 -rw-r--r-- 3 hbase hadoop 697 2014-05-21 14:12 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1/585ad8fa038c434880848a260160eed2 put 'test_table3','zzz','f1:col1','data_zzz_col1' # hadoop fs -ls /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 705 2014-05-21 15:05 /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2/f1/a10e61a74a394e5d93a20eb61372d674
3. Desired split points
If you have split points at hand, you can also use the HBase shell, to create the table with the desired split points.Sample:
create 'test_table2', 'f1', {SPLITS => ['a', 'b', 'c']}3 regions created:
# hadoop fs -ls /apps/hbase/data/test_table2/ Found 6 items -rw-r--r-- 3 hbase hadoop 675 2014-05-21 13:06 /apps/hbase/data/test_table2/.tableinfo.0000000001 drwxr-xr-x - hbase hadoop 0 2014-05-21 13:06 /apps/hbase/data/test_table2/.tmp drwxr-xr-x - hbase hadoop 0 2014-05-21 13:06 /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7 drwxr-xr-x - hbase hadoop 0 2014-05-21 13:06 /apps/hbase/data/test_table2/b8ef19896ac8e43ab5c050c01f129329 drwxr-xr-x - hbase hadoop 0 2014-05-21 13:06 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd drwxr-xr-x - hbase hadoop 0 2014-05-21 13:06 /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323dRange of region:
hadoop fs -cat /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323d/.regioninfo STARTKEY => '', ENDKEY => 'a' hadoop fs -cat /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/.regioninfo STARTKEY => 'a', ENDKEY => 'b' hadoop fs -cat /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/.regioninfo STARTKEY => 'b', ENDKEY => 'c' hadoop fs -cat /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/.regioninfo STARTKEY => 'c', ENDKEY => ''Keys = "a","b","c","abcd" fall into each region:
put 'test_table2','a','f1:col1','data_a_col1' # hadoop fs -ls /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 697 2014-05-21 13:11 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1/4c54189cb64f452a98e722a6bfef23b7 put 'test_table2','b','f1:col1','data_b_col1' # hadoop fs -ls /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 697 2014-05-21 13:37 /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/f1/4159bd0e73dd4dcaad49efbead735851 put 'test_table2','123','f1:col1','data_123_col1' # hadoop fs -ls /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323d/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 705 2014-05-21 13:38 /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323d/f1/37c2532ae31a4904ad593887ce9dd70c put 'test_table2','abcd','f1:col1','data_abcd_col1' # hadoop fs -ls /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1 Found 2 items -rw-r--r-- 3 hbase hadoop 697 2014-05-21 13:11 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1/4c54189cb64f452a98e722a6bfef23b7 -rw-r--r-- 3 hbase hadoop 709 2014-05-21 13:39 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1/a27ae4c6a09644f7b7c9c23344a878fc
Auto Split
Once a region gets to a certain limit, it is automatically split into two regions.Here are 3 predefined Auto Split Algorithm -- ConstantSizeRegionSplitPolicy, IncreasingToUpperBoundRegionSplitPolicy, and KeyPrefixRegionSplitPolicy.
hbase.regionserver.region.split.policy A split policy determines when a region should be split. The various other split policies that are available currently are: ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy, DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy etc. Default: org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy
1. ConstantSizeRegionSplitPolicy
A RegionSplitPolicy implementation which splits a region as soon as any of its store files exceeds a maximum configurable size("hbase.hregion.max.filesize", default =10G).This is the default split policy. From 0.94.0 on the default split policy has changed to IncreasingToUpperBoundRegionSplitPolicy
hbase.hregion.max.filesize Maximum HStoreFile size. If any one of a column families' HStoreFiles has grown to exceed this value, the hosting HRegion is split in two. Default: 10737418240
2. IncreasingToUpperBoundRegionSplitPolicy
For 0.94:Split size is the number of regions that are on this server that all are of the same table, squared, times the region flush size OR the maximum region split size, whichever is smaller.Min (R^2 * "hbase.hregion.memstore.flush.size", "hbase.hregion.max.filesize"), where R is the number of regions of the same table hosted on the same region server.By default, "hbase.hregion.memstore.flush.size" = 128MB, "hbase.hregion.max.filesize"=10GB.
hbase.hregion.memstore.flush.size Memstore will be flushed to disk if size of the memstore exceeds this number of bytes. Value is checked by a thread that runs every hbase.server.thread.wakefrequency. Default: 134217728So the split point is : 128MB, 512MB, 1152MB, 2GB, 3.2GB, 4.6GB, 6.2GB, 10GB, 10GB, ...
For 0.98:Split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size OR the maximum region split size, whichever is smaller.
Min (R^3 * 2 * "hbase.hregion.memstore.flush.size", "hbase.hregion.max.filesize"), where R is the number of regions of the same table hosted on the same region server.So the split point is : 256MB, 2GB, 6.75GB, 10GB, 10GB, ...
In all, different versions may have different algorithm.
3. KeyPrefixRegionSplitPolicy
A custom RegionSplitPolicy implementing a SplitPolicy that groups rows by a prefix of the row-key This ensures that a region is not split "inside" a prefix of a row key. I.e. rows can be co-located in a region by their prefix."prefix_split_key_policy.prefix_length" attribute of the table defines the prefix length.
Force Split
hbase(main):004:0> help 'split' Split entire table or pass a region to split individual region. With the second parameter, you can specify an explicit split key for the region. Examples: split 'tableName' split 'regionName' # format: 'tableName,startKey,id' split 'tableName', 'splitKey' split 'regionName', 'splitKey'Sample:
create 'testforce','f1' put 'testforce','row1','f1:col1','data1' put 'testforce','row2','f1:col1','data2' put 'testforce','row3','f1:col1','data3' put 'testforce','row4','f1:col1','data4' flush 'testforce' # hadoop fs -ls /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 808 2014-05-21 15:56 /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e/f1/dbe88fd159324a9499405a8536c66c4b [root@hdm ~]# hadoop fs -ls /apps/hbase/data/testforce Found 3 items -rw-r--r-- 3 hbase hadoop 671 2014-05-21 15:55 /apps/hbase/data/testforce/.tableinfo.0000000001 drwxr-xr-x - hbase hadoop 0 2014-05-21 15:55 /apps/hbase/data/testforce/.tmp drwxr-xr-x - hbase hadoop 0 2014-05-21 15:56 /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e hbase(main):034:0> split '1a146d535be7662bb1102e44961ddb7e','row2' 0 row(s) in 0.0420 seconds # hadoop fs -ls /apps/hbase/data/testforce Found 5 items -rw-r--r-- 3 hbase hadoop 671 2014-05-21 15:55 /apps/hbase/data/testforce/.tableinfo.0000000001 drwxr-xr-x - hbase hadoop 0 2014-05-21 15:55 /apps/hbase/data/testforce/.tmp drwxr-xr-x - hbase hadoop 0 2014-05-21 15:58 /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e drwxr-xr-x - hbase hadoop 0 2014-05-21 15:58 /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60 drwxr-xr-x - hbase hadoop 0 2014-05-21 15:58 /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03 # hadoop fs -ls /apps/hbase/data/testforce Found 4 items -rw-r--r-- 3 hbase hadoop 671 2014-05-21 15:55 /apps/hbase/data/testforce/.tableinfo.0000000001 drwxr-xr-x - hbase hadoop 0 2014-05-21 15:55 /apps/hbase/data/testforce/.tmp drwxr-xr-x - hbase hadoop 0 2014-05-21 15:59 /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60 drwxr-xr-x - hbase hadoop 0 2014-05-21 15:59 /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03 # hadoop fs -ls /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 715 2014-05-21 15:58 /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03/f1/3ba626091f7948eb9b19a328fe108716 # hadoop fs -ls /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60/f1 Found 1 items -rw-r--r-- 3 hbase hadoop 645 2014-05-21 15:58 /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60/f1/eb093dbeb36a44c29d049135f0fcbfe8 # hadoop fs -cat /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03/f1/3ba626091f7948eb9b19a328fe108716 row2f1col1F data2 row3f1col1F data3 row4f1col1F data4 # hadoop fs -cat /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60/f1/eb093dbeb36a44c29d049135f0fcbfe8 row1f1col1F data1
Thanks For sharing this valuable post ,I have some confusion
ReplyDeleteI don't understand why do we need to split column as daughterA and daughterB