Goal:
How to use MapR local volume as the spill directory for Apache DrillEnv:
Apache Drill on MapRSolution:
By default, each drillbit uses its local disk "/tmp" as the spill directory.However if the local disk is not large enough for certain huge query which requires lots of spilling space, another choice is to use MapR local volume as the spill directory.
Of course, the disks used by MapR local volume should be large enough.
1. Create a MapR local volume for each node.
MapR local volume is just a volume limited by its topology to reside only on its own node.Here is one example of local volume created by MapR for MapReduce jobs.
Here I have 3 nodes with “hostname -f" outputs shown as below:
v1.poc.com
v2.poc.com
v3.poc.com
Assume on each node, I have created the local volumes with below path:
/tmp/v1.poc.com
/tmp/v2.poc.com
/tmp/v3.poc.com
2. Add an environment variable to drill-env.sh on all nodes.
export DRILL_LOCALHOST=`hostname -f`
3. Add the configurations for spill directory in drill-override.conf on all nodes.
Sample is:drill.exec: { cluster-id: "my_cluster_com-drillbits", zk.connect: "v1.poc.com:5181,v2.poc.com:5181,v3.poc.com:5181", sort.external.spill.directories: ["/tmp/"${DRILL_LOCALHOST}], sort.external.spill.fs: "maprfs:///" }
Note: the environment variable ${DRILL_LOCALHOST} should be outside the double quotes.
So if there is subdirectory after that, a sample configuration is:
sort.external.spill.directories: ["/var/mapr/local/"$(DRILL_LOCALHOST)"/drillspill/"],
4. Restart all drillbits
maprcli node services -name drill-bits -action restart -filter csvc=="drill-bits"
5. Check the configurations
> select * from sys.drillbits where `current`=true; +-------------+------------+---------------+------------+----------+ | hostname | user_port | control_port | data_port | current | +-------------+------------+---------------+------------+----------+ | v1.poc.com | 31010 | 31011 | 31012 | true | +-------------+------------+---------------+------------+----------+ 1 row selected (1.132 seconds) > select name,string_val from sys.boot where name in ('drill.exec.sort.external.spill.fs','drill.exec.sort.external.spill.directories'); +---------------------------------------------+-------------------------------------------------------------------------------------------+ | name | string_val | +---------------------------------------------+-------------------------------------------------------------------------------------------+ | drill.exec.sort.external.spill.directories | [ # merge of drill-override.conf: 27,env var DRILL_LOCALHOST "/tmp/v1.poc.com" ] | | drill.exec.sort.external.spill.fs | "maprfs:///" | +---------------------------------------------+-------------------------------------------------------------------------------------------+ 2 rows selected (0.991 seconds)
6. Test a sample sort heavy query with the minimum memory to trigger spilling.
alter session set `planner.memory.max_query_memory_per_node`=1048576; select col1,col2,col3,count(*) from hive.a_large_table group by col1,col2,col3 order by count(*) limit 10;
Then keep monitoring the spill directory to see if you see below kind of spill file generated:
227fec0a-5781-f276-74db-ebd23342292c_HashAgg_2-2-0
thanks for the detailed instructions
ReplyDeleteHey OpenKB community, I've been following the insightful discussion on leveraging MapR Local Volume as spill space, and I wanted to share a valuable resource. If you're keen on visualizing your data or presentations, check out Depositphotos (log in images). They offer a vast collection of high-quality images that can add a professional touch to your MapR projects. Visual appeal is crucial, and Depositphotos makes it easy to find the perfect visuals for any context. It's been a game-changer for me, and I thought the community here might find it beneficial too. Looking forward to hearing your thoughts and any other tips you have on optimizing MapR Local Volume usage.
ReplyDelete