Impala uses producer-consumer model when queries need to read data.
1. Disk IO Manager
This is the producer which schedules read and write requests to the disks and populates the IO buffers.Name
It is shown as "disk-io-mgr" in Impala web GUI "threadz" page.And it is shown as "impala::DiskIoMgr::WorkLoop" in output of "pstack <impalad>", eg:
# grep "impala::DiskIoMgr::WorkLoop" "2015_pstack.log"
#3 0x00000000008ad06a in impala::DiskIoMgr::WorkLoop(impala::DiskIoMgr::DiskQueue*) ()
#3 0x00000000008ad06a in impala::DiskIoMgr::WorkLoop(impala::DiskIoMgr::DiskQueue*) ()
#3 0x00000000008ad06a in impala::DiskIoMgr::WorkLoop(impala::DiskIoMgr::DiskQueue*) ()
#3 0x00000000008ad06a in impala::DiskIoMgr::WorkLoop(impala::DiskIoMgr::DiskQueue*) ()
#3 0x00000000008ad06a in impala::DiskIoMgr::WorkLoop(impala::DiskIoMgr::DiskQueue*) ()
#3 0x00000000008ad06a in impala::DiskIoMgr::WorkLoop(impala::DiskIoMgr::DiskQueue*) ()
#3 0x00000000008ad06a in impala::DiskIoMgr::WorkLoop(impala::DiskIoMgr::DiskQueue*) ()
Default value
The number of the Disk IO Manager is decided by number of disks on this server by default:1 thread per a rotation disk and 8 per a flash disk.
Currently Impala (version 2.0 ) dose not take into account whether all the disks in a server are used for HDFS or not.
For example, although only 1 disk is used by MapR-FS and the other 6 disks are not used, the number of Disk IO Manager threads is 7.
# maprcli disk list -host `hostname -f` |grep running /dev/sda1 running 0 XXXGATE xxx xx91000640SS 421 ext4 1 AS03 500 79 /dev/sda2 running 0 XXXGATE xxx xx91000640SS 0 AS03 953368 /dev/sdb running 0 XXXGATE xxx xx91000640SS 0 AS03 953869 /dev/sdc running 0 XXXGATE xxx xx91000640SS 0 AS03 953869 /dev/sdd running 0 XXXGATE xxx xx91000640SS 0 AS03 953869 /dev/sde running 0 XXXGATE xxx xx91000640SS 0 AS03 953869 /dev/sdf running 0 XXXGATE xxx xx91000640SS 618603 1 MapR-FS 0 AS03 953869 335266
Configuration value
If setting "-num_threads_per_disk" in env.sh and restarting impalad, the total number of Disk IO Manager threads becomes "-num_threads_per_disk" * "number of disks".2. HDFS Scan Node
This is the consumer which keeps issuing calls to Disk IO Manager to get subsequent scan ranges.Name
It is shown as "hdfs-scan-node" in Impala web GUI "threadz" page.And it is shown as "impala::HdfsScanNode::ScannerThread" in output of "pstack <impalad>", eg:
# grep "impala::HdfsScanNode::ScannerThread" "2015_pstack.log"
#6 0x0000000000ab63af in impala::HdfsScanNode::ScannerThread() ()
#6 0x0000000000ab63af in impala::HdfsScanNode::ScannerThread() ()
#6 0x0000000000ab63af in impala::HdfsScanNode::ScannerThread() ()
#6 0x0000000000ab63af in impala::HdfsScanNode::ScannerThread() ()
Default Value
By default, Impala uses as many cores as are available (one thread per core).Configuration Value
NUM_SCANNER_THREADS controls maximum number of scanner threads (on each node) used for each query. For example, we can set it at session level:SET NUM_SCANNER_THREADS=2;
Statistics
In Impala web GUI "queries" page, we can check the Profile of the SQL to see the performance and statistics for "HDFS_SCAN_NODE".For example:
- AverageScannerThreadConcurrency: 1.00 - NumScannerThreadsStarted: 1 - PerReadThreadRawHdfsThroughput: 60.41 MB/sec
No comments:
Post a Comment