Env:
Hadoop 2.5.1Apache Hadoop ResourceManager HA enabled.
Symptom:
ResourceManager fails to transition to Active mode with "InvalidResourceRequestException".Below stacktrace shows firstly in RM log:
Caused by: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=9216, maxMemory=8192 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:228) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:345) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:309) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:508) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 13 moreBelow stacktrace then repeats in RM log:
WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:122) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:301) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:120) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: RMActiveServices cannot enter state STARTED from state STOPPED at org.apache.hadoop.service.ServiceStateModel.checkStateTransition(ServiceStateModel.java:129) at org.apache.hadoop.service.ServiceStateModel.enterState(ServiceStateModel.java:111) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:190) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:911) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:951) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:948) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:948) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:292) ... 5 more 2015-09-03 13:59:23,581 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
Root Cause:
This is due to YARN-3493 which is fixed in Hadoop 2.6.1, 2.8.0 and 2.7.1.This issue can happen if users lower the value of yarn.scheduler.maximum-allocation-mb and then restart ResourceManager.
ResourceManager fails to recover the applications left in RMStateStore which requires more memory than yarn.scheduler.maximum-allocation-mb, even though those applications failed for a long time.
Solution:
1. Identify the RMStateStore class.
MapR by default uses FileSystemRMStateStore which means the RMStateStore is on MFS.User may choose ZKRMStateStore also.
$ hadoop2 conf |grep yarn.resourcemanager.store.class <property><name>yarn.resourcemanager.store.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value><source>yarn-default.xml</source></property>
2. Find the location of RMStateStore.
If RMStateStore is using FileSystemRMStateStore, the parent location is defined by yarn.resourcemanager.fs.state-store.uri.$ hadoop2 conf |grep yarn.resourcemanager.fs.state-store.uri <property><name>yarn.resourcemanager.fs.state-store.uri</name><value>/var/mapr/cluster/yarn/rm/system</value><source>yarn-default.xml</source></property>Then the location of all application directories is :
/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot
If RMStateStore is using ZKRMStateStore, the parent znode is defined by yarn.resourcemanager.zk-state-store.parent-path
$ hadoop2 conf |grep yarn.resourcemanager.zk-state-store.parent-path <property><name>yarn.resourcemanager.zk-state-store.parent-path</name><value>/rmstore</value><source>yarn-default.xml</source></property>Then the znode of all application directories is:
/rmstore/ZKRMStateRoot/RMAppRoot/
3. Move or remove all the application directories in RMStateStore.
The impact of this step is, RM UI will be clean, but the application information can still be view-able from HistoryServer UI; and also RM will not recover any failed/running applications so users need to re-submit the application.For example:
If FileSystemRMStateStore,
hadoop fs -mv /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot/* /backup_statestore/
If ZKRMStateStore,
Need to remove application directories one by one as below
rmr /rmstore/ZKRMStateRoot/RMAppRoot/application_#############_####
No comments:
Post a Comment