Monday 16 February 2015

Cluster service start/stop again and again within few seconds, continusly trying to start but failed. Resolved


Today i faced an issue after restarting a server (Windows server 2012 has  Exchange server 2013 Mailbox and CAS role installed and is part of exchange server 2013 4 nodes DAG) that its cluster was not starting properly.

This server is part of  Exchange server 2013 DAG and has Mailbox and CAS role installed.

Error in Cluster manager
 (1)
EVENT ID 1135
Cluster node 'Node02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
(2) Event ID 1127
Cluster network interface 'Node1 -VL3' for cluster node 'Node1' on network 'Cluster Network 1' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
(3) Event ID 1069
Cluster resource 'IPv4 Static Address 2 (Cluster Group)' of type 'IP Address' in clustered role 'Cluster Group' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Also got the following error in Event Viewer/System Logs

(1) Event id 7024
The Cluster Service service terminated unexpectedly.  It has done this 231 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.
(2) Event id 7024
The Cluster Service service terminated with the following service-specific error:
The handle is invalid.
(3)
Event ID 1070
The node failed to join failover cluster 'DAG15' due to error code '6'.
(4)
Event ID 7031
The Cluster Service service terminated unexpectedly.  It has done this 231 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

Solution:
The bellow value in registry was changed to 2, which should be 1 or 0, so after changing it to 1, Start Cluster service, start normally and the issue resolved.
Computer\HKEY_Local_Machine\System\CurrentControlSet\Control\Lsa
CrashOnAuditFail Value Data:2
Change to
CrashOnAuditFail Value Data:1