Josh-D. S. Davis

Xaminmo / Omnimax / Max Omni / Mad Scientist / Midnight Shadow / Radiation Master

Previous Entry Share Next Entry
cl_rsh fails
Josh 201604 KWP
joshdavis
PROBLEM: On some migrates, we found the rpdomain would not stay running on one node.
The cluster was up, and SEEMED to operate normally, but errpt got CONFIGRM stop/start messages every minute.

lsrpdomain would show Offline, or "Pending online".

lsrpnode would show:
2610-412 A Resource

PROBLEM: On some migrates, we found the rpdomain would not stay running on one node.
The cluster was up, and SEEMED to operate normally, but errpt got CONFIGRM stop/start messages every minute.

lsrpdomain would show Offline, or "Pending online".

lsrpnode would show:
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.


On the other node, lsrpnode only showed itself, and lsrpdomain showed Online.

"cl_rsh node1 date" worked from both nodes
"cl_rsh node2 date" worked only from node2.
/etc/hosts, cllsif, hostname, /etc/cluster/rhosts... everything was spotless.
clcomd was running, even after refresh.
Same subnet, and ports were not filtered.

Importing a snapshot said:
Warning: unable to verify inbound clcomd communication from
node "node1" to the local node, "node2".


I applied PowerHA 7.1.3 SP4, and no fix. I think this is a problem with clmigcheck or mkcluster in AIX.

SOLUTION
I saved a snapshot, blew away the cluster, and imported the snapshot.
/usr/es/sbin/cluster/utilities/clsnapshot -c -i -nmysnapshot -d "Snapshot before clrmcluster"
clstop -g -N
stopsrc -g cluster
clrmclstr
rmcluster -r hdisk10
# one node's SSHd died here.
rmdev -dl cluster0
cfgmgr
cl_rsh works all the way around now.
/usr/es/sbin/cluster/utilities/clsnapshot -a -n'mysnapshot' -f'false'
cllsclstr ; lscluster -m ; lsrpdomain ; lsrpnode

works fine all around, before and after reboot.
Cluster starts normally.


Error Reference
---------------------------------------------------------------------------
LABEL: CONFIGRM_STOPPED_ST
IDENTIFIER: 447D3237

Date/Time: Tue Nov 24 04:18:36 EST 2015
Sequence Number: 42614
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Description
IBM.ConfigRM daemon has been stopped.

Probable Causes
The RSCT Configuration Manager daemon(IBM.ConfigRMd) has been stopped.

User Causes
The stopsrc -s IBM.ConfigRM command has been executed.

Recommended Actions
Confirm that the daemon should be stopped. Normally, this daemon should
not be stopped explicitly by the user.

Detail Data
DETECTING MODULE
RSCT,ConfigRMDaemon.C,1.25.1.1,219
ERROR ID

REFERENCE CODE

---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42613
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42612
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42611
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42610
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42609
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42608
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_PENDINGQUO
IDENTIFIER: A098BF90

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42607
Class: S
Type: PERM
WPAR: Global
Resource Name: ConfigRM

Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM.
This state usually indicates that exactly half of the nodes that are defined in the
peer domain are online. In this state cluster resources cannot be recovered although
none will be stopped explicitly.

Failure Causes
One or more nodes in the active peer domain have failed.
One or more nodes in the active peer domain have been taken offline by the user.
A network failure is disrupted communication between the cluster nodes.

Recommended Actions
Ensure that more than half of the nodes of the domain are online.
Ensure that the network that is used for communication between the nodes is functioning correctly.
Ensure that the active tie breaker device is operational and if it set to
'Operator' then resolve the tie situation by granting ownership to one of
the active sub-domains.

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,19713

---------------------------------------------------------------------------
LABEL: STORAGERM_STARTED_S
IDENTIFIER: EDFF8E9B

Date/Time: Tue Nov 24 04:17:53 EST 2015
Sequence Number: 42606
Node Id: node1
Class: O
Type: INFO
WPAR: Global
Resource Name: StorageRM

Detail Data
DETECTING MODULE
RSCT,IBM.StorageRMd.C,1.49,147

---------------------------------------------------------------------------
LABEL: CONFIGRM_ONLINE_ST
IDENTIFIER: 3B16518D

Date/Time: Tue Nov 24 04:17:52 EST 2015
Sequence Number: 42605
Node Id: node1
Class: S
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,24950

Peer Domain Name
mycluster



http://omnitech.net/reference/2015/11/24/cl_rsh-fails/

?

Log in

No account? Create an account