Respawn a pacemaker cluster

Issue type: pacemaker nfs/drbd cluster

6 posts all around the web
- 2 deviant from my configuration
- the rest.. no answer.

Seems legit.

My pacemaker setup

Architecture:

2 nodes
pacemaker active/passive

# crm configure show
node ha-node01 \
	attributes standby="off"
node ha-node02 \
	attributes standby="on"
primitive drbd_nfs ocf:linbit:drbd \
	params drbd_resource="r0" \
	op monitor interval="15s"
primitive fs_nfs ocf:heartbeat:Filesystem \
	params device="/dev/drbd0" directory="/mnt/data/magento" fstype="ext3" \
	op start interval="0" timeout="60" \
	op stop interval="0" timeout="120"
primitive nfs lsb:nfs-kernel-server \
	op monitor interval="5s" \
	meta target-role="Started"
primitive vip1 ocf:heartbeat:IPaddr2 \
	params ip="172.18.34.63" nic="bond1.1034" \
	op monitor interval="5s"
group HAServices vip1 fs_nfs nfs \
	meta target-role="Started"
ms ms_drbd_nfs drbd_nfs \
	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
colocation ms-drbd-nfs-with-haservices inf: ms_drbd_nfs:Master HAServices
order fs-nfs-before-nfs inf: fs_nfs:start nfs:start
order ip-before-ms-drbd-nfs inf: vip1:start ms_drbd_nfs:promote
order ms-drbd-nfs-before-fs-nfs inf: ms_drbd_nfs:promote fs_nfs:start
property $id="cib-bootstrap-options" \
	dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	no-quorum-policy="ignore" \
rsc_defaults $id="rsc-options" \
	resource-stickiness="100"

General cluster status

Cluster status, the resource is running on the second node

============
Last updated: Wed Apr 25 11:40:11 2012
Stack: openais
Current DC: ha-node02 - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Node ha-node02: standby
Online: [ ha-node01 ]

 Resource Group: HAServices
     vip1       (ocf::heartbeat:IPaddr2):       Started ha-node02
     fs_nfs     (ocf::heartbeat:Filesystem):    Started ha-node02
     nfs        (lsb:nfs-kernel-server):        Started ha-node02
 Master/Slave Set: ms_drbd_nfs
     Masters: [ ha-node02 ]
     Slaves: [ ha-node01 ]

When I tried to put the node which hold the resource in standby mode, things goes wrong

# crm node standby ha-node02
============
Last updated: Wed Apr 25 11:40:12 2012
Stack: openais
Current DC: ha-node02 - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Node ha-node02: standby
Online: [ ha-node01 ]

 Resource Group: HAServices
     vip1       (ocf::heartbeat:IPaddr2):       Started ha-node01
     fs_nfs     (ocf::heartbeat:Filesystem):    Stopped
     nfs        (lsb:nfs-kernel-server):        Stopped
 Master/Slave Set: ms_drbd_nfs
     Slaves: [ ha-node01 ]
     Stopped: [ drbd_nfs:0 ]

Failed actions:
    fs_nfs_start_0 (node=ha-node01, call=14, rc=1, status=complete): unknown error

##What the logs said?

Corosync log:

ha-node02 lrmd: [23026]: WARN: For LSB init script, no additional parameters are needed.
ha-node02 pengine: [14834]: WARN: unpack_rsc_op: Processing failed op fs_nfs_start_0 on ha-node01: unknown error (1)
ha-node02 pengine: [14834]: WARN: common_apply_stickiness: Forcing fs_nfs away from ha-node01 after 1000000 failures (max=1000000)

##Solution: clean your resource!

Simply run this command and your cluster will go back to life. In my setup the incrimited resource is fs_nfs, so I ran:

crm resource cleanup fs_nfs

Extract from the cluster lab documentation:

Cleanup resource status. Typically done after the resource has temporarily failed. If a node is omitted, cleanup on all nodes. If there are many nodes, the command may take a while.

Never give up!

Sébastien Han

Respawn a pacemaker cluster

My pacemaker setup

General cluster status

Comments