8. Resource Agents¶

8.1. Action Completion¶

If one resource depends on another resource via constraints, the cluster will interpret an expected result as sufficient to continue with dependent actions. This may cause timing issues if the resource agent start returns before the service is not only launched but fully ready to perform its function, or if the resource agent stop returns before the service has fully released all its claims on system resources. At a minimum, the start or stop should not return before a status command would return the expected (started or stopped) result.

8.2. OCF Resource Agents¶

8.2.1. Location of Custom Scripts¶

OCF Resource Agents are found in /usr/lib/ocf/resource.d/$PROVIDER

When creating your own agents, you are encouraged to create a new directory under /usr/lib/ocf/resource.d/ so that they are not confused with (or overwritten by) the agents shipped by existing providers.

So, for example, if you choose the provider name of big-corp and want a new resource named big-app, you would create a resource agent called /usr/lib/ocf/resource.d/big-corp/big-app and define a resource:

8.2.2. Actions¶

All OCF resource agents are required to implement the following actions.

**Required Actions for OCF Agents**¶
Action	Description	Instructions
start	Start the resource	Return 0 on success and an appropriate error code otherwise. Must not report success until the resource is fully active.
stop	Stop the resource	Return 0 on success and an appropriate error code otherwise. Must not report success until the resource is fully stopped.
monitor	Check the resource’s state	Exit 0 if the resource is running, 7 if it is stopped, and any other OCF exit code if it is failed. NOTE: The monitor script should test the state of the resource on the local machine only.
meta-data	Describe the resource	Provide information about this resource in the XML format defined by the OCF standard. Exit with 0. NOTE: This is not required to be performed as root.

OCF resource agents may optionally implement additional actions. Some are used only with advanced resource types such as clones.

**Optional Actions for OCF Resource Agents**¶
Action	Description	Instructions
validate-all	This should validate the instance parameters provided.	Return 0 if parameters are valid, 2 if not valid, and 6 if resource is not configured.
promote	Bring the local instance of a promotable clone resource to the promoted role.	Return 0 on success
demote	Bring the local instance of a promotable clone resource to the unpromoted role.	Return 0 on success
notify	Used by the cluster to send the agent pre- and post- notification events telling the resource what has happened and will happen.	Must not fail. Must exit with 0
reload	Reload the service’s own config.	Not used by Pacemaker
reload-agent	Make effective any changes in instance parameters marked as reloadable in the agent’s meta-data.	This is used when the agent can handle a change in some of its parameters more efficiently than stopping and starting the resource.
recover	Restart the service.	Not used by Pacemaker

Important

If you create a new OCF resource agent, use ocf-tester to verify that the agent complies with the OCF standard properly.

8.2.3. How are OCF Return Codes Interpreted?¶

The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed, and recovery action is initiated.

There are three types of failure recovery:

**Types of recovery performed by the cluster**¶
Type	Description	Action Taken by the Cluster
soft	A transient error occurred	Restart the resource or move it to a new location
hard	A non-transient error that may be specific to the current node	Move the resource elsewhere and prevent it from being retried on the current node
fatal	A non-transient error that will be common to all cluster nodes (e.g. a bad configuration was specified)	Stop the resource and prevent it from being started on any cluster node

8.2.4. OCF Return Codes¶

The following table outlines the different OCF return codes and the type of recovery the cluster will initiate when a failure code is received. Although counterintuitive, even actions that return 0 (aka. OCF_SUCCESS) can be considered to have failed, if 0 was not the expected return value.

**OCF Exit Codes and their Recovery Types**¶
Exit Code	OCF Alias	Description	Recovery
0	OCF_SUCCESS	Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.	soft
1	OCF_ERR_GENERIC	Generic “there was a problem” error code.	soft
2	OCF_ERR_ARGS	The resource’s parameter values are not valid on this machine (for example, a value refers to a file not found on the local host).	hard
3	OCF_ERR_UNIMPLEMENTED	The requested action is not implemented.	hard
4	OCF_ERR_PERM	The resource agent does not have sufficient privileges to complete the task.	hard
5	OCF_ERR_INSTALLED	The tools required by the resource are not installed on this machine.	hard
6	OCF_ERR_CONFIGURED	The resource’s parameter values are inherently invalid (for example, a required parameter was not given).	fatal
7	OCF_NOT_RUNNING	The resource is safely stopped. This should only be returned by monitor actions, not stop actions.	N/A
8	OCF_RUNNING_PROMOTED	The resource is running in the promoted role.	soft
9	OCF_FAILED_PROMOTED	The resource is (or might be) in the promoted role but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again.	soft
190	OCF_DEGRADED	The resource is properly active, but in such a condition that future failures are more likely.	none
191	OCF_DEGRADED_PROMOTED	The resource is properly active in the promoted role, but in such a condition that future failures are more likely.	none
other	none	Custom error code.	soft

Exceptions to the recovery handling described above:

Probes (non-recurring monitor actions) that find a resource active (or in the promoted role) will not result in recovery action unless it is also found active elsewhere.
The recovery action taken when a resource is found active more than once is determined by the resource’s multiple-active property.
Recurring actions that return OCF_ERR_UNIMPLEMENTED do not cause any type of recovery.
Actions that return one of the “degraded” codes will be treated the same as if they had returned success, but status output will indicate that the resource is degraded.

8.3. LSB Resource Agents (Init Scripts)¶

8.3.1. LSB Compliance¶

The relevant part of the LSB specifications includes a description of all the return codes listed here.

Assuming some_service is configured correctly and currently inactive, the following sequence will help you determine if it is LSB-compatible:

Start (stopped):
```
# /etc/init.d/some_service start ; echo "result: $?"
```
- Did the service start?
- Did the echo command print result: 0 (in addition to the init script’s usual output)?
Status (running):
```
# /etc/init.d/some_service status ; echo "result: $?"
```
- Did the script accept the command?
- Did the script indicate the service was running?
- Did the echo command print result: 0 (in addition to the init script’s usual output)?
Start (running):
```
# /etc/init.d/some_service start ; echo "result: $?"
```
- Is the service still running?
- Did the echo command print result: 0 (in addition to the init
  
  script’s usual output)?
Stop (running):
```
# /etc/init.d/some_service stop ; echo "result: $?"
```
- Was the service stopped?
- Did the echo command print result: 0 (in addition to the init script’s usual output)?
Status (stopped):
```
# /etc/init.d/some_service status ; echo "result: $?"
```
- Did the script accept the command?
- Did the script indicate the service was not running?
- Did the echo command print result: 3 (in addition to the init script’s usual output)?
Stop (stopped):
```
# /etc/init.d/some_service stop ; echo "result: $?"
```
- Is the service still stopped?
- Did the echo command print result: 0 (in addition to the init script’s usual output)?
Status (failed):

This step is not readily testable and relies on manual inspection of the script.

The script can use one of the error codes (other than 3) listed in the LSB spec to indicate that it is active but failed. This tells the cluster that before moving the resource to another node, it needs to stop it on the existing one first.

If the answer to any of the above questions is no, then the script is not LSB-compliant. Your options are then to either fix the script or write an OCF agent based on the existing script.

8. Resource Agents¶

8.1. Action Completion¶

8.2. OCF Resource Agents¶

8.2.1. Location of Custom Scripts¶

8.2.2. Actions¶

8.2.3. How are OCF Return Codes Interpreted?¶

8.2.4. OCF Return Codes¶

8.3. LSB Resource Agents (Init Scripts)¶

8.3.1. LSB Compliance¶

Table Of Contents

Previous topic

Next topic

This Page