If one resource depends on another resource via constraints, the cluster will interpret an expected result as sufficient to continue with dependent actions. This may cause timing issues if the resource agent start returns before the service is not only launched but fully ready to perform its function, or if the resource agent stop returns before the service has fully released all its claims on system resources. At a minimum, the start or stop should not return before a status command would return the expected (started or stopped) result.
OCF Resource Agents are found in /usr/lib/ocf/resource.d/$PROVIDER
When creating your own agents, you are encouraged to create a new directory under /usr/lib/ocf/resource.d/ so that they are not confused with (or overwritten by) the agents shipped by existing providers.
So, for example, if you choose the provider name of big-corp and want a new resource named big-app, you would create a resource agent called /usr/lib/ocf/resource.d/big-corp/big-app and define a resource:
All OCF resource agents are required to implement the following actions.
Action | Description | Instructions |
---|---|---|
start | Start the resource | Return 0 on success and an appropriate error code otherwise. Must not report success until the resource is fully active. |
stop | Stop the resource | Return 0 on success and an appropriate error code otherwise. Must not report success until the resource is fully stopped. |
monitor | Check the resource’s state | Exit 0 if the resource is running, 7 if it is stopped, and any other OCF exit code if it is failed. NOTE: The monitor script should test the state of the resource on the local machine only. |
meta-data | Describe the resource | Provide information about this resource in the XML format defined by the OCF standard. Exit with 0. NOTE: This is not required to be performed as root. |
validate-all | Verify the supplied parameters | Return 0 if parameters are valid, 2 if not valid, and 6 if resource is not configured. |
Additional requirements (not part of the OCF specification) are placed on agents that will be used for advanced concepts such as clone resources.
Action | Description | Instructions |
---|---|---|
promote | Bring the local instance of a promotable clone resource to the promoted role. | Return 0 on success |
demote | Bring the local instance of a promotable clone resource to the unpromoted role. | Return 0 on success |
notify | Used by the cluster to send the agent pre- and post- notification events telling the resource what has happened and will happen. | Must not fail. Must exit with 0 |
One action specified in the OCF specs, recover, is not currently used by the cluster. It is intended to be a variant of the start action that tries to recover a resource locally.
Important
If you create a new OCF resource agent, use ocf-tester to verify that the agent complies with the OCF standard properly.
The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed, and recovery action is initiated.
There are three types of failure recovery:
Type | Description | Action Taken by the Cluster |
---|---|---|
soft | A transient error occurred |
Restart the resource or move it to a new location |
hard | A non-transient error that may be specific to the current node |
Move the resource elsewhere and prevent it from being retried on the current node |
fatal | A non-transient error that will be common to all cluster nodes (e.g. a bad configuration was specified) |
Stop the resource and prevent it from being started on any cluster node |
The following table outlines the different OCF return codes and the type of recovery the cluster will initiate when a failure code is received. Although counterintuitive, even actions that return 0 (aka. OCF_SUCCESS) can be considered to have failed, if 0 was not the expected return value.
Exit Code | OCF Alias | Description | Recovery |
---|---|---|---|
0 | OCF_SUCCESS | Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands. |
soft |
1 | OCF_ERR_GENERIC | Generic “there was a problem” error code. |
soft |
2 | OCF_ERR_ARGS | The resource’s configuration is not valid on this machine. E.g. it refers to a location not found on the node. |
hard |
3 | OCF_ERR_UNIMPLEMENTED | The requested action is not implemented. |
hard |
4 | OCF_ERR_PERM | The resource agent does not have sufficient privileges to complete the task. |
hard |
5 | OCF_ERR_INSTALLED | The tools required by the resource are not installed on this machine. |
hard |
6 | OCF_ERR_CONFIGURED | The resource’s configuration is invalid. E.g. required parameters are missing. |
fatal |
7 | OCF_NOT_RUNNING | The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action. |
N/A |
8 | OCF_RUNNING_PROMOTED | The resource is running in the promoted role. |
soft |
9 | OCF_FAILED_PROMOTED | The resource is (or might be) in the promoted role but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again. |
soft |
other | none | Custom error code. | soft |
Exceptions to the recovery handling described above:
The relevant part of the LSB specifications [http://refspecs.linuxfoundation.org/lsb.shtml] includes a description of all the return codes listed here.
Assuming some_service is configured correctly and currently inactive, the following sequence will help you determine if it is LSB-compatible:
Start (stopped):
# /etc/init.d/some_service start ; echo "result: $?"
Status (running):
# /etc/init.d/some_service status ; echo "result: $?"
Start (running):
# /etc/init.d/some_service start ; echo "result: $?"
Is the service still running?
script’s usual output)?
Stop (running):
# /etc/init.d/some_service stop ; echo "result: $?"
Status (stopped):
# /etc/init.d/some_service status ; echo "result: $?"
Stop (stopped):
# /etc/init.d/some_service stop ; echo "result: $?"
Status (failed):
This step is not readily testable and relies on manual inspection of the script.
The script can use one of the error codes (other than 3) listed in the LSB spec to indicate that it is active but failed. This tells the cluster that before moving the resource to another node, it needs to stop it on the existing one first.
If the answer to any of the above questions is no, then the script is not LSB-compliant. Your options are then to either fix the script or write an OCF agent based on the existing script.