Diagnostics Tool | Cluster Administration | OpenShift Origin Branch Build

Overview
Using the Diagnostics Tool
Running Diagnostics in a Server Environment
Running Diagnostics in a Client Environment
Additional Diagnostic Checks via Ansible

Overview

The oc adm diagnostics command runs a series of checks for error conditions in the host or cluster. Specifically, it:

Verifies that the default registry and router are running and correctly configured.
Checks ClusterRoleBindings and ClusterRoles for consistency with base policy.
Checks that all of the client configuration contexts are valid and can be connected to.
Checks that SkyDNS is working properly and the pods have SDN connectivity.
Validates master and node configuration on the host.
Checks that nodes are running and available.
Analyzes host logs for known errors.
Checks that systemd units are configured as expected for the host.

Using the Diagnostics Tool

OpenShift Origin can be deployed in many ways: built from source, included in a VM image, in a container image, or as enterprise RPMs. Each method implies a different configuration and environment. To minimize environment assumptions, the diagnostics were added to the openshift binary so that wherever there is an OpenShift Origin server or client, the diagnostics can run in the exact same environment.

To use the diagnostics tool, preferably on a master host and as cluster administrator, run:

$ oc adm diagnostics

This runs all available diagnostics, skipping any that do not apply. For example, the NodeConfigCheck does not run unless a node configuration is available. You can also run specific diagnostics by name as you work to address issues. For example:

$ oc adm diagnostics NodeConfigCheck UnitStatus

Diagnostics look for configuration files in standard locations:

Client:
- As indicated by the $KUBECONFIG environment variable variable
- ~/.kube/config file
Master:
- /etc/origin/master/master-config.yaml
Node:
- /etc/origin/node/node-config.yaml

Non-standard locations can be specified with flags (respectively, --config, --master-config, and --node-config). If a configuration file is not found or specified, related diagnostics are skipped.

Consult the output with the --help flag for all available options.

Running Diagnostics in a Server Environment

Master and node diagnostics are most useful in a specific target environment, which is a deployment of RPMs with Ansible deployment logic. This provides some diagnostic benefits:

Master and node configuration is based on a configuration file in a standard location.
Systemd units are configured to manage the server(s).
All components log to journald.

Having configuration files where Ansible places them means that you will generally not need to specify where to find them. Running oc adm diagnostics without flags will look for master and node configurations in the standard locations and use them if found; this should make the Ansible-installed use case as simple as possible. Also, it is easy to specify configuration files that are not in the expected locations:

$ oc adm diagnostics --master-config=<file_path> --node-config=<file_path>

Systemd units and logs entries in journald are necessary for the current log diagnostic logic. For other deployment types, logs may be going into files, to stdout, or may combine node and master. At this time, for these situations, log diagnostics are not able to work properly and will be skipped.

Running Diagnostics in a Client Environment

You may have access as an ordinary user, and/or as a cluster-admin user, and/or may be running on a host where OpenShift Origin master or node servers are operating. The diagnostics attempt to use as much access as the user has available.

A client with ordinary access should be able to diagnose its connection to the master and run a diagnostic pod. If multiple users or masters are configured, connections will be tested for all, but the diagnostic pod only runs against the current user, server, or project.

A client with cluster-admin access available (for any user, but only the current master) should be able to diagnose the status of infrastructure such as nodes, registry, and router. In each case, running oc adm diagnostics looks for the client configuration in its standard location and uses it if available.

Additional Diagnostic Checks via Ansible

Some additional diagnostic checks are available through the openshift-ansible container image. See the image’s source repository for usage information.

The following health checks belong to a diagnostic task meant to be run against the Ansible inventory file for a deployed OpenShift Origin cluster. They can report common problems for the current OpenShift Origin installation.

Table 1. Diagnostic Checks
Check Name	Purpose
`ovs_version`	This check ensures that a host has the correct version of Open vSwitch installed for the currently deployed version of OpenShift Origin.
`kibana`, `curator`, `elasticsearch`, `fluentd`	This set of checks verifies that Elasticsearch, Fluentd, and Curator pods have been deployed and are in a `running` state, and that a connection can be established between the control host and the exposed Kibana URL. These checks will only run if the `openshift_hosted_logging_deploy` inventory variable is set to `true`, to ensure that they are executed in a deployment where a logging stack has been deployed.
`etcd_imagedata_size`	This check measures the total size of OpenShift Origin image data in an etcd cluster. The check fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of image data amounts to 50% or more of the currently used space in the etcd cluster. A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift Origin image data, which can eventually result in your etcd cluster crashing. A user-defined limit may be set by passing the variable `etcd_max_image_data_size_bytes=400000000` to the `openshift_health_checker` role.
`etcd_volume`	This check ensures that the volume usage for an etcd cluster is below a maximum user-specified threshold. If no maximum threshold value is specified, it is defaulted to `90%` of the total volume size. A user-defined limit may be set by passing the variable `etcd_device_usage_threshold_percent=90` to the `openshift_health_checker` role.
`docker_storage`	Only runs on hosts that depend on the docker damon (nodes and containerized installations). Checks that docker's total usage does not exceed a user-defined limit. If no user-defined limit is set, docker's maximum usage threshold defaults to 90% of the total size available. The threshold limit for total percent usage can be set with a variable in your inventory file: `max_thinpool_data_usage_percent=90`.

To disable specific checks, include the variable openshift_disable_check with a comma-delimited list of check names in your inventory file. For example:

openshift_disable_check=ovs_version,etcd_volume

A similar set of checks meant to run as part of the installation process can be found in Configuring Cluster Pre-install Checks. Another set of checks for checking certificate expiration can be found in Redeploying Certificates.