$ oc adm diagnostics
The oc adm diagnostics
command runs a series of checks for error conditions in
the host or cluster. Specifically, it:
Verifies that the default registry and router are running and correctly configured.
Checks ClusterRoleBindings
and ClusterRoles
for consistency with base
policy.
Checks that all of the client configuration contexts are valid and can be connected to.
Checks that SkyDNS is working properly and the pods have SDN connectivity.
Validates master and node configuration on the host.
Checks that nodes are running and available.
Analyzes host logs for known errors.
Checks that systemd units are configured as expected for the host.
OpenShift Origin can be deployed in many ways: built from source, included in a
VM image, in a container image, or as enterprise RPMs. Each method implies a
different configuration and environment. To minimize environment assumptions,
the diagnostics were added to the openshift
binary so that wherever there is
an OpenShift Origin server or client, the diagnostics can run in the exact same
environment.
To use the diagnostics tool, preferably on a master host and as cluster administrator, run:
$ oc adm diagnostics
This runs all available diagnostics, skipping any that do not apply. For example, the NodeConfigCheck does not run unless a node configuration is available. You can also run specific diagnostics by name as you work to address issues. For example:
$ oc adm diagnostics NodeConfigCheck UnitStatus
Diagnostics look for configuration files in standard locations:
Client:
As indicated by the $KUBECONFIG
environment variable variable
~/.kube/config file
Master:
/etc/origin/master/master-config.yaml
Node:
/etc/origin/node/node-config.yaml
Non-standard locations can be specified with flags (respectively,
--config
, --master-config
, and --node-config
). If a configuration file
is not found or specified, related diagnostics are skipped.
Consult the output with the --help
flag for all available options.
Master and node diagnostics are most useful in a specific target environment, which is a deployment of RPMs with Ansible deployment logic. This provides some diagnostic benefits:
Master and node configuration is based on a configuration file in a standard location.
Systemd units are configured to manage the server(s).
All components log to journald.
Having configuration files where Ansible places them means that you will
generally not need to specify where to find them. Running oc adm diagnostics
without flags will look for master and node configurations in the standard
locations and use them if found; this should make the Ansible-installed use case
as simple as possible. Also, it is easy to specify configuration files that are
not in the expected locations:
$ oc adm diagnostics --master-config=<file_path> --node-config=<file_path>
Systemd units and logs entries in journald are necessary for the current log diagnostic logic. For other deployment types, logs may be going into files, to stdout, or may combine node and master. At this time, for these situations, log diagnostics are not able to work properly and will be skipped.
You may have access as an ordinary user, and/or as a cluster-admin user, and/or may be running on a host where OpenShift Origin master or node servers are operating. The diagnostics attempt to use as much access as the user has available.
A client with ordinary access should be able to diagnose its connection to the master and run a diagnostic pod. If multiple users or masters are configured, connections will be tested for all, but the diagnostic pod only runs against the current user, server, or project.
A client with cluster-admin access available (for any user, but only the
current master) should be able to diagnose the status of infrastructure such as
nodes, registry, and router. In each case, running oc adm diagnostics
looks
for the client configuration in its standard location and uses it if available.
Some additional diagnostic checks are available through the openshift-ansible container image. See the image’s source repository for usage information.
The following health checks belong to a diagnostic task meant to be run against the Ansible inventory file for a deployed OpenShift Origin cluster. They can report common problems for the current OpenShift Origin installation.
Check Name | Purpose |
---|---|
|
This check ensures that a host has the correct version of Open vSwitch installed for the currently deployed version of OpenShift Origin. |
|
This set of checks verifies that Elasticsearch, Fluentd, and Curator pods have
been deployed and are in a |
|
This check measures the total size of OpenShift Origin image data in an etcd cluster. The check fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check will fail if the size of image data amounts to 50% or more of the currently used space in the etcd cluster. A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift Origin image data, which can eventually result in your etcd cluster crashing. A user-defined limit may be set by passing the variable
|
|
This check ensures that the volume usage for an etcd cluster is below a maximum
user-specified threshold. If no maximum threshold value is specified, it is
defaulted to A user-defined limit may be set by passing the variable
|
|
Only runs on hosts that depend on the docker damon (nodes and containerized
installations). Checks that docker's total usage does not exceed a
user-defined limit. If no user-defined limit is set, docker's maximum usage
threshold defaults to 90% of the total size available. The threshold
limit for total percent usage can be set with a variable in your inventory file:
|
To disable specific checks, include the variable openshift_disable_check
with
a comma-delimited list of check names in your inventory file. For example:
openshift_disable_check=ovs_version,etcd_volume
A similar set of checks meant to run as part of the installation process can be found in Configuring Cluster Pre-install Checks. Another set of checks for checking certificate expiration can be found in Redeploying Certificates.