Diskprediction Module

The diskprediction module supports two modes: cloud mode and local mode. In cloud mode, the disk and Ceph operating status information is collected from Ceph cluster and sent to a cloud-based DiskPrediction server over the Internet. DiskPrediction server analyzes the data and provides the analytics and prediction results of performance and disk health states for Ceph clusters.

Local mode doesn’t require any external server for data analysis and output results. In local mode, the diskprediction module uses an internal predictor module for disk prediction service, and then returns the disk prediction result to the Ceph system.

Local predictor: 70% accuracy
Cloud predictor for free: 95% accuracy

Enabling

Run the following command to enable the diskprediction module in the Ceph environment:

ceph mgr module enable diskprediction_cloud
ceph mgr module enable diskprediction_local

Select the prediction mode:

ceph config set global device_failure_prediction_mode local

or:

ceph config set global device_failure_prediction_mode cloud

To disable prediction,:

ceph config set global device_failure_prediction_mode none

Connection settings

The connection settings are used for connection between Ceph and DiskPrediction server.

Local Mode

The diskprediction module leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. Thus, no connection settings are required in local mode. The local predictor module requires at least six datasets of device health metrics to implement the prediction.

Run the following command to use local predictor predict device life expectancy.

ceph device predict-life-expectancy <device id>

Cloud Mode

The user registration is required in cloud mode. The users have to sign up their accounts at https://www.diskprophet.com/#/ to receive the following DiskPrediction server information for connection settings.

Certificate file path: After user registration is confirmed, the system will send a confirmation email including a certificate file download link. Download the certificate file and save it to the Ceph system. Run the following command to verify the file. Without certificate file verification, the connection settings cannot be completed.

DiskPrediction server: The DiskPrediction server name. It could be an IP address if required.

Connection account: An account name used to set up the connection between Ceph and DiskPrediction server

Connection password: The password used to set up the connection between Ceph and DiskPrediction server

Run the following command to complete connection setup.

ceph device set-cloud-prediction-config <diskprediction_server> <connection_account> <connection_password> <certificate file path>

You can use the following command to display the connection settings:

ceph device show-prediction-config

Additional optional configuration settings are the following:

diskprediction_upload_metrics_interval

Indicate the frequency to send Ceph performance metrics to DiskPrediction server regularly at times. Default is 10 minutes.

diskprediction_upload_smart_interval

Indicate the frequency to send Ceph physical device info to DiskPrediction server regularly at times. Default is 12 hours.

diskprediction_retrieve_prediction_interval

Indicate Ceph that retrieves physical device prediction data from DiskPrediction server regularly at times. Default is 12 hours.

Diskprediction Data

The diskprediction module actively sends/retrieves the following data to/from DiskPrediction server.

Metrics Data

  • Ceph cluster status

key

Description

cluster_health

Ceph health check status

num_mon

Number of monitor node

num_mon_quorum

Number of monitors in quorum

num_osd

Total number of OSD

num_osd_up

Number of OSDs that are up

num_osd_in

Number of OSDs that are in cluster

osd_epoch

Current epoch of OSD map

osd_bytes

Total capacity of cluster in bytes

osd_bytes_used

Number of used bytes on cluster

osd_bytes_avail

Number of available bytes on cluster

num_pool

Number of pools

num_pg

Total number of placement groups

num_pg_active_clean

Number of placement groups in active+clean state

num_pg_active

Number of placement groups in active state

num_pg_peering

Number of placement groups in peering state

num_object

Total number of objects on cluster

num_object_degraded

Number of degraded (missing replicas) objects

num_object_misplaced

Number of misplaced (wrong location in the cluster) objects

num_object_unfound

Number of unfound objects

num_bytes

Total number of bytes of all objects

num_mds_up

Number of MDSs that are up

num_mds_in

Number of MDS that are in cluster

num_mds_failed

Number of failed MDS

mds_epoch

Current epoch of MDS map

  • Ceph mon/osd performance counts

Mon:

key

Description

num_sessions

Current number of opened monitor sessions

session_add

Number of created monitor sessions

session_rm

Number of remove_session calls in monitor

session_trim

Number of trimed monitor sessions

num_elections

Number of elections monitor took part in

election_call

Number of elections started by monitor

election_win

Number of elections won by monitor

election_lose

Number of elections lost by monitor

Osd:

key

Description

op_wip

Replication operations currently being processed (primary)

op_in_bytes

Client operations total write size

op_r

Client read operations

op_out_bytes

Client operations total read size

op_w

Client write operations

op_latency

Latency of client operations (including queue time)

op_process_latency

Latency of client operations (excluding queue time)

op_r_latency

Latency of read operation (including queue time)

op_r_process_latency

Latency of read operation (excluding queue time)

op_w_in_bytes

Client data written

op_w_latency

Latency of write operation (including queue time)

op_w_process_latency

Latency of write operation (excluding queue time)

op_rw

Client read-modify-write operations

op_rw_in_bytes

Client read-modify-write operations write in

op_rw_out_bytes

Client read-modify-write operations read out

op_rw_latency

Latency of read-modify-write operation (including queue time)

op_rw_process_latency

Latency of read-modify-write operation (excluding queue time)

  • Ceph pool statistics

key

Description

bytes_used

Per pool bytes used

max_avail

Max available number of bytes in the pool

objects

Number of objects in the pool

wr_bytes

Number of bytes written in the pool

dirty

Number of bytes dirty in the pool

rd_bytes

Number of bytes read in the pool

stored_raw

Bytes used in pool including copies made

  • Ceph physical device metadata

key

Description

disk_domain_id

Physical device identify id

disk_name

Device attachment name

disk_wwn

Device wwn

model

Device model name

serial_number

Device serial number

size

Device size

vendor

Device vendor name

  • Ceph each objects correlation information

  • The module agent information

  • The module agent cluster information

  • The module agent host information

SMART Data

  • Ceph physical device SMART data (provided by Ceph devicehealth module)

Prediction Data

  • Ceph physical device prediction data

Receiving predicted health status from a Ceph OSD disk drive

You can receive predicted health status from Ceph OSD disk drive by using the following command.

ceph device get-predicted-status <device id>

The get-predicted-status command returns:

{
    "near_failure": "Good",
    "disk_wwn": "5000011111111111",
    "serial_number": "111111111",
    "predicted": "2018-05-30 18:33:12",
    "attachment": "sdb"
}

Attribute

Description

near_failure

The disk failure prediction state: Good/Warning/Bad/Unknown

disk_wwn

Disk WWN number

serial_number

Disk serial number

predicted

Predicted date

attachment

device name on the local system

The near_failure attribute for disk failure prediction state indicates disk life expectancy in the following table.

near_failure

Life expectancy (weeks)

Good

> 6 weeks

Warning

2 weeks ~ 6 weeks

Bad

< 2 weeks

Debugging

If you want to debug the DiskPrediction module mapping to Ceph logging level, use the following command.

[mgr]

    debug mgr = 20

With logging set to debug for the manager the module will print out logging message with prefix mgr[diskprediction] for easy filtering.