sanlock messages
====================================================

sanlock messages/logging can be configured in various ways.
The following is based on sanlock 3.5.0.
2017-06-02


default message prefixes
------------------------

"2017-04-28 10:34:52 1491929 [23442]:"

. 2017-04-28 10:34:52
  date from gettimeofday(2), printed with localtime_r(3), or gmtime_r(3).
  (gmtime with config setting logfile_use_utc)

. 1491929
  monotonic time in seconds from clock_gettime(2) with CLOCK_MONOTONIC.

. [23442]
  thread id, see gettid(2).

In /var/log/sanlock.log, the prefix above is all that appears.
In /var/log/messages, syslog typically adds its own prefix
ahead of the prefix above (making for very lengthy prefixes
in /var/log/messages.)


default message destinations
----------------------------

1. /var/log/sanlock.log

By default, LOG_ERR and LOG_WARNING (log_error/log_warn) messages are
saved here.  To record all messages in sanlock.log, (including debug), set
sanlock.conf logfile_priority = 7.

2. /var/log/messages

By default, LOG_ERR (log_error) messages are set to syslog.  To record
both error and warning messages in /var/log/messages, set sanlock.conf
syslog_priority = 4.  To record all messages (including debug), set
syslog_priority = 7.

3. sanlock client log_dump

The daemon's 1MB internal circular message buffer includes all messages.
To print all messages since the daemon was started (or since the circular
buffer rolled over), run: "sanlock client log_dump" (and usually redirect
output to a file.)

As soon as possible after a problem occurs, it's recommended to save a
copy of this debug data to a file for later inspection.  If too much time
passes, the daemon will be restarted or the important data will be pushed
out of the buffer.  If it's not possible to get debug data from the daemon
after a failure (e.g. the host is always reset or sanlock is restarted),
then saving all the debug data to sanlock.log (logfile_priority = 7) is an
alternative.


Standard informational messages
=============================================================================

The following messages for normal events are recorded by default in
sanlock.log.


daemon started
--------------

"sanlock daemon started 3.5.0 host example"

or

"sanlock daemon started 3.5.0 host 638bd439-538c-47d5-826c-3b094075c598.example"

Every time the sanlock daemon is started, this is saved in sanlock.log.
The monotonic time in the prefix can be a useful reference, and the name
of the host recorded here is a key piece of information when debugging.

If the name of the host is not statically defined in sanlock.conf
(our_host_name), then sanlock generates a new host name every time the
daemon is started.  The new host name must be unique among all hosts, so
sanlock uses a random UUID (for uniqueness) and appends several characters
from the uname nodename (for help debugging).


lockspace add
-------------

"s1 lockspace lockspacename:2:/path/to/file_or_device:0"

When an application joins a lockspace, the lockspace definition is saved
in sanlock.log.

<lockspace_name>:<host_id>:<path>:<offset>
	<lockspace_name>      name of lockspace
	<host_id>             local host identifier in lockspace
	<path>                path to storage reserved for leases
	<offset>              offset on path (bytes)

The "s1" prefix is an abbreviation that sanlock uses to refer to this
lockspace in subsequent messages (because a UUID as the lockspacename may
be hard to read.)  The numbers in s<NUM> are simply incremented while the
daemon continues running, and restart at 1 when the daemon is restarted.


resource lease acquire
----------------------

"s1:r1 resource lockspacename:resourcename:/path/to/file_or_device:68157440 for 2,9,6073"

When an application acquires a resource lease, the resource definition is saved
in sanlock.log.

<lockspace_name>:<resource_name>:<path>:<offset>[:<lver>]
	<lockspace_name>      name of lockspace
	<resource_name>       name of resource
	<path>                path to storage reserved for leases
	<offset>              offset on path (bytes)
	<lver>                optional leader version or SH for shared lease

The "r1" prefix is an abbreviation for the resource name, like the "s1"
prefix above for lockspace names.

"for NUM1,NUM2,NUM3"
	Identifies the source of the lease request:
	NUM1 is the internal client id,
	NUM2 is the file descriptor used for the request,
	NUM3 is the client pid.


hosts discovered
----------------

"s1 host 1 11 172064 host-01"
"s1 host 2 23 172038 host-02"

When new hosts in a lockspace are first detected, their information is
recorded in sanlock.log.  Upon joining a lockspace, all hosts are new to
the joining host, and the joining host will record the details of all
current members (including itself).

The "s1" prefix above is the lockspace name abbreviation (see above).  

"host NUM1 NUM2 NUM3 NAME"
	Uniquely identifies a host at the current point in time.
	NUM1 is the host_id
	NUM2 is the host_id generation
	(incremented each time the host_id is used to join a lockspace)
	NUM3 is the current renewal timestamp recorded in the host_id's
	delta lease
	NAME is the unique host name (see daemon started message above)


Unusual events
=============================================================================

Messages for these unusual events generally appear in both sanlock.log
and /var/log/messages.


open error
----------

"open error -1 /path/to/file_or_device"

open(2) on the file or LV failed.  The error number may be some help in
finding the problem.  It could be things like permissions, selinux, broken
storage setup.  Try reading directly to see if the storage works, e.g.
"dd if=/path/to/file_or_device of=/dev/null bs=512 count=2048 iflag=direct"


io timeout
----------

"15aadd15 aio timeout RD 0x7f70140008c0:0x7f70140008d0:0x7f702e8a1000 ioto 10 to_count 1"

An async i/o operation timed out.  The read may complete later, in which
case a corresponding "aio collect" message will appear with the result.

"15aadd15"
	Thread id.  For the most common case of a lockspace thread,
	these are the first characters of the lockspace name that
	the thread is dedicated to.  For a worker thread, this will
	be something like "worker0".

"aio timeout WR|RD"
	An aio read or write operation did not complete in the specified
	timeout period (default 10 seconds).
	(In previous versions, 1|0 were used in place of WR|RD.)

"0x<NUM1>:0x<NUM2>:0x<NUM3>"
	Internal buffer pointers used for the aio op
	(aicb, data buffer, iocb).

"ioto 10"
	The io timeout in seconds.

"to_count 1"
	A counter of the number of io timeouts recorded by this thread.


io results
----------

"443aa0b3 aio collect WR 0x7f7db80008c0:0x7f7db80008d0:0x7f7daf6fe000 result -28:0 match res"

or

"742162cd aio collect RD 0x7f266c0008c0:0x7f266c0008d0:0x7f266c101000 result 0:0 match len 512"

An async i/o operation completed (in a non-standard way).

"443aa0b3"
	Thread id.  (See above)

"aio collect WR|RD"
	This is the result of an aio operation (write|read).
	(In previous versions, 1|0 were used in place of WR|RD.)

"0x<NUM1>:0x<NUM2>:0x<NUM3>"
	Internal buffer pointers used for the aio op
	(aicb, data buffer, iocb).

"result -28:0 match res ..."
	The aio result from the kernel io_event.res and io_event.res2.
	If io_event.res is a negative error number, the aio op failed.
	sanlock does not try to interpret specific negative numbers
	(several different error numbers can appear here.)

Other possible results following aio collect:

"result <A>:<B> match len <L> ..."
	The aio result from the kernel io_event.res and io_event.res2.
	If io_event.res is not negative, but does not match the requested
	length <L>, the aio op failed.

"result <A>:<B> old free"
	The aio result from the kernel is for an old i/o that sanlock
	has given up on, and this result is just discarded.

"result <A>:<B> other free ..."
	The aio result from the kernel is for another i/o that sanlock
	is not waiting for, and this result is just discarded.

"result <A>:<B> other free ..."
	The aio result from the kernel is for another i/o that sanlock
	is not waiting for, and this result is just discarded.

"result <A>:<B> match reap"
	The aio result is late (arrived after the standard timeout),
	but was successful, and can be used.


close_task_aio
--------------

"60c7bc7a close_task_aio 0 0x7fa7780008c0 busy"

"60c7bc7a close_task_aio destroyed <N> incomplete ops"

A thread is exiting (typically a lockspace thread), and freeing memory.
Before freeing memory associated with incomplete aio operations, it
attempts to collect the results.  If that fails, it logs the "busy"
message.  Eventually, it will give up waiting and print the "destroyed"
message.


renewal problems
----------------

Every 20 seconds (assuming the default 10 sec io timeout) a host renews
the delta lease for it's host_id in each lockspace it has joined.  This
renewal consists of reading the single sector for the delta lease,
verifying the contents, updating the timestamp in that sector to the
current time on the host, and writing that sector back to storage.
Error/warning messages are printed when the i/o involved in the renewal
fails, and when the host has not been able to complete a successful
renewal for an extended period.

These errors are not necessarily critical on their own.  They can happen
periodically without any ill effects, but they are more important when
they persist long enough to trigger more serious failure.

----

"s2 delta_renew read timeout 10 sec offset 0 /path/to/file_or_device"
"s2 renewal error -202 delta_length 10 last_success 1483"

This shows that a renewal read timed out after not completing in 10 sec.

"offset 0 /path/to/file_or_device"
	The location of the delta lease on disk.

"error -202"
	The i/o timeout error number.

"delta_length 10"
	Seconds the delta lease renewal operation took before returning.

"last_success 1483"
	The timestamp that was used in the last successful
	delta lease renewal.

----

"s15 delta_renew read rv -2 offset 0 /path/to/file_or_device"
"s15 renewal error -2 delta_length 0 last_success 24559"

This shows that a renewal read failed because an error was returned.

"offset 0 /path/to/file_or_device"
	The location of the delta lease on disk.

"rv -2"
	The error number returned from the read.  Different kinds of i/o
	errors from different kinds of storage will result in different
	error numbers here.

"delta_length 0"
	The time the renewal operation took before returning.
	In the case of 0, the read failed immediately.

"last_success 24559"
	The timestamp that was used in the last successful
	delta lease renewal.

----

"s1 delta_renew write time <T> error -<N>"

This is printed when the renewal write operation fails.  It is less common
than a read error above.

"time <T>"
	The number of seconds the write operation took before failing.

"error -<N>"
	The specific error returned from the kernel.

----

"s1 delta_renew long write time <T> sec"

This is printed when a successful renewal write operation takes longer
than the default i/o timeout (i/o timeouts are not applied to renewal
writes, but the value is used as the threshold for emitting a warning.)

"time <T> sec"
	The time taken by the write.

----

"s17 renewed 1232493 delta_length 22 too long"

This is printed when a successful lockspace renewal operation (one read
and one write) takes longer than 20 seconds (twice the 10 second default
i/o timeout).  An i/o timeout is not applied to the renewal write, so 20
seconds is used as the threshold for emitting the warning.

"delta_length 22"
	The number of seconds taken by the renewal (both i/o's).

"renewed 1232493"
	The local timestamp that was used in the renewal.


renewal failure
---------------

When a host does not renew its delta lease for a long enough period of
time, warnings start to be emitted in anticipation of a pending
expiration.  If the host manages to renew the lease at some point in the
following sequence, then the escalation and warnings stop.

As soon as any of these warnings appear, it would be prudent to collect
the output of "sanlock client renewal -s LOCKSPACE" (and probably "sanlock
client log_debug"), so that a final expiration can be more thoroughly
understood later.

s20 check_our_lease warning 60 last_success 1319137
...
s20 check_our_lease warning 79 last_success 1319137
s20 check_our_lease failed 80

The "check_our_lease warning" messages begin when sanlock has not been
able to renew a delta lease for 60 seconds (it is supposed to be renewed
once every 20 seconds).  The message is printed once a second, indicating
how many seconds old the last successful renewal is, and the timestamp
used in the last successful renewal.  This is a critical warning.  Unless
the renewal i/o succeeds within 20 seconds of the first warning, the
"check_our_lease failed 80" message is printed, at which point sanlock
enters recovery mode for the lockspace.

s20 kill 42444 sig 15 count 1
...
s20 kill 42444 sig 15 count 10

After "check_our_lease failed", kill messages appear, printed once a
second, indicating that sanlock has entered recovery mode for the
lockspace.  In this phase, it attempts to shut down any processes using
the lockspace.  "kill <PID>" is the process being told to quit.  "sig <N>"
is the signal being sent to the process.  sig 100 means that the process
has a graceful shutdown mechanism that sanlock is using.  sig 15 is
SIGTERM, which is used when the process has no graceful shutdown mechanism
specified.  sig 9 is SIGKILL, which is used if the others do not.  count
shows the number of attempts that have been made to send the signal.

dead 42444 ci 3 count 10

The specified pid that was killed (per messages above) has exited.
This is a successful step in lockspace recovery phase.

s20 all pids clear

All processes that were using the failing lockspace have exited, and the
lockspace recovery phase as completed successfully.  At this point, the
watchdog is disarmed for the lockspace, and the host should avoid being
forcibly reset.  If this message does not appear, the wdmd warnings below
will continue counting down until the host is reset.


wdmd warnings
-------------

While the check_our_lease warnings appear from sanlock, wdmd also begins
printing warnings indicating that the watchdog timeout for the expiring
lease is getting close.  These messages appear only in /var/log/messages.

wdmd[3627]: test warning now 5909660 ping 5909650 close 5210877 renewal 5909586 expire 5909666 client 8476 sanlock_lockspacename
wdmd[3627]: /dev/watchdog closed unclean
wdmd[3627]: test warning now 5909661 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock_lockspacename
...
wdmd[3627]: test failed rem 54 now 5909666 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock_lockspacename
wdmd[3627]: test failed rem 53 now 5909667 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock_lockspacename
wdmd[3627]: /dev/watchdog reopen


"client 8476 sanlock_lockspacename"
	This is the pid, of sanlock in this case, and a name provided by
	the process for the wdmd connection.

"/dev/watchdog closed unclean"
	The watchdog device has been closed to generate a potentially
	final ping in case it needs to fire for a pending expiration.

"/dev/watchdog reopen"
	The watchdog device has been reopened because an expiring
	connection has been refreshed before the watchdog fired.

"test warning"
	A connection is close to expiring.

"test failed"
	A connection has expired, and the watchdog will no longer
	be pinged, forcing it to reset the host.  If the connection
	is refreshed within 60 seconds, the reset can still be averted.

"rem N"
	N seconds remain until the watchdog fires.

"now N1 ping N2 close N3 renewal N4 expire N5"
	All monotonic host times in seconds.
	now is the current time.
	ping is the time /dev/watchdog was last pinged.
	close is the time /dev/watchdog was last closed (generates ping).
	renewal is the last refresh time from the connection owner.
	expire is the expiration time set by the connection owner.


resource lease contention
-------------------------

When hosts are simultaneously attempting to acquire a single resource
lease, the disk paxos algorithm in sanlock resolves the owner.  In cases
like these, the hosts log the paxos algorithm details in sanlock.log and
syslog.  These events are expected when hosts contend for leases.  They
are logged because they are infrequent and if debugging around one of
these events is needed, it's important to have all the details.

There are various ways that the algorithm can resolve the owner, and the
details may be logged differently, e.g.

r1128 ballot 437 choose bk_max[2] lver 437 mbal 3 bal 3 inp 3 1 8692
r1128 ballot 437 commit other owner 3 1 8692

Another host (host_id 3) acquired the lease.

...

r1014 ballot 372 abort2 mbal 1 mbal[2] 3
r1014 paxos_acquire 372 retry delay 631839 us
r1014 paxos_acquire 372 owner is our inp 1 1 6757188 commited by 3

The local host (host_id 1) acquired the lease.

...

r691 paxos_acquire owner 3 1 8359 writer 1 owner dblock free
r691 ballot 167 abort2 mbal 1 mbal[2] 3
r691 paxos_acquire 167 retry delay 572312 us
r691 paxos_acquire 167 owner is our inp 1 1 6756929 commited by 3

The local host (host_id 1) acquired the lease.

...

r654 ballot 145 abort1 mbal 1 mbal[2] 3
r654 paxos_acquire 145 retry delay 13757 us
r654 paxos_acquire 145 owner is 3 1 0

Another host (host_id 3) acquired the lease.