[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
LMDB dead process detection
There's been a long-running discussion about the need to have APIs in liblmdb
for displaying the reader table and clearing out stale slots. Quite a few open
questions on the topic:
1) What should the API look like for examining the table?
My initial instinct is to provide an iterator function that returns info
about the next slot each time it's called. Not sure that this is necessary or
most convenient though.
Another possibility is just a one-shot function that walks the table itself
and dumps the output as a formatted string to stdout, stderr, or a custom
output callback.
2) What should APIs look like for clearing out a stale slot?
Should it just be implicit inside the library, with no externally visible
API? I.e., should the library periodically check on its own, with no outside
intervention? Or should there be an API that lets a user explicitly request a
particular slot to be freed? This latter sounds pretty dangerous, since
freeing a slot that's actually still in use would allow a reader's view of the
DB to be corrupted.
3) What approach should be used for automatic detection of stale slots?
Currently we record the process ID and thread ID of a reader in the table.
It's not clear to me that the thread ID has anything more than informational
value. Since we register a per-thread destructor for slots, exiting threads
should never be leaving stale slots in the first place. I'm also not sure that
there are good APIs for an outside caller to determine the liveness of a given
thread ID.
The process ID is also prone to wraparound; it's still very common for
Linux systems to use 15 bit process IDs. So just checking that a pid is still
alive doesn't guarantee that it's the same process that was using an LMDB
environment at any point in time. We have two main approaches to work around
this latter issue:
A) set a byte range lock for every process attached to the environment.
This is what slapd's alock.c already does, which is used with BDB- and LDBM-
based backends. This is fairly portable code, and has the desirable property
that file locks automatically go away when a process exits. But:
a) On Windows, the OS can take several minutes to clean up the locks of
an exited process. So just checking for presence of a lock could erroneously
consider a process to be alive long after it had actually died.
b) file lock syscalls are fairly slow to execute. If we are checking
liveness frequently, there will be a noticeable performance hit. Their
performance also degrades exponentially with the number of processes locking
concurrently, and degrades further still if networked filesystems are involved.
c) This approach won't tell us if a process is in Zombie state.
B) check process ID and process start time.
This appears to be a fairly reliable approach, and reasonably fast, but there
is no POSIX standard API for obtaining this process information. Methods for
obtaining the info are fairly well documented across a variety of platforms
(AIX, HPUX, multiple BSDs, Linux, Solaris, etc.) but they are all different.
It appears that we can implement this compactly for each of the systems, but
it means carrying around a dozen or so different implementations.
Also, assuming we want to support shared LMDB access across NFS (as discussed
in an earlier thread), it seems we're going to have to use a lock-based
solution anyway, since process IDs won't be meaningful across host boundaries.
We can implement approach (A) fairly easily, with no major repercussions. For
(B) we would need to add a field to the reader table records to store the
process start time. (Thus a lockfile format change.)
(note: performance of fcntl locks vs checking process start time was measured
with some simple code on my laptop running Linux. These functions are all
highly OS-dependent, so the perf ratios may vary quite a lot from system to
system.)
The relative performance may not even be an issue in general, since we would
only need to trigger a scan if a writer actually finds that some reader txn is
preventing it from using free pages from the freeDB. Most of the time this
wouldn't be happening. But if there were a legitimate long running read txn
(e.g., for mdb_env_copy) we may find ourselves checking fairly often.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/