Hello,
in my program using LMDB, I've experienced rare deadlocks in highly
concurrent mixed (read/write/cursor iteration) workloads. The end result
is that hundreds of threads are hanging waiting on LOCK_MUTEX_W().
Unfortunately I'm not quite sure why this happens.
If my understanding is correct, this mutex is locked from the beginning of
the transaction, until the commit/abort, effectively serialising writers.
So I assume that somehow a writer dies or is violently killed, so he is
not able to run its atexit() cleanups, and this shared mutex remains
locked forever.
What would you suggest for such a situation? I'm thinking of patching LMDB
to lock with mutex_timedwait() and periodically check if the PID having
taken the mutex is still alive. Is the writer PID stored somewhere, or a
change of format will be needed? Any other ideas are welcome!