[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#5171) hdb txn_checkpoint failures
> Have you got backups from just before these occurrences? Can you see what the
> last valid transaction log files were before this? Or perhaps you can get
> some db_stat's off any other slaves that are still running OK? The idea is to
> see whether the current valid CSNs on an equivalent slave are anywhere near
> the numbers being logged here, e.g. 1/188113 or 1/8730339.
>
> Have you actually run out of disk space on the partitions holding the logs?
> It's rather suspicious that two machines would act up at the same time unless
> some admin specifically disturbed the log files on those two systems at
> around that time.
I don't have backups for slave bdb logs. The master slapcat output is
considered sacred data; the slave bdb log files are considered derivable
thereof and don't get backed up (we'd sooner just replace the entire slave
if it acts up). The odds of the partitions filling is minimal; Solaris has
that logged at kern.notice (which on our configuration is serious enough
to mean a write to NVRAM), and logs that extend prior to September 24
don't show any such messages.
With that said, "some admin specifically disturbed the log files around
that time." Logs show that I was the only person in a position to do so
(unless somebody broke in and covered their tracks; we'll ignore that
theoretical possibility). On September 24, I reconfigured the slaves to
use a different IP address to the master instead of the existing
connection. The times are too coincidental to be unrelated:
(slave4) reconfigured Sep 24 09:41 (first syslog complaint 09:43)
(slave6) reconfigured Sep 24 09:39 (first syslog complaint 09:44)
So...is there something that's cued off the (reverse?) name service
entries for the master? Does the master IP hash in to a CSN somehow? And
if this is indeed the case/root cause...well, quite honestly, I think that
assuming a name service database will remain constant throughout a slapd
instance is a fallacy. Furthermore, if this is indeed the case, it should
be absolutely trivial for me to reproduce this (I can perform a DR on
slave4/6, and reconfigure their network again).
With that in mind, I'll likely test this reproduction early next week. I
can still get db_stat from all slaves (working and not) at this point if
that's interesting. Comments?