[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#5171) hdb txn_checkpoint failures
Aaron Richton wrote:
>> It's still rather suspicious that slave4 and slave6 both had identical log
>> status for base1 (1/188113) but different requested locations (1/8730339 vs
>> 1/8730401). If they're identically configured slaves then they ought to be in
>> lock-step. Then again, obviously they're not identical since slave6 doesn't
>> show base4 in your log.
>
> Identical is relative. They've got the same OpenLDAP and supporting
> binaries running on the same patches of Solaris 9 running identical
> turn-up scripts with identical configuration files. But this is
> production, so we've got data changes over time. For instance, the slaves
> bootstrap with a slapadd -q, and the underlying slapcat could easily be
> different from slave4 vs. slave6 (the most recent one is automatically
> used). I'd imagine this would look different at the db layer, even once
> syncrepl eventually converged the logical data?
>
>> Do you have the db_stat output from an uncorrupted slave? What about the
>> master?
>
> Sure... https://www.nbcs.rutgers.edu/~richton/its5171_dbstatl2
Judging from the LSNs in use on these other servers, it sure looks like
somebody went in and zeroed out your logs on slave4 and slave6. I don't think
the environment spontaneously corrupted itself and reset the log offsets...
One more thing to check is just using "ls -l" to see if the actual size of the
log files corresponds with the db_stat offsets. E.g. if slave6 base1's
log.0000001 is really 8MB but the LSN is only 233KB, then we have to look for
a weird in-memory corruption. If not, then somebody reset your logs.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/