[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: Corruption of Index files running readonly slapd (ITS#2582)



Given the information you've provided, this still sounds like either the BDB
cache is inadequate or there are stale locks in the way. Since the lock
information is recorded in the __db.00* environment files, deleting them all
will also remove the locks. However, there's not enough information here to
tell that for certain.

The next time you see this slowdown occur, shutdown the slapd and record all
of the information you can get out of db_stat:
	db_stat -c (lock info)
	db_stat -l (logging info)
	db_stat -m (memory usage)
	db_stat -t (transactions)

In particular, with slapd cleanly shut down, in the output of "db_stat -c"
you should see zero current locks, lockers, and lock objects. If any of those
are non-zero, we may have a locking bug, or there is a locking bug in the BDB
library. In the output of "db_stat -m" you should look at the number of clean
and dirty pages forced from the cache. These numbers should be small,
preferably zero. If they are non-zero then your cache is probably too small.
In the output of "db_stat -l" look at the number of region locks granted
after waiting, it should be zero or very small. In the output of "db_stat -t"
the number of active transactions should be zero. If not, there is a bug
somewhere. The number of aborted transactions should be zero or very small,
assuming that your usage patterns are primarily read-oriented. The number of
maximum active transactions should be much smaller than the maximum active
transactions possible. If not, then you need to reconfigure the transaction
region.

It's better to use the db_recover command than to manually delete the
__db.00* files. Usually, if slapd has shut down cleanly, the effect will be
the same, but if slapd was shutdown uncleanly, the db_recover command will
flush the cache and make sure that the last committed transactions actually
make it into the database.

Unless you see non-zero values for currently active lockers or transactions,
it's unlikely that this is an OpenLDAP bug. Also, a lock management bug in
OpenLDAP would most likely cause slapd to hang and stop answering queries,
not just make it run slowly. If there is no indication of this type of bug,
then you have a badly configured database, and you need to read the SleepyCat
documentation to resolve the problem. Finally, even if there's an errant
locker hanging around out there, it may just be a leftover from an unclean
system shutdown, and not actually a misplaced lock. We've been discussing
approaches to prevent this problem on the -devel list; the issue was first
mentioned in ITS#2502 and any action taken will be reported there.

  -- Howard Chu
  Chief Architect, Symas Corp.       Director, Highland Sun
  http://www.symas.com               http://highlandsun.com/hyc
  Symas: Premier OpenSource Development and Support

> -----Original Message-----
> From: owner-openldap-bugs@OpenLDAP.org
> [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of Christoph Neerfeld

> We have quite the same problem. In our setup we have only 500 entries
> and at most 200 client machines. The database is mostly read only
> besides the changes of user passwords.
>
> After the import of the data via ldif the server runs very fast and
> after three weeks the performace degrades dramatically. slapd starts
> eating up cpu cycles for each request. Restarting slapd does not
> change anything.
>
> I read the FAQ and most parts of the bdb documentation. AFAIR most
> tips for performance tuning are related to write access to the
> database which is of no concern to us.
> The only hint I found is to increase the bdb cache but
> 'db_stat -m' already reports a cache hit rate of 98%.
>
> So I tried another thing. I stoped slapd, removed those __db.00? files
> and all log.00* files which db_archive reported are not longer used
> and started slapd again. I don't know if this can corrupt my database
> but it fixes the problem. slapd runs again with the same speed as
> after a fresh import of the data.
>
> If this is a configuration problem and no bug I would appreciate any
> hints to what I have to change.
>
> Here are some details to our setup:
>
> - Linux SMP kernel 2.4.20 running on i386 with two processors
> - debian woody
> - ext2 filesystem
> - openldap 2.1.21
> - bdb 4.1.25 compiled with --disable-largefiles
>
> Regards
>
> Christoph Neerfeld
>
> > There are other sites with larger installations running under heavy
> > load that
> > have not experienced this problem. As such, this sounds like a cache
> > configuration problem on your end. Have you read the FAQ?
> > http://www.openldap.org/faq/data/cache/893.html
>
> >  -- Howard Chu
> >  Chief Architect, Symas Corp.       Director, Highland Sun
> >  http://www.symas.com               http://highlandsun.com/hyc
> >  Symas: Premier OpenSource Development and Support
>
> > > -----Original Message-----
> > > From: owner-openldap-bugs@OpenLDAP.org
> > > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of ldap@uic.edu
>
> > > Full_Name: Andrew J. Herbert
> > > Version: 2.1.21
> > > OS: Linux
> > > URL: ftp://ftp.openldap.org/incoming/
> > > Submission from: (NULL) (128.248.172.135)
> > >
> > >
> > > System master and slave pair running openldap v2.1.21 and
> > > Berkeley DB 4.1.25 on
> > > Linux 2.4.18 systems (RH7.3 with updates) filesystems are ext3.
> > >
> > > We have an issue using the PADL software pam_ldap module on a
> > > Solaris V880 with
> > > approx 40,000 users against OpenLDAP. pam_ldap is not
> > > configured with the root
> > > DN and the ACL are setup to allow no modification by anyone
> > > bar the root DN. As
> > > such the LDAP database can be considered to be read-only.
> > >
> > > After running for a few hours, the server starts taking an
> > > inordinately long (>1
> > > min) to do a simple lookup. If we stop the server and compare
> > > the database files
> > > with a 'known good' one, we find that the files have changed.
> > > Performing a
> > > slapcat on the database takes in excess of 30 mins to run,
> > > but produces a
> > > correct LDIF which can then be reloaded (around an hour for
> > > this) and the server
> > > then continues to run normally for another few hours.
> > >
> > > We can reproduce this, we have tried the following
> > >
> > > Originally this system came online running 2.1.17 on a pair
> > > of IDE based
> > > servers. We moved it to newer faster SCSI based servers (Sun
> > > LX50's) and still
> > > had the same problems. We upgraded the system to 2.1.21 and
> > > the problem was
> > > still present. If we leave the master and slave running long
> > > enough, eventually
> > > they both enter this slow mode of operation.
>
> --
> Christoph Neerfeld
>
> FH Bonn-Rhein-Sieg       | e-mail: Christoph.Neerfeld@FH-BRS.DE
> FB Angewandte Informatik |
> Grantham Allee 20        | phone : +49 2241/865-241
> 53757 Sankt Augustin     |
> Germany - Deutschland    | fax   : +49 2241/865-8241
>
>
>