[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: PANIC: bdb fatal region
----- ldap@mm.st wrote:
> I am rebuilding our aging pre 2.2 openldap servers that ran ldbm
> backend and slurpd. We ran this setup without any issues for many
> years.
>
> The new setup is:
> RH5
> openldap 2.3.43 (Stock RH)
> bdb backend 4.4.20 (Stock RH)
> Entries in db- about 1820
> LDIF file is about 1.2M
> Memory- Master 4GB Slave 2GB (will add two more slaves)
>
> Database section of slapd.conf:
> database bdb
> suffix "o=example.com"
> rootdn "cn=root">cn=root">cn=root">cn=root,o=example.com"
> rootpw {SSHA} .....
> cachesize 1900
> checkpoint 512 30
> directory /var/lib/ldap
> index objectClass,uid,uidNumber,gidNumber,memberUid,uniqueMember
>
> eq
> index cn,mail,surname,givenname
>
> eq,subinitial
>
> DB_CONFIG:
> set_cachesize 0 4153344 1
> set_lk_max_objects 1500
> set_lk_max_locks 1500
> set_lk_max_lockers 1500
> set_lg_regionmax 1048576
> set_lg_bsize 32768
> set_lg_max 131072
> set_lg_dir /var/lib/ldap
> set_flags DB_LOG_AUTOREMOVE
>
> This new setup appeared to work great for the last 10 days or so. I
> was
> able to authenticate clients, add records etc. Running slapd_db_stat
> -m
> and slapd_db_stat -c seem to indicate everything was ok. Before I
> put
> this setup into production, I got slurpd to function. Then decided to
> disable slurpd to use syncrepl in refreshonly mode. This also seemed
> to
> work fine. I'm not sure if the replication started this or not, but
> wanted to include all the events that let up to this.
Replication should not be related at all.
> I have started
> to
> get:
> bdb(o=example.com): PANIC: fatal region error detected; run recovery
> On both servers at different times. During this time slapd continues
> to
> run which seems to confuse clients that try to use it and they will
> not
> try the other server that is listed in ldap.conf. To recover I did:
> service ldap stop, slapd_db_recover -h /var/lib/ldap, service ldap
> start.
>
> I then commented all the replication stuff out in the slapd.conf and
> restarted ldap. It will run for a while (varies 5 minutes - ?) then
> I
> get the same errors and clients are unable to authenticate. On one
> of
> the servers I deleted all the files (except DB_CONFIG) and did a
> slapadd of a ldif file that I generated every night (without stopping
> slapd).
You imported while slapd was running? This is a recipe for failure. You can import to a different directory, stop slapd, and switch directories, and then start, but importing in the directory while slapd is running is a bad idea.
> Same results once I started slapd again. I have enabled
> debug
> for slapd and have not seen anything different, I attached gdb to the
> running slapd and no errors are noted. I even copied a backup copy
> of
> slapd.conf prior to the replication settings (even though they are
> commented out) thinking that maybe something in there was causing it..
>
> Then after several recoveries as described above the systems seem to
> be
> working again. One has not generated the error for for over 5.5
> hours
> the other has not had any problems for 2 hours. For some reason
> after
> that period when the errors showed up for a while, things seem to be
> working again, at least for now.
>
> I'm nervous about putting this into production until I can get this
> to
> function properly without these issues. During the 10 day period
> with
> everything working good, the slave would occasional (rarely) get the
> error and I would do a recovery, but we thought this was due to
> possible
> hardware problems. Now I'm not so sure.
>
> I have a monitor script that runs slapd_db_stat -m and -c every 5
> minutes and nothing seems wrong there, I far as I can tell. I'm
> hoping
> someone can help me determine possible causes or things to look at.
I would recommend that any server that hasn't had a clean import while slapd is *NOT* running, get one. Run it for a few days, and see if you see any problems.
I have been running 2.3.43 for years (my own packages on RHEL4, then my own packages on RHEL5, now some boxes run the RHEL packages), and never seen *this* issue, with a *much* larger directory (with about 8 replicas, though not all replicas have all databases).
Usually database corruption is due to hardware failure, unclean shutdown, or finger trouble.
Regards,
Buchan