[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
openldap-2.1.29/db-4.2.52.2 and db environment issues
I am implementing a highly-available LDAP system for an ISP, clustering
openldap-2.1.x/db-4.2.52.2 with back-bdb using Red Hat Cluster Manager
(for the master, which must be highly available) and stand-alone slaves.
I have been testing both hot backups (via the steps recommended by the
Berkeley DB documentation) and hot restores (by restoring a backup on a
slave).
Until today, I had been using 2.1.25 with relative succes, except that
checkpointing did not seem to take place correctly, so with large writes,
log files would increase in number, until at some point (202 log files
typically, with log files limited to 10MB) db_recover would
no longer work. Restarting the ldap master would resolve this issue
(checkpoint, and reduce the number of active transaction log files), but
would be rather undesireable.
Now, it seems that at certain points in time, the database environment is
not consistent. For example, I ran these commands seperated by about 5
seconds while doing large writes on the LDAP master;
[root@ldap2 root]# slapd_db_stat -d /var/lib/ldap/mail/dn2id.bdb
db_stat: DB->stat: DB_PAGE_NOTFOUND: Requested page not found
[root@ldap2 root]# slapd_db_stat -d /var/lib/ldap/mail/dn2id.bdb
db_stat: DB->stat: DB_PAGE_NOTFOUND: Requested page not found
[root@ldap2 root]# slapd_db_stat -d /var/lib/ldap/mail/dn2id.bdb
53162 Btree magic number.
9 Btree version number.
Flags: duplicates, little-endian
2 Minimum keys per-page.
4096 Underlying database page size.
3 Number of levels in the tree.
272193 Number of unique keys in the tree.
350377 Number of data items in the tree.
109 Number of tree internal pages.
175052 Number of bytes free in tree internal pages (61% ff).
7998 Number of tree leaf pages.
10M Number of bytes free in tree leaf pages (69% ff).
198 Number of tree duplicate pages.
25960 Number of bytes free in tree duplicate pages (97% ff).
0 Number of tree overflow pages.
0 Number of bytes free in tree overflow pages (0% ff).
0 Number of pages on the free list.
To give you more information on what I am doing here:
Since I wanted to test these features under load, I am importing an
existing database which consists of approx 150000 entries used for
qmail-ldap. I am running a hot backup script from cron (incidentally, I
added it to the Mandrake openldap packages, so you can view it
here:http://cvs.mandrakesoft.com/cgi-bin/cvsweb.cgi/SPECS/openldap/ldap-hot-db-backup
)
So, I have a bdb database in /var/lib/ldap/mail, the script places a
backup in /var/lib/ldap/backup/mail every 15 minutes.
ldapmaster:/var/lib/ldap/ is mounted on /var/lib/ldap/master on the slave
(thus the hot backup appears in /var/lib/ldap/master/backup/mail), and I
use the following script to do hot restores:
(http://cvs.mandrakesoft.com/cgi-bin/cvsweb.cgi/SPECS/openldap/ldap-reinitialise-slave)
So, while the import is running, I run the following on the slave:
while true;do date;/usr/share/openldap/scripts/ldap-reinitialise-slave
-v3;date; service ldap restart ; sleep 30;ldapsearch -x -b cn=mail,ou=isp
-h localhost -LLL dn -z10 2>/dev/null|grep ^dn|wc -l;ldapsearch -x -b
"ou=radius,o=intekom,c=za" -h localhost -LLL dn -z10 2>/dev/null|grep
^dn|wc -l;sleep 30;done
When db_stat does not return an answer for dn2id.dbb, my hot backups fail,
and restores fail even more miserably (yes, I should do a bit more error
checking ... but it wasn't necessary on 2.1.25)
Now, this isn't a huge problem, it seems the next hot backup succeeds (if
I remove the lock file manually for now), but it would seem to indicate
other problems? And, under 2.1.29, I don't see the problem with
checkpointing, the backup usually only contains 1 transaction log, rather
than the 202 I saw under 2.1.25.
Unfortunately the systems I am testing on now are due to be used for other
applications quite soon, so I will not have much time to debug this.
Additionally, I need to decide on 2.1.25 vs 2.1.29 within about 2 days for
a system that will most likely be running with minimal changes, and
hopefully minimal maintenance for 2 years or more.
Well, I will leave the tests running overnight again, and see what I find
in the morning ...
Regards,
Buchan