[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Slaves taking up 100% cpu
Hi,
We seem to have a problem in which occasionally our slaves are taking up
100% CPU. We can stop the slaves without any resulting corruption, but once
we start them up again they'd shoot right back up to 100% cpu. If we run
db_recover, they will be fine again, but we'd like to try to avoid the
initial load peaking.
Not sure what is happening here, since I don't have debugging access to the
machines on which our stuff is deploy (stupid corporate policy). However, I
was able to run some db_stats on them (output at bottom).
Our setup is as follows:
OpenLDAP 2.1.22 + BerkeleyDB 4.1.25
Redhat 9 (2.4.20-20.8smp)
4 2.8GHz Xeon processors
2 GB RAM
ext2 filesystem w/ RAID mirroring
Config files contain the following (everything else is pretty standard):
database monitor
database bdb
idletimeout 600
timeout 30
sizelimit unlimited
cachesize 2000
checkpoint 8192 1
We also run with no indices (this is on purpose).
DB_CONFIG looks like:
set_cachesize 0 536870912 0
set_lg_max 209715200
set_lg_bsize 52428800
set_flags DB_TXN_NOSYNC
I ran db_stat with various options to see if I could see some differences
between an instance that is behaving properly and one that isn't.
Here is some db_stat output for an instance that is NOT acting up:
stats.20040119-15:28:23.lockinfo
95 Last allocated locker ID.
2147M Current maximum unused locker ID.
9 Number of lock modes.
1000 Maximum number of locks possible.
1000 Maximum number of lockers possible.
1000 Maximum number of lock objects possible.
2 Number of current locks.
25 Maximum number of locks at any one time.
9 Number of current lockers.
12 Maximum number of lockers at any one time.
2 Number of current lock objects.
24 Maximum number of lock objects at any one time.
18M Total number of locks requested.
18M Total number of locks released.
0 Total number of lock requests failing because DB_LOCK_NOWAIT was set
0 Total number of locks not immediately available due to conflicts.
0 Number of deadlocks.
0 Lock timeout value.
0 Number of locks that have timed out.
0 Transaction timeout value.
0 Number of transactions that have timed out.
360KB The size of the lock region..
0 The number of region locks granted after waiting.
20M The number of region locks granted without waiting.
stats.20040119-15:28:23.loginfo
40988 Log magic number.
7 Log version number.
50MB Log record cache size.
0666 Log file mode.
200Mb Current log file size.
2GB 427MB 322KB 773B Log bytes written.
778KB 448B Log bytes written since last checkpoint.
29768 Total log file writes.
0 Total log file write due to overflow.
29768 Total log file flushes.
13 Current log file number.
79002569 Current log file offset.
13 On-disk log file number.
79002569 On-disk log file offset.
8 Max commits in a log flush.
1 Min commits in a log flush.
50MB 64KB Log region size.
7876 The number of region locks granted after waiting.
9701439 The number of region locks granted without waiting.
stats.20040119-15:28:23.meminfo
512MB Total cache size.
1 Number of caches.
512MB Pool individual cache size.
0 Requested pages mapped into the process' address space.
27M Requested pages found in the cache (100%).
10684 Requested pages not found in the cache.
173851 Pages created in the cache.
10684 Pages read into the cache.
884647 Pages written from the cache to the backing file.
136012 Clean pages forced from the cache.
0 Dirty pages forced from the cache.
0 Dirty pages written by trickle-sync thread.
48523 Current total page count.
48256 Current clean page count.
267 Current dirty page count.
65537 Number of hash buckets used for page location.
28M Total number of times hash chains searched for a page.
4 The longest hash chain searched for a page.
44M Total number of hash buckets examined for page location.
803M The number of hash bucket locks granted without waiting.
59610 The number of hash bucket locks granted after waiting.
156 The maximum number of times any hash bucket lock was waited for.
710500 The number of region locks granted without waiting.
9437 The number of region locks granted after waiting.
184545 The number of page allocations.
325170 The number of hash buckets examined during allocations
3449 The max number of hash buckets examined for an allocation
136012 The number of pages examined during allocations
102 The max number of pages examined for an allocation
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Pool File: dn2id.bdb
4096 Page size.
0 Requested pages mapped into the process' address space.
25M Requested pages found in the cache (100%).
9141 Requested pages not found in the cache.
27158 Pages created in the cache.
9141 Pages read into the cache.
700581 Pages written from the cache to the backing file.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Pool File: id2entry.bdb
16384 Page size.
0 Requested pages mapped into the process' address space.
2652413 Requested pages found in the cache (100%).
1543 Requested pages not found in the cache.
146693 Pages created in the cache.
1543 Pages read into the cache.
184066 Pages written from the cache to the backing file.
stats.20040119-15:28:23.txinfo
13/78205401 File/offset for last checkpoint LSN.
Mon Jan 19 15:27:29 2004 Checkpoint timestamp.
800835e8 Last transaction ID allocated.
20 Maximum number of active transactions possible.
0 Active transactions.
2 Maximum active transactions.
538088 Number of transactions begun.
2595 Number of transactions aborted.
535493 Number of transactions committed.
0 Number of transactions restored.
16KB Transaction region size.
2 The number of region locks granted after waiting.
1678280 The number of region locks granted without waiting.
In the interest of convenience, I'm listing the "significant" differences
(or at least what I THINK is significant) between the output above and an
instance that is taking up 100% cpu here:
lockinfo:
1. Number of current locks is 14 vs. 2.
2. Number of current lockers is 61 vs. 9.
3. Max number of lockers at any one time is 61 vs. 12.
4. Total number of locks requested is 9,159,483 vs. 18M.
5. Total number of locks requested is 9,159,469 vs. 18M.
6. Number of region locks granted without waiting is 10M vs. 20M.
loginfo:
1. Log bytes written is
1GB 236M 963K 586B vs. 2GB 427M 322K 773B
2. Total log file writes is 12582 vs. 29768.
3. Total log file flushes is 12582 vs. 29768.
4. Number of region locks granted after waiting is 5647 vs. 7876.
5. Num/region lks granted w/o waiting is 4,734,568 vs. 9,701,439.
meminfo:
1. Requested pages found in the cache (100%) is 12M vs. 27M (the
absolute difference is mainly for reference for the following...)
2. Requested pages not found in the cache is 571 vs. 10684.
3. Number of hash bucket locks granted after waiting is 638379 vs. 59610.
txinfo:
1. Checkpoint timestamp was Jan 17 22:07 vs Jan 19 15:27 (i.e., it stopped
checkpointing around when the load shot up).
The db_stat output for the misbehaving instance in its entirety:
stats.20040119-15:28:22.lockinfo
92 Last allocated locker ID.
2147M Current maximum unused locker ID.
9 Number of lock modes.
1000 Maximum number of locks possible.
1000 Maximum number of lockers possible.
1000 Maximum number of lock objects possible.
14 Number of current locks.
28 Maximum number of locks at any one time.
61 Number of current lockers.
61 Maximum number of lockers at any one time.
2 Number of current lock objects.
24 Maximum number of lock objects at any one time.
9159483 Total number of locks requested.
9159469 Total number of locks released.
0 Total number of lock reqs failing because DB_LOCK_NOWAIT was set.
0 Total number of locks not immediately available due to conflicts.
0 Number of deadlocks.
0 Lock timeout value.
0 Number of locks that have timed out.
0 Transaction timeout value.
0 Number of transactions that have timed out.
360KB The size of the lock region..
0 The number of region locks granted after waiting.
10M The number of region locks granted without waiting.
stats.20040119-15:28:22.loginfo
40988 Log magic number.
7 Log version number.
50MB Log record cache size.
0666 Log file mode.
200Mb Current log file size.
1GB 236MB 963KB 586B Log bytes written.
47KB 139B Log bytes written since last checkpoint.
12582 Total log file writes.
0 Total log file write due to overflow.
12582 Total log file flushes.
7 Current log file number.
63926248 Current log file offset.
7 On-disk log file number.
63926248 On-disk log file offset.
8 Max commits in a log flush.
0 Min commits in a log flush.
50MB 64KB Log region size.
5647 The number of region locks granted after waiting.
4734568 The number of region locks granted without waiting.
stats.20040119-15:28:22.meminfo
512MB Total cache size.
1 Number of caches.
512MB Pool individual cache size.
0 Requested pages mapped into the process' address space.
12M Requested pages found in the cache (100%).
571 Requested pages not found in the cache.
90612 Pages created in the cache.
571 Pages read into the cache.
446516 Pages written from the cache to the backing file.
49032 Clean pages forced from the cache.
0 Dirty pages forced from the cache.
0 Dirty pages written by trickle-sync thread.
42151 Current total page count.
42059 Current clean page count.
92 Current dirty page count.
65537 Number of hash buckets used for page location.
12M Total number of times hash chains searched for a page.
3 The longest hash chain searched for a page.
21M Total number of hash buckets examined for page location.
324M The number of hash bucket locks granted without waiting.
638379 The number of hash bucket locks granted after waiting.
157 The maximum number of times any hash bucket lock was waited for.
329844 The number of region locks granted without waiting.
6652 The number of region locks granted after waiting.
91193 The number of page allocations.
114834 The number of hash buckets examined during allocations
3425 The max number of hash buckets examined for an allocation
49032 The number of pages examined during allocations
5 The max number of pages examined for an allocation
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Pool File: dn2id.bdb
4096 Page size.
0 Requested pages mapped into the process' address space.
11M Requested pages found in the cache (100%).
522 Requested pages not found in the cache.
14455 Pages created in the cache.
522 Pages read into the cache.
355123 Pages written from the cache to the backing file.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Pool File: id2entry.bdb
16384 Page size.
0 Requested pages mapped into the process' address space.
1346932 Requested pages found in the cache (100%).
49 Requested pages not found in the cache.
76157 Pages created in the cache.
49 Pages read into the cache.
91393 Pages written from the cache to the backing file.
stats.20040119-15:28:22.txinfo
7/63669220 File/offset for last checkpoint LSN.
Sat Jan 17 22:07:59 2004 Checkpoint timestamp.
800431f0 Last transaction ID allocated.
20 Maximum number of active transactions possible.
0 Active transactions.
2 Maximum active transactions.
274928 Number of transactions begun.
1285 Number of transactions aborted.
273643 Number of transactions committed.
0 Number of transactions restored.
16KB Transaction region size.
7 The number of region locks granted after waiting.
855495 The number of region locks granted without waiting.
Any help or insight whatsoever would be greatly appreciated.
--
Jeff
// Diplomacy is the art of saying 'nice doggy'
// until you can find a rock.