[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: ch_malloc of 8388608 bytes failed (ITS#2270)
Howard,
I lowered the threads back down to a sane level (64) and I got the same
ch_malloc the following morning. When I did the math (for 64 threads), I
should have had free space left on the stack, even with the db cache (which
I also set back to the default via commenting the DB_CONFIG). I don't
think that this is what was causing the problem directly. It was a
workaround of mine because I felt that the daemon stayed operational longer
with max threads set way up there. I saw it crash with as few as 10 active
threads. The behavior was erratic.
Subsequently, I've patched the system with the following two Sun patches.
109147-21
108827-40
Then, I recompiled everything (with gcc 3.2.2 instead of 2.95.3) and linked
it using the new (via the patch) Sun ccs binaries instead of using the old
GNU binutils that I was using. This appears to have done the trick. The
daemon hasn't crashed in over 4 days, now. It's breaking the corporate
record every minute. The max threads are still set at 64 and to follow
through on my testing I plan on slowly increasing them to see ch_malloc fail
properly, as you describe, so I can see if it's the same amount of bytes. I
think this problem had something to do with the binutils that I was using
and/or the compiler on this particular architecture.
Thanks a lot for the help and encouragement. It is very much appreciated.
Since you mentioned it, I've been playing around with a 64bit compile of
everything (just db 4.1.25 and openldap HEAD) and have successfully built
the binaries, but I'm having a problem that I'll open another case for. It
fails the concurrency test...
This case is closed.
Thanks Again,
Joseph
----- Original Message -----
From: "Howard Chu" <hyc@highlandsun.com>
To: "'Joseph Tingiris'" <joseph.tingiris@cox.net>;
<openldap-its@OpenLDAP.org>
Sent: Monday, February 17, 2003 3:56 PM
Subject: RE: ch_malloc of 8388608 bytes failed (ITS#2270)
> Ok. Please drop your max threads parameter back down to a sane level
before
> pursuing this further, because it is a fact that with the numbers you show
> your application has definitely run out of free memory. Even though your
> machine has 8GB of RAM, your process only has 2GB of usable address space;
> the other 6GB aren't helping it at all. Let's eliminate that issue so we
can
> focus on the real problem.
>
> You need GCC 3.x to build usable 64-bit Solaris binaries. (I have tested
> successfully with GCC 3.1, after tweaking the GCC specs file.) But again,
> going there will only further obscure the issue. Stick with the current
> configuration. Any further changes you make will only make it harder to
> decipher what is really going on.
>
> libc_psr is the processor-specific runtime library, there is a different
> version for each type of Sparc architecture to handle any quirks in the
> different CPU implementations. That's why you only see that specific
libc_psr
> being used on that machine. Do not mess with it.
>
> Leave max threads at the default of 32. Run slapd under gdb. When it
aborts,
> get a full back trace of all threads.
>
> -- Howard Chu
> Chief Architect, Symas Corp. Director, Highland Sun
> http://www.symas.com http://highlandsun.com/hyc
> Symas: Premier OpenSource Development and Support
>
> > -----Original Message-----
> > From: Joseph Tingiris [mailto:joseph.tingiris@cox.net]
> > Sent: Monday, February 17, 2003 7:07 AM
> > To: hyc@highlandsun.com; openldap-its@OpenLDAP.org
> > Subject: Re: ch_malloc of 8388608 bytes failed (ITS#2270)
> >
> >
> > Update. I applied the patch Kurt recommended to no avail.
> > Once again, I
> > came to work this morning to my very familiar ch_malloc error. I've
> > suspected all along this may have something to do with the
> > fact that I had
> > built the binary using the 32 bit libraries. But, I kind of
> > ruled that out
> > because I don't (ever) get the ch_malloc errors on other
> > 64bit Suns (280R,
> > for example). It's just this one 3800 that's giving me
> > grief. I've played
> > around with the number of threads, DB_CONFIG parameters, and
> > most blatantly
> > configurable options. The reason there are so many is
> > because I've found
> > that the more I allow, the longer it runs without aborting.
> > This machine is
> > configured with 8G real and 14G swap. It has plenty of RAM
> > to spare. This
> > problem has persisted (on this machine) since its inception.
> > I've stayed
> > current with the HEAD, here, and I'm only using BDB 4.1.25
> > compiled in to
> > reduce the dependencies while I'm troubleshooting. Bleh ...
> >
> > I've tried compiling HEAD and linking with Solaris' 64bit
> > libraries but I'm
> > having issues getting it to produce a binary with gcc 2.95.3
> > ... I think I
> > need to upgrade my compiler. I'm really trying to avoid
> > doing anything
> > radical like that until I'm sure what is causing the problem.
> > I haven't
> > completely ruled out Openldap on very large machines like
> > this (12CPU and
> > +20G available memory) and I'm wondering if the OS is
> > returning (what it
> > considers) a valid pointer but it is somehow being considered
> > out of range
> > in the code. On the other hand, it could be the compiler or a bug in
> > Solaris on this architecture. I've forwarded this issue (and others
> > directly related to *only* 3800s) to Sun and they assure me I
> > am at the
> > latest revision of patches and these are a "3rd party
> > application" issue
> > ...
> >
> > I've compiled slapd a variety of ways. With and without
> > mtmalloc, openssl,
> > sasl, kerberos, zlib, etc still produces the ch_malloc abort
> > message. I
> > keep wondering about this one library it seems to only get
> > linked with on
> > the 3800. That is /usr/platform/sun4u-us3/lib/libc_psr.so.1
> > and I'm not
> > really sure what that does. I've read some stuff on sunsolve
> > about other
> > architectures having problems with their counterpart
> > (/usr/platform/Ultra-80/lib/libc_psr.so.1, for example) and
> > some people have
> > suggested just renaming this file so it doesn't get loaded on
> > startup. I
> > may try that, too, just to see what happens, if nothing else.
> >
> > Today, I plan on getting a more detailed bt full on the
> > process and possible
> > step through a caught failure (it happens about every hour during peak
> > usage) to see if I can determine what function is aborting.
> > Maybe that'll
> > shed some light ....
> >
> > Still determined,
> >
> > Joseph
> >
> >
> > ----- Original Message -----
> > From: <hyc@highlandsun.com>
> > To: <openldap-its@OpenLDAP.org>
> > Sent: Saturday, February 15, 2003 8:14 PM
> > Subject: RE: ch_malloc of 8388608 bytes failed (ITS#2270)
> >
> >
> > > When ch_malloc fails it calls abort() to kill the process.
> > In your stack
> > back
> > > trace, there are 232 threads but none of them is in the
> > abort() routine,
> > > which I find very odd. Regardless, your problem is not due
> > to any bug in
> > > OpenLDAP. The fact is, even though you have a 64 bit
> > machine, you have
> > built
> > > a 32 bit binary. So, it is limited to a 32 bit address space, and in
> > Solaris,
> > > not all of that 32 bit space is available for user memory,
> > only about half
> > of
> > > it (31 bits, 2GB) is available. The default size of a
> > thread stack has
> > grown
> > > in OpenLDAP 2.1, but even in OpenLDAP 2.0 it was 2MB per
> > thread. With the
> > > current 4MB per thread, times 232 threads, you have used
> > 928MB of RAM. You
> > > are also using 1GB for your BDB cache. This alone (1.9GB) leaves
> > practically
> > > nothing left for slapd to run with.
> > >
> > > You should decrease the maximum number of threads; creating
> > more beyond a
> > > certain limit does not enhance concurrency anyway. You can
> > increase your
> > > available address space by building as a pure 64 bit
> > executable but that
> > > doesn't change the fact that having too many threads will
> > slow you down.
> > >
> > > -- Howard Chu
> > > Chief Architect, Symas Corp. Director, Highland Sun
> > > http://www.symas.com http://highlandsun.com/hyc
> > > Symas: Premier OpenSource Development and Support
> > >
> > > > -----Original Message-----
> > > > From: owner-openldap-bugs@OpenLDAP.org
> > > > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of
> > > > joseph.tingiris@cox.net
> > > > Sent: Wednesday, January 15, 2003 9:27 AM
> > > > To: openldap-its@OpenLDAP.org
> > > > Subject: ch_malloc of 8388608 bytes failed (ITS#2270)
> > > >
> > > >
> > > > Full_Name: Joseph Tingiris
> > > > Version: 2.1.12
> > > > OS: Solaris 8
> > > > URL: ftp://ftp.openldap.org/incoming/
> > > > Submission from: (NULL) (206.157.224.254)
> > > >
> > > >
> > > > I've read some of the other folks, using Solaris, having
> > > > similar problems and
> > > > I've tried almost everything I could find short of
> > actually modifying
> > > > ch_malloc.c myself. It appears to be specific to
> > > > multiprocessor (3+) Sun
> > > > installations. The binaries have been compiled with
> > > > -lmtmalloc and the latest
> > > > versions of all Openldap dependent packages are used. The primary
> > > > authentication mechanism is cleartext.
> > > >
> > > > Some key points:
> > > >
> > > > * This server is a replica.
> > > > * BDB-4.1 with 3.4 million DNs, 6 indexes (eq,sub)
> > > > * process stack 32k (plimit -s), DB cache 1G (via DB_CONFIG)
> > > > * this problem has persisted, on the same hardware, since
> > > > openldap 2.0.12
> > > > * slapd fails at least once a day with the same error every
> > > > time, "ch_malloc of
> > > > 8388608 bytes failed"; it's always the same amount of bytes
> > > > * it appears to happen during a wildcard search, although it
> > > > may be during some
> > > > type of replication event
> > > >
> > > > Here is some info on the build environment:
> > > >
> > > > Application - OpenLdap and Dependencies:
> > > >
> > > > openldap-2.1.12
> > > > openssl-0.9.7
> > > > krb5-1.2.7
> > > > cyrus-sasl-2.1.10
> > > > db-4.1.25
> > > >
> > > > Compiler/Dev Tools:
> > > >
> > > > autoconf-2.57
> > > > automake-1.7.2
> > > > binutils-2.11.2
> > > > bison-1.75
> > > > fileutils-4.1
> > > > gawk-3.1.0
> > > > gcc-2.95.3
> > > > gdb-5.0
> > > > gdbm-1.8.0
> > > > gettext-0.10.37
> > > > glib-1.2.10
> > > > gtk+-1.2.10
> > > > libgcc-3.2
> > > > libiconv-1.6.1
> > > > libnet-1.0.2a
> > > > libpcap-0.7.1
> > > > libtool-1.4
> > > > m4-1.4
> > > > make-3.80
> > > > ncurses-5.2
> > > > slang-1.4.4
> > > > tcl-8.4.1
> > > > termcap-1.3
> > > > textutils-2.0
> > > > tk-8.4.1
> > > > zlib-1.1.4
> > > >
> > > > Here's the system info:
> > > >
> > > > System Configuration: Sun Microsystems sun4u Sun Fire 3800
> > > > System clock frequency: 150 MHz
> > > > Memory size: 8192 Megabytes
> > > >
> > > > ========================= CPUs
> > > > ===============================================
> > > >
> > > > Port Run E$ CPU CPU
> > > > FRU Name ID MHz MB Impl. Mask
> > > > ---------- ---- ---- ---- ------- ----
> > > > /N0/SB0/P0 0 750 8.0 US-III 3.4
> > > > /N0/SB0/P1 1 750 8.0 US-III 3.4
> > > > /N0/SB0/P2 2 750 8.0 US-III 3.4
> > > > /N0/SB0/P3 3 750 8.0 US-III 3.4
> > > > /N0/SB2/P0 8 750 8.0 US-III 3.4
> > > > /N0/SB2/P1 9 750 8.0 US-III 3.4
> > > > /N0/SB2/P2 10 750 8.0 US-III 3.4
> > > > /N0/SB2/P3 11 750 8.0 US-III 3.4
> > > >
> > > > ========================= Memory Configuration
> > > > ===============================
> > > >
> > > > Logical Logical Logical
> > > > Port Bank Bank Bank DIMM
> > > > Interleave
> > > > Interleave
> > > > FRU Name ID Num Size Status Size
> > > > Factor Segment
> > > > ------------- ---- ---- ------ ----------- ------
> > > > ----------
> > > > ----------
> > > > /N0/SB0/P0/B0 0 0 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB0/P0/B0 0 2 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB0/P1/B0 1 0 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB0/P1/B0 1 2 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB0/P2/B0 2 0 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB0/P2/B0 2 2 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB0/P3/B0 3 0 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB0/P3/B0 3 2 512MB pass 256MB
> > > > 8-way 0
> > > > /N0/SB2/P0/B0 8 0 512MB pass 256MB
> > > > 8-way 1
> > > > /N0/SB2/P0/B0 8 2 512MB pass 256MB
> > > > 8-way 1
> > > > /N0/SB2/P1/B0 9 0 512MB pass 256MB
> > > > 8-way 1
> > > > /N0/SB2/P1/B0 9 2 512MB pass 256MB
> > > > 8-way 1
> > > > /N0/SB2/P2/B0 10 0 512MB pass 256MB
> > > > 8-way 1
> > > > /N0/SB2/P2/B0 10 2 512MB pass 256MB
> > > > 8-way 1
> > > > /N0/SB2/P3/B0 11 0 512MB pass 256MB
> > > > 8-way 1
> > > > /N0/SB2/P3/B0 11 2 512MB pass 256MB
> > > > 8-way 1
> > > >
> > > > ========================= IO Cards =========================
> > > >
> > > > Bus Max
> > > > IO Port Bus Freq Bus Dev,
> > > > FRU Name Type ID Side Slot MHz Freq Func State Name
> > > >
> > > > Model
> > > > ---------- ---- ---- ---- ---- ---- ---- ---- -----
> > > > -------------------------------- ----------------------
> > > > /N0/IB6/P0 cPCI 24 B 2 33 33 1,0 ok
> > > > pci-pci1011,46.1/pci108e,1000 pci-bridge
> > > > /N0/IB6/P0 cPCI 24 B 2 33 33 0,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P0 cPCI 24 B 2 33 33 0,1 ok
> > > > SUNW,hme-pci108e,1001
> > > > SUNW,cheerio
> > > > /N0/IB6/P0 cPCI 24 B 2 33 33 4,0 ok
> > > > SUNW,isptwo-pci1077,1020/sd
> > > > (blo+ QLGC,ISP1040B
> > > > /N0/IB6/P0 cPCI 24 B 3 33 33 2,0 ok
> > > > network-pci108e,abba.11
> > > > SUNW,cpci-ce
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 1,0 ok
> > > > pci-pci1011,46.1/pci108e,1000 pci-bridge
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 0,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 0,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 1,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 1,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 2,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 2,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 3,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB6/P1 cPCI 25 B 4 33 33 3,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB6/P1 cPCI 25 A 1 66 66 1,0 ok
> > > > fibre-channel-pci10df,f900.10df.+
> > > > /N0/IB8/P0 cPCI 28 B 2 33 33 1,0 ok
> > > > network-pci108e,abba.11
> > > > SUNW,cpci-ce
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 1,0 ok
> > > > pci-pci1011,46.1/pci108e,1000 pci-bridge
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 0,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 0,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 1,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 1,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 2,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 2,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 3,0 ok
> > > > pci108e,1000-pci108e,1000.1
> > > > /N0/IB8/P1 cPCI 29 B 4 33 33 3,1 ok
> > > > SUNW,qfe-pci108e,1001
> > > > SUNW,cpci-qfe
> > > > /N0/IB8/P1 cPCI 29 A 1 66 66 1,0 ok
> > > > fibre-channel-pci10df,f900.10df.+
> > > >
> > > > ========================= Active Boards for Domain
> > > > ===========================
> > > >
> > > > Power Fault HotPlug Board
> > > > FRU Name LED LED LED Cond.
> > > > -------- ----- ----- ------- -------
> > > > /N0/SB0 on off off ok
> > > > /N0/SB2 on off off ok
> > > > /N0/IB6 on off off ok
> > > > /N0/IB8 on off off ok
> > > >
> > > > ========================= Available Boards/Slots for Domain
> > > > ==================
> > > >
> > > > Power Fault HotPlug Board/Slot Board/Slot
> > > > FRU Name LED LED LED Condition Assigned
> > > > -------- ----- ----- ------- ---------- ----------
> > > > There are currently no Boards/Slots available to this Domain
> > > >
> > > > ========================= Hardware Failures
> > > > ==================================
> > > > No Hardware failures found in System
> > > >
> > > > Need any more info? I still have pmap, lsof, truss, cores,
> > > > and additional debug
> > > > data. Anyone have any ideas?
> > > >
> > > > Any help would be greatly appreciated.
> > > >
> > > > Thanks!
> > > >
> > > >
> > > >
> > >
> > >
> >
>