[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
RE: ch_malloc of 8388608 bytes failed (ITS#2270)
Ok. Please drop your max threads parameter back down to a sane level before
pursuing this further, because it is a fact that with the numbers you show
your application has definitely run out of free memory. Even though your
machine has 8GB of RAM, your process only has 2GB of usable address space;
the other 6GB aren't helping it at all. Let's eliminate that issue so we can
focus on the real problem.
You need GCC 3.x to build usable 64-bit Solaris binaries. (I have tested
successfully with GCC 3.1, after tweaking the GCC specs file.) But again,
going there will only further obscure the issue. Stick with the current
configuration. Any further changes you make will only make it harder to
decipher what is really going on.
libc_psr is the processor-specific runtime library, there is a different
version for each type of Sparc architecture to handle any quirks in the
different CPU implementations. That's why you only see that specific libc_psr
being used on that machine. Do not mess with it.
Leave max threads at the default of 32. Run slapd under gdb. When it aborts,
get a full back trace of all threads.
-- Howard Chu
Chief Architect, Symas Corp. Director, Highland Sun
http://www.symas.com http://highlandsun.com/hyc
Symas: Premier OpenSource Development and Support
> -----Original Message-----
> From: Joseph Tingiris [mailto:joseph.tingiris@cox.net]
> Sent: Monday, February 17, 2003 7:07 AM
> To: hyc@highlandsun.com; openldap-its@OpenLDAP.org
> Subject: Re: ch_malloc of 8388608 bytes failed (ITS#2270)
>
>
> Update. I applied the patch Kurt recommended to no avail.
> Once again, I
> came to work this morning to my very familiar ch_malloc error. I've
> suspected all along this may have something to do with the
> fact that I had
> built the binary using the 32 bit libraries. But, I kind of
> ruled that out
> because I don't (ever) get the ch_malloc errors on other
> 64bit Suns (280R,
> for example). It's just this one 3800 that's giving me
> grief. I've played
> around with the number of threads, DB_CONFIG parameters, and
> most blatantly
> configurable options. The reason there are so many is
> because I've found
> that the more I allow, the longer it runs without aborting.
> This machine is
> configured with 8G real and 14G swap. It has plenty of RAM
> to spare. This
> problem has persisted (on this machine) since its inception.
> I've stayed
> current with the HEAD, here, and I'm only using BDB 4.1.25
> compiled in to
> reduce the dependencies while I'm troubleshooting. Bleh ...
>
> I've tried compiling HEAD and linking with Solaris' 64bit
> libraries but I'm
> having issues getting it to produce a binary with gcc 2.95.3
> ... I think I
> need to upgrade my compiler. I'm really trying to avoid
> doing anything
> radical like that until I'm sure what is causing the problem.
> I haven't
> completely ruled out Openldap on very large machines like
> this (12CPU and
> +20G available memory) and I'm wondering if the OS is
> returning (what it
> considers) a valid pointer but it is somehow being considered
> out of range
> in the code. On the other hand, it could be the compiler or a bug in
> Solaris on this architecture. I've forwarded this issue (and others
> directly related to *only* 3800s) to Sun and they assure me I
> am at the
> latest revision of patches and these are a "3rd party
> application" issue
> ...
>
> I've compiled slapd a variety of ways. With and without
> mtmalloc, openssl,
> sasl, kerberos, zlib, etc still produces the ch_malloc abort
> message. I
> keep wondering about this one library it seems to only get
> linked with on
> the 3800. That is /usr/platform/sun4u-us3/lib/libc_psr.so.1
> and I'm not
> really sure what that does. I've read some stuff on sunsolve
> about other
> architectures having problems with their counterpart
> (/usr/platform/Ultra-80/lib/libc_psr.so.1, for example) and
> some people have
> suggested just renaming this file so it doesn't get loaded on
> startup. I
> may try that, too, just to see what happens, if nothing else.
>
> Today, I plan on getting a more detailed bt full on the
> process and possible
> step through a caught failure (it happens about every hour during peak
> usage) to see if I can determine what function is aborting.
> Maybe that'll
> shed some light ....
>
> Still determined,
>
> Joseph
>
>
> ----- Original Message -----
> From: <hyc@highlandsun.com>
> To: <openldap-its@OpenLDAP.org>
> Sent: Saturday, February 15, 2003 8:14 PM
> Subject: RE: ch_malloc of 8388608 bytes failed (ITS#2270)
>
>
> > When ch_malloc fails it calls abort() to kill the process.
> In your stack
> back
> > trace, there are 232 threads but none of them is in the
> abort() routine,
> > which I find very odd. Regardless, your problem is not due
> to any bug in
> > OpenLDAP. The fact is, even though you have a 64 bit
> machine, you have
> built
> > a 32 bit binary. So, it is limited to a 32 bit address space, and in
> Solaris,
> > not all of that 32 bit space is available for user memory,
> only about half
> of
> > it (31 bits, 2GB) is available. The default size of a
> thread stack has
> grown
> > in OpenLDAP 2.1, but even in OpenLDAP 2.0 it was 2MB per
> thread. With the
> > current 4MB per thread, times 232 threads, you have used
> 928MB of RAM. You
> > are also using 1GB for your BDB cache. This alone (1.9GB) leaves
> practically
> > nothing left for slapd to run with.
> >
> > You should decrease the maximum number of threads; creating
> more beyond a
> > certain limit does not enhance concurrency anyway. You can
> increase your
> > available address space by building as a pure 64 bit
> executable but that
> > doesn't change the fact that having too many threads will
> slow you down.
> >
> > -- Howard Chu
> > Chief Architect, Symas Corp. Director, Highland Sun
> > http://www.symas.com http://highlandsun.com/hyc
> > Symas: Premier OpenSource Development and Support
> >
> > > -----Original Message-----
> > > From: owner-openldap-bugs@OpenLDAP.org
> > > [mailto:owner-openldap-bugs@OpenLDAP.org]On Behalf Of
> > > joseph.tingiris@cox.net
> > > Sent: Wednesday, January 15, 2003 9:27 AM
> > > To: openldap-its@OpenLDAP.org
> > > Subject: ch_malloc of 8388608 bytes failed (ITS#2270)
> > >
> > >
> > > Full_Name: Joseph Tingiris
> > > Version: 2.1.12
> > > OS: Solaris 8
> > > URL: ftp://ftp.openldap.org/incoming/
> > > Submission from: (NULL) (206.157.224.254)
> > >
> > >
> > > I've read some of the other folks, using Solaris, having
> > > similar problems and
> > > I've tried almost everything I could find short of
> actually modifying
> > > ch_malloc.c myself. It appears to be specific to
> > > multiprocessor (3+) Sun
> > > installations. The binaries have been compiled with
> > > -lmtmalloc and the latest
> > > versions of all Openldap dependent packages are used. The primary
> > > authentication mechanism is cleartext.
> > >
> > > Some key points:
> > >
> > > * This server is a replica.
> > > * BDB-4.1 with 3.4 million DNs, 6 indexes (eq,sub)
> > > * process stack 32k (plimit -s), DB cache 1G (via DB_CONFIG)
> > > * this problem has persisted, on the same hardware, since
> > > openldap 2.0.12
> > > * slapd fails at least once a day with the same error every
> > > time, "ch_malloc of
> > > 8388608 bytes failed"; it's always the same amount of bytes
> > > * it appears to happen during a wildcard search, although it
> > > may be during some
> > > type of replication event
> > >
> > > Here is some info on the build environment:
> > >
> > > Application - OpenLdap and Dependencies:
> > >
> > > openldap-2.1.12
> > > openssl-0.9.7
> > > krb5-1.2.7
> > > cyrus-sasl-2.1.10
> > > db-4.1.25
> > >
> > > Compiler/Dev Tools:
> > >
> > > autoconf-2.57
> > > automake-1.7.2
> > > binutils-2.11.2
> > > bison-1.75
> > > fileutils-4.1
> > > gawk-3.1.0
> > > gcc-2.95.3
> > > gdb-5.0
> > > gdbm-1.8.0
> > > gettext-0.10.37
> > > glib-1.2.10
> > > gtk+-1.2.10
> > > libgcc-3.2
> > > libiconv-1.6.1
> > > libnet-1.0.2a
> > > libpcap-0.7.1
> > > libtool-1.4
> > > m4-1.4
> > > make-3.80
> > > ncurses-5.2
> > > slang-1.4.4
> > > tcl-8.4.1
> > > termcap-1.3
> > > textutils-2.0
> > > tk-8.4.1
> > > zlib-1.1.4
> > >
> > > Here's the system info:
> > >
> > > System Configuration: Sun Microsystems sun4u Sun Fire 3800
> > > System clock frequency: 150 MHz
> > > Memory size: 8192 Megabytes
> > >
> > > ========================= CPUs
> > > ===============================================
> > >
> > > Port Run E$ CPU CPU
> > > FRU Name ID MHz MB Impl. Mask
> > > ---------- ---- ---- ---- ------- ----
> > > /N0/SB0/P0 0 750 8.0 US-III 3.4
> > > /N0/SB0/P1 1 750 8.0 US-III 3.4
> > > /N0/SB0/P2 2 750 8.0 US-III 3.4
> > > /N0/SB0/P3 3 750 8.0 US-III 3.4
> > > /N0/SB2/P0 8 750 8.0 US-III 3.4
> > > /N0/SB2/P1 9 750 8.0 US-III 3.4
> > > /N0/SB2/P2 10 750 8.0 US-III 3.4
> > > /N0/SB2/P3 11 750 8.0 US-III 3.4
> > >
> > > ========================= Memory Configuration
> > > ===============================
> > >
> > > Logical Logical Logical
> > > Port Bank Bank Bank DIMM
> > > Interleave
> > > Interleave
> > > FRU Name ID Num Size Status Size
> > > Factor Segment
> > > ------------- ---- ---- ------ ----------- ------
> > > ----------
> > > ----------
> > > /N0/SB0/P0/B0 0 0 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB0/P0/B0 0 2 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB0/P1/B0 1 0 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB0/P1/B0 1 2 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB0/P2/B0 2 0 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB0/P2/B0 2 2 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB0/P3/B0 3 0 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB0/P3/B0 3 2 512MB pass 256MB
> > > 8-way 0
> > > /N0/SB2/P0/B0 8 0 512MB pass 256MB
> > > 8-way 1
> > > /N0/SB2/P0/B0 8 2 512MB pass 256MB
> > > 8-way 1
> > > /N0/SB2/P1/B0 9 0 512MB pass 256MB
> > > 8-way 1
> > > /N0/SB2/P1/B0 9 2 512MB pass 256MB
> > > 8-way 1
> > > /N0/SB2/P2/B0 10 0 512MB pass 256MB
> > > 8-way 1
> > > /N0/SB2/P2/B0 10 2 512MB pass 256MB
> > > 8-way 1
> > > /N0/SB2/P3/B0 11 0 512MB pass 256MB
> > > 8-way 1
> > > /N0/SB2/P3/B0 11 2 512MB pass 256MB
> > > 8-way 1
> > >
> > > ========================= IO Cards =========================
> > >
> > > Bus Max
> > > IO Port Bus Freq Bus Dev,
> > > FRU Name Type ID Side Slot MHz Freq Func State Name
> > >
> > > Model
> > > ---------- ---- ---- ---- ---- ---- ---- ---- -----
> > > -------------------------------- ----------------------
> > > /N0/IB6/P0 cPCI 24 B 2 33 33 1,0 ok
> > > pci-pci1011,46.1/pci108e,1000 pci-bridge
> > > /N0/IB6/P0 cPCI 24 B 2 33 33 0,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB6/P0 cPCI 24 B 2 33 33 0,1 ok
> > > SUNW,hme-pci108e,1001
> > > SUNW,cheerio
> > > /N0/IB6/P0 cPCI 24 B 2 33 33 4,0 ok
> > > SUNW,isptwo-pci1077,1020/sd
> > > (blo+ QLGC,ISP1040B
> > > /N0/IB6/P0 cPCI 24 B 3 33 33 2,0 ok
> > > network-pci108e,abba.11
> > > SUNW,cpci-ce
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 1,0 ok
> > > pci-pci1011,46.1/pci108e,1000 pci-bridge
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 0,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 0,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 1,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 1,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 2,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 2,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 3,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB6/P1 cPCI 25 B 4 33 33 3,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB6/P1 cPCI 25 A 1 66 66 1,0 ok
> > > fibre-channel-pci10df,f900.10df.+
> > > /N0/IB8/P0 cPCI 28 B 2 33 33 1,0 ok
> > > network-pci108e,abba.11
> > > SUNW,cpci-ce
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 1,0 ok
> > > pci-pci1011,46.1/pci108e,1000 pci-bridge
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 0,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 0,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 1,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 1,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 2,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 2,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 3,0 ok
> > > pci108e,1000-pci108e,1000.1
> > > /N0/IB8/P1 cPCI 29 B 4 33 33 3,1 ok
> > > SUNW,qfe-pci108e,1001
> > > SUNW,cpci-qfe
> > > /N0/IB8/P1 cPCI 29 A 1 66 66 1,0 ok
> > > fibre-channel-pci10df,f900.10df.+
> > >
> > > ========================= Active Boards for Domain
> > > ===========================
> > >
> > > Power Fault HotPlug Board
> > > FRU Name LED LED LED Cond.
> > > -------- ----- ----- ------- -------
> > > /N0/SB0 on off off ok
> > > /N0/SB2 on off off ok
> > > /N0/IB6 on off off ok
> > > /N0/IB8 on off off ok
> > >
> > > ========================= Available Boards/Slots for Domain
> > > ==================
> > >
> > > Power Fault HotPlug Board/Slot Board/Slot
> > > FRU Name LED LED LED Condition Assigned
> > > -------- ----- ----- ------- ---------- ----------
> > > There are currently no Boards/Slots available to this Domain
> > >
> > > ========================= Hardware Failures
> > > ==================================
> > > No Hardware failures found in System
> > >
> > > Need any more info? I still have pmap, lsof, truss, cores,
> > > and additional debug
> > > data. Anyone have any ideas?
> > >
> > > Any help would be greatly appreciated.
> > >
> > > Thanks!
> > >
> > >
> > >
> >
> >
>