[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#6310) Slapd with pcache crashes under load



luben karavelov wrote:
> masarati@aero.polimi.it wrote:
>>
>> Thanks for collecting this info.  The valgrind output could be of some
>> use, but unfortunately I don't have time right now to set up a working
>> RDBMS and extensively debug things.  I'll keep this on my todo list.
>>
>> You should please re-run valgrind with --num-callers=30 or more, because
>> in some cases errors are in too nested functions to get a clear idea of
>> whether the issue is caused by garbage fed by slapd/back-sql or by errors
>> inside the RDBMS/ODBC layers.  The fact that valgrind systematically
>> complains about internals of the RDBMS/ODBC reading past the end of 
>> memory
>> chunks malloc'ed by slapd could be related to passing some non-nul
>> terminated bervals that are dealt with as strings.  Having a longer call
>> stack could help tracking those occurrences.  However, those issues 
>> should
>> not be critical, since there's no invalid writes.
>>
>> Also, you should walk through the list of attributes being returned, to
>> provide a hint about whether back-sql is computing a screwed attrlist or
>> so.  Along the lines of your current gdb session, you should get to frame
>> #5, refresh_merge() in pcache.c, and print *e->e_attrs,
>> *e->e_attrs->a_desc, *e->e_attrs->a_vals[0]; then move to
>> e->e_attrs->a_next and repeat the prints to the end of the list.  The 
>> fact
>> you get a value of "a" equal to 0x500000000 looks definitely odd to 
>> me, as
>> that attr list should result from be_entry_get_rw(), which in turn should
>> collect it from the local database.  Unless valgrind reveals some oddity
>> in back-sql, the behavior you notice should not depend on the specific
>> remote database you're using, but rather from the local one.
>>
>> p.
> 
> Hello,
> Tomorrow I will make a setup with pure sql process and a pure pcache 
> daemon that reads from the first over unix domain socket. In this manner 
> it will be clear if the crashing part is related to back-sql and the 
> database drivers/ODBC manager or not.
> 
> Meanwhile, you could find the requested debugging session here:
> http://purgatory.spnet.net/~karavelov/attr_list/gdb-1
> 
> It seems that the "e" pointer is corrupted.

Good catch.

> Tomorrow I will start it 
> through valgrind with more back-frames as requested

Another quick check you could probably do relatively quickly is zero out 
that "e" pointer before calling be_entry_get_rw() within refresh_merge().

Thanks, p.