[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: thread pools, performance
Rick Jones wrote:
The ethernet controllers are on a hub attached by HyperTransport to a
single processor, I don't think you can usefully distribute the
interrupts to anything beyond that socket.
Well, it wouldn't necessarily be goodness from the standpoint of the time to do
any PIO reads, but from the standpoint of getting packet processing spread-out
and matching with the core on which the application thread runs it might be.
All depends I guess on how busy that HT link gets I guess and whether the NIC(s)
can do the spreading in the first place. I wonder if there are any pre-packaged
linux binaries out there to measure HT link utilization?
Haven't looked. Would be interesting to see.
Although, if there is still 10% idle, that probably needs to go next :)
Heh heh. The 80/20 rule hits this with a vengeance. That's "10%" of
"800%" total, which means really only about 1.2% of a CPU, which is
almost totally indistinguishable from measurement error in the oprofile
results. This is all intuition (guesswork) now, no more obvious hot
spots left to attack. Maybe if I'm really bored over the holidays I'll
spend some time on it. (Not likely.)
My degree is in AppliedMath which means I cannot do math to save my life, but if
it was 10%agepoints of 800%agepoints total, that means 10% of 8 cores or 80% of
a core doesn't it?
No, there are 800 %agepoints total for all 8 cores, and 10 out of 800 are
idle. That's a total of 10% of a single core, but it's actually distributed
across all 8 cores. So 1.25% of each core.
If it were 10% of one core, then it would be 1.25% of 800% IIRC.
Yes.
And besides, the nature of single-threading bottlenecks seems to be that when
you go to 16 cores it will be rather more than 2X what it is today :)
Yes, well, I haven't got a 16-core test system handy at the moment...
The changes I just checked in hit the easy stuff, reducing mutex contention by
about 25% on a read-only workload with a nice corresponding boost in
throughput. I haven't gotten to the multiple pools yet.
The most obvious one was using per-thread lists for the Operation free lists
instead of a single global free list. This assumes that threads will get work
allocated relatively uniformly; I guess we can add some counts and push free
Ops onto a global list if a per-thread list gets too long. At the moment
there's no checking, so once a few operations have been performed Operation
allocation occurs with pretty much zero mutex activity. (This is probably
redundant effort when using something like Google tcmalloc, but not everyone
is using that...)
The other change was using per-thread slap_counters for statistics instead of
just a global statistics block. In this case, the global slap_counters
structure still exists as the head of the list, and its mutex is used to
protect list traversals/manipulations. But actual live data is accumulated in
per-thread structures chained off the global.
The head mutex is locked when allocating a new per-thread structure,
deallocating a structure for an exiting thread, and whenever back-monitor
wants to tally up data to present to a client. In normal operation, once the
configured number of threads have been created, there will be no accesses to
the global at all. If cn=config is used to decrease the number of threads,
then some number of threads will exit and so they'll lock the list at that
time and accumulate their stats onto the head structure before destroying
themselves. So again, in normal operation, if nobody is querying cn=monitor,
there will be zero mutex contention due to statistics counters.
Something else I've been considering is padding certain structures to sizes
that are multiples of a given cache line. Not sure I want to write autoconf
tests for that, may just arbitrarily choose 64 bytes and let folks override it
at compile time. The Connection array is the most likely candidate here.
Something like
#define ALIGN 64
typedef real_struct {
whatever
} real_struct;
typedef padded_struct {
real_struct foo;
char pad[(sizeof(real_struct)+ALIGN-1) & ~(ALIGN-1)];
} padded_struct;
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/