[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Code profiling is good for the performance..
Hi,
By running somewhat random sized workloads (20 messages every time,
apparently not even roughly same amounts of recipients), I attacked
things at the top of the flat function profile.
( CFLAGS=-O -g3 -pg -non_shared )
Like profiling always, the initial top-of-list contained *strange*
things -- why strncpy() is there ???
% cumulative self self total
time seconds seconds calls ms/call ms/call name
12.5 3.75 3.75 19296 0.19 1.01 interpret <cycle 1> [5]
10.5 6.89 3.14 12396 0.25 0.25 strncpy [15]
7.8 9.23 2.34 689 3.40 6.22 cons_garbage_collect [10]
3.5 10.30 1.06 1263513 0.00 0.00 proc_at_0x120082da0 <cycle 4> [25]
3.5 11.35 1.05 193348 0.01 0.04 execute <cycle 1> [6]
3.1 12.28 0.93 31128 0.03 0.03 _cons_DSW [35]
2.9 13.14 0.86 1271703 0.00 0.00 proc_at_0x1200836d0 [31]
2.7 13.96 0.82 138012 0.01 0.01 v_find [30]
2.6 14.73 0.77 1580605 0.00 0.00 newcell [7]
2.3 15.44 0.70 549548 0.00 0.00 sp_lookup [28]
2.1 16.08 0.64 366072 0.00 0.01 _s_copy_tree [12]
2.1 16.72 0.64 269216 0.00 0.02 expand [8]
2.1 17.35 0.63 1314169 0.00 0.00 memcpy [47]
2.0 17.96 0.60 1837675 0.00 0.00 memset [50]
(total 29.99 cumulative seconds)
It turned out that in router/db.c there is strncpy() from the
input value into a key buffer, and it's TARGET size is BUF_SIZE.
Also, I didn't like how much the cons_garbage_collect() was called.
(It was called every 8000 newcell() calls, no matter if the free list
has something in it, or not..)
That fourth thing in the listing is likely some libc malloc internal
function.
The following sample contains mere 17% of the newcell() calls as
compared to the previous run, but taking that into account, the
scaled number of gc calls would be 247 -- 36% of the original.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
13.4 0.64 0.64 3237 0.20 0.88 interpret <cycle 1> [5]
4.3 0.84 0.20 216918 0.00 0.00 proc_at_0x120082dc0 <cycle 4> [18]
4.2 1.04 0.20 optimize [24]
3.6 1.21 0.17 24 7.12 8.76 SslWalker <cycle 1> [21]
3.4 1.37 0.16 32926 0.00 0.02 execute <cycle 1> [6]
3.1 1.52 0.15 42 3.51 7.58 cons_garbage_collect [16]
3.0 1.66 0.14 213449 0.00 0.00 proc_at_0x1200836f0 [31]
2.6 1.78 0.12 268387 0.00 0.00 newcell [11]
2.5 1.90 0.12 46425 0.00 0.02 expand [7]
2.4 2.02 0.12 227861 0.00 0.00 memcpy [40]
2.3 2.13 0.11 98314 0.00 0.00 AcceptInputToken <cycle 1> [41]
2.3 2.24 0.11 99187 0.00 0.00 sp_lookup [30]
2.1 2.34 0.10 311816 0.00 0.00 memset [47]
(total 4.74 cumulative seconds)
So, now attacking the performance issues will be more difficult, I should
imagine... (Memory allocation questions, very likely..)
Some strange things with the internal script IO facilities..
Apparently I forgot to send this earlier, so I had opportunity to 1) detect
one bug introduced at that first 'strncpy()' change/elimination, and 2) to
run longer test:
granularity: each sample hit covers 8 byte(s) for 0.00% of 1324.22 seconds
(flat profile:)
% cumulative self self total
time seconds seconds calls ms/call ms/call name
14.8 196.08 196.08 1062061 0.18 0.88 interpret <cycle 1> [7]
4.7 257.93 61.85 69283960 0.00 0.00 proc_at_0x120082e30 <cycle 4> [24]
4.0 310.67 52.74 13197 4.00 8.93 cons_garbage_collect [17]
3.9 362.64 51.97 70277104 0.00 0.00 proc_at_0x120083760 [25]
3.8 413.52 50.88 10642359 0.00 0.02 execute <cycle 1> [10]
3.6 460.58 47.06 101165793 0.00 0.00 memset [33]
3.2 502.90 42.31 86990890 0.00 0.00 newcell [12]
2.6 537.02 34.13 30034419 0.00 0.00 sp_lookup [29]
2.6 571.00 33.98 73023536 0.00 0.00 memcpy [48]
2.5 604.25 33.25 14775896 0.00 0.02 expand [8]
2.5 637.17 32.92 20110384 0.00 0.01 _s_copy_tree [14]
2.4 669.22 32.05 12476208 0.00 0.00 re_match_2_internal [49]
2.3 699.55 30.34 52885243 0.00 0.00 dupnstr [16]
2.0 725.71 26.16 2389686 0.01 0.02 regmatch [39]
2.0 751.54 25.83 69268306 0.00 0.00 malloc [18]
1.9 777.22 25.68 7604995 0.00 0.00 v_find [50]
1.8 800.59 23.37 84171403 0.00 0.00 NLstrdlen [51]
1.7 823.12 22.52 23884176 0.00 0.00 __fflush_unlocked [53]
1.6 843.98 20.86 15344881 0.00 0.00 squish [52]
1.5 863.83 19.85 52882404 0.00 0.00 freestr [20]
1.4 882.96 19.13 20348364 0.00 0.00 _splay [56]
1.4 901.75 18.79 69262738 0.00 0.00 free [21]
(total 1324.22 cumulative seconds)
This gives no clear indication of anything being really wrong -- well, the
garbage collect routine itself runs "long" time (4.0% of total), which propably
isn't all that bad. Hmm... but memset() is being run quite much too, why ?
Looking at another variant of gprof(1) report:
........................
Cumulative (self+descendants) profile:
granularity: each sample hit covers 8 byte(s) for 0.00% of 1324.22 seconds
self+descendants called/total parents
index %time self descendents called+self name index
secs secs called/total children
........................
[11] 16.2 14.03 201.13 43537090 copycell [11]
6.32 94.39 34697128/51756998 dupstr [15]
21.18 79.24 43537090/86990890 newcell [12]
Ok, copycell() is fifty-fifty for dynamic string duplication,
and conscell management.
[12] 15.2 42.31 158.33 86990890 newcell [12]
52.74 65.12 13197/13197 cons_garbage_collect [17]
40.47 0.00 86990890/101165793 memset [33]
Hmm... 20% of the newcell() is spent at memset(), I wonder if
that is really necessary ? That could cut (perhaps) 3% of total
runtime. (80% of newcell() is within the GC run.)
[13] 14.4 6.55 184.50 20110384 s_copy_tree [13]
[14] 13.9 32.92 151.58 20110384+33627375 _s_copy_tree [14]
9.88 141.70 30673066/43537090 copycell [11]
Well, 80% of the s_copy_tree() is done in copycell() calls,
thus propably no point in tuning _s_copy_tree() itself.
[15] 11.3 9.42 140.80 51756998 dupstr [15]
29.69 96.74 51756998/52885243 dupnstr [16]
14.37 0.00 51756998/84171403 NLstrdlen [51]
Hmm.. 10% of 11.3% runtime is spent in strlen().
Total save of doing massive change in concell
associated string length storage might not help
much -- e.g. perhaps mere 1.0-1.5 % of total runtime,
HOWEVER, it might allow NUL characters in strings...
[16] 9.8 30.34 98.85 52885243 dupnstr [16]
19.72 54.53 52885243/69268306 malloc [18]
24.61 0.00 52885243/73023536 memcpy [48]
The memcpy() presents mere 20% of the runtime, managing the
malloc() things is way more costly... (7.8% of total)
[17] 8.9 52.74 65.12 13197 cons_garbage_collect [17]
15.93 46.49 42418274/52882404 freestr [20]
0.06 2.44 550782/770382 cons_DSW [114]
0.01 0.18 13196/13196 cache_gc_markup_iterator [246]
Hmm.. Nearly 9% of total runtime is spent in GC, of which
53% is spent in freeing dynamically allocated strings,
e.g. 4.7% of total runtime is spent at the conscell
associated string storage freeing..
Apparently dupnstr() and freestr() contained malloc() and
free() spend total of about 12.5 % of the runtime in them.
Taking then about 1.8% of total runtime of strlen()s, of which
over 90% is inside dupstr()s, puts the notch to some 14%.
Add 3 % of propably unnecessary memset()s, we get some 17 %.
Ok, that is a fair size target to hit :^)
Changeing conscells a bit to store also the string size
might enlarge the cell at 32-bit systems, although at
64-bitters it would be invisible. Having 16-bit flags
and length fields at 32-bit machine might work, though.
[18] 7.3 25.83 71.42 69268306 malloc [18]
61.89 9.53 69268306/69268347 proc_at_0x120082e30 <cycle 4> [24]
........................
[20] 5.9 19.85 57.96 52882404 freestr [20]
14.35 43.62 52882404/69262738 free [21]
........................
Tracking these things was simple, tracking which
s_copy_tree()s are unnecessary, is not so simple..
/Matti Aarnio <mea@nic.funet.fi>