[Raw Msg Headers][Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Code profiling is good for the performance..




Hi,

  By running somewhat random sized workloads (20 messages every time,
apparently not even roughly same amounts of recipients), I attacked
things at the top of the flat function profile.
( CFLAGS=-O -g3 -pg -non_shared )

Like profiling always, the initial top-of-list contained *strange*
things -- why strncpy() is there ???

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 12.5       3.75     3.75    19296     0.19     1.01  interpret <cycle 1> [5]
 10.5       6.89     3.14    12396     0.25     0.25  strncpy [15]
  7.8       9.23     2.34      689     3.40     6.22  cons_garbage_collect [10]
  3.5      10.30     1.06  1263513     0.00     0.00  proc_at_0x120082da0 <cycle 4> [25]
  3.5      11.35     1.05   193348     0.01     0.04  execute <cycle 1> [6]
  3.1      12.28     0.93    31128     0.03     0.03  _cons_DSW [35]
  2.9      13.14     0.86  1271703     0.00     0.00  proc_at_0x1200836d0 [31]
  2.7      13.96     0.82   138012     0.01     0.01  v_find [30]
  2.6      14.73     0.77  1580605     0.00     0.00  newcell [7]
  2.3      15.44     0.70   549548     0.00     0.00  sp_lookup [28]
  2.1      16.08     0.64   366072     0.00     0.01  _s_copy_tree [12]
  2.1      16.72     0.64   269216     0.00     0.02  expand [8]
  2.1      17.35     0.63  1314169     0.00     0.00  memcpy [47]
  2.0      17.96     0.60  1837675     0.00     0.00  memset [50]
   (total  29.99  cumulative seconds)

It turned out that in   router/db.c  there is   strncpy()  from the
input value into a key buffer, and it's TARGET size is BUF_SIZE.

Also, I didn't like how much the  cons_garbage_collect()  was called.
(It was called every 8000 newcell() calls, no matter if the free list
 has something in it, or not..)

That fourth thing in the listing is likely some libc malloc internal
function.


The following sample contains mere 17% of the newcell() calls as
compared to the previous run, but taking that into account, the
scaled number of gc calls would be 247 -- 36% of the original.

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 13.4       0.64     0.64     3237     0.20     0.88  interpret <cycle 1> [5]
  4.3       0.84     0.20   216918     0.00     0.00  proc_at_0x120082dc0 <cycle 4> [18]
  4.2       1.04     0.20                             optimize [24]
  3.6       1.21     0.17       24     7.12     8.76  SslWalker <cycle 1> [21]
  3.4       1.37     0.16    32926     0.00     0.02  execute <cycle 1> [6]
  3.1       1.52     0.15       42     3.51     7.58  cons_garbage_collect [16]
  3.0       1.66     0.14   213449     0.00     0.00  proc_at_0x1200836f0 [31]
  2.6       1.78     0.12   268387     0.00     0.00  newcell [11]
  2.5       1.90     0.12    46425     0.00     0.02  expand [7]
  2.4       2.02     0.12   227861     0.00     0.00  memcpy [40]
  2.3       2.13     0.11    98314     0.00     0.00  AcceptInputToken <cycle 1> [41]
  2.3       2.24     0.11    99187     0.00     0.00  sp_lookup [30]
  2.1       2.34     0.10   311816     0.00     0.00  memset [47]
    (total  4.74  cumulative seconds)

So, now attacking the performance issues will be more difficult, I should
imagine...  (Memory allocation questions, very likely..)
Some strange things with the internal script IO facilities..


Apparently I forgot to send this earlier, so I had opportunity to 1) detect
one bug introduced at that first 'strncpy()' change/elimination, and 2) to
run longer test:

granularity: each sample hit covers 8 byte(s) for 0.00% of 1324.22 seconds

(flat profile:)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 14.8     196.08   196.08  1062061     0.18     0.88  interpret <cycle 1> [7]
  4.7     257.93    61.85 69283960     0.00     0.00  proc_at_0x120082e30 <cycle 4> [24]
  4.0     310.67    52.74    13197     4.00     8.93  cons_garbage_collect [17]
  3.9     362.64    51.97 70277104     0.00     0.00  proc_at_0x120083760 [25]
  3.8     413.52    50.88 10642359     0.00     0.02  execute <cycle 1> [10]
  3.6     460.58    47.06 101165793     0.00     0.00  memset [33]
  3.2     502.90    42.31 86990890     0.00     0.00  newcell [12]
  2.6     537.02    34.13 30034419     0.00     0.00  sp_lookup [29]
  2.6     571.00    33.98 73023536     0.00     0.00  memcpy [48]
  2.5     604.25    33.25 14775896     0.00     0.02  expand [8]
  2.5     637.17    32.92 20110384     0.00     0.01  _s_copy_tree [14]
  2.4     669.22    32.05 12476208     0.00     0.00  re_match_2_internal [49]
  2.3     699.55    30.34 52885243     0.00     0.00  dupnstr [16]
  2.0     725.71    26.16  2389686     0.01     0.02  regmatch [39]
  2.0     751.54    25.83 69268306     0.00     0.00  malloc [18]
  1.9     777.22    25.68  7604995     0.00     0.00  v_find [50]
  1.8     800.59    23.37 84171403     0.00     0.00  NLstrdlen [51]
  1.7     823.12    22.52 23884176     0.00     0.00  __fflush_unlocked [53]
  1.6     843.98    20.86 15344881     0.00     0.00  squish [52]
  1.5     863.83    19.85 52882404     0.00     0.00  freestr [20]
  1.4     882.96    19.13 20348364     0.00     0.00  _splay [56]
  1.4     901.75    18.79 69262738     0.00     0.00  free [21]
 (total  1324.22  cumulative seconds)

This gives no clear indication of anything being really wrong -- well, the
garbage collect routine itself runs "long" time (4.0% of total), which propably
isn't all that bad.  Hmm... but memset() is being run quite much too, why ?

Looking at another variant of  gprof(1)  report:

	........................
Cumulative (self+descendants) profile:

granularity: each sample hit covers 8 byte(s) for 0.00% of 1324.22 seconds

   self+descendants               called/total       parents 
index  %time    self descendents  called+self    name    	index
                secs      secs    called/total       children
	........................
[11]    16.2   14.03      201.13 43537090         copycell [11]
                6.32       94.39 34697128/51756998     dupstr [15]
               21.18       79.24 43537090/86990890     newcell [12]

		Ok, copycell() is fifty-fifty for dynamic string duplication,
		and conscell management.

[12]    15.2   42.31      158.33 86990890         newcell [12]
               52.74       65.12   13197/13197       cons_garbage_collect [17]
               40.47        0.00 86990890/101165793     memset [33]

		Hmm... 20% of the  newcell()  is spent at memset(), I wonder if
		that is really necessary ?  That could cut (perhaps) 3% of total
		runtime.  (80% of newcell() is within the GC run.)

[13]    14.4    6.55      184.50 20110384         s_copy_tree [13]
[14]    13.9   32.92      151.58 20110384+33627375 _s_copy_tree [14]
                9.88      141.70 30673066/43537090     copycell [11]

		Well, 80% of the s_copy_tree() is done in copycell() calls,
		thus propably no point in tuning _s_copy_tree() itself.

[15]    11.3    9.42      140.80 51756998         dupstr [15]
               29.69       96.74 51756998/52885243     dupnstr [16]
               14.37        0.00 51756998/84171403     NLstrdlen [51]

		Hmm.. 10% of 11.3% runtime is spent in strlen().
		Total save of doing massive change in concell
		associated string length storage might not help
		much -- e.g. perhaps mere 1.0-1.5 % of total runtime,
		HOWEVER, it might allow NUL characters in strings...

[16]     9.8   30.34       98.85 52885243         dupnstr [16]
               19.72       54.53 52885243/69268306     malloc [18]
               24.61        0.00 52885243/73023536     memcpy [48]

		The memcpy() presents mere 20% of the runtime, managing the
		malloc() things is way more costly... (7.8% of total)

[17]     8.9   52.74       65.12   13197         cons_garbage_collect [17]
               15.93       46.49 42418274/52882404     freestr [20]
                0.06        2.44  550782/770382      cons_DSW [114]
                0.01        0.18   13196/13196       cache_gc_markup_iterator [246]

		Hmm.. Nearly 9% of total runtime is spent in GC, of which
		53% is spent in freeing dynamically allocated strings,
		e.g. 4.7% of total runtime is spent at the conscell
		associated string storage freeing..

		Apparently dupnstr() and freestr() contained malloc() and
		free() spend total of about 12.5 % of the runtime in them.
		Taking then about 1.8% of total runtime of strlen()s, of which
		over 90% is inside dupstr()s, puts the notch to some 14%.
		Add 3 % of propably unnecessary  memset()s, we get some 17 %.

		Ok, that is a fair size target to hit :^)

		Changeing conscells a bit to store also the string size
		might enlarge the cell at 32-bit systems, although at
		64-bitters it would be invisible.  Having 16-bit flags
		and length fields at 32-bit machine might work, though.

[18]     7.3   25.83       71.42 69268306         malloc [18]
               61.89        9.53 69268306/69268347     proc_at_0x120082e30 <cycle 4> [24]
	........................

[20]     5.9   19.85       57.96 52882404         freestr [20]
               14.35       43.62 52882404/69262738     free [21]
	........................


		Tracking these things was simple, tracking which
		s_copy_tree()s are unnecessary, is not so simple..

/Matti Aarnio <mea@nic.funet.fi>