A performance analysis chronicle

Richard Purdie

One thing that bothers me at the moment is how long we spend in
do_configure. I have some new build hardware but sadly whilst having
quite a large parallel capacity, it spends a lot of time at idle and
builds don't seem to do so well on it.

In particular, gettext has caught my attention. Specifically, gettext on
an idle system takes 128s to run do_configure, it takes more like 285s
when run as part of the main build? Why? I've been trying to figure out
what contributes to this. Why is it important? If we could understand
this, we might be able to make "critical path" tasks faster and if we
can do that, overall build times will come down.

My first attempt to look at this was from a scheduling priority level:

BB_TASK_NICE_LEVEL_pn-gettext = "0"
BB_TASK_IONICE_LEVEL_pn-gettext = "2.0"

This did offer some reduction to say 275s but certainly doesn't seem to
be the main issue.

Next, I tried:

WORKDIR_pn-gettext = "/media/build1/poky/build/tmpram/work/i586-poky-linux/gettext/0.19.4-r0"

with a tmpfs at tmpram. Sadly again, whilst this changed things like
do_unpack from 15s to 3s, it made no real difference to the overall
configure time. Since I don't believe the system to be IO bound (the
first 1700 tasks of a 6000 task core-image-sato build execute within 2
minutes) this perhaps isn't surprising.

I also wondered about contention over shared paths like the metadata and
sysroot being mixed in with WORKDIR. I tried a build with tmp/work and
tmp/work-shared on a different physical drive, again with no real
difference in performance.

I decided to get some information from perf, hacking buildstats.bbclass
to do:

pn = d.getVar("PN", True)
if pn == "gettext" and e.task == "do_configure":
parentpid = os.getpid()
pid = os.fork()
if pid == 0:
pid2 = os.fork()
if pid2 > 0:
output = subprocess.check_output("perf record -g -t %s --inherit -o /tmp/configureperf.%s" % (parentpid, parentpid), shell=True)

and with this obtained perf records for the build under load and at
idle. I've appended various output from perf report at the end of the
mail. Other than seeing that fork overhead is a significant problem, its
hard to see where the extra time is being spent from that profile data.
If anyone knows perf well and get better information I'm open to ideas.

I also started looking at what the gettext configure test is doing under
strace using simplistic measurements like:

strace -f -s 2048 bitbake gettext-native -c configure 2> gtn.strace

This generates a file around 1GB in size, there are some useful
statistics you can pull from it though:

cat gtn.strace | cut -d ' ' -f 3 | grep '^[a-z]*(' | cut -d '(' -f 1 | sort | uniq -c > gtn.funcs
cat gtn.strace | grep execve | cut -d ' ' -f 3 | grep '^[a-z]*(' | sort | uniq -c > gtn.strace.execs

which show an approximate list of syscalls the task is making along with
counts of the number of calls made. It also shows the commands that are
being execve()'d.

One thing which puzzled me was large numbers of call to
"/usr/bin/print", "/usr/bin/which" and "/usr/bin/file", the latter with
some very odd looking arguments. The former is a perl script, part of

This turns out to be some rouge autoconf macro code which tries to see
if "print X" works in the shell if you're not using bash (my /bin/sh was
dash). It does this at the start of every configure script and the
configure script attempts to reexec itself a few times. This in turn
loads the perl interpreter and the script then calls "which file",
ignores anything other than the return value, it then searches PATH for
file. This is just a complete and utter waste of time, there is a patch
below which simply removes the macro and assumes the shell's "printf" is

With the above analysis approach, you can show that this worth a couple
of thousand execve() calls (22,000 -> 20,000) and 15,000 stat calls
(333,000 -> 320,000) and knocks around 10-15s off the loaded system
do_configure time, the unloaded system time was worth more like around

Of the 20,000 execs, 2,300 are "cat", "7100" are "sed", 1,000 are "expr"
and the others look more problematic as being the linker/compiler.

I did try experimenting to see if the cat calls could be replaces with
something else. In autoconf, they are mostly from heredoc usage, e.g.:

[m4_if(m4_bregexp([$1], [\\\|`\|\$(\|\${\|@]), [-1],
[AS_ECHO(["AS_ESCAPE([$1], [""])"]) >>confdefs.h],
[cat >>confdefs.h <<_ACEOF

which I found you could replace with:

[m4_if(m4_bregexp([$1], [\\\|`\|\$(\|\${\|@]), [-1],
[AS_ECHO(["AS_ESCAPE([$1], [""])"]) >>confdefs.h],
[while IFS= read -r line || [[ -n "$line" ]]; do
printf '%s\n' "$line"
done >>confdefs.h <<_ACEOF

which has the advantage of using shell builtins. I did try other
constructs but couldn't find anything simpler and even in this case, it
was complicated, for example needing the || [ -n $line ] to ensure that
trailing data would be handled correctly and [[ to be quoted correctly,
its not a bashism.

Unfortunately, this is line buffered IO so whilst I could cut 2,000
execve() calls, I added around 300,000 read calls instead. I did try
using -d '' to avoid the EOL delimited, however that internally turns
into single byte reads and 900,000 read calls!

So basically, this optimisation didn't really work. If we could make cat
a shell builtin, it would help but I haven't looked at what that would

The other thing which did jump out at me from the strace was the number
of PATH searches going on for common utilities. I did wonder about
whether our diverse PATH spanning metadata (the scripts dir), our
sysroot and the main system was somehow causing IO bottlebeck issues. I
tested that above with the separate build drive for workdir however I
also tried a different trick, adding symlinks in the scripts directory
for the most common utils as, cat, chmod, cp, gcc, grep, rm and sed to
the host utils in /bin or /usr/bin (the scripts directory is the first
one listed in PATH). I did also briefly experiment with changing "cat"
to "/bin/cat" but preferred the symlink approach.

The profile shows this saves around 1,200 execve() calls and around
80,000 stat calls (320,000 -> 240,000) and took the do_configure time
under load to 240s so there was a measurable effect, sadly the overall
build time didn't seem much changed however. The non-load case is still
around the 125s mark so this perhaps does give a strong hint that cross
directory IO is contributing to the inflation of the build time under

There are obviously further things which can be looked into from here,
however I thought people may find this story so far interesting and it
might inspire others to start looking more closely at this too!



Output from "perf report A"

Children Self Command Shared Object Symbol
- 5.41% 0.00% aclocal [unknown] [.] 0x43d6258d4c544155
- 5.08% 0.01% bash [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
- entry_SYSCALL_64_fastpath
+ 84.05% __libc_fork
+ 3.77% __waitpid
+ 3.05% __GI___libc_write
+ 1.93% __GI___libc_read
+ 1.45% __xstat64
+ 0.91% __open_nocancel
+ 0.68% __brk
+ 0.66% __sigprocmask
+ 0.53% __pipe
+ 3.79% 0.02% bash bash [.] make_child
+ 3.61% 0.01% bash libc-2.21.so [.] __libc_fork
+ 3.49% 3.49% aclocal perl [.] Perl_pp_match
+ 3.49% 0.00% bash [kernel.kallsyms] [k] sys_clone
+ 3.49% 0.00% bash [kernel.kallsyms] [k] _do_fork
+ 3.30% 0.08% bash [kernel.kallsyms] [k] copy_process.part.29
+ 2.85% 0.52% bash [kernel.kallsyms] [k] page_fault
+ 2.80% 0.00% cc1 [unknown] [.] 0000000000000000
+ 2.41% 2.41% aclocal perl [.] Perl_fbm_instr
+ 2.39% 2.38% aclocal perl [.] Perl_re_intuit_start
+ 2.33% 0.01% bash [kernel.kallsyms] [k] do_page_fault
+ 2.31% 0.04% bash [kernel.kallsyms] [k] __do_page_fault
+ 2.23% 0.00% autom4te [unknown] [.] 0000000000000000
+ 2.15% 0.10% bash [kernel.kallsyms] [k] handle_mm_fault
+ 2.06% 0.01% bash [kernel.kallsyms] [k] perf_event_init_task
+ 2.04% 0.01% bash [kernel.kallsyms] [k] inherit_task_group.isra.95.part.96
+ 1.94% 0.01% cc1 [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
+ 1.93% 0.14% bash [kernel.kallsyms] [k] inherit_event.isra.93
+ 1.71% 0.35% cc1 [kernel.kallsyms] [k] page_fault

Output from "perf diff A B -c ratio"

# Baseline Ratio Shared Object Symbol
# ........ .............. ....................... ...................................................................................
16.62% 1.275408 m4 [.] 0x0000000000002090
3.58% 0.934606 perl [.] Perl_pp_match
3.46% 1.039378 perl [.] 0x000000000001ac0e
3.12% 0.740720 cc1 [.] 0x000000000016d946
2.77% 0.949347 perl [.] Perl_fbm_instr
2.74% 0.908970 bash [.] 0x000000000001ed44
2.53% 0.889904 perl [.] Perl_re_intuit_start
1.91% 0.706060 i586-poky-linux-ld [.] 0x0000000000002430
1.89% 0.868389 perl [.] Perl_pp_nextstate
1.82% 1.053153 libc-2.21.so [.] _int_malloc

To see the list sorted by ratio:

"perf diff A B -c ratio -t : | sort -t: -k2,2 -nr | cut -c -100"

# Baseline Ratio Shared Object Symbol
# ........ .............. ....................... ...................................................................................
0.00: 107.335521:libpcre.so.3.13.1 :[.] 0x00000000000017e0
0.00: 58.518921:sed :[.] open@plt
0.00: 39.706804:[kernel.kallsyms] :[k] set_next_buddy
0.00: 32.437345:[kernel.kallsyms] :[k] find_busiest_group
0.00: 26.460722:[kernel.kallsyms] :[k] task_numa_fault
0.00: 25.791301:file :[.] 0x0000000000001ed0
0.00: 22.245486:libselinux.so.1 :[.] __fsetlocking@plt
0.00: 20.580930:cc1 :[.] _Z15make_pass_loop2PN3gcc7contextE
0.00: 17.729036:[kernel.kallsyms] :[k] update_sd_lb_stats
0.00: 16.301473:[kernel.kallsyms] :[k] rwsem_down_write_failed
0.00: 16.067699:[kernel.kallsyms] :[k] kernfs_refresh_inode
0.00: 15.090138:[kernel.kallsyms] :[k] seq_printf
0.00: 12.787537:ls :[.] 0x00000000000081d8
0.00: 11.657115:[kernel.kallsyms] :[k] ext4_da_invalidatepage
0.00: 11.656298:bash :[.] unary_test
0.00: 10.775472:[kernel.kallsyms] :[k] security_bprm_secureexec
0.00: 10.573946:[kernel.kallsyms] :[k] __queue_work
0.00: 9.638718:ld-2.21.so :[.] _xstat
0.00: 9.178533:cc1 :[.] _Z10add_paramsPK10param_infom
0.00: 8.937500:[kernel.kallsyms] :[k] cdev_get
0.00: 8.628314:sed :[.] getenv@plt
0.00: 8.618151:perl :[.] Perl_do_trans
0.00: 8.535535:[kernel.kallsyms] :[k] mangle_path
0.00: 8.356976:[kernel.kallsyms] :[k] __delete_from_page_cache
0.00: 8.124628:ld-2.21.so :[.] malloc@plt
0.00: 8.117871:[kernel.kallsyms] :[k] osq_lock
0.00: 7.991028:cc1 :[.] _Z20output_quoted_stringP8_IO_FILEPKc
0.00: 7.814020:[kernel.kallsyms] :[k] ext4_ext_remove_space
0.00: 7.549076:[kernel.kallsyms] :[k] load_balance
0.00: 7.441893:sed :[.] _init

or to see the biggest differences in period:

"perf diff A B -p -c wdiff:1,1 -t : | sort -t: -k3,3 -nr | cut -c -100"

# Baseline Base period Weighted diff Period Shared Object Symbol
# ........ .............. .............. .............. ....................... ...................................................
16.62:30953898486: 8524961890:39478860376:m4 :[.] 0x0000000000002090
1.73:3220644640: 538932468:3759577108:libc-2.21.so :[.] __strcmp_sse2_unaligned
1.49:2768327998: 436920702:3205248700:libc-2.21.so :[.] __memcpy_avx_unaligned
0.80:1499099032: 408223464:1907322496:[kernel.kallsyms] :[k] clear_page_c_e
0.97:1803966092: 323339402:2127305494:libc-2.21.so :[.] strlen
0.54:1006908653: 288332008:1295240661:libc-2.21.so :[.] __strchr_sse2
0.66:1228217319: 266221456:1494438775:[kernel.kallsyms] :[k] copy_page
3.46:6446114974: 253835378:6699950352:perl :[.] 0x000000000001ac0e
1.82:3388135954: 180089205:3568225159:libc-2.21.so :[.] _int_malloc
0.17:325395321: 166763174:492158495:libc-2.21.so :[.] __ctype_b_loc
0.57:1066020098: 162627432:1228647530:[kernel.kallsyms] :[k] filemap_map_pages
0.06:120013915: 140088632:260102547:[kernel.kallsyms] :[k] vma_interval_tree_insert
0.39:730051202: 132370215:862421417:[kernel.kallsyms] :[k] perf_event_alloc
0.55:1015490988: 131035917:1146526905:[kernel.kallsyms] :[k] perf_event_aux_ctx
0.43:793617868: 125666408:919284276:[kernel.kallsyms] :[k] memset_erms
0.74:1383682896: 122935416:1506618312:perl :[.] Perl_regexec_flags
0.83:1546248993: 120309599:1666558592:libc-2.21.so :[.] _IO_getc
0.25:469980070: 118797566:588777636:libc-2.21.so :[.] _dl_addr
0.26:490314628: 110317867:600632495:ld-2.21.so :[.] do_lookup_x
0.45:836227328: 106471933:942699261:perl :[.] Perl_yylex
0.85:1580066629: 103820430:1683887059:libc-2.21.so :[.] _int_free
0.62:1160141971: 101707344:1261849315:[kernel.kallsyms] :[k] unmap_page_range
0.01:19809999: 89831404:109641403:[kernel.kallsyms] :[k] native_queued_spin_lock_slowpath
0.60:1113390363: 86975498:1200365861:perl :[.] Perl_yyparse

Path for autoconf to avoid /usr/bin/print

The check for solaris 'print' causes significant problems on a linux machine
with dash as /bin/sh since it triggers the execution of "print" which on some
linux systems is a perl script which is part of mailcap. Worse, this perl
script calls "which file" and if successful ignores the path file was found
in and just runs "file" without a path. Each exection causes PATH to be searched.

Simply assuming the shell's printf function works cuts out all the fork overhead
and when parallel tasks are running, this overhead appears to be significant.


Index: autoconf-2.69/lib/m4sugar/m4sh.m4
--- autoconf-2.69.orig/lib/m4sugar/m4sh.m4
+++ autoconf-2.69/lib/m4sugar/m4sh.m4
@@ -1045,40 +1045,8 @@ m4_defun([_AS_ECHO_PREPARE],
export as_nl
-# Printing a long string crashes Solaris 7 /usr/bin/printf.
-# Prefer a ksh shell builtin over an external printf program on Solaris,
-# but without wasting forks for bash or zsh.
- && (test "X`print -r -- $as_echo`" = "X$as_echo") 2>/dev/null; then
- as_echo='print -r --'
- as_echo_n='print -rn --'
-elif (test "X`printf %s $as_echo`" = "X$as_echo") 2>/dev/null; then
- as_echo='printf %s\n'
- as_echo_n='printf %s'
- if test "X`(/usr/ucb/echo -n -n $as_echo) 2>/dev/null`" = "X-n $as_echo"; then
- as_echo_body='eval /usr/ucb/echo -n "$][1$as_nl"'
- as_echo_n='/usr/ucb/echo -n'
- else
- as_echo_body='eval expr "X$][1" : "X\\(.*\\)"'
- as_echo_n_body='eval
- arg=$][1;
- case $arg in @%:@(
- *"$as_nl"*)
- expr "X$arg" : "X\\(.*\\)$as_nl";
- arg=`expr "X$arg" : ".*$as_nl\\(.*\\)"`;;
- esac;
- expr "X$arg" : "X\\(.*\\)" | tr -d "$as_nl"
- '
- export as_echo_n_body
- as_echo_n='sh -c $as_echo_n_body as_echo'
- fi
- export as_echo_body
- as_echo='sh -c $as_echo_body as_echo'
+as_echo='printf %s\n'
+as_echo_n='printf %s'

Join openembedded-architecture@lists.openembedded.org to automatically receive all group messages.