Though I can intuitively get most of the results, I'm having hard time fully understanding the output of the perf report
command especially for what concerns the call graph, so I wrote a stupid test to solve this issue of mine once for all.
I compiled what follows with:
gcc -Wall -pedantic -lm perf-test.c -o perf-test
No aggressive optimizations to avoid inlining and such.
#include <math.h>
#define N 10000000UL
#define USELESSNESS(n) \
do { \
unsigned long i; \
double x = 42; \
for (i = 0; i < (n); i++) x = sin(x); \
} while (0)
void baz()
{
USELESSNESS(N);
}
void bar()
{
USELESSNESS(2 * N);
baz();
}
void foo()
{
USELESSNESS(3 * N);
bar();
baz();
}
int main()
{
foo();
return 0;
}
perf record ./perf-test
perf report
With these I get:
94,44% perf-test libm-2.19.so [.] __sin_sse2
2,09% perf-test perf-test [.] sin@plt
1,24% perf-test perf-test [.] foo
0,85% perf-test perf-test [.] baz
0,83% perf-test perf-test [.] bar
Which sounds reasonable since the heavy work is actually performed by __sin_sse2
and sin@plt
is probably just a wrapper, while the overhead of my functions take into account just the loop, overall: 3*N
iterations for foo
, 2*N
for the other two.
perf record -g ./perf-test
perf report -G
perf report
Now the overhead columns that I get are two: Children
(the output is sorted by this one by default) and Self
(the same overhead of the flat profile).
Here is where I start feeling I miss something: regardless of the fact that I use -G
or not I'm unable to explain the hierarchy in terms of "x calls y" or "y is called by x", for example:
without -G
("y is called by x"):
- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2
- __sin_sse2
+ 43,67% foo
+ 41,45% main
+ 14,88% bar
- 37,73% 0,00% perf-test perf-test [.] main
main
__libc_start_main
- 23,41% 1,35% perf-test perf-test [.] foo
foo
main
__libc_start_main
- 6,43% 0,83% perf-test perf-test [.] bar
bar
foo
main
__libc_start_main
- 0,98% 0,98% perf-test perf-test [.] baz
- baz
+ 54,71% foo
+ 45,29% bar
__sin_sse2
is called by main
(indirectly?), foo
and bar
but not by baz
?baz
) and sometimes not (e.g., the last instance of bar
)?with -G
("x calls y"):
- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2
+ __sin_sse2
+ __libc_start_main
+ main
- 37,73% 0,00% perf-test perf-test [.] main
- main
+ 62,05% foo
+ 35,73% __sin_sse2
2,23% sin@plt
- 23,41% 1,35% perf-test perf-test [.] foo
- foo
+ 64,40% __sin_sse2
+ 29,18% bar
+ 3,98% sin@plt
2,44% baz
__libc_start_main
main
foo
__sin_sse2
?main
calls foo
and that's ok, but why if it calls __sin_sse2
and sin@plt
(indirectly?) it does not also call bar
and baz
?__libc_start_main
and main
appear under foo
? And why foo
appears twice?Suspect is that there are two levels of this hierarchy, in which the second actually represents the "x calls y"/"y is called by x" semantics, but I'm tired to guess so I'm asking here. And the documentation doesn't seem to help.
Sorry for the long post but I hope that all this context may help or act as a reference for someone else too.
Alright, well, let's ignore temporarily the difference between caller and callee call-graphs, mostly because when I compare the results between these two options on my machine, I only see effects inside the kernel.kallsyms
DSO for reasons I don't understand -- relatively new to this myself.
I found that for your example, it's a little easier to read the whole tree. So, using --stdio
, let's look at the whole tree for __sin_sse2
:
# Overhead Command Shared Object Symbol
# ........ ......... ................. ......................
#
94.72% perf-test libm-2.19.so [.] __sin_sse2
|
--- __sin_sse2
|
|--44.20%-- foo
| |
| --100.00%-- main
| __libc_start_main
| _start
| 0x0
|
|--27.95%-- baz
| |
| |--51.78%-- bar
| | foo
| | main
| | __libc_start_main
| | _start
| | 0x0
| |
| --48.22%-- foo
| main
| __libc_start_main
| _start
| 0x0
|
--27.84%-- bar
|
--100.00%-- foo
main
__libc_start_main
_start
0x0
So, the way I read this is: 44% of the time, sin
is called from foo
; 27% of the time it's called from baz
, and 27% from bar.
The documentation for -g is instructive:
-g [type,min[,limit],order[,key]], --call-graph
Display call chains using type, min percent threshold, optional print limit and order. type can be either:
· flat: single column, linear exposure of call chains.
· graph: use a graph tree, displaying absolute overhead rates.
· fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.
order can be either:
- callee: callee based call graph.
- caller: inverted caller based call graph.
key can be:
- function: compare on functions
- address: compare on individual code addresses
Default: fractal,0.5,callee,function.
The important piece here is that the default is fractal, and in fractal mode, each branch is a new object.
So, you can see that 50% of the time that baz
is called, it's called from bar
, and the other 50% it's called from foo
.
This isn't always the most useful measure, so it's instructive to look at the results using -g graph
:
94.72% perf-test libm-2.19.so [.] __sin_sse2
|
--- __sin_sse2
|
|--41.87%-- foo
| |
| --41.48%-- main
| __libc_start_main
| _start
| 0x0
|
|--26.48%-- baz
| |
| |--13.50%-- bar
| | foo
| | main
| | __libc_start_main
| | _start
| | 0x0
| |
| --12.57%-- foo
| main
| __libc_start_main
| _start
| 0x0
|
--26.38%-- bar
|
--26.17%-- foo
main
__libc_start_main
_start
0x0
This changes to using absolute percentages, where each percentage of time is reported for that call chain: So foo->bar
is 26% of the total ticks (which in turn calls baz
), and foo->baz
(direct) is 12% of the total ticks.
I still have no idea why I don't see any differences between callee and caller graphs though, from the perspective of __sin_sse2
.
One thing I did change from your command line is how the callgraphs were gathered. Linux perf by default uses the frame pointer method of reconstructing callstacks. This can be a problem when the compiler uses -fomit-frame-pointer
as a default. So I used
perf record --call-graph dwarf ./perf-test