Understanding Linux perf report output

cYrus picture cYrus · Jan 2, 2015 · Viewed 16k times · Source

Though I can intuitively get most of the results, I'm having hard time fully understanding the output of the perf report command especially for what concerns the call graph, so I wrote a stupid test to solve this issue of mine once for all.

The stupid test

I compiled what follows with:

gcc -Wall -pedantic -lm perf-test.c -o perf-test

No aggressive optimizations to avoid inlining and such.

#include <math.h>

#define N 10000000UL

#define USELESSNESS(n)                          \
    do {                                        \
        unsigned long i;                        \
        double x = 42;                          \
        for (i = 0; i < (n); i++) x = sin(x);   \
    } while (0)

void baz()
{
    USELESSNESS(N);
}

void bar()
{
    USELESSNESS(2 * N);
    baz();
}

void foo()
{
    USELESSNESS(3 * N);
    bar();
    baz();
}

int main()
{
    foo();
    return 0;
}

Flat profiling

perf record ./perf-test
perf report

With these I get:

  94,44%  perf-test  libm-2.19.so       [.] __sin_sse2
   2,09%  perf-test  perf-test          [.] sin@plt
   1,24%  perf-test  perf-test          [.] foo
   0,85%  perf-test  perf-test          [.] baz
   0,83%  perf-test  perf-test          [.] bar

Which sounds reasonable since the heavy work is actually performed by __sin_sse2 and sin@plt is probably just a wrapper, while the overhead of my functions take into account just the loop, overall: 3*N iterations for foo, 2*N for the other two.

Hierarchical profiling

perf record -g ./perf-test
perf report -G
perf report

Now the overhead columns that I get are two: Children (the output is sorted by this one by default) and Self (the same overhead of the flat profile).

Here is where I start feeling I miss something: regardless of the fact that I use -G or not I'm unable to explain the hierarchy in terms of "x calls y" or "y is called by x", for example:

  • without -G ("y is called by x"):

    -   94,34%    94,06%  perf-test  libm-2.19.so       [.] __sin_sse2
       - __sin_sse2
          + 43,67% foo
          + 41,45% main
          + 14,88% bar
    -   37,73%     0,00%  perf-test  perf-test          [.] main
         main
         __libc_start_main
    -   23,41%     1,35%  perf-test  perf-test          [.] foo
         foo
         main
         __libc_start_main
    -    6,43%     0,83%  perf-test  perf-test          [.] bar
         bar
         foo
         main
         __libc_start_main
    -    0,98%     0,98%  perf-test  perf-test          [.] baz
       - baz
          + 54,71% foo
          + 45,29% bar
    
    1. Why __sin_sse2 is called by main (indirectly?), foo and bar but not by baz?
    2. Why functions sometimes have a percent and a hierarchy attached (e.g., the last instance of baz) and sometimes not (e.g., the last instance of bar)?
  • with -G ("x calls y"):

    -   94,34%    94,06%  perf-test  libm-2.19.so       [.] __sin_sse2
       + __sin_sse2
       + __libc_start_main
       + main
    -   37,73%     0,00%  perf-test  perf-test          [.] main
       - main
          + 62,05% foo
          + 35,73% __sin_sse2
            2,23% sin@plt
    -   23,41%     1,35%  perf-test  perf-test          [.] foo
       - foo
          + 64,40% __sin_sse2
          + 29,18% bar
          + 3,98% sin@plt
            2,44% baz
         __libc_start_main
         main
         foo
    
    1. How should I interpret the first three entries under __sin_sse2?
    2. main calls foo and that's ok, but why if it calls __sin_sse2 and sin@plt (indirectly?) it does not also call bar and baz?
    3. Why do __libc_start_main and main appear under foo? And why foo appears twice?

Suspect is that there are two levels of this hierarchy, in which the second actually represents the "x calls y"/"y is called by x" semantics, but I'm tired to guess so I'm asking here. And the documentation doesn't seem to help.


Sorry for the long post but I hope that all this context may help or act as a reference for someone else too.

Answer

Matthew G. picture Matthew G. · Feb 4, 2015

Alright, well, let's ignore temporarily the difference between caller and callee call-graphs, mostly because when I compare the results between these two options on my machine, I only see effects inside the kernel.kallsyms DSO for reasons I don't understand -- relatively new to this myself.

I found that for your example, it's a little easier to read the whole tree. So, using --stdio, let's look at the whole tree for __sin_sse2:

# Overhead    Command      Shared Object                  Symbol
# ........  .........  .................  ......................
#
    94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
            |
            --- __sin_sse2
               |
               |--44.20%-- foo
               |          |
               |           --100.00%-- main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
               |--27.95%-- baz
               |          |
               |          |--51.78%-- bar
               |          |          foo
               |          |          main
               |          |          __libc_start_main
               |          |          _start
               |          |          0x0
               |          |
               |           --48.22%-- foo
               |                     main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
                --27.84%-- bar
                          |
                           --100.00%-- foo
                                     main
                                     __libc_start_main
                                     _start
                                     0x0

So, the way I read this is: 44% of the time, sin is called from foo; 27% of the time it's called from baz, and 27% from bar.

The documentation for -g is instructive:

 -g [type,min[,limit],order[,key]], --call-graph
       Display call chains using type, min percent threshold, optional print limit and order. type can be either:

       ·   flat: single column, linear exposure of call chains.

       ·   graph: use a graph tree, displaying absolute overhead rates.

       ·   fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.

               order can be either:
               - callee: callee based call graph.
               - caller: inverted caller based call graph.

               key can be:
               - function: compare on functions
               - address: compare on individual code addresses

               Default: fractal,0.5,callee,function.

The important piece here is that the default is fractal, and in fractal mode, each branch is a new object.

So, you can see that 50% of the time that baz is called, it's called from bar, and the other 50% it's called from foo.

This isn't always the most useful measure, so it's instructive to look at the results using -g graph:

94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
        |
        --- __sin_sse2
           |
           |--41.87%-- foo
           |          |
           |           --41.48%-- main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
           |--26.48%-- baz
           |          |
           |          |--13.50%-- bar
           |          |          foo
           |          |          main
           |          |          __libc_start_main
           |          |          _start
           |          |          0x0
           |          |
           |           --12.57%-- foo
           |                     main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
            --26.38%-- bar
                      |
                       --26.17%-- foo
                                 main
                                 __libc_start_main
                                 _start
                                 0x0

This changes to using absolute percentages, where each percentage of time is reported for that call chain: So foo->bar is 26% of the total ticks (which in turn calls baz), and foo->baz (direct) is 12% of the total ticks.

I still have no idea why I don't see any differences between callee and caller graphs though, from the perspective of __sin_sse2.

Update

One thing I did change from your command line is how the callgraphs were gathered. Linux perf by default uses the frame pointer method of reconstructing callstacks. This can be a problem when the compiler uses -fomit-frame-pointer as a default. So I used

perf record --call-graph dwarf ./perf-test