Clang vs GCC - which produces faster binaries?

SF. picture SF. · Jul 6, 2010 · Viewed 120.3k times · Source

I'm currently using GCC, but I discovered Clang recently and I'm pondering switching. There is one deciding factor though - quality (speed, memory footprint, reliability) of binaries it produces - if gcc -O3can produce a binary that runs 1% faster, or Clang binaries take up more memory or just fail due to compiler bugs, it's a deal-breaker.

Clang boasts better compile speeds and lower compile-time memory footprint than GCC, but I'm really interested in benchmarks/comparisons of resulting compiled software - could you point me to some or describe your experiences?

Answer

Mike Kinghan picture Mike Kinghan · Feb 23, 2013

Here are some up-to-date albeit narrow findings of mine with GCC 4.7.2 and Clang 3.2 for C++.

UPDATE: GCC 4.8.1 v clang 3.3 comparison appended below.

UPDATE: GCC 4.8.2 v clang 3.4 comparison is appended to that.

I maintain an OSS tool that is built for Linux with both GCC and Clang, and with Microsoft's compiler for Windows. The tool, coan, is a preprocessor and analyser of C/C++ source files and codelines of such: its computational profile majors on recursive-descent parsing and file-handling. The development branch (to which these results pertain) comprises at present around 11K LOC in about 90 files. It is coded, now, in C++ that is rich in polymorphism and templates and but is still mired in many patches by its not-so-distant past in hacked-together C. Move semantics are not expressly exploited. It is single-threaded. I have devoted no serious effort to optimizing it, while the "architecture" remains so largely ToDo.

I employed Clang prior to 3.2 only as an experimental compiler because, despite its superior compilation speed and diagnostics, its C++11 standard support lagged the contemporary GCC version in the respects exercised by coan. With 3.2, this gap has been closed.

My Linux test harness for current coan development processes roughly 70K sources files in a mixture of one-file parser test-cases, stress tests consuming 1000s of files and scenario tests consuming < 1K files. As well as reporting the test results, the harness accumulates and displays the totals of files consumed and the run time consumed in coan (it just passes each coan command line to the Linux time command and captures and adds up the reported numbers). The timings are flattered by the fact that any number of tests which take 0 measurable time will all add up to 0, but the contribution of such tests is negligible. The timing stats are displayed at the end of make check like this:

coan_test_timer: info: coan processed 70844 input_files.
coan_test_timer: info: run time in coan: 16.4 secs.
coan_test_timer: info: Average processing time per input file: 0.000231 secs.

I compared the test harness performance as between GCC 4.7.2 and Clang 3.2, all things being equal except the compilers. As of Clang 3.2, I no longer require any preprocessor differentiation between code tracts that GCC will compile and Clang alternatives. I built to the same C++ library (GCC's) in each case and ran all the comparisons consecutively in the same terminal session.

The default optimization level for my release build is -O2. I also successfully tested builds at -O3. I tested each configuration 3 times back-to-back and averaged the 3 outcomes, with the following results. The number in a data-cell is the average number of microseconds consumed by the coan executable to process each of the ~70K input files (read, parse and write output and diagnostics).

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 231 | 237 |0.97 |
----------|-----|-----|-----|
Clang-3.2 | 234 | 186 |1.25 |
----------|-----|-----|------
GCC/Clang |0.99 | 1.27|

Any particular application is very likely to have traits that play unfairly to a compiler's strengths or weaknesses. Rigorous benchmarking employs diverse applications. With that well in mind, the noteworthy features of these data are:

  1. -O3 optimization was marginally detrimental to GCC
  2. -O3 optimization was importantly beneficial to Clang
  3. At -O2 optimization, GCC was faster than Clang by just a whisker
  4. At -O3 optimization, Clang was importantly faster than GCC.

A further interesting comparison of the two compilers emerged by accident shortly after those findings. Coan liberally employs smart pointers and one such is heavily exercised in the file handling. This particular smart-pointer type had been typedef'd in prior releases for the sake of compiler-differentiation, to be an std::unique_ptr<X> if the configured compiler had sufficiently mature support for its usage as that, and otherwise an std::shared_ptr<X>. The bias to std::unique_ptr was foolish, since these pointers were in fact transferred around, but std::unique_ptr looked like the fitter option for replacing std::auto_ptr at a point when the C++11 variants were novel to me.

In the course of experimental builds to gauge Clang 3.2's continued need for this and similar differentiation, I inadvertently built std::shared_ptr<X> when I had intended to build std::unique_ptr<X>, and was surprised to observe that the resulting executable, with default -O2 optimization, was the fastest I had seen, sometimes achieving 184 msecs. per input file. With this one change to the source code, the corresponding results were these;

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 234 | 234 |1.00 |
----------|-----|-----|-----|
Clang-3.2 | 188 | 187 |1.00 |
----------|-----|-----|------
GCC/Clang |1.24 |1.25 |

The points of note here are:

  1. Neither compiler now benefits at all from -O3 optimization.
  2. Clang beats GCC just as importantly at each level of optimization.
  3. GCC's performance is only marginally affected by the smart-pointer type change.
  4. Clang's -O2 performance is importantly affected by the smart-pointer type change.

Before and after the smart-pointer type change, Clang is able to build a substantially faster coan executable at -O3 optimisation, and it can build an equally faster executable at -O2 and -O3 when that pointer-type is the best one - std::shared_ptr<X> - for the job.

An obvious question that I am not competent to comment upon is why Clang should be able to find a 25% -O2 speed-up in my application when a heavily used smart-pointer-type is changed from unique to shared, while GCC is indifferent to the same change. Nor do I know whether I should cheer or boo the discovery that Clang's -O2 optimization harbours such huge sensitivity to the wisdom of my smart-pointer choices.

UPDATE: GCC 4.8.1 v clang 3.3

The corresponding results now are:

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.1 | 442 | 443 |1.00 |
----------|-----|-----|-----|
Clang-3.3 | 374 | 370 |1.01 |
----------|-----|-----|------
GCC/Clang |1.18 |1.20 |

The fact that all four executables now take a much greater average time than previously to process 1 file does not reflect on the latest compilers' performance. It is due to the fact that the later development branch of the test application has taken on lot of parsing sophistication in the meantime and pays for it in speed. Only the ratios are significant.

The points of note now are not arrestingly novel:

  • GCC is indifferent to -O3 optimization
  • clang benefits very marginally from -O3 optimization
  • clang beats GCC by a similarly important margin at each level of optimization.

Comparing these results with those for GCC 4.7.2 and clang 3.2, it stands out that GCC has clawed back about a quarter of clang's lead at each optimization level. But since the test application has been heavily developed in the meantime one cannot confidently attribute this to a catch-up in GCC's code-generation. (This time, I have noted the application snapshot from which the timings were obtained and can use it again.)

UPDATE: GCC 4.8.2 v clang 3.4

I finished the update for GCC 4.8.1 v Clang 3.3 saying that I would stick to the same coan snaphot for further updates. But I decided instead to test on that snapshot (rev. 301) and on the latest development snapshot I have that passes its test suite (rev. 619). This gives the results a bit of longitude, and I had another motive:

My original posting noted that I had devoted no effort to optimizing coan for speed. This was still the case as of rev. 301. However, after I had built the timing apparatus into the coan test harness, every time I ran the test suite the performance impact of the latest changes stared me in the face. I saw that it was often surprisingly big and that the trend was more steeply negative than I felt to be merited by gains in functionality.

By rev. 308 the average processing time per input file in the test suite had well more than doubled since the first posting here. At that point I made a U-turn on my 10 year policy of not bothering about performance. In the intensive spate of revisions up to 619 performance was always a consideration and a large number of them went purely to rewriting key load-bearers on fundamentally faster lines (though without using any non-standard compiler features to do so). It would be interesting to see each compiler's reaction to this U-turn,

Here is the now familiar timings matrix for the latest two compilers' builds of rev.301:

coan - rev.301 results

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 428 | 428 |1.00 |
----------|-----|-----|-----|
Clang-3.4 | 390 | 365 |1.07 |
----------|-----|-----|------
GCC/Clang | 1.1 | 1.17|

The story here is only marginally changed from GCC-4.8.1 and Clang-3.3. GCC's showing is a trifle better. Clang's is a trifle worse. Noise could well account for this. Clang still comes out ahead by -O2 and -O3 margins that wouldn't matter in most applications but would matter to quite a few.

And here is the matrix for rev. 619.

coan - rev.619 results

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 210 | 208 |1.01 |
----------|-----|-----|-----|
Clang-3.4 | 252 | 250 |1.01 |
----------|-----|-----|------
GCC/Clang |0.83 | 0.83|

Taking the 301 and the 619 figures side by side, several points speak out.

  • I was aiming to write faster code, and both compilers emphatically vindicate my efforts. But:

  • GCC repays those efforts far more generously than Clang. At -O2 optimization Clang's 619 build is 46% faster than its 301 build: at -O3 Clang's improvement is 31%. Good, but at each optimization level GCC's 619 build is more than twice as fast as its 301.

  • GCC more than reverses Clang's former superiority. And at each optimization level GCC now beats Clang by 17%.

  • Clang's ability in the 301 build to get more leverage than GCC from -O3 optimization is gone in the 619 build. Neither compiler gains meaningfully from -O3.

I was sufficiently surprised by this reversal of fortunes that I suspected I might have accidentally made a sluggish build of clang 3.4 itself (since I built it from source). So I re-ran the 619 test with my distro's stock Clang 3.3. The results were practically the same as for 3.4.

So as regards reaction to the U-turn: On the numbers here, Clang has done much better than GCC at at wringing speed out of my C++ code when I was giving it no help. When I put my mind to helping, GCC did a much better job than Clang.

I don't elevate that observation into a principle, but I take the lesson that "Which compiler produces the better binaries?" is a question that, even if you specify the test suite to which the answer shall be relative, still is not a clear-cut matter of just timing the binaries.

Is your better binary the fastest binary, or is it the one that best compensates for cheaply crafted code? Or best compensates for expensively crafted code that prioritizes maintainability and reuse over speed? It depends on the nature and relative weights of your motives for producing the binary, and of the constraints under which you do so.

And in any case, if you deeply care about building "the best" binaries then you had better keep checking how successive iterations of compilers deliver on your idea of "the best" over successive iterations of your code.