I want to use the Hardware Performance Counters that come with the Intel and AMD x86_64 multicore processors to calculate the number of retired stores by a program. I want each thread to calculate its retired stores separately. Can it be done? And if so, how in C/C++?
You can use Perfctr or PAPI if you want to count hardware events on some part of the program internally (without starting any 3rd party tool).
Perfctr quickstart: http://www.ale.csce.kyushu-u.ac.jp/~satoshi/how_to_use_perfctr.htm
PAPI homepage: http://icl.cs.utk.edu/papi/
PerfSuite good doc: http://perfsuite.ncsa.illinois.edu/publications/LJ135/x27.html
If you can do this externally, there is a perf
command of modern Linux.