I would like to learn more about low level code optimization, and how to take advantage of the underlying machine architecture. I am looking for good pointers on where to read about this topic.
More details:
I am interested in optimization in the context of scientific computing (which is a lot of number crunching but not only) in low level languages such as C/C++. I am in particular interested in optimization methods that are not obvious unless one has a good understanding of how the machine works (which I don't---yet).
For example, it's clear that a better algorithm is faster, without knowing anything about the machine it's run on. It's not at all obvious that it matters if one loops through the columns or the rows of a matrix first. (It's better to loop through the matrix so that elements that are stored at adjacent locations are read successively.)
Basic advice on the topic or pointers to articles are most welcome.
Answers
Got answers with lots of great pointers, a lot more than I'll ever have time to read. Here's a list of all of them:
I'll need a bit of skim time to decide which one to use (not having time for all).
Drepper's What Every Programmer Should Know About Memory [pdf] is a good reference to one aspect of low-level optimisation.