I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being read.
My question is:
Imagine that you need to read one byte from memory, which 64 bytes will be brought into the cache?
The two possibilities I can see is that, either the 64 bytes start at the closest 64 bytes boundary below the byte of interest, or the 64 bytes are spread around the byte in some predetermined way (for instance, half under, half above, or all above).
Which is it?
If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).
Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory. (DDR1/2/3/4 SDRAM burst transfer size is configurable up to 64B; CPUs will select the burst transfer size to match their cache line size, but 64B is common)
As a rule of thumb, if the processor can't forecast a memory access (and prefetch it), the retrieval process can take ~90 nanoseconds, or ~250 clock cycles (from the CPU knowing the address to the CPU receiving data).
By contrast, a hit in L1 cache has a load-use latency of 3 or 4 cycles, and a store-reload has a store-forwarding latency of 4 or 5 cycles on modern x86 CPUs. Things are similar on other architectures.
Further reading: Ulrich Drepper's What Every Programmer Should Know About Memory. The software-prefetch advice is a bit outdated: modern HW prefetchers are smarter, and hyperthreading is way better than in P4 days (so a prefetch thread is typically a waste). Also, the x86 tag wiki has lots of performance links for that architecture.