So from my understanding of delay slots, they occur when a branch instruction is called and the next instruction following the branch also gets loaded from memory. What is the point of this? Wouldn't you expect the code after a branch not to run in case the branch is taken? Is it to save time in case the branch isnt taken?
I am looking at a pipeline diagram and it seems the instruction after branch is getting carried out anyway..
Most processors these days use pipelines. The ideas and problems from the H&P book(s) are used everywhere. At the time of those original writings, I would assume the actual hardware matched that particular notion of a pipeline. fetch, decode, execute, write back.
Basically a pipeline is an assembly line, with four main stages in the line, so you have at most four instructions be worked on at once. Which confuses the notion of how many clocks does it take to execute an instruction, well it takes more than one clock, but if you have some/many executing in parallel then the "average" can approach or exceed one per clock.
When you take a branch though the assembly line fails. The instructions in the fetch and decode stage have to be tossed, and you have to start filling again, so you take a hit of a few clocks to fetch, decode, then back to executing. The idea of the branch shadow or delay slot is to recover one of those clocks. If you declare that the instruction after a branch is always executed then when a branch is taken the instruction in the decode slot also gets executed, the instruction in the fetch slot is discarded and you have one hole of time not two. So instead of execute, empty, empty, execute, execute you now have execute, execute, empty, execute, execute... in the execute stage of the pipeline. The branch is 50% less painful, your overall average execution speed improves, etc.
ARM does not have a delay slot, but it gives the illusion of a pipeline as well, by declaring that the program counter is two instructions ahead. Any operation that relies on the program counter (pc-relative addressing) must compute the offset using a pc that is two instructions ahead, for ARM instructions this is 8 bytes for original thumb 4 bytes and when you add in thumb2 instructions it gets messy.
These are illusions at this point outside academics, the pipelines are deeper, have lots of tricks, etc, in order for legacy code to keep working, and/or not having to re-define how instructions work for each architecture change (imagine mips rev x, 1 delay slot, rev y 2 delay slots, rev z 3 slots if condition a and 2 slots if condition b and 1 slot if condition c) the processor goes ahead and executes the first instruction after a branch, and discards the other handful or dozen after as it re-fills the pipe. How deep the pipes really are is often not shared with the public.
I saw a comment about this being a RISC thing, it may have started there but CISC processors use the same exact tricks, just giving the illusion of the legacy instruction set, at times the CISC processor is no more than a RISC or VLIW core with a wrapper to emulate the legacy CISC instruction set (microcoded).
Watch the how its made show. Visualize an assembly line, each step in the line has a task. What if one step in the line ran out of blue whatsits, and to make the blue and yellow product you need the blue whatsits. And you cant get new blue whatsits for another week because someone screwed up. So you have to stop the line, change the supplies to each stage, and make the red and green product for a while, which normally could have been properly phased in without dumping the line. That is like what happens with a branch, somewhere deep in the assembly line, something causes the line to have to change, dump the line. the delay slot is a way to recover one product from having to be discarded in the line. Instead of N products coming out before the line stopped, N+1 products came out per production run. Execution of code is like bursts of production runs, you often get short, sometimes long, linear execution paths before hitting a branch to go to another short execution path, branch another short execution path...