So I am learning MSIL right now to learn to debug my C# .NET applications.
I've always wondered: what is the purpose of the stack?
Just to put my question in context:
Why is there a transfer from memory to stack or "loading?"
On the other hand, why is there a transfer from stack to memory or "storing"?
Why not just have them all placed in the memory?
UPDATE: I liked this question so much I made it the subject of my blog on November 18th 2011. Thanks for the great question!
I've always wondered: what is the purpose of the stack?
I assume you mean the evaluation stack of the MSIL language, and not the actual per-thread stack at runtime.
Why is there a transfer from memory to stack or "loading?" On the other hand, why is there a transfer from stack to memory or "storing"? Why not just have them all placed in the memory?
MSIL is a "virtual machine" language. Compilers like the C# compiler generate CIL, and then at runtime another compiler called the JIT (Just In Time) compiler turns the IL into actual machine code that can execute.
So first let's answer the question "why have MSIL at all?" Why not just have the C# compiler write out machine code?
Because it is cheaper to do it this way. Suppose we didn't do it that way; suppose each language has to have its own machine code generator. You have twenty different languages: C#, JScript .NET, Visual Basic, IronPython, F#... And suppose you have ten different processors. How many code generators do you have to write? 20 x 10 = 200 code generators. That's a lot of work. Now suppose you want to add a new processor. You have to write the code generator for it twenty times, one for each language.
Furthermore, it is difficult and dangerous work. Writing efficient code generators for chips that you are not an expert on is a hard job! Compiler designers are experts on the semantic analysis of their language, not on efficient register allocation of new chip sets.
Now suppose we do it the CIL way. How many CIL generators do you have to write? One per language. How many JIT compilers do you have to write? One per processor. Total: 20 + 10 = 30 code generators. Moreover, the language-to-CIL generator is easy to write because CIL is a simple language, and the CIL-to-machine-code generator is also easy to write because CIL is a simple language. We get rid of all of the intricacies of C# and VB and whatnot and "lower" everything to a simple language that is easy to write a jitter for.
Having an intermediate language lowers the cost of producing a new language compiler dramatically. It also lowers the cost of supporting a new chip dramatically. You want to support a new chip, you find some experts on that chip and have them write an CIL jitter and you're done; you then support all those languages on your chip.
OK, so we've established why we have MSIL; because having an intermediate language lowers costs. Why then is the language a "stack machine"?
Because stack machines are conceptually very simple for language compiler writers to deal with. Stacks are a simple, easily understood mechanism for describing computations. Stack machines are also conceptually very easy for JIT compiler writers to deal with. Using a stack is a simplifying abstraction, and therefore again, it lowers our costs.
You ask "why have a stack at all?" Why not just do everything directly out of memory? Well, let's think about that. Suppose you want to generate CIL code for:
int x = A() + B() + C() + 10;
Suppose we have the convention that "add", "call", "store" and so on always take their arguments off the stack and put their result (if there is one) on the stack. To generate CIL code for this C# we just say something like:
load the address of x // The stack now contains address of x
call A() // The stack contains address of x and result of A()
call B() // Address of x, result of A(), result of B()
add // Address of x, result of A() + B()
call C() // Address of x, result of A() + B(), result of C()
add // Address of x, result of A() + B() + C()
load 10 // Address of x, result of A() + B() + C(), 10
add // Address of x, result of A() + B() + C() + 10
store in address // The result is now stored in x, and the stack is empty.
Now suppose we did it without a stack. We'll do it your way, where every opcode takes the addresses of its operands and the address to which it stores its result:
Allocate temporary store T1 for result of A()
Call A() with the address of T1
Allocate temporary store T2 for result of B()
Call B() with the address of T2
Allocate temporary store T3 for the result of the first addition
Add contents of T1 to T2, then store the result into the address of T3
Allocate temporary store T4 for the result of C()
Call C() with the address of T4
Allocate temporary store T5 for result of the second addition
...
You see how this goes? Our code is getting huge because we have to explicitly allocate all the temporary storage that would normally by convention just go on the stack. Worse, our opcodes themselves are all getting enormous because they all now have to take as an argument the address that they're going to write their result into, and the address of each operand. An "add" instruction that knows that it is going to take two things off the stack and put one thing on can be a single byte. An add instruction that takes two operand addresses and a result address is going to be enormous.
We use stack-based opcodes because stacks solve the common problem. Namely: I want to allocate some temporary storage, use it very soon and then get rid of it quickly when I'm done. By making the assumption that we have a stack at our disposal we can make the opcodes very small and the code very terse.
UPDATE: Some additional thoughts
Incidentally, this idea of drastically lowering costs by (1) specifing a virtual machine, (2) writing compilers that target the VM language, and (3) writing implementations of the VM on a variety of hardware, is not a new idea at all. It did not originate with MSIL, LLVM, Java bytecode, or any other modern infrastructures. The earliest implementation of this strategy I'm aware of is the pcode machine from 1966.
The first I personally heard of this concept was when I learned how the Infocom implementors managed to get Zork running on so many different machines so well. They specified a virtual machine called the Z-machine and then made Z-machine emulators for all the hardware they wanted to run their games on. This had the added enormous benefit that they could implement virtual memory management on primitive 8-bit systems; a game could be larger than would fit into memory because they could just page the code in from disk when they needed it and discard it when they needed to load new code.