madhadron

What are the stack and the heap?

Status: Done
Confidence: Very likely

This evening someone asked, “what are the stack and the heap?” This isn’t the first time I’ve taught someone this, and I think I’ve found a decent way to teach it, so it’s probably time to write it down.

Let’s go all the way back to the begining. We’re going to do this in something vaguely like really old BASIC. Consider the following program:

10 PRINT "HELLO"
20 GOTO 10

If you haven’t seen old BASIC before, this is read and run one line at a time. The number at the beginning of the line is a way of labelling the lines (called, imaginatively, the “line number”). The exact numbers don’t matter so long as the number of each line is greater than the number of the line before it. The GOTO N statement jumps so that the program runs line N next and continues from line N. This program prints HELLO over and over until we stop it.

In modern languages we can write functions that we can call over and over again. Let’s try to put together something like that in this minimal substrate. First, let’s write a hunk of code that we can jump to, run the lines in it, and jump back to where we came from. We can put it at a fixed line number, but we need to know where to jump back, so we will define a variable RETURN_ADDRESS where we will store the line number to jump back to.

10 DIM RETURN_ADDRESS

100 LET RETURN_ADDRESS = 120
110 GOTO 1000
120 PRINT " ALICE"
130 NEWLINE
140 LET RETURN_ADDRESS = 160
150 GOTO 1000
160 PRINT " BOB"
170 NEWLINE
180 EXIT

1000 PRINT "HELLO"
1010 GOTO RETURN_ADDRESS

The statment DIM just declares a variable with the given name. This program run lines 10, 100, 110, then jumps to 1000. At 1000, it prints HELLO, then at 1010 it looks up the number in RETURN_ADDRESS (set at line 100 to 120) and jumps to that line. At 120 it prints ALICE and then a newline. Then does the same thing, but this time having line 1010 jump back to line 160 to print HELLO BOB then exits.

We can reuse the hunk of code at line 1000, but what happens if we want to call another hunk from that hunk? If both hunks rely on GOTO RETURN_ADDRESS, then we will have something like

10 DIM RETURN_ADDRESS

100 LET RETURN_ADDRESS = 120
110 GOTO 1000
120 EXIT

1000 RETURN_ADDRESS = 1010
1010 GOTO RETURN_ADDRESS

2000 PRINT "HELLO"
2010 GOTO RETURN_ADDRESS

Which ends up infinitely looping at line 1010. We could have separate variable for the return address for each hunk, but we would like to be able to write recursive hunks eventually, and we can’t use separate variables for that. Instead, we’re going to introduce a stack.

Note. If you don’t know what a stack is, think of it like a stack of cards. You can write a number on a card and put it on the top of the stack, or take a card off the top of the stack and read the number on it. Putting a card on top is calling pushing onto the stack and taking a card off is called popping it from the stack.

With a stack for return addresses, our previous program will work.

10 DIM RETURN_STACK

100 LET RETURN_STACK = STACK()
110 RETURN_STACK.PUSH(130)
120 GOTO 1000
130 EXIT

1000 PRINT "HELLO "
1010 RETURN_STACK.PUSH(1030)
1020 GOTO 2000
1030 GOTO RETURN_STACK.POP()

2000 PRINT "WORLD"
2010 NEWLINE
2020 GOTO RETURN_STACK.POP()

Again, we run 10, then 100, then at 120 we jump to 1000 and print HELLO. We push another address onto the RETURN_STACK, and jump to 2000. At 2000 we print WORLD, and then pop the top address off RETURN_STACK and jump to it (in this case 1030). At 1030 we do the same thing, but this time we get 130. At 130 we exit.

Next we’re going to pass behavior to hunks.

10 DIM RETURN_STACK
20 DIM FIRST_NAME
30 DIM LAST_NAME
40 DIM FULL_NAME

100 LET RETURN_STACK = STACK()
110 LET FIRST_NAME = "ALICE"
120 LET LAST_NAME = "SMITH"
130 LET FULL_NAME = "EDDIE THE BABOON"
140 RETURN_STACK.PUSH(160)
150 GOTO 1000
160 PRINT "HELLO " + FULL_NAME
170 NEWLINE
180 EXIT

1000 LET FULL_NAME = FIRST_NAME + " " + LAST_NAME
1010 PRINT FULL_NAME
1020 NEWLINE
1030 GOTO RETURN_STACK.POP()

We would like this program to print

HELLO ALICE SMITH
HELLO EDDIE THE BABOON

But FULL_NAME gets clobbered on line 1000. This isn’t exactly wrong, but it is certainly very inconvenient, especially when programs get large. We need a way to create a scratch pad for each hunk we invoke so we don’t get this kind of collision. We could explore possibilities, but having already gone through the exercise of arriving at a stack for return values, let’s just do the same thing here. This time we will make a stack of dictionaries (or maps or objects)…a set of key-value pairs.

10 DIM RETURN_STACK
20 DIM DATA_STACK
30 DIM THIS_STATE
40 DIM FIRST_NAME
50 DIM LAST_NAME
60 DIM FULL_NAME

100 LET RETURN_STACK = STACK()
110 LET DATA_STACK = STACK()
120 LET FIRST_NAME = "ALICE"
130 LET LAST_NAME = "SMITH"
140 LET FULL_NAME = "EDDIE THE BABOON"

200 PRINT "HELLO "
210 RETURN_STACK.PUSH(230)
220 GOTO 1000
230 PRINT "HELLO " + FULL_NAME
240 NEWLINE
250 EXIT

1000 DATA_STACK.PUSH(THIS_STATE)
1010 LET THIS_STATE = {}
1020 LET DATA_STACK["FULL_NAME"] = FIRST_NAME + " " + LAST_NAME
1030 PRINT FULL_NAME
1040 NEWLINE
1050 LET THIS_STATE = DATA_STACK.POP()
1060 GOTO RETURN_STACK.POP()

Now we get the output we expect. If we make every hunk of code push whatever THIS_STATE was to DATA_STACK and then set it to an empty dictionary, only set values in that dictionary, then restore whatever it was from the stack at the end, we will not clobber anything outside. This stack of states is what people mean when they speak of allocating memory on “the stack.”

If that is the stack, then what is the heap? The heap appears when we need to work with data that is not lost when we return from a hunk of code. For example, say we want to append messages to various logs as our program runs. We need a place to write those values so we can find them. Here is a program that has three hunks of code to create a log under a given name, append to it, and finally print and delete it.

10 DIM RETURN_STACK
20 DIM DATA_STACK
30 DIM HEAP
40 DIM LOG_NAME
50 DIM LOG_LINE

60 LET RETURN_STACK = STACK()
70 LET DATA_STACK = STACK()
80 LET HEAP = {}
100 LOG_NAME = "DEBUG"
110 RETURN_STACK.PUSH(130)
120 GOTO 1000
130 LOG_NAME = "AUDIT"
140 RETURN_STACK.PUSH(160)
150 GOTO 1000
160 LOG_NAME = "DEBUG"
170 LOG_LINE = "DOING SOMETHING 1"
180 RETURN_STACK.PUSH(200)
190 GOTO 2000
200 LOG_LINE = "DOING SOMETHING 2"
210 RETURN_STACK.PUSH(230)
220 GOTO 2000
230 LOG_NAME = "AUDIT"
240 LOG_LINE = "ACCESS RECORDED"
250 RETURN_STACK.PUSH(270)
260 GOTO 2000
270 RETURN_STACK.PUSH(290)
280 GOTO 3000
290 LOG_NAME = "DEBUG"
300 RETURN_STACK.PUSH(320)
310 GOTO 3000
320 EXIT

# Allocate a structure
1000 HEAP[LOG_NAME] = []
1010 GOTO RETURN_STACK.POP()

# Append to log
2000 DATA_STACK.PUSH(THIS_STATE)
2010 LET THIS_STATE = {}
2020 LET THIS_STATE["MESSAGE"] = TIMESTAMP() + " " + LOG_LINE
2030 HEAPS[LOG_NAME].APPEND(THIS_STATE["MESSAGE"])
2040 THIS_STATE = DATA_STACK.POP()
2050 GOTO RETURN_STACK.POP()

# Print and deallocate
3000 PRINT LOG_NAME
3010 NEWLINE
3020 PRINT HEAPS[LOG_NAME]
3030 DELETE HEAP[LOG_NAME]
3040 GOTO RETURN_STACK.POP()

This program first allocates two logs on the heap (the jumps to 1000), then appends several log lines (the jumps to 2000), and finally prints and deletes them (jumps to 3000).

It’s clear why we would refer to “the stack” by that name (it’s a stack of scratch pads). Why is “the heap” so called? It comes from how this variable we have called HEAP is actually implemented. In our program we have assumed that we have dictionaries, but that is already a very sophisticated data structure. In practice we actually begin with a big block of memory. When we allocate a variable on the heap, we carve a piece of the right size off the beginning of that block. As we allocate more, we pull more off the block. When we delete variables, we want to be able to reuse the space, so the free space is often tracked in a data structure called a heap, where the first block that is big enough for a variable will get reused. Modern allocators are often far more sophisticated, but we persist in calling the memory that is being managed this way “the heap.”

So: “the stack” refers to memory used by a hunk of code’s local variables, discarded as soon as we return. “The heap” refers to memory managed separately from jumping from and returning to code.