Climbing the stack allocator

A good use of the memory is often key of performance. This is even more true with C or C++ where you have to do your own memory management1. When you work on a game, it becomes a major concern. Massively allocating memory inside the game loop will cause your game to slow down just as your frame rate. That’s why this article proposes an overview of the memory allocation universe, describing a simple way to allocate temporary memory in real time applications without almost any disadvantage. We are speaking about the stack allocator.

The main memory

Before speaking about the stack allocator, here are some reminders. The memory can be represented as a large array where each element has its own address and is 1 byte long. By using contiguous memory, where all blocks are next to each other, you can store data longer than 1 byte. If you have 4 contiguous bytes, you can store 1 int on a 32 bit system. This is one of the older allocation scheme.

Memory representation

If you’re allocating an array of 10 ints with the C++ operator new[],

you’re asking the system (assuming a 32 bits system) to find in the memory at least2 40 contiguous bytes (10 * sizeof(int)) and to return the address of it. This kind of allocation is called dynamic allocation. A dynamically allocated storage space is located on the heap and if you’re not using smart pointers, you will have to release the memory, manually. The heap is wide, you can easily allocate several megabytes depending on your system and your hardware. At the opposite, you have the stack allocation.

It’s the “regular” way to declare variables and instantiate objects. There’s no need to worry about releasing the memory because this is done by the compiler using the scopes. When the code exits the scope of a variable, the variable is destroyed. The stack is much smaller than the heap, too many allocations on the stack will result into the well-known stack overflow (often with recursive functions).

Caches

One of the major advantage of contiguous memory allocation is the cache optimization. A cache is a physical location close to the processing unit where data can be temporary stored. Caches are not only closer but also a lot faster than the main memory. They are many different caches in a CPU but we will only talk about the concept behind caches. Actually the CPU doesn’t deal directly with the values from the main memory but with temporary copies stored into caches.

  • If the CPU fetches a data from the main memory, it’s a read
  • If the CPU write something back into the main memory, it’s a write.

Let’s take the following example :

The code is simple, the value of c is the sum of a and b. The values of a and b are stored on the stack. Assuming your program is called Cache.exe and compiled in debug with g++, you can easily disassemble the program with the command :

If you look at the main function, you will find something like this :

And if you only keep the essential :

As you can see, the CPU needs to move the values from the main memory to the registers, that are the closest location to the processing unit. Once the computation done, the CPU write the values back into the memory. We’re speaking about memory and registers but where is the cache we were previously talking about … ? Even if the CPU is really fast, it would be a waste to access the memory each time a value is required because the CPU would spend all its time to wait… CPU scale speaking (nanometer), the main memory is very far. To avoid this overhead, when the CPU needs a value that is in the memory and not in the cache, it takes this opportunity to fetch not only the value it needs but also a bit of the contiguous memory right after or before the value. The following image explains this.

  • In the first scenario (no caching):
    • The CPU needs a and a is not in the cache, this is a cache miss. The CPU fetches a.
    • The CPU needs b and b is not in the cache, another cache miss. The CPU fetches b.
  • In the second scenario (with caching) :
    • The CPU needs a and a is not in the cache, this is a cache miss.
    • The CPU fetches a and some of the memory after a (including b).
    • The CPU needs b and b is in the cache. This is a cache hit. No extra fetch required.
Memory fetching

When a cache miss occurs, the CPU must fetch the missing value from the main memory and then copy it in the cache. When that happens, the CPU is waiting for the value and your process is no longer executed. A CPU stall may happen (CPU cycles are wasted). When a cache hit occurs, the CPU directly copies the cached value into registers. Fast and efficient. If you’re processing fragmented data across the memory (a lot of small heap allocations), it will pollute the cache with unnecessary data and result into many cache misses.

The stack allocator

A stack allocator is a stacked based allocator. You allocate a big chunk of contiguous memory and you distribute addresses or identifiers3 on specific part of this memory. The only rule is that data are added or removed in a last-in-first-out (LIFO) manner. It is important to realize that memory cannot be freed in a arbitrary order. You have to roll back all previous allocations. Not really a problem for a frame allocator which is one of the possible way to use a stack allocator.

In video games, a frame allocator is a stack allocator that we reset each frame. It’s a temporary allocation that will be lost at the end of the frame (or at the beginning of the next one). To implement this kind of allocator, we need a least a class that will allocate a fixed amount of memory, a method to allocate in that memory and a method to clear the allocator (reset the current top of the stack). The header should looks like the following code :

The class manages 3 member variables :

  • The size, this is the amount of allocated memory in bytes
  • The head, this is the current top of the stack. It is represented with an offset to the base
  • The data, this is the pointer on the internal raw memory

Note that we don’t need a constructor that’s why the default constructor is used. The first method to implement is the destructor. The destructor is pretty straight forward. It calls the Release() method to ensure that memory is properly freed. This is the RAII (Resource Acquisition is Initialization) idiom.

To initialize the allocator :

First we check if the size parameter is valid. If not, we throw a bad_alloc exception. (For simplicity, this is the only exception used by this allocator). Because we can call Initialize several times, we have to call Release() before any memory allocation to avoid memory leak. The head is initialized to 0 because there is no previous allocation. We allocate the memory as an array of unsigned bytes and we check if the pointer is valid. In the release method, we simply release the memory :

To allocate memory with the allocator, we have to return the address of the current stack top. After having computed the pointer address, we can increment the current top with the allocated byte amount.

Stack Allocator memory

At the beginning, the head (the lowest address) is equal to 0 and all the memory, starting from mp_data to mp_data + m_size, is free. When an allocation of N bytes occurs, we consider the chunk of memory starting from mp_data + m_head to mp_data + m_head + N as allocated. That’s why we return the address mp_data + m_head that correspond to the allocated block and we increment the head of N bytes to make this block “allocated” and to avoid that the next allocation overlaps the previous one. m_head is actually the variable that controls the allocations. If m_head is equal to 0, all the memory is free, if m_head is equal to m_size, all the memory is allocated.

Finally, to clear the allocator, we reset the head to 0 :

The allocator is now ready to use :

If you need an undefined amount of memory for one frame, the is definitely the allocator to use. With a frame allocator, the allocation is almost instantaneous but you have to remember to not hold the pointer longer than a frame since the allocator is cleared each time. The frame allocator and more generally the stack allocator is only one of all possible allocators. The most famous of them is the pool allocator. I hope I will have time to make a tutorial about object pooling to optimize games and introduce the pool allocator.

This stack allocator can be improved in many way starting with a better handling of exceptions, a way to set up the max size and introduce identifiers to avoid the use of raw pointers. It could be also interesting to care about the memory alignment.

You can find the complete source code here on my GitHub in the repository Articles.

Thanks for reading !


1 : In C++ you should use smart pointers but this is what this article is about
2 : The true amount of allocated bytes depend of the system allocation algorithm
3 : Identifiers could be used if with reallocation

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © 2018 Vincent STEHLY--CALISTO. All Rights Reserved.

Up ↑