Does memory alignment matter today?

The concept of "memory alignment" is of great importance in programming and is frequently mentioned to have a substantial impact on program performance. Upon diving into the famous CSAPP (Computer Systems: A Programmer's Perspective), I found it provides a comprehensive introduction to the memory alignment includes the method for calculating the number of bytes of a struct.

Thanks to the compiler, I have never had to take "memory alignment" into consideration in practical programming, in other words, this concept is transparent to most programmers. Honestly, all my understanding of memory alignment is hearsay (I think many people are like me :). However, no all-round knowledge can be acquired merely by relying on hearsay.

Inspired by Data alignment: Straighten up and fly right, Data alignment for speed: myth or reality?, Aligned vs. unaligned memory access and many other blogs, I conduct some experiments to investigate the impact of memory alignment on program performance.

Problem definition

Simply put, memory alignment refers to storing an object at the address that is an integer multiple of its size or the word size of your CPU (the real situation is not so simple, such as the memory alignment of struct type).

In computing, a word is the natural unit of data used by a particular processor design. When the processor reads from the memory subsystem into a register or writes a register's value to memory, the amount of data transferred is often a word.^[1] In most circumstances, the word size of a 32-bit CPU is 32 bits, and the word size of a 64-bit CPU will be 64 bits.

Accessing an object at a misaligned address often causes the CPU has to read mutiple words to fetch the data we require. Figure 1 and Figure 2 demostrate the bad case: when we read an eight-bytes data (may be a 64-bit integer, like long long type in C++ or int64 type in Go) at 0, the CPU only needs to read one word to obtain the data we need, but has to get an extra word when the eight-bytes data is stored at address 4 (assuming that the word size of this CPU is 8 bytes). Worse still, the CPU has to shift out unwanted bytes and merge the left bytes of two words. Figure 3 and Figure 4 present the the scenarios of reading a four-byte data at the aligned and misaligned address.

Figure 1. read an eight-byte data at the aligned address 0

Figure 2. read an eight-byte data at the misaligned address 4

Figure3. read a four-byte data at an aligned address

Figure3. read a four-byte data at the aligned address 4

Figure4. read a four-byte data at a misaligned address

Figure4. read a four-byte data at the misaligned address 7

It can be boldly speculated that accessing data at the misaligned addresse will lead to a decrease in program performance. But, all of the above content has been inferred based on our knowledge and experience! Experiments will tell us the truth.

Environment

OS: Windows 11 64-bit
CPU: Intel(R) Core(TM) i5-10600KF CPU @ 4.10GHz
Programing languages: Go 1.18 and Python 3.8.11 (for visualizing experimental results)

Implemention details

We conduct experiments that reading sizeof(T) bytes at addr, addr+256, addr+256+256 in a large enough buffer respectively, and so on until a total of N bytes is read:

the first address addr is aligned to 0, 1, 2, 3, ... 255, and so are following addresses addr+256, addr+256+256, etc. a simple application of congruence theorem: $addr≡(addr+256)(mod \hspace{0.5em} 256)$ .
T refers to int8, int16, int32 and int64.
N is set to 16M.
the reason for skiping 256 bytes per reading is to greatly ensure the data we read is not in CPU cache yet (the cache line size of Intel(R) Core(TM) i5-10600KF CPU is 64 Bytes). For more information, see How do cache lines work?.
each experiment is repeated 10 times; remove the maximum and minimum time cost of the program, and take the average as the final result.

Figure 5 shows an example of this experiment. Type T is selected as int32 which occupies 4 bytes space in Go. Firstly, the addr is aligned to 2, that is to say $addr \% 256 == 2$ . To Fetch a int32 at addr (and addr+256, addr+256+256, etc), the CPU solely need to read one word. We denote the time it takes to read 16MB in this case as t1. But when the addr is aligned to 5, something bad happens: the CPU has to do more work because a data type of int32 crosses two words. The time it takes to read 16MB of data is denoted as t2 in this case. According to our analysis t2 should be greater than t1, but is it really? Let's test our conjecture with experimental results.

Figure 5. a demostration of the experiment

Experiment result

Figure 6 depicts the trend of time taken by the program to read all 16MB of int32 data when addr is aligned to 0 to 255. The vertical dashed lines marks the alignments which causes the CPU have to read two words for fetching a int32. Figure 7 is a close-up of a part of Figure 6. This two figures clearly show that when addr is aligned to the addresses marked by vertical dashed lines, the program takes more time to read all data than the addr aligned to other addresses.

It is also worth noting that when the alignments are 61, 62, 63 (nearby 64), 125, 126, 127 (nearby 128), 189, 190, 191 (nearby 192), 253, 254, 255 (nearby 256) and so on, the program takes the maximum time, we mark these alignments are cross-cache-line-alighment. The same phenomenon can be seen in the results of the experiments on int16 and int64 as well. In short, a combination of memory misalignment and the CPU cache mechanism causes this behavior:

If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).

The cross-cache-line-alighment forces CPU to request an extra cache line and cache replacement may also occur! Aligned vs. unaligned memory access also explains this behavior.

Figure 6. experiment results of int32

Figure 7. a close-up of Figure 6

As can be seen from Figure 8 and Figure 9 which show the results of int16, the program takes more time to read data when addr satisfies $addr \% 8 == 7$ , namely the data of type int16 at these addr will cross two words.

Figure 8. experiment results of int16

Figure 9. a close-up of Figure8

The alignments for int64 which are marked by vertical dashed lines in Figure 10 are what the CPU is willing to meet. Accessing the data of int64 type which is aligned with these alignments, CPU can work like Figure 1 in Problem definition section. The rest of the alignments will result in the scene presented in Figure 2.

Figure 10. experiment results of int64

Not just performance

The original 68000 was a processor with two-byte granularity and lacked the circuitry to cope with unaligned addresses. When presented with such an address, the processor would throw an exception. The original Mac OS didn't take very kindly to this exception, and would usually demand the user restart the machine. Ouch.

Later processors in the 680x0 series, such as the 68020, lifted this restriction and performed the necessary work for you. This explains why some old software that works on the 68020 crashes on the 68000. It also explains why, way back when, some old Mac coders initialized pointers with odd addresses. On the original Mac, if the pointer was accessed without being reassigned to a valid address, the Mac would immediately drop into the debugger. Often they could then examine the calling chain stack and figure out where the mistake was.

Check the wild pointer through CPU's inability to handle accessing at misaligned addresses well is so cool :)

Conclusion

Memory alignment matters today!

Yeah! I do and I understand.