Software Engineering
memory caching cpu unicode x86
Updated Mon, 13 Jun 2022 15:06:49 GMT

How to align on both word size and cache lines in x86

From what it sounds like, a 64 bit processor means aligning to 64 bits, which means if you have unicode utf-8 stored in there, each 8-bit chunk would take up 64 bits of space. That doesn't really make too much sense, so I think I need to do a bit more to understand exactly how cache aligning works.

But thanks to this great answer to Purpose of memory alignment I see how alignment is beneficial, so I would like to learn what it would mean in practice to implement alignment to a word and to a cache line, together.

For example, take unicode encoded in utf-8. If you were to store that in memory and access it most efficiently in terms of word alignment and cache line alignment, I'm wondering what that would look like / mean.

Some examples are:

  1. Accessing individual characters.
  2. Accessing large blocks of text.

From what it sounds like, on a 64-bit machine, you would do this (I'm still a bit confused on how to apply it, which is the reason for the question), where the letters would be unicode encoded in ascii (for the simple case of utf-8, just using ascii):

a b a b a b ...

That would look like:

01100001 01100010 01100001 01100010 01100001 01100010 ...

Or more specifically:


To add more characters to the mix (adding newlines for readability, not adding it to the data):


That's 26 x n, where let's say n is 100, so 2600 8-bit characters (2600 bytes) stacked next to each other. From my understanding, you could say these are "8-bit aligned", or "1-byte aligned".

But now there are two problems:

  1. Word alignment (and how to figure out what your machine's word alignment is).
  2. Cache line alignment (assuming all 64-bit machines use 64 bytes as cache line alignment, which is what I've seen from the web).

Since we have 2600 bytes, we could theoretically have 2600 / 64 = 40.625 41 cache line aligned chunks, and if the word size was 2 bytes, then 2600 / 2 = 1300 word-aligned chunks.

Now I'm lost, I don't see how we are supposed to access or alternatively organize the data so it takes advantage of these 2 alignment conditions (word and cache line alignment). I already feel like I'm going to create more confusion than is necessary for the question if I try to explain more.

So my question is, how to (a) organize this utf-8 string in memory so that (b) you can take advantage of the 2 alignment conditions, while (c) accessing the data (either individual characters, or chunks between word size and cache line size, or or chunks larger than cache line size). I don't really know what "accessing larger than 64 bits at a time" would mean, since registers are limited like that. So again, don't really understand how the cache alignment works, and interested to know how to align to both conditions properly in this case as a practical example.

Sidenote: I don't need to know exactly how to do it for x86, like which instructions to use and whatnot (unless it is easy / straightforward to describe). I am just looking for at a high level how it works, using x86 as a starting point.

Another way I try looking at this comes after reading:

  • Align 8-bit data at any address [don't understand this]
  • Align 16-bit data to be contained within an aligned four-byte word
  • Align 32-bit data so that its base address is a multiple of four
  • Align 64-bit data so that its base address is a multiple of eight
  • Align 80-bit data so that its base address is a multiple of sixteen
  • Align 128-bit data so that its base address is a multiple of sixteen

Don't quite follow exactly, but say for 8-bit data we align to a 4-byte word. That means it would look like this:

<letter> <empty> <empty> <empty> <letter> <empty> <empty> <empty> ...



That seems like a lot of wasted space. That would mean that plain text requires 4x the actual size of the text to store in memory, which doesn't seem right.

Finally, when they talk about how if you don't align to the boundaries, it will fetch the extra data, I keep thinking "won't it do a fetch for the empty space anyways". I don't see how if the memory is full or not at a specific point that fetching some extra data would be harmful. That is to say, say the data was like this:


And 4-byte aligned. That means to access a we would be actually fetching abcd, to fetch b it would also fetch abcd, etc.. But I don't see how that is any different from fetching a--- in this layout:



To understand how alignment affects things, let's look at a larger context.

First, as you note, 2600 bytes of UTF-8 (or any kind of data) will indeed take 2600 bytes.

If you allocate 2600 bytes from the heap using malloc(2600) e.g. in C, then since malloc does not accept alignment information, it will not know that you're intent is to store only individual bytes — it assumes worst case, which is that you're using the memory for the largest native type that the processor supports.  In the case of 64-bit processor that is going to be 16 bytes, which is rather large.

So, the memory allocator locates free memory that matches the 16-byte alignment (and at least 2600 bytes in length).  A later memory allocation via malloc will also be rounded up to 16-byte alignment, so there will be a small gap between the 2600 byte chunk and the next memory block returned by malloc, because 2600 is an exact multiple of 8 but not of 16.  (There are also potentially other overheads associate with each malloc block as well.)

Both Linux & Windows offer an aligned malloc; however, Linux explicit states that the minimum alignment is pointer size.  Even on Windows, which doesn't say, it is clear from the documentation that the authors expect larger alignments to be requested, not smaller alignments.

C will create structures with proper field alignment for the target platform, meaning that it will insert unused pad bytes within a struct if the preceding fields do not layout such that the proper alignment can be had.  For example:

struct S {
   char c;
   int  i;

Struct S declares c, as a 1-byte item, and it will be at offset 0 in the struct.  The field i, is, let's say, a 4-byte item.  After c the next available offset is 1 but that is not suitably aligned for a 4-byte value, so the compiler will insert 3 padding bytes and use offset 4 for i, making the size of the struct sizeof(struct S) is 8, even though it only stores 5 bytes of information.

Let's also talk about endian-ness.  If you have a string of bytes, then each succesive byte is stored at the next higher byte address (just add 1 to the address to get to the next one) — However, big-endian machines store the 4 bytes needed to make up a 4-byte word reversed from little-endian machines.  So, if you wanted to use word-sized access on your string "abcd", you would see a difference between a big and little-endian machine: the big-endian machine would give you 'abcd' whereas the little-endian machine will give you 'dcba'.

In summary, it is generally best not to use the same memory both as bytes and as words (at the same time): if it holds bytes, then use byte-sized access, and if it holds words, then use word-sized access.  Note this will naturally happen unless you do "bad things" like cast pointers to a type other than what they originally pointed to.  (There are times when you might need to, and, authors of routines like memcpy and memmove play some tricks for performance.)  We can also note that it is not even possible to mix byte-sized and word-sized accesses (for the same data/object/array) in a language like Java (without resorting to serialization) since it doesn't offer the low-level feature of casting pointers.

The compiler and runtime (e.g. malloc) cooperate to make sure your data is properly aligned (perhaps even if over-aligned).  For example, the stack, before main, should be initially aligned (by the runtime) to at least a 16-byte boundary, and then the compiler can create stack frames that are rounded up to a 16-byte size so the stack and all local variables remain aligned during function calls.  Global variables get similar treatment, and we have already discussed heap allocations.

Comments (2)

  • +0 – Dear Erik, thanks for sharing your knowledge. I have one question, how do you come to 16 bytes as the maximum type size for a 64-bit system? I would have assumed it was 8 bytes. Appreciate any "pointers" you can share on this. — Apr 24, 2022 at 00:57  
  • . Some ABI's use 8, some 16, so I guess it depends on the ABI. Vector instructions also tend to like the larger alignments. — Apr 24, 2022 at 01:07