Intel x86 Assembly Language & Microarchitecture Paging - Virtual Addressing and Memory Introduction


Example

History

The first computers

Early computers had a block of memory that the programmer put code and data into, and the CPU executed within this environment. Given that the computers then were very expensive, it was unfortunate that it would do one job, stop and wait for the next job to be loaded into it, and then process that one.

Multi-user, multi-processing

So computers quickly became more sophisticated and supported multiple users and/or programs simultaneously - but that's when problems started to arise with the simple "one block of memory" idea. If a computer was running two programs simultaneously, or running the same program for multiple users - whch of course would have required separate data for each user - then the management of that memory became critical.

Example

For example: if a program was written to work at memory address 1000, but another program was already loaded there, then the new program couldn't be loaded. One way of solving this would be to make programs work with "relative addressing" - it didn't matter where the program was loaded, it just did everything relative to the memory address that it was loaded in. But that required hardware support.

Sophistication

As computer hardware became more sophisticated, it was able to support larger blocks of memory, allowing for more simultaneous programs, and it became trickier to write programs that didn't interfere with what was already loaded. One stray memory reference could bring down not only the current program, but any other program in memory - including the Operating System itself!

Solutions

What was needed was a mechanism that allowed blocks of memory to have dynamic addresses. That way a program could be written to work with its blocks of memories at addresses that it recognised - and not be able to access other blocks for other programs (unless some cooperation allowed it to).

Segmentation

One mechanism that implemented this was Segmentation. That allowed blocks of memory to be defined of all different sizes, and the program would need to define which Segment it wanted to access all the time.

Problems

This technique was powerful - but its very flexibility was a problem. Since Segments essentially subdivided the available memory into different sized chunks, then the memory management for those Segments was an issue: allocation, deallocation, growing, shrinking, fragmentation - all required sophisticated routines and sometimes mass copying to implement.

Paging

A different technique divided all of the memory into equal-sized blocks, called "Pages", which made the allocation and deallocation routines very simple, and did away with growing, shrinking and fragmentation (except for internal fragmentation, which is merely a problem of wastage).

Virtual addressing

By dividing the memory into these blocks, they could be allocated to different programs as needed with whatever address the program needed it at. This "mapping" between the memory's physical address and the program's desired address is very powerful, and is the basis for every major processor's (Intel, ARM, MIPS, Power et. al.) memory management today.

Hardware and OS support

The hardware performed the remapping automatically and continually, but required memory to define the tables of what to do. Of course, the housekeeping associated with this remapping had to be controlled by something. The Operating System would have to dole out the memory as required, and manage the tables of data required by the hardware to support what the programs required.

Paging features

Once the hardware could do this remapping, what did it allow? The main driver was multiprocessing - the ability to run multiple programs, each with their "own" memory, protected from each other. But two other options included "sparse data", and "virtual memory".

Multiprocessing

Each program was given their own, virtual "Address Space" - a range of addresses that they could have physical memory mapped into, at whatever addresses were desired. As long as there was enough physical memory to go around (although see "Virtual Memory" below), numerous programs could be supported simultaneously.

What's more, those programs couldn't access memory that wasn't mapped into their virtual address space - protection between programs was automatic. If programs needed to communicate, they could ask the OS to arrange for a shared block of memory - a block of physical memory that was mapped into two different programs' address spaces simultaneously.

Sparse Data

Allowing a huge virtual address space (4 GB is typical, to correspond with the 32-bit registers these processors typically had) does not in and of itself waste memory, if large areas of that address space go unmapped. This allows for the creation of huge data structures where only certain parts are mapped at any one time. Imagine a 3-dimensional array of 1,000 bytes in each direction: that would normally take a billion bytes! But a program could reserve a block of its virtual address space to "hold" this data, but only map small sections as they were populated. This makes for efficient programming, while not wasting memory for data that isn't needed yet.

Virtual Memory

Above I used the term "Virtual Addressing" to describe the virtual-to-physical addressing performed by the hardware. This is often called "Virtual Memory" - but that term more correctly corresponds to the technique of using Virtual Addressing to support providing an illusion of more memory than is actually available.

It works like this:

  • As programs are loaded and request more memory, the OS provides the memory from what it has available. As well as keeping track of what memory has been mapped, the OS also keeps track of when the memory is actually used - the hardware supports marking used pages.
  • When the OS runs out of physical memory, it looks at all the memory that it has already handed out for whichever Page was used the least, or hadn't been used the longest. It saves that particular Page's contents to the hard disk, remembers where that was, marks it as "Not Present" to the hardware for the original owner, and then zeroes the Page and gives it to the new owner.
  • If the original owner attempts to access that Page again, the hardware notifies the OS. The OS then allocates a new Page (perhaps having to do the previous step again!), loads up the old Page's contents, then hands the new Page to the original program.

    The important point to notice is that since any Page can be mapped to any address, and each Page is the same size, then one Page is as good as any other - as long as the contents remain the same!

  • If a program accesses an unmapped memory location, the hardware notifies the OS as before. This time, the OS notes that it wasn't a Page that had been saved away, so recognises it as a bug in the program, and terminates it!

    This is actually what happens when your app mysteriously vanishes on you - perhaps with a MessageBox from the OS. It's also what (often) happens to cause an infamous Blue Screen or Sad Mac - the buggy program was in fact an OS driver that accessed memory that it shouldn't!

Paging decisions

The hardware architects needed to make some big decisions about Paging, since the design would directly affect the design of the CPU! A very flexible system would have a high overhead, requiring large amounts of memory just to manage the Paging infrastructure itself.

How big should a Page be?

In hardware, the easiest implementation of Paging would be to take an Address and divide it into two parts. The upper part would be an indicator of which Page to access, while the lower part would be the index into the Page for the required byte:

+-----------------+------------+
| Page index      | Byte index |
+-----------------+------------+

It quickly became obvious though that small pages would require vast indexes for each program: even memory that wasn't mapped would need an entry in the table indicating this.

So instead a multi-tiered index is used. The address is broken into multiple parts (three are indicated in the below example), and the top part (commonly called a "Directory") indexes into the next part and so on until the final byte index into the final page is decoded:

+-----------+------------+------------+
| Dir index | Page index | Byte index |
+-----------+------------+------------+

That means that a Directory index can indicate "not mapped" for a vast chunk of the address space, without requiring numerous Page indexes.

How to optimise the usage of the Page Tables?

Every address access that the CPU will make will have to be mapped - the virtual-to-physical process must therefore be as efficient as possible. If the three-tier system described above were to be implemented, that would mean that every memory access would actually be three accesses: one into the Directory; one into the Page Table; and then finally the desired data itself. And if the CPU needed to perform housekeeping as well, such as indicating that this Page had now been accessed or written to, then that would require yet more accesses to update the fields.

Memory may be fast, but this would impose a triple-slowdown on all memory accesses during Paging! Luckily, most programs have a "locality of scope" - that is, if they access one location in memory, then future accesses will probably be nearby. And since Pages aren't too small, that mapping conversion would only need to be performed when a new Page was accessed: not for absolutely every access.

But even better would be to implement a cache of recently-accessed Pages, not just the most current one. The problem would be keeping up with what Pages had been accessed and what hadn't - the hardware would have to scan through the cache on every access to find the cached value. So the cache is implemented as a content-addressable cache: instead of being accessed by address, it is accessed by content - if the data requested is present, it is offered up, otherwise an empty location is flagged for filling in instead. The cache manages all of that.

This content-addressable cache is often called a Translation Lookaside Buffer (TLB), and is required to be managed by the OS as part of the Virtual Addressing subsystem. When the Directories or Page Tables are modified by the OS, it needs to notify the TLB to update its entries - or to simply invalidate them.