Early computers had a block of memory that the programmer put code and data into, and the CPU executed within this environment. Given that the computers then were very expensive, it was unfortunate that it would do one job, stop and wait for the next job to be loaded into it, and then process that one.
So computers quickly became more sophisticated and supported multiple users and/or programs simultaneously - but that's when problems started to arise with the simple "one block of memory" idea. If a computer was running two programs simultaneously, or running the same program for multiple users - whch of course would have required separate data for each user - then the management of that memory became critical.
For example: if a program was written to work at memory address 1000, but another program was already loaded there, then the new program couldn't be loaded. One way of solving this would be to make programs work with "relative addressing" - it didn't matter where the program was loaded, it just did everything relative to the memory address that it was loaded in. But that required hardware support.
As computer hardware became more sophisticated, it was able to support larger blocks of memory, allowing for more simultaneous programs, and it became trickier to write programs that didn't interfere with what was already loaded. One stray memory reference could bring down not only the current program, but any other program in memory - including the Operating System itself!
What was needed was a mechanism that allowed blocks of memory to have dynamic addresses. That way a program could be written to work with its blocks of memories at addresses that it recognised - and not be able to access other blocks for other programs (unless some cooperation allowed it to).
One mechanism that implemented this was Segmentation. That allowed blocks of memory to be defined of all different sizes, and the program would need to define which Segment it wanted to access all the time.
This technique was powerful - but its very flexibility was a problem. Since Segments essentially subdivided the available memory into different sized chunks, then the memory management for those Segments was an issue: allocation, deallocation, growing, shrinking, fragmentation - all required sophisticated routines and sometimes mass copying to implement.
A different technique divided all of the memory into equal-sized blocks, called "Pages", which made the allocation and deallocation routines very simple, and did away with growing, shrinking and fragmentation (except for internal fragmentation, which is merely a problem of wastage).
By dividing the memory into these blocks, they could be allocated to different programs as needed with whatever address the program needed it at. This "mapping" between the memory's physical address and the program's desired address is very powerful, and is the basis for every major processor's (Intel, ARM, MIPS, Power et. al.) memory management today.
The hardware performed the remapping automatically and continually, but required memory to define the tables of what to do. Of course, the housekeeping associated with this remapping had to be controlled by something. The Operating System would have to dole out the memory as required, and manage the tables of data required by the hardware to support what the programs required.
Once the hardware could do this remapping, what did it allow? The main driver was multiprocessing - the ability to run multiple programs, each with their "own" memory, protected from each other. But two other options included "sparse data", and "virtual memory".
Each program was given their own, virtual "Address Space" - a range of addresses that they could have physical memory mapped into, at whatever addresses were desired. As long as there was enough physical memory to go around (although see "Virtual Memory" below), numerous programs could be supported simultaneously.
What's more, those programs couldn't access memory that wasn't mapped into their virtual address space - protection between programs was automatic. If programs needed to communicate, they could ask the OS to arrange for a shared block of memory - a block of physical memory that was mapped into two different programs' address spaces simultaneously.
Allowing a huge virtual address space (4 GB is typical, to correspond with the 32-bit registers these processors typically had) does not in and of itself waste memory, if large areas of that address space go unmapped. This allows for the creation of huge data structures where only certain parts are mapped at any one time. Imagine a 3-dimensional array of 1,000 bytes in each direction: that would normally take a billion bytes! But a program could reserve a block of its virtual address space to "hold" this data, but only map small sections as they were populated. This makes for efficient programming, while not wasting memory for data that isn't needed yet.
Above I used the term "Virtual Addressing" to describe the virtual-to-physical addressing performed by the hardware. This is often called "Virtual Memory" - but that term more correctly corresponds to the technique of using Virtual Addressing to support providing an illusion of more memory than is actually available.
It works like this:
The important point to notice is that since any Page can be mapped to any address, and each Page is the same size, then one Page is as good as any other - as long as the contents remain the same!
This is actually what happens when your app mysteriously vanishes on you - perhaps with a MessageBox from the OS. It's also what (often) happens to cause an infamous Blue Screen or Sad Mac - the buggy program was in fact an OS driver that accessed memory that it shouldn't!
The hardware architects needed to make some big decisions about Paging, since the design would directly affect the design of the CPU! A very flexible system would have a high overhead, requiring large amounts of memory just to manage the Paging infrastructure itself.
In hardware, the easiest implementation of Paging would be to take an Address and divide it into two parts. The upper part would be an indicator of which Page to access, while the lower part would be the index into the Page for the required byte:
+-----------------+------------+
| Page index | Byte index |
+-----------------+------------+
It quickly became obvious though that small pages would require vast indexes for each program: even memory that wasn't mapped would need an entry in the table indicating this.
So instead a multi-tiered index is used. The address is broken into multiple parts (three are indicated in the below example), and the top part (commonly called a "Directory") indexes into the next part and so on until the final byte index into the final page is decoded:
+-----------+------------+------------+
| Dir index | Page index | Byte index |
+-----------+------------+------------+
That means that a Directory index can indicate "not mapped" for a vast chunk of the address space, without requiring numerous Page indexes.
Every address access that the CPU will make will have to be mapped - the virtual-to-physical process must therefore be as efficient as possible. If the three-tier system described above were to be implemented, that would mean that every memory access would actually be three accesses: one into the Directory; one into the Page Table; and then finally the desired data itself. And if the CPU needed to perform housekeeping as well, such as indicating that this Page had now been accessed or written to, then that would require yet more accesses to update the fields.
Memory may be fast, but this would impose a triple-slowdown on all memory accesses during Paging! Luckily, most programs have a "locality of scope" - that is, if they access one location in memory, then future accesses will probably be nearby. And since Pages aren't too small, that mapping conversion would only need to be performed when a new Page was accessed: not for absolutely every access.
But even better would be to implement a cache of recently-accessed Pages, not just the most current one. The problem would be keeping up with what Pages had been accessed and what hadn't - the hardware would have to scan through the cache on every access to find the cached value. So the cache is implemented as a content-addressable cache: instead of being accessed by address, it is accessed by content - if the data requested is present, it is offered up, otherwise an empty location is flagged for filling in instead. The cache manages all of that.
This content-addressable cache is often called a Translation Lookaside Buffer (TLB), and is required to be managed by the OS as part of the Virtual Addressing subsystem. When the Directories or Page Tables are modified by the OS, it needs to notify the TLB to update its entries - or to simply invalidate them.