#### **Virtual Memory**

Instructor: Dmitri A. Gusev

#### Fall 2007

CS 502: Computers and Communications Technology

Lecture 11, October 10, 2007

# **Virtual Memory**

• Main memory can act as a cache for the secondary storage (disk)



Advantages:

- illusion of having more physical memory
- program relocation
- protection

# Pages: virtual memory blocks

- Page faults: the data is not in memory, retrieve it from disk
  - huge miss penalty, thus pages should be fairly large (e.g., 4KB)
  - reducing page faults is important (LRU is worth the price)
  - can handle the faults in software instead of hardware
  - using write-through is too expensive so we use writeback



Physical address

## Page Tables



# Page Tables



Physical address

#### Making Address Translation Fast

• A cache for address translations: translation lookaside buffer



Typical values: 16-512 entries, miss-rate: .01% - 1% miss-penalty: 10 – 100 cycles

#### **TLBs** and caches



#### **TLBs and Caches**



## Modern Systems

| Characteristic   | Intel Pentium P4                                                                                                                                                                | AMD Opteron                                                                                                                                                                                                                                                             |  |
|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Virtual address  | 32 bits                                                                                                                                                                         | 48 bits                                                                                                                                                                                                                                                                 |  |
| Physical address | 36 bits                                                                                                                                                                         | 40 bits                                                                                                                                                                                                                                                                 |  |
| Page size        | 4 KB, 2/4 MB                                                                                                                                                                    | 4 KB, 2/4 MB                                                                                                                                                                                                                                                            |  |
| TLB organization | 1 TLB for Instructions and 1 TLB for<br>data<br>Both are four-way set associative<br>Both use pseudo-LRU replacement<br>Both have 128 entries<br>TLB misses handled in hardware | 2 TLBs for instructions and 2 TLBs for data<br>Both L1 TLBs fully associative, LRU<br>replacement<br>Both L2 TLBs are four-way set associativity,<br>round-robin LRU<br>Both L1 TLBs have 40 entries<br>Both L2 TLBs have 512 entries<br>TLB misses handled in hardware |  |

#### FIGURE 7.34 Address translation and TLB hardware for the Intel Pentium P4 and AMD

**Opteron.** The word size sets the maximum size of the virtual address, but a processor need not use all bits. The physical address size is independent of word size. The P4 has one TLB for instructions and a separate identical TLB for data, while the Opteron has both an L1 TLB and an L2 TLB for instructions and identical L1 and L2 TLBs for data. Both processors provide support for large pages, which are used for things like the operating system or mapping a frame buffer. The large-page scheme avoids using a large number of entries to map a single object that is always present.

| Characteristic         | Intel Pentium P4                                                                   | AMD Opteron                       |  |
|------------------------|------------------------------------------------------------------------------------|-----------------------------------|--|
| L1 cache organization  | Split Instruction and data caches                                                  | Split Instruction and data caches |  |
| L1 cache size          | 8 KB for data, 96 KB trace cache for<br>RISC Instructions (12K RISC<br>operations) | 64 KB each for Instructions/data  |  |
| L1 cache associativity | 4-way set associative                                                              | 2-way set associative             |  |
| L1 replacement         | Approximated LRU replacement                                                       | LRU replacement                   |  |
| L1 block size          | 64 bytes                                                                           | 64 bytes                          |  |
| L1 write policy        | Write-through                                                                      | Write-back                        |  |
| L2 cache organization  | Unified (Instruction and data)                                                     | Unified (instruction and data)    |  |
| L2 cache size          | 512 KB                                                                             | 1024 KB (1 MB)                    |  |
| L2 cache associativity | 8-way set associative                                                              | 16-way set associative            |  |
| L2 replacement         | Approximated LRU replacement                                                       | Approximated LRU replacement      |  |
| L2 block size          | 128 bytes                                                                          | 64 bytes                          |  |
| L2 write policy        | Write-back                                                                         | Write-back                        |  |

FIGURE 7.35 First-level and second-level caches in the Intel Pentium P4 and AMD Opteron. The primary caches in the P4 are physically indexed and tagged; for a discussion of the alternatives, see the Elaboration on page 527.

# Modern Systems

• Things are getting complicated!

| MPU                                | AMD<br>Opteron                        | Intrinsity<br>FastMATH               | Intel Pentium 4                       | Intel PXA250                        | Sun<br>UltraSPARC IV               |
|------------------------------------|---------------------------------------|--------------------------------------|---------------------------------------|-------------------------------------|------------------------------------|
| Instruction set architecture       | IA-32, AMD64                          | MIPS32                               | IA-32                                 | ARM                                 | SPARC V9                           |
| Intended application               | server                                | embedded                             | desktop                               | low-power embedded                  | server                             |
| Die size (mm <sup>2</sup> ) (2004) | 193                                   | 122                                  | 217                                   |                                     | 356                                |
| Instructions Issued/clock          | 3                                     | 2                                    | 3 RISC ops                            | 1                                   | 4 × 2                              |
| Clock rate (2004)                  | 2.0 GHz                               | 2.0 GHz                              | 3.2 GHz                               | 0.4 GHz                             | 1.2 GHz                            |
| Instruction cache                  | 64 KB,<br>2-way set<br>associative    | 16 KB,<br>direct mapped              | 12000 RISC op trace<br>cache (~96 KB) | 32 KB,<br>32-way set<br>associative | 32 KB,<br>4-way set<br>associative |
| Latency (clocks)                   | 3?                                    | 4                                    | 4                                     | 1                                   | 2                                  |
| Data cache                         | 64 KB,<br>2-way set<br>associative    | 16 KB,<br>1-way<br>set associative   | 8 KB,<br>4-way<br>set associative     | 32 KB,<br>32-way set<br>associative | 64 KB,<br>4-way set<br>associative |
| Latency (clocks)                   | 3                                     | 3                                    | 2                                     | 1                                   | 2                                  |
| TLB entries (I/D/L2 TLB)           | 40/40/512/<br>512                     | 16                                   | 128/128                               | 32/32                               | 128/512                            |
| Minimum page size                  | 4 KB                                  | 4 KB                                 | 4 KB                                  | 1 KB                                | 8 KB                               |
| On-chip L2 cache                   | 1024 KB,<br>16-way set<br>associative | 1024 KB,<br>4-way set<br>associative | 512 KB,<br>8-way set<br>associative   | -                                   | _                                  |
| Off-chip L2 cache                  | -                                     | -                                    | -                                     | -                                   | 16 MB, 2-way<br>set associative    |
| Block size (L1/L2, bytes)          | 64                                    | 64                                   | 64/128                                | 32                                  | 32                                 |

FIGURE 7.36 Desktop, embedded, and server microprocessors in 2004. From a memory hierarchy perspective, the primary differences between categories is the L2 cache. There is no L2 cache for the low-power embedded, a large on-chip L2 for the embedded and desktop, and 16 MB off chip for the server. The processor clock rates also vary: 0.4 GHz for low-power embedded, 1 GHz or higher for the rest. Note that UltraSPARC IV has two processors on the chip.

#### Some Issues

Processor speeds continue to increase very fast
— much faster than either DRAM or disk access times



- Design challenge: dealing with this growing disparity
  - Prefetching? 3rd level caches and more? Memory design?