UCB-CS162-操作系统笔记-四-UCB CS162 操作系统笔记（四） P15：Lecture 15： Memor

UCB CS162 操作系统笔记（四）

P15：Lecture 15： Memory 3 Caching, and TLBs (con't), Paging - RubatoTheEmber - BV1L541117gr

Okay， let's get started。

So today we're going to look at our third lecture on memory and we're going to look， at caching。

translation， look aside， buffered and then we'll get into demand behavior。 Okay。

so remember we have the two level page table， right？ So here we have this tree of page tables。

We use the magic 10， 10， 12， orderization for the pattern and that gives us this nice。

breakdown where each of our tables， our page tables is 1，024 entries to the 10。

The entry is 4 bytes so that fits on a 4k page which represents our offset of 2 to the， 12。

By doing this， your tables are a fixed size， we have the root page table pointer for the。

page table pointer here which is a register， the CRP register on x86， points to our top。

level table， the index of the top 10 bits into that table。

That gives us a pointer to the page table in physical memory。 At the second level。

the index into that with our P2 index with the next 10 bits， that。

gives us the page table entry for the actual physical page that we're looking for。

Combining that physical frame number with our offset and we get the page and we get the。

byte on the page。 Lots of bookkeeping information that we'll be going through later on today in the page。

table entry and things like valid bits， 30 bits， modified bits and so on。

Now one of the nice things about the two level scheme is the amount of data structures that。

we're going to have is going to be proportional to the fraction of the virtual address space。

that's actually in use。 It's going to be much much smaller than the actual virtual address space。

Okay， now how do we do translation？ Again， remember。

translation is done in the memory management unit。

The memory management unit has to translate every instruction fetch， every load and every。

store from the virtual address space used by the program， the instruction of the addresses。

emitted by the CPU to the actual physical addresses that are used by memory。 Now。

how do we do this translation？ Well， for a single level page table。

we're just going to index into that table and read， the page table entry。

check that it's valid and that will give us the physical page frame。 For two level tables。

it can be exactly what I showed you on the last slide where we're。

going to walk through two sets of those page tables。 For n levels。

the page tables will walk through n sets of page shavings。

So what we're watching the memory management unit do is just a form of pre traversing。

It's traversing through that tree of page table entries to find the page table entry for。

the actual physical page that we want。 All right， now where do we put the memory management unit？

Well， those on the processor die right after the processor。

So the processor is generating virtual addresses。 The memory management unit uses the page table base register to traverse through the page。

tables that are stored in memory。 And notice here， our cache in this case is physically addressed。

So the addresses that were passing through the cache are addresses that have been translated。 Okay。

So when we give the MMU a virtual address， it's going to do this traversing。

Now since we have a cache here， it might be the case that portions of those page tables。

are in the cache。 If that's true， then the memory management unit will get back those page table entries。

right away。 And it'll be able to return answer to the processor or provide the physical address rather。

for the processor right away。 If not， then we might have to go all the way out to physical memory in order to do that。

proof of our error。 And if those page tables are actually out on this。

we might actually have to do IO in， order to actually do that conversion from a virtual address to physical address。

So if we hit in the cache， it's going to be super fast。 If you're missing the cache。

it could be really， really slow。 Okay。 So what's the memory management unit doing？ Well。

on every instruction fact， load and store is doing this tree traversal to do this conversion。

from a virtual address to a physical address。 And it finds an invalid page table entry for whatever reason。

the decommission， the access， the type of operation， and whatever。

it's going to return a fall and will drop into the， operating system。 Okay。

So one thing to consider here is the memory management unit is before the cache。

The cache is how we make things fast。 Right？ This cache is inside the processor。

So it's going to operate on the order of nanoseconds。 But going out to memory。

that's like 100 nanoseconds going out to disk。 That's like millions of nanoseconds。

So in order to actually access something in the cache， we have to go through this process。

which could potentially take millions of nanoseconds。

Seems like that really defeats the whole purpose here。 Right？

So let's look at what caching offers and how we might apply caching to the cancellation， process。

So we're going to do a quick primer， which you forgot on 61C on what cache。

So a cache is just a repository where we keep copies that we can access much more quickly。

or faster than the original copy。 Right？ So rather than having to go out to memory。

we have a cache here。 In this case， it's a data cache that lets us access copies of what's in our physical memory。

much faster。 So like a millisecond instead of a nanosecond rather than hundreds of nanoseconds。

Okay。 What makes caches work is that they make the frequent case fast and they make infrequent。

cases less dominant。 So if you look at it moderate computer。

you're going to find caching everywhere。 Right？ So we're going to look at how we can use it to cache memory locations。

I would be to address translations。 I would be to cache domain name resolutions。

I would be to cache network locations， pages， file blocks。

Everything in computers really is accelerated by using caches or almost not always be。

the case is not too expensive。 We're going to walk a very high hit rate and we're going to want to kind of minimize。

the miss cost as much as possible。 And so that kind of gets us to how do we evaluate caches。

We evaluate it with this measure， the average access time。

The average access time is just a hit rate times the hit time。

So when we find it in the cache plus the miss rate times the miss time。

So when we don't find it in the cache and have to go out to a lower level of our memory， error。

So caching is how we get system performance。 So here's an example。 Right？

If the processor directly gives the memory， it takes us 100 nanoseconds。

If a processor is operating in a nanosecond， it's some nanoseconds。

So accessing something on the processor like an S-grant cache is a nanosecond。

Going all the way out in a memory， 100 nanoseconds。

It's 100 times slower if we're operating out of main memory than if we're able to stay。

on the processor and operate a processor speeds。 Just went out and spent thousands of dollars on that crazy fast Intel。

you know， i9， 12，000， plus KF processor。 And now you're going to run it at the speed of the memory。

So this is where cache and the thing can help us have speeds that look closer to what。

that processor can actually do。 Okay， so how do we apply the average memory access time here？ Again。

it's the hit rate times the hit time plus the miss rate times the miss time where。

the hit rate plus the miss time is one。 Okay， so what if we have a 90% hit rate？

So 90% of the time the processor is going to find the data in the cache。 That's really good。

Like 90% of the time， 9 out of 10， we're finding it in the cache。

But what is that really translated？ Well， that's going to be 0。9 times one nanosecond。

90% of the time。 And then the other 10% of the time of 0。1， it's going to be 101。 Now。

why is it 101 and not 100？ Because I have to look into cache first before I don't find it。

And so there's one nanosecond to look in the cache and then 100 nanoseconds to go out and。

get it from the next level， which is memory。 That's 11。1 nanoseconds。 Is that good？ Is that bad？

Well， I mean， it certainly seems better than 100 nanoseconds。 It's 10 times faster。

But it still is 10 times slower than this processor that I spend all this money to buy。

So it actually is pretty bad。 A 10x slowdown really isn't for acceptable。

So what if I increase the hit rate？ What if I hit the hit rate the 99%？ Well。

if I hit the hit rate the 99%， now it's 0。99 times one plus 0。01 times 101。 So it's 2。

01 nanoseconds。 So that's much better， not perfect。

But now you've only got a factor of two slowdown instead of a factor of 100。

So that's where the power caution ends into the fall。

But it also says that we have to make sure we can deliver very high hit rates in order。

to achieve good performance。 Now， when you really think about it。

what we're going to see later on in the class is， that memory-- so here we're using the cache to cache contents of memory。

But we use memory to cache the contents of the hardware。 So really。

when we think about the average memory access time， it really is the hit time。

because we find it in this level one in cache。 Plus。

the miss penalty if we don't find it in a level one， which is basically equivalent。

to the average memory access time for level two。 Because you might actually go to get it out of memory and find it。

It's not a memory that's actually out of disk and so we have to then go out to disk again。 Yeah。

[INAUDIBLE]， Oh， that's a good tool。 [INAUDIBLE]， Let's see if you can lower that wall。 [INAUDIBLE]。

OK。 All right。 So another reason why we want cache-- oh， yeah。

[INAUDIBLE]， So it should be 10。1 plus 0。1。 Oh， here， it should be a letter。 Yeah。

I'll fix that on the side。

Thank you。 OK。 OK。 So another major reason why we want to deal with caching is that remember。

we're having， to walk through these tables。 So here's an example of a segment-based system and a single and one level of paging at the。

lowest level。 So for every memory reference that we do。

we're going to first look in the segment map。 That's in the process。 So that's going to be fast。

but still is going to cross the sample。 And then we're going to have to actually go out and work the memory to walk through the。

page table。 In the example I showed on the first slide， we had to walk through two levels of memory。

page tables out of memory。 So every reference that we're doing is going to be somewhere between two to three memory。

references。 So every instruction that we load and we store。

we can multiply the cost by a factor of two， or a factor of two。

Or if we have additional levels or five and so on。 OK。 Now， if you think about it， right， again。

what I said in the MX slide before， where's the， TOB？

The TOB is before the cache that we're using to make memory faster。

But before we can look up anything in the cache， we first need to get the physical address。

And so if it takes us multiple memory references just to be able to check in the cache， what's。

the point of having a memory cache anyway？ Because the time to access that memory cache is going to be dominated by those three memory。

accesses that we have to give。 OK。 So solution is we're doing all this work to do the translation。

We might as well save that result or pass it。 OK。 So this cache。

it's called a translation look aside buffer。 It gets a why it has that we have named in just a moment。

But first the question is， will this work？ And can we get a decent hit rate out of kind of cache translations？

So why did， when does caching help？ Caching will help us when we have locality。 No locality。

There is no benefit to caching。 Every access is completely random。

You won't see any benefit from caching。 But that's not how computer programs typically operate and not how computers operate in general。

Right。 So if we look at the behavior， we typically see temporal locality。

That's what polity is called。 So here on the x axis， I have our address space and on the y axis。

I have the probability that， we reference a given memory location。

And you can see that there are peaks。 So if I have access something。

I'm likely to access that thing again。 Right。 If I have some global variable and I access it。

I'm likely to come back and access that global， variable again。

Where I have some variables within a subroutine。 I'm likely to access those variables multiple times。

So keeping recently access data closer to the processor， it's going to give us better performance。

And now if I access something， it turns out unlikely to also， and if you look at this。

it's sort of rough here， I'm likely to access things around it too。 Right。

So like think about like if I'm reading strength， I'm doing a string length or string copy operation。

Well， access one character， I'm likely to access the other characters nearby。

So when you think about moving things around， the spatial locality means I want to move a block from a lower level of the memory hierarchy。

one upper level。 Because access one thing that block。

I'm likely to access other things in that block。 Okay。 But that's what this picture is doing。

If I access something down here like law， why I should move it all。

not just the individual item that I access。

All right。 So our goal here with caching is all about illusions。

So we want to present the illusion that I have memory of scale terabytes that operates at speeds of registers。

And has cost at the order of my slowest storage， rather than my most expensive story。

The capacities of my largest storage and not my fastest story。

Like it's here registers were talking about hundreds of bytes， which is terrible。

but I can access these registers in a fraction of a nanosecond。 Right。

whereas I have to go out the disk。 It's 10 million nanoseconds。 But at a terabyte。

it looks like it's infinite for a program。 So I want that illusion that I have infinite fast storage。

So we'll do caching at every level of this higher。 So we'll treat， you know。

so here we have an L one， for the， for the core， the LT cash for the core。

And then we have memory as a cash for our SST and our SST is a cash for our hard drive。

It could be a cash for going to the cloud or cash for cake and so forth。 Okay。

Now our challenge is that address translation first needs to occur here。

It's during my instruction fetch during loads and stores into registers， but the page tables。

I need to do that address translation or living out in a memory。

Which is operating again at 100 nanosecond versus the fractions in nanosecond for my registers。

All right。 So how do we make address translation facts？ The cash， use of translations。 All right。

And so this is a little different from how you think about a traditional cash。

The traditional cash you think about， like， you know， the key for it is like a memory location。

The physical memory location and the cash returns。 There's the contents of that memory location。

Here we're kind of caching the results of an algorithm。

The algorithm being the algorithm that we use for future virtual that returns that page table on。

And it's indexed not by a physical address， but by a physical page， a virtual page number。 So。

we have our processor generating virtual addresses。

our memory management unit that's going through our data cache walking through the page table。

generating the translation， and we're going to cash that result。 So here we have our translation。

look aside， bussler， indexed by virtual page numbers， returning physical frame numbers。

along with a whole bunch of bits bits from our page table。 All right。 So here's a page table。

And we've loaded into our。 Okay。 So this is recording again， virtual page number。

physical frame number translations。 If we find something in the TLD， we're done。

We don't need to go out to memory。 We need to look at the page tables。

We don't need to do any of that particular stuff。 We have the answer。

We have the physical frame number。 So， what that name comes from。

what was invented by Sir Morris Wilkes。 Prior to cash。

Because I realized on the early computers that there was this problem that they were running really slowly because they're spending all their time going out to memory。

Just to do address translation。 So he created this address translation cache。

And decided to call it the translation with the side buffer。 But then people realized that， wow。

you know， we could catch the result of sensations。

Why don't we just also use memory storage to hash the results of going out to memory。

And that's where it passes。 And they picked a veteran and a memory translation or memory。

look aside， but those are something like that。 Okay。 Now， when we don't find something in the TLC。

we just revert back to what we were doing before， which is the memory management。

It's just going to traverse through the trees， find the result。 And when we find that result。

we'll pass that result in the TLC。 Okay， so pulling it all together。

The processor generates a virtual address。 We look in the TLC， if we find it。

we have our physical address， you can go directly with physical memory。 If you don't find it。

and we ask the MmU， translate this virtual address to a physical address and walk the table。

does all the things it does。 Generate the physical frame number， and we pass that translation。 Okay。

and of course we have uncranslated。 So there's a couple of questions。

So this is all premise on the idea that there is a real problem。 So if I access a page as a program。

am I likely to access that page again？ If the answer is no。

then there's no benefit to cash in the translation。 The programs are just， you know。

kind of ping ponging all over their memory， never touching the same page， plus。

would be zero benefit。 But that's not true。 It's not how programs typically behave。

So if we look at instruction accesses， they spend a lot of time on the same page。

but your code runs sequential。 Then you have branches， you have things like that， but in general。

we will stay on the same page。 Periodically， you know。

call some library or call some sub-routine and change pages。

But many of your instruction fetches will be on the same page。

So we'll gain some benefit from passion that translation。 Stock accesses。

definitely locality of reference。 We grow our stack， incrementally shrink our stack， incrementally。

So you're going to spend most of the time on the stack on the same page。 Periodically。

you'll go up or down a page。 Data accesses， a little bit more complicated。

Many times you will have low quality， but not always， right？ So the example I gave earlier。

where I'm going through the characters in a string。

those will all be on the same page and empirically help these values that say crosses your page。

But if I have a lookup table， you know， I might be doing random accesses。 Why is the database。

it might be a lot of random accesses to that data structure。 So it's going to vary。

But definitely for stack， definitely for instruction fetch。

we'll see that we have a lot of page with all。 Okay。 Can we have a hierarchy of the tailgreens？ Yes。

we can have multiple levels of field these operating with different organizations。 Yeah， question。

Yeah， so the question is when the processor is going through the tool B。

does it have to go through the MME or can go directly with that。 So again。

the idea is that you ask the TOB first， it's the cache。

if it has that translation from the virtual page number， the physical page number， the doc。

If it doesn't， then you ask the memory management unit， you have to have a TOB miss。

then you go to the member management unit， and it walks the freeze。

Generates the translation from that virtual page number to the physical page frame。

and then it tells the TOB， here is that translation and all of the bits that are associated with it。

hash this for the future。 And then you go out to the path。

So this all happens before we go to the cache and that's why it has to be your test。

If we don't have a good hit rate in the TOB， then we're not going to move matter if we have a very high hit rate in the cache。

we're going to spend a lot of memory access is just to get to our on-track path。 Yeah。

so this is actually very important。 So this extra path here where the CPU can generate physical addresses。

So you think about it， the operating system needs to be able to go in and manipulate things like talk directly to devices or memory map or be actually able to set up page tables。

And so there are a lot of cases where the operating system needs to be able to directly write the physical memory。

And so then in that case， you know， there's no translation that happens。

We're operating in the physical address space of the machine as opposed to in some processes for the address space。

Otherwise， there'd be like no way to set up page tables。 Yeah。 Correct。 The question is。

we should do user processes from being able to generate these unferencing。 Absolutely。

But if we didn't， then a user process would read any physical memory。 So。

dual mode operation requires that if you're going to generate physical addresses。

you have to be running like a protective mode， a phenomenon。 Any other questions？ Yeah。 Yeah。

So the question is whether there's something that indicates whether the CPU is access and translated or un-ventilated addresses。

So typically， there might be something like reserved instructions。

And so those instructions can only execute on the request of which level is at the final level。

that allow you to do directly， the loads and stores that are absolutely a mark going through translation。

But it's going to be our correct agenda。 Yeah。 Yeah。 So the question is。

can I give some examples of why the kernel would want to be un-ventilated addresses。

So two easy examples would be like the graphics frame buffer is typically at a very specific location in physical address space。

And for performance， I might want to be able to read and write from the kernel directly into the graphics frame buffer without having to go through any kind of cancellation。

I could also map windows and things like that， from the frame buffer into various address spaces。

but those would then go through address translation。

The other key reason why you want to do it is how do you set up the page tables in the first place。

Like， you need to say， tables are located in physical memory。 And so to set those up。

I need to be able to read and write directly to the physical memory。 What are your questions？ Okay。

So the question is， are we assuming that we pass after every access。 Yeah， so the TLC。

So we're only going to need to update the TLC when we make a change to something that the TLC is passing。

For example， if。 So when we do a translation on the MME the other hand。

that's a point we'll put it into the TLC。 The TLC tracks whether something's been recently accessed。

So the TLC will be updated when we do a week or a right。 And similarly， when we do a right。

we might have a dirty bit that tracks that that underline changes the modified。

And so that would cause a TLC update。 But we're going to go into this more detail。

So on the previous slide， it says if the TV misses， then we check the catch。 Yeah。

so if the TV misses， then we're just going to do translation using memory management unit。

And that's going to walk the trees。 And remember the memory cache is hashing the thing that are in memory。

And that could include page table entry for the page tables and so on。

So we're always going to do all of our reason rights to the class。

That way potentially you might find that page table entries that are relevant to the translation are actually in the past。

which case we get to say on the processor。 We never actually have to go out to name them。

So is the TV before the one cash。 Yes， the table is before the open cash。

We're getting a little head。 We're going to have a picture that shows what a modern process for our technical looks like。

And you'll see that there's actually multiple。 And there's actually shared。

So， good question。 What organization， what kind of cash is a TV？ Well， you think about caches。

there's a lot of different parameters that we can use in picking what kind of organization you have。

So there's the size， and we have all of it in our。 There's the associativity， except all size。

There's the number of sites。 Right。 And then there's the line size。 Now， remember。

this is something located on the process of die。 So every transistor is incredibly expensive。

limited amount of area implemented。 And so making the design trade off properly。

It's going to be really important for performance。 There's also a right policy。

So when we store something into make a change。 Is it right or is it right back？

So how might our organization of the TV。 Different from the organization of a traditional data capture and stuff。

Answer that question。 You can probably take a look and make sure everybody remembers all the different types of class organizations from 60 and C。

Okay， before we do that， let's look at why things might not be in a past。

So why we might have a cat to miss。 So the first， there's three plus one， I call it sources of cash。

So the first is compulsory。 The cold hard track life that the first time you go to access something。

it's not going to be in the catch。 And so you have to go out to memory， retrieve it。

and then put it into the catch。 Now， if you've got a long 11 program， you know。

you've got billions of instructions and these cold start effects might be minimal。

But if every time you contact switch， you have to slush the cash。

and you're always going to be faced with compulsory misses。 Now， it's also。

I think a little misleading when we call it compulsory because we actually can get around。

Because I know that I'm going to， like， I， you know， I represent the particular block。

That first reference of the block， that's going to cause a multiple miss。

If I know that I'm doing sequential access， I know I'm going to access the next block。

I could actually pre-search that block so that by the time I go to request that block。

the cash has it already。 I avoid that compulsory miss。 I want to trade off with doing pre-fetching。

eight on a time to go into the benefit is you can avoid some of these cold start misses。

The downside is you might be kicking something valuable out of the path in order to proof at something that we actually don't end up using。

Okay。 Next type of miss is a capacity miss。 So this is where the cash doesn't have less space。

You can't contain all of the blocks that are doing the options。 The solution。

Go out and buy a bigger path。 And so when you look at process respects。

we'll see they have an instruction cache， a beta cache， shared passes， all one deltings， all foods。

And the difference when you look at the lowest costs。

the i3 versus i9 will be in the size of the passes。

And so that's one of the most expensive parts of a processor because I take up a lot of area and the process of that。

So a cheaper processor， smaller die， less space for that。

But then I'm going to have more capacity misses。 And so that's part of the reason why。

even though the same architecture。 One processor might be much slower on workloads than another processor。

And that's a good example for that。 Okay， another source is what we call conflict or collision misses。

This is where one of two or more data items， collab。

And so we end up having to replace one to store the other。

but then we go and access that thing that you just replaced。

And so you kind of can end up in a form of crash， the multiple data items that map to the same location in the path。

So solutions， again， spend more money and get a bigger cash。 Or make it have more social。

Make it set associative director， make it fully associated or more sets。 Or include the。

but reduce the conflict mystery。 And finally， there's one that you don't see。

talk about as much as a， which is called parents misses。 These are a form of invalidation。

But for example， if I'm passing some location in memory。

and there's a direct memory access controller and IO controller that's leading something from the disk。

And it reads it for a block from the disk and writes it into memory。

That is what I have in the past。 I've now invalidated the contents of the cache。

That DNA operation invalidated。 Or， for example， if I have a multi processor and another processor writes the same memory location that I have cash。

Now my copy of that memory location is invalidated。

A lot of work goes into the invalidation protocols， the multi processes。 Okay。 Now。

how do we find something in a page？ We take our address and we break it up into components。

So our block is our minimum quantum or unit and transfer。

When we think about transferring something from a lower level to an upper level。

we move it in blocks。 But when we think about moving some memory from blocks from memory into a processor path。

That block size might be something like 16 bytes or something like that。

When you think about going out to disk， it might be something like four kilobytes that we move into one。

And so， at each level of a hierarchy， we think about what's the block going on already that we're moving。

Now the data selects you the lower order bits are used select within that block。

So the number of bits we allocate to the offset is going to control the size of our blocks。 Well。

not all passionate applications that will have that。 Like in the case of a TLE， you know。

we don't have that。 Because we're just returning the translation。 Now we have the index。

which is used to identify a potential set in the cache that would have the data。

And then we use the tab to actually identify whether what's stored in the cache is the thing that we're looking for。

All right， so let's go through the three different types of cache organizations using this model for highly funded in the cache。

The first one we're going to review is a direct map。 So we have a direct map path。

it'll have two to the end bytes that it's storing。

The upper 32 minus n bytes is going to be our cap。 The lower our bits right。

the lower end bits is going to be our offset into the block。 So in the three。

that's what will be the bytes left。 So this says our blocks are going to be two to the end in size。

So here's an example of our path。 So here we have blocks that each one of these lines is a block。

The one kilobyte direct map path that is 32 bytes blocks。 That's a bit as 32 bytes。

How many bits do we need？ This is powers of key。 All right。

so we're going to take our lower five order bits。 That's our byte。

That's how we're going to be able to sell up a particular。 We're going to take the next。

Five and use those as our cash index。 Again， you can see here we have 42 blocks。 A two to the five。

So we select that。 Then we use our tag。 Let's see。 Okay， the thing that's cashier。

the block that's cashier is not the block I'm looking for。

It's going to use our tag to check if there's a match。 There's a comparator there。 If it matches。

we know we found the block we want。 Now we just need to fight。

And that's where we're going to use the byte select or treat the actual。 That's a direct map。

For a set associative cash， the way to think about it is now we have basically。 Anyway， anyway。

set associative cash。 Now we're going to have any direct map passes。 All right。 Now。

an important consideration here is that。 Whenever we compare caches。

we assume the total number of entries is always the same。

So we have not doubled the number of entries by going to a two way set associative。 In fact。

it's the same number of entries are just organized into。

These banks of direct map caches that are operating in parallel where each。 Okay， so here。

our cash index is used again to select one of these。 Sets。

And then we take our cash tag and we're going to use。

And we're going to use the same number of entries。

The results of those comparators have to go through a bunch of gate logic and everything。

And that's going to feed into a multiplexer。 That multiplexer will take the block and then provide the block。

And then we can use our。 These are our bytes select the select the actual。 Fight from that block。

So there's a trade off。 But now we have more places to places in this case where we could store a potential item in the cap。

So that's going to help us with reducing our conflict。 This way。 But that is a lot more logic。

We have multiplexers and it was just two。 And I just want to get to eight。

You can't build a multiplexer that has eight inputs。 You typically have like a tree of multiplexers。

Each of those has a little bit of a big delay associated with it。

And so when we start to add up all these data， a set of those that are passed is going to run slower。

And a direct map。 So， we've made the access time higher or a little bit。

But we get the benefit of reducing the conflict。 And also takes up more area because all multiplexers and gate logic and all。

Counts against their transistor count that we have on the bottom。

It's more expensive to implement a set of sensitive。 But you know。

if we're going to do set associated。 What happens to which is make everything。

To compare every single country rather than just breaking things up。

That's what we get to do a fully associated。 So now there's no index。

Our tag is everything other than our fight。 And so attached to each of these entries blocks。

Is a tag or a big comparator。 This takes the most amount of additional logic to implement。

And that's going to make it slower。 But we have no conflict misses。

Like any block can go anywhere in the class。 Well。

I'm not going to have the problem of two blocks going into the same place。 Now。

where do we put a block in a cache？ How do we decide？ So let's look at an example。

So here is an 32 block address space。 We've got a cache that holds eight lines。

So where would block 12 go in our eight block？ Well， it's easy in a direct map cache。

There's only one place in the 12 mod of the block four。 That's the only place it can go。

Now that we're going to have a problem， right？ Because any other block from our 30 block address space。

That is not a equal to four。 What happens at our address， you know。

stride pattern or whatever access pattern is one of those。

And we're going to just go back because we're going to replace this block with that other request。

And then we're going to make a request for this block and replace that other block。

And we're just going to ping on back and forth。 And a set associative cache。

It can go anywhere and set zero， which has two options。 These two ways set associative。

It's sort of a 12 mod four。 Again， recognize here we have the caches have the same size。

eight block hash。 Both cases。 But now we have two different places that we can put that block。

That's going to reduce the likelihood that we're just being home between， you know。

like we might have in the graph。 And then if we look at a fully associative cache。

it can go anywhere。 Again， we reduce the conflict this way down to zero。

But it takes more hardware and explore the answers。 New questions。 So， which block do we replace？

What do we have in this？ Well， it's really easy with a direct map cache。 We only have one choice。

Right。 It only maps to one location。 With a set associative， fully associative。 Well。

now we have a choice。 And so we need some block replacement algorithm that's going to pick between the potential location that we could use。

So we can， you know， some of the algorithms might think about are random。 Why random？ Well。

really easy to implement a random number generator in part。

It's actually kind of very complicated to make it more random， not firmly dependent on that。

And all we know is， but you can do something that's relatively good。

It's come a two around number one。 Another alternative。

why do we use something like these recently used？ The which of those blocks， we have locality。

so I'll show you temporal locality， which of those blocks would use farthest in the class。

That's the one we're going to admit and replace with the block that we want to explore。 Okay。

so here's an example of a workload。 It's just a random， you know， kind of benchmark workload。

And we're sort of different axes to look at for casual organization。

So we have the degree of associativity。 So two way， four way， eight way。

You have choice of algorithm for replacement。 And then here on the， on the Y side。

we have rows that are the size。 I want to be a jerk。 Well， for small caches。

especially if it's actually kind of independent， I guess， of the， the system。

There's anywhere from 0。5 to 0。6% difference between using our new and using random。

which is actually pretty significant。 The point six is， is what applies， you saw， then we looked at。

you know， what we needed to have as mistakes。 On the other hand。

stop and think about it for a moment。 Implementing LRU， it's going to require extra data。

Like you have to have a clock， you have time stamps。

You have to have a lot of logic to actually be able to go through and compare each entry， the scene。

which has the oldest time stamp process。 And so， is it worth it to spend that for a point six percent gain？

Maybe， but， you know， we have to really look a lot deeper into it because it's going to be a lot of software or we try to do it in hard work and do a lot of hardware overhead implemented。

On the other hand， when we look at the larger caches。

there's really no difference between LRU and random。

And given that random is so easy to implement in hard work， we would just pick a random model。 Now。

big caveat here， giant aspect that I have to add here is this is just one workload running on this organization's box and replacing box。

Before I made a decision like that， I'd want to look at a lot of different representative workloads to determine。

you know， what might be the best algorithm to use。 Okay。 Now， how do we handle writes？

If you remember， we have two choices for writes。 One approach is write for it。

So when I make a write to the cache， I also write the next level of the memory viral。

So the cache only ever contains the same data that's in the backing memory or pages。 All right。 Now。

the other alternative is write back。 So here I write things to the cache and it's only when I'm going to evict it from the cache or replace it。

But then I go and I write that block to the next level of the memory horror。

So now I got to keep track of is the data in the particular block， queen。

as in it perfectly matches what's in the backing store， or is it dirty， in which case。

I have to write it back。 I'm going to evict it from the closer。

So let's think about some of the pros and cons。 I would write through。

I get the advantage that a weakness can't result in a life。 Okay， that's a little complicated。

What does that mean？ Think about it。 I go to read something and I check the cash。

It's not in the cash。 What am I going to do？ I'm going to go to the backing memory， read that item。

and cash it in store in the cash。 Pass it。 All right。 But what if the cash is full？

I have to pick a block in the cash from a place。 What happens if I pick a block that's dirty to replace。

If it's not right through， well， first of all， if it's right through none of the blocks will be good。

If it's right back， then that block won't be dirty。 I mean， I have to write it back to none。

So this is a little weird， right？ The processor is doing a read from memory and ends up having to do a lot。

So that's where having rights， I'm going to throw on a read。 The other downside here though。

is imagine I'm doing a ton of rights。 Well， processor operates in the nanoseconds memory hundreds of nanoseconds。

So every write I do is going to take 100 nanoseconds to go out from memory and write it。

So I'm going to make my processor run 100 times slower。 So the solution is to put a buffer。

And now when I do writes， the writes get buffered and eventually make it out for the memory。

I can return as soon as I put something into the write buffer。

But now that's going to gate house how far ahead I can get。

If I fill the write buffer and now I'm back to running the speed of memory。 Alternatively。

if I do write back， then think about it。 If I've got a counter that I'm just sitting there updating。

That's all those updates are going to get absorbed by the catch with write back。

So I'm writing it cache speeds。 And then periodically when I end up having to evict that block。

I'm going to write it back to memory speed。 But most of my writes are going to operate at the speed of the path。

And so it's really good for absorbing rights not holding up the processor on rights。

But the downside complexity。 I've got to keep track of some of the data in the cache。 It's clean。

Some of the data is dirty。 When I'm doing replacement。

I've got to figure out am I going to replace something clean for us because that's free。

Or am I going to put it might be something I'm using a lot。

So instead I want to play something dirty， get a schedule。 So it gets a lot more complicated。

as we'll see later in the lecture。 So that's the tradeoff between these two。 But again。

there's the issue where we've misses， it causes to replace a block that's dirty。

which means you then have to do a write back on that block。 Questions。 Okay。

so administrative stuff。 So I have office hours on Tuesdays and on Thursdays。

And I've been seeing people in the office hours。 More people can come by。

Project two design doc is due tomorrow。 Friday， the 11th。 And hard to believe it。

but a week from today， we have our second midterm。

And it's going to cover everything up until a blue lecture 16。 So Monday。

And the TAs are going to do a new session on the 16th of the day before exam details will be posted in the job。

Questions。 Yeah。 So the question is， is the first midterm in the state。

The answer is sort of sort of。 So， are we going to specifically ask questions about things that were on the first midterm。

not directly， but we'll assume you have not heard out of your cash。 So。

we're going to have all of the content from the term one。

So we may rely on concepts that come from the term。 It's a root。 When you're studying， you know。

remember that for your folks in policy， don't please， you know。

write the stuff from the beginning of the semester。 Okay。 So， we look at all these different tasks。

What kind of organization would make the most sense given what we need to do here。

given that it's on the critical path to getting to the cash is on the critical path for every single memory operation。

So， whatever going to pick as an organization， whatever one of the policies have to be around a real class。

So， there's a question about would write that faster because you go to a memory， and less。

And the answer is yes。 Right back。 Going， rights are going to mean memory less often right through every single right has to make it to the memory one of the case。

Module of the fact that we can buffer。 Okay， so。 So， we can't be before the path。 Right。

And so it's critical that the field will be the translations occur at CPE speeds。 All right。

the current memory speeds， and we've lost the benefit of having a path。

Every access to this physically addressed path。 So。

that's going to argue for something like direct or something of low associated。

I've been crazy associated with the conflict this way。

But we need a lot more comparative logic and muxes and all of this stuff that's going to make that look up slow。

But the counterpoint to this is conflicts， conflict misses are really expensive。

And so that kind of argues for a fully associated class。 Again。

there's this potential issue of flash。 Because it's correct map or because of low associated。 So。

if you think about it， like what are we using the lower bit to the virtual number of an index。 Well。

that means that the first page of code first page and stack first page of heat。

Are all going to collide for the same。 My instruction fetch。 Gets cash。

And then that instruction is going to read something from memory and store it on the stack。

And then I'm going to do the data fetch from the heat。

I flush the entry replace the entry for instruction。 And then as soon as I write to the stop。

I flush the entry for data。 And then I just， you know， rinse and repeat。 And so， again。

running it through to the speeds of memory， not a patch。 So， my saying said I use a whole other。

And then I use a whole other thing。 And then I use a whole other thing。

And then I use a whole other thing。 And then there's this entry value or any value。

And then there's some access bits。 Read， write， you know， read only， execute only， so on。

And then there's this column， application specific ID。 Remember that？

I'm going to come back to that in just a moment。 Now when we look at how the processor is organized。

So I'm going to do a little bit of 150 architect for a moment。

You actually overlap to the lookup with the rest of the pipeline。

So there are PLB stages in the pipeline。 So here， when you are instruction fetch。

we have our instruction， for our virtual address。 You do a TLD lookup。

You can work that into a physical address。 And then we look at instruction cache in the instruction。

We're treating it， begin decoding of it。 And then we start to fetch the registers that are associated with that instruction。

Then we do our ALU or effective address operations like it's being a load or store。

We're doing a later store。 Got a new address translation。

And so I'm going to have PLB operation and look up in the PLB。

And then we're loading and storing data。 And so when it has a cache， I think it will be in the load。

So we're going to look at the data cache。 And then we might do right back into the register or a write if we're doing a write for a few moments。

Now， organization， it's a 64 entry on chip PLB。 Really counter-oracle decision that they made at the time is to have a software based PLB。

So what does that mean？ That means when we have a PLB miss。

it's going to trap into the operating system。 The operating system is the new software to decide which entry to replace。

Think about that for a moment。 We're talking about operating at the nanoseconds。

And all of a sudden， I'm telling you that if you don't find it。

we're going to call out to the operating system， and spend thousands of nanoseconds to try and figure out what to do。

How can that work？ Well， it's all because if you think about it。

if we miss the PLB on a regular basis， it's really expensive。

And so we want to make the frequent case as frequent as possible。

And so we can be a little bit more intelligent in how we replace objects in the translations in the PLB。

We're going to drive up our pitway and that's going to make the frequent case more frequent。

and that's going to make our effective access problem up low。

So that was the argument that the developers had of the chip architect to the chip。

but it was really counter-oracle at the time。 The people didn't believe I could make the software perform fast enough。

Okay。 So it has this ASID。 Remember I mentioned that we're going to come back to that。

This is the application specific ID。 Now what it is， is it's a context。

So the operating system concept is one at loads an entry into the PLB。 It's saying， hey。

this TOB entry is associated with this process。 Why is that important？

Why do I need to know what process owns a particular PLB entry？ Exactly。

If I don't know who owns a particular PLB entry， I'd imagine I didn't have applications specific IDs。

What happens at a context？ A changed the address space。 It means the virtual。

the physical translations all change。 It means I have to flush my PLB。

It means compulsory cold start misses。 It means really bad performance。

So the trade-off here is going to take up some more precious space on our processor die。

The Aggies application specific identified。 But it means that we now don't have to necessarily watch the PLB every time the conflict switch to another program。

Every time the switch process and switch addresses is still be able to have valid translations stored in the TOB because any look-ups in the TOB are going to rely on that application specific ID。

You don't have to just match the virtual page number。

You also have to match the ASIB in order for it to be a valid translation。 Okay。 All right。 So。

a little bit of a 150 regression is the way I showed you the TOB organization。 We have the TOB。

then we have the cap。 We look at the TOB and even if we get a hit。

we first have to do that translation before we can actually do anything with the memory cap。

There's actually a way where we can overlap looking up in the cap with doing the translation look up in the TOB。

And it's going to work like this。 Okay。 So you think about our virtual address， right？

You do the TOB look up。 That gives us the physical page number and it gives us the offset。

The offset is this lower 10 bits of our address。 In this case。

we're going to have one fill about pages。 So machine to TOBs and caches do a correct that allows them to simultaneously start looking in the cash as we're doing your address translation。

And it's going to work because we want to set things up so we have perfect overlap between the offset and the index and byte。

Why does this work？ Because on our translation， we have the offset right away。

We don't change the offset when we're going doing translation。 The offset is within a page。

We have those bits available right away， right away。

and we're just waiting for the bits that are associated with the TOB， the physical page。 All right。

so here's how it's going to work。 We're going to take our catch。

It's going to be a four kilobyte cap。 Each block is four bytes in size。 Or by size or by half。

I'm going to enter using our class。 So we're going to enter the index。 10 for byte blocks。

How many bits do we need for a byte slot？ So we need 12 bits of a lower 12 bits。

And then we're going to be our index and our bytes。

So we can simultaneously select the particular line that we want， the block。

and then select the byte within that block。 And then as soon as the TOB gives us a gauge number。

we can compare。 And if the TOB matches our tag， here's the byte。 If it doesn't。

it's a miss in the cache， you have to go off and do the sets for a moment。

But it allows us to overlap that look up in the cache with the actual translation steps。 Okay， now。

if we make our cache be eight kilobytes， we don't have perfect overlap。 Take 150， 150， 150，000。

and we'll learn how to make that with it。 Okay， another option will be if we made our caches be virtually addressed。

If our caches were virtually addressed， then we could look up directly in the cache with our virtual address。

What's the disadvantage of a virtually addressed tag？ Exactly。

So every time we did a process switch， we'd have to flush the test。

or we'd have to add application specific identifiers to cache one。 It would increase complexity。

And so for both of those reasons， we typically use caches， oh， the third reason why they're shared。

I'd imagine we want to share the same physical memory across multiple processes。

If they have different virtual addresses for that same physical address。

then we could end up even with applications for identifiers with multiple copies of the same data in effect to not eligible or a migrate。

So for all of those reasons， we make our caches typically be physically addressed。 Okay。

so here's an actual。 So the question is how can we read more than one bite in time。

So we can read up to a word， right， our block sizes or word is the word side of the machine。

which is what most machines are。 So the most in this case we could read the word at a point。 Again。

if you take 152 or take 250 to see all sorts of different platforms。

organizations that have larger block sizes and can deal with larger size classes。

Okay， so here's a modern example of modern x86 processors of Skylake， Cascade Lake and so on。

And it's a little hard to see here on the side， but there are three。 So there's a TV down here。

which is a data TLC。 There's another TV up here， which is an instruction program。

And then there's another TV here that's a shared。 There's a question here that's each different pass have it on TV。

No， so the TVs are separate from the passion。 So this case。 We can have。 For example。

a bunch of different。 Passions， so we have L one instruction test L one， beta， cash。

and to combine cash instruction data。 And I'll read that that's shared across multiple course so the one passes now to cash is those are。

Individual course， and then the other thing that will be over here， but you know。

will be shared across multiple courts。

Similarly， we can have multiple different TLDs so L one instruction。

and then a shared out three or two rather shared。 And， and again。

there's all different sizes for these all the kinds of social utility， some weird one。

but a login way said， essentially。 And this is all where they looked at and look at lots of different workloads。

And then decided， you know， for the cost of making the chip and the limit on the number of this and this is sort of price performance that they're trying to get。

What's the right architecture use？ More servers， you know， the class of processor。

the larger all of the classes are going to be。 The more consumer， you know。

it's going into a Chromebook or something expensive in a platform is cash is going to be much smaller。

It's all trade off and forms of capacity versus prices versus performance。 Okay。

So what's going to happen on a concert switch？ Well， again， you know。

if we don't have application specific IDs， we just changed what the virtual physical map and flow。

So those entries are no longer valid and need to flush them all。

And that's going to be expensive because what we do， we context with all the time between processes。

not just between the same process。 Alternatively， we put in an application specific ID or process ID in the TLD and then we don't have to do it。

So most modern architectures， they include some form of process or application specific ID。

they get stored in a field that way we don't have to。

What happens if we change the translation tables。 So we move a page from memory out to the disk or vice versa。

We bring a page in from the disk into memory。 Well。

now there's going to be a difference between what that page table on three is in the page table and what's in the field。

So then we have to invalidate that field， the entry。 That's called to be consistency。 Okay。

With virtual index cash。 You need to flush the caption。

So that's again a reason why we do not want to have a work to get passed because every time we come to switch。

we have to flush the path。 So let's pull all of this together。

So what we saw the word beginning of this lecture was the general we have a virtual address。

We take the offset， copy over to the physical。 Memory management unit uses the page table base register to find the root table。

Use our P1 index to index into that table to find the physical page frame for our second level。

These are two index index into that second level table and get our physical plane of the page frame。

We're looking for memory and then the offset gives us the data that we want to treat。

We can replace this translation process with a TV。 All right。

so we now just provide this virtual page number into the TV and it gives us back the physical page number。

Similarly， we'll actually go to read the data or write the data in that location。 We can use a tab。

All right， and so we're going to take our physical address divided up into a tag index， byte so up。

using index to select one of these sets。 The tag to map and then the byte select to actually so up the data that we're going to achieve。

Okay。 So this combination of TV and caching is again allows the operation closer to process or speeds instead of operating at normal speeds。

All right， so what happens when that virtual physical translation。

The MME walks the trees and it generates a page。 Or we just don't have permission to find a read the current data area。

This is going to cause a fault or crap。 It's synchronous。

Interrupts the current instruction causes it to be terminated。 Perhaps this into the purpose。 Now。

in the kind of we run the page fault hammer and you realize that Oh。

this is actually something we can see。 So it might be that Oh， ran off the end of the stack。

We'll just allocate another step and end off the end of the key。

We'll allocate additional data from pages。 Or it could be we did a floor。 And that right。

because it's a page marked。 They actually now need to actually copy it and update both the parent and child。

Or it could be that that page is out on this。 You can use memory as a cash for this。

And you think about this is like kind of flip things upside down。 You normally think of， you know。

the software running on top of a hardware。 Now it's kind of the other way around the hardware is going to fall and then it's going to go after the software。

What do I do next？ When the software is going to tell the hardware what to be next instead of the other world。

But this notion of demand aging again is really powerful。 Like we think about modern programs。

I'm always amazed because every time I buy a laptop every time I buy a computer。

I just double the amount of physical RAM I put in it。 And yet I always run out of number。

Because I'll get bigger。 I hate using help。 It just gets bigger every word。

Does it do anything different than like this show slides。 I don't know。

I think now actually they have this whole mode where it doesn't seem learning and it will be automatic。

And it will post you on a presentation。 I don't really need that。 But it doesn't。 Right。

And so it's a challenge。 Our programs are getting bigger and bigger。

But even if we add more memory and you grow the amount of memory per machine。

we want to run more things。 We want to do more things with machines。 Right。

But we actually look at those programs。 And we spend like 10% of the code。

I'm not using all those options features right now。

Even though the phone is loaded on running on my machine。

So we really don't need to keep all of that in memory。 At the PowerPoint binary， probably like。

you know， a big about don't need to use up a big about my process memory。

And we think about like what is actually running and we're going to program。 It's much smaller。

And that's what we want to make sure we keep in memory。

So we're going to use memory as a task for this。 Right。 Because memory is expensive and small。

And it's the same thing I was doing with my Instagram and my on chip caches。

We only keep the things in the cash that are being actively used for anything else。 Yeah。 Yeah。

So the question is that the difference between software hardware paid for home。

And the time we don't want to implement everything in software。

So the hardware people occurs because memory management unit encounters some error。 Like。

there's all these little checks that does run off the end of the table。 And so， you know。

the segment valid or the entry valid is the access match the permissions， you know。

the request or permission level match， the current， the permission level on the PPE。

is because of fault denhering。 Now you could try to， as a hardware developer architect， figure out。

what are all the possible faults， that can happen and how should I handle all of those？

And now I greatly can strain what my operating system can do。

So what we see is when the hardware impounder， of the situation like that， it throws up its hands。

This is why I have no idea what to do。 Operating systems， you tell me what to do。

That's actually really good to do now， as an operating system developer。

I can decide what's the best way for me to handle this， and how I handle it might be different。

If I'm building a server， I'm going， to handle it differently than if I'm。

building a laptop for interactive use。 Mostly what I'm running is lots of big batch jobs。

That's different。 If I'm building a database engine。

I might implement it differently from how I implement。

page fault handling on a network file of appliance。 So by doing things in software。

we get a lot more control， when we get in hardware。 OK， so we are just about out of time。

What I want to leave you with is one more slide。 So if you think about how page fault flow。

In the normal case， generate an instruction， go to a memory management unit， define the page。

in the page table， and the operation， will conclude as intended， read， write， instruction fetch。

But we can also have the case where the memory management unit。

walks through the page table where it looks， because it has a cache in the tool B。

and we find it's not in the page table。 We generate the page fault。 And so in that case。

the operating system now takes over。 Once the page fault handler figures our own things out。

on disk， pulls it into memory， updates the page table， entry， reschedules。

so puts it back on the ready field， that read， now we can retry it， and the operation will succeed。

And so what we're going to talk about next time， is exactly how that page fault handler works。 OK。

thank you。 [INAUDIBLE]， (tense music)。

you。

P16：Lecture 16： Memory 4 Demand Paging Policies - RubatoTheEmber - BV1L541117gr

Okay， let's get started。

So this is our fourth lecture on memory and we're going to look at demand paging and paging。

policies。 Okay， so remember from 61C that we can compute the average memory access time。

So if we look at what the time is going to be to access， say we have a processor with。

an L1 cache and DRAM， we want to figure out what's the average that it's going to take。

us to access any given memory location。 Probably， probably。

holistically that will be the hit rate for finding it in that L1。

cache times the hit time plus the miss rate times the miss time。

Now since the hit rate is 1 minus the miss rate， we can just simply reduce that down to。

the average memory access time being our hit time in the L1 plus the miss rate for our。

L1 times the miss penalty for the L1。 Now if you think about it。

we're not just looking at the time to go from one cache to， memory。

but there could potentially be multiple caches in our hierarchy。 So for example。

if we don't find it in memory but it's out on disk or we don't find it in。

our L1 but it's in our L2 cache， then that miss penalty for the L1 becomes our average。

memory access time for the L2。 So that looks just like this。

So it just becomes our miss penalty for the L2， our average memory access time rather for， the L2。

So we can just take each one of our layers of the hierarchy and just compute the average。

memory access time as the time if we hit at that layer or the average memory access time。

if we have to go to the layer below， which could include the layer below that， which could。

include the layer below that and so on and so on。

So ultimately， this is all about trying to make everything look like we're accessing it， on chip。

on die at those speeds， even though that's incredibly expensive memory and we。

have an incredibly limited amount of that memory， we want to make it look as if we have。

as much memory as we have at our slowest technology and our least expensive per byte。

But notice again here the difference in access time， right？

We can access stuff in the chip in the speed of sub nanosecond to nanoseconds。

Going to main memory is it jumped to 100 nanoseconds going all the way out to hard drives is 10。

million nanoseconds。 So that's the challenge for us today is we want to look at access times。

make it look， like main memory access times， 100 nanosecond。

but we're going to implement it using storage， that has speeds that are 10 million nanoseconds in latency。

Okay， so remember in the ideal case， we have an instruction fetch or an instruction that。

loads or stores from memory that generates a virtual address， provide that to the MMU。

it either finds it in the TLB or if it's not in the TLB， it'll walk through the memory structures。

the page table entries and it'll find the page table entry for the physical page that， we need。

return that， combine that with the offset and we get the data we're looking for。

That's the ideal case， but today we're going to talk about the other case which is we go。

to make the reference， we look in the page table and we find the entries invalid。

That causes a page fault that causes a exception or trap into the operating system， terminates。

the running instruction and runs the page fault handler。

Page fault handler looks at the operating system books maybe at the page table entries。

and figures out that that page is actually out on disk， it schedules that page to be loaded。

into memory， once it's loaded into memory it updates the page table entry， invalidates， the TLB。

puts the thread back on the ready queue and eventually gets scheduled， we retry。

the instruction and it'll run to completion。 Okay。

so we can think of demand paging as treating main memory as a cache for our slower。

SSDs or hard drives。 So if we say it's a cache then we have to ask all of the questions that we'd ask of any。

cache which is what's the block size， so what's the unit of transfer from the cache to the。

backing thing behind the cache and vice versa。 In this case it's going to be a page。

so you know four kilobytes。 We can also ask what's the organization？

Last time we looked at direct maps that associate fully associate。

Well in this case it would make most sense to have this be fully associate， right， because。

every page in main memory is equivalent。 You can put any page in any place。

The virtual to physical page mapping allows any placement。 So it's fully associate。

Now how do we locate a page？ Well first we look in the TLB。

if we miss in the TLB the memory management unit traverses。

the page tables and segment map and whatever to find that mapping from virtual page number。

to physical page frame。 Now what's our page replacement policy？

So when we have to evict the page from memory which page do we pick to evict？

Well there's lots of policies we could implement。 We could implement LRU。

We could implement random or something else， right。 It's going to require a lot more explanation。

We're going to actually spend a lot of time thinking about the policy。 Why？ Because if we miss。

if we replace the wrong page and then we access that virtual page right。

after that then you know 10 million nanoseconds to go and do that access instead of 100 nanoseconds。

Yeah， question。 Yeah so why do we want a fully associate organization？

So in this case right every physical page is identical。 So we can place a。

because of the page table allows us to map any virtual page number to。

any physical page number we can place a page anywhere in physical memory。

Right now we'll see later on that we'll sometimes treat physical memory differently。

Different chunks in different regions differently。 But in general it's fully associate。 We have no。

there's no limitation in the hardware that says you can't put this virtual page at。

a given physical page as long as it's not used for IO or other purposes。

We'll impose some policies on how we use physical memory but there's no strictly you know limitations。

in the hardware saying you can't map some virtual address to some physical address。

So that gives us perfect flexibility because again we're going to get into a moment in。

a moment we'll talk about like misses and if we limited where we could put a page then。

potentially we could have conflict misses。 Right？ Okay now。 What happens on a miss right？

We're going to go to the lower level whatever is backing memory to fill that page and retrieve。

that page。 So a disk in this case。 Now what happens on a right？

Do we want it to be right through or would we want it to be right back？

So if we think about it right if we made it right through then like if we're writing lots。

of stuff on the stack or we're you know writing characters in a string we're going to be doing。

lots of rights directly to the disk。 That's going to be really slow so of course we're going to want to do right back。

But if we do right back now we need to keep track of the fact that a page in memory could。

have dirty data that is it's different what's in memory from what's on the backing store。

for that page。 So we can't just simply reallocate that page to someone else when we do an eviction。

Okay。 So all of this is because we want to provide this illusion of infinite memory。

So we say to a process you have a 32 bit virtual address space for gigabytes worth of virtual。

memory and then we use the hardware the TLB and the page table to map that to what might。

be a much smaller physical memory。 So we say you have four gigabytes but we might actually only have 512 megabytes on our machine。

And so we use the disk which let's say is 500 gigabytes as backing store。

So some of that virtual address space lives in memory and some of that virtual address。

space lives out on disk。 Right。 The disk is much much larger than physical memory。

So we can use it as backing store for our virtual for our virtual addresses。

The other reason why we want to think about this is because we don't just have one program。

running we have many programs running。 Each of which we're saying here's four gigabytes of virtual address space。

So the fraction of each program that's in memory could be much much smaller than its。

virtual address space and the rest of it will live out here on disk。

Now we want multiple programs running for concurrency。

It's one program waiting for its pages to come in for memory or it's waiting on IO or。

it's waiting on the user。 We want to run other stuff on our processor and we can do that。

So the principle here is one of transparent level of indirection。 Right from the point of view。

semantically of the program。 It's virtual address space。

It can touch or access any of the bytes in it。 Right。 That are defined that are that are mapped。

But it doesn't know physically where that data lives。

That data could physically live in physical memory or it could live out on the disk or。

it could live on another machine。 Rather than paging to a local disk。

we could actually page across the network to a disk， or a server on the other side of the planet。

That's from a semantic and a correctness standpoint。 Now from performance standpoint。

you could actually time memory references and see when， you're page faulting。

when something's not in memory and then you'd see a difference。 But from a correctness standpoint。

from a semantic standpoint， it doesn't know the difference。

between a page that's in memory and a page that's not in memory。

So that's valuable。 Okay。 We've seen this picture a bunch of times an Intel page table entry。

The things we want to take away for today are three sets of bits。

The present bit which is like the valid bit tells us this entry refers to a page in physical。

memory。 The access bit which tells us whether we've recently accessed that page， I'll give a formal。

definition of that in just a moment。 And then the dirty bit which tells us whether the page reflects what's out on disk。

So is it clean or is it dirty？ Have we written to it and so changed its contents relative to what might be stored out on disk？

Okay。 So some of the mechanisms。 So the page table entry is what lets us do demand page because it gives us this transparent。

layer of indirection。 We look at the bits in the page table entry and they tell us where to find the page。

So if it's valid， it tells us the pages in memory and the page table entry has the physical。

page frame number for that virtual page number。 If it's not valid。

it means the page is not in memory。 Okay。 So if it uses the page table entry or other operating system。

bookkeeping structures to， find that page。 Now if you reference an invalid page table entry。

then that's going to cause the memory， management unit to trap to the operating system。

It's called the page fault。 And what will the operating system do？

Well assuming the page is valid and it's sitting out on disk， it's going to want to pull it。

into memory。 But we might have to find an old page to replace。 Right？

Because if our memory is all in use， physical memory is all in use， we want to bring a new， page in。

somebody has to get evicted。 And we're going to pick an old page。 Right？

Because if we haven't referenced that page in a long time or use that page in a long time。

then you know， we're not likely to use it again。 Well。

we're going to go through a long detail what we mean by old and all of that。

But for now we're just going to pick some page to replace。 All right。 If that old page is dirty。

we're doing right back。 So we have to write it back to the disk so that now the disk is consistent with that page。

We're then going to change the page table entry and any cached heal bees for that page。

to be invalid。 Because it's not in memory anymore。 It's out on the disk。

Then we get to load the new page into memory from the disk。

We need to update its page table entry and validate any TLB entries， which previously。

said it was invalid。 We're going to flush those TLB entries。

And now we can put the thread back on the running queue and then the ready queue rather。

And then eventually it will run。 Right？ So we'll continue that thread from the faulting instruction。

It'll restart the instruction。 And the instruction now will succeed。 All of those steps。

that's a cache。 Yeah。 For the new or the old？ For the new entry。

the reason why we have the question is why do we invalidate the TLB， entry for the new entry？

Because that TLB entry， when we ran through the memory management unit， it said， oh， this。

page is invalid。 Here's the TLB entry I found。 We put that in the TLB。

So now we have to invalidate that because we'll reload the TLB with the correct valid a page。

table entry when we restart the instruction。 That's an important little subtle step。

So that was a really good question because it's easy to miss that。

Everything is getting cached out of the memory management unit even when we find that a particular。

entry is invalid。 Yeah。 Question？ Same question。 Okay。 All right。 So now， as I just said。

the TLB for that new page， it's going to get loaded as soon as。

that faulting instruction restarts because it's going to not exist。

The memory management unit will walk the tables， find the page table entry， and we'll put that。

into the TLB。 Yes。 So that's another good question。

We're about to dive into that in just the D and just in gory detail in just a moment。

But the question is， how does the operating system figure out where this page is out on， disk？

All right。 So， all that thought， we'll come back to that in just a moment。

But that's going to be one of the challenges is we're going to need some bookkeeping information。

and to store it somewhere like maybe in the page table entry of where to find that particular， page。

Yeah。 Question？ Yeah。 So the question is， how do we invalidate any cache TLB？

So we just go through the TLB and tell the TLB， you know， this mapping for this virtual page。

number and this application specific ID is invalid。 And then they TLB will flush it。

So we're just flushing that specific entry if it's cached in the TLB。 Okay。 Now， this takes time。

This could take a lot of time。 Again， tens of millions of nanoseconds。

So while that process is occurring， the process is sitting out， the thread is sitting on a。

weight queue for that page and the operating system just schedules other threads to run。

So this is again a reason why I want to have a lot of threads available is because there。

might be multiple threads that are waiting for pages to come in from the disk。 Okay。

So lots of things that we can do once we have page faults and demand paging。

And this is a lot of this is operating specific。 Some operating systems will implement things different ways or support functions or not。

But sort of the broadest set is we can use it for things like extending the stack， run。

off the end of the stack， and automatically operating system just extends your virtual。

address space with an additional page for the stack。 Same thing with the heap。

run off of the end of that。 It just gives you another page。 We went over last time。

copy on write when we do forking。 We just simply copy the address spaces。

the page tables for the parent and child so they're， identical。

And then we mark everything read only。 When we page fault on the right we check the books to see actually this is a writeable page。

And so then we copy the page and update the page table entries in both the parent and the， child。

Exexx， exec。 So when we call the Syscall for executing a new executable， exec。

we create a new virtual， address space。 We don't actually have to load everything in on demand。

We can just simply load in the parts of the binary that are actually being referenced， on demand。

So this is really nice。 If I've got PowerPoint and I don't know。

it's like a 2 gigabyte program nowadays， I don't， tie up 2 gigabytes of my physical RAM just for the code。

The only code that's actually going to get loaded in is that code that I'm actually using。

as I use it。 So this means it starts right away and it's memory， physical memory footprint is going。

to be much smaller than its total virtual address space footprint。 Another example would be。

we'll see this later on in the semester， I want to memory map。

a file into memory and now I can just do reads and writes and I can access that file。

So that's a technique we use for， for example， the code segment of a program。

We just memory map it into memory and then run off of that。 Okay。

So loading and executable into memory。 So this is a classic thing that we're going to do。

We have some XC or executable out on disk in the file system and it contains our code。

and our static initialized data and then relocation tables， a bunch of symbols for debugging and。

other sorts of stuff。 So the operating system is going to load this into memory。

initialize the registers， initialize the stack pointer and then it's going to call the C runtime in it。

CRT0， the， start procedure or function。 So if we think about it， right。

any pages that we're utilizing in our virtual address。

space are going to be backed by pages on disk， blocks on disk。 Right？

So you can think of on the disk， we've got our， there we go， we've got our stack， our， heap。

our data， our code pages。 So all of the virtual address space that's in use for every page in use in memory。

we， have a blocker page on disk that's backing it。 Right？ Now。

we can swap these pages back and forth on demand as needed。 We do this for every single process。

So here we have our page table and it's mapping our kernel， our stack， our heap， our data。

to various physical pages。 Right？ Now for all of， so for all the other pages。

the ones that aren't mapped， we have to keep， track in the operating system of where those pages are。

They're going to be out on disk。 Right？ So here they are。

here are the pointers out to all of the pages that are not in memory。 Right？

So there was the question， how do we do that mapping from a given block or a given page。

a virtual page number to a location on the disk？ Well。

typically we have some kind of backing store out on the disk。 Right？ And in older operating systems。

we had a special partition， which was our paging partition。

And it was managed by the operating system。 And so it was one contiguous region。

And so it went from like zero to n。 And so there would be a function that would given a process ID and a page number would。

return you the block within that partition。 Now， modern operating systems。

it's just simply a paging file。 It reuses the existing file system mechanisms。

And there are many reasons why we can do that。 That way it can grow to scale with the size of processes。

It doesn't have to be predefined when you create the disk partitions。 And so instead。

it'll be some kind of logical index into that file that'll tell you where， define a given block。

Now， depending on your page table entries， their sizes and structures， some operating。

systems just simply store that logical offset in the page table entries。

You just look in the page table entry and it gives you that actual pointer to this location。

In others， there'll be some bookkeeping data structure， which when you call this find block。

function， we'll return back the associated disk block。 Right？ Okay。 Let's see， what else？

Something like the code segment。 Code segment， as I mentioned。

we can just memory map that directly into the virtual address， space。 Right？

So what that would look like is here， when we need to， instead of mapping stuff for the。

code to some location on the disk， like here in our swap file， instead， we could just map。

it back to our executable。 This is， again， what most modern operating systems do。

And when you're doing the project or you're doing the homework， you've probably run into， this。

You're debugging your homework and you're running your program and you realize， "Ah， I have a bug。"。

So you go into your IDE， make a change to the source and you compile it and the compilation。

fails because it says the executable is not writeable。 The executable is locked。

And when you launch a program， the executable is locked because we're using it as backing。

store for virtual memory。 Right？ And so it's actually got reference counting on it because you could have multiple people。

who are running off of the same executable and you can't change that executable until。

all of those running copies stop running and you release all those locks。 And so that， yeah。

that's the second part。 You could share it with multiple instances that are running。 Right？ Okay。

So， again， we have our page table， which is mapping in both directions。

It's mapping to pages that are stored in memory。 It's also mapping to pages that are stored out on disk。

But we can have multiple programs running the same executable。

So here we've got a second one that's running the same executable。 And so if we look。

it'll have separate stack heap and data， but it'll actually have the。

same code that it's paging off of。 And again， this could be the actual executable that we're paging off of to save disk space。

Right？ So when， and of course it's got pages that are in memory also。

So now when we go to make a memory reference， right， we may find that that memory reference。

is not present。 So we're going to page fault。 And when we page fault。

we'll switch from what's the active process and each page table。

to the other process while we schedule the read of that page to occur。 All right。

So we're getting concurrency running now。 Eventually that page gets pulled into memory。

Page table entries get updated。 And now we can switch back to that being the active process。

Eventually it'll get scheduled and it will run。 All right。 Questions？ Yes。 Yeah。 So the question is。

does each page and memory have a copy on disk？ The answer again is sort of。

So for kernel stack heap and data， yes。 For things like code。

it's going to be a many processes sharing the same physical page on， disk。

And same physical page in memory also。 Yeah。 [INAUDIBLE]， Yeah。 [INAUDIBLE]， Yeah。 [INAUDIBLE]。

Yeah。 So there are some-- the question is inclusive versus exclusive for caches。

And do we have memory regions？ Absolutely。 There are some-- the question is inclusive versus exclusive for caches。

And do we have memory regions that you don't cache？ Absolutely。

There are memory regions that we don't cache。 Like for example。

if you have a GPU and it's memory mapped into the virtual address space。

that would be uncasurable memory。 Because all the writes are going directly to say the frame buffer。

Similarly， you can have that with network adapters or disk adapters。 Yeah。

If you can kind of think of it， that's the cool thing。 Again， with the page table entries。

If you go back and look at the page table entry， in fact， you'll see that there's a mark for。

whether something's cacheable or not cacheable。 Oh。 OK。

So there was a question I missed about S break。 Yeah。 So if you look at break and S break。

you'll see it changes the size of the data segment。 And for m-map。

you'll see how you can use m-map to map a file into a address space。

And then you can map that file into multiple address spaces and actually do sharing through， it。 OK。

So kind of summarizing everything that we put together in terms of how we handle main。

memory access， we have an instruction here like load some location in memory。

When we go to walk through the page tables， we find that the page table entry is marked， as invalid。

That causes us to trap into the operating system， run our page fault handler。

It locates the page on backing store， schedules the page to be loaded into memory in a free， frame。

Creating a free frame might require evicting someone else。 If that page is dirty。

then we have to write that page back to the backing store。 Eventually it gets loaded。

We update the page table and validate the TLB and then we restart the instruction when。

that thread gets scheduled again。 All right。 That's page fault all on one slide。 All right。

Lots of questions。 So the question is， so the code section isn't actually loaded into memory。 Yes。

code segment is loaded into memory。 It's just backed by the executable on disk and we load from that executable on demand。

rather than loading all of the code into memory when you start running a program。

Although we'll see， there's a little caveats and there are optimizations that we can do。

that we'll talk about at the very end of the lecture。 Okay。

So a bunch of questions that we have to answer。 Like I just said， we need a free frame。

When a page fault occurs， if memory is all full。 So where do we get the free frame from？ Well。

operating systems typically will have a free list。

So they'll have some process like a reaper process or other process that looks for frames。

to evict evicts those frames。 If they're dirty， then it schedules them to be written out。

It zeros them and then puts them on the free list。 That way the operating system when it goes to。

you know， pull a frame in always has some， free location that I can put it。 And again。

we'll talk about some ways that we can manage that free list at the very end， of the lecture。

Now how do we organize the choice of what frame we're going to evict？ Well。

that really gets to our replacement policy。 Another question we can ask is around how many page frames do we give each process？

Is it a one over N？ Is it proportional to the virtual address space size？ Is it priority based？

You know， we can also ask questions about utilization and fairness。

And we can even extend that all the way down to thinking about things like the disk bandwidth。

Right？ How much disk bandwidth does a given process get to use？

So these are all questions that operating systems will look at in terms of how they allocate。

physical memory to individual processes。 The question is。

so we would never perform a read and a write to the disk at the same， time。

So you're either reading or you're writing the disk。

The disk executes commands on behalf of the processor。

So the processor can give a queue of commands that are reads and writes to the disk。

And the disk will do those reads and writes in the order it's given them。

Or as we'll see later on in the semester， there's all sorts of optimizations we can do。

in scheduling the disk。 Every resource in a computer can be scheduled and managed。

And we're going to want to do that for efficiency and performance。 Okay。

So if we think about what a program is doing when it's using its virtual address space。

So it creates， we create a process。 It's got a virtual address space。

Is it using all of that virtual address space at the same time？ Well。

if we look over a window of time， what we're going to find is the following。 Right？

So here we have time on the x-axis and we have our addresses on the y-axis。

And what we'll see is that the program moves through a set of working sets。

So at any given delta of time， we're going to see a set of accesses coming from that， process。

Instruction fetches， loads and stores。 You can see it's not accessing all of its address space at any given time。

So what that tells us when we're thinking about it is that working set represents the。

pages of the application in its virtual address space that we want to make sure are。

physically resident in the processor， in the physical memory。

The other way to think about it is if these pages， for example， at this given time， if。

these two sets of pages are not in physical memory， we're going to take page faults to。

bring them into physical memory。 That's one thing to consider。

Another thing to consider is that the working set changes over time。 It's not a static thing。 Right？

In different phases of time， we've got different regions that are active。 Question is。

what's the difference between a frame and a page？ They're the same thing。

We'll use frame and page interchangeably。 Usually。

we reference like a virtual page and a physical page frame of memory， but page。

and frame are the same concept。 Okay。 Now， if we look at the size of our cache， in this case。

the size of our physical memory， we look at our hit rate。 As we increase our cache size。

the cache behavior we'd like is to see that more memory。

means more of our working set fits into memory。 Right？ And so the key thing is also， remember。

we're transitioning from one working set to another。

So the working set sizes are also going to be changing in size。

Now what happens again if our working set doesn't fit in memory？ If it doesn't fit in memory。

we end up page faulting。 And it's the difference between having memory access times of 100 nanoseconds and memory access。

times that are tens of millions of nanoseconds。 So when we think about physical memory being allocated to our processes。

we want to think， about that in the context of what's the working set size of our processes。

Now why might things not fit or why might we have misses that occur？ Well。

some of them are going to be capacity related。 So if we don't allocate enough physical memory for the working set size。

we're going to， encounter capacity misses。 But there's also going to be conflict misses and compulsory misses。

Conflict because we replace the wrong pages and that also kind of gets into replacement， policies。

Conflict misses should really be zero if we're fully associative and there's no biases in。

replacing pages。 And then compulsory misses are when we first start running or when we just swap something。

back in。 So this is applicable thinking about this kind of model to memory caches and pages。

But it's really applicable to any kind of caching strategy。

We have a limited size cache and we have to look at the access patterns of that cache。

relative to the size of that cache。 And we'll come back to that in our next slide。 Okay。

so another model to think about for locality is a Ziffian distribution。

So in a Ziffian distribution， the likelihood of accessing some item of rank R， so here's。

our ranks here， is going to be one over R to the A or alpha。

And so the thing is it might be very rare to access something of low rank。

There's so many things of low rank that this really makes it a heavy tail distribution。

So one of the places where people see Ziffian distributions is in web accesses。

So if we were to go to the Berkeley border route and look at what's the top website that。

people are going to？ It's probably， I don't know， Cal Central。 If we look at the next one。

maybe it's www。berkeley。edu。 And so there's going to be tons of accesses every day， every hour。

every minute to those。 So caching those websites would give us huge value。

Then maybe we get down to number 20 and it's， I don't know， www。sfgate。com。 But it keeps going。

You might be the only person on campus that accesses www。xyz。com。 And so if we think about it。

there's a large value to， substantial value to caching the， top view。 All right。

because we're going to absorb a tremendous number of references to that top， view。

But at the same time， there are going to be hundreds of thousands or millions of references。

to this tail。 So even if we put an enormous cache on the border， we're still going to have a high。

miss rate。 And that was one of the challenges people ran into when they started trying to develop。

web caches is that， yeah， there are those popular sites and we can absorb all the traffic。

to those popular sites in our caches。 But because people have diverse things that they want to look at on the web。

that long， tail is going to mean we're not going to have as high a hit rate as we'd like to have。

Yeah。 Yeah。 So the thing， the question is why the lower rank has lower hit rate。 So think about it。

right？ If you're the only person on campus who's accessing www。xyz。com out of our campus population。

daytime， population of like 50，000， that's going to be a small number of references。

Yet if every student is going on Cal Central to check their classes， that's going to represent。

a large overall number of references。 And so the value of caching those will be high。 Oh。

the question is low rank， less popular or more popular？ It's less popular。 Right。

So when we get all the way out to rank 49， there's not a lot of people that are accessing， 49。

but they are accessing it。 So we can't， the thing is we can't cache everything。 Right。

And in ideal world， we would just cache the entire internet at the border， but we can't。 Right。

So we have to pick what's our cache size going to be， and what we'll find is we'll capture。

a lot of these high rank， low rank， the rank one items， rank two， three， four， but we won't。

be able to cache rank 5，000。 Okay。 Any other questions？ Yeah。 Ah。

so if we look at capturing like every， if we were able to cache those pages， the more， pages we can。

we can cache the low， these， you know， high rank ones， the more we would。

capture what people are trying to browse， but we need a， you know， a， you know， petabyte。

size cache to try and capture that。 Yeah。 Exactly。 The hit rate is capturing if we cap。

cache everything of that rank and below。 So we just have to keep going further and further。

you know， to get to action， you can， see we're still not at one。 You get to one， you know。

we're going to be caching， you know， rank one million， one， you， know， in order to capture it。

And that's just going to be expensive。 Right。 If memory was no， you know， disk space was no issue。

then we could have an arbitrarily， sized， large size cache， but because we have to set a limit。

you know， we have to look at， a trade off。 And so we， if we end up over here， right， we're， we're。

we're getting an 80% hit rate， but， that means 20% of our stuff is still missing。

Any other questions？ Okay。 So let's， it says， are there operating systems that have access to multiple caches。

but use， different methods for each one？ Absolutely。 So every cache we have， you know。

we ask those questions， like for example， replacement， policy and， you know， for TLB。

we say random is often good enough because we can employ。

what we implemented in hardware for demand paging。 We're going to do it in software。

And so we're probably going to implement something more sophisticated and predictable， than。

than random。 And the question is if we have too many caches， would it not be worth it？ Yeah。

So there's always a trade off， right？ Every time we look in a cache。

there's a cost associated with that。 And so that's why we don't have 20 level caches on processors。

We typically have like three levels of caching。 The L3 is typically shared across multiple cores。

There's always trade offs in terms of， you know， the value of chip real estate versus。

the performance benefit that you get versus the performance hit that you， the latency。

hit that you get of having to go through yet another cache。 Okay。

So we can compute our average access time for main memory。 So let's look at that。 So if our。

so our effective access time is going to be our hit time plus our miss rate， times our miss penalty。

So as an example， let's say our memory access time is 200 nanoseconds and our average paid。

fault time is eight milliseconds。 We'll say our probability of a miss is P。

So one minus P is our probability of a hit and， we can compute our effective access time as follows。

All right。 So it'll be 200 nanoseconds plus P times eight million nanoseconds。

So if one out of a thousand accesses causes a page fault， that seems like a relatively。

small number。 A thousand accesses run at full speed and one access runs expensive。 Well。

it actually yields an 8。2 microsecond average memory access time。

So we've now slowed our machine down by a factor of 40。 So it looks like our machine just stopped。

Right。 So a different way to think about it is what if I wanted to bound the slowdown？

I wanted to say I'm willing to slow down my machine by 10% for demand paging。 So what would I need？

Well， I'd need to have a page fault rate of about one in 400，000。 All right。

So this is why it's going to be really important to come up with a replacement algorithm。

That's really good because if we picked the wrong page and evicted and then go to reference。

that page， well， now we've made our machine run at the speed of our disk。 Okay。 So question is。

why do we miss？ Well， we don't have enough space or not enough space。

We haven't referenced it before compulsory。 We'll get to space and just more compulsory。

We haven't referenced the page before。 So that's going to cause us to miss。 We could do prefetching。

but that's super controversial。 So prefetching says， hey， if I've referenced this page， well。

I know I do sequential access， so I'm likely to reference the other pages after it。

So I'm going to pull those pages in when I reference that first page。

The risk is what if I don't reference those other pages？ Well， to pull those pages in。

I evicted other pages from memory and I use disk bandwidth。

to pull those pages in and then I don't reference them and I've kicked other pages out so I。

could actually drive my miss rate up。 So that's why prefetching is always controversial。

And we have to predict which pages to actually prefetch。 So capacity。

That was the one I meant to say。 Not enough memory。

So we have to kick something out of the cache and that's going to cause us to have to pull。

it back in later when we reference it。 So somehow we have to increase the available memory。

So this is one of the things。 My friends， they're like， oh， you're in computers。

can you help me with my windows？ It runs slow。 I'm like， I don't do that kind of computer stuff。

But I say， how much memory do you have？ And then people say， oh， I have eight gigabytes。

I go out and buy 32 gigabytes。 Not very expensive。 It'll make your machine run much faster。 Right？

Because again， think about it。 If we're taking a fault every thousand references。

it's a slowdown of 40， factor of 40。 So extra memory。

you avoid capacity misses and you run it closer to that 10% overhead。 All right。

So increase the amount of DRAM or adjust the amount of memory that we allocate to a process。

If it's not getting enough memory for its working set， then it's just going to be page， faulting。

not making for progress。 And so we either give it more memory or we're just going to say， all right。

you run slow， and it's not very useful。 And you're taking up memory that can be used by other processes。

So this conflict misses。 Technically， we shouldn't have conflict misses unless we had something broken in how we select。

empty pages or something like that。 And that kind of really gets into policy misses。

So where we replace the wrong page and that causes us to then fault and have to pull that。

page right back into memory。 So solution invests more time and software cycles in a better replacement policy。

Okay。 So some administrative stuff。 We have a midterm coming up on Thursday。

It's going to cover everything including this lecture。

And there was a review session yesterday and the slides and video for that are available。

on the course website。 Questions？ Yes。 [inaudible]， Absolutely。 So the question is whether。

you know， the idea here is investing more in the software。

associated with the replacement policy to get a better replacement policy is worth it。

And the answer is absolutely。 You can improve the hit rate。

You're going to make your programs run dramatically faster。 Right。

And so a better replacement policy or like I said， you know， going out and adding more。

physical memory。 So you reduce capacity misses。 Those are all going to yield lower mis rates and yield much better performance。

And this all is true because there's that huge difference。 Right。 If we think about like， you know。

TOB， we can access the TOB in fractions of a nanosecond。

versus going out to main memory is a hundred times slower or more。 Right。 A hundred nanoseconds。

The difference between RAM at a hundred to 200 nanoseconds versus 10 million nanoseconds。

is so large that it really makes sense for us to invest in a good replacement policy。

Because with the TOB， we thought， well， we'll just we'll do it in hardware。

Because it's a significant cost， but it's not as significant a cost differential。 Okay。

Page replacement policies。 Again， reason why we care about replacement policy is because it's important for any cash。

I just talked about， you know， how we we made that decision with TLBs that TLBs are typically。

hardware managed。 MIPS was an exception where they do a software TOB。 But with pages。

it's really important because this huge difference between the access times。

to main memory and the access time of disk。 So we don't want to throw out something important。 Okay。

So first policy， FIFA， first in， first out， sounds pretty simple。 It is pretty simple。

So we throw out the oldest page。 The page we first pulled in from disk is the page where first going to kick out to。

disk。 All right。 You could argue it's fair。 Every page gets to live in memory for， you know。

sort of roughly the same amount of time。 But the disadvantage is if we have an old page that's frequently used。

like it's some， critical code set， you know， piece of code that's used across multiple processes and。

we kick it out of memory， next thing we do is going to be to pull it right back into memory。

So that's a downside with FIFA。 Random。 Random is easy to implement。

So you don't random number generator， pick a page， you're an evicted， right？

But if you think about it， that worked well in the TOB where there was a high cost， a factor。

of 100 to go out the disk。 But here again， 100 nanoseconds。

200 nanoseconds versus 10 million nanoseconds。 So we picked the wrong page and evicted。

It's going to be really bad for performance if we're looking at， you know， soft real time。

or interactive systems。 So unpredictability is not something we want when it comes to paging。

So looking at it the other way， what's the best we could do？ If we had to pick some page to evict。

which page would do sort of the least damage to， evict the page？ Well， it's going to be the page。

not the page that we're going to reference next， not the， page we're going to reference after that。

not the page we're going to reference after， that。

but the page we're going to reference farthest in the future。 Right？

That's provably optimal going to be the best that you can do when you're limited in the。

amount of memory that you have。 And so that's going to be our min algorithm。

We can't implement it because we don't know the future。 If we knew the future， you know。

we'd all be in Dogecoin or something like that。 And so we're going to instead think about programs as being consistent and if they haven't。

accessed something for a long time， they're not likely to suddenly turn around and access， it。

So we're going to use the past as a predictor of the future。 And that gets us to LRU。 So with LRU。

we look backwards to the thing that was least recently used。

And that's the page that we're going to evict。 And hopefully that's going to reflect something like min。

It's going to be a good approximation of min。 Now how would we implement LRU？ Well。

one way would be to just use a list。 So when we reference a page。

we place it at the head and the tail is the least recently， used page。 So when we need a free page。

we just pull it off the tail and give it to a process。 But think about this for a moment。 Right？

What is use me？ Use means instruction fetch。 Use means memory load。 Use means memory store。

So on every instruction fetch， every load， every store， we would have to update this list。

That's not really very efficient。 All right。 That'd be incredible number of software cycles for every single memory reference。

And yes， you know， it would allow us to pick the oldest page to replace， but it would come。

at a severe penalty from a performance standpoint。 So we're going to approximate LRU， which again。

remember， is going to be how we're going， to approximate min。

So we start with the gold standard of min。 We approximate it with LRU。

And then we're going to approximate LRU with a bunch of different approaches。 Okay。

So let's look at what this would look like。 So we'll use FIFO as our strawman to start。

We're going to assume we have three physical page frames and a virtual address space that。

has four pages。 And we're going to have the following reference stream， A， B， C， A， B， D， A， D， B。

C， B。 OK， so here's what this is going to look like。 Three physical frames here。

And let's look at what our references do。 So first A， what's going to happen？ Reference A。

We're going to page fault， right？ It's a compulsory page fault。 It's not in physical memory。

so we have to page fault and bring it in。 So we'll bring in A。 Where's it going to go？ Well。

we're just going to pick one at random。 We'll pick one。 OK。 Next reference for B。

Where's it going to go？ Well， again， it's a compulsory miss。

So we'll put it in the next available free frame， which is two。 Next is going to be C。 Again。

compulsory miss。 We'll replace or bring it in to free frame three。 Right？ Now we reference A。

That's in memory。 That's a hit。 Runs it full speed。 Reference B。 That's in memory。 That's a hit。

Full speed。 And oh， now we reference D。 So that's going to fault。

What was our first page-- we don't have any pages on our free list。 What was our first page in？ A。

the page one。 So we're going to replace A with D。 And then our next reference-- oh。 It's for A。

All right。 So what's our oldest first in frame now？ Going to be frame two B。 So we'll replace B。 OK。

And our next reference is for D。 That's a hit。 That's great。 Our next reference is for B。

It's not in memory。 That's a fault。 So who are we going to replace？ We'll replace three， C。

And then we're going to reference C， which we just kicked out。 So now our oldest is going to be D。

We'll replace D。 And now we reference B， which is a hit。 So we ended up with seven faults。

And one of our faults was really bad。 Because when we referenced D， we replaced A。

And then immediately referenced A。 And similarly， we replaced B。 And then-- no。 OK。

Let's look at what the best choice would be at every given， moment， given our trace。

So that'll be min。 So A comes in。 Nothing we can do about it。 Compulsory， same with B。 Same with C。

Two hits。 OK。 Now D comes in。 Which should we replace？ So now we're going to look forward from D。

And we'll see， well， we're next going to reference A。 So we don't want to replace A。

Then we're going to reference B。 So we don't want to replace B。 And C is the furthest out。 So yes。

it will be C that we replace。 Now A is a hit。 B is a hit。 B is a hit。 Oh。 Now C。

So now who do we replace？ We're going to replace A or D。 So we'll replace A。

And then B will be a hit。 So min ends up being five faults。 When we go to replace a frame for D。

we look as far as we can into the future， to see what's the furthest out reference。

Now it just happens that LRU does the same thing。 But that's not guaranteed。 In fact。

if you think about it， we， could come up with a pathological example of where LRU is。

going to do poorly。 So here， consider the reference stream ABCD， ABCD， ABCD。

Same setup for virtual addresses。 Every physical page frames。

And LRU and FIFO are actually going to do the same thing。 OK， so A is going to be a miss。

B is going to be a miss。 C is going to be a miss。 Now D comes in with LRU。

which frame is least recently used？ A， right？ Because frame one， so we replace A。

And then we reference A。 Now， which frame was least recently used？ B， frame two， so we replace B。

And now we reference B。 Which frame was least recently used？ C。 And so you can see we're basically。

going to fault on every single reference。 So this is a little bit of a contrived example。

of N plus 1 as our working set size。 And we only have N frames available。

But it demonstrates that if you have fewer physical frames， available than your working set size。

you're going to be taking a lot of page faults， and running at disk speed instead of running at memory speed。

So let's look at instead what would be a better--， and again， this gets to a like。

why replacing policies。

are important， what would min do？ So now the reference for B comes in。 Which frame do we replace？

Looking into the future， we want to replace C。 Now when the reference A and B will run out of memory。

and now the reference for C comes in。 Which frame do we want to replace？ It's going to be B。

Because we're， going to do D， then A， then B。 So we replace B。 Now D and A run out of memory。

And now we get the reference for B。 And so if we look into the future， we can replace A。

So we can see much fewer faults。 That's why replacing policy is important。

OK， now a desirable property we want to have， is you add more memory， you spend more money。

and you get better performance。 In that you have--。

as you increase the number of physical frames that， are available。

you watch your number of page faults drop。 All right， now there's always going。

to be some floor because there are compulsory misses and so on。

But does every replacement policy have that property？ Throw more memory at it。

and it performs better。 Intuitively， it seems like the answer should be， yes。 I have more memory。

so I should be reducing capacity misses。 But you can have policy misses。 And so for some algorithms。

we have what's called， balates anomaly， where actually adding memory。

doesn't decrease the fault rate。 So for LRU and MIN， it's guaranteed。 You give them more memory。

you will get better hit rates。 But for FIFO， it's not always the case。 And this is the anomaly。

So consider this following page reference key。 Three frames， and the reference string here， ABCDE--。

so five in our virtual address space--， mapped into three。 So how many faults do we have here？ Nine。

If I now add an additional frame， how many faults do I have？ Ten。 So I added memory。

and my number of faults went up。 So here's the thing。

Look at the contents of memory with three frames， the contents of memory with four frames。

Completely different。 And that's the anomaly here。 With FIFO， the contents of memory。

can be completely different， which， means the hit rate might actually be lower。 With LRU and MIN。

the contents of memory with X pages， is going to be a subset of the contents of memory at X plus 1。

This is a little confusing。 It's a little counterintuitive。

But it's a side effect of how FIFO treats memory。

OK。 Approximations。 So yeah。

[INAUDIBLE]， Yes。 The question is MIN guaranteed to be the most optimum policy。 Yes。

because MIN can look in the future。 So MIN knows exactly what's the least impactful frame。

to replace。 The most impactful is to replace the frame。

that you're going to reference immediately following。 So MIN looks as far into the future。

and tries to delay that page fault to be as far in the future， as possible。

And it's doing that on every page replacement。

OK。 So we can implement MIN。 We can't implement LRU。

So we're going to implement an approximation to LRU。

We're going to look at a couple of different ways， of doing that approximation。

So the first is an algorithm called the clock algorithm。

So we're going to take all our physical pages in memory， and conceptually。

we're going to arrange them in a circle。 And then we're going to have a clock arm that's。

going to go around all those pages and do the following。

Our goal here is we need to find a page to replace。

We're not going to try and replace the absolute oldest， referenced page。

We're just going to try and replace an old page， a page， we haven't referenced in a while。

And that's the trade-off。 So rather than trying to implement an exact LRU。

we're going to implement something that just picks an old， referenced page and kicks it out。

So some of the details。 We're going to have a hardware use bit or access bit。

And if that bit is set--， or that rather that bit gets set every time we reference a page。

So if we do an instruction fetch， a load or a store to a page。

the hardware will automatically set that use bit for us， that access bit for us。

Now that could be set in the TOB。 If it's set in the TOB， then when we evict an entry from the， TOB。

we need to copy that use bit back to the page table， entry that that TOB is caching。

On a page fault， what happens？ We advance the clock arm。 If the use bit is set， what does it mean？

It means the page was recently accessed。 We clear the use bit and advance the hand。 If it's not set。

if it's 0， it means that page hasn't been， accessed in a while， and that could be a candidate for。

replacement。 So we could pick that page to replace。 OK， a little bit more detail。

As this hand is going around， are we eventually going to find， a replacement candidate？ Yes。

Even if every page in memory is recently accessed， that hand， will just set it to 0， set it to 0。

set it to 0， and advance， all the way around until it comes back around and finds a。

page that's marked to 0。 And that becomes our replacement candidate。

So that effectively becomes FIFO at that point。 Now。

what does it mean if the hand is moving very slowly？ And is that kind of like good or bad？ Well。

it means that there are not many page faults。 So that's probably good。

It could also mean that we're finding pages very quickly when， we do have a page fault。

That's also good。 Now， what if the hand's moving really quickly？ Well。

it means that either there are lots of page faults or。

it's taking us a long time to find a free page， like all the， pages are in use。

So we have to go all the way around on every single， access to find a valid page to replace。

So the way you can think of the clock algorithm is we're。

partitioning our pages into young pages that are， referenced a lot and old pages that aren't referenced as much。

But if we're just looking at two sets， why not have more， than two sets？

Then we could start to think about the granularity of， the page。

And then we could start to think about the， number of pages that are written in the， form of a page。

And then we could start to think about the number of pages， that are written in the form of a page。

And then we could start to think about the number of pages， on a page fault。

operating system checks the use bit。 If the use bit is one， we clear the bit， we clear the， counter。

If the use bit is zero， we're going to increment the， counter。 And if the counter equals n。

it's a candidate for， replacement。 Question？ OK。 So what this means is for any given page。

a hand has to go， by n times before we're going to select it for， replacement。

So it gives it n chances to get accessed。 If it doesn't get accessed in those n chances， we pick。

that page to replace。 So big question now is we just looked at n chance where we。

just look directly。 What do we set n to be？ Well， if we set n to be really large。

then it's going to be a， good approximation for LRU。 If we set it to 1，024。

that's saying we're dividing these， granularities of how old that page is into really fine。

granularities。 It's not perfect LRU， but it's going to be a really close， approximation of LRU。

But if you think about it， though， that's not going to be， very efficient， right？

The hand might have to go around 1，024 times before we， actually find a page to replace。

So if we pick a smaller n， it's going to be much more， efficient。

but we're not going to have as true a reality， around what's LRU。

It's going to be more of a coarse-grain approximation。

So we find something that balances finding pages， quickly and efficiently with picking the right pages to。

a bit。 OK。 So yes。 Yes， so the question is if we're looking at n chance， does。

it mean it's going to take n sweeps every time we want to， find a page？ Not necessarily， right？

Because we could have already have gone by a number of， times in the past。

And so as we advance the hand， we might find a page where， advancing the counter hits n。

So the first page fault--， yeah， potentially could require n to go around and find it。

if all pages have been recently accessed。 OK。 So all of this。

we didn't consider what happens if the， page we picked to replace has been modified。 So remember。

we're doing writeback caching， so that means， we may have done a bunch of writes。

What's on the backing store doesn't match what's in memory。 So we actually have to--。

it's going to take extra overhead because now we have， to schedule that page to be written out。

That takes tens of millions of nanoseconds to write that page。

out before we can spend the tens of millions of nanoseconds， to bring the page in。

So maybe what we do is we give dirty pages an extra chance。

So we don't consider them for replacement when we first， realize， hey。

maybe we should replace this page that hasn't， been used in a long time。 So a common approach is。

say， for clean pages， we use n， equal to 1。 And then for dirty pages， we use n equal to 2。

And when we hit n equals to 1， when we realize the page is， dirty， we schedule it to be written out。

So then hopefully by the time the hand comes all the way back， around， that's now a clean page。

And we can just evict it and replace it with the page， that we want to pull it。 All right？

OK。 So if we come back again-- yeah， question？

[INAUDIBLE]， Yes， the question is， what do I mean by n equal to 1 and n， equal to 2？

So it's a question of when we consider it for replacement。 So if you remember here。

if we hit the page and the use is 0， we increment the counter。 If it equals n。

then we evict that page and replace it。 So what we're going to do is that when we're going around。

if we hit a page that's 0， it's use bit of 0， we increment the counter。

If the counter now equals 1 or greater， we replace it。 If it's clean。 If it's dirty。

then we're going to assume n has to equal 2。 We're going to schedule it to be written out。

By the time we come all the way back around， if it still hasn't been used， hopefully at that point。

the counter will now be 1。 It'll get incremented， and we will now。

be able to-- it'll also have been renounced， so now we can replace it。 OK。

Coming back to our page table entries， what bits are we going to use here？

We're going to use the present bit。 We're going to use the writable bit， the access bit。

and the dirty bit。 So those are the bits that we're going to use。

for all of the bookkeeping that we need to do to figure out， whether a page is valid。

whether a page has been written， whether a page is writable， and whether the page has been used。

All right。 So some of the variations of clock algorithm。 Do we really need a modified bit that's。

implemented in hardware and tells us， whether things are dirty？ The answer is no。

We could emulate that in software。 So we're going to need some extra books that keep track。

of which pages are allowed to be written and which aren't。

but we could simply mark all pages as read-only。 And then when they get accessed to write。

check the books to see are we allowed to write it？ If we are， we mark that now in software。

the page has been modified， and we mark the page table。

and we can see how the page is written and how the page is， written and how the page is written。

So we can see how the page is written。 So we can see how the page is written。

So we can see how the page is written。 So whenever we write it back to disk。

we clear in our software modified bit。 Similarly， we could ask， do we need a use bit in hardware？

That's what gets set on every single memory reference。 Now， again， we can emulate it in software。

So now we're going to keep a use bit in software， and a modified bit in software for each page。

And we're going to use the page table infrastructure， to do both of those。

So we'll mark all pages as invalid， even those that are actually in memory。 Now。

when you reference one of those pages， what will happen？ Trapped to the operating system。 Again。

the operating system is going to look in its double， books and see， oh， this page is actually valid。

So all market is used。 If that was a write that occurred。

I'll also mark it as dirty and set it as read write。

If it was a read or instruction fetch that occurred， I'll mark it as read only。 All right？ Now。

when the clock hand passes， I'll， set that software use bit back to zero。

and mark the pages invalid in the page table entry。 Flush the TOB entry for it。 The modified bit。

I don't touch， because the page is still， dirty and needs to be written back to disk。 Now。

clock itself is just an approximation of LRU。 It's just one way that we could implement LRU。

What are other ways that we could do it？ All， again， we're trying to do here， is identify old pages。

not necessarily the oldest page， but just an old page。

So another approach that became very popular for a while。

is an approach called the second chance list。 And we still have variants of that today。

So this was used in the VaxVMS architecture。 And the idea was we split memory into two regions。

an active region of pages that are actively， being used， which we manage FIFO。

and a second chance list， that we mark as LRU for pages that are considered inactive。

So pages in the active list are marked read/write。 Pages in the second chance list are marked LRU。

So active pages， we can reference those at full speed。 Accesses to the second chance list。

those are going to trap into the kernel。 OK。 Now， when we have overflow from our active pages。

we move that page onto the head of the second chance list。

If we access a page that's on the second chance list， we're going to trap into the kernel。

that tells us that this page we thought was inactive， actually is active。

So we're going to move it to the head of the FIFO， for the active pages。 All right。 Now。

page in from disk， it's active， so it's going to go to the head of the FIFO queue。

here for our active。 And that's going to force us to evict a page out of memory。

and we're just going to pick our LRU page。 So what does this give us？ Well， if we think about it。

if we put all of our pages， onto the active list， it's just FIFO。

If we put all of our pages onto our second chance list， it would be pure LRU。

but it would be really expensive， because we trap on every memory reference。

So we picked some intermediate， and that， gives us the advantage of fewer disk accesses。

because we let pages live in memory longer。 We have a higher fidelity LRU implementation。

even if they're unused。 But the trade-off here is we're going。

to increase the number of traps that we have， into the operating system。

So this is where we're trading off traps for hopefully。

having a higher fidelity representation of LRU， as an approximation to min and reducing our overall misread。

So the cool thing here is with page translation， it's really up to the operating system developer。

to figure out what replacement policy do I want to implement。

The hardware gives us the ability to implement anything we want。

And if the hardware doesn't have that capability， we can emulate that capability in software。

So a little side-factor of history， why did the VaxVMS have a software implemented。

use bit and software implemented modified bit？ Because when the architect asked the operating system。

developers， do you need a hardware supported use bit， and a hardware supported modified bit。

they said no。 Why， who knows why。 But it turns out it wasn't the end of the world。

because they were actually able to emulate them in software。

And the VaxVMS line was a very successful line， of many computers。 OK。

Another thing we have to think about is our free list。 Yeah。 Well， again， the question is。

is it really expensive， to have to trap into the operating system， to implement the use bit？

Absolutely。 If you had any choice otherwise， you would not want to have to do a trap on every use。

But they were able to amortize it， because it's only the first reference to a page that's。

going to cause the trap。 But still， it would have just been a lot easier。

if you had a hardware implemented use bit， and hardware implemented modified bit。 OK。 So yeah。 Yeah。

So the reason why we have to trap， is because the second chance list is marked as invalid。

So if we try to access anything on the second chance list， the page table entry is invalid， which。

is going to cause us to trap into the operating system， so that we can then move it onto the active。

And market is used and if necessary， market is modified。 [INAUDIBLE]， Yeah。 So the question is。

if it's marked as invalid， how do we know that it's on the second chance list？

So this is where the double books come into place。

The operating system has to have a side set of books that， tell it， oh， these are the pages that。

are on the second chance list and not just pages that， are maybe out on disk。 And again。

we might be able to do some of that， in the page table entry， depending on what bits we have。 OK。

So I'm going to skip the free list， and we will go through our summary。 OK。 So replacement policy。

So we looked at FIFO， which places pages on a queue， and replaces the one at the end of the queue。

We looked at the page that we're going to reference， farthest in the future to replace。

It's optimal， the best we can do。 We looked at LRU because we don't have the future。

And we're going to use the past as a predictor of the future， assuming programs behave rationally。

We looked at the clock algorithm as an approximation， to LRU。

arranging the pages in a circular list， sweeping through with a clock hand。

And if a page is not in use， picking that page， for potential replacement。

We looked at finer granularities for treating young versus， old pages。

The nth chance algorithm was an example of that。 Another example is having a second chance list。

where we actually can implement LRU on a smaller set of pages。

So we have pages that are truly treated as LRU， and then we have pages that are treated in a FIFO manner。

We have our working set， which is the set of pages， that have been accessed recently。

And we didn't get into thrashing。 We'll get into that after spring break。

But thrashing is when we have too many different pages。

So our working set sizes across all of our processes， is greater than our physical memory。

So that's the case where we are continually taking capacity， misses and always going out to disk。

All right， so good luck on the midterm。 Again， it's on Thursday in the evening。 [VIDEO PLAYBACK]。

[END PLAYBACK]， [BLANK_AUDIO]。