UCB CS162 操作系统笔记(八)
P22:Lecture 22: Transactions (Con't), End-to-End Arguments, Distributed D - RubatoTheEmber - BV1L541117gr
Okay, let's get started。
So this is lecture 22 and we're going to talk about reliability。
Then we're going to dive into transactions, get into the end-to-end argument and then。
talk about distributed decision making if we have enough time。
So remember that we have the file system buffer cache and the role of the file system buffer。
cache is in caching disk blocks, right? Because the time to access memory, 100 nanoseconds。
The time to go out to disk, millions, tens of millions of nanoseconds。
So we use memory as a cache for the disk in the form of the buffer cache。
And so we cache all sorts of things, right? We cache data blocks。 We cache i-nodes。
We cache directory blocks。 And we cache important metadata structures like our free fit map。 Okay?
Now there's some important abilities that we want to talk about today。
The first ability is availability。 And it's the probability that a system can。
let's get rid of this。 There we go。 Okay。 It's the probability that a system can accept and process requests。
We typically measure this in terms of nines of probability。 So a 99。
9% probability of being available is called three nines of availability。
So at times you'll see someone offer a service level agreement。
And they'll say the system will be available with three nines of availability。 That means 99。
9% of the time the system will be available except and process requests。 Now。
how do we get availability? The key idea here is independence of faults。
The one component failing doesn't cause the entire system to go down。
We'll dive into this more in just a moment。 Now, durability is the ability of the system to recover data even in the presence of faults。
and failures。 So this is really fault tolerance applied to our data。
Now one of the things that can be confusing here is what it means is that the data is, preserved。
It does not necessarily mean the data is accessible or available。
So for thousands of years there have been hieroglyphics carved into the side of the pyramid。
But it was fairly recently with the discovery of the Rosetta stone that we were actually。
able to read what those hieroglyphic said because the Rosetta stone was basically gave。
us a translation table。 Showed examples of text written hieroglyphics and text written。
I think it was in Greek。 So the data was there, it was durable for thousands of years。
but it wasn't available。 So that's why you separate the notion of durability and availability。
Now a lot of times, you know, like a cloud provider or, you know, some service provider。
will say we provide high availability。 So the system is up and processing your requests。
But they don't tell you is whether it's actually processing those requests correctly。 That really。
the question you should ask is what is the reliability of the system。
The reliability is the ability of a system or component to performance required function。
under stated conditions for a specified period of time。 Now that sounds super formal。
that is because that's the formal IEEE definition。 And so this is much stronger than availability。
Because it means that the system isn't just up and accepting and processing requests, but。
also that it's doing so correctly。 Right。 And so there might be some bug in that service and if that's that service is accepting requests。
and processing those requests and corrupting data and giving garbage back as results。
Technically the system is available。 It's processing the request。
But you wouldn't consider that system to be reliable。 So like for example。
when we think about a space mission, like the Apollo missions to, the moon。
they looked at not just the availability of the computers on that mission, but the。
reliability of those computers could function for correctly for a two week period with a。
high degree of probability of that being true。 Okay。
so this means we have to take a really holistic view of the system。
We have to make sure that the data survives and crashes of the system and failures of media。
like this crashes and any other potential problems that they could there could be。
The same kind of reliability applies to airplanes。 Right。
We want to guarantee that the plane is not just simply available。
The flight computers are not just simply available, but they're correctly processing the, data。
Okay, so let's start talking about some of these LEDs。 We're going to talk about durability。
So how do we make something like a file system more durable? Let me make sure and pull up my chat。
Okay。 All right。 So we do it at a number of different levels。 So one at the lowest of levels。
the disk blocks contain read Solomon error correcting codes, or ECC。
And these are used to deal with defects on the drive。 Now we could try and make a hard drive。
but if the ferrous oxide coding on it, absolutely, perfect。
And manufacturers strive to make it as perfect and as uniform as possible。 But that's difficult。
That's expensive。 And so there's a trade off on how well I do in terms of making it perfect versus errors。
And so the error correcting codes are a way of dealing with some of those manufacturing, effects。
Also, as Cooby talked about last time, you know, I jostled my laptop or other device that。
has a hard drive in it。 And those heads that are floating。
it just microns above the surface can impact the surface, and damage the media。
And that can lead the data。 Error correcting codes may be able to recover the data in those cases。
So the second thing that we need to do for file systems is to make sure that writes survive。
in the short term。 So once we get writes onto the hard drive。
the argument is they'll survive long term。 We'll use things like ECC and other techniques to make the data durable。
But we still have to get from the application out to the drive。
And so either we have to think about when an application does a write, instead of treating。
the buffer cache is right back, we treat it as right through。
So when we return back from the application, the data is durable, persistent on the hard, drive。
Now why might we not want to do that? Well, at the very beginning。
exactly what I said was the role of the buffer cache。
It's dealing with the difference in it taking 100 nanoseconds to access main memory versus。
10 million nanoseconds to access the disk。 So if we want to use write through, we might as well。
you know, we've given up a lot of, the benefit of having a buffer cache。
So the alternative is to do something like use some kind of battery backed up RAM or。
nonvolatile RAM and store our writes there。 And so that's what a lot of servers will do。
They'll have a special area of nonvolatile or backed up RAM, battery backed up RAM, where。
writes get queued。 And so that way they can immediately return back if the system crashes。
when it comes back, up, we just continue writing those blocks out to the disk。 Okay。
so now we need to make sure that the data survives in the long term。
And simply writing it to the disk is not really going to be enough。 Why? Well。
disks have a mean time between failure time of around 50,000 hours of operation。
That's a pretty long time, but that's actually not that many years of continuous operation。
So eventually the drives are going to fail。 And I'm sure all of you have encountered the situation where our drive failed。
Right? And so what can we do? Well, we can do replication。 We have more than one copy of the data。
That way if a drive fails and we lose a copy of the data, we have another copy that we can, go to。
But the important thing to hear to think about again in terms of durability is independence。
of failure。 All right, so I could put two hard drives into my computer。 But if my computer fails。
right, then I lose both of those drives, potentially。
There's a-- I could put it in a drive in a separate computer in my machine room。
But if I have a fire, I could use-- lose both of those servers。
I don't know if the building gets hit by light or hit by a tornado。
So maybe I put them on different continents。 Right? And so then, you know。
it takes something like an asteroid striking the planet for me, to lose both copies。
And at that point, I probably had bigger things that I'm worried about。 All right。 The key thing。
and this is what a lot of the cloud providers think about, is how do I。
ensure independence of failure? So when you use Berkeley Mail, powered by G Suite。
when you press Send, Google before, it unquix that Send button。
guarantees that it's copied your message to servers in three, different data centers。
which may not even all be in the same content or country。 Yeah。
So that even if there's a failure of one or more of those data centers, your email message。
will still be delivered。 It will still be a copy that they can access。 OK。
So let's look at some different approaches。 Yes。 So the question is, what is battery background?
It's just what it sounds like。 So it's a special memory module where there's a battery attached to it。
So even if power is lost to the machine, like the server, the contents of that memory will。
be preserved。 Another example would be an SSD or the flash memory that's in an SSD。
I can turn off the power and it will still preserve the charge。
There it's done by maintaining that charge within the actual flash memory cell。
But before those were actually cost effective and popular, what people would do is just。
attach a battery to some memory。 Now, cautionary tale happened here in the department。
We used to have a big file server where all the projects stored, their research and all。
our home directories were located and all。 And it was on a server that had a battery backed up memory region。
Well, batteries fail after some number of years。 And after some number of years, the battery failed。
Now these servers are really smart。 They have lots of monitoring。
And so it detected that the battery had failed。 And it dutifully logged in the log that the battery had failed。
No one ever read the log。 The server was on a UPS, but the UPS failed。
getting to independence of failure。 And power went out。
But power only went out for a few milliseconds。 It was literally just the light flicker and everything was good。
except it wasn't。 Because in that small amount of time, the server lost power。
that battery backed up memory, where all those important rights were being stored was lost。
Now think about file systems。 What data do you think is going to be in that battery backed up memory in our buffer cache?
Any ideas? I'll go back to my picture。 It's all of the important stuff, like the free bitmap。
like the directory blocks, the, i-nodes, the i-node table。
all of the information that tells you where to find stuff in the file, system。
And in a few milliseconds, that data was lost。 We had backups。
But it turns out people added more disks to the server and had forgotten to add those。
disks to the backup groups。 There were groups that lost all of their research and individuals who lost all of their。
research。 It's a cautionary tale, it's actually really hard to have full independence of faults。
Okay。 So first type of replication I want to talk about is a technology developed here at Berkeley。
and now a $1 billion a year industry, which is RAID。
RAID stands for redundant array in expensive disks。
The argument was I can either try to make a disk that is super ultra reliable, but then。
it's also going to be super ultra expensive。 I could use the best of the best titanium。
gold plated and beryllium components and get, some insane mean time between failure out of them。
But it's going to be priced out of everybody's price range。
So instead what I'm going to do is I'm going to just take commodity drives with their 50,000。
hour mean time between failure。 And I'm going to put them together in an array。
By putting them together in an array, I'm now going to put them in these recovery groups。
and duplicate the data。 So instead of going out and buying 110 terabyte drive。
I'm going to go out and buy two 10, terabyte drives。
And each disk now is fully duplicated onto its shadow。
So we'll say the green disk is the primary and then the pink flash purple disk is our shadow。
So this is really great for high IO environments and high availability environments because to。
lose the data, I have to lose both of the drives。 And now we're looking at the probability of a double drive failure instead of just the。
probability of a single drive failure at any given time。 But it's really expensive, right?
Instead of having 10 terabytes, I now have to go out and buy two of these drives。 But they're cheap。
And they're cheaper than making a 10 terabyte drive that would be super ultra reliable。
So the trade off is the bandwidth that I get for rights。 Right? Because when I do a logical right。
I actually have to do two physical right。 I have to write to both drives。
And so when these first came out, there were these like wires that went between the drives。
that were like servo wires and they perfectly would synchronize the rotation of the disk。
so that the same sectors were under the same head on the two drives。
And so when you're doing a right, you can simultaneously do the right through the controller。
Now modern versions of this don't do that。 And so there's a cost because they're going to be slightly out of phase with one another。
And so there's going to be a cost, a higher cost for doing those rights。
Reads on the other hand are optimized。 If I have a high read workload, then RAID 1 is great。
Because I can read from either drive。 I can read from both drives in parallel。
So I can send half my reads to one drive, half my reads to the other。
I've effectively doubled my read capacity between the two drives。
Now what happens when a disk fails? I just simply unplug that disk and replace it with a new 10 terabyte disk。
And now I copy all 10 terabytes from the disk that's still working over to the disk that's。
brand new。 Now what happens if a failure happens in the middle of the night?
Well now there's this vulnerability window between when that failure happens and when。
I complete copying all of the data。 What happens if that other drive fails?
Well then I lose all my data。 So to close that window down, I can have a hot spare。
So now instead of going out and buying a 10 terabyte disk to store 10 terabytes of data。
I'm going to go out and buy three 10 terabyte disks。 One is a primary, one is a shadow。
and one is a hot spare。 So that gets really expensive to store the data。
But it still is again less expensive than if I tried to over-engineer a drive to get a。
very high meantime between failure。 Okay, so let's look at sort of the other spectrum of RAID。
There's lots of different ways we can do RAID。 But one of the more common ways of doing RAID is what's called RAID 5。
And I say RAID 5 plus because this is the technique that's used for the higher versions, of RAID。
So the idea here, and I'm going to go through this slowly, we're going to take the data。
and rather than saying all the data is going to go on one drive and then all the backup。
the redundant data is going to go on the other drive, I'm going to have an array of drives。
So here I have five disks。 And I'm going to stripe the data across those five disks。
Now I could have more than five disks, I could have 10 disks, I could have 12 disks, any number。
but the idea is I'm striping the data across the drives such that block zero。
goes on the first drive, block one goes on the second drive, block two goes on the third。
block three goes on the fourth。 And then the next block for this group is a parity block。
And I compute the parity block by XORing the data blocks。 And then I repeat。
So now here I have my next data, my next data, my next data, and here we should actually stripe。
the parity。 So there are versions like RAID 5 where you just have one drive that's your parity drive。
But the later versions you stripe the parity across。 So not one disk is like your parity disk。
And there's a bottleneck。 So now I can read in parallel for block one, block two, block three。
and so on。 Now when a drive fails, what happens? Let's say, I don't know。
we'll say that disk refails。 All right, well if disk refails。
we can reconstruct the contents of disk three by XORing the blocks, on the other drives。
So block D2 is just simply D0, D1, D3, and P0 XOR together。 Block D6 is just D4, D5, P1, D7。
XOR together。 It's like I can very quickly, or I can recover all of the data that was on this particular。
drive。 All right, now in this case, you know, typically you'll find these in like a network approach。
appliance that sits in your server in the box。 But you could imagine that I could actually spread these disks around the globe。
Of course there's going to be a trade-off, right? Because speed of light is finite and so I'm going to add latency to the time it takes。
to do reads。 But I could get more independence of failure if the data isn't stored all in one data center。
that could be hit by a tornado or hit by lightning or an earthquake。 Okay。 So now in general。
the raids are some form of a ratio code, right? So assuming we have some way to tell a disk is bad and we do because we could look at the。
error correcting codes that are used and see, do we, are we unable to recover from the errors。
that in a block? So we can tell when a drive has failed and that's considered an erasure。
And so erasure codes correct for erasure。 So erasure correcting codes。
So it turns out today our disks are really large。 All right。 So I said 10 terabyte。
I think you can go out and buy like a 12 or 16 terabyte drive。
It takes a really long time when you have one of these failures to read all of the other。
drives and reconstruct that 10 terabytes of data that was on disk free。
And so now we have to actually look at what's the probability that I'm going to suffer another。
drive failure while I'm rebuilding disk free。 So let's say, I don't know, disk for phase。
What happens if disk for phase while I'm doing my recovery? Oof, that goes my 10 terabytes。
I have no way to recover it because the XOR is only good for one drive having failed。
And so while this was really good when drives were small and, you know, drives were hundreds。
of gigabytes, like a terabyte or something like that, I could recover from a failure in。
a few hours。 Now we could take many hours or over a day to recover from a failure。 During that time。
with something like RAID 5, I'm vulnerable to another drive fail。
Now I don't have time to dig into it, but if you look at, you know, the failure curves, for drives。
drives tend to immediately fail or they tend to fail around their mean time, between failure。
And if you buy drives from a manufacturer, like a distributor, you tend to, and you look。
at the serial numbers, you'll see oftentimes they're like sequential。
They all came from the same production batch, which means they all have the same rough。
and if there are any issues or anything like that, the same mean time between failure。
But turns out it's actually very likely if one drive fails that you're then going to。
suffer a second drive failure。 So when we go out and we build a network of arrays of storage in my group。
we actually, go to multiple distributors so that we get drives from different batches。
We reduce this kind of sibling behavior that you see in failures。 But it's still not perfect。
and so that's why we use things like RAID 6, which is a more。
complex code that allows two of the drives in the strike to fail。
So then you combine that with hot spares, and so if a failure occurs, you can immediately。
start rebuilding the array and close that window of vulnerability to multiple drive failures。 Yeah。
question。 Yeah, so the question is what about RAID 0? So RAID 0 is a recipe for disaster。
Is the polite way of saying。 So RAID 0 is just I'm going to take my two drives。
And rather than using one as a backup for the other, I'm going to treat it as what they, call JBOD。
just a bunch of disks, one giant of disk blocks, one giant disk。 So if either drive fails。
I lose my data。 So RAID 0 was only used because people wanted for mistaken reasons to create larger disk。
pools where they could have a really big file or something like that。
But the value was really bad because if either drive fails, I'm going to lose this really。
large file。 So nobody uses RAID 0 unless you really want to play with fire。
RAID 1 is probably one of the most common ones。 A lot of a low end。
two bay drive naz is that you can buy for your desktop are basically, doing RAID 1。
You just mirror for one drive to the other。 And then if you have a number of drives like you have five drives。
then you can do four, drives even, then you can do something more complicated。 Okay。
so you have these more complicated erasure codes for something to be able to tolerate multiple。
disk failures。 And you can see that in the reading。 So in general。
erasure codes are things like lead-solving codes。 So the things that we were applying at the bit level。
we can also apply those at the block, level and have a way of basically encoding it so that we're encoding some number of fragments。
So we can encode our data, we can take our data, chunk it up, and then encode it into。
a set of fragments。 And then we only need m of those chunks in order to be able to reconstruct the data。
So we're replicating the data, but this is increasing our durability because now we have。
to erase more chunks before we would actually lose the data。 So in the example here。
if we split the data into m equals four chunks and generate 16, fragments。
Then any four of those fragments can be used to reconstruct the original data。
So that makes it incredibly durable。 But it comes at a cost of there's an overhead associated with that。
Now we can take this even further and make it really hard to destroy all the copies or。
a large number of the copies by taking and replicating it across a geographic region。
So this gives us great performance for reads because I only need to read in the simple case。
a copy if I just replicated it at the copy at the full file level。
Or I just need a fraction of those fragments in order to be able to reconstruct the original, data。
But it doesn't work as well for writes。 Because when I do a write。
I need to be able to update all of the replicas。 I want to change some value in a file。
I have to update all of those fragments across all of those replicas。
If any of those replicas are down, then I'm faced with two choices。 One is the system stops。
That impacts availability。 The second choice is I use some kind of relaxed consistency model。
So for example, I can put version numbers on the fragments and then write them out。
And then when I'm reading back, make sure that I'm reading the latest version of the, file。
There's all sorts of protocols that I can use to try and ensure that I'm reading the。
latest version of the file。 And then eventually if the replicas come back up。
I can make them consistent by updating, them with the rights that they've missed。
But it adds a lot more complexity。 But this is what a lot of the cloud providers actually use。
Again, they take your data and they replicate it out in these fragments。
And then they support the notion that one or more of those replicas might be unavailable。 Okay。 Now。
to get true reliability again in this context, we have to think about all of the, components。
We have to think about availability。 Are the replicas available and able to process results?
We have to think about security。 If a worm or virus comes in。
we don't want it corrupting our data by encrypting all of。
our data and encrypting all of our fragments。 Then our data is not available, not durable。
not accessible。 And then of course you have to think about all of the possible faults。
So it actually turns out that a lot of effort has gone into designing cloud infrastructures。
And it's really interesting when there's an outage, like you can go to like Google's。
status page or you can go to Amazon or Microsoft。 A lot of times they'll have after action reports。
after a failure。 And it's really interesting to kind of dive into those and see what went wrong。
What was the critical thing that caused Google to go offline for three hours one day?
And it's oftentimes innocuous things like someone made a configuration change and made, a mistake。
It broke things and there was no way to roll it back。
Or an authentication server that no one realized was core to the entire worldwide operation。
went down and brought everything down。 Those are just some of the examples of things that have caused worldwide outages。
Okay, so how do we make a file system more reliable? So if we think about file system reliability。
there's kind of a difference from what we, have at the block left。
So we talked just now about how do we make our blocks durable。
Now we want to make our file system durable and reliable。 So we want correctness as a component。
So what happens if a disk loses power or your operating system, blue screens or crashes? Well。
there's some operations in progress that might actually complete。
So if I'd send stuff to be written to the disk and the disk controller still has power。
but the operating system crashed, well, maybe those writes still make it out to disk because。
they're in some disk buffer on the physical drive and not just in some operating system, buffer。
But some of those operations that are in progress might get lost。
And I interrupt the power and writes that were occurring just don't occur instead。
So I might have a block that I was writing on the disk that's only partially written, doesn't。
get fully written。 And the thing is having something like RAID isn't necessarily going to protect us against。
these failures。 Right? There's no protection against an application or an operating system writing bad state。
RAID will dutifully protect that state that you've written, but it's bad state, garbage。 Right?
There also can be errors and issues with RAID controllers themselves, where maybe one of。
the disks in the RAID group doesn't get correctly written because there's a bug。
So the file system needs to have durability。 And kind of as a minimum, right?
If the file system's not durable, it doesn't matter if the blocks in the disk are durable。
because we won't know what those blocks mean。 And that's exactly what happened when we had the power failure with the storage。
The data was all there on the disks, on terabytes worth of disks。
But we had no idea what block belonged to what file。 And so for all intents and purposes。
it was lost。 It was gone。 It was unrecoverable。 Okay。 So we want durability where, you know。
maybe we have to do some recovery actions after, you, know。
a failure occurs after an operating system crash after a power system failure。
We might have to do like some kind of disk checking file system checking operation in。
order to reconstruct the state of the file system。 But durability alone is not always quite enough。
Okay。 So now if we think about it, right, we do a single logical file operation, like a right。
We're touching many different physical disk blocks。 All right。
we touch the i node because maybe we need to allocate it。 We touch indirect blocks。
We touch direct blocks。 We touch the data block。 We have to touch the bitmap that tells us what blocks are in use and not in use and so。
on。 And then we have sector level remapping that might be being done by the underlying drop。
And if blocks contain multiple sectors, when I actually do a write, I might be writing over, here。
writing over there, writing over there to write the same one block。 Right。
because it's been remapped due to data defects on the drive。 So at a physical level。
operations are completing one at a time。 I do a sector write, I do another sector write and so on。
We want concurrent operations for performance, right?
Concurrency always allows us to get performance out of being over being able to overlap IO computation。
and so on。 So how can we guarantee consistency for our file system even if some crash occurs?
Well the thing about that, we need to think about what might be affecting our reliability。
What are sort of the threats to the reliability of our file system?
So some of the threats that we're going to face are interrupted operation。
So crash occurs and power failure occurs during the middle of updates to our file system。
And that could lead the data that we've stored in some inconsistent state。
Some of the data was written, some of the data wasn't, maybe some i-node was updated。
but the directory wasn't, or the free block wasn't, and so on。 An example, you know。
sort of at the logical level is I want to transfer $100 from one bank。
account to another bank account。 The power failure occurs in the middle。
What happens if I'd already taken out the $100? But I hadn't deposited it in the other account。
$100 just disappears。 The bank's books now don't reconcile for the branch。
don't reconcile for the account, and, so on。 There's also the loss of stored data。
So if we have a disk that fails, we saw how we can address some of that with things like。
RAID and a ratio occurs。 So there are two approaches that we see in our file systems to try and get reliability。
The first is to have a careful ordering and a recovery process。
And the second is to do some kind of versioning and copy on write。 So in the first case。
and this is what's used by the FAT file system。 It's also used by the UNIX FastFile system。
So in the FAT file system, we have check disk and in the FAT file system, we have the file。
system checker, FSCK。 And what happens is that each step we build the structure of the file system。
that we're, like write a data block, then we write the i-node, then we write the free list, then。
we write the directory entry。 And then the last step is what links it into the file system when we add it to the directory。
And if a failure occurs, we just simply rescan looking for these incomplete actions。
Like maybe we only wrote the data block and the i-node, but we didn't update the free。
list or we didn't add it to the directory。 In the versioning case。
it's used by ZFS and OpenZFS and some other file systems。
We've versioned the files at some granular。 And so we're going to create a new structure that links back to the unchanged parts of。
a file, like if we're just appending to a file, and then the new parts of the file。
like that append that was the write that we just did to a file。
And that last step is going to declare that the file is ready for use。
Let's look at that first approach where we use a careful ordering。
So here we think a lot about the sequence of operations, and we always execute those。
operations in a specific order。 And we assume failures will occur and could interrupt that process。
And then after a crash, as part of the operating system booting up, we're going to run a recovery。
process。 And so this is why, like, you know, if you have a Mac or something like that, you'll。
notice sometimes, like if your operating system just crashes, when it comes back up。
it takes a long time to come back up。 What booting doesn't actually take that long?
Booting takes seconds。 What is it doing? It's running this recovery process where it's checking the disk to see where their operations。
in progress that have left the file system in an inconsistent state。
And the time that takes is going to be proportional to the fraction of the disk that you're using。
If you're like me and you have a large hard drive in here or a large SSD in here and it's。
very full, those operations can take a long time。 Okay。 So again。
this is the approach taken by the FAT file system and by Berkeley FFS to protect。
the file systems and metadata。 Now notice I said file system and metadata。 I did not say data。
So this approach does not protect the data that's stored by application。
And so it's up to application developers to come up with their own recovery mechanisms。
And we'll look at those more in just a moment。 But there are things like Word and EMACs they autosave to a different file。
And so that way if there's an interruption, you can always, you know, come back up and, it'll say。
oh, you know, there's an autosave version。 You want to use that instead of the version that you tried to open。
Okay。 So let's look at how it works in the Berkeley FFS file system。 So in the normal case。
we allocate a disk block, write the data block, allocate an i-node, write the i-node block。
and then we update the bitmap of free blocks and free i-nodes。 And then as a last step。
we update the directory with the file named i-node number mapping。
And then we go and we update the modified time for the directory since we just added a new。
file to it。 Okay。 Now on recovery, what do we do? We scan the i-node table。
If we find any unlinked files, the files in which there's an i-node allocated, but there's。
no directory entry, what we do is we can, I'm too choice。 We could delete it。
but maybe that was somebody's important。 That was somebody's thesis。
That was somebody's CS162 project。 So rather than delete that file。
what the FFS file system and many other operating systems。
do is they put it in a lost and found directory。 If you go and look in your Mac。
you'll see that there's a lost and found directory。 That's where, like。
files that are unlinked get put。 On the FFS file system, it would be files that have an extension。
or I think like CHK or, something like files, 000。chk。
So then you could manually go look at the contents of the file and realize, "Oh, this。
is something really important," and then manually link it back into the file system structure。
Or if it was garbage or a temporary file, you could just simply delete it。
Then we compare the free block bitmap against the i-node trees to see if, again, there。
were blocks that were written, but they weren't actually allocated to i-nodes。
And then we scan directories for any missing updates and then update the time and access, them。
Again, the time to do all of this is going to be proportional to the size of the file system。
which is proportional to the size of the disk, assuming the file system fills the disk。
So it can be a very expensive operation if you have a very large drive。 So, yeah。 Yeah。
so the question is, "What happens if we crash after we allocate the i-node but。
before we write the data?", So in that case, oh, after we write the data before, yeah, yeah。
So if we crash after we write the data before we've gone and allocated the i-node, then。
the data will be lost。 Because we haven't marked in the bitmap that the data is the block is in use。
We haven't attached it to any i-node。 And so in that case, it's just simply lost。 So that's why。
like, at the application level, you might do things like Word might make two, copies。
So it has the autosave copy and the original copy。 And I'm doing a save。
I'll save to a temporary file。 And then after I complete the temporary file, right。
I'll then go and take the original, file, rename it to something。
And then I'll note that's a file system operation, so it's atomic。
And then I'll rename the temporary file that I just wrote to the original file。 So that way。
if I have a crash anywhere in there, there's either the autosave file or, there's the old file。
And so I have something I can recover to。 But sometimes you get corruption during that autosave process and now you end up with。
you, know, you end up with nothing。 You end up with the original file。
which if you're not saving frequently, it could be, very old。 There's a lot of questions。
I'll start here。 Yeah, so the question is why do we update the bitmaps last?
So we want to fully fill out the data structure before we then say we're allocating。
If we allocated the bitmap first, then we might have a data structure that was incomplete。
And so we'd have to throw it away anyway。 Yes, the question is what if we're allocating a file and we're writing to the same block。
So again, we're not protecting the data blocks。 We're only trying to protect the file system integrity to make sure that the file system。
at the end of the day contains is contains integral pointers to disk to data on the disk。
and files on the disk。 And there's nothing incomplete or inconsistent in the file system that doesn't again protect。
us against, you know, writes to data blocks that don't make it out of the buffer cache。
It doesn't protect us against incomplete writes to a data block because I was in the middle。
of writing a data block。 We'll look at other approaches that will give us those kinds of protection。
But they come out of cost。 There's always a trade-off question。 Okay。 Any other questions? Okay。
So let's look at the second approach, which is a copy on write。 So if you think about it, right。
the multi-level index structure that we have, let's just figure。
out where all the data blocks are for a file。 We have the direct pointers。
We have the indirect pointer blocks。 We have the doubly indirect blocks, the triply indirect block。
And instead of overriding the existing data blocks, why don't we just simply update the。
index structure? So we'll reuse all of the index blocks that are unchanged。
We'll reuse all of the data blocks that haven't changed and only update parts of the structure。
that have changed。 That's why it's called copy and write。 So that seems kind of expensive。
Like it seems like we're doing a lot of duplication, but we're really not。
Because one of the things we'll do is we'll batch updates。
Rather than trying to do every single update, we'll do this kind of copy on write, we'll。
batch together a bunch of updates, and then make the change。 All right。
So this is the approach that's taken in a lot of network file server appliances, the。
Waffle file system from NetApp, ZFS and Open ZFS uses approach。
So here's what copy on write looks like。 And in this case。
I simplified things by using small write blocks in our tree。
But our file is just represented as a tree of blocks。 So here's our old version pointer。
The points to this index block, which points to another index block here and another index。
block here。 This points to an index block here that points to one of the extends。
Here's another extent, another extent, and so on。 And the writes。
let's say we're appending to the file。 So we really just need to update this leading fringe of the file。
So we can do that right。 Right? So we'll write to it。
And now we're going to copy that the rest of that file。 We're not going to actually copy it。
We're just going to copy the pointers and as much of the existing data structure as, we can。
So if you think about it, what needs to change? Well, certainly this needs to change。
This index block would have to point to here and point to here。
So we're going to have to duplicate that。 And then we're going to need a new index block for this。
And we're going to need a new version at the top。 So we're going to do that。
And now we're just going to add pointers to point to the existing data structures。 All right?
Now we add our write。 So the actual overhead for this is relatively small。
Now the other nice thing that you might have noticed here is this also now allows us to。
have both the old version and the new version in a relatively space efficient manner。
And so a lot of these network appliances offer this as what they call a snapshot feature that。
will save some number of these old versions。 It'll be one old version an hour for the last day。
And after that one per day, going back a week, and then after that, maybe one per month, and, so on。
And it can do this storage in a very efficient manner。 The relative overhead is relatively small。
assuming I'm not just completely rewriting, the files every time。 Okay。
so file systems that implement this include ZFS and the open source version of, it。
which is called open ZFS, variable size blocks。 So you can have sort of different size extends。
It's a symmetric tree。 So we know if the tree is going to be large or small and we have to make the copy。
we store, version numbers with the pointers。 And so we can just create a new version by adding the blocks in the pointers as I showed。
on the last slide。 And we buffer a bunch of writes together before we write it out to this。
So again, we're trying to protect the file system integrity。
We're not necessarily trying to protect your data integrity or your data。
So your data is going to live potentially in buffers longer before we make it go out, to the disk。
And then free space is a set of extends。 And that way we can just simply grab an extent when we need some large blocks of data to。
allocate to a file。 Okay。 So I want to take a moment and talk again about our collaboration policy。
I know this is a super busy part of the semester when everybody's working on the last project。
and working on homeworks and exams and everything else。
So some of the things that are good to do in this class, explaining a concept to someone。
else in a group, doing that at a high level。 Doing algorithms or testing strategies with other groups。
So how you might do glass box or black box testing, how you might try to design the space。
of test cases that you're going to use。 Discussing how you do debugging or different approaches to doing debugging with another。
group。 And then if you're going online looking for generic algorithms like what's a good hash。
algorithm or what's a fast memory allocation algorithm, things like that。
Where it's a generic algorithm。 So all of these are good things to do。
Things that are not good to do are sharing code or test cases with another group。
So that's even if it's just a fragment of the code or it's a fragment of the test case。
Copying or reading another group's code or test cases。
Unfortunately we've had situations where we had to have those hard conversations because。
someone simply looked at someone else's code。 And then many hours later went and implemented their code。
It turns out the way we each write code, the way we each design our code is kind of like。
a fingerprint。 It's different。 It's unique。 And so just looking at how someone thought about structuring their code can be enough to。
influence how you structure your code。 And then that triggers our cheating detection algorithms。
And then we have those hard conversations。 Copying or reading online code or test cases from prior years。
So disk space is cheap。 We've kept a lot of years worth of submissions。
And we compare against all of those submissions。 The TAs are also really good at Google。
And they find all the instances of 162 code and repos that people have made public and。
things like that on the internet。 And again, if you look, even just looking at one of those。
even if you don't try to copy, from it, it's going to influence the way you structure your code。
It's going to influence the way you write your code。 And again。
it's going to lead to one of those difficult conversations。
So helping someone in another group debug their code is not acceptable。 And again。
what we've seen happen before is someone's hit a wall, they can't get through, it。
and a friend comes along, helps them debug their code and says, oh, you know, I just。
did it this way。 And they change it。 And then we detect that because it turns out that was a unique way of solving that particular。
issue。 So those are all things you should not do。 And again。
we compare all of the prior year of current year, all the online solutions to。
everybody's submission。 So what do you do if you hit the wall? You know, it's deadline is due。
code is due, and it's not working。 Well, that's where you reach out to your core staff。
We're here to help you get through this class。 And we have a lot of office hours。
We have a lot of TA support, a lot of reader support。
And it's all designed to help you with any problems that you run into。
So if you find yourself in a situation where you think you have to turn to Google as your, friend。
please don't。 Instead, turn to your TA, turn to your instructor, turn to your reader for assistance。
Also, please don't put your friends in the situation where you ask them for help, you。
ask them for code。 We've had situations where a roommate left their computer unlocked and another student。
looked at their computer and saw their code, saw how they did things, and they influenced。
their design。 Now we have a really hard conversation between two friends who are now not friends。
So with all of that in mind, again, if you run into the situation where you think you。
need to turn to doing something like that, please come see you。 That's why we're here。
Any questions? Yeah。 Is it open to you? Okay, to read which source code? Yeah。
it is perfectly fine to read the source code for a public operating system like BSD。
or Linux or any other operating system。 That's not going to necessarily change or influence directly your solution in 162。
but, you'll see an example of how an operating system implemented something。 The generic cases。
that's perfectly fine。 But if you go and you find somebody else's 162 code repository that they left up on GitHub。
that crosses the line unfortunately。 You're welcome。 Any other questions? Okay。
so let's talk about back to technical stuff, some more general types of reliability, solutions。
All right, so we're going to use transactions to implement atomic updates。
And the goal here is we want to make sure that multiple related updates either completely。
happen or don't happen at all。 So it's an all or nothing。
So if there's a crash that occurs in the middle of a transaction, at the end of the day。
after we've completed recovery, our state is either everything is complete or nothing, happened。
We can't end up in some kind of inconsistent weird state where $100 got taken out of bank, count。
but didn't get deposited into the target。 Now, as I said, most modern file systems。
their focus is on protecting the file system itself。
and the metadata associated with the file system。 They don't try to do anything about the data that's stored in the file system。
It was these are my examples of where they do try to protect some of the data in the, file system。
But a lot of times, applications have to implement their own types of transactions or。
other techniques to preserve their data。 So we also want to provide redundancy, of course。
for media failures。 And again, we saw how we can do that with things like error correcting codes on individual。
drive and then by replicating the storage media itself through something like RAID or by using。
erasure codes to replicate the data blocks across a geographic area。 Okay。
So transaction is kind of really closely related to something we've already looked at, which。
is critical sections。 Right? We use critical sections when we wanted to manipulate a shared data structure。
A transaction takes this notion of doing an atomic update to a shared data structure and。
takes it from memory to persistent storage。 Right? If you think about it。
like with a critical section, the idea was everybody outside the。
critical section either sees the data before you entered the critical section or they see。
the data after you left the critical section。 They can't see the intermediate state of the data。
It's what a transaction is going to do for us but with stable storage。
It's going to do it with persistent storage。 Now there are lots of ad hoc approaches, right?
So what we saw in fast file system is we did it is really careful ordering of the updates。
to the file system data structures such that we could either leave it the case that the。
file never got created or never got updated or it's the case that the file got created, or update。
But we don't end up in an inconsistent state for the file system in between。 Like I said。
applications still do techniques using temporary files or other things to try。
and make things atomic。 So the key concept here is transaction。
The transaction is atomic sequence of reads and writes that takes our system from one。
consistent state to another consistent state。 And again, just like with a critical section。
transactions appear atomic to all other threads。 So a thread can't see what's going on with the state in between while the transaction。
is active。 So again, transactions are extending our concept of atomic updates from memory in a critical。
section to stable storage, persistent storage。 So the typical structure。
first thing we do is to begin。 That gives us a transaction ID。 Then we do a bunch of updates。
Now let's say we run into trouble。 Our program throws an exception。 That's fine。
We abort the transaction and roll back like it never happened。
What happens if we run into a conflict with some other transaction?
We're trying to read and write the same files that that transaction is right is trying to。
read and write or same data structures that transactions trying to read and write。
So we have a conflict。 That's fine。 We abort both transactions or one and just roll back。 All right。
So that's the idea is that when we do to begin, we're going to kind of snapshot what's going, on。
Then you can do whatever you want with the playground。 If anything goes wrong。
we just make it all go away。 We roll back to that snapshot。 Then finally。
when we're done making our changes, we commit the transaction。 So let's say we have a classic。
this is sort of the classic example of a transaction, say, from 186。
I want to transfer money from Alice to Bond in a banking environment。
And so I first start by decrementing the balance of Alice's account。 And this bank。
as many databases work, has two sets of ledgers。 It has individual account ledgers and then it has ledgers for the bank branches。
So I need to decrement $100 from the bank branch where Alice has her account。
That's where she went and opened her account。 Then I'm going to increment Bob's bank balance。
I'm going to increment the balance for the branch where Bob's account is stored。 And then I'm done。
All right。 So I've got four separate operations doing reads and writes, but I want to have a peer。
as one logical, indivisible operation。 The way I do that, it's super complicated with transaction。
I put a beginning at the beginning and I put a commit at the end。 That's it。
It was actually not very complicated。 And so what this does is it guarantees that even if I've got other transactions that are。
incrementing and decrementing accounts and branch balances, this is one indivisible unit。
Other transactions can't see the intermediate state。
This transaction won't see the intermediate state of other transactions。 And when the commit occurs。
the system guarantees that the results are durable。 All right。 Questions? Yes。 Yeah。
So the question is that we can't see the intermediate state。
Could there be a double spending situation? So what the system does is it guarantees that if there were a couple of different things。
or what the system does is it guarantees that if there were attempt to both have multiple。
threads that are simultaneously trying to decrement Alice's account, then we would either make。
one wait or abort that transaction and the other transaction would be allowed to proceed。
So both transactions can't actually be manipulating the data at the same time。
So if you want to think about the ordering, even though all the transactions look like。
they're running at the same time, there's an ordering and one transaction occurs logically。
before another transaction occurs before another transaction occurs and so on, if they're working。
on the same data。 If they're working on independent data, independent accounts。
independent bank balances, then all, of those could be simultaneously running at the same time。
And so part of what the system has to do is keep track of what's being accessed by what。
and what's part of what transaction。 So this is really simplistic。
It's actually a lot more complicated。 There's locks。
There's all sorts of other things that monitor what's going on。 Okay。
Another concept that we need is the concept of a log。
So there's a simple action that's atomic in a log which is writing or appending a basic, item。
And we're going to use that to seal the commitment of a whole series of actions。
So here is a whole set of actions that we'll say is associated with a particular transaction。
Now there may be other operations that are going on that we've written in the log that。
are not part of this transaction。 These actions will all have that ID that was part of when we did the begin。
When we do the begin, we get back a transaction ID。
A transaction ID is assigned to everything that we're writing in the log。
So when we start a transaction, we get that ID N that's going to be attached to all of, these items。
When we're done, we're going to commit that in the log。
We're going to write a commit record to tell us that this transaction is now complete。
So when we look at a log, if we just see a start and we see a bunch of actions for that。
transaction, but we don't see a commit, then it did not finish。
So how do we turn this into transactional file systems? Well。
the idea is we want to get better reliability by using a log。
So any changes we're going to make to the file system, we're going to treat as transaction。
So previously, like in the file system, we went in and we manipulated the i-node, we。
manipulated the bitmap, we manipulated the directory entry。
We went and did those as we were doing the operation。
Now we're going to put all of that in the log instead and do it through transactions。
So a transaction, again, it's committed once it's written to the log and we see a commit, record。
we could do one of two approaches。 We can, because the log is going to be buffer cached。
we could force the log to disk。 So we're going to block, stop。
wait until the log reaches disk when we do a commit。
Or we could put the log in some sort of nonvolatile memory。 All right。
and that way it'll survive system crashes and power outages。
So we're doing everything in the log first。 It's right ahead logging。
And then we're going to go and do things in the file system。
So even though the file system gets updated at a later time, we're going to ensure that。
eventually it's updated if it should get updated。 Now there are two different ways we can use logs and file systems。
One is what's called a log structured file system。
The other is what's called the journal file system。
Everybody in this room is using journaling file systems because all of the modern file。
systems use journaling。 The difference is in a log structured file system, we put the data。
The data blocks also go in the log。 In a journaled file system。
we only use the log really for recovery purposes。 So the data actually goes out to the disk just as it would in a normal file system。
The journaling is usually, and you can think of it as kind of a add on to a file system。
where a log structured file system is kind of a ground up reimplementation of a file system。 Okay。
So in a journaling file system, we don't directly modify the data structures on disk。 Instead。
we write each update as a transaction in the log。 So this is typically called a journal or it's called an intention list。
And it's maintained on the disk。 So we allocate blocks for the log。
We can again keep it cached in memory and either force the log to disk when we do a commit。
or we can choose to put it into non-volatile RAM。 Now once a change is in the log。
and it's been committed, then we can go and apply it to, the file system。
So I'll walk through an example because it's probably a lot of words and sounds a little。
complicated。 The idea is that there might be an instruction like modify the I know pointers in this particular。
way。 We log that in the log and then later on we're actually going to go and modify the file system。
Again, this is in contrast to before where we just went and modified the I。
Now we're going to instead record, hey, go modify the I。 That's what we're going to do。
And then we can clean up the logs。 The log doesn't grow unbound once we've actually performed those operations on the file system。
So like I said, it's kind of a bolt-on that you can do。 So the original file。
one of the original file systems for Linux, EXT2, was a fast file, system like file system。
And they added journaling to it and created EXT3。 So the relatively straightforward change。 Okay。
Other examples of journaling file systems。 This why I say everybody is using it。
If you're using NTFS, if you're using HFS plus or APFS on Mac products, if you're using。
Linux XFS or JFS or EXT4, all of those are journaling file systems。
But now let's peek under the covers and see exactly how those file systems work。 All right。
So first, let's look at an example of what happens when we don't have journal。
So I want to create a file。 So what do I do? First thing I do is I find some free data blocks。
All right。 Okay。 I found some free data blocks。 Then I find a free I node entry。 Okay。
I found a free I node entry。 And I find a directory insertion point。 And I found that。
And now I'm actually going to go and do everything。
So I'm going to write the map is used to say that I'm using the data block。
I'm going to write the I node entry to point to those data blocks。
Then I'm going to update the directory entry to point to the I node, the I number。 And I'm done。
Okay。 So I made all of these changes on the disk in the file system。
And I did it in this very precise order so that if I had a failure, I could try and go。
back and kind of figure out like how do I, you know, reconstruct the disk using my FSCK, problem。
So how does it work with journal? It's a little different。 So we have our log。 Again。
we're going to either store this。 We're going to store this preferably a non-volatile storage。
So we're going to have to force it to disk and wait for the disk。
So now we're going to find the free data blocks。 Okay。 I found a free data block。
We're going to find the free I node entry。 Then we're going to find the free。
to find the directory insertion point。 To find that。 All right。
So now this is very similar to what I did before。 Now I'm going to record my intentions in the log。
Instead of going and modifying the file system itself, I'm going to write the instructions。
for what I want to do to the file system。 So I want to write the map and mark it as used。
I want to write the I node entry to point to the block I just allocated。
I want to write the directory entry to point to the I node that I just allocated。 Okay。
And after I've done all of that, now I'm done。 I've created my file。
So now I'm going to commit my transaction。 All right。 So there's a couple of things。
Here we divide our log up into regions。 So done is things where the state of the disk reflects what we had in the log。
We've done the operations that we said we wanted to do in the log。
We have our tail of our log and then we have this region called pending。
The pending region is all the operations we haven't done yet。 So now in this case。
we have a bunch of pending region of pending operations that we just, created。
So now that we finished our transaction, we're going to move our head to the very end。
And all of this is in our pending region。 So these are instructions that say what we want to do to the file system in order to。
create that file。 We haven't modified the file system。
The file system still on disk reflects the old state of the file system。 So after we commit。
eventually what we're going to do is we're going to replay that transaction。
We're going to actually do the intentions, the intended actions that we had in the log。
So all accesses in the meantime to the file system have to first look in the log。
So it means if I want to look for that file that I just created, I'm going to have to scan。
The operating system will scan through the pending operations to see was that file does。
that file existing。 And then it'll look out onto the disk to see what's the state of the disk。
All right。 So that file that I just created only the directory entry for it only exists here in the log。
What's on the disk is stale。 It reflects the before time, not the current state of the file system。
So we've added some complexity here。 Before we could just go out the disk。
look at the disk that's hold us the state of the, world。 Now we have to look in the log。
read through the actions in the log to determine whether。
or not something exists or not or what the state of it is。
Now eventually what we're going to do is we're going to copy the changes that are in the。
log to disk。 So we'll start, so assuming we've already done everything that was pending, we'll copy。
our changes out。 So here we'll mark the block as allocated。
then we'll mark the i node as allocated pointing, to the data block that we just marked as allocated。
and then we'll add the actual directory entry, pointing to the i node that we allocate。 All right。
Now at this point, our tails that are commit and when our tail reaches our head, we can。
now discard everything that's in the log。 We can do garbage collection。 All right。
Now what happens if after a crash, I'm recovering, I'm reading through the log, all right, as。
the system is booting up, and I happen to see this, and I'm scanning through the log。
and I see a start, and I see, oh, it was allocating a block, right。
But what I don't see is any directory entry。 What I don't see is any commit。 So in this case。
I've got a transaction that has a start, but it has no commit。 So it was interrupted。
and so therefore that transaction did not complete。
And so all I'll do is just simply discard the log entries。 The disk remains unchanged。 All right。
Now, if instead I see a commit, that means the transaction completed。
So when I'm scanning through the log, I find a start, I find a matching commit。
So there are two choices。 I could actually redo it right now and make it happen。
Or I could just keep booting up the system, and eventually the log processing code will。
just simply replay everything from the head until it gets to the last commit, and we're, done。
And then we'll discard those changes。 But those changes will have been applied to the actual file system on disk。
So why do we go through all this trouble? Because it gives us updates that are atomic。
even if we crash。 So even if the operating system crashes, even if we suffer a power outage。
we're able to, ensure that our file system remains in a consistent, coherent, correct state。
Updates either get fully applied or they get discarded。 So all of the operations we're doing。
like creating files, deleting files, updating files, occur as atomic operations。
They either complete or they never happen。 So you could say, well, is it expensive? Yeah。 Right?
We're writing everything twice when it comes to our metadata。 We're writing it to the log。
and then we're going and we're actually writing it to the。
disk and modifying the data structures in the disk。
And that's one of the reasons why in modern file systems we only do journaling for the。
metadata updates。 Because doing journaling for the actual data blocks。
now our logs would just blow up in, size。 And we'd be doing a lot of copying if we had to copy the data blocks out of the log and。
into the rest of the disk。 So it's kind of a trade-off, right?
We're able to protect the file system structure integrity, but we're not protecting the contents。
of files。 And so that's why, again, applications will choose to do their own thing at the application。
level。 And again, there are lots of different approaches replicating the files and temporary files and。
things like that that we can use。 Questions without journaling? Yeah。 Sure。 [INAUDIBLE], Yeah。
So the question is, when I'm recovering here, right? I see that I have a start。
I see I have a commit。 So I know that this completed。 So I'm going to redo this。
I'm going to redo allocating this, redo linking in the I node, redo linking in the directory。
entry to the I node。 What happened if I was right at the crash occurred right after I'd already done that。
once? Well, I end up with two copies of the file。 So the answer is no。
because all these operations are what we'd call item posts。 Like。
I can repeat the operations multiple times and I'll get the same outcome。
Because when I'm doing this, I'm not saying, find a free block。 I'm saying。
I'm allocating this very specific free block in the bitmap。 Similarly, I'm saying。
I'm allocating this specific entry in the I node and this is what, I'm going to write to。
And similarly, I'm saying, I'm allocating this specific directory entry and this is what。
I'm going to write to it。 So I can redo, I can actually redo these operations multiple times and I'll still get the same。
physical outcome on the disk。 And that's important。 These operations all have to be item potent。
If any of them were not, then making changes would result in having, like you said, multiple。
files be created。 Okay, so simply finished with our journaling summary。
Next time we'll talk about LFS and then we'll get into networking。 Okay。 Okay。 Okay。 Okay。 Okay。
Okay。
P23:Lecture 23: Distributed Decision Making (Con't), Networking and TCP I - RubatoTheEmber - BV1L541117gr
Okay, let's get started。
We have a lot of material for today。 Okay, so we're going to talk about distributed systems。
Then we're going to look at distributed decision making and, have distributed commit protocols。
We'll look at networking and then time permitting。 We'll get into PCPIP。 Okay, so if we。
if you recall, we have societal scale information systems。
So these span the space from these massive clusters of tens of thousands of machines。
all the way down to MEMS devices like airbag accelerometers。
We're surrounded continuously by dozens to hundreds of microprocessors。
So if we were to just guess in this room how many microprocessors there are。
would probably be off by a factor of three or four。 Right, each person has three or four devices。
A phone, a tablet, a laptop, a watch, all with microprocessors。
But the room itself is filled with microprocessors。
So everything from each of the wireless access points around the room, the lighting controller。
the AV controller, the wireless mic, the projector, the thing that raises and lowers the screen。
All of these have microprocessors and they're all interconnected and, running various applications。
So that's kind of what we're going to dive into today is these distributed systems。
These distributed applications。 Now it wasn't always like this。
If we look at sort of the dawn of time, it was large mainframes。
Might spend $5 million on a mainframe。 And then you'd have dumb terminals connected to it。
They were called dumb because they literally had no intelligence at all。
All they could do is receive commands to display things on a screen。
And when you type the keystroke, that keystroke got sent all the way to the mainframe。
it was processed in the application。 And then the application basically drew the screen。
The closest analogy today would be is if you use protocols like remote desktop or, remote display。
where all you're really using your computer as is a frame buffer。
And a way of inputting keystrokes and mouse inputs。
So that kind of model in the 80s when we had personal computers became a client server model。
And now we have some of the compute done on desktops。
And those desktops are clients making requests of that mainframe or that server。 So single computer。
Today, what we have is more of a peer to peer model, where we have distributed systems。
So that functionality that used to live on that one single giant $5 million mainframe。
now lives on servers。 Originally those servers were all in the same room, then in the same building。
Now they can be globally distributed。 So this is again called a peer to peer or widespread kind of collaboration model。
Okay, so why do we want distributed systems? Well, that mainframe that costs $5 million。
let's say it could support 10,000 of those dumb, terminals。 What happens when I buy my 10。
000 in first terminal? I have to go out and buy another $5 million mainframe to support one additional user。
All right, and so the sort of unit of growth is very large and doesn't really scale very well。
The idea with distributed computers is that these smaller computers, these smaller servers。
or units are much cheaper to produce, maybe a few thousand dollars instead of a few million。
dollars。 And so I can scale my compute incrementally as I scale my user。 Or if it's a web server。
as my demand increases, I scale incrementally。 And so there's great economic reasons for distributed computing。
When we look at client server, users have control over some of the components, right?
The client could be your phone running an application that's talking to a server somewhere, else。
But you control your phone。 You can choose not to run that application or to run a computing application。
And then there's collaboration, right? Simple as an example, Google Docs, right?
We can all work on a shared network file system on a document together。
So the promise of distributed systems is they give us higher availability。 One machine goes down。
I just simply use another one。 They give us better durability。
I store my data in many different locations。 They give us more security。 Right?
Each piece becomes much easier to make secure。 However, the reality has been really disappointing。
So this is Leslie Lamport。 He's a Turing Award winner and he has made some very fundamental contributions in terms。
of distributed systems, the notion of distributed time and distributed clocks。
And we'll come back and see some of his work later on in this lecture。
But the reality has really been disappointing, right? So we get worse availability。
We depend on every machine being up。 So he's famous for this quote in addition to all the other stuff he's famous for。
where, he said, "A distributed system is one in which the failure of a computer you didn't。
even know existed can render your own computer unusable。", And I had this all day today on campus。
I was kept getting kicked off of Edgiro。 And I think it's because some authentication server is overloaded and so I was getting。
timeouts to it。 And I don't know where that server is located。
but I wanted to go there and I wanted to not, do nice things to that server because it was making it really difficult to get my work done。
But that's what I think。 I don't know。 But in some computer I don't know about that's keeping me from getting on the network and。
keeping me from getting my work done。 And that's the problem with distributed computing。
We've all run into that situation, not just here on campus where something breaks and it's。
because something that we don't know about or have any control over is not working。
And so we can't get our work done。 It's worse reliability。 So potentially any machine crashes。
now the system goes down and we can't get work done。
And it's worse security because I break into any component and then I may be able to, from, there。
leapfrog into other components and so on and so it's kind of my weakest link becomes。
my greatest vulnerability in my system。 And it's also much more difficult。
We spent the beginning part of the semester talking about coordination on a single machine。
where we had multiple threads that needed to coordinate on some data。
And we could just use test and set。 But now that data is scattered all over the globe。
So how do we synchronize that? So what was easy and a centralized system becomes difficult。
extremely difficult as we'll, see in a decentralized system。
We're going to see we need very complicated algorithms in order to deal with it。
So there's all these other issues, right? There's trust issues。
I have to trust third parties that might be storing my data or I'm running my computation。
There's security issues。 There's privacy issues。 Right? When you store your photos in Facebook。
what does Facebook do with your photos? Right? Or any other cloud service that you store private data in。
There's denial of service potential risks。 Right? And there are lots of variants of problems that we had in a single computer environment。
that now get amplified when we're in this distributed computing environment。 Right?
How can we build a distributed application from a bunch of third party components?
How can we trust it that those third party components are going to behave correctly and。
perform correctly? Right? So sort of a corollary of LAMPORITE's quote is a distributed system is one in which you。
can't do work because some computer you didn't even know existed is successfully coordinating。
an attack on my system。 Right? If you're launching a distributed denial of service attack against me。
And we're starting to now see distributed denial of service attacks that are just phenomenal。
A terabit of traffic directed at a server。 Right? So the server on a planet that can absorb that kind of attack。
All right。 So maybe we need to step back and think about what are some of our goals?
What do we want out of a distributed system? What are some of the requirements?
So the first one I'd say is well, transparency。 And this is really the ability of the system to mask complexity behind some simple interface。
So there's lots of different kinds of transparencies。 Right? So one transparency might be location。
Right? So I can't tell where resources are located。 Right? When I go and I read my bear mail。
where is my bear mail stored? I don't know。 Right? Do I care? No。 As long as I can read my mail。
I don't care where it's stored。 There's migration。 Right?
So the servers that manage the data for bear mail are constantly moving that data around。
for various load balancing and other reasons。 Do I see that? Is that visible to me? Well sometimes。
right? Occasionally you might go to log into some system and get told and account migration。
is in progress。 Like banks seem to love to do this sort of stuff。
You go to like Go Pay a bill and they're like, "Sorry, go away。 You know。
we're moving your money somewhere else。", Right? But in many cases, it's completely transparent。
I have no idea where my data is or that migrations are taking place。
And that is good for the providers of the service。 Right?
So I can store my data where it's most efficient from a location transparency standpoint。
And from a migration standpoint, it means if I need to do maintenance, I just simply。
move those accounts off the server。 I can do some hardware maintenance。
And then I move those accounts and that data back。 Replication。 Right? How many copies。
when I save a document to Google Drive, how many copies are created? Right?
It turns out it can be in some cases as many as a dozen copies that get created。 Around the globe。
Again, to provide durability。 But I don't care how many copies。
All I care is that when I go to get my document, when I go to load my slides, my slides are, there。
There's concurrency。 Right? How many users are simultaneously using Bing Search when you're using?
Well, maybe that wasn't a good example。 Maybe I should have used Google Search。 Right?
You don't know。 Right? Performance appears to be independent of the number of users。 Parallelism。
So when I do a search query, right, I could implement that by going to one machine and。
that machine reading through a petabyte of index data to find the terms and return the, pages。
That would take a very long time。 Instead, that query gets routed to tens of thousands of machines。
each of which has a, small portion of that global index。 Fault tolerance。 Right?
So machines fail all the time in the cloud。 And you don't see it happening because transparency hides that from you。
So transparency and collaboration, they're ways for different processes on different machines。
to communicate with each other。 But that requires communication。
So you can immediately ask the question, how do these processes communicate? Well。
they communicate using a protocol。 Right? So a protocol is an agreement on how to communicate。
It includes two components。 A syntax, which is how you specify the messages, how you structure them。
It's the format。 The order that messages are delivered。 And then it's the semantics。 Right?
What a particular communication means。 It's the actions that you take when you get that particular message。
Right? Now we can-- or what happens when something like a timer expires。 Okay。
so we can formally describe a protocol with a state machine。
And typically we'll use some message transaction diagram to represent it。 Right?
And so you can think about a protocol。 And then we can also add a。
language that's really this kind of partitioned state machine that we're keeping in sync between。
two entities。 We can also add stable storage。 And we'll see that when we get into distributed transactions and distributed permit as a way。
of ensuring that even if failures occur, these distributed partitioned state machines。
remain in sync。 All right, so let's-- that's kind of the abstract, the formal definition。
Let's look at an example drawn from the real world。 So human interaction。
And the interaction we're going to use is the telephone。 Okay, so I want to call my friend。
So what do I do? I pick up my phone or open my phone or pick up the handset。 Right?
And then I listen for dial tone or I check, do I actually have service? Right? In this room?
And I have service。 Great。 Okay。 So then I died。 What happens?
I start hearing some ringing and then the other person is going to hear ringing。
They're going to pick up the phone and they're going to answer。 They're going to say hello。
And I'm going to say hi。 It's Anthony or hi it's me or whatever。 And then I'm going to say。
"Hey blah blah blah blah blah。" Pause。 Then they're going to respond, blah blah blah blah blah。
Pause。 And then say bye。 And then they're going to respond back by and then hang out。
That's a protocol。 So how do we initiate it? So the ringing initiated the protocol to the other side。
And that side responded with the introductory。 The hello。
And then I responded with a hello and along with some requests。 And then they respond back with。
"Okay, they'll do that request。", Now we want to close the connection。
And so we have this joint sort of buy and buy and then we can hang up。
So that's an example of a protocol。 You're seeing the syntax here。
You're not seeing the semantics because you're not really seeing the content。
Although you can see the initial message, the hello and the closing messages of the buy。
But maybe there's a request and a response that happens in the middle。 So we go back to computers。
So we have all these applications that we want to create。 Applications that we can't even imagine。
So maybe it's communication applications like Skype。 It could be terminal applications like SSH。
It could be networking applications like the network file system。
And at the same time that all the computer scientists are innovating and application developers are innovating。
We also have hardware people, electrical engineers who are innovating on new transport media。
transmission media。 So we have coaxial cables。 So the thing that Comcast runs or fiber optic。
the thing that AT&T and Sonic run, that go to your house。
And then you have applications that are talking on top of it。 So many different applications。
many different ways of interconnecting computers, networking styles and technologies。
Wi-Fi and wireless, cellular and so on。 And really it's a question of how do we organize this mess?
Because if this is what it looks like, then every time we add a new application like, "Oh。
the web comes along。" We then have to implement support in the web for operating on cable modems and operating on fiber modems and operating on Wi-Fi and operating on wired Ethernet and so on。
And so we'd end up re-implementing every technology or every application for every new technology that comes along。
Because it could also be the case that we have a new packet radio。
a new wireless technology for a wide area, maybe based on ham radio。 So no。
this really wouldn't work。 This is not a scalable approach if every application has to support every potential network technology。
If every network technology requires re-implementation of existing applications。
So how does the Internet avoid this? Well, the goal here is to have reliable communication channels that we can build applications on。
And we're going to reach that goal through a level of indirection。
So we're going to add this intermediary layer that provides a set of abstractions for various network functionalities and components and technologies。
So this allows us now to implement applications once。
So we implement an application against this intermediate layer。 And similarly。
when we add a new technology, we're going to implement support for that intermediary layer。
So it's kind of, again, it's like a variant of just add another layer of indirection。
which you heard through your entire academic time here at Berkeley。 This intermediary layer。
this narrow waist here, is the Internet protocol。 And it's the fundamental abstraction that has allowed us to have innovation because when a new application comes along。
like the web, all it has to do is implement support for IP, the Internet protocol。 And similarly。
when someone comes up with 802。11, AX or AC or whatever。
all they have to do is implement support for IP。 And then all of the applications work。
The applications that were written, Skype was written 20 years ago。
and it runs on the latest of networking technologies。 Similarly。
I could pull out a microwave link from the 1980s that supports IP and run modern applications over it。
And so that's what the power of having this narrow waist of having this layer of indirection or level of indirection gives us。
So this is called the Internet Hourglass。 And having just one network layer protocol is what gives us this interoperability。
Underneath, we can have lots of different networking technologies that use lots of different physical implementations。
So Ethernet over copper, like the connection I have to this room。
this room might be connected to campus, gateway via fiber。 I could also connect my laptop over 802。
11 to the access points around the room, although I get very upset because authentication is not working today。
And then on top of that, we have various kinds of transport protocols。
like the unreliable data gram protocol, transmission control protocol。 And on top of that。
we have lots of applications。 Questions? Okay。
So what are some of the implications here? Right? Having this single Internet layer of IP allows arbitrary networks to interoperate any network technology that supports IP can exchange traffic。
So I can be on that ancient microwave link and talk to someone who's on a modern fiber to the home link。
Right? As long as both are supporting IP, I can exchange packets between them。 Okay。
It also allows applications to function on any network that supports IP。 And this is something that。
you know, is pretty amazing, right? When I think about an application from 20。
30 years ago that supported IP, I can run it on the latest networks today without having to change anything。
So this has led to phenomenal innovation。 Right? Look at the millions of apps that are in the various mobile app stores。
They all run on top of IP。 Right? And simultaneously look at the phenomenal innovation that we've had below IP development of new wide area technologies like LTE and satellite and Starlink and so on。
Right? All of this enabled by having just one transport protocol that everyone implements against。
The downside? What happens when you want to change that transport protocol?
So when we want to go from IP version four, which has a limited address space。
we'll get into that later on, the IP V six, which has a pretty much unlimited address space。 Well。
it turns out it takes decades。 So IP V six has been floating around for many decades。
And we still are just incrementally getting there。 Right? So when I connect to the wired network。
I get an IP V four now address。 When I connect to the Wi-Fi。
I get an IP V six address because we just over the past few years migrated our wireless networks to IP V six。
We did it because we were running out of addresses, right? Our daytime population is something like。
I don't know, 70,000 people。 And each of those people has a bunch of devices and they all are on Wi-Fi and we just didn't have enough addresses。
And so by moving to IP V six, we have basically unlimited addresses and that doesn't become a problem。
But it took a lot of work and effort to make that happen。 And globally。
we're still most stuff is still going in IP V four。 Okay。
so that gets that's one example of the drawback of layering is that once you have a layer that everyone's dependent on。
making changes to that layer is incredibly difficult。
So other disadvantages are that at a given layer and you might end up duplicating functionality of the layer below it。
Right, so things like error recovery and check sums and things like that。
Layers might need the same information。 Timestamps are very useful in networks。
And so I may have timestamps that appear at multiple levels of my networking stack。
I need to know what's the largest transmission unit that the underlying network supports。
I can also have performance implications from layering。 Right。
because I hide details about what's going on。 I make assumptions that the higher layer and that interferes with my performance。
We'll see some examples of that later on。 Sometimes the layers are not very cleanly separated。
Sometimes there's overlap in headers, check sums or other sorts of things between layers。
And so the layers kind of bleed through and I'm doing processing it at multiple layers kind of simultaneously。
And let's see。 And then sometimes my headers are just big。 Right。
because each layer adds a little bit of header information, some metadata that it needs。
And when we talk about data that's small, so example would be the packet voice。 You know。
that is being used by zoom。 The amount of header versus the amount of voice sample can actually be the case where the headers are much。
much larger than the small voice sample that's being encoded and transmitted。
So this really begs the question then of when I've got layering, where do I put functionality?
Do I put it at layer n, layer n minus one or my highest layer in my system?
So there was a paper that was written way back in 1984 and it's one of the most influential papers。
in computer networking。 It's sort of the seminal paper that everybody turns to。
And it's called the end-to-end, it's called end-to-end arguments in system design。
There's endless disputes about what this paper means and what it says。
And oftentimes you'll have people on polar opposite sides of an argument claiming that。
the paper supports their position。 Okay, well, in summary, read the paper, it's a really good。
paper to read。 The simple message here, some types of network functionality can only be correctly。
implemented end-to-end。 Reliability, security, and other types of functionality。 So because of this。
it means that end hosts can satisfy that requirement, say reliability, without the network's help。
And so they have to do so because they can't rely on the network, doing it fourth。
And so what's the point of doing it in the network if I'm just going to duplicate。
that functionality end-to-end? So don't go out of your way to do something in the network。
since it's redundant with what I'm going to have to do at a higher level。 So it seems, simple。
seems kind of confusing。 What do I mean? Let's look at an example and that'll make it a。
little bit clearer。 So let's say I want to do reliable file transfer。 I have a file on the。
disk of host A。 I want to transfer that file to host B。 And I want to make sure that the。
contents of that file on host B is equivalent and identical to the contents of the file at host A。
Very simple。 You do this all the time。 Right。 FTP or SCP and so on。 Basic file transfer。
So what do I have to do? Well, I have to read the file in from the disk as the operating system。
to read the file to a file read into the application。
Then the application is going to provide those, file bytes back to the operating system and say send it to the application on host B。
So it's going to send it across the network to the operating system on host B, which is then going。
to buffer and send those bytes to application on host B, which is then going to turn around。
open a file for writing and ask the operating system to write those bytes to the disk。 Simple。
All right。 So how do we make this reliable? That's just file transfer。 Well, two solutions。
First solution, we're going to make each step reliable and then concatenate。 Second solution。
is that we'll just have an end to end check and then retry if we encounter any errors and it's necessary。
So it's sort of like, you know, we'll just simply read the file at the end, compute some checksum。
and then send that checksum back to the original application, which will read a checksum from。
its file, reading computer checksum from its file。 And if they match, we're done。 If not, you know。
we have to retransmit it。 What do we do in the first case? Well, in the first case, when we read it。
from the disk, we'll check and make sure we read the data correctly from the disk。
Then we'll send it to the other machine and make sure we reliably deliver it across this link。
And then we'll copy it and we can check some that, you know, check some each of those packets。
check some what we transmitted。 And then we're going to give it to the application on host B。
And the application on host B is going to write it out to disk。
And then it could check some to make, sure that that was reliably written。
So those are the two different approaches。 Solution one, make each component reliable。
And then solution two, which is just make the process reliable from, an end-to-end standpoint。
So some discussion。 Solution one is incomplete。 So let's say I read it in at host A, all right。
Read it in from this。 And now a cosmic race strikes, the memory chip holding the buffer in memory。
Flip some bits。 And now I reliably send that over to the other machine。
And then reliably write it out to disk。 And I've written a corrupted copy。
The same thing could happen at the destination machine of a cosmic。
ray could strike a bit and flip it。 So this solution won't work。 And the receiver is going to have。
to check anyway。 The receiver will, even though every step was reliable, the data got corrupted in。
trying in one of those buffers and the temporary storage。 I'm going to have to do an end-to-end。
check to make sure nothing went wrong。 Right。 What about solution two? It's complete。
I get the functionality of reliably delivering my file from A and the drive on A to the drive on B。
Because I do that end-to-end check some tests。 So that could ask, you can ask the question, okay。
then why would I ever want to implement reliability at a lower layer? What's the point?
If I'm still, going to have to at the end of the day, do an end-to-end check。 And if it's messed up。
then I have to, retransmit again。 That wasn't a rhetorical question。 That was an actual question。
Yeah。 Yeah, exactly。 So if I don't implement reliability at the lower layers。
then it might be the case。 If, I've got a really flaky link and I've got a large file moving a terabyte-sized file。
I might have to transmit it a lot of times before I get a successful clean transmission。
If I really lost the link, then providing reliability could actually help me quite a bit in terms of。
efficiency。 It could make it much, much more efficient。 Because rather than retransmitting a。
terabyte, I might just have to retransmit that 1,500 byte or 8 kilobyte packet that contained。
the data that was corrupt。 So if we think about the end-to-end principle。
it's kind of saying that implementing complex functionality in the network。
isn't going to make the host implementations complexity any less。 I still have to do those。
end-to-end checks for security or for integrity or for reliable delivery。 But it is going to。
increase the complexity of my network。 What is a router components that have processors in them?
It's going to increase the complexity of the code in my operating system for my network。
If I've now got to have reliability in all of these different components。
and that's going to impose a delay。 That's going to impose overhead on all applications。
even if they don't need the functionality。 So let's imagine I do implement reliable, network links。
And I'm going to use that as part of building my reliable file transfer application。 Well。
take an example, like for those people who are at home watching me on Zoom。
Zoom uses unreliable communication。 Why? Because latency matters when you're dealing with an。
interactive communication。 And so we've all had those Zoom calls where the audio kept dropping。
and the video was dropping。 But you could still mostly understand what the person would say。
That's better than if we said, well, we're going to perfectly deliver every voice sample。
and so we're just going to keep retransmitting。 And so that person is like a second or two behind。
And you're a second or two behind that。 That gets really when you have delayed communications like that。
It's really difficult to have a conversation。 But we need to find some happy medium。 I just had a。
FaceTime call with one of my colleagues who's in Idaho。 And she's on a satellite link that's。
super-lossing。 And I could barely understand what she's saying because like every packet's。
getting dropped。 And I was kind of like thinking about this lecture。 I was like, well, that might。
be a case where I want a little bit of reliability。 So maybe retry a few times。
And if it doesn't get, through, then you drop the packet。
But even that will introduce latency and make it harder to have, a conversation that's interactive。
So when you have a very lossing link, some degree of reliability。
introduction could help and improve overall performance, especially if you're sort of。
retransmissioned unit at the application level is really large。 If I'm dealing with a 30-byte voice。
sample, retransmitting that isn't hard for the application to do if it really cares about。
reliability。 But if I'm dealing with a terabyte-sized file。
retransmitting that's going to be very painful。 Okay。
so we can think about a really conservative implement kind of viewpoint for。
interpretation for the end-to-end argument。 So the super conservative view is don't。
implement a function at the lower levels of the system unless it can be completely implemented。
at this level。 Or unless you somehow or other reduce the burden on the end-host。 If that's not。
going to be the case, then don't bother implementing the functionality。 So with this super kind of。
conservative extreme view, then, you know, we're never going to implement anything at the lower。
levels。 Because anything that we do at the lower levels, we're always going to have to implement。
it at our levels。 But some people will take this viewpoint and say that the network should be as。
clean and simple as possible and basically just simply use best effort to deliver packets。
do nothing more in the way of processing。 So that's one viewpoint, and there are cases where。
that makes a lot of sense。 But I kind of argue for a more moderate interpretation, which is, well。
let's think twice before we implement some functionality in the network。
So if a host can implement that functionality correctly, then only do it at the implemented。
at the lower layer if there's some performance benefit or improvement that we would get out。
of implementing it at that lower layer。 But we want to make sure we only do so if it's not going to。
impose a performance burden on applications that don't require that functionality。
So this is the reason why we have TCP and we have UDP。 If you want reliable delivery, you use TCP。
Don't want reliable delivery or you want to implement it yourself, use UDP。
So you get to choose as the application at the end host, which are those protocols you're going。
to choose to use。 So this is an interpretation that we use。 This is the interpretation I use in my。
research group。 Many people kind of take this sort of more middle of the road approach to。
how you interpret the end-to-end principle and an argument。 But you could ask the question。
is this still valid in cases where we have things like。
the distributed denial of service attacks or protection against inclusion and so on。
Those are cases where we actually are pushing that functionality into the node。
So I mentioned earlier, should a denial service attack that directs a。
terabit per second of traffic at a host。 So you can argue, well, end-to-end says。
I should deal with that attack at the host。 Well, there's not a host on the planet machine。
which could absorb a terabit of data and filter out the attack from valid requests。 However。
if I implement that functionality in the network, then I can decentralize it。
And now I have individual routers that are maybe dealing with digabit flows, tens or hundreds of。
gigabit flows。 And there I have a chance of being able to filter out and block those attacks。
So you're seeing more and more of this with services like CloudFlare and others。
where they're pushing functionality, implementing functionality in the networks that that。
traditionally would have implemented at the end host。 If we, again。
if we follow a sort of strict interpretation of the end-to-end。
So this is going to the other end and saying we should push lots of functionality like this。
down into the network because we can do a much better job of implementing it at the network level。
Still have to implement denial of service protection on the servers, but much easier if I'm。
dealing with a megabit or a gigabit flow than if I'm dealing with a terabit flow。 Okay。
questions about end-to-end。 Okay。 All right, so let's switch gears and talk about distributed applications。
So how do we actually write a distributed application? What does it take to do? Well。
think about it, okay? A distributed application, right? There's two components to it。 There's going。
to be code, right, that's implementing the functionality, and then there's going to be state。
Right now, when we were writing our applications on a single machine, that state and synchronizing。
around that state was really easy to do。 But now we have multiple threads that are running on。
different machines。 And so there isn't shared state。 We can't just simply use test and set。
to create critical sections。 So we're going to need a different primitive that allows us to have。
synchronization on shared state in the wide area。 So one abstraction we could use is sending。
and receiving messages。 Why? Because if you think about it, right, it's atomic。
I either get a message or I don't receive a message。
The two receivers can't receive the same message。 So I can build synchronization on top of sending and receiving messages。
So I have an atomic primitive if you want to think about it though。 So what's the interface?
We're going to use a really simple interface to start of the notion of a mailbox。 So that's a。
temporary holding area for messages that has a queue associated with it and a destination。
And then they're sending a message to a mailbox。 And so this will send a message to。
the location identified by the mailbox。 And there's receiving a message from a mailbox into a buffer。
And so we'll wait。 The thread will sleep on that receive on that mailbox until a message comes in。
And then it'll one of those threads that's waiting。
we'll get woken up and we'll copy the message into, its buffer。 Okay。
So we can ask some questions about like the behavior of a send, for example。 When does send return?
So it could return when the receiving application's thread actually gets, the message。
When we get an acknowledgment back saying it got the message。 It could be when the。
message gets buffered by the operating system on the destination。 Well, maybe it's immediately。
As soon as we copy it into the operating system buffer, send immediately return。
The semantics that we want will depend on the application。
And so it really is a question of like when can the sender know that the receiver has。
actually received the message。 By only really in that first case, in the other cases we're kind。
of relying on the systems to ultimately deliver the message。
But what happens if it's the second case, and the destination host crashes before that message got delivered to the receiver。
Or it's, buffered locally and the local machine crashes before it can send it on to the receiver。
There's also a question of when can the sender reuse the buffer that contains that message。
We'll come back to this when we talk about TCP。 But what a mailbox provides us with is a one-way。
communication channel between thread one and thread two with a buffer in between。 Well。
that looks really familiar。 Right。 We saw an instance of thread one communicating to thread two。
with a buffer in the middle at the very beginning of the class。
Remember the beginning of the semester? Hopefully you haven't forgotten it。
We have a midterm coming up soon。 This is producer consumer。 Right。
Thread one is producing data that's getting consumed by thread。
two and we put a buffer in between so we can decouple the execution of the two threads。
So our send becomes like our V with a semaphore and our receive becomes like the P operation with our。
semaphore。 But the key difference here now is that where at the beginning of the class。
thread one and thread two were on the same machine。
Now thread one and thread two don't even need to be, on the same planet。
And yet they can communicate and synchronize。 All right。 So if we think about。
like implementing something producer consumer style, the producer takes a message buffer。
prepares the message and then sends that message to a target mailbox。 The consumer has a buffer。
They wait on that buffer with a receive。 I want a message arrives。 It gets copied into the buffer。
and then they get to process the message。 So the great thing here is that the producer and the。
consumer, they don't care how much buffer space there is in the mailbox。 They don't have to track。
that's handled by send and receive in the operating systems。 This is one of the roles of the window。
in TCP。 It's a function of the buffer space that's available in the band with the delay product and。
a couple other things。 But it tells us how much space the sender has to send to the receiver。
Now what if we want to do to a communication? We just don't want to like throw it over the wall。
from thread one to thread two。 But we want to make a request, say of a server。
like a client requesting, something from a server and getting a response。
I want to read a file on a file server and get back, a response。
I want to request a webpage from a web server and get back a response。 So this is。
the client server application。 The client is our requester, the server is our responder。
The server is providing some service to the client that's making a request。 So what would this look。
like for a file service? Well, the client says, I have a response buffer。 I'm going to send the。
request to the server of read the file, root a big。 And here's the mailbox for that server。
Where to find that application。 And then I'm going to receive a response back from that server。
into my client mailbox。 So of course send also, there's a lot of missing details here, but。
I have to tell the server how to find the client mailbox and all that kind of stuff。 Okay。
so the server is going to receive into a command buffer that requests from the server mailbox。
Read root a big。 It'll decode that command, load the file into memory, and then into that。
answer buffer here。 And then it's going to send that answer buffer to the client mailbox。 So return。
it to the client that made the request。 All right。 So again。
you could think about how we would have, done this if we were on the same machine using inter-process communication or。
you know, pipes and other sorts of things。 Here now we're doing this across the internet。
Two separate, machines。 Okay, questions about request response。 Yes。
Is this the same concept as the actor model? It's kind of similar。 I mean, these are all kinds of。
protocols where you're making some request and then there's some action that occurs at a remote。
machine and some response that comes back。 So there's some overlap。 Other questions? All right。
so we're going to change gears and now talk about, distributed consensus making。
So the consensus problem is you have a set of nodes。 They propose, a value。
It could be true or false。 It could be a or b。 It could be abc。 But I have to make some, decision。
This is the real world。 So some of those nodes might crash or they might stop responding or。
they might be malicious。 I want it to be the case that eventually all of the nodes are able to reach。
consensus and decide on the same value from that set of proposed values。 So that, in the nutshell。
is the distributed consensus making problem。 So it's a form of, distributed decision making。
Examples are choose between true and false, commit versus abort and so, on。 Now。
one important component of distributed consensus or distributed decision making is, durability。
Once you make an answer, it has to persist。 And so it's very important that you。
include stable storage so that you have a durable way of recording that decision。 So like in a。
transaction, this is the D and acid, the durability component。 Now, in a global scale system。
lots of, different ways we can do it。 Everything from distributed ledgers like blockchain to a ratio。
code or multiple replicas。 But the key thing is, however we do it。
we want to make sure our decisions, persist。 Okay。
So let's look at how hard it is to reach distributed consensus by looking at what。
seems like a deceptively simple problem。 It's called the general's paradox。 So here are the。
constraints。 We have two generals, the general on the mountain over here, general on the mountain。
over here。 They can only communicate via messengers。 All right, so they're going to send messengers。
through the battlefield from one mountain to the other mountain。 And there's, you know。
I don't know, there's bobcats and bears and, you know, other things that can eat the messengers。
And that's the only way they can communicate。 The messengers can be captured, you know, all sorts。
of things can happen。 The problem is that they each have an army, but the armies aren't that big。
in isolation。 And they're going to attack a target with a large army。 That army is larger than any。
individual general's army, but the combined general's armies, the two armies。
is larger than the target, army。 So if they attack at the same time, they will succeed。
If they attack at different times, they will fail and die。 All right。
So this is named after Custer who died at Little, Bighorn because he arrived early。 All right。
So here's the question that I'm going to ask。 Can messages over an unreliable network be used to guarantee that two entities do something。
simultaneously? So, you know, we can make an unreliable network reliable, right? Because we can。
send acknowledgments, right? We can send a flood of messages。 But kind of counterintuitive。
the answer is no。 Even if all the messages get through。
that's the part that's really confusing oftentimes to students is even if I say every single message。
will get through, can you make this work? The answer is no。 So let's look at a mockup and。
what the message exchanges might look like。 And that'll help us see why we have this as a paradox。
Here are two generals。 Okay。 So the general on the left is going to propose a talk。
Let's attack at 11 o'clock。 Okay。 All right。 Are we done? Can we go attack at 11? Yes? No, maybe。
Exactly。 So the answer is no, right? Because what if the general on the right doesn't get the。
message because the mountain lion eats this messenger? Okay。 How does the general on the left。
know that the general on the right got its message? Simple。
We send a message in the other direction。 Yes, 11 a。m。 works。 Let's attack at 11。
Can we attack now at 11 and live? I see people nodding at if you people shaking out yet。 Yeah。
So the frustration here is that the general on the right doesn't know if their。
messenger got to the general on the left。 And what happens if it didn't? Right?
Then maybe it did get through or maybe it didn't。 If it didn't get through。
then this general goes and attacks at 11。 Well, the general on the left never got the message。
saying that it was okay to attack at 11。 So they're not going to attack at 11。 And they, you know。
doesn't end well for the general on the right。 We can solve this。 So it's 11。
We're going to attack at 11。 Okay。 Now the general on the left has told the general, on the right。
I got your last message。 And so we're good to go。 Right? Yeah? No? Yeah。 This is the same problem。
Right? Because now the general on the left told the general on, the right。 Yeah。 We can go at 11。
but they don't know if their message got through。 If their message, didn't get through。
then this general on the right might not know that their message got through。
And so they're not going to attack at 11。 We can solve this。 We're computer side this。
We just send another message。 Right? So this is the problem。 We don't have any way of knowing。
that that last message got through。 So we can never confirm that it's okay to attack at 11。
No matter how many messages we send, even again, if we guarantee that we know that every message。
got through, we still don't really know that every message got through。 So, you know。
this doesn't happen in real life because we have out of band communication。
But the two generals just, you know, use their walkie-talkies or their satellite phones and they're。
able to in real time confirm that the other one knows that the attack plan is going to happen at 11。
Right? But if we look at two computers that are trying to agree to do something at a specific time。
and can only communicate via these messages over unreliable networks, we cannot reach consensus。
on a very specific time。 So we need to do something other than simultaneously。 Right?
We need to make a decision, but we have to, unfortunately, remove time from that equation。
because we don't know about this last message。 Okay。 So if we can solve the general's problem。
we're going to solve a related problem instead。 And that's the problem of distributed transactions。
So this is two or more machines agreeing to do something or not to do something and making。
that decision a Thomas。 So either everyone agrees to do it or everyone agrees we're not going to do it。
The key thing that we've done here was we removed the constraint of time and changed it into eventually。
things will happen。 Rather than saying it's going to happen at 11, we say eventually it will happen。
Or we all agree it's not going to happen。 And this is codified in a protocol called。
two phase commit developed by another Turing Award winner, Jim Gray, who was also the first。
Berkeley CS computer science PhD back in 1969。 And he made many, many, many contributions。
to the foundations of databases。 He was really hugely impactful in the world of databases。 Okay。
So how does this, what are some of the components? So we have a persistent stable log。
on each machine。 And this is where we're going to keep track of whether or not the commit happened。
If a machine crashes, when it recovers, it's going to look at its log to see what was the state of the。
world at the time that it crashed。 We're going to have two phases。
The first phase is a prepare phase。 So we're going to have a global coordinator and they're going to request that all of the participants。
promise to commit or roll back the transaction。 And so the participants are going to record that promise in the log if they agree to commit and。
then acknowledge if anyone votes to abort, then the coordinator writes abort in its log。
and it's going to tell all of the participants abort and each one is going to record abort in its。
log。 Now, if everyone says, let's go ahead and commit, then we have the commit phase。 So if。
everyone responds that they're prepared, the coordinator writes commit to its log。
then it asks all the nodes to commit。 They respond with acknowledge。 After it receives all the。
acknowledges, it writes got commit to its log。 Okay, so the decision point here is when the。
coordinator writes commit to its log。 At that point。
we are committed to phase complete to phase commit, will ensure that everyone eventually commits。
And that's what we use the log for, guaranteeing, you know, once we put it into stable storage。
we guarantee that either we're, going to proceed with the transaction or commit。
or we've decided to abort the transaction and, everyone will roll back the results。 All right。
so the algorithm here, one coordinator and workers, are replicants。 At a high level。
the coordinator asks all the workers, okay, everybody ready to commit? If all the workers say, hey。
we vote to commit, then the coordinator says, global commit, everyone will commit, right? Otherwise。
if anybody said, no, I vote abort, then the, coordinator broadcasts to everybody global abort。
And everybody rolls back the transactions, releases, all the locks, it's like it never happened。
Workers obey whatever global message they receive。 So, if the global message is to commit。
they will commit the global message is to abort, then they, will abort。 And again。
we're using a persistent stable log on each machine to track what's happening, at that machine。
so each of the replicas and at the global coordinator。 So when a machine crashes, again。
as I said before, it's going to wake up, check its log to recover any data and understand。
what the state of the world was at the time that it crashed。 All right, so one machine。
the coordinator initiates the protocol。 It asks every machine, vote up or down on the transaction。
Two possible votes commit or abort。 Only going to commit if we receive unanimous commit votes。
from all of the machines。 So anybody basically has veto power and can say abort that transaction。
Now, if a worker agrees to commit, then it has to guarantee that it's going to accept the transaction。
So it's going to record that commit in the log before it notifies the server。 Once it records。
that commit in the log, if it receives a go ahead and global commit。
it has to complete that transaction。 It can't change its mind and say, oh, now I want to abort。
Once it writes that commit in the log, it is guaranteeing that it will see that transaction to completion if told to do so。
Okay, if the worker decides to abort, then it's guaranteed it will never accept that transaction。
It records the abort in the log and can roll back that transaction and informs the server。
I decided to vote abort。 At the end, when we decide to commit the transaction, the commit phase。
the coordinator hears that everyone says, vote commit, it's then going to。
record the decision to commit in the log。 This is the point again, where now the transaction is。
committed。 No matter what happens to the coordinator, or whatever。
what happens to any of the machines, we guarantee that we will eventually be in the state where all the machines will have committed。
We'll go over some of the failure cases in just a moment and that'll make it clear。
Then we apply the transaction and form everybody else。 If we decide to abort, because anybody。
vetoes and says abort the transaction, then we're going to record in the log that we have aborted。
and we're going to inform all of the machines with global abort。 So if we think about it, right。
there's two actions。 Either everyone has agreed to commit。
and we ensure that we go down that commit path atomically, or one or more has decided to abort。
and we go down the abort path and notify everybody that it will abort。
Only one of these outcomes can occur。 We either agree to commit the transaction, or we disagree。
and the decision is to abort the transaction。 All right。 Questions?
We'll go through a bunch of examples which should help a lot。 Yes? Yeah, that's a good point。
The question is, you write to the log as a worker before you tell, the coordinator your vote。
That's really important because that log is persistent。 So I record that I decided to commit。
or I report that I decided to abort, then if I immediately crash, eventually I'll come back up。
I'll scan through my log, and I'll see I decided to commit or abort。
and I'll contact the coordinator。 And we'll see an example of that in just a moment。 Okay。
so some administrative stuff。 We have a midterm coming up on the 28th。
It's going to be from 7 to 9 p。m。 All course material is fair game。
although again the focus is going to be on the material since the last midterm。
but that doesn't mean you can forget what happened 12 weeks ago。
And there will be a review session on the 25th time and place to be announced。 Okay。
so let's go through the algorithm in some more detail。
So the coordinator sends out a vote request to all of the workers。
The workers are sitting there and they're waiting for a vote request from the coordinate。
Now if they're ready to commit, then they'll send a vote commit to the coordinator。
They're not ready to commit, then they'll send a vote abort to the coordinator。
and then they can immediately abort。 Now why can they immediately abort?
Because they know the outcome。 If I vote to commit, I can't immediately commit and release。
locks and do all that stuff, because I don't know if somebody else might cause the transaction to。
be aborted。 But if I myself say I'm not ready, I'm going to vote for abort, then I can go ahead。
and abort, because I know that the coordinator will listen to me and will ultimately abort the。
transaction。 So I can just stop that transaction right now, release all the locks。
roll everything back。 Okay, it would be equally correct for me to wait until I get the global abort from。
the coordinator, but as an optimization, I can just clean up things right away。 Okay。
so at the coordinator, if everyone votes to commit, it sends the global commit to all the, workers。
If anyone disagrees, it'll send a global abort to all the workers。 And then again at the, workers。
if they receive a global commit, then they'll go ahead and commit。 If they receive a, global abort。
then they'll go ahead and abort。 All right。 So now let's, we're still kind of in。
the algorithm phase。 So let's look at some execution examples of how this works。
So we'll start first。
with a failure free case。 All right。 So here, the coordinator sends out the vote request。
In the failure free case, all the workers are ready and they vote to commit。 All right。
Now the coordinator has all those votes, records in the log, commit。 That's our commit point。
And then broadcast out global commit。 All right。 And everyone will commit。
So if we peer under the covers and we look at the state, machine of the coordinator。
it's a very simple state machine。 All right。 We start in the init state。 We send。
we receive a start。 So the program gets told start the two phase commit。
And it sends out a vote request。 And then it waits。 If it receives a vote abort。
it sends a global abort enters the abort state。 If it receives all vote commits。
then it enters the commit state and sends a global commit。
The workers also implement a very simple state machine。 They start in the init state。
waiting for a vote request。 When they receive that vote request, they can either choose up。
I'm going to vote abort, in which case they enter the abort state。 Or they can say, okay。
I'm ready to commit and send a vote commit。 Then they enter the ready state。 And they wait again。
They wait to hear back from the server, global abort, or to hear back global commit。
And based on that, they'll either enter the abort state and roll everything back。
or enter the commit state, finalize everything, release the locks, and the transaction is committed。
So let's think about this。 Failure for the coordinator only really matters。
when we're waiting for messages。 All right。 So when we sent out our vote requests。
now we're waiting for messages from the workers。 So if one of the workers fails。
then we're going to sit stuck in this waiting state。 And so if we don't receive all end votes。
after a period of time, we could just simply time out and abort the transaction。
That'd be the simplest implementation。 More complicated implementations, maybe we'll retry。
sending a message to anyone who hasn't responded because maybe that message got lost or something。
But conceptually, we just wait a fixed amount of time。 If we don't hear back。
we abort the transaction。 So here's that example of a worker failure。
So we send out the vote requests。 We get back a vote, commit from worker one and from worker two。
And who knows what happens to worker three? No, maybe their message gets lost or something。
But eventually we tie or maybe they crash, we time out, and then we're going to send a global abort。
At some time later, this worker three is going to recover, look through its log and say, hey。
I have this transaction, contact the coordinator and say, what happened to this transaction?
The coordinator will say the transaction was abort。
In which case the worker will then clean everything, up, roll it back, release the locks。
and proceed。 So that's where the eventual side of things comes in。
Even if it takes a week for this machine to, get rebuilt and come back online。
eventually it will and eventually will apply the results in this。
case of having aborted the transaction。 Okay, when we think about coordinator failure。
from the point of view of the worker, there are two places where we wait。 We wait for that vote。
request in a NIN。 And from a worker standpoint, they could time out also。 They don't hear, you know。
they've got a transaction that's ending, they're waiting to see, are we committing or aborting。
I haven't gotten asked to enter Q-phase commit, they could just simply time out and abort。
Later on when the coordinator contacts them, finally say, hey, I didn't hear from you。
I boarded that transaction。 And that's fine。 The coordinator will handle that。
it'll just get a vote abort, and then it'll send out a global abort。
The other place where we can wait, is which is more complicated is in the ready state。
And waiting here is really bad。 It's really, really bad, right? Because in the NIT state。
the worker has control。 All right, so think about this。
this is a distributed transaction system on individual workers。
there's a ton of transactions that are going on。 And each one of those transactions is grabbing。
locks and preventing other transactions from being able to potentially make progress。
So we want to complete transactions as quickly as possible。 When we're stuck in the NIT state。
we have control as a worker。 Because we can always say, hey, it's been 10 minutes。
I haven't gotten a request for this transaction。 So I'm just going to time it, out。
I'm going to abort the transaction, release the locks, allow other transactions in the system。
that are waiting on those resources and locks to proceed。 When we're in the ready state, it's。
different。 We've told the coordinator, we're committing。 And so we can't release those locks。
we can't abort, the transaction, we have to wait until the coordinator says, okay。
we're going to commit, or no, we're, going to go ahead and abort the transaction。
And so getting stuck in this ready state is really, bad。
because it's holding up resources on that particular node, and it could be for an indeterminate。
amount of time, and the node has no control over leaving the ready state。 Okay。
so examples of coordinator failure。 So the first case, we don't get a vote request from。
the coordinator, so because the coordinator crashes or something happens, and so the workers just。
simply time out abort the transaction and keep going。 And eventually, when the coordinator comes。
back, they'll say, hey, we all decided to abort the transaction because you went away。
When we're in the ready state, we've got a vote request, we've all sent back, vote commit, or。
we don't even know, maybe some of the workers have decided to vote abort because they had some issue。
We're waiting to hear from the coordinator what's going to happen。
And we're blocked waiting for the coordinator to eventually restart, and then perhaps send out。
a global abort, or it could send out a global commit。 It's all going to depend on whether the。
coordinator had written commit to its log or not。 If it hasn't written commit, then it'll abort the。
transaction。 If it has written the commit, then it's going to send out a global commit。
So one little minor optimization that people will sometimes make in this case is if you don't。
hear from the coordinator, then you try to talk to all the other workers。 And if you can find any。
of the workers that voted to abort, then you know the final decision is going to be a global abort。
rather, and so you can go ahead。 And even though you are in the ready state, you can safely abort。
But that's problematic, and you know just making sure you implement that correctly is complicated。
But that's one of the optimizations that people will do in practical settings。
because of this large potential gap between when the coordinator fails, and when it restarts。
and the fact that it's holding up other transactions on these nodes。 Okay。
so related to this is again durability。 All nodes are using stable storage to store, the log。
the current state of things。 Nonvolatile storage could be either hard drive SSD or some。
some kind of a nonvolatile RAM。 So now when we recover from a failure at a node。
it'll look at that state and that tells it how to proceed。 Right, so if the coordinator finds that。
it was in the init state, the wait state or the abort。
it knows the transaction will be global abort。 If the coordinator wakes up and sees that it was in the commit state。
then in this case, when it restarts, it's going to send out a global commit message to all of the workers。
So it's really, you know, a question of when did this crash occur? Did it occur before that。
decision was made and written to stable storage? In which case that's a global abort。
we're going to occur after we wrote commit to the log, in which case it's going to be global commit。
For the workers, if they come back up and they're in the init state or the abort state。
they can locally abort because if the coordinator asks them, they'll say, "I vote to, I vote abort。
I don't vote commit。" If it's in the commit state, then that means it。
agreed to commit the transaction。 And so if it has to replay the log or whatever。
it's going to do that, to ensure that the results of the transaction are durable。
And if it wakes up and it's in the ready, state, then it's going to ask the coordinator。
"What was the final decision?" Was the global decision。
to go ahead and commit or was the global decision a global abort? Okay。
So before we dive into discussions, are there any questions about the phase commit? Yeah。
So the question is similar to the case with the general's paradox。 How does。
the coordinator know that all of the workers receive this global commit? Right? So the thing is。
if the workers don't receive a global commit and they're stuck in the, ready state。
they're eventually going to come back and ask the coordinator, "Hey, what did you, decide?" Right?
And they'll get a response back from the coordinator。 We decided a long time ago。
we were going to do global commit。 And so they'll commit。 That's correct。
So let's say we end up in the situation, I don't know if I have an exact picture of it。
but where a worker receives global commit。 Okay? But before it can right commit to its log and commit。
process, it crashes。 All right。 So when it wakes back up, it's going to read through its log。
and the last thing it's going to see is already。 And so then it'll go and contact the coordinator。
The coordinator will say, "Hey, I already told you we globally committed。" And they'll say, "Oh。
okay, I'll go ahead, commit, write a commit to the log, and then I'll commit the transaction。"。
So the key thing here, and this is where the timing, again, and I've tried to overemphasize。
when we're writing the stable storage, is to make sure that no matter where a failure occurs。
we always end up in a state where we can ultimately either continue the transaction and commit。
or we abort the transaction。 And we don't end up in a state where, well, we don't know what to do。
There's something inconsistent。 Yes。 [inaudible], Exactly。
So the key difference here is that in the general's paradox, we tried to make sure our。
distributed decision was going to occur at a specific time, 11 a。m。 Here, there's no time down。
If a machine goes down, it might be a week before that machine comes back up。
but when it comes back, up, it will ask the global coordinator what was the final decision。
So one of the things is, if you remember earlier, I mentioned that once you do the commit。
you then send an, acknowledgment back to the coordinator。
That's when the coordinator knows when it's gotten all, of those acknowledgments。
now it can delete that commit record。 Up until it gets all of those, acknowledgments。
it has to retain that commit record because a week later, some worker that's。
been offline for a week might come back and say, "Hey, what happened with this transaction?"。
And it can tell it, yes, we decided to commit the transaction。 So there's no time bound on。
when the decision gets implemented, just that we're going to reach a decision and do that。
in a decentralized manner。 Yes。 [inaudible], Yeah, so that's a very key distinction。
So vote request occurs at the end of a transaction。
when we've decided we want to commit the transaction。 That's when we initiate the two-phase commit。
process。 So this is not starting the transaction, this is finishing the transaction after we've。
decided we want to commit or abort the transaction。 All right。 Any other questions? Yeah。
[inaudible], Yeah, so if one of the workers crashes before responding back to the vote request。
then, we just treat it as a vote abort and it times out at the coordinator and then it makes the decision。
to abort the transaction, which is persistent。 And so whenever that worker comes back, it'll be。
told that the transaction did not commit it was aborted。 [inaudible], That's correct。
If the workers offline for an extended period of time, then all those pending。
transactions would be decided to be abort。 All right。 Okay。 I will see everybody on Thursday。
[inaudible]。
[ Silence ]。