斯坦福-CS110-计算机系统原理笔记-六-

94 阅读1小时+

斯坦福 CS110 计算机系统原理笔记(六)

P19:Lecture 18 MapReduce - main - BV1ED4y1R7RJ

Okay。 Welcome。 Welcome back。 So this has zero to do with the class。 Although by the way。

do you like the fact that I'm really excited that I'm now a movie star from the video in lab the other day?

Right, I totally forgot that I was on that video。 Literally。

So there's one other thing you might be asking yourself, why is he showing us these silly videos?

And there's one other thing we're going to have you do in class this week。

which is like a five minute survey on the videos。 It's actually part of a research project you will find out more about it in lab this week。

The, oh, what else do I want to show you? Oh, another video。 This has nothing to do with class。

I just thought it was kind of funny。

So last week, let me turn the lights off。 Last week, not last week。 Recently。

my wife and I moved into a new house and we had this little back fenced in area。 And every day。

I'll go back there and there'd be a little like thing dug under the fence。

Something is dug under the fence and like gone on the back。 And so I have a camera out there now。

You want to see what we found? Let's see。 Yeah, you can kind of see it from here。

There it is。 Yeah, really cute。 The problem is if we let the dogs back, they were really in trouble。

If that skunk will win the war, I'm sure。 But anyway, that was what I found。

When you put cameras up, you find cute things, but things, you know。 And then。

so if I put a bunch of boards down below and it hasn't come back。 Although I hear skunks are really。

really good at digging and they could care less about like boards。 So we'll see what happens。

We will see what happens。 Okay。 Let's see。

How is assignment seven going? Almost there on this one, hopefully。

Let's do what does it do tonight tomorrow。 It's between it。 Okay。

I do have offsauce right after this for about an hour if you want to stop by if you're having some particular troubles。

And then we are on to the final assignment, which is what we're going to talk about today completely and kind of get you up to speed on the final assignment。

which I briefly mentioned last week called MapReduce。

And MapReduce is an algorithm that we will dig into a little bit today。

And then I'll talk to you more about the details of the assignment itself and how you can do it。

There are a lot of moving part, I see that for all these times。 There's a lot of parts of this one。

There's four different tasks again, but it is broken down into tasks。

And we'll take a look at a little bit of the code today too。 And I'll point out some highlights。

And then we'll get going on that。 The assignment。 MapReduce assignment technically is out already。

I posted it a little earlier today。 So if you want to get working on it, if you haven't。

if you've already finished, then that's okay。 It's due ostensibly next Wednesday。

which is the last day of class。 I'm going to let anybody who wants to hand it in Thursday do so without any penalty。

So you can figure out what that means in the big picture anyway。

But just if you want you can turn it。 So the no penalties will start until Friday of next week。

Which would be the 90% and 60% Saturday。 So you've got a little extra time for me to use that。

But it is technically due Wednesday。 Okay。 All right。 Let's chat about this MapReduce algorithm。

Now, what I didn't tell you the other day about MapReduce is that MapReduce is an algorithm that was used very extensively at Google。

It actually was invented at Google。 Although it's very similar to things that have been invented in the past。

It's not like they completely reinvented the entire thing before。

But it was used at Google very extensively。 It's also used in a system called Hadoop。

which is a system that uses MapReduce to solve these big data problems。

And what the idea is for MapReduce is you take this data you have to analyze and you farm it out to a bunch of servers and do this map stage where you're taking all these different parts。

You're taking these kind of this big set of data, mapping it out to all these different servers。

All the different parts of it。 And then on the servers you're doing a reduce to bring it back。

You're actually doing this kind of filtering sort of thing where you're combining a bunch of things and then sending it back to the main server。

So it utilizes in our case the myth machines。 But you can use Hadoop or you can use MapReduce on a giant set of servers if you want。

With thousands and thousands and thousands of processors。

What's nice about it is if you build your program well and ours aren't necessarily for that purpose。

But you build it well it's very robust for computers going down。

So if you go to a Facebook or a Google data center they have tens of thousands of computers in there。

And every single day many of those hard drives or computers will actually break。

Because tens of thousands when you scale up things tend to break。

And there's literally people there who are hired to walk around with a cart of hard drives or SSDs or whatever now。

And we just go to this row and this column and the data center will place the hard drive。

And then because it broke。 And some of your data might be on those broken things but you have to be robust enough so that the data is actually spread out across a number of systems。

But it's the same idea with doing an algorithm that's using tens of thousands of computers。

You need to write your application such that if one of the computers goes down you don't have to restart the entire process。

That would be deadly because there's no way something with 60。

000 computers could last for more than an hour without one of those computers breaking。

So you have to be robust about that。 And that produce allows you to do those things where you can be robust。

Although that's not really the point of our assignment。 So I did this。

I started this the other day I'm going to kind of review just to get us back up to speed of where it goes。

And I kind of already said this you've got the map state in the map part and the reduced part。

And the map part takes the data and comes up with some intermediate result。

Now the intermediate result we're going to have the example and this is a classic example of taking a text of words。

And basically taking the each word and putting it in a list in some file with a little one after it's saying there is one word with one word。

And why the one it's so you can add it up later。 You could ignore you could not have the one and just assume it's there and that's fine too。

But in this case there's going to be an intermediate result which we'll see which is just a word and then the number one next to it saying that's what was next。

And this is not sorted at all at this point。 Then you do this sorting stage where you might take a file and sort all the words so it'll say like and one and one and one and one and one。

It's all sorted and then you run a group by key which basically goes down list counts up all the hands and then writes one and in the total sum and then does that for all the other words。

And then you have to reduce it which basically collates all those results and then sends them back to the original server。

Okay, so that's the big process。 The mapping stage depends on what you're mapping the reducing stage depends on what you're reducing。

But the sorting and group by key if you do the same for every problem then they're always going to be the same。

So you never need to redo those parts of it。 Okay。

and we'll talk in more detail as we go through this。

So here's the little thing now there's a couple Python programs that we're going to show you。

And by the way this isn't listed in your assignment itself but if you want to use some of these pythons in your code you are welcome to so you don't have to rewrite some of these algorithms yourself。

So that's actually okay。 It doesn't help the mapping and reducing part but it helps the actual like the kind of sorting and so forth you can use these。

But what this does is basically this file goes through a standard in and it finds all the lines and then it splits them into it splits them into by spaces or whatever。

And then for each token in that it will actually take the make the word lowercase and then match it and basically print out the word and then print a one next to it。

That's all this part is doing。 That's all this files do。 Okay。 How do you run it?

You say something like this and we will go do that。

Let's see。 We need to go into the right folder。 You might want to write down some of these。

You might want to have these files handy because you can you will be able to call them without copying them into your folder。

You'll see how that works in a minute。 Let's see。 Spring and then live map reduce。 Okay。

So let's say cat Anna Karenina and we are going to pipe it through dot slash word count mapper。

Okay。 And that's all it's doing。 Right。 Stating every word and just putting a one next to it。

Now most of the time you would either have this itself go into a file because you're going to do lots of these。

Or you would have it。 You probably wouldn't have it just print the screen。

You probably have it piped out to some file。 That's all it's doing。 Take each word。 Take one。

Word one。 Word one。 That's that。

Okay。 All right。 So after we do that。 Okay。 Then you've got this group by key situation where you're trying to take the output and then just sort it。

So in this case all we need to do is just in if we're doing this on the command line just pipe it through sort and it will do that。

And so you can see zoological happen twice but that's what the end result should have that in there as a to in the end。

But now we're going to be able to run through this use this other program to run through and just grab the and add them up。

And that's the that's the next portion of this。

Okay。 So this is another this is a kind of a cool little Python script by the way that is not it's a little bit dense as it turns out。

It's not really that hard to understand once you get it but it's using things that aren't necessarily are in C++。

For instance this yield which is kind of basically this function is used in another function you'll see down here。

To basically go through each line and grab the first item which is the word and then join all the words that are the same using the ones and counting them all up。

So it's basically what it's doing and then it's printing out the actual word and then oh sorry this was not in counting this was just putting the word and then putting 1111111。

That's what this was doing。 Okay, so let's see what happens when we do this and if you want to look at the details of the program you can。

But this one would be then we do this and then do group by group by key and this one says oh okay now it's got it's kind of one step at a time。

Now again you can you probably combine some of these together if you really wanted to but this is the main algorithm and if you do it this way it will work for lots of different problems。

So it's not like you have to if you format all your map reduce jobs in this format then it always works。

This might not be the case if we did it in some other way but for instance your has this many times of it was in that document。

So once you've got that then you need to write another program which is in this case word count reducer which is another one of these strangers but it basically does the same sort of thing where it breaks the it takes each line gets the first one which is the word the first word。

the first space delimited value which is the word and then for all the other space that we values all just ones and it's going to sum them all up。

Okay so it basically does the sum of all the rest and then prints out the word and then the sum and there's your answer and then that is what would get sent back to the server in this case。

Remember what we're doing here is not a map reduce across many servers it's just on one computer。

So what would that one be that would be dot slash word count reducer and hopefully there you go and notice is a logical as to words in there and so forth your self as 27 and so forth。

And that's the big idea for map reduce。 Okay, now things out to word one on one on one。

collate all the words and then count up some of the words and then basically send the data back。

Problem is you have to do it distributing it's like which is which is the part that that's the science about。

Yeah。 So the one on one seems like one program you can take that out and just sum it in the second step。

But because we're distributing this that they just can only tack on one because like each process won't have access。

Yeah that's a good way of putting it。 Yeah, so Sam's comment is very good where the reason we're doing the one on one on one and here doesn't seem to make sense because you could have just summed it right then right because it's one thread doing it all。

But when multiple threads are all trying to contribute to one file let's say it's going to each one has to only contribute a one so there's a tax one on the end。

Now, could it read it in read the value and then put it sure but I mean there's there's in this sense this is the most generic way to do it。

In the pipeline that you have on the pivot slide where do the threads like。 Yeah, yeah。

so this is a member this isn't the one this is now a single threaded one so there's no where would it happen。

They would rejoin under the reduce phase basically the reduce would take its individual parts reduce them down and then when the last one finishes it basically tells the server hey we're all done grab this file and then the server go get the file。

So that's that's how it works for the assignment。 Here's the nice thing about this assignment on this is not necessarily true in across the data centers of the world but for our assignment we have how many myths that we have 10 or 12 myths or how 15 or whatever there are。

That many myths they all share the same file system so that's the nice part about this is that if you ask for a file you can use one path and it gives you that file off the file system no matter which myth you're on。

I mean you do this every day when you ask this agent in myth and go to your CS 110 folder。

It's the same folder on multiple different myths because really it's the same underlined file distributed file system。

So that's the nice thing about this assignment is yes we are using multiple myth machines to do this but we're using one single file system now you might think oh how could I do this if we had a multiple if we didn't have shared file systems。

Well you just have to pass the data back and forth every time so if you wanted to say hey go do this chunk of data you would first have to send the data across to that machine when it's all done you'd have to gather the machine gather the data back again。

And if you've ever taken or maybe some of you will take a parallel processing class you'll see where that becomes important。

Why is that a big deal? Sometimes it's too expensive to keep the entire file system synchronized across many data centers。

It's not that bad these days but it's to keep it all synchronized across many data centers is not really that efficient so it's easier actually to send the data and retrieve it back versus just say hey read your local data and I'll get the response once it gets there。

So that's that。 Okay so questions on the map reduce algorithm itself besides the question about when do these parts happen。

Anybody else have any questions on this? Yes。 The so the group by key let's just back up one second let's see what the group by key does。

The group by key just puts all these ones in there and it hasn't actually collated them yet。

Like it hasn't added some to them all up so in our case the reducer is just doing the summing part。

That's all it's doing。 Why do they separate? I think that it kind of comes down to this idea of many files will be adding to the one that's not going to be the one。

It kind of comes down to this idea of many files will be adding to the one like maybe updating one file or they'll be updating it on files but many threads might be doing that on a machine and so therefore you don't want to try to do the summing at the same time you're adding up it might just be slower。

If you're just adding a one is easier than saying oh I'm going to now get the value back and sum it up and then you know so forth。

I think that's probably the main reason there。 So let's look at the actual assignments once we get to it。

Now so this assignment as I said already to do next Wednesday although there won't be any late days until starting Friday so you can hand it Thursday around the penalty。

It obviously is going to do the MapReduce algorithm。

It's going to use the Myth Machines so you are going to start on one Myth Machine run your basic program on the Myth Machine and then it will create it will basically SSH and launch a program via SSH。

You can look in the code and see how it does that and then that program will call back to the server and say please give me some data to tell me what to do。

So basically the one program is going to farm out all these processes different server or different myths and then each one will call back to the main myth and say hey give me some work to do。

Most of this is already written for you it turns out。 There's a lot going on there。

So we've given you a pretty robust start for the program。

What that means is that there's going to be a lot of code you have to understand beforehand。

Now this has been true in this whole class and I was talking to somebody in office hours。

I don't see a person here talking to somebody in office hours a couple of days ago about this idea that we have a ton of code to read through and this happens wherever you go。

How many people have done internships already in CS in some form or other。

Yeah have you found that there's like you get to this place and they're like okay your job is going to be to do some updates to the code base and you look at the code base and it's a billion lines of code。

Like not even kidding it's a billion lines of code you got to figure out how to do that and every place you work has a different way they format the code and the different tools they use and they have different ways to make files and different ways to do code reviews and you have to learn all this stuff and it's very similar to going in and saying here's a bunch of code that you're going to try to modify in some way。

You got to understand the code more or less first。

So if you're thinking about oh there's a lot of code to read this is we're not necessarily doing this just for you know for kicks we're doing it to kind of get you ready for when you have to do this in real life。

Okay so this is the robust start we give you a lot of code to look through and we're going to look through some of it in a few minutes。

There are four tasks handle each one just like the others in order and it's probably the best best way to go。

Well one thing you will be adding of course is a thread pool so adding a thread pool will make it so that your main distributor program can just launch all these threads and not have to worry about the thread issue and then you're each one of the workers can also have multiple threads when it gets more work to do。

So that's kind of a cool part of it。 Okay and you're going to be doing the reduce part the one part that we've really left up to you is the reduce stage we've given you a pretty open end in part I mean if you look at we'll look at this in a minute as well but let's see。

If we look at we find it here and assignment eight map reduce reducer dot CC。

Okay there there's your function so far like we haven't given you much right we've given you exactly zero right for that so this is your part your place to shine in this case where you write the reducer but once you understand what has to happen you go oh okay all these distributed files are going to be in this format。

I need to now co-awake that and then get them back to service it's doable。

Okay we'll get we'll get there as we go through this。 Okay all right so that is that。

Let's actually talk about the details because you got to get up to speed on some of these details as well。

When you clone the assignment okay in fact I'm going to do let's see R and dash R which one do I want I want is it files yes I hope it's files it's not in trouble。

So when you when you start your assignment yeah I'm just I have to expect never never do this R and R star that would be really bad I'm just I'm a slash star will like destroy things in a bed especially if you're a super user don't do that。

Trust me I've done stuff like that before you just you accidentally do that and it just destroys your day I can tell you that。

So anyway when you start out you don't have all the files are going to end up in this one particular directory this will help you when you're doing debugging。

And so what you do is you type make directories and it creates the directories one's called files slash intermediate one's called file slash output and and it creates the delete some first minute creates I guess it could have been in the valley but it creates them for you。

And that's where all the data is going to go okay and then every time you run your program you should clean that folder out by just saying make file free and what that does is it just deletes all the files in there。

Okay, for you and why is that because you don't want the files that are already there to be corrupt to be there when you're trying to do a new batch of data so always remember to do make file free you only have to do this one wants me this one pretty much every time before you run the program。

You should do that。 Okay, there are actually five different executables that this program entails。

Okay, I'm seeing people go what are you talking about。 It's not that bad。

You've got the map reduce map which does the mapping you've got the map reduce reducer which does the reducing and this is coordinating these。

Those are the ones that are coordinating all this you've also got a program that we've written these two programs for you the word count mapper and the word count reducer。

Those are similar to the ones that we've you've already seen the Python。 Okay。

so those are already like already built there。 You will have to modify these other ones to make it actually work。

So because of the way you're doing this you might think, oh well。

how many threads do I need and how many different servers in this and that you have a file that's configuration file。

Okay, and in fact if we look at it right now, we will look at Odyssey for that CFG。

Oops。 Hang on。 I swear I hit there we go。 Okay, let's see Odyssey for that CFG。

So here's what's in that file and this is it's always going to look exactly like this in the sense you have you tell it。

You have a key that says, hey, what file is going to do the actual mapping of the words。 Well。

it's going to be this word count mapper program。 What file is going to do the reducer word counter producer。

How many mappers are we going to have? We're going to have eight mappers。

That's basically eight threads that are going to be on different servers doing their thing。

And how many different reducers are we going to have? We're going to have four reducers。

Why is it that number of those numbers? Kind of arbitrary。

Normally you probably make them equivalent but maybe you know that there's more work to be done doing the mapping part and you want more horsepower to do that。

And then it says what the input path is。 In this case it's this Odyssey folder。

I'll show you what's in there in a second。 And then you tell what the intermediate path is。

The intermediate path is after the mappers get done。

They will put a whole bunch of files in this intermediate path and we'll look at those。

And then you have an output path which is where the final results go。 And we'll see that as well。

Let's go look at the Odyssey folder。 If we get a samples and then Odyssey full。

We've got in here we've already done this for you where we've broken up the Odyssey textbook or text into twelve different files。

And let's look at one。 0007。input。 By the way, one thing before you get to this part。

See how it's like four zeros and a seven and three zeros and eleven。

You have to write that part yourself。 Like how do I figure out how to make a string that has like three leading like the number of leading zeros that add up to five。

It's not really that hard。 Don't overthink it。 But just know that you're going to have to do that and go oh okay this is the part where it doesn't do it for you。

And people get stuck on that and go oh why isn't it doing that。 You just got to do it yourself。

But anyway let's look at one of these files。 This is what it is。 It's just part of the Odyssey。

Okay and you can read through and this is the thing。

We've broken it into twelve different files because we're basically going to farm each one of those files off to a different server to do their thing。

That's how that one works。 So as you're going along you can see you won't really have to modify this too much。

If you want there's an Odyssey is it Odyssey partial or something。

Yeah Odyssey partial which is many fewer files。 So if you're debugging it might be better to do this one so it doesn't take forever。

Just take a keep your eye on like changing that。 You just need to change the CFG file or if you run this Odyssey partial dot CFG file。

Then you do this。 Okay all right and those are our files I just saw。 I already showed you。

So let's actually run this。 Okay so when you're running this I'm going to copy this line because it's kind of long。

You'll get used to copy pasting either this one or the one for your program。 Maybe I won't。

There we go。 Okay so I've done make file free。 I've already done that。

And no I haven't。 I have but I got to be in the right folder。 Make file free。 There we go。

And then we're going to run this line。 Now here's what's going on here。

The first part of this is the map reducer like coordinator。

That's this part right or is it this part right here。

We're going to just run the solution you want to see it。

And then the mapper is going to be MR the MRM mapper solution。

The reducer is going to be MRR solution and then the config file is going to be this。

And then you go to your stuff and say to yourself why don't we just put these things inside the configuration you could have。

But this is actually a pretty generic way to do this where it's a little more generic in this sense。

You could use some other file here but keep the same config file。

So that's just what we're doing in this case。

Okay let's run this and you will see lots of stuff flying by。

Okay so all of this log data is not bad to use if you need to figure out what's going on。

But there's lots of it here and it is eventually this is the full version with threading and also it's actually relatively quick。

It goes through and eventually it says okay here's all the output files and what they have to and it's hashing the output files so you can check the answer。

Remember all the way back to the first assignment or the file assignment。

Same sort of thing where you go oh mine the hash is the same we're good to go。

Okay so that's what you're going to want to do there。

And as I said it's your program is calling all of these SSH things。

Now let's actually look up how it's doing that SSH star。cc。 There we go。

So it's basically doing some file。 File it's got some streams in there and it's basically just doing something like that where it says okay that's going to be you and it's going to tell it how to do SSH to there。

Okay so there's lots of fun stuff going on there。 You might be asking yourself how does it know your password and I actually don't know。

I forget how it actually does that part。 I'll look up how it does that part。 What's that?

It doesn't know your password so it does it without having to know your password I forget how。

I'll look up。 I forget how。 Yeah I'm not sure。 Anyway it does it does that。

And then let's see what else。 It's all of the different communications that are happening here like it says。

Oh informing worker at Myth 52 that all file patterns have been processed。

We're going to see some of these to do that。 Keep all these in mind because these are going to help you when you're debugging this。

Okay。 All right。 So that's that。 Now let's look at where the output happened。 Okay。

So if we do files/intermediate。 Okay。

There are a whole bunch of files in here。 Okay。 Now it turns out that this is how they actually work。

0 0 0 0 1。 0 0 0 0 dot mapped。 Okay。 Is basically saying from file 0 0 0 0 1 dot input。

It takes a subset of file of words that are individually hashed to go into the 0 0 0 0 0 mapped file。

And any time you have a dot 0 0 0 0 dot mapped file it means that that word happened to hash into to that that number mod I believe it's 32。

So if we go down here and we look up the Mac yacko 0 to 31。 So each one of these says 31。

Now there were 12 different files。 There were 32 individual map files for this。

So 32 times 12 is 384。 There should be 384 different files in this folder。

And there are in this case。 Okay。 We'll talk more details about what that is。

The first time I looked at them or what's going on here。 I'll show you。 Okay。

So if you look at all these files。 Okay。 This is what the word count mapper produced。 Okay。

And it will look familiar when we actually look at these。 Okay。

The word word count mapper place these files in here。

And they represent the words from the input that hashed to that particular number on the end here。

So I'm going to show you what that hashed means and I wrote a little program to test it for you too so you can test out a word hash。

But let me show you what let's say we looked at 12 to 28。 0 0 0 1 2 dot 0 0 0 2 8。

Does this look familiar? This is what the output of the mapper was earlier in that it takes the well。

These are actually a little bit farther down the line。

This is all of the words from zero zero from zero zero twelve from zero zero twelve from zero zero verb。

Here from zero zero zero, Let's see, There is all the words from zero zero zero twelve input, that。

mapped two or the, when hash ended up with a, value of 28 after being modded by 32。

Let me show you what that means。 Okay? The is in here。 It's in this list。 So the。

If we run it through some hash function, you get a number, right? Let's go do this。

I wrote a file or wrote a program, Let's see it is let's see CS one。 I'll just run it from here。

CS 110, You can run this to let's do it from there。 You can run it like this you do slash。

User slash class CS 110, W don't know we don't need that we can do a lecture examples。

There we go lecture examples map reduce, Oh, no, it's not made you do have to make it。 Sorry。

You have to make it first when you when you download it user, Let's see CS 110 spring, Map reduce。

Hash-hasher, okay, if you run a hash or it says please provide something to hash。

So if we provide the then it will hash into this number, What is the hash function?

It's something built into C++ and it might actually be different on different machine different。

Operating systems and different C++ compilers。 We're using the same one。

So it's always gonna be the same for us, But this is what these should always hash to if I do it again。

It will have to the same number that number will not fit into 32 buckets。

So if we mod it by 32 it fits into bucket 28, Okay, isn't that what we want it before?

Let's see if we look at greev g r e ve then let's try the same thing g r i e ve。

It should also hash into the 28 bucket。 Okay, so what's happening here? We have。

Each individual file out there's 32 files per input file that have now been mapped in the following way where all the words in those。

File in the input file that hashes to a particular one of the 32 buckets goes in that file with a little one after it。

That's that okay, and that's what the intermediate files end up with。

Okay, and they're 384, No good, Maybe part is in, No, no good question the question is wait。

You just said something about I can I have to make I have to figure out this。

0 0 0 1 2 business some of it has figured out for you already some of it like these ones。

You'll have to write yourself and I'll show you how that I'll show you how that that actually works。

Well actually it's not written for you because it's originally there。

You have to figure out how to take a number and put leading zeros up to five。

This is like a one or six a question is really, What is it's not that hard?

You'll find out when you write the program。 I mean it's it's in the I forget。

It's in the mapping stage is what's happening。 Yeah, this is happening。 It's not in the pipe number。

This is gonna be in C++。 Sure。 You don't forget this out。 Okay。 You're not quite using it。

We're not actually using any Python yet。 You'll see where that happens a little later, Okay。

but now are there any questions about what's what the results are and why there's 384 files here?

I didn't really told you why exactly those 384 but。

What it turns out what it ends up being is there are eight, Nappers and there are four reducers。

So we're gonna end up with 32 different files per input file, Okay, that's why it is now。

Why is it number of mappers times number of reducers? This is completely arbitrary, Okay。

it's arbitrary, Mainly because if we do it this way that's making you break your code into lots of different pieces that when you're doing your threading part can。

surface some, harder to find, Bugs like mainly whether you're locking around the right things or whatever。

So that's the big deal with with this one, We're saying you have to produce these 32 files or eight times four in this case because we want you to be able to。

Write your thread and make your locks work appropriately。

So it's a little bit arbitrary in that sense, But that's that's what's going on there。 Okay?

All right, so, Let's look at another file just to make sure if we look at so we just looked at zero zero zero one two dot input。

if we look at zero zero zero five, same 28 again instead of。

12 look at zero zero zero zero zero five dot, 28, Well, it will also hopefully have yeah。

the V is in there, Okay, anything with a 28 means that the words that end up there happen to hash to 28 now。

Why are there different these in different files at this point? Right now。

Each individual input file might have a bunch of these in them。 In fact, we can find out。

Let's see if we do。

See grip if we look for the as the, Let's see, Let's look for, How are we gonna do this?

It's a little hard to do this little hard to get this perfect because of the files。

But if we look for the in and this will not do commas and things in, See samples。

Let's see Odyssey full, Star yeah, it'll say and then let's see。

I think - L will just list the files that they're in, - L。 There we go。

So it turns out that the that V happens to be in files three through twelve probably two as well。

But I didn't do it right。 So that's so that's how so they're all in each file。

So they're all gonna end up in anything with the 28 after yeah, Yeah。

good question the question is well, how much faster are we talking about when we do this multi-threading?

So it's really it's a little hard to answer that, It will scale well。 I'll put it that way。

Just because of the fact that you are spreading it out over many different servers。

When we're gonna try our the starter version which doesn't have any threading in it doesn't do the whole thing。

But you'll see it's a, Marginally slower anyway, but you do get some speed up and you'll be able to see some speed out here when you do it。

It's hard to quantify it exactly though for the size of the problem。

We're doing you might not get a huge speed up, But you get some, Okay。

so where do we stand at this point? We have now shown that you're going to have to map these into many different files。

Before you do the reduce it。 Yes, Yeah, the question is is there is the reason for hashing so that we can split them into?

Kind of evenly spaced files pretty much。 Let's actually let's actually look at the file sizes here and see。

They are so 685 looks like a small one, But then it goes up to maybe a few thousand there might be a ten thousand in here。

I'm not sure but they're roughly the same size would there be another way to do it?

I mean you could probably figure that for whatever data you have to do。

You know you might say oh just do a through L and this file and you know what extra words to start with that。

Probably wouldn't be perfect though。 This is a little probably a little better in that sense, Yeah。

good question, Everybody else, Okay, let's look at。

Just wondering if you could count all the words in Odyssey easily, Let's see。

Think let's see if we do the following, Now I won't do right now。

We can count we could count all the words and see if they're all in there, I'll put that in Piazza。

It's gonna be easier to do that than me trying to figure the command out right now。

But but it is possible。 So if you want to learn how to use the hashing function。

We've written it for you already in a very place and thing or just go look at this file that I。

That I pasted on or that I put down here, That's not it。 I think I did。 I think I pasted the wrong。

Oops, I think I pasted the musta pasted the wrong one。 I will fix that, Hasher program located。

Hang on。

No, it didn't。

No, it didn't do it。 Well if you click on it should come but you can look at that file and actually see how it's done。

It's not that hard cat slash here。 I'll show you then, Spring I map reduce and, Hasher got CC here。

It's pretty easy to use you basically say hash, brackets or angle bracket string hasher and then。

Passing the string and there you go。 That gives you the hash value。 So really straightforward to do。

Yeah, Um, do you have one more? So the question is if you're doing that produce would you do not have any more race conditions once you've mapped it out?

No, I mean you still are going if you're doing a if you're using the。

Distributed file system anytime two threads on any myth or trying to write to one file。

They better not write to the same file。 So there's definitely a you have to do some something there now。

It's going to be a little harder。 You can't lock across。

Myths right so there's other things you have to worry about。 I think in that case, Let's see。 No。

it's only a multiple if you multiple threads you could run into that problem。 So。

Each individual myth one is not going to be writing the same file as some other one。

So that's a good thing。 So it'll work out in the big。

You do it still have to think about it for the threads just think about just like you normally do。

I should just back up just think about it exactly like you normally do if multiple threads can。

Modify a data structure at the same time。 I better stop。 I better lock around those。

That's where that's the the max。 We don't think about, okay。

So, Okay, so let's actually run the starter code。 Okay。

and see what your starter code actually produces。

Okay, I'm gonna do same things before make file three, Okay, now clear everything out。

Don't forget to do that。 And then I'm going to just copy this little line here, Okay。

So this one is basically using your version of the one that we give you for starter code, and。

The mrm one which we give you and the mr。 Which we give you and then it says map only。

This is because we didn't write any reduce for you and then the quiet means don't print any details out except for these hashes that you'll see。

That are all print out。 Okay, so that's what we're going to run and you can see how fast it is if it's faster or slower。

then before, Okay, let's see。 I'll do it from here。 There we go。 Okay, so it's going, Right now。

I have to wait around。 Wait around。 This is only the mapping part, right? And there it goes。

So that's it。 So the mapping part, I guess you can't really tell but it tells you some of the hashes that come out。

Let's look at the files right now intermediate by the way。

If we looked at the output files there are none because there's no reduced stage yet。

but if we do intermediate, Then, Your solution does not break it into。

Or the solution the starter code does not break things into any of the hashing。

That's the part you're going to have to do but what it does do is it already maps to。

Like the words for you zero zero zero one dot mapped, And look what it comes for you already, right?

So in this sense, we don't need to you don't need to use that python program to do anything here。

We've already written that for you in C++。 Okay, and this now is。

Fairly far along the path of like how it's doing how it's doing the, Mapping for you。 Okay。

and let's do it without the quiet part, Make file free again。

Do make file free again and then I want to do this without the quiet and you'll see what else what other things are going on here。

Okay, so it's still communicates。 It's already doing a lot of that communication stuff。

So it's not like it's just, Doing it locally like the python scripts。

It is actually communicating and sending back data。

So I mean a lot of that's already written for you, Question, It's not doing the hash and right。

Yeah, so in this so this is only partially the way they're now you're you are still going to further need to break it into。

32 in the case of the four times eight different files。

For each one of these input ones and you can do that by hashing in there。

And again part of this is our return about why we're making you do that part。

It's part of it to have more files that can be spread out among more servers。

And part of it is this kind of test that you can do locking and multiple blocks or multi threading well。

We again, we've given you a thread pool。 It's not like you have to write that part。

You just have to write the locking and do the proper locking which really。

In the big picture the big the biggest part about going forward from here。

You've now written thread pool and you've done regular threads and things before from now on。

You'll probably just use a thread pool or maybe use threads individually。

But this you know now at this point the important part is oh。

do you really know how to lock in the right places? That's the important part。

Okay, Other questions, Okay, so let's go back here。 So that's what our mean。

that's what we've actually, done in this program and, You can see that the。

Actually what we can do now actually see where the shows up。

Let's see find no crap and we'll do the, This way, be in files file slash intermediate star and。

  • oh, this is a, It was should tell us which files it actually found the end in this case。 Yeah。

I found it in all but, What they only filed it didn't have the end was that was one in this case。

So it's already broken it up into it's already taken each individual input file and done the。

first round of mapping for you, Okay。

Okay, so, Where does that leave us now? Well, let's look at some of the files。

Here's the main files that you'll have to look at now stage by the way。

There's four tasks task one is understand the files。

You're not sure to do anything for task one except absorb and you have to kind of do you know。

Do that in the best way possible, but don't skip task one。

You'll get you'll get crushed because there's too much too much to understand, Right。

even though it's just reading and don't try to just jump in and write this stuff, Right first。

I try to understand these things。 Okay mrm, That is the entry point。

That's the client of the map reduce server, Okay, and the server invokes mrm remotely like on each machine。

so each machine gets a map reduce mapper that runs up it okay and, and then。

The real as I said the reason this works is because of the x-men's file system map reduce mapper is the。

Class that actually does all the server business。 Okay。

and here's the most interesting part about that, Let's let's go look at map reduce mapper and map reduce。

Map reduce, Worker, okay, if we look at let's look at let's see。

Let's see which one what's the name of the file it's that produce we'll look at map reduce。

Map reduce worker first, Okay, let's look at map reduce worker。h。

Okay defines the map reduce worker based class now, How many people remember from 106b?

Like the last week of class when you were already checked out where you talked about。

Um where you talked about polymorphism and you talked about。

Narratents and all that anybody remember that? Yeah。

like one person remembers that right and so this is written for you。

So you don't need to worry about that, but basically we've we've broken this down and we said look。

There's a server and there's a the reducer and the。

Mapper they both have to talk to the server in pretty much identical ways。

So let's but but not exactly identical。 I have to say different things there too。

So let's create one base class, which then, Does this worker part and then each individual the reducer and the mapper。

Subclass off of this worker and do it so you you will have to understand inheritance。

Just a little bit for this and when you read through this don't be scared when you see。

Weird things that i'll show you in a second。 Okay, so in here。 What do we have in here?

We've got a huge。

Set of parameters for this。

Okay, and then we can request input and notice there's client stuff in here again this part is mostly written for you。

We're not asking to redo all the client creation business and so forth。

Okay, it's got the server stuff and it's got alert server progress。

And you should be using some of these to print out the right messages。

When you're trying to oh you have when you're doing progress and so forth, Okay。

let's look at the map reduce mapper and i'll show you the part that。

Might seem a little weird look at the map reduce mapper。 Well, here's the actual。

Constructor for this, Okay, Notice that it has a map reduce worker, As part of its definition。

This is because it's sub classing the map reduce worker, It turns out so um。

it's actually nice that it's it's not it's it doesn't even have other parameters。

Basically passing all of its own parameters onto the the class of the map reduce worker and then it updates some of the functions。

uh down here, Okay, we've got the map going on here and that's it。 That's all we have so far, Okay。

I'm gonna show you the map reduce reducer。

Uh, but i'll show you that again here, user, Cc same sort of thing here where it subclasses map reduce worker and then has this reduce function。

Okay, so there's not not too much that you have to like go and like dig through to find out what's happening。

Where but you have to understand what's what's going on down in those files? Okay, that is task one。

Okay, understand all of those files, Uh, let's see is there anything else in here。

Let's look at mrm。cc。 Oops mrm。cc。

Okay, this is basically saying right now, map, map, That's what it's doing。

It's kind of in here eventually going to call reduce after that, Okay。

so you've got to you're gonna do that and you have to do reduce。

What you're also going to have to do is it turns out is you're gonna have to modify the number of parameters in here。

You'll find out when you read through the assignment。 Um。

it's basically so you have to if you have to add, Another parameter to the splitting of the files based on those hashes and things。

So you're gonna have to update that, um。

It's it's kind of listed on the, assignment where you have to do that, All right, Task two, Well。

now we have to spawn multiple mappers。 Now whenever you say spawn multiple。

you should immediately be thinking thread pool, Right, this is where you have to start。

So you have to say oh, okay。 I'm basically going to uh modify the spawn mappers。

Where is spawn members? Spawn mappers in the map reduce server, app reduce server, Okay。

spawn mappers, Okay, and the spawn mappers function right now has no thread pool in it。 Okay。

but what you are going to do is you're going to, Put a wrap a thread pool around spawn worker such that it then does many。

Workers, okay, why are we doing that because that will make it。

Much easier to or that will make it much more robust in terms of being faster and enable communicate better。

Okay, what do you have to do with that? You have to make sure to lock in the right places and and。

Just all the various now you know about thread pool。

So it's the things that you have to you have to do。 Uh, you're just going to basically。

If you actually had the wrong the wrong one there, you have there's a。

Method called orchestrate workers, which uses a handle request and the hand I think it's a handle request you're actually going to map in a dot schedule。

Uh handle request。

There we go。 There is an orchestrate workers function, Which uh。

we'll look at the header file to see that and then there's a handle request。

Down here and this is the one that we can basically。

Right up here is where you're going to end up doing your scheduling, Okay。

See we wanted to look up the header file on this。

There we go, orchestrate。

Oh, it doesn't say anything about it。 Oh, no, I thought it did。 Well, it's in the assignment anyway。

and they had, For car make sure you do your right in your Texas and modify only like lock around things。

That are going to change from multiple threads。 You look to the code and see where that happens。

Okay, question, In the, Handle request is basically where you'll need the mutexes right you'll need it in there and then anything that calls that might end up locking。

needing to be locked, Because there's going to be various files that are going to be written to and and red and so forth that you need to need to pay attention。

Anybody else? Okay, so。

then, Task three。 This is the hashing part。 Now。 This is where you're going to。

At the moment and this is actually what you've got already your program only creates the little。

00xx。map files, Now we need to do the splitting into。

All of those intermediate files that we saw a little bit earlier。 Okay。

that's where you're going to have to update the build map or command。

This is the one we have to add that other argument。 Okay。

and that's going to be the number of hash codes used by each mapper, And how do you figure that out?

Well, you figure that out by, Multiplying the number of servers by the number of or the number of mappers by the number of reducers。

That's how you get that number, Okay, you also have to update mrm。

cc to accept another argument in its rv, Which means you're going to have to update map reduce mapper。

So there's a kind of a propagating set of things you have to go and go oh now that this has a。

Updated argument and this one has to also so it can pass over there and this one has to also。

So it's a little bit propagating you'll see when you get to that part, but don't don't be too。

Scared if you start going oh i've been updating a lot of arguments you'll have to do at least three in this case。

You'll have to do three in this case and the number of hash codes i said number of mappers times number of reducers。

And that's more or less for hey are you doing this correctly with the concurrency stuff。

All right, And then finally you have to do the reduce part, Um。

The this is what i said before this is relatively open-ended。

In that we haven't given you any code for it。 We've just told you how the algorithm is supposed to work。

And at this point you know, oh, I now have, 384 files that I need to。

Combine back into the output files that uh that we've got now。 Um, let me go back。

I I didn't show you earlier, Forgot to do that。 Um, I didn't show you the。

Let's do this make file free, Okay, and then let's do the original one, Which is, Nope。

that's longer there, I want to go back to the one there it is。 Okay。 Uh。

that's the one and this is the one that's going to do all the files, This is the solution, And。

Is that there we go。 Let's look at files slash output the output there will be in this case 32 output files。

Okay, 000, One dot out or zero, Dot output, Okay, and this is now an alphabetical order the actual total counts。

This is what you're going to end up with, in the end, Okay。

and if in the if you really wanted to create one more file that is everybody you just add all those together because they're going to be in this case。

Uh, let's look at 00031 dot output, Uh, oh no, it's not this isn't an alphabetical or you'd have to do some more shorting in that case。

But it's going to be 32 different ones based on the words that you're, output, Okay。

and that's the yeah, but I guess there'd be happy that one more stage for you to have to do that。

I don't think we have to do that, So let's see, How are we going to do this part? Well。

you need to you know kind of what you need to do, But you the reducers need to collate the collection of intermediate files with the keys those are words。

Uh with the same hash code sort it and then group the sorted。

Collation by key then invoke the reducer, Uh to actually produce the output files and then you can leave it in those those final output file format。

Okay, um, here's where you can start to use the python programs that we give you if you don't want to rewrite this。

If you don't want to do the sorting and the um, the collating part, Um, this is where you can go。

Oh, all right。 I'll just use the files that you've given us already。 They are located here。

And you should have that absolute file name in your program as it turns out probably as a constant but in there。

Uh, so for instance the group by key, Word count reducer, etc。 You can do that。

The question is how do you run those python programs?

You probably could use subprocess or you could probably use something like that that we've done before turns out。

There is one called there's a function called system, which is really easy to use, Okay。

let me show you how easy it is to use, uh, let's see, We are going to go to。

Should have this and I should have this in two, My lecture map reduce。 There we go。

And in here system example, Okay system example takes an argument, which is a。

Command that you would type into the command line and it runs it, Right。

you don't need to parse it into rv。 You don't need to do anything。 It just goes bone and that's it。

Um, it doesn't give you the answer back。 It actually does whatever it does。

It doesn't you can't gather the input or the output back, You have to actually uh。

pipe it out to a file, but for instance, I will give you uh, the example here。

I'm gonna have to re-type this a little bit, but uh。

Let me show you what this one would be and make sure you got it there。

Remember this thing that we had before where we had all of the different, um。

We we had this whole python stream here。 Well, if you run you can run this exact command from your c++ program by using that system command。

So it's pretty straightforward if you ran this it should there you go output to what do I call it all output dot txt and there you go。

So you can run this for each individual file by using that system command so you don't have to do the sorting。

You don't have to do and it does it does in this case, You're not gonna do all of this pipeline。

but you're part of it and it's a little easier than writing it c++。

Feel free to write it in c++ that will probably be a little quicker because it's not having to call all these functions that are outside of your program。

But up to you if you want to do it a little more a little easier because we've already done something before you go right ahead。

Alphasic sort, Right, so this is what we've done with the python program right python program in here calls the sort function down here。

And you would do the same thing you could use the sort you could use c++'s sort function if you had like a。

Or you could you can read all files in and sort them that way, You could do that as well, No, no。

the the you would have you are going to sort the individual files。

Right and you can do that by reading them in and sort of pushing them back out the file。

That's fine or you could use this or stuzz it for you, So up to you about which one you want to do。

but they will already have so remember what the output the intermediate files are。

Let's see intermediate, Looks like let's look at one zero zero zero zero six dot zero zero zero two nine。

Oops, Dot-mapped, Already looks like this, but that's not sorted。

It's just as they came in the file, That's what you've got so you need a first sort them and then you need to co-late them。

But do all the ones and then you need to add up all those ones that other python programs do it for you。

Might as well use that if you can you have to figure out what the final names are going to be and so forth。

But that's how that's that's the basic idea, So up to you if you want to do it either way。

I'd probably suggest the python version because it's, Easy to implement。

You have to run it on all of them。 Yes, you do you do have to run it on all of them。 Yes。

All that an individual wants you have to run that on, You'll see when you get to the the the apple。

but yes, You have to run that on all the individual files and then once you have all those files co-late them together。

You'll see you'll you'll see when you get there in the program about that。

but that's about the basic idea, Okay, All right, that's how you do that。 Um。

there's one hint here that I wanted to just point out it may not make much sense right now。

Once you the mapper's job has gotten to a point where it's going to do the reducing, Okay。

you're going to have to reduce all of those, individual。

Files that have the same hashes right all the ones that end for instance in zero zero dot zero zero zero zero zero。

zero one dot nect, Okay, I think I might have missed a, Ending in files。 Yeah。

I think I'm gonna miss something there, But anyway all the files that end in this case in dot zero zero zero one dot napped。

You need to get all those files and do the co-lating on that。 Okay。

so you're probably gonna want to have some sort of pattern, matching。

In your built-in your program so that it does that you basically say here。

Hey here all the files that you're gonna have only use the ones that match this pattern。

And it's not that hard to do just keep in mind you're gonna have to logically figure out how to write that part。

Okay, you are totally allowed to have as many intermediate more as many more intermediate files as you want。

Um, just remember to delete them before you get to the before you're finally done。

Like create as many intermediate files as you want, Um。

and then after that after you get all everything into the final the output delete all the intermediate files。

Except for the ones that we want you to leave, Just make sure you do that。

So if you want you can have those and again, I know at this point。

Some of this doesn't make much sense yet, um, because you haven't actually dug into the assignment。

but it will once you once you get there, Okay, All right。

What other questions do you have that is the mapperies program, Don't be scared of it。

It's just like you if you if you thought at the beginning of proxy you're like, how am I?

We're gonna do it get through all this。 It's the same sort of idea, Um。

there's lots of steps we write out a lot of before you will get there one step at a time as you go through it。

Yes, So how does this work to? Computers going down。

This doesn't so the good question the question is wait, how does this resolve?

How does this like how is this robust for computer breaking and so forth?

This one isn't but you can imagine that if you if you may wanted to make this more robust。

You would in fact, I'm not even I'm not sure that we haven't written some of this in there。

But if one of the myths let's say doesn't return your values you're gonna have some time out in the。

The main map where that's the main program is doing this goes on。 I better send it to a different。

I better task a different myth to do this because this other one seems to be flaking out on me。

Yeah map producing itself doesn't fix the problem。

But map reduce makes it relatively easy to attack that problem because now you've got lots of servers。

You can and you're getting data back and forth and go。 Oh, all you need to do is basically say oh。

okay, I didn't get anything back from that server。

I better retarget that exact same job to some other server。

And then that would be how you handle that robust part of it, Okay。

all right go end a few minutes early unless there's more questions。

I'm headed back to my office for office hours, Until about 4 p。m。 It's 2。

19 in gates and I will see you later。

P2:Lecture 2 File Systems - main - BV1ED4y1R7RJ

Hello, hello。 Welcome。 Welcome back to day two, CS110。 So we have a lot to cover。 We。

have a lot to cover every day。 That's one of the things about this class is that just。

kind of there's lots and lots of material。 So a couple things I put out the first assignment。

which we're going to talk about today。 So hopefully we'll get started on that if you。

haven't already。 Please re-download the assignment if you downloaded it last night or like early。

this morning。 I made a couple not critical changes but for instance I found out that for。

whatever reason when I the PDF if you open the PDF and try to copy some of the things。

from the some of the like commands from the PDF and then try to run it it will look like。

your program is not working but it's because there's some hidden character in like the。

pasted thing。 So that like confused me for a while I'm like I think my program is working。

but it didn't look like it and so there so re-download that I think I fixed that。 That。

would be a really annoying bug to know that your program is working fine and then it's。

just because you copied a code and they had hidden characters in it。 So please re-download, that。

Let's see Piazza is up and going but there's been a few questions on Piazza。 Piazza。

is probably your first line of defense as far as trying to get information from about the。

assignments and so forth。 I look at Piazza all the time and Tiazza look at Piazza all。

the time and so it's a good place to kind of get started if you have a question。 Obviously。

coming to Offsowers is a great place too but lots of questions get answered on Piazza before。

they get answered like in Offsowers or whatever and you might as well check there。 Alright。

So on Monday we started out with UMass and the questions afterwards kind of indicated。

that I confused a bunch of you so I apologize about that。 I just wanted to redo just talk。

about UMass a little bit more or UMass excuse me a little bit more just to kind of give。

you the overview of it。 By the way this is not the most important part of the class so。

I wanted to start with the other day not mean to confuse anybody but I just wanted to talk。

about permissions a little bit more specifically as it relates to UMass so that I kind of unconfused。

you or you have get your questions answered right now。 Okay。 The UMass is all about allowing。

the user to control what the default permissions are for files。 Okay。 So it's not so much about。

the program trying to set various permissions。 It's about the user saying hey look when a。

program creates a file for me I don't want it to give read access to anybody in the world。

That's up to the user to be able to like control that。 So if you go to your terminal in fact。

we'll just do that right now and you type UMass if you hang on。 I don't know why am I hold on。

a sec。 No。 Hang on let me try this again。 Oh that's not good。 Hold on a sec。 There we, go。 Okay。

There we go。 Okay。 If you type UMass then it will tell you your default user permissions, right。

And in fact remember 077 we'll go over the details in a minute means that the 0 the first。

the second 0 after the 0 meaning it's an octal number the actual this 0 here means that the user。

can write whatever permissions it wants。 Okay。 So it'll be read write execute whatever permissions。

the program tries to write it will be able to write it。 The 7 in the 7 means that the group。

and the any other people not not the owner cannot write directly to the or cannot actually。

have those permissions set。 Okay。 So let me give you an example。 All right。 If we actually if I do。

the following where I type touch let's say test 1。txt and then do ls-l test 1。txt。 Okay。

Notice that it gave read write permissions when you take you touch it doesn't do the execute。

permissions anyway。 The touch file the touch will attempt to do the user to do the read write for。

anybody is allowed to。 You mask because 077 says you can't do that。 Okay。 But if I changed UMass to。

0 okay well that would not now if I just actually check it right it's 0 which means that if I tried。

to set a particular permission for the owner the or sorry the yeah the owner anybody else and the。

group it will now allow me to。 So if I do touch test 2。txt let's see if this is different test 2。

txt。

Guess what it set all the permissions that way。 Okay。 So it's all about the user you having control。

over what gets set regardless of what the program tries to set。 Okay。 So in other words the touch。

program said set the read write permissions for the owner every the group and everybody else。

and the original UMass blocked out the setting of the permissions for the group and anybody else。

Okay。 Does that make sense about what's going on now it's basically the UMass is saying you as the。

user get to control this if another program tries to set the permissions it won't let it or it will。

let it depending on what your situation is。 Yeah。 Question。

Touch actually creates like an empty file。 Yeah。 That's all。 Yeah。 The first one here。

Oh the first one in UMass is the it's the fact that it is, an octal number。 Yeah。

I believe that's what it is anyway but anyway that you put a zero before。

three digits and it means it's an octal number in Unix talk。 It's really normal it is。 What other。

question is that? Yeah。 Right。 So this is a good question。 So the question was wait a minute wait。

a minute。 I thought there are all three ones where the bits were on。

I'm gonna go over that in a second。 UMass is the reverse。 Why? I couldn't find out。

Like I did all the sorts of searching。 I'm like why would, this be the case? Not exactly sure。

The way you use it。 I'll show you in a second。 Very good question。 Yeah。 So all of them are on。

Why aren't they in the X file? These files were created various other, reasons。

Like other programs created these ones。 Yeah。 So and it's it's the default for touch for instance。

doesn't ever try to set the executable。 So that's that。 Okay。 Let's go over some more details about。

any other big ones on this one。 Again this is not the most important part of this class so I don't。

want to spend an hour kind of going over the nuances of it except to say that there are member。

there are three parts to the permissions。 There's the owner in red。 Sorry for colorblind people。 I。

apologize。 There's owner in red and then I put green for the for the group and then blue for the。

other。 Anybody else is not in your group and not you。 And that's what it is。

And the by the permissions, are just bits right。 So if you have a permissions if you want permissions R W dash R W dash R W dash。

that would be one one zero one one zero one one zero or well one one zero happens to be octal six and。

then octal six and octal six。 Okay so zero six six six would be the permissions for R W dash R W dash。

R W dash。 Okay。 Now the mask does the reverse like it it's the reverse and again I'm not exactly sure why。

I'm also not sure。 Oops。 I'm also not sure why the this little black。 Hang on。

Let me see if I can't get。

rid of that for a sec。 That is kind of weird。 Maybe that he's。 Hang on。 Let me try this。 Nope。 Well。

okay it's just going to stay there for a while。 Um, hang on。 Yeah, I'm not sure why。 Well anyway。

The。

so the U mask is actually applied as follows。 Okay。

so if the U mask is zero zero seven or zero seven, seven。 In other words, act all zero seven seven。

That would be this U mask and it basically says, you're not allowed to create permissions for any of the you the group or the other。

That's the bottom, line for it。 And what it does and kind of stinks that this is we've got this weird black box in here。

but it basically does whatever you're attempting to set it bitwise and zip with the inverse of the。

U mask and that's how it actually gets the actual permissions that are set。 Okay, so that's how it。

goes and I did a little example here of if you're trying to set this R W dash R W dash R W dash and。

you have a U mask that's zero and or zero and all ones。 Then it does the inverse of it and。

it against that and you get the permissions out like this。 That's all there is to it。 What other。

questions do you have on this stuff? Yeah。 What is it? Yeah, so let's do let me show you。

That's a good question about what is an actual group here and again, I wish I could get rid of。

this little thing。 I don't really know why it's even there but oh, ah-ha。 I moved it。 I moved it。

Okay, it's gone。 All right。 All right。 I don't know why it's there, but so if you if you do the。

following and say groups, these are all the group。 I guess the hang on。 It's just group。

Group is it?

No。 I thought it was groups。 Anyway, you can find out what groups you're in and if you if you。

look at particular files, the second one over here is the actual group that your that file。

happens to belong to。 So all does each file has a user and then has a group that many users may be。

able to use for it。 I'm not sure what the operator one in particular is, but that's what it is。

What other question on this? Any more hands? Okay, so you mask is not that important。

It's just kind。

of a nuanced thing。 I wanted to show you。 All right。 Okay, so if you here's another just another。

example, I guess about this。 Basically, if you have the file that we created the other day, right。

where it just basically tries to set a certain permissions, in this case, 0644。

and your U mask is like, if it's zero, it will enable you to set all those permissions correctly。

If you change your U mask, you can the same program inherits that U mask and then applies it。

and will only allow you to set the permissions here。 Now。

a particular program can modify the U mask。 So if the program modifies the U mask。

then it will be able to write to it。 But it's all about。

defaults and as long as you know what the program is attempting to do, you as the user can control。

it。 That's the bottom line there。 Okay。 All right。 Unix file systems are interesting。

Okay。 Assignment one。 How many people actually looked at it already? Oh, half of you。 Okay。 If that。

The first assignment is, well, first of all, we haven't covered enough。

stuff for you to get to like do new CS 110 stuff for the first assignment。 But what we wanted to do。

is give you a refresher on CS106B and CS107 and a little bit more, hey, you got to go learn some C。

plus plus things you didn't learn in 106B。 Okay。 So that's what this assignment is all about。 All。

right。 The assignment is meant to get you up to speed on all of this coding that you need to be able。

to do for this class。 Okay。 So some of you have already emailed me and go, oh my gosh, I haven't。

taken 107 in like two and a half years and whatever。 What am I going to do? Well。

this assignment is, going to get you back at the speed。 All right。 The assignment itself, okay。

is kind of a fun, assignment。 It's basically called the six degrees of Kevin Bacon。

And why is it Kevin Bacon? Well, Kevin Bacon happens to have been in a ton of movies。

And so if you try to link Kevin Bacon to, another actor。

it's very hard to find more than one or two movies where Kevin Bacon was in this。

movie with a bunch of actors。 They were in a bunch of different movies。 And then some other。

actor happened to be in one movie with one of those actors。 And you can link them together。

like with one movie difference。 Okay。 And so that's how the program actually works。 You can。

you run it by saying dot slash search, which is one of the one of the files you will be working on。

And then you type two names in, in this case it's Meryl Streep and Jack Nicholson。 And by the way。

many actors, because you can have many people with the same name, are in the IMDB internet movie。

database system with a little Roman numeral next to their name in parentheses。 That just means。

that there's two or three or more Jack Nicholson's and you have to type that in。 So you have to be。

a little bit careful。 Madonna, for instance, is another one。 If you're testing your code for this。

before you test a particular name that you haven't tested yet, go and IMDB and look them up。 And if。

they have a little parentheses like that next to their name, you have to type that one in or。

the Roman numeral。 Otherwise it will say Madonna is not in the search in the files。

And you'll think, your program is broken when really it's just you didn't type it in wrong。 Right。

Okay。 So that's how。

it works。 You can type dot slash search, Meryl Streep and Jack Nicholson。 And it works fine。 Let's。

actually do this。 Oh, no, here we go again。 Hang on。 There we go。 All right。 There we go。 All right。

Let's see。 We will do assignments, assignment one。

And you can test this out yourself by doing samples, samples slash search。 And it's got the。

search solution。 And if you type Meryl Streep and you to streak, streak, streak, there might be a。

Meryl Streep。 I don't know。 And then you type Jack Nicholson parentheses one, right。 It should say。

that Meryl Streep was in close up with Jack Nicholson。 Okay。 Let's try some others。 Try to fool the。

system。 Give me two names of actors we might have heard of just just so but give me two names。

Jerry Kane。 Jerry Kane who talked to class before。 I think there might be more than one actually。

And then let's try Meryl Streep。 Why not? He probably wasn't in a, what did I do?

Forgot the quotes here。 Thank you。 There we go。 Okay。 So Jerry Kane。

one not the lecturer who's here。

was in a movie called No Rules with somebody named Don Fry。

Don Fry was in the Aunt bully with Meryl, Streep, right? It's going to be really。

it's actually really hard to find more than one or two different, movies。

Give me some other names of people that we've heard of。 Sorry? Michael Jordan。

Michael Jordan was in what? Here Jordan, or Michael Jordan? He was in that sort of space jam。 Okay。

There's probably more than one but let's try this。 Jordan。

and then anybody else? Keanu。 Is that, did I spell it right?

I think so。 Okay。 Couldn't find Michael Jordan because there's probably more than one。

So we will do, let's just see if it's the first one。 I don't know。 There we go。

Michael Jordan's in Blink, who knew。

The Rick, I don't know how it sounds。 And that other person with chain reaction。 So it's hard。

right? You can go back a long way。 So let's see if Michael Jordan and let's see。

How about Charlie Chaplin?

All right。 That was a while ago, right? And maybe I wonder if, I haven't tried that one yet。

That's interesting。 That's interesting。 We'll let it go for another。

second or so。 It might be that, hmm, I'm not sure what's going on with that。 Let me try。

let me try it, everyone。 So I should try these things out before。 But let's just try Meryl Streep。

And, yeah, that's interesting。 I wonder if Charlie Chaplin's not, well, it shouldn't, hmm。

I don't know。 That's weird。 How about Ronald Reagan?

I think something, I think it's the file system actually。 Hang on。 Let's go try again and see if。

I don't know。 I don't know。 That's weird。 Well, anyway, you get kind of the idea。 You can。

even though it's a little crazy。 But you can also do things like you can do IMDB tests。

which are going to do first, which would be something like just check and see it。 Let's just, check。

Oh, you know, is it Charles? Now I think it's, I might be Charles。 Ah, there we go。 Maybe。

that was the issue before。 Okay, hold on, hold on, hold on。 I don't know why I didn't tell。

us Charles ever。 There we go。 Okay。 Well, okay。 So here's one problem with the IMDB database。

Charles, Charlie Chaplin didn't, wasn't alive in 2006。 He died in like 1970 or something。

But he was, whenever somebody's in like clips in a movie, they also put him in here too。

So it's a little bit。

little bit trickier。 Anyway, that's it。 But you can do IMDB cert test。 And if we do。

Charles Chaplin, then it will give you the fact that he was in thousands and thousands of movies。

and it will list them, list a bunch of them, not all of them。

And you can test your program that way。 Okay。 So that's kind of the basic idea about the。

about how this works。 Okay。 Let's talk about a。

little bit more details。 There are two big files, big being hundreds of thousands of names that。

are called, one's called the actor file, one is called the movie file。 And they are set up such。

that you can do binary searching on them。 Okay。 The actor file has a whole bunch of offsets to。

where various actors are located in this file。 And then there they have the name and the actor and。

all the movies that the actor was in。 And the movie database is kind of the reverse。 It's got。

movie, it's got a whole bunch of offsets into this file that point to movies。 And then the movie。

say what actors they were in。 Okay。 These files are big enough that you do not want to read them。

all in at once, number one。 But we've actually, we've actually put a lot of that under the hood。

for you。 You don't need to worry about like the fact that you're not reading these。 Just know。

that you're not reading, you're actually jumping to a place in the file and reading a little bit。

But the important part about these files is that you can do a binary search on them。 They are set。

up in alphabetical order such that you can do a binary search。 Awesome binary searches, as you know。

from CS106B are very fast because they break things into chunks and divide and conquer, etc。

So you have to figure out how to index correctly into this weird file that's got all these offsets。

in it with a whole bunch of different kind of nuances to it。 And that's the CS107 part。 Okay。

That's the part where you're going to be going, oh boy, I got to remember, I got to remember how。

to do pointer math and pointer arithmetic and indexing into things and so forth。 So and off by one。

errors and all that kind of stuff。 But that's the CS107 part。 Okay。 There's a C++ part where you're。

going to use the standard template library which is similar but different from the Stanford library。

which you used in like 106B or 106A。 Specifically you're going to have to use this function called。

lower bound which is a standard template library function which does the binary search for you。

In fact there was a question on Piazza already that said, hey, can we do recursion to do this。

searching? And I said, whoa, we know recursion necessary for searching because lower bound does。

the recursion, does the recursion, but it does the searching the binary search for you。 Okay。

The idea is you set up the lower bound function and it searches through your data but you have to set。

it up correctly。 Okay。 But that's the kind of 106B plus sort of stuff that you're going to be working on。

Okay。 This lower bound function, it's a little bit interesting。 It returns what we call an iterator。

and an iterator for our purposes at least for this assignment is a pointer because an iterator。

allows you to add to it and allows you to allows you to go to the next one in the line of whatever。

you're iterating through。 That's what it does。 You will be able to learn those sorts of things。

This, is what this week is all about。 Oh, I don't know what an iterator is yet。

Let me learn what it is。 Okay。 Once you have figured all that out, well, you've got this。

you're able to search through this, database。 Well, now you actually have to link after two。

after one to actor two。 Well, how are you, going to do that?

You're going to do a breadth first search。 Ah, more 106B stuff。 Right。 You're。

going to have to remember how to do a breadth first search。 I'll give you a hint。 You probably。

want to use a queue or in our case a list, which is a queue by another name as such。 So those are。

the big things about the assignment。 Okay。 Now, let's talk a little bit more about this lower bound。

for a couple minutes。 The assignment itself says I am requiring that you use the STL lower bound。

algorithm。 Oh, great。 So you've got to use that to perform binary searches and that you use C++。

lambdas, also known as anonymous functions with capture clauses to provide nameless comparison。

functions。 And you're thinking to yourself, I have never seen that before, right? We haven't even。

learned it yet。 Well, I'm going to talk to you a little bit about that right now。 A C++ lambda。

is it's a new, it's a, it's a new concept probably, if although if you've done any, JavaScript。

you have certainly used these before, although you might not have known what they were, called。

And what it is, is a function that is placed in line as a parameter to another function, okay。

which it, which expects the parameter itself to be a function。 So it's kind of inception, right?

You did all this in 107 with function pointers。 Okay。 So if you remember 107, you talked。

about function pointers。 This is a function pointer that's in line and not like another function that。

you're setting up。 Okay。 So for instance, you remember the Q sort function from 107, right?

The last。

parameter in here is a, oops, not good, is a comparison function that you have to define and。

you pass it in to Q sort。 Okay。 Q sort has no idea how that function works。

It just knows that I have, two pointers that I'm going to pass them to this function。

I'm going to get back which ones the, I'm going to get back a zero。

a negative one or a one to say whether it's smaller, bigger, the same。 All it knows, right?

And that's what function pointers are all about。 You give, give this other, function。

the function pointer that says, "Hey, I'm going to do something for you, use it。" That's。

all it's about。 All it is。 Okay。 That's what that's all about。 Okay。 Let's look at a little program。

about this。 Okay。 I just made up some dumb program to do the following。 Okay。 In this program, okay。

I created two, well, let's start with this one。 I created a function called modified VEC。

And what it does, we're now in C++ land, so it takes in a C++ vector by reference。 Okay。

It takes in a value and it takes in this which is C++ for function pointer。 You can think of it。

that way。 Okay。 What it is is it says, "Give me a function that returns an integer and takes。

two integers as parameters。" Okay。 And what it's going to do is it's going to basically loop through。

the vector and then pass in that vector element, a reference to it as it turns out, and the, well。

it's actually going to get to reference out。 It's going to pass in the value。

it's going to pass in the value in the vector and then it's going to apply that operation to it。

That's all it's doing。 It's basically updating the vector with new values based on some operation。

Okay。 Well, what are the two operations I did? I happen to make one called add which basically。

returns x plus y when you pass it in。 Pretty simple。 And then subtract which does x minus y。

That's all it does。 Okay。 Well, what is it actually, what are we actually doing in main?

We are getting some stuff out of the command line。 And then we are creating a vector and then。

calling the operator inside this modified vector function。 Okay。 I'm saying modify the vector。

pass in the vector, pass in the value that I'm typing on the command line to do the operation with。

and then the function itself。 Okay。 A key part of this is that when you call this modified。

VEC function, even though add is a function itself, it is not getting called immediately。

That's a key key part of this。 When you say modify back when you pass in parameter one。

parameter two, parameter three, this is a parameter that is not being called until modified VEC。

actually calls it directly。 Okay。 That's how this is。 That's how that's working。 Okay。

And you can test it out by saying like function pointer add 12。 And what we'll do is it will take。

the one, two, three, four, five, ten, a hundred thousand and add 12 to each one of those and。

print them out。 Okay。 This should look very familiar from CS107。 Okay。 If it's not。

if you're a little, rusty on that, okay。 Take a look at the example。

figure out how it works and ask questions about, it if you've got them。 Okay。

But the important part is here's where you're passing in a。

function pointer。 Okay。 Now let's rewrite this as or using this weird thing called a lambda function。

Okay。 Everything is the same except for this from right here to right here and then the same thing。

on the next line。 What's happening here is we are saying, okay, great, pass in the vector。

pass in that value and then pass in this weird thing which actually has code inside the function。

parameter itself。 Seems kind of weird。 But it's just inline code that's going to return x plus y。

for parameters x plus y or x and y。 Okay。 Exactly the same thing except notice I do not have an。

add and subtract program up here or function up here。 I've just done the exact same thing, inline。

That's what a lambda allows you to do。 Okay。 A lambda does the following, it has the。

following signature。 It has a curly or a hang up。 It's a little hard to see here。 It has two square。

brackets and something can go in there。 We'll talk about that in a minute。 It has a parameter list。

parameter list and then it has some code that's in curly braces。 Okay。 That is a function that gets。

passed into the other function as a parameter itself。 Okay。 It does not have a name associated。

with it。 It doesn't need one。 It's right in line。 Okay。 And this is exactly what will happen。

It's exact same program now。 Okay。 All right。 We'll talk about what this weird thing in this。

square brackets are for a second。 Ponder that for a second。 What questions do you have about it。

so far? You may not have time to think through yet but questions you have about it so far。 Yeah。

Good question。 The question was, what's the advantage of this? Is the advantage that you can。

write in line? That is one advantage。 Normally these functions are short。 They're right there。

You can look at them right there。 That's great。 Another advantage which we'll come up to is the。

big advantage is the fact that you can do things with lambda you can't do with function pointers。

and we'll get to that in a minute。 But yeah。 But for now I just want to introduce it and just kind。

of, hey, this is exactly the same before。 Next we'll go into what is new。

What are the questions you, have about that so far? Okay。 If you are like, oh。

I haven't seen this before。 Take a look at it again。

and we'll get there。 Now it turns out, okay, what if, or I should say, what if we want to actually。

utilize variables that we don't want to explicitly pass into the function and it's not necessarily just。

a variable。 We could actually have a function call inside our lambda function。 It might use。

something local to the original calling function。 Let me show you what I mean。 Okay。 This is what。

we originally had。 Right。 We had a function here which took in two things。

It took in a value and it, took in or it took in two values。 Okay。

One of those values is coming from this vector。 The other one, we pass into the function。

That's like the 12。 It would add 12 to each one of the, each one of the, elements in the vector。

Okay。 But what if we wanted to change it to say, I only want to deal in my。

function here with the actual vector element itself。

I don't want to pass another element through using, these function parameters。 Okay。

This as it turns out will be very hard to do。 In fact, I didn't know。

if I know how I would do it with function, a regular old function pointer。 Okay。 What。

and I'll show you, the example in a second。 It turns out we want modified vector in this case to also handle the value。

that we are updating by。 In other words, we want the function that calls modified vector to handle that。

before it even gets to our modified vector function。

We want to like pre-line up what we've got there。 That would be really hard to do with a function pointer。

With a lambda function, it is actually。

possible。 Okay。 Here's how you might do it。 Okay。 Here's what I've done differently now。 I have。

I've changed this up here。 So basically the function now simply says。

give me a vector element and I, will do my operation with some other value on it。 And our knife。

our modified vector function is, really easy。 All it needs to know is get an element out of a vector。

pass it on to this operation, function and it's done。 And then it'll do the rest of it for us。

But it still does add that。 Well, down here what I've said was, okay。

and this is now the important part here, okay。 I have said, all right, pass into modified vector。

the vector we're trying to change。 And then the following, lambda function, okay, which is this。

It says only take one parameter x but use the value of v。

called val from the scope of this function in the modified x function。 Okay。 How is it going to do。

that? It's going to say return x plus that value。 In other words。

I am not sending value as a parameter, to modify that anymore。 Okay。

So that's what's happening here。 The way this works is through this, idea of captures。 Okay。

And a capture says, put something inside those square brackets that are。

local variables or variables in scope anyway that you want to pass to the。

to the captured or to the, lambda function that will get used in some other function。 Okay。

In this case, I'm just passing in, val and then we're saying int x is the。

and then we're saying in here return x plus val, semicolon and curly brace like that。

That's what we're doing inside that function and it allows you, to do that。 Now。

you might still say, well, it doesn't seem like it's that much different。 It turns out that it's in。

well, it would be almost impossible to do this with a regular function, pointer number one。

And number two, it's even more difficult to do if you are in a class function。

trying to call a non-class function。 Okay。 It's just really hard to do。

especially if that non-class, function expects something already handled for you that isn't part of the parameter list already。

Okay。 So that's a lot to kind of think about right now。 There are, by the way。

there are lots of ways, to capture multiple variables by putting them in a list like value one。

value two, etc。 You can, also capture them by reference。

which unfortunately looks just like a pointer, but it's not in this, area in a dress rather。

But if you said a percent value would pass the value by reference, so you're。

actually changing the original one, you would do this with big data structures and so forth。 Now。

the reason I'm bringing all this up and showing you how to do this is that you do have to do it。

for this assignment in a very, in a particular place, using that lower bound function and go back。

to the notes here and go, "Oh, how is this stuff actually working?" Understand these, then you'll。

be able to figure out how to do the one with the assignment。 Yeah。 [inaudible question], Yeah。

So let's think about what we're trying to do。 If we have a function that says, "Okay。

you are going to give me a function pointer," in this case, or just a function kind of。

lambda function, "but a function pointer then only has one value, but that value is contingent on。

something else happening in the original calling function。" In other words, this, in other words。

this vowel variable, right? If I can't pass that in to the function up here, how would this。

modify VEC do that unless I was able to pass in another variable? You, I, obviously, could have。

done it with passing that other variable in, but they don't do it for things like lower bound。

They say, "Here's the, here's the function signature。 You get one variable or you get one。

function to pass in with one parameter。 You don't get any other parameters。" And you go, "Well。

wait a, minute。 I need two parameters for this。 How am I going to pass the other one?

How am I going to deal, with the other one? You do it through the lambda function。" Okay。 You say。

"Oh, I'll handle it locally, capture that variable I'm going to need。

tell the function what to do with that variable。"。

And then when that whole function gets passed into the other function, it all gets handled。

as if it was kind of, it's all gets handled as a black box to the original function or the one。

you're calling。 So it, these things are subtle, but you'll see later, in fact, when you start to。

break the assignment, you'll get to this point and go, "I guess I could use a regular function。

player and you go, 'Oh, I can't。 This won't work that way。 That's, that's where it's going to come。

into play。'" What other questions do you have on this? No? Okay。 Look at this stuff again。

There are, lots, there's lots of documentation on it and certainly feel free to come to office hours。

to look at it as well。 Okay。 All right。 What other comments do we have on here? "They are critical。

for C++ classes。 As I said, class method variables。

you actually can't pass them by reference at all, to other functions no matter how you do it。

So that's going to be, you're necessary to use the, the lambda functions。

You can capture all of the variables in a class by using the this, pointer to do that。 Okay。

You can also, as I said, do them by one after the other。 There's some nuances。

about how you actually need to set these up, but if you wanted to pass in this and Val and my。

VEC by reference, that's how you would do it。 Okay。

You'll see when you get into this and by the end, of the class, by the end of the quarter。

you'll go, "I get these things。" But the first couple times, you see it, you go。

"I don't know what's going on。" Go back to the slides, ask questions on Piazza。

and then off-sars and we will figure it out。 Anything else on that? Okay。 All right。 Let us go back。

to where we ended up yesterday。 We're going to talk about file systems some more。 Okay。 Yesterday。

I ended with this example, which basically was re-implementing the CP command called copy。

And it was a pretty simple, all things considered。

program that basically says you have file one and。

file two and you want to copy the contents of file or you want to file two is what you're creating。

You want to copy file one into file two as an exact copy。 Okay。 And what we did was we said, "Okay。

fine。 We're going to get set up a file descriptor, which is just an integer。

and we're going to use the open command to do it。 And we're going to say it's read-only。 That's。

all I'm reading from。 We're going to do the exact same thing except we're going to create a file。

using another file descriptor using open and using write-only O-cre-at and O-X-C-L。 We'll see。

another one in a couple slides as well。 But this says, "Create the file if it doesn't exist already。

", Right。 Otherwise, produce an error。 And then attempt to do the following permissions。 And that。

may or may not work with those exact permissions based on the UMass。 Again, not super important。

Okay。 And then what do we do? We set up a little buffer and this buffer can be any length we want。

We just happen to make it 1,024 because we want to save memory, let's say, but you could make it。

bigger if you wanted to。 And then we have a while loop and there are some questions about the while。

loop here。 This is to read 1,024 bytes from the file one after another after another or 1,000。

1 kilobyte and another kilobyte and another kilobyte。

And each time take that kilobyte and put it into, the output file。

That's what the first while loop is doing。 Okay。 How does it work? Well, it reads data。

It reads from the input file into the buffer and the size of, in this case, 1,024。 You do not need。

to worry about the fact that there's, oh, there might be a null character at the end or these are。

not strings necessarily。 They're just data。 So you don't need, you can read all 1,024 bytes worth。

Okay。 If you end up getting zero bytes back, it means there were no more bytes to read and you can end。

Okay。 If you get any number of bytes other than zero, well, they came out of the file and you then。

need to write them to the other file。 How do you do that? Well, you start out and you say, okay。

I've got the number of bytes written, which is zero。 I haven't written any yet。

And then I'm going to do, another while loop, which is going to attempt to write all of those bytes。

That's right here。 It's going to be right here。 It's going to try to write all of those bytes into the file at once。

Okay。 Where it's going to do it into the output file, it's going to index so many bytes into the。

buffer。 And the first time through this is going to be zero, of course, it's going to start and。

try to do all 1024。 It turns out that right may only write some of those bytes。 Why? Well。

that's what the operating system can do。 The operating system can say, you want to write a。

thousand 24。 I'm going to only allow you to write a byte。 Sorry。 You have to try again later。

That's what this while loop is all about。 It's going through until you have successfully written。

all those bytes。 Okay。 It's probably not going to fail for 1024 in a local file。 But if you try to。

do a million bytes at once, it might say, oh, you can have half a million, and then the next time。

you have to do another half a million。 Okay。 All right。 And by the way, the read command, as I。

mentioned yesterday, does block until those bytes are read, at least some of those bytes are read。

It will, if there's any bytes to be read, it will wait until some are available。

What questions you have on this program so far? Anything? Nope。 Okay。 So。

the interesting part about that is that the, this is direct and low level。

Okay。 When I say low level, all the other things that you might have used to read and write to files。

uses read and write under the hood to do that。 Okay。 So, if you're using file pointers, which you。

might use in C or IO streams in C++, they are themselves using read and write to do the work。 Okay。

They have some other nice benefits to them。 They can buffer connections。 You can go backwards。

and forwards using a file, a star or an IO stream。 You can rewind, etc。 You can't do that with read。

and write。 You have to like manually figure out the details of that in a library function。 So。

read and write are going to be fast, but they're really not that developed。 Right? You。

have to go through that while loop to write out。 Otherwise, you may miss writing some other bytes。

Okay。 Those are the big things about read and write。 Okay。 What are their questions? Anything?

All right。 So, let's move on to another program called T。 Okay。 Now, T is a program that you can。

that you have built into Linux。 Okay。 And T works like this。 Okay。 T says does the following。

It says。

okay。 Take input from the command line or piped in through another file and then。

print it out to the output and print it into any one of a number of files。 Okay。 So, for instance。

I can say cat, let's say T。C, which is a file I have。 Cat。T。C。 No, I don't have T。C。 Hang on。 Maybe。

not there。 Maybe if I went to the right file, let's see lecture and then file system。 There we go。

If I type cat, T。C。 We'll print it out to the screen。 If I type cat, T。C。 And then pipe it into。

the T program。 Okay。 What it will do is it will then say, okay, fine, I'm going to print out the。

screen and I'm going to copy it into T2。C and T3。C and T4。C and whatever。 And it will do that。 So。

it should print out to the screen, which it did。 And then if we look at T2。C, it's also there。 And。

it's same thing for T3。C and T4。C。 Okay。 Do you get the idea of what T is doing? Yeah。

Is it creating those new? It's creating those。 If they already exist。 If they already exist。

let's try。 Let's see if we, let's type file called, I don't know, d。txt and abcd。 And let's try the。

exact same thing except now we're going to cat D into d。txt into this。txt into T2。T3 and T4。 T2。C。

Yep。 It overrides it。 Yep。 So it doesn't care。 In that case, it overrides it。 Okay。

So to get what's going on here, take a file that, or standard input and do that。 By the way。

standard input is normally you typing。 That's what standard input does。 If we did the。

same thing and I said instead of cat D。txt, I just said, okay, T2。D3。4。 Right。

And then I start typing。 This is some text。 Right。 Well, it doubles it。 What it's did did it。

It took that。 This is some, text and it printed out to the screen and it threw them into the files。

And then this is some more, text。 It does the same thing until I type control D which says end of file。

That's by the way what, that means in there。 And then if we look at, if we look at T2。C now。

it's this is some text。 This is, some other text。 Okay。

So standard input is you typing or getting information from the output of, another file。

That's what the little pipe symbol does。 Okay。 You will be very familiar with that pipe。

symbol by the end of this class。 I guarantee it。 Okay。 All right。 Let's look at T。C。 Let's actually。

look at how we might do this。 Okay。 Well, what do I have to do? We said take standard input。 We。

have to figure out how to do that。 We're going to print the standard input to standard output。 We。

have to figure out how to do that。 And then we have to open up as many files as we type on the command。

line and then open them up and print the data to those as well。 We might want to help our function。

for this just because it seems like there's going to be a lot of things going on here。 Okay。

But let's, actually start this off and say, okay, fine。 Int file descriptors。

If we have a command line, you remember that the command line is argv and argc。

argc says how many different things you've, typed on the command line。

The first one is the program name and the rest are the things you, typed after the program name。

Okay。 We need to create argc number of files。 Why? If I type, let's see, let's see this。

If I type t abc。txt def。txt, right, we've got a file that's going to be, output。

We've got a file for abc。txt and we've got a file for def。txt。 Three things on the command。

line means argc is going to be three。 That's how many file descriptors we need as it turns out。

Okay。 So let's go back up here and do that。 So we do that。 Okay。 Well, then we need to actually。

create them those files except for standard in, right, or rather standard out in this case。 Okay。

Because standard out is going to be created for you already。 It turns out it already exists。 Here's。

how that works。 Okay。 FDS0 in this case equals STD0UT standard out file。 That's what it's called。

Standard out and standard in already exists。 There are two, there are another one too。 There's。

standard, well, there's standard in。t file。 There's also standard error dot file or standard error。

underscore file。 That's because sometimes you want your program to print regular output and also。

output that's an error or debugging or whatever。 Just the way it goes。

There's three different types, of files that are created for you as we go。 Okay。

So that's how that works。 All right。 And then, we have to open all those other files。

So let's do it this way。 Size ti equals one。 We've already, created FDS0。 i equals one。

i is less than argc。 i plus plus。 Okay。 And what we're going to do here。

is we are going to say FDSi equals open argv of i。 And then now we're going to do the, okay。

we're going to write output to these。 So write only。 Okay。 And we're going to, amount。

we're going to bitwise or that with o create because we're creating the file。 And then。

there's one other that we haven't seen yet called o trunk。 This is the one that says if the file。

exists, wipe it out before you're putting the other stuff into it。

That's what your question kind of, goes to。 And then we can say oh let's try to do 044 but we know that may or may not work。

Okay。 That's how we're doing that。 Now we've opened all those files we're about to push out to。

Okay。 Well let's read in the data。 Char buffer will make it 2048 this time。 Why not? Okay。

While true。 This is going to look very familiar to what we did before。 Okay。 While true。 S size。

tier。 Remember that's a signed integer number of bytes red equals read。 We're going to read from。

where standard in file number。 That's what we said。 We said we read from the input file from the。

typing or or data piped in from another file output piped in buffer size of buffer。 Okay。

So we do that。 Okay。 And then if num red equals zero we mean we're done because we don't。

anything else to do。 And this is where we'll use our little helper function here to write output。

to all those different files。 So let's see, we'll do another size t i equals zero i is less than rc。

i plus plus we need to do all of them。 We'll do a function called write all which we'll write in a。

second。 And then we are going to do we're going to write it out to a buffer。

I in this case and then, or to a file descriptor。 And then the amount of bytes red。 Okay。

And then that will do that。 Sorry, for the size of the screen。 And then we are going to let's see。

That's on one line so that actually, will work。 And then afterwards we're done writing to all those files。

We have to close them all。 Well, we better do another size t i equals zero i is less than rc i plus plus。

Right。 And in this case, we're going to close fds i。 You can close the input if you want to。

If you're done with it in your, program close it。 It's still opened in all the other programs that might be running。

So it's not, like you're ruining anything for anybody。 Okay。

And then now we can return zero which I might, already have in there。 Okay。 So that's that。

Let's see if there's any issue there。 There will, probably be there will be an issue because we haven't created the write all function yet。

But。

let's see comparisons do not have their mathematical meaning。 Oh no。

Mind 21。 Did you guys notice that? 21。 There we go。 Oh thank you。 Semi-colon。 There we go。 Better。

Let's try it again。 Oops。 Oh no。 What have I done? There we go。

All right。 There we go。 Implicit declaration of and standard out underscore file under, layered。

Let's see。 Standard。 Oh。 File number is what it is。 Thank you。 I was getting it wrong。

It is standard out file like this。 File new like that。 And the same thing。 I think the same。 There。

we go。 Do the same thing。 Okay。 Let's try it again。 Okay。 So what we need to do now is just quickly。

write that other function which is just write all which should look very familiar to the copy。

function。 Static void write all what does a static function do?

Remember only visible in the current file。 You should write your local functions as static。

just because it's nice to not pop pollute the namespaces too much if you can help it。 Okay。

We're going to do a buffer and a length。 Okay。 And in this function, we are going to do。

the same exact thing we did before。 Size T num written equals zero。 And then while num。

written is less than the length that was passed in, we are going to do what we're going to do。

num written plus equals right。 We're writing that with each of those files。 Okay。 To the file。

descriptor we passed in the buffer plus the number of bytes we've already written。 And the。

length minus number of written bytes。 And that should do it。 And that's all our function is。

Question。 Yeah。 Here。 Yeah。 So what does that mean? Remember what the T command is doing。

It's reading from, you typing which is standard input。

So this case we have to open that file for we have to open it。

to be able to read to or sorry we have to not open it we have to set it because it's already。

open for you。 So we need that file descriptor。 That's how we do it。 Good question。 There are。

questions on this before I try it。 Yeah。 No, it's an or so you can do them any order。 Totally。

associate like that。 Good question。 Anybody else? All right。 Let's try it。 Make。 Oh。

it's a different。

file。 Okay。 So T。 If I do dot slash T and then I say F file one dot T X T file to dot T X T。

Right。 And I start typing this is me typing more typing。 Right。 And then control D。

I should be able。

to look in F one dot T X T and see that that's where it works。 Okay。 So what questions do you have。

about that? You've now learned about started about standard file in and our standard in file。

number and standard out file number input and output。 There's also standard error file number。 Okay。

In our little program here, we're assuming everything succeeded but the actual code which。

you can go look at online has more error checking。 You should probably do more error checking。 For。

our class, we're not quite as worried about most error checking。 But you'll see in various。

assignments when it's important。 Okay。 Other questions on that? All right。 That is T。 Okay。 So。

let's we're going to continue to dig a little deeper into file system sorts of things。 Okay。

There are two functions, STAT and EL STAT, which are system calls。 And again, a system call is a。

function that the kernel runs。 Okay。 Your code calls it and then the kernel takes over and runs it。

Okay。 And they populate this other thing called a struct STAT。 Unfortunately, it's overloaded。 Okay。

Struct STAT is called is populated。 The struct STAT you pass in is populated by the STAT function。

Okay。 And STAT and EL STAT are exactly the same except that if there is a link。

and we'll talk about links, a little bit later, STAT returns the functions about the links itself。

EL STAT says, oh, I'm going to go, check the details of that link。

What a link is is think of an alias。 It's basically a name that points, to some other file。 Okay。

So we'll get to that in the next day or so。 Okay。 And you can definitely, look these things up。

By the way, you should get very used to typing things like MAN 2 STAT and so。

forth。 Two being the part of the manual that you need to。 If you just type MAN STAT。

watch what happens。

I think it's a different, it's the built-in command from Unix that gets run or that you see instead of。

the library function。 Okay。 But you've got all the details here for the library for the function。

Okay。 And what they do, okay, is they populate this struct STAT and the struct STAT has some。

information in it that might be useful to you。 Okay。 We eventually will use the i number, okay。

or the i node number。 And then, but the one we care about is this mode。 Okay。 And the mode is a。

bunch of different bits in one variable that allows you to find out information about the file。

Okay。 So you can extract information about the file。 Is it a directory? Is it, I think。

I don't know if that tells how big it is。 I don't think that tells how big it is, but it。

says various information about a particular file。 Okay。 And we want to actually do a little bit of。

coding to show you what this is doing。 Okay。 So there's a function called find。 I used。

Unix for years and years and didn't know about this。

And once I found out there was a function called, find。 I went, oh boy, there's the。

I'm going to use it every day。 And I did, I have, I use find。

every single day。 What find does, it allows you to search through a set of directories recursively。

and find a particular file or a pattern for a file。 So let's see, we happen in。

I know there's search。

dot whatever in here。 So if I go back up to, let's do this, find live lecture。

and then search dot star, I'll put that in parentheses。 Okay。

Then what it does is it goes and says anything that actually。

it sort of shows you, well, actually in this case it didn't, hang on, let me just do this。

I'll just, say search like that。 And it will tell you where it things。

the various things are found for that。 Okay。 Let's see。 Oh。

you know what find is a little different。 It means a thing called a dot mean。

That's what it's looking for。 Okay。 So there we go。 There's where search itself is。

If I wanted to find, matching, I hope all the different ones, I would type search star。

and it would give me all the。

different things that had star or search in them, which are all those files。 Okay。 So it allows you。

to search through various directories to find files based on a file name or partial matches。 Okay。

So what we're going to do is we're going to write a relatively simple program to actually do this。

And to do some searching to do exactly that。 If you do find slash user slash include。

and then look for STD IO dot H, find slash user slash include STDI。 Let's see。 You got to do the。

dash name STD IO dot H and then print is a term。 Bring print as it turns out。 There we go。 We'll。

list all the ones that are called STD IO H and there's a whole bunch of them as it turns out。

And our search function should do the exact same thing。 Okay。 We need to be able to read a file in。

see if it's a directory。 If it's a directory traversing that directory and then continue reading。

This, should look very fairly familiar from some CS107 things that you did。 Okay。

But let's actually go。

and write the actual command search dot C。 Okay。 I've got all the header stuff in there。

We're going to。

first do the main function。 And then we're going to write this list matches function, which is。

which is going to be a little bit more a little bit more involved。 But for right now。

let's actually do the search function。 In fact, yeah, we'll do some of it。 In fact。

I'll just show it to you right now。 And then we can go and run it the actual version in a little。

bit。 Okay。 The main function, instead of me typing it all out right now, the main function。

looks like this。 Okay。 It basically uses the the L stat system call to get information about whether。

or not a file is a directory。 And how do you do that? Well, you say L stat and then some name。

right, and you populate a struct stat that populates a struct stat with that。

And then you check and see, if it is a directory by using this macro S is DER。

And then you pass in the mode that you get back。 All there is to it。 Okay。

And then you need to actually go ahead and check the length of the, directory name itself。 Okay。

And then you have a pattern that we're going to type in that you。

actually are going to pass in to the list matches function by basically copying the path plus the。

file in plus the next plus the next plus the next。 Okay。

And that's how it's going to work using the, list matches program or list matches function that we're going to write。

Okay。 So it's basically。

what it is。 You're going to type in the search and then the name of the directory you're starting。

search at and then the and then the pattern that you're trying to search for。 And it will and the。

list matches will do all that。 Okay。 All right。 So how's that going to work? Well, as I said。

we need。

else that to do this。 Okay。 The S is there is a macro。 A macro is kind of like a function except。

it gets replaced in the code immediately。 It doesn't actually call a function。

It's like in lines a bunch, of code。 It's basically just trying to check some bits。

And if one of those bits is set, it needs, it's a directory for this for this case。 Okay。

There's also is reg, which is whether or not it's a regular。

file is link is whether it's a link to a file。 That's the alias I mentioned before。

And most of this, is actually going to happen in the list matches function。

which I'm going to show you in a second。 Okay。 You need to utilize open dir to open a directory and you need to utilize dir int。

which。

is sorry, you need to use struct dir int and the redir function, which you should have done in 107。

If you forget that one, well, you can go look up how to do it。 But it basically does the traversing。

of this directory for you。 It's kind of nice to do that。 And then you need to know how to close。

a directory as well。 Okay。 All right。 So here's the actual implementation of the list matches function。

Okay。 The list matches function does what? It opens the directory up。 Okay。 And if the, directory。

if the path is no, meaning if there is nothing there, then it just returns。 Okay。 Otherwise。

it takes the path that copies the, the actual path onto the, or the, the。

copies of slash onto the path and then starts going through all the different directories or。

all the different entries and checking each one to see if it matches what we're trying to match。

Okay。 I'm gonna have to do this from my pen down here。 So we are going to read in the directory。

in this while loop of going through the directory。 Okay。 Again, if it's no。

we're going to stop because, we've gone to, you know, we've gone to the end。 That's how this works。

It goes until it gives you back a, no。 Okay。 And then we're going to compare this to dot and dot dot dot dot and dot dot are the two。

files that are mean what? What do we say? Dot and dot dot are same directories dot and then。

previous directory is dot dot。 So what it's going to do is it's going to ignore those。

It's going to skip them because why, because then it would end up in some loop that you don't want to。

go down necessarily。 Okay。 And then if you happen to have a path which is too long。

we only allow a certain path with that's fine。 Then it just kind of says I'm done。

I'm not going to go, any farther than that。 Too bad for you。

And then it copies the actual name onto the end of the path, after that slash。

And then it does the LStack command again checks if it's a regular file。 If it's a regular file。

it does a string comparison on that file。 And then if it comes out with zero。

meaning they're the same, then it will actually print out the path。

And that's how it actually finds it。 Okay。 If it's a directory, what does it have to do?

It has to recursively call list matches because, then it needs to traverse through the list。

So it's a recursive, it's a recursive program。 But really it's not that like it's just looking through a set of directories to do that。

Okay。 Question。 Yeah。 A regular file is not a link。 It's basically a file that has data inside it。

A link, it's a file, that refers to another file。 So the data inside of it refers to the other file。

And the operating, system needs to know whether or not it's an alias or a link or a regular file so they can。

travel down those or not。 Yeah。 What are the questions you have on this one, on this program? Yeah。

What do we use? Yeah。 What do we say the difference was? It was just a matter of like, like。

what do we expect that to be? You might have links。 Yeah。 You can certainly have links。 In fact。

your assignment has links in it。

Let me show you a link in your assignment。 If you go to, let's see, let's go to assignment。

assignment one, let's see, I think, I think, you start or maybe? There we go。 Okay。

If we go into here and if you look at dash L, okay, take a look at samples down here。 What's that?

What's going on there? It says samples has this little weird arrow that points to slash。

AFS slash IRS slash class CS110 samples assignment one。 And by the way, if we go into samples。

let's take a look。 Now we're actually in that file。 If we look at it, we've got an actor data。

file and a movie data file。 And those two files are ginormous。 And what it means is if you put a。

link in your assignment file to this regular file, none of you are going to change these files。

You're not even allowed to, as it turns out。 Like you can't change them because they're redonely for。

you。 But it means that everybody in the class can get access to this one gigantic file through。

a link so that it doesn't have, so that you don't have to make copies of them for everybody。

You would quickly run out of space if we tried to make a gigabyte file or 80 megabyte files for。

everybody or whatever。 Yeah。 Eight hundred megabytes files, whatever。 Other questions on that?

Good question。 Did that answer? All right。 So that's how the stat and L stat works。 Again。

the reason we did this example was to show you that you can use L stat to get whether it's something。

to file or a directory。 Okay。 You can error stats do that。 You can also do that with other。

there's other things that you might care about。 Like for next assignment, you will care about some。

other pieces of that structure。 All right。 We relied on Opender, which basically says。

oh, it's a directory。 We had to do an assertion in there and says, don't try to open a directory。

that's a file。 You have to figure that out。 Okay。 You have to make sure you do that。

And then you get, all of these different directory entries by walking through that read/read/der call。

Okay。 And again, you probably did this in 107。 You probably did an assignment about that。

But it's not too bad。 It's just recall that function again and again and again and it gives you the next directory。

the next file, the next sort of, or the next directory。 Okay。 Let's see。 What else? Here it is。

Here's your answer to your question。 We used L stat instead of stat so we know whether it's really。

a link and we're going to ignore links in this case。 You can have links that are recursive like。

that refer back to something earlier in the file system。 And so if you did that, you might have a。

problem because you traversed on a link and then it might come back where you were and then you。

didn't end up in this recursive loop forever。 And that would be probably bad。 Okay。 And I want。

over the other details about about all this。 You do should as always remember to close your directories。

Any else on those? All right。 There's another function that we'll briefly look at called list。

which is basically LS。 Right。 And LS does what? LS actually lists all the stuff in your directory。

Well guess what? It has to do basically the same sort of thing。 It needs to read through all of。

the directory entries and get the information out about it。

It needs to get the permissions by the way。 Needs to get whether or not this is directory and it needs to populate these parts of it right here。

Okay。 It also needs to tell how big or how many other what we call hard links refer to a file。

That's another topic we'll get to I guess on Friday or next week。

But basically you can have one file, which lots of things point to in two different ways and the list will actually tell you how many of。

those exist。 Okay。 It needs to get the name and it needs to get the date it's created and so forth。

Okay。 Those details are a little bit like down in the weeds but you can you can do that。 Okay。 If。

you want to look at the entire list。c function you can do that。 On the key one I will show you。

right here。 This is the permissions one。 How do you get the permissions for a particular file?

Okay。 Well let's see。 Okay。 What we need to do is we need to。 There's a lot of code in this one。

right。 We basically need to。 Here's the list permissions right down here。 Okay。 It needs to, find。

And there's details about this that you don't really need to know but it basically says。

"Set all the permissions to dash。" Okay。 And then go ahead and look through each directory, right。

And check the permissions of each directory and if it's a directory you put D in it and。

otherwise you go through the all the permissions for each one of the various。

permissions that you might have。 So you have to go through each one of those and go look can the user。

can the owner read can the owner write can the owner execute set that up。 Can the other do the。

same thing can the group do the same thing, right。 Details。

You have to kind of go through all these, details when you're writing this low level code。

That's kind of the way it goes in this case。 Okay。 And let's see。

Is there anything else important here? Here are some flags that you can use。

for all of the different files, right。 S-I-R user。 That's a macro that basically says。

check and see if the user can write or the user can read or the user can write。

nine different things to check for because there's three for each owner group and user。

Lot to do if you're trying to do all this one by one。 Okay。 It's kind of amazing what, LS can do。

It has to go through all these details if you're doing systems type stuff。 Okay。

The list permissions function itself prints out those permissions and this is。

hang on I think the neck。 Oh no that's that's uh yeah here we go。

Oh no that's a great that was all that right there。 It's got that's the list permissions and the。

stuff up here is kind of all of the setup for it。 Okay。 Anyway go look at the code for this if you。

want to see how how all these macros and things work。 We won't necessarily make you code all this。

kind of low level things。 You should just know how these things work so that you understand how。

permissions work and that there are macros and there is this struct that you have to use and so。

forth。 Okay。 All right。 What questions do you have at this point about any of that that we've covered?

The assignment is due next Wednesday。 It is due absolutely on Wednesday。 There's no late days for。

this assignment just because we want you to get going on the next assignment and we want to grade。

the first assignment as quickly as we can。 Officer I have officers tomorrow morning 10 to 12 I believe。

stop by if you want and then we will get the I will send a message out to the CAs。 They're going to。

start some office hours probably tomorrow or Friday。

They will go through on Saturday or sorry Friday, and Sunday at least and then next week as well。

All right I'll stick around for a few more minutes, and we'll see you on Friday。

There is class this Friday。 No labs。 Class Friday。