斯坦福-CS110-计算机系统原理笔记-五-

41 阅读1小时+

斯坦福 CS110 计算机系统原理笔记(五)

P17:Lecture 16 Network System Calls - main - BV1ED4y1R7RJ

Screen and cast people you missed about 30 seconds。 Okay, so the question was, "Hey。

can we use multi-processing from threading?"。

The answer is you kind of have to be careful。 The best I could find was the sacroful answer。

which basically said, "Look, you can use fork in a multi-threaded program。"。

And then what does that actually mean? It means that both copies of when you've forked。

now both processes, have threads associated with them。 So something's got to be, you know。

you've got to have some way of dealing with that。 But in the sense of being able to do it。

the biggest thing is, after you do fork and before you do exec BP, don't do any mallet-free。

new or delete。 And that was the best I could find。 You can do it。

Just don't do anything that's going to mess with memory in such a way that it will end up。

messing with the schedulers and things。 And that was the best I could find。 So the answer is, yes。

if you want to use multi-processing from within multi-threading, you can。

You just have to be a little careful about it。 And obviously。

testing is always going to be something important to do。 Okay, so good question。

And there's the kind of answer that I could find。 Feel free to do some more digging on your own。

But it is possible。 It's just you've got to be careful。 Okay, so let's move on to。

Network system calls, library functions, et cetera。 Now, as I said a couple minutes ago。

this is where I think it's kind of a flashback to CS107。

because some of the walkiness associated with C propagates to these functions。

that are about getting host names and resolving them and using these low-level functions。

to do so in a way such that you can support IP version 4 and IP version 6 and any other。

version that you want。 It turns out that socket programming, which is what we're talking。

about here, where two computers talk to each other through a socket or basically a port。

A socket is, as it turns out, just a file descriptor, but it's a very special file descriptor。

It still gets an entry in the open file table。 Believe it or not。

it still gets an entry in the file descriptors for a particular process。

but it allows you to have a double-ended communication between two processes or two computers over a network。

Okay, so it's just like a file descriptor, but there's a little more to it。

And you don't necessarily need to know what else there is to it, except for the fact that。

even though it's a file descriptor, you actually can't say。

you don't say read and write in the same way that you normally do。 Okay?

You use things like accept and you use other functions that。

I guess you can use write on the file descriptors。 But you can't。

but there's some other nuances that you can't just say, oh, it's just like every other file。

Somewhat like that a bit of a file, but not exactly。 So, let's talk about the specifics。

Because we are people, we like to use things like www。facebook。com, versus 31。13。75。17, right?

You don't want to memorize numbers。 We've talked about that in a number of the ways this class。

And so, there are functions that will get the number from the name。 So, if you know the name www。

facebook。com, which you can remember easily, you can use a function called gethostbyname。

And you can also use one gethostbyaddress, which they're both technically deprecated。

In other words, there's other things you should probably use instead。 However。

your book talks about these ones and they're still used enough that you should be used to using them。

In particular, I'm not exactly sure what the actual new one is。 I don't say。

see these ones more than I see that and I forget what the actual。

the one that you should use these days is。 So, anyway, we're going to talk about these two。

And what they take are a gethostbyname。 It takes a name like www。facebook。com。

And it returns this struct called host, a struct hostint。

which is a particular struct we'll see in a second。

And it populates it with the information you're going to need about what the IP address is and so forth。

You can also pass in an address here。 As it turns out, I don't believe that's actually a char star。

Although it looks like a char star。 You have to actually cast it to believe it or not a like an int in this case。

And the reason for that is because you need to, actually this one may be not。 This one。

I don't even think we use this one or an example。 This one may be not。

There are ones where you have to do that。 And again。

it's because of the longiness of C and some of this stuff we've done in 1977 and there weren't even void star pointers then。

So, everything became a char star pointer and it, you can do that。 In this case。

I guess in this case, yeah, this is actually going to be, I believe, a number。

a pointer to a number。 And you're going to say how many bytes it is and then another variable it talks about the type of connection。

whether it's an IP for before or six or so forth。 Okay。

so there's lots of details about that that you have to know when you're using that function。 Okay。

a struct host int has the following, well, we'll get to what it is。 Let's see, get host name。 Yes。

here we go。 This one is a struct i-n-a-d-d-r, which happens to be an int。

It's the weirdest struct I've ever seen。 It's got one value in it, which is an int。 So。

we'll see a little bit more about that in a second。

But the one we're going to focus mostly on is get host by name。 So, it turns out。 Okay, all right。

so what is this? What is this? Oops。 Oh, no, that's not what I want to do。 Hang on。

I don't have my tablet set up quite correctly now, so hang on。

Let me go back here and we will use cursor。 Okay。 All right。 So。

here's what the struct host int looks like。 First of all, it takes inside of it, I will use the pen。

It takes a struct called struct i-n-a-d_ad-dr, which is, in this case, an unsigned int s-a-d-d-r。

which is, like I said, the strange is struct because it's only got one value in it。

You normally don't bother with a struct that has one value, and it wants the point。

Maybe this is for, hey, they thought maybe it would be used some other way。

and they might have some other, they probably were forward thinking enough, and it never was。

And it never has changed。 So, that's all that is in there。 Okay。 This is an unsigned int。

And then the struct host int has a regular old charge star, finally, a regular old charge star。

which is the name。 Okay。 That's the official name。 Turns out, I'll show you an example。

the official name might be different than what you actually type in。

A lot of times they end up going to the same IP address。

but the name could be a little bit different。 Then you have a bunch of aliases。 Okay。

And the aliases are other strings that also refer to the same IP address。 Okay。

So there's like an official one, and there's other ones that do that。

You rarely see the aliases one filled in at all。 I guess maybe it doesn't really。

nobody really cares that much about it。 But you rarely see that in。 But it is a charge star star。

meaning that it's basically just like every other charge star we normally use。 It's a。

as it turns out, no terminated list of our pointer to strings。 So string, string, string, string。

string, and then the final ones, no, you know, you've reached the end。 That's what that one is。

Okay。 The age address type is going to be different depending on which type of IP address you care about。

So in this case, we'll probably stick mostly with AFI net。

which means the Internet for IPV for addresses。 Okay。 And this is actually somewhat important。

We'll see why when we get into the details of some of the strange polymorphism that they've jammed into C。

as we see。 Okay。 And then here's a charge star star, each address list。

which is another not charge star star。 It's basically a void star star。

That should be a void star star。 But when they built this function。

they either didn't have void star star, they didn't want to use it for whatever reason。

But when you use this, you should cast it to the appropriate thing that it is。

which you will know based on analyzing this。 We'll see how that works again。

It's like really low level 107 kind of stuff。 Question, are these things supposed to make a durant?

Are these things that make a durant? Probably, yeah。 Probably the same people。 Yeah。 There you go。

Yeah。 It's, look, when you're designing something in the 1970s。

it's a little bit different than the way you might design it today。 Maybe。 But yes。

it's probably the same people who did that。 Or somebody who's thinking along the same lines。 Okay。

Anyway, we'll see why this becomes interesting in a few minutes。 Okay。 All right。

The S-A-D-D-R field is what we call a dotted quad。 Okay。 Basically, it's four bytes。 One, two。

three, four。 Okay。 Four bytes。 And this is what you hopefully have seen IP addresses that look like this in the path。

You have it。 Well, that's what an IP address looks like。 An IP version four address looks like。 171。

64。 64。136。 And those IP addresses, they break them into fours because it's a little easier than memorizing that big long number。

number one。 And number two, each little byte means something different。 171。

I believe that Stanford's mean like the first byte and all Stanford addresses once everyone I believe。

And then the other ones kind of slowly get it down to the actual machine or router that you're actually looking at。

So that's how that works。 Meaning that because these are just bytes, the IP version four addresses。

four bytes long or 32 bits, and it's 171, 64, 64, 136。 Now。

what order are those bytes going to be in? In other words, when you have。

this is the 107 stuff I was talking about too。 When you have a number and a four byte number on your system。

there's two different orderings we can use。 Remember what they were? Little Indian and big Indian。

right? And our machines are generally little Indian machines。

Meaning that the little end of the number actually comes first in memory。

So 136 would actually be the first one in memory and then 64 and then another 64。

And then 171 would be last in memory。 We could have done it in exactly the opposite order。

You could have done it in big Indian。 And for those of you who took CS, let's see, 107E。

I believe the Raspberry Pi's are in general big Indian。 Although maybe they。

I think I can actually switch it back and forth as it turns out。

But the point is that in that case it would be in a different order。 Well。

because of standardization, your computer, which is the little Indian。

you need to talk to some computer on the other side of the world, which may be big Indian。

You have to make a decision on which direction to put the bytes when you actually send them over the wire。

Okay? So the reason we have to do that is so that everybody can talk together。 And therefore。

if you have the wrong type on your computer, you'd better do a translation when you're about to send it。

We will see that in action。 I'm just setting you up for it, thinking about it right now。

And I'm going to show you the details in a second。 Okay? Let's see。 Yeah。 So we。

for non-IP version 4 ones, we can also, let's see, hang on, see this one。 See this one。

For non-IP v4 ones, we have different information in all that thing。

It's got other information in there。 H-atter type is going to be a different actual number。

Length is going to be different if you have 128 bits。 You're going to need to report that。

And the list can have different types of information too。 So you've got to be a little bit。

a little bit careful with that as well。 Okay? All right。 Now。

why is it not letting me do this again? There we go。 Okay。 Any questions on that so far?

We're getting the, getting to the actual code here。

In fact, let's actually write some code and then I'll show you this as we go along。

That is way too small。 All right。 So, there we go。 All right。 So。

we're going to go ahead and take a look at the code。

And then we're going to go ahead and take a look at the code。

And then we're going to go ahead and take a look at the code。

And then we're going to go ahead and take a look at the code。

And then we're going to go ahead and take a look at the code。

And then we're going to go ahead and take a look at the code。

And then we're going to go ahead and take a look at the code。

And then we're going to go ahead and take a look at the code。 Okay。 So。

we have to start out and we declare a struct host end。

This is a statically built variable that is kept by the operating system。 So。

if you're going to use this multiple times in your code, you'd better make a, if you want。

to keep track of different like things。 Keep a copy of it because it's not, it's not like it's。

it's not thread safe and it's other, things because you are just getting the actual pointer to the data that's living inside the。

function that you're calling。 Okay。 Not the best way to do it。

but that's the way static variables work and that's how this, works。

Get host by name is a C function。 So we need to convert, if we have a C plus plus string。

we just convert to C string。 Okay。 And then calling the get host by name will populate this。 Okay。

So, no, it means we couldn't resolve the name。 That could be for a number of reasons as it turns out。

It could be that your DNS server is down and that's on your computer or the network or。

the router and so forth where you're connecting to or it could be some other reason。

So if you type one and it happens to be not there, it doesn't necessarily mean that it。

doesn't exist, but it doesn't, it means that your program couldn't resolve it。

This is the problem with networks。 Sometimes other things that you can't rely on go down and nothing you can do about it。

Okay。 Then H E the H name is the name we talked about and then all of those IP addresses, we just。

go through them one at a time by first saying, okay, let's cast the H address list to what。

it really is and we know that really is a IN address double pointer。 Why? Because we're no。

we're asking for the IP version for address。 In other words, we call get host by name。

meaning we were expecting an IPV4。 If you want an IPV6, there's a different function you call。

Okay。 I'll show you that in a minute。 But anyway, then you get the IP address by the IP addresses by looping through until you。

get no。 Okay。 And how do we actually get out the actual IP address? Well。

remember it's just a number right now and it's just a four byte number。

And so what we do is we call another function called, I met network to printable is basically。

what the NTOP stands for, which means it takes, it says, hey, what type of IP address。

are you using in this case, IP version for? Okay。 It says。

give me the actual pointer to the address itself。 Okay。

And then give me a string to populate it with。 And that's what we've done up here。

And we know that I met underscore ADDR, Sterling is the maximum length of a IP address that。

turns out。 So we know that's okay。 And then how long it is is also passed in so that it won't。

right over the buffer。 Once you do that, it takes that number and it forms it into a nice 172 dot whatever。

whatever, whatever。 Okay。 Any questions on that? Yeah。 Yes。 A of I net is IP version four。 Yes。

A of I net is IP version four。 Let's look real quickly。 Resolve host name six, C C。 Yeah。

There's the one for IP version six。

A of I net six, et cetera。 And we also would call a different function called get host by name two。

They weren't very clever when they named these, I guess。 You think it would be。

you think it would be someone like get host by name six or something。 But it's a。

you can pass into this one。 This one you pass in the actual type。

So I believe we could have used this one for the IP V or version six as well。

And it would have worked just fine as well。 Because we would have passed in the IP version, sorry。

the IP version four。 Did they also write two two? Did they also write two two?

They probably also wrote two two。 Yes。 Yes。 If you ever build a big system。

you will quickly realize that there are decisions you have。

to make that you hate at the core in your core。 But you have to make those decisions because it just happens at the way。

Now, if they have to do it this way, probably not。

But they may be, there's some committee that probably decided that and that's the way it, goes。

Anyway, let's see this program in action。 Okay。 I think it's already made。 Let's see。

Resolve host name。 Yes。 Okay。 So if we type a host name, let's try, let's try, www。stanfer。edu。

Okay。 So it says that that's the actual host name, the official one。

Now it's a good thing we don't have to type that every time you want to go to Stanford。

or to Stanford or to you。 Why is it that big long? Well。

it tells you that actually Stanford relies on Amazon AWS services for their, well。

if you think about it, it's actually not a terrible call。

Why do it all in house if you don't have to, right?

Just rely on some giant company that has billions of servers out there。

It's probably going to stay up relatively often and so forth。 I believe if we just type Stanford。

edu, we get, there we go, we get the official name, is just Stanford。

edu and it happens to go to a slightly different web address。 My guess is that if you type www。

stanfer。edu, your browser may actually, let's try this。

I'm going to try something。 Ping。 Stanford。edu, okay, it says 171, 67, 215, 200。

Okay, so that's that。 If we ping www, let's see what happens here。 No。

it's giving you the other one。 So they're slightly different。

They probably end up pointing to the same place somewhere along the line。 But yeah, there's the。

they do end up with different, www actually makes a slight difference, in that case。

If you type both, you'll end up with the same place。 Somewhere along the line。

they re-routed to the same web page。 Let's try a couple more。 Let's try google。com。 Google。

com has an IP address that I believe is actually based on your location。

Like it kind of knows who you, who's asking and then it returns the IP address about like, where。

like locally, more local to you。 As it turns out, let's see, facebook。com。 Yeah, same thing。

There you go。 Oh, some of the other ones, let's see, www。facebook。com。 There you go。 The, the www。

stampford1, actually I think it had two, there it is。

It did have two different IP addresses associated with it。 Those are Amazon addresses I think。

And then there's another one that Jerry likes to use。

Okay, Cupid。com。 I don't know that Jerry likes to use that, but Jerry actually showed me this。

That it, that it has lots。 I'm not sure why。 And it's not because like billions and billions of people are using it and like Google has。

a few people here and there。 So I don't know why。 There's some。

something going on where their host server says, hey, these are all your IP。

addresses and it shows up in this list and it's, it's a little bit black magic after that。 Now。

github。com, let me try it。 github。com。 There's just one for that。 I don't know if I could try www。

github。com。 Same。 So it's the same。 That one happened to be the。

the official name is the same one that went to the same one。 So yeah。

there's definitely some black magic going on there。 I know, you know。

I don't know all the details about that。 So let's do one other thing。

I want to show you just one other thing here。 gdb。 I told you it's not feel like 107。 gdb。 Resolve。

Host name。

Okay。 Break on publish。 IP address info。 I think there we go。 All right。

Let's run it again。 Let's do Stanford。edu。 Actually, let's do it。 Let's do it。 Let's do it。

All right。

Let's do the, let's do the, okay, keep one just to see。 Okay, qubit。com。

Okay。 So if we go into the code and we get the host name, okay, if we print out HE, it's just。

a pointer。 All right。

If we print out host HE like that, it tells you all the details there。 Okay。

So it tells there in this case the name, that's just a pointer to the name。

And then the aliases are in, there's no, it turns out there's no aliases。 I believe if we do。

let's see how we're going to do this。 We're going to do a star。 HE, let's do the star。 HE, arrow。

HE, alias is。

See if that works。 Yeah。 So the first there are no aliases turns out。 And then let's see。

The H address type happens to be two。 In that case, that one means the IP version four。

And then the length is four bytes。 So we know how long the address is。 And then the address list is。

remember we said it was a char star。

But if we just said, try to do this, ADDR list like that, it's going to be kind of garbage-y, right?

Because we don't really know what it is。 I think you're going to have to actually do something like this。

See if this works。 Let's see if we can do this。 We're going to have to cast it。 I know。

It gets ugly, right? We have to cast it to a--, what is it here? It is a struct, i-n-a-d-d-r。

Struct, i-n-a-d-d-r star, maybe?

Nope。 [INAUDIBLE], It's going to be--, nope, oh no。 Let's see。 I did this earlier。

I figured this out earlier。

OK。 That's that。 Maybe we need to do--。

let's see。 Let's just go--, [INAUDIBLE], Hang on。 This the only-- like this one?

It's going to be that well。 It's going to be the same thing if I do--。

if I--, if I did--, hang on。 OK。 So that's that one。 And then if we want to then--。

let's see。 So let's print out what each one of those is。

P0x84--, 132。 Was that one of the OK, keep it once?

It is。 132。

And then let's see。 84d0--。

208。 Yeah。 What are the other ones? It's 41 and 198。 So we should get 41 should be 29。

And then C6 should be 198。

OK。 Notice the order it's in。 It's in the wrong order。 It's backwards。 Well。

the function that we called earlier that i net, underscore ntop function actually。

knows that it's in the wrong--, no, there's in the little endian format。

And then gives us the correct string back。 OK。 But we haven't yet actually converted--。

like send anything over the network yet。 Yeah。 [INAUDIBLE], Yeah。

It knows that by the time it gets into that number, it's whatever the computer's representation is。

So in this case, little nd。 When we actually send it across the network。

we have to turn it into a big nd number。 You have to do that so that everybody knows how to--。

[INAUDIBLE], Why big and not little? Somebody made that decision。 I mean。

so do you know where the little ending and big, ending comes from? You guys need to take more。 See。

this is so supposedly Stanford is a liberal arts, university。 That's what I understand。

But it comes from Gulliver's Travels。 So in Gulliver's Travels, there were。

the little Indians and the big Indians who cracked their eggs, either on the big end。

to open them up, or like their hard-well or soft-well, of eggs or whatever, or the little endian。

And they got a big fight over it。 And so somebody who ever is creating this said, oh。

it looks like the little end is there。 Oh, I remember this。 Oh。

and this is also going to cause a big fight。 And that's why we get questions like, why?

What does it matter, and so forth。 So it was actually a perfect analogy, which turns out。

That's where it comes from as turns out。 OK, so while we're at it, let's look at the hostname 6。cc。

which is the-- oops--。

then resolve hostname 6。cc。 Let's look at this。 When we kind of look at this one already a little bit。

if we want to do IP version 6 addresses, we can actually look at them。

We have to actually use a get hostname。 We got hostname 2 and tell it。

We're looking for an I and a 6 address。 And then we have to check and make sure。

that's an I and a 6 address, an IP version 6 address。 And then same sort of thing here。

We can use this function。 It knows how to convert IP version 6s。

Let's see that one as we run it。 Let's actually do it on DDB just to see the difference。

Resolve hostname 6。

Let's first run it and see hostname, google。com。 There's google。com's address。 Now。

there's 128 bits here。 This is a pretty big。 128 divided by 8 is 16。 That would mean 16。

There's not 16 here。 There's one, two, three, four, five, six, seven, basically seven, eight。

because it takes about 9, 10。 And there's a little extra double colons in there。

This is a decision they made to try to make the IP version 6, number smaller。

If there's a bunch of zeros in a row, you can actually put two colons and just look。

all the rest of the zeros in there。 I still think it's almost impossible for a human。

to figure out exactly what all that means。 It's not impossible, but it's just like。

you got a rock your brain about, where's the zeros, and how do they fit in?

And it was probably an uninspired decision, as far as I'm concerned。 But some places, actually。

when they remember an IP version, 6 address, there's 128 bits available, which。

is 2 to the 128 different addresses。 Actually, let me go to here。 Got the number in here。

There it is。

There are this many different IP addresses, that you can now have。

That number is bigger than the number of protons or atoms, in the universe, I believe。

So you will be able to assign every atom in the universe, an IP version 6 if you'd like。

So I doubt we'll run out of them, at least in our lifetimes。 But you never know, I suppose。

But they look at it。 So if you're a big enough company and you have enough cloud。

you can actually ask for a particular IP version 6。 And why would you care about that? Facebook。com。

Take a look at Facebook's IP version 6 address。

It actually says Facebook in it, which is like, oh, how clever, how nice is that。 And again。

I don't even know why they didn't put just, two double colons。 Again。

I don't understand how these things are。

How these things are figured out。 But maybe someday they'll just have vanity IP addresses。

and you can get one for your phone or whatever。

Not that you actually care, but people of Facebook, I guess, care about these things。

Let's just run it。 Let's break on the same function as before。 Publish IP address info。

And then-- oh, no。 Is it not there? Hang on。 Then。 Then。 Then resolve host name 6。

They say it is called-- let's see。

Publish-- oh, of course。 Publish IPV6。 Info。 Let me do that。 All right。 Break on that。

and then run it。 And yes, we're going to start it, and let's try Google again。

Notice I didn't do Stanford。 Wait, let me show you Stanford。 Yes。 Stanford。edu。 Continue。

I don't know if it has it yet。 Just kind of too bad。 But anyway, that's the way that goes。

Let's see。 Host name。 We'll try it again。 Google。com。 OK。 Google。com。 OK。

So you do the get host name, and then there, and then there。 And then we have to-- again, we have。

to cast it to the struct i and address 6。 And that we have to actually do it。

So if we just type again--, let's see, pi_print_out_he。

It will give us the fact that it, has got a length of 16 for the number of bytes。

and the actual address list, and so forth。 So there are some differences that you go into that。

Yeah, it has to。 When you do cast it as the struct i, and 6 address star star。

it's because you're able to do, just address list。 Well, that's correct。 Right。 Good question。

Take a look at the struct--。

where did it go? There it is。 OK。 So this is cast to a char star star。

It's not cast to what it really is。 Why is it cast to really-- why is it cast to really what。

it really is? Because we want this to be generic enough。

to work with both IP version 4 and IP version 6, and any other one that you want。

There's actually another one that's IP。 I think it's just got a--, I think it's i-- what is it here?

It's not i and address。

It's i and addr_unix。 And it's its own type, meaning that you。

can use sockets like internal to a computer。 And it's another way of doing that。 So again。

this server thing is pretty robust。 But it's robust enough that there's。

some weird details in here that you kind of have to get to know。

But does that make sense about that? Why you'd have to do that? Mr。--, [INAUDIBLE]。

How come you were going to do address list plus plus? Yeah。

You can do address list plus plus because it knows now, that it's this type of strong。 Right?

Once you do that, if you didn't do that, it would try to do one character at a time。

and it would get all screwed up。 Again, 107 stuff。 Yeah。 [INAUDIBLE]。

I don't know what happened to version 5。 I mean, that's a good question。

This could be deadly。 But what happened happened to the IPV5? IP-- why is there no IP?

Yeah, it might be like Windows 9。 Let's see。 Doesn't exist, so there is no IP version 5。 Let's see。

It was intentionally skipped to avoid confusion。 There was an experimental protocol called the internet。

stream protocol defined in 1190。 And therefore, it was assigned IP version 5。

And we don't use that anymore。 So they said, oh, let's skip it。 Use 6。

Has nothing to do with those four bytes or six bytes or anything。

like that if that was one thing you were thinking。 So that's it。 OK。 That works。 All right。

I know this stuff is definitely a bit crazy。

So anyway, we've run a bunch of these here。 Now, let's talk about the sockets themselves。

When you are creating a socket-- remember we did? We had the accept command and we had the create socket。

and so forth。 We are going to look at those in a little bit of detail。

to see how they're actually built。 We're going to look at the create client socket。

and create server socket。 Now remember, they are two very different things。

When you are creating a client socket, what you are trying to do is you're。

trying to reach out to some other computer, and connect to that computer's IP address and port。

So you're trying to actually go, let, me make a connection with some other computer。

That's when you're creating the client socket。 You're trying to create the client socket。

When you are a host, if you're trying, to do the create server socket, all you need to do。

is get your local port number and try to assign it, to yourself。 We call that binding it。

We'll see how that works。 So you don't actually reach out and get anybody。

when you're doing a server。 So you're basically saying, hey, I'm here。 Please。

I'm listening for people to connect to me。 Or other computers to connect to me。

That's the big difference there。 Now, of course, we have different types of sockets here。

This is also, again, where it gets a little bit pokey。 We have a generic socket。

The generic socket is struct-socket-dress。 And it has its first element--。

its first member is this unsigned short called SA family。 It's a two-byte value called SA family。

which, is going to be the--, this is going to say what protocol it is。

Then we have a really bizarre SA data 14 bytes worth。 But that's it。

It doesn't say anything else about that, except that it says there's 14 bytes there。 And you say。

oh, all right, maybe that 14 bytes, is going to be useful for something。 I don't really see。

Then we have struct-socket-dress-in for internet。 This is the IP version for version。

And it also has as its first two bytes, the family, the internet socket internet family。

And then it then has a port number associated with it。

And then it has one of these struct-i and addresses, which we saw earlier, sine ADDR, which。

says the actual four-byte internet address in it。 Remember, it says weird。

It's four bytes and not-- it's actually a struct, which is weird。

And then it has eight bytes worth of zeros。 And they're defined to be zeros。

And it turns out they're completely ignored, although most people actually just do set them to zero。

because it says zero in the name。 And so they figure out I should probably just set it to zero。

It actually probably doesn't matter one bit。 They're completely ignored。 So what is that? Well。

I don't know。 We'll see。 Let's count the bytes, first of all。

Let's see if this actually does anything。 How many bytes is a short? Two。 Short's another two。

There's four。 How many bytes is an unsigned in, which is the INRS? Four。 So that's eight total。

And then there's eight more。 So that's 16。 And then up here, we add two there and 14 there。

So that's 16。 OK, that sounds like it might make sense。 Let's look at the internet version。

sixth version。 Well, it also has the first two bytes as the family。

Then it's got the first two bytes as the port。 Remember, ports are only two bytes。

either whether you're, using IP version 6 or 4。 And then it's got a struct IN6 address。

And that's going to be sign a sin 6 address。 How big is 128 bytes? And how big is 128 bits in bytes。

rather? 16。 16 plus 4 is 20 plus 2 is 22 plus 2 is 24。 Plus another 4 is 28 for this one。

Is that 16? No。

I have no idea why this is the case。 It turns out that it really doesn't matter。

that there's this 14 byte one there。 I think there has to be something there。

to make the compilation work right。 That's about all I can figure out。

The various resources I've looked at said, and it just kind of doesn't matter。 So whatever。

[INAUDIBLE], What's what? [INAUDIBLE], Oh, sorry。 Flow info and scope ID。

I'm not even sure what those are。 Those are specific things to IP version 6。

The nice thing about this part of this thing, is that it is generic enough so you can have this that。

has extra stuff in it。 Flow info might have something to do with the actual back。

and forth between the server and client。 Maybe it's more efficient than something else。

And they wanted to add it in there。 Scope ID might be something else too。 I just don't know。

But it's beyond a scope of this class。 No pun intended。

But it's just extra information that goes along with it there。 OK。

So that's what a SOC address looks like。 We've got this generic one, which。

doesn't seem to do much for us, except that it has, this family in there。

And then we've got these other ones, that have the family as well as the first two bytes。

and then extra stuff in them。 That's the important stuff。 In fact, there's not just two。

There's this other Unix one and there's other ones as well。

Socket is a very generic type of structure。 OK。 So that's that。 Anyway, as I said, the version 6。

1 has some other stuff in it。 You will rarely ever declare variables that are of this type。 OK。

This is kind of like an abstract class or something, in Java or those sorts of things。

where you've got this, definition that you will never actually use。

It's just there so that other things that are--, that kind of inherent from it can be used。 OK。

So you will rarely ever do an actual SOC ADDR。 You'll do the one you want for the particular socket。

you're trying to create。 OK。 And Linux actually does kind of one set for both。

because they want to make it generic。 OK。 What you're going to have to do-- and we'll see this。

when we actually write the code--, is you are going to have to do some casting associated。

with these to get the right value out。 Now, if you remember from CS107, all。

of that casting you would have had to do, when you did generic functions。

There are times when you are, say, writing a function that, has two void star pointers。

where you don't really know--, you know when you're writing it, but the compiler has no idea。

what type it is, because it expects void star pointers。 You know。

because you're the one writing the function, that, oh, really, these are char star star pointers。

or something like that, inside the function, you will actually cast them。

It's going to be very similar when we go and figure out, what these are。

when we actually write these two functions, we're about to write。 All right。

We are going to write two functions。 The one is create client socket。

The other is create server socket。 The client socket is a little bit easier。

even though it seems like there's more to do。 You're actually trying to reach out to this other computer。

and connect to it。 But there's not that much really to do。

You know that you're going to know the port number, and the address。

And then you have to set up the socket to do it。 And that's what we're going to do。

What we're going to do is we are going to confirm, that we can actually talk to the IP address。

Well, confirm that the IP address exists for that host。

We're going to try to go to some host and we're going to go up。 We need the IP address。

Let's see if it exists。 Then we are going to allocate a new descriptor。

This is exactly like except very different from a regular descriptor, file descriptor。

It's just like the file descriptor in that it lives in the file。

descriptor table and it has an open file descriptor and so forth。

But you don't use it in the same way。 It's a double。

It's a two-way communications instead of a one-way, which most of our other descriptors were。

You use this system call called socket, to actually configure a socket descriptor。

When you use socket, it doesn't actually, talk to any other computer yet。

It just sets up the socket so that you can, populate it with the right details。 And then use it。

Then we have to create an instance of socket address, underscore IN if we're doing IPv4。

And then that packages up the host and port number。

That packages all the details that we're going to connect to。 We'll do that。

And then now you've got this socket that you've set up。

Now you can actually go and connect it to the other computer。 That's what we're going to do。

And then if all goes well, you return that socket to whatever。

program or whatever function requested it。 OK? Question。 Is that it? Conceptually。

how is this different from just setting up, a server in a client like we did? Yeah, good question。

The question is conceptually, how is this different than setting, up?

Now we're doing the details of setting it up。 So we actually called these functions before。

Now we're actually going to go and dig in and go, what do I look like?

And that's where all this other stuff that we had to get to, is involved。 Good question。

So the socket descriptor, the thousand of the files。

A socket descriptor lives in the file of the descriptor table。 The type is a socket。

So it's not a read file, write only file, et cetera。 It's a socket。

And it's got more details associated with it because it。

needs to be too way and potentially connect other, computers and so forth。

But because pretty much everything in Unix is a file, they still made it a file。

even though you can't use it, quite like you would in your local file descriptor。 All right。

let's actually do this。 OK。 Let's go and do this one now。

Let's see。 We want to do client sockets。cc。

OK。 This one is here。 You've also got this code so you can follow one as we do this。

We are going to do struct host and hv equals get host by name。 And then the name that we passed in。

host。cster, like that。 Which is exactly what we did。 We've talked about for 10 minutes earlier。

Because we're trying to just figure out if it's there。 All right。 If he is no。

we're just going to return negative one, which, said, look, we didn't get it。

We don't know what address you're talking about。 And then hopefully whoever's calling create client circuit。

is paying attention to the return value and then knows, oops。

I didn't get the socket that I requested。 And then we do in s equals。

And this is where we call the socket function to set up the, socket。 In this case。

we're doing afinet。 And then we're doing SOC stream。 And then let me talk about that in a second。

SOC stream is basically telling the operating system, please, please。

please handle this the way the regular old, internet works。 And you might say, well。

how does a regular, internet work? Well, I'm going to tell you。

When you send data between two computers, you don't send one long stream of data like all at once。

until you're done。 I mean, you kind of do。 But what happens in the way they've set it up is it's sent。

in packets。 A packet can be very sizes。 I think most--。

I'm not exactly sure how big a packet is right now。

It might be something like 128 bytes or something like that。 It's relatively small。

but it made it 512。 But it has information about the data。 It has a packet number。

And it has information about the data itself。 And then it has some of that data。

And you send a whole bunch of these packets off to some, computer。

And they all take different paths around the internet。 Many of them go on the same path。

because it happens to be, the shortest path。 But it might not。

And some of your packets may go down one path。 Some might go down another path。

And they all meet up at the computer where they're trying。

to be heard in the one you're sending them to。 And they may end up in the wrong order。

If they end up in the wrong order, it actually doesn't, matter, because the other computer is going。

OK, I'm, going to listen for all these packets。 And if I get packet two first。

I'm just going to hold it, aside until I get packet one。

And then I'm going to know that I got packets one and two。

And I'm going to order them all when I get them。 And sometimes packets get dropped。 In fact。

packets get dropped all the time。 Some computer goes down, or there's a glitch somewhere。

or whatever。 If a packet gets lost, there's a time out associated with that。 On both ends。

as it turns out。 When a receiving computer gets a packet, it sends an, acknowledgment packet back。

It says, I got your packet number three。 And then the sending computer goes, OK, packet three got。

three。 And it checks it off the list。 If the receiving computer doesn't get packet three, it。

waits a little bit of time。 And then sends a-- actually, it just waits。 I believe。

It just sits there and waits。 It's the sending computer that says, oh, I never got an。

acknowledgment。 I better send another packet。 And it keeps sending them until it gets through。

And that's sometimes why I can take your file。 That's sometimes why your file is buffers and so forth。

Question。 But is it-- what is the accepting packet that's lost? Yeah, a good question。

That's such a good question。 What if the acknowledgment packet gets lost?

It's exactly the same thing。 The sending computer says, oh, they never got my thing。

They send another one。 And then the receiving computer, if it gets two packet, three。

it ignores one of them。 It ignores the second one。 It's because I already got it。

So it's robust in that sense。 There is a problem called the--。

it's called-- why am I blanking on it? The-- let me think about this for a second。 Say again?

Is this forward? No, no, no。 The Byzantine Emperor's problem。 That's what it is。 You know。

Byzantine generals problem。 Again, liberal arts education。 The Byzantine generals problem is。

what happens if there's two generals on two hills, and they both want to, and they want。

to coordinate attacking a valley。 And they have to agree on the time。

They have to attack the valley。 And let's say one general sends a message to the other general。

and says, we're going to attack at 7 a。m。 And the other one sends a message back that says, OK。

we know we're going to attack at 7。 But what if then you never get those like acknowledgments?

And even if you do, how does the one who sent the acknowledgment。

know that the other one got the acknowledgment? And then like-- so you can never quite exactly coordinate。

So there's always-- there's a little bit of an issue there。 But generally, you send how many total。

bites you want at the beginning, and hopefully that gets through。

and you get an acknowledgment on that, and whatever。 It turns out it works out OK, as it turns out。

But that's how it works。 So basically, you send a bunch of these packets on。

Here's what Sockstream does。 Sockstream says to the operating system。

please take care of all of the details of that packet back, and forth for me。

If you don't want them to the operating system to do that。

let's say instead you wanted to handle it all yourself。

You just want to send the packet and let them go in the order, they're in and go and do that。

There's more work for you to do, especially if you want, to keep everything。

like make sure that the other end got, the packet。 You have to do all that。

And you may want to do that。 Let's say you're sending video data or something。

and you don't care about every bite or every packet。 Maybe you want to do that。

You want to send it that way。 You can say, hey, I'll take care of it。

because it doesn't matter that ever a few bites of the video, gets a little blurry for a second。

Who cares, at least it continues, instead, of having to be slowed down by this acknowledgment business。

and all that。 So there are reasons to do that。 We will only stick to SOC string, in this case。

All right? OK, so that's that if S is less than 0。 Again, that means we had a problem。

Return negative 1。 Now we have to do struct SOC address, and address。

And we need to populate this address we're creating here。

because we're actually going to use it to do the connection。 So we are going to do this。 Now。

the first thing we're going to do, is memset and percent address 0 size of struct SOC address。

And like that-- actually, I guess we can just do--, I have that in private, but you also just。

do size of address。

I think both would work。 We're doing that just in 0 out of memory。 Again。

it probably doesn't matter, as it turns out。 But we just do it because it says, hey。

these things should be 0。

It's not actually a matter。 I think we fill in all the bytes anyway, because it turns out。 OK。

Then we're going to do address。internet, basically, socket internet family equals affinet。

And then we are going to do address。sonport equals。 Now。

here's where we need to make it into a network number。 When we're going to set up the socket。

we have to make sure that it's correct for the network order。

I'm not sure why they couldn't just make the connect thing do this。

But you need to actually do the following。 You need to just say, htonsport。

And that stands for-- it stands for--, let's see, h-- forget now。 The host-- yeah。

host to network short。 It's going to convert the port to the correct address。

Going to flip those two bytes, as it turns out。 If you were building this on a big M in machine。

that would be a no operation function。 It would actually do nothing。

because it's already in that order。

That's the way it goes。 OK。 So we have to do that。

All right。 So then we're going to do the address equals。 OK。 Ready for this?

We have to now cast it appropriately。 Two, we have to cast it to a struct, internet, a, d, d, r。

like that, h, e, h, e, r。

Why are we doing-- oops, what do I do here? Hang on。 A sign, a, d, d, r, equals。 There。

did I forget the number of-- yes, I did。 I must have forgotten on there, but we won't yet。

So now we are casting that 4 by int we got to an actual struct。

to make sure it goes into the struct correctly。 Why? Because it's a struct and not actually an int。

So it goes。 So we do that。 And then we call this connect function。 If, connect。

We are passing in the socket that we don't--, it has nothing associated with it yet。

We are about to do that。 We have another cast we have to do here, struct, sock, address, star。

This is basically down casting it。 It's losing information in the big picture。

It's losing information when you do this in the sense。

that the compiler is going to be looking for a sock address, instead of a sock address i n。

And you say, well, how do you ever get that information back? Well。

let's talk about that after you do this line, size of address, and equals equals 0。

That's the actual connect line。 How does the socket-- or sorry, the connect function。

know that we are doing internet version 4? That's the question。 How does it know?

What did we say was the same about every one of these? The first two bytes is always the address。

family。 So if we know that the first two bytes are always, in address, I mean。

we know that structs always, have to be in the order of the members。

We can actually look at that first member, and say that member must be the type。

And then inside the connect function, as well if the type is internet address 4。

let's handle it this way。 If it's 6, we'll handle it this way。 If it's units, we'll end up this way。

et cetera。 So it tells that by knowing what it is。 All right。 OK, if that's a return 0。

meaning that we--, or sorry, return s, meaning that we actually, got the correct socket for now。

[INAUDIBLE], Yeah。 OK, how does it know? We are creating this up here。 Now。

it's one of the types of internet address family, set up like structs, right?

The first two bytes have to be。 In fact, we set them here。 The first two bytes are the family。

So any time you get one of these generic socket, address ones, if you look at the first two bytes。

it will be a family。 And then you can say, oh, I know what the various families are。 Therefore。

it must be of this type, socket address i n。 And therefore, they treat it that way。

So the connect function needs to know the different types。 And it looks at the first two bytes。

says, oh, that one's 2, or that one's 4, that one's 8, or whatever, and the number is。

And then it decides internally, oh, I know what this is。 That's how it does it。 Now。

you can't do that in C because you can't overload, function declaration types。 So therefore。

you have to do it this wonky way。 CS107, yay。 That's what that all comes back to。 All right, now。

if for some reason you don't get a connection--, let's say the connection actually is broken or something--。

you have to close that socket that we created, because we opened--。

or that descriptor because we actually, opened a descriptor using socket。 And so therefore。

you have to close it。 So if you don't-- if you get to this line。

it means something's bad, and you need to close it, and then you return negative line like that。

Question。 So I noticed before deciding the SI and address the old--, [INAUDIBLE], Yeah。

Just the court you take。 I was hoping nobody would ask that question。 The question is。

in a very good point。 In this case, you don't have to do the Indian-ness。 I think because--。

let me think about this。 Yeah, I'm not exactly sure。 There's the reason why-- I don't know why。

Why you have to do it up here, and you don't, have to do it down here。 I'm not 100% sure why。 Yeah。

I'm not sure。 I'll look it up。 I'll try to look at it and see。 This works。 Well, we can try it。

I mean, it's worked for every example we've used so far in class, so I assume it works in that case。

OK, so what do we have?

So that's the client connect function。

All right, let's actually make it and see。 Did anybody see any client socket?

See if that works。 A client socket。 OK, oh no。 Oh, I know why, because it's not-- it just。

needs to make it into a--, if we just type make in this case, it would work。

It's actually a library function that, doesn't need a main function。 We don't have main in that。 OK。

any other questions on that? Yes。 [INAUDIBLE], Why do we close it? [INAUDIBLE]。

If the call succeeds, we pass it back。 The user needs to close it。

Because you're setting up the socket, right? So down here, you're saying。

let's actually connect to that, computer and get-- now the socket is an open file。

For-- it's an open file descriptor。 Or rather, it's a connected file descriptor, I should say。

And then you pass it back to the user who uses it。 [INAUDIBLE], Yeah, yeah, the question is。

is that kind of like how sub-process, passes back and open file descriptor。

so the user has to close it。 Exactly。 Notice that we do close our descriptors when we're using them。

It turns out the screen, the socket stream, does the closing for us。 But it closes it。 Yeah。

good point。 The using function needs to actually close it。

Because that's the only one that knows when it's done。 OK。

So let's now look quickly at the server socket file。

OK。 This one's going to be somewhat similar。 But we've got a couple extra little details to do to work on here。

OK。 So in the Create Server socket, we're, going to do basically the same thing。 And by the way。

we're only doing internet IPV4, in this case。 AF, iNet, OK。 Sockstream, 0。

So we don't need to get any IP address in this case。 Because the IP address is our IP address。

We're trying to set up a server on our computer。 So it actually turns out that it doesn't matter。

We could use one of our IP addresses。 And I say one of our IP addresses。

Because your computer often has net more than one IP address。 If you've got Bluetooth。

if you've got Wi-Fi, if you've got a cable, ethernet cable plugged in your computer。

you are going to have multiple IP addresses。 So you can actually say, can that only。

allow connections on this Wi-Fi or whatever? Or as it turns out, you can say any。

which we'll see in a second。 OK。 All right。 If that doesn't work。

we are going to return negative one。 It didn't even open yet, so we don't need to actually close it。

But if it did open, then we can go on。 We can go OK, struct。 And same things before, SOC, address。

high, and address。 This is going to look relatively familiar。 And then we're going to M。

set address, 0, size of address。 OK。 And then we are going to set the sign for the family。

And this is, again, how the socket function is going to--, or in this case, the bind function。

is going to know how to interpret our downcasted address, or downcasted SOC address。 OK。

So we're going to do that。 And SOC family equals a, f, i net。

OK。 And then we need to address。son address。address。address。 OK。 In this case。

now we do need to do this。 So I'm wondering if we didn't need to do it before。 No。

I don't think we did, because it always worked before。 I'll have to look at it。 See why this is OK。

So what we're doing now, we are doing HTML, meaning。

we are going to now convert the IP address that we are doing, into host to network form。

in this case。 So now, on an ADDR, any。 We could have put our own IP address, but we don't want to。

because we want to, just in this case, allow it to connect on any of the available ports。

You don't have to do that if you don't want to。 And then address。son。port equals hto_n short port。

like that。

And then now we need to make it so that we are connected, to that port for the operating system。

And we need to make it so that we're listening to that port。 We have to do two things。

We have to do what we call bind。 We bind the socket to the address。 And again。

you have to downcast it。 Follows, socket, address。 And then same thing。 If that, if bind equals 0。

and let's do another thing called, listen。 The listen system call says, OK。

now you've got this port, that you bound to。 Now you actually say, oh, whenever I get somebody。

trying to connect, forward that along to my program。 And you do that with a listen。

And then backlog。 Did we type backlog there? Did we have backlog in there at all? Oh。

it's passed in。 Yeah, I'm not sure what that-- oh, I。

think that normally we pass it in as zero or no, or whatever。 It doesn't actually matter。

I'm not sure what the backlog one actually does。 Return S。 OK。 So in other words。

if you can't bind to the socket, or if you can bind to it, and you can listen to it。

return that socket we now correctly set up, to be a server socket。 Otherwise。

we close the socket because we had a problem。

and we return negative line。 And that's how that works, yeah, we're not。 What is line 22?

What is line 22 do? OK, that sets up the address that we are-- it basically。

tells the-- in this case, it tells the bind and listen。

function you can listen on any of my IP addresses。 That's what it means。

It is different than any of its ports, because your computer has an IP address through the Wi-Fi。

Your computer also, believe it or not, has an IP address through Bluetooth。

And it also has one if you connect an ethernet port in。 So there's different IP addresses, which。

are associated with the connection in this case。 As it turns out, I believe, I believe。

I and ADDR any actually is zero。 So you probably didn't need to do this at all。

But that wasn't the case in the other one。 So again, I'll look that up and let the other miss。

And now we have set it up to start listening。 One other question。 Yes? [INAUDIBLE], OK。

so let's see about your question。 Question is, what's stopping the client from connecting?

And then reassigning things or whatever。 So remember, the client just--。

it's a very opaque procedure。 The client is a different computer。

that's trying to request something from the server。 And it just says, hey。

I want to talk to port 1234。 And then if your server is listening to it。

it just sets up a connection that says, OK, start giving me data for it。

It doesn't do anything else locally on the other server, computer。

You can't reach over the server computer, and change any of the ports over there。 [INAUDIBLE], Oh。

it can try to communicate with other ports。 [INAUDIBLE], Well, it can try to-- well, OK。

So let's say you send a message, a random message, to a web server on port 80。

And you're trying to get it from-- it's only going to talk web。 Again, it's not going to try it。

It's not going to be able to get any other information from it。

You can take that and then pass that information, to pass that to other programs and whatever。

And then they can talk to the web server。 It doesn't really matter in the sense that it's not going to--。

what you do on your computer with that socket is irrelevant。 If it's all you talk web HTTP。

then it'll be fine。 And maybe multiple programs can talk。 Sure。

But it doesn't-- maybe I'm misunderstanding your question。 [INAUDIBLE], [INAUDIBLE]。

Can a server be given a socket that hasn't been requested directly? [INAUDIBLE], No。 I can't。

I mean, it's local to the machine。 A socket is local to a machine。

It happens to be maybe connecting to another machine。

on a particular IP address on the other machine。 But that's it。

If there's no other-- you can't request a socket。 You can't request a port that's not listening。

first of all。 And you can't request anything that's--, stop by ready for class。 We'll check。

Come on back。 We're ready for class。 But anyway, any other questions on this? Sorry, but not good。

Yeah。 [INAUDIBLE], Yeah, I'm not exactly-- I'm not 100% sure what the backlog does。

Let's look it up。 Bind。 And bind。 Let's see。 Bind to a socket。 Let's see。

Bind or-- is that the-- was it bind? Hang on。 It was-- oh, it's listening。 Sorry。 Listen。

Listen for an action。 I'm sorry。 Backlog。 OK。 The backlog argument defines the maximum length。

to which the queue of pending connections may grow。 So basically, if a connection requests, arise。

the queue is full of clientry。 OK。 Yeah。 So this is saying how many different connections。

you're going to allow。 And I think the maximum is 128, actually。

I don't think you're allowed to say do more。 But it's going to make it so that your program can take a little bit。

of time to set up a connection。 And then the operating system will。

keep a backlog of the other connections, and then forward them to you one at a time。

Believe that's what it is。 Yeah。 [INAUDIBLE], Yeah, good question。 OK。 So the question is。

I'm getting confused about support, socket, and IP address。 OK。

An IP address-- let's start with the reverse order。 An IP address is your computer's address。

that the rest of the world knows about。 OK。 My IP address on this machine on Myth 64, rather。

is IP address。

IF config gives you the IP address。 And it's right here。 The IP before address is 171641529。

The IP version 6 address is this big long one here。 And the scope-- ooh, the scope。

There's where the scope comes in。 It's a link, for some reason。 So that's where your IP address is。

It's the address the rest of the world talks on。 When somebody from the rest of the world wants to talk to you。

they can ask for a particular port of yours。 One of your programs is listening on it。

So maybe we want to SSH into Myth。 That's port 22。 So we go look at IP address。

and then we look at port 22, and that gives us an SSH connection。 If we look for port 80。

that gives us a web connection。 If we look for port 443, that's a mail connection。

or something like that。 So that's what the port is。 It's defined as on your computer。

what port are you listening to for various types of things。 Now, a socket is the file descriptor。

that is associated with a connection, either to another, computer or one you're listening on。

The socket itself is the file descriptor, that you can read and write from, and you can do both。

as it turns out。 Does that help answer your question? Yeah。 Good。 Yes, sir。 So we can have multiple。

like, we can do other computers around what, are connected to a server like other words。

and they'd be like, it would be like, what are the multiple, multiple, different sockets。

And then the socket。 Ah, good question。 So yeah, this is a very good question。

I haven't really talked about this yet。 The question is, wait, so we've got--。

if multiple computers are trying to talk to you, on a particular port, because that。

means they all have different sockets, here's what really happens。 And this。

we kind of glossed over this。 Once you've set up a connection to another computer。

you actually do it on a completely different port。 The only thing you do on the original port。

is listen for connections。 Then you set up another port to go and connect。

to the client through another port, that you can listen on and do the connection on。

So once you set up on that initial port, then you actually hand over the connection。

to a different port to have that connection。 So let's say you have 1。

000 different computers connected, to you, you will then have 1,000 other ports。

that they're connecting on and talking back and forth, and you're still listening to port 80。

for the next connection to come in。 That's what happens there。 Good question。 Yeah。 [INAUDIBLE]。

When you want to listen to various ports, you have to do-- well, you have to do processes。

not really。 I mean, you can listen to multiple ports, in a particular process。

Just like you can have multiple files open。 [INAUDIBLE], Oh, do threads share sockets?

I believe they do。 Yes, I believe threads share sockets。 Now, it's not really going to, again。

like in your thread pool for networking, for instance, when we do that, we set up a new connection。

that's a different port when we do all this--, when we do the setup。

the connection back to the client。 It's a different port altogether。 OK, any other questions?

All right, we'll see you all Wednesday。

P18:Lecture 17 Web Proxy - main - BV1ED4y1R7RJ

Okay, let's get going even though not a ton of people here but we'll get going。 So。

how's the assignment going? Assignment, what is it, six? How's that going? Going right。

yeah it's due tonight。 I will have offs hours today。

right after class for an hour and an hour and 15 minutes。

or so for anybody who wants to stop by and get some help with that。 And hopefully it's not too bad。

Hopefully the thread pool is an interesting assignment。

I mean it kind of has all the different parts to it。 You've got to have new Texas。

you've got semaphores, you do have to probably have a conditional, condition variable, any in there。

You don't, maybe you did it some other way but that's likely you'll need that。

And there's various kind of nuances to that but that's a pretty robust class to be building。

And so a good job for doing that。 There are two more assignments。

We're now going to be on like a Thursday, Thursday schedule。

The next assignment will be out tomorrow although technically you can go look at it right now。

I'll show you the link in a minute。 And then it will be due next Thursday。

The next one will come out Thursday due to the following Thursday which I think technically。

is actually after class ends。 I'm not sure I'm allowed by school rules to actually have an assignment due after classes。

end like I think that's like not allowed。 So I'll have to figure something out。

Maybe I'll make it due Wednesday but then give everybody a free day until Thursday or。

something like that。 I'm just skirting the rules a little bit but I do want to give you enough time。

But I also understand that you've got finalists to study for and all that so I'll, we'll figure。

something out in that regard。 But so today, so we are mostly done with the new material for assignments。

In fact we are pretty much all done with new material for assignments。

We have one more topic which is non-blocking I/O which is basically I/O that doesn't block。

when you say accept and so forth。 And there's reasons we might want to use that。

We will talk about that on the last day of class, maybe the previous day as well。

After today we actually have three more lectures right。 Next one, Monday's a holiday。 Yay。

more of that。 Next, next Wednesday is lecture and then the following Monday Wednesday。

But really there's no new stuff that's going to be on assignments。 The I/O。

non-blocking I/O I might ask some very high level questions about it on the final。

but that would be about it。 So we've basically gotten through all the stuff for the quarter。

We do have, next week we will talk a little bit more about what we're going to start today。

and then which is about your next assignment actually。 And then we will also do a little like, hey。

here's the big picture of 110 and all the, different things you should have learned in a big picture sense。

So that's really all we've got going。 Today I wanted to give you a relatively deep dive into the assignment that you're going。

to start tomorrow。 Why am I doing that? I think it's an assignment that you will enjoy but you've got to wrap your head around。

it and wrapping your head around it takes some time。

Not like the other ones did but this one definitely takes some time to get your head。

wrapped around。 So we're going to take most of the class to go over。

depending on how many questions, are I guess, about to go over the assignment starting tomorrow and then I will introduce。

the topic that we will have for the final assignment。

It's called Napper-Duce and it kind of ties everything together。

You've got to use all the different parts of the things you know to do the final assignment。

So tomorrow's assignment is a web proxy。 So here's what a web proxy。 You can go by the way。

you can go here and download the actual assignment if you go and。

let's see if I can click on it now。 Let's see if we go here and download it。

That should be the assignment。 Now you can't, we don't have the, I will。

small morning I will push out the actual repos, but feel free if you are not working on the current assignment still。

To go and read through this it's a relatively long document as you can tell。 It's not too bad。

We're going to go through it right now。 So there we go。

I will go back here and here's what it is all about。

What is a web proxy? Well, a web proxy is a server that sits between your web browser and some web page that you。

want。 What you can do is you can, I will show you how to do this today。

You can set up your web browser to use a proxy server whenever it requests websites。

So instead of going to the actual website and requesting it, it goes to your web server。

and makes the exact request and your web proxy makes the request and then the web proxy forwards。

that on to the actual web page or web server you're looking for gets the result back and。

then forwards it back to you。 Now why would we care about doing that? Well。

there's a lot of reasons you might want to use a proxy。

You might want to block access to certain websites。

This is kind of like a firewall sort of thing if you want to think of it that way。 But basically。

maybe you want to put a proxy up because you don't want, if you run a company。

you don't want to already go into Facebook。com during the work day or something, draconian。

like that or whatever。 But maybe you do that。 Maybe you're a parent and you don't want your kids going to certain websites that will。

remain unnamed, et cetera。 But that would be one reason to have a proxy。 You set that up。

the browser goes to that and so forth。 And by the way, if you ever have kids。

they will be able to get around whatever proxy you, put it in place。

So don't think it's like some sort of full solution。

This is why only you should own the password to your router。 That's what it is。 But anyway。

so that's one thing you might do。 You might want to block access to certain documents。 I mean。

let's say you have some giant document。 If you're on a plane。

they always have you go through a web proxy such that you can't。

download like YouTube document or you can't watch YouTube or whatever on the plane because。

they have a very limited bandwidth because they're flying around with the sky。

And then maybe certain types of files。 I don't know why I put zip files up there。

but maybe they're too dangerous because they, can have viruses or something you want to make those block。

Maybe you say, "I don't like Likstinstein and I don't want any web pages coming from。

there for my whatever my thing is。", Anybody from there? Sorry if I said it。 Okay, probably not。

There's only 30, 40,000 of you in them anyway。 But anyway。

I didn't even know if they have websites there。 Probably a little bit。

They're like hosted in Likstinstein。 I bet they do。 You also might want to act。 Actually。

I think an important reason you might want to use proxy of sorts。

You want to act as an anonymizer to strip data from headers to strip what your real IP。

address is and so forth。 How many people had heard of the Tor browser before or the onion routing before?

A few people。 Okay, here's what that is。 That is partially a web proxy。

partially kind of a thing that wraps everything in high, level encryption。 What it is。

the Tor network is a network where it was put in place to allow people in countries。

where they may be discouraged from using the internet or if they use the internet they。

might be under like they might end up getting arrested and so forth。

These people are hopefully not breaking the law。 They might be breaking the law in their country but hopefully they're doing good things in。

the world and they might want to use the internet to talk, communicate with other people and not。

necessarily give away their information and so forth。 Now。

it certainly can be used on the so-called dark web for people who are doing malicious, things。

Let's assume people are doing good things。 What it does is it has basically you go to a proxy server on this Tor network and it takes。

your request and it encrypts it, well you probably first encrypt it but it encrypts it。

and then sends it on to another proxy server which encrypts it again, sends it to an ex。

one which encrypts it again and all the way along the way it further anonymizes where。

you came from and it's not a bad way to be able to do like use the internet without fear。

of somebody being able to figure out who you are。 And so this is important in countries where there are people fighting a good fight who。

otherwise might get trouble with their government or so forth。 It is not foolproof。

A couple years ago some student at Harvard of all places actually sent a bomb request。

a bomb threat into the school and said there's a bomb in this building and you've got to。

take his final exam or whatever。 And so he sent this through the Tor network which anonymized his thing completely but。

the police were pretty clever about that。 What they did was they said well we have the logs of who's using the internet at all on。

campus at that time who was using the Tor network。

And they found like two people using the Tor network and one was some graduate student who。

was researching the Tor network and the other one was this kid in his dorm room who was。

the one who did the bomb threat。 So they figured it out by using external data not knowing exactly what his IP address was。

or whatever。 They just said well who was using this network。

So be a little more careful than that if you're going to do silly things like that。

Why else might you want to use a proxy? Maybe you want to take images。

certain images and do something interesting with them。

One of my favorite things I found about oh man I was probably ten years ago at this point。

is this thing called the upside down turnnet which is a web proxy and what it is is it's。

a person who set up a proxy。 It was a person who had an open Wi-Fi network。

Now this used to be a thing most Wi-Fi networks are now not open。

They have passwords but it used to be most everybody had an open Wi-Fi networks and this。

one guy realized that all his neighbors were stealing his Wi-Fi from being open。

And so what he did was he set up a proxy on his computer。

He basically set it up on his computer between his computer and the internet through the。

router and then to his computer and the internet。

And he basically was a couple little scripts and what it did was any image that came through。

that got requested it just took the image flipped it over in an image program and then。

served it to the user。 So whoever was browsing using his internet would look and get all these upside down images。

which probably is annoying enough that you probably stop using the Wi-Fi thing。

And then he said well let's be a little more even a little more diabolical about it。

Besides the upside down he said you can actually make the blurry internet where instead of flipping。

it it would just like moderately blur all the images so that it just kind of you kind of。

stare something wrong with that image and you do it with all the images。

So just a little bit blurry and he figured that would do the same have the same effect。

You could also use it to and this guy does the exact same thing intercept all traffic。

and just forward it somewhere else。 This happens when you go to the airport and you log on like there's a paywall and you。

know you log on to any site and it immediately forwards you to some proxy that says please。

pay for internet or agree to these terms and conditions or whatever。

This person also had it so that it would just go straight to, he switched it every so often。

it would go straight to where did it go where did it go。

Maybe it's not there。 It would go straight to kittenwar。com。

So anytime they were browsing or just go to kittenwar。com where I guess you decide between。

QK and you can pick and choose whatever but that was that's you know that's what he did, too。

So he reduced the number of people using his open network。

And then he actually got a little note from the guy that owns kittenwar saying every so。

often I get extremely irate to emails people claiming that like kittenwar。com is playing。

host to some kind of virus and he says I just point them back to you and says probably。

your thing and then he goes yeah prime I think。 So anyway that's another reason you might want to use a web proxy。

If you want people to just look at kittens all day。

Okay so oh and the other one this is the big one and this is in fact one that you will。

be doing for your assignment。 You may want to cash the actual requests。

Now a cash as we probably talked about before is just a local copy of whatever you've requested。

So in this case the local copy is whatever data comes off the website that allows itself。

to be cashed。 What kinds of things might be cashed? The Google logo right。

The Google logo is the same every single day。 I mean they change it so when they do the special days or whatever but the actual Google。

logo itself doesn't change for years and years and years。

Why would you go and request it from the website every single time you go to the website。

Just keep a low copy and that eventually times out like over many days or months or whatever。

But you don't need to go request it。 So if you have it locally you don't need to waste bandwidth going out to the internet。

as a whole and getting it and that's important。 Now your browser already does a lot of that for you。

Your browser keeps a cache of pages。 Sometimes you'll。

it's really bad when they keep a cache and they're working on a website。

and then you're changing things and it just never updates and you've got to figure out。

how to manually update it。 For this assignment you will have a cache both in your browser because this is an assignment。

you're actually using your browser and for the actual program you should clear the browser。

data as frequently as you can。 Or at least as much as you do it frequently enough so that you don't get into the issue。

of like wait it looks the same something must be wrong。 So that's that。

Okay so the assignment itself well it requires you to use your browser and it does require。

you to have your browser pointing to a particular computer mainly one of the myth machines。

We suggest to use the Firefox browser mainly because we know that most of you don't normally。

use Firefox and it's probably easiest to set up Firefox to do the proxy thing and then。

just use your other browser to do all your regular browsing。

I happen to be using Firefox for this presentation so I'll do a little bit of both right now。

But what you do is actually will do it right now for the assignment。

I'm going to show you how to do this。

The first things first I'm going to one tenth spring assignment seven and then samples actually。

let's just let's make the original here make this is the starter code right and the starter。

code has a proxy there's a lot of files we'll talk about those files in a bit。

The starter code has a proxy which is called proxy and it basically chooses a port that's。

based on a hash of your user name so if you do if you do proxy and then somebody else。

does proxy probably not going to overlap the port numbers and there's enough port numbers。

to go around so it's probably okay if you find out that it's always if it's always using。

the same port number somebody else has you can do it you can do a specific port number。

if you want to。 Anyway this is now on what is this myth 64 listening on port 19 4 1 9 so if we go back。

to Firefox and we go to preferences and in preferences we type in proxy down here and。

then click on settings it will say a bunch of things it probably says no proxy if you。

right now may say use system proxy settings you can set your global ones in your system。

what you're going to do is set your manual pro you've seen I've done this before you。

set your manual on to whatever server you have whatever myth you have in this case it's。

myth 64 and then is it 19 4 or 9 it is because my name didn't hash differently from last。

year or last quarter so anyway it's myth 64 port 19 1 4 1 9 you should also select the。

use this proxy server for all protocols if you don't select that you can go to some websites。

that bypass your proxy you think everything's working great and it's not because it's going。

past your proxy so make sure you click on that and then when you do that if you try。

to go to a website so let's go to let's try example。com it should say you're writing a。

proxy right so I've got an example that kind of says you're writing proxy well that's all。

it does so far basically intercepts your so for the starter code intercepts your request。

and then feeds back a page that says you're writing proxy that's all it does right now。

okay now it turns out that the way we've got this written it only works for HTTP websites。

the way you guys are going to write this it is possible to do it for HTTPS sites which。

is most sites these days so you kind of have to be a little careful that you don't go to。

sites that aren't that are HTTPS and you'll get halfway through the assignment go why。

isn't this working you remember oh that's an HTTPS site it will work for the starter code。

because again the starter code just completely ignores anything about the request and feeds。

this back so if I go to Google。com which is going to be an HTTPS oop maybe not maybe not。

it's interesting let's see let's try example。com maybe I didn't set it up there okay that。

one worked let's try let's try Stanford。com or Stanford AD yeah that one worked so HTTPS。

maybe if you do HTTPS it won't work let's see no that's it so if you go to HTTPS it。

probably will give you an error there so let me do this I'm going to keep the preferences。

up so I can still do the websites proxy change this very so often every time I do this okay。

no proxy right now okay so that's what it looks like when you run it oh you know what。

I should do I should just run it with I should show you how to do it with the other let's。

see what show you what happens with the regular one so if we go to samples slash proxy solution。

okay it will say the same sort of thing and it'll be set up and then if we go to a website。

like let's do let's do this let me make this let's do well I'll just go back and forth between。

the page if we go to example。com example。com happens to be a page that allows it to be。

allows it to be cached so if you go to the page that says allows to get this all looks。

like there example domain your program is going to say okay to cash or response so cash。

and response under the hash and there's a big long hash for six hundred four eight。

hundred seconds so for a while okay so it's just going to cash that it's going to keep。

it locally on your myth in the myth machine for you and then the next time you go to example。com。

it's going to come back faster number one and it's not going to say anything again there。

because it's already it's already cached it for you so if example。com did change your computer。

wouldn't update that until it timed out sometime later okay many sites don't have have that。

there is one let's see I got it in the I've got it in the actual let's see if this actually。

continues to work without it seems like it will right now okay so there's not much going。

on in the starter code but you get to change that you should just leave your settings and。

then use a different browser generally for for for doing the other stuff okay if you want。

you can use tell that instead if you want to don't bother you don't want to bother you。

want to see more details about what's going on you can do that as well so let's try in。

fact I'm going to do this one it is this is what I want to actually send I want to do。

the whole get line there whoops when I do well there we go okay let's see we are on myth 64。

right so if I go tell net and then myth 64 one port one nine four one nine it will allow。

me to connect great and then if I type that get line IPFI is just a it tells you your IP。

address it turns out and then you have to type host IPI guess it's www。ipi。org and then another。

another let's see did that actually work no I think it didn't don't think it worked maybe。

it wasn't because we did oh it's just API sorry let's try this again tell them by the。

just doing it from my Mac I don't doesn't even need to be from another myth machine let's。

do this again and then host host IPI IPI FYI。org and then not there we go that's what we're。

looking for and it just sends back some JSON but you can see all the headers and things。

that it sends back to and your program should be for should be capturing all of that data。

through the through thing and I'll talk about some other things that has to be has to do。

as well by the way this tells you that it's JSON that's getting sent back and remember。

we talked about what JSON is there it's like that it basically tells you what your IP address。

is for your computer or your local one is anyway that's how you can use telnet to also。

do okay if you're off campus I mean some people said they're gonna be traveling or whatever。

you're off campus you do need to log on to the Stanford VPN to use this is to do this because。

you're gonna have to go directly to one of the myth machines you can't do that off campus。

you can't go to a specific myth machine off campus okay so you're gonna need to log into。

the VPN for that okay all right like I said you can use telnet you can do you can actually。

do the actual samples one tells you it does exactly what yours should be doing in the。

end okay so let's go through what the assignment is going to be actually doing you're going。

to first there's four parts to this assignment first what you're going to do is a sequential。

proxy okay eventually you will use thread pool not miss not yours unless you really want。

to we've given you a working version not the use isn't working but we've given you a working。

version that definitely works so you don't have to do more debugging on thread pool but。

first you're gonna write sequential version and what it's going to do is you're gonna make。

first thing you're gonna do is just make it into an actual proxy that intercepts the request。

and just passes them onto the intended server and then gets the response and passes that back。

to your browser okay you need to support three different types of HTTP requests we have seen get。

before and get is one that basically tell just grabs the web page okay get by the way is not supposed。

to have any side effects on the web server in other words you shouldn't pass data that the。

web server is going to use to update its own state that's really not the way get your work。

because often web browsers will make one two or more get requests for the same data for various。

reasons and so in this case you don't want to you don't want to do send any data that the server。

might need to make changes but that normally happens when you're doing web pages anyway you。

also have to do a request called head this is a new one this one does exactly the same thing as。

get except it only requests the only thing that you get back is the headers now why might you want。

to do that well sometimes a website or a browser might request the head from a particular page。

knowing that it's about to make an actual get request and it may need to set some things up。

what you get back is how big the payload is going to be and whether it's going to be encrypted or not。

and so forth so sometimes they'll do that you should support that the difference between get and head。

is simply a matter of ignoring the payload or not or knowing that no payload for the head otherwise。

you're going to do exactly the same thing so that's an easy one and then put is one where you have to。

actually send data to the website put is relative it's almost the same thing except that after all。

your headers you have payload that you're then sending to the to the actual server okay so you。

have to capture the payload from the request and then forward to payload on to the next request or。

to the server that requests it or that you're requesting from okay so far so good all right。

then the request is going to look like this it's going to be gapped and then this is a we've seen。

this number of times let's say you're going to Cornell better use research remember you need a。

ctb 1。1 for any web pages you are sending back you can feel free to use hctb 1。0 it's not really。

going to make a difference they're basically the same format but that's that's that when you do。

that you forward the request to ww。coranel。edu okay and the request in that case because you're just。

for you already know you're forwarding it to that server you just need to put the actual path in there。

you don't need the actual full server you just say the the actual path i don't know that it matters。

that you have if you have a path in there but that's the way it goes we've given you an hctb request。

class so you don't have to build all of that okay you need to update the operator less than less。

than to do some to do some input for that but it's not that much extra work in general the hctb。

request is all set for you so you just basically feed it the information and then you get the。

request you really should understand how a request got cc the file works because it's got a lot of。

functions in there but you're going to want to use and i know you haven't seen these yet but we'll。

talk about the actual files you're going to need soon okay all right so one thing you're going to have。

to do in this first one is going to sign it's going to seem a little weird like what's the point but。

we want you to take the headers that come in okay and we want you to add a header called x-forded。

proto now remember the headers look like this okay if you look here the headers are all of。

in this case we didn't have any headers but they would come after the get i think it actually。

after the host line i believe they're just headers that kind of look like this。

these are the response headers but they look kind of like this where they they have。

information that is used by the server to like do stuff well the two headers that we want you to。

add or add two are the x-forded proto which basically means that it's saying what protocol are we using。

here and you're going to use just HTTPs it turns out it's a pretty straightforward you just add that。

it's very easy to do and then you also and if it's already there you just add it again no big deal。

there there's another one you have to do which is going to be really important later this is called。

x-forded for this is basically four proxies to say who they forwarded the request from originally。

right now if you're anonymizing things you're going to leave that blank but you're not going to。

be doing that for this in this case it's basically a list of all the various prior places that this。

request came from okay now why do i say there could be many of them eventually and we'll get to this。

you could do a pre you could have a proxy chain who's to say that your proxy doesn't talk to another。

proxy doesn't talk to another proxy doesn't talk to another proxy that's basically what the。

Tor network is doing incidentally it's going one to run and then there's 17 things in between and you。

have no idea how to get back to the original one which is kind of the point or there might be other。

reasons to have a proxy talk to another proxy we'll get to those in a few minutes what you need to do。

is if there's a header that comes in that says x-forded for it's going to be a comma-separated list of all。

these IP addresses you need to add another comma and add the current IP address that you came from。

and then re-put it like put it back into the request okay so you need to do that manually by the way。

there's no fancy function that allows you to do that you have to do that um it's really pretty easy to。

to use some of these request functions in this case you just request the header which we've given。

you a whole function or a whole class that does has this as one of its methods get value or get。

the sorry that's the basically the request the request handler class i believe it is and then。

get value as string and use pass in the export of four it gives you back the string it's this comma。

separate value so don't look at this go i have no idea how to get a request header it's it's pretty。

straightforward straight from the actual code you just have to look through the code and find out。

where those are okay like i said you have to manually enter that in um most of the code for this。

sequential version is going to be in the request handler。

h file so let's look at that for a second okay, request handler。

h file let's do it over here in request handler。h okay so request handler。h well。

it's got a bunch of functions in it public functions there's a constructor of course and then it has。

service request clear the cache we'll talk about the cache soon and then set the maximum cache age。

that's all that you're the program is going to use and then you've got some other functions in here。

like handler get request you're going to want to also handle a put request and handle a head request。

so you've got a couple more functions there and then you've got some other requests that you're。

going to have to handle as well some of those we've actually built for you already so let's see。

there we go handle handle request has already got a handle error in it and so forth。

but otherwise you need to actually do most of your work under like let's see where's the get request。

one here we go handle like there's one we're going to set up a get we're also going to set up a。

put you're also going to set ahead etc and then let's see I think that's the only there it is down。

here handle get request is the actual function we've done a little bit for you already so you can kind。

of see that it actually look you set you get a rest each create a response you set the response。

code to be whatever you want normally it's going to be 200 you set the protocol to be HTTP and then。

in this case slash 1。0 and then the payload in this case was your running proxy and that's it and then。

you push it out so pretty straightforward in terms of like the connections between the。

the actual web and your proxy okay and your browser so that's that。

after you've written this one test test test test on lots and lots and lots of HTTP sites。

okay you can also test on HTTPS but most of the time you'll get that connection refused or whatever。

or some other error if you can find HTTP sites what are good ones example。com is a good one。

there's that IP if I won there's there's lots of others that are out there but just don't。

don't be surprised like sometimes you'll go to an HTTP site and it will quickly turn it into an。

HTTPS site you won't notice well it's broken look for the little lock if there's a lock up in the。

corner it probably needs its HTTPS and you should try a different site so we will have we'll we can。

put together a list of other sites that are okay all right now that's the first thing the second。

thing you're going to do so as I said one of the proxy services that it may offer is this thing。

called blacklisting and another one is called caching blacklisting basically means to block access。

to certain websites okay so there is a blocked domains。txt file let's look at it it is let's see。

already set up where really blocked domains。txt okay so you can't go to any Berkeley sites can't。

go to any Canadian sites can't go to French sites it looks like no government sites I guess。

I can't go to Microsoft。com although I think Microsoft is one of those ones that does everything HTTPS。

anyway the good news is testing the block sites it doesn't actually have to go to the website so。

it should block it immediately in fact let's test that oh no it hasn't been set up I'm gonna start。

a cook but anyway these are the ones that you want to do Jerry Kane has one set up that should work。

all the time I'll ask him if it's still set up it should be but that's those are sites that you。

shouldn't be able to go to this uses regular expression matching to figure out which sites feel free to。

remember if you took 103 already or go learn about regular expressions you do not have to know a。

thing about it for this for this part the blacklist file program or blacklist functions are already。

written for you but feel free to look at if you want to see what it looks like okay so if you。

if the browser that's connecting to your proxy does go to a forbidden site or block site you actually。

return a client status code or you return to the client status code 403 well let's look up what 403。

actually is if we go here this there we go 403 forbidden okay this means that it it's telling。

the actual browser that that's on forbidden site and most of the time your browser just reports。

whatever the payload is but it knows that it's a forbidden site sometimes it may put up some other。

pages oh my gosh this is forbidden but that's the the correct code when the site is not allowed to。

be requested you may have seen that one before for other various things and you that means you。

have to create your own request and send a response and send it back but it's exactly like what the。

the temporary the one I showed you already was it's very easy to tell whether a server is allowed。

using this blacklist function I mean I'm giving it to you right here if not blacklist。

server is allowed, all right we've written that function for you okay again what I don't mind you to do is get into。

the reading this document go oh it sounds like I have to write like an entire like you know。

part of it would take me months not that quite that bad little question no okay all right anything。

so far on this blacklisting pretty straightforward don't let the user go to that site this person。

but yes like the information that you're trying to put like what problem is it was。

time now like just a payload and I'm sure we'll go back yeah let's look up what a put request looks。

like right so let's do it here uh put oops see if this works there we go that should work there。

put requests okay put request versus a post request uh put here we go we'll go to this site here。

a pull request creates a new resource resource on the target。

it's just been put in post the post is identity calling it once several times the same effect okay。

it basically says here's a bunch of here's the request here's a bunch of headers there's a new。

line after the headers after that everything until the end of the file is dated that's all there is to。

it you generally put how much the payload how many bytes the payload is and so forth the good news is。

you don't have to worry about that so much for the assignment and you just have to know that there。

will be a payload that you do need to forward otherwise you don't need no any details about it。

you won't have to create any put requests yourself but basically it's just a request that sends data。

to the client or to that web server and that's it all right any other questions good question。

anybody else okay so that's the web proxy okay cashing on the other hand is this idea that I。

mentioned earlier it's keeping a local copy of a page so that you don't re-request it from the。

internet all right that saves time it saves bandwidth many times proxies are local to like an。

organization and if lots of people in the organization happen to go into the same web pages and you。

have a cache of them well it's right there locally you don't need to go like lots of your benefit from。

having that having that cache okay in this case you are going to have to do a few things for this。

you're going to have to update the hctb request handler to check and see if you've already got。

the cache copied I mean that's basically the first thing you're going to do is you're going to look。

here's the request do I already have it in my cache if I have it in my cache I'm going to just。

forward back the cache response and be done I never even touch the site that it's trying to request from。

okay and that's kind of nice all right now are you going to do that for things like put requests no。

you can't do that for put because you have to send data to the actual websites there's no way to。

cache something that's going to be put so it's absolutely only going to be either head or get。

requests that are cached and we've already built most of the logic behind whether it's figuring out。

whether or not this is cached for you so you're not even going to worry too much about that there。

is an hctb cache class which you should look at and understand you'll have to make very minor。

change to that when you get to multi-threading but that's that's the only real change you're。

going to have to make is when you do this as a multi-threaded program you're going to forward。

as usual if it's not in the cache when you get the response the response will tell you whether or。

not it can be cached so in other words you record make the request and it comes back and the page。

itself in the headers says oh yeah you can catch this okay let's actually test this。

i have a let's see if i can do this there we are i have a website let me make sure i get it right。

that i'll let you use to test on here as well if we tell that to ecosimulation。com 80 and i do get。

and then what is it it is cgi whoops oh no oh no this is really so i forgot my browser's garage。

really fast i have to write this first and then paste it in so cgi has been uh it is。

let me look hang on ecosim i forgot the name of the pages here uh okay cgi public html cgi。

it is called current time dot php okay so we are going to do a request for we're going to say get。

slash cgi been slash current time dot dot php and then h and then http1 slash 1。1 okay we're。

gonna put paste that in there and let's see if i can type the hosting fast enough let's see tell。

that okay let's try that and then host, did i get it oh what did i forget um。

did i forget something else that should get cgi i don't know why he's not doing that well。

we'll look at it in the browser um but what it's going to do is it sends back like the ability to。

be cached okay i'll test it in a little bit i'll test it again um but basically it's going to say。

that and then you need to determine oh if it's cached i'd better put it in the catch okay and then。

later tell me your request the page again you can do it let me show you what this looks like for。

the actual uh let me show you what it looks like for the site that i was just trying to go to。

uh let's see eco simulation dot com slash cgi been slash current time dot php there we go okay。

so that's what it looks like um what it is i just put up a little page that tells you what the。

current time is now it's got the uh date and the hour minute and seconds if i refresh this right。

the browser doesn't actually cache it as it turns out um or rather the the original proxy server。

doesn't actually cache it let's see if this is still set up uh no group the rope i think it's still set。

up meaning that hang on preferences maybe it didn't actually set that up let's see。

there we go okay so now we've got the proxy server going uh if we do the let's see samples。

slash proxy solution there we go okay and then now we request this from the proxy server there it。

is now that's it right if i rec and let's look at what it did it said it cast the response and it's。

going to cache it for 3600 seconds so for the next hour it's going to request that website。

and it's always i'm repeating the refreshing here and it's always going to the same time。

because it's been cached okay this is not something you want for pages like this because this is a page。

that's generated uh dynamically okay and many web pages are generated dynamically on the web。

if you go to facebook。com your timeline is updated all the time right you got ads flashing up and。

whatever it's all being regenerated um every time you refresh or scroll or whatever that's dynamic。

content the only content that can be cached is static content which should be things like images。

and things that aren't going to change so in this case that's it now what if this happens to you and。

you've cached something that shouldn't be cached okay by the way every time i refresh it it says。

using cached copy because it knows to go and use the cached copy there what you can do is you can say。

dash dash clear cache and then it will clear the cache by deleting all these files it keeps the files。

in a hidden directory in your program in your assignment file i don't know why we hit it but i。

guess that's the way it goes and then now when i refresh it it should go back and there it is now。

updated to there of course it's not being cached again so it'll update won't uh won't do that again。

so that's uh that's that okay let me go back here again so i can do that questions i'm caching and。

what you have to what the basic idea is no questions on the basic idea for that okay。

all right it is that's those are the things you have to do for that um and feel free to use this test。

to make sure that your thing is caching right go and refresh a bunch of times go oh it's the same。

time every time so it makes sure you're caching yeah so when you're caching something you can't。

pass the web page right yes so good question you can cast parts of a web site i would say。

yeah um so the here's the thing when you go when you run this and you go to some website it will。

actually request hundreds and hundreds of web pages web actual web page in some sense um if we。

let's see i kind of think let's do it this way we still have to set up uh let's see if we。

set it up for let's make go back to the proxies for a second i'm going to set it up there and then。

let's see how am i going oh yeah if i go to okay so now there's what we have there let me let me make。

the web page a little smaller so you can at least see the scrolling that might happen here okay if i。

go to let's say time。com which i don't think it there we go see all the web pages that it's getting。

and if you can see it's still getting lots and lots and lots of web pages right time。com is another。

good way to test because it's not HTTPS except for probably the ads or something some parts it。

might be but you get lots and lots so look at how many web look at how many different pages it requested。

just from time。com right and it still has me in finished loading why because it goes and requests。

the main one and the main one says oh here's another image here's a uh an ad here's another sub i frame。

page and so forth so that's what happens there so good question yeah。

yeah so it is so the the question was does the cache persist beyond your running it yes it's。

the files on the actual thing so if we look ls-a-l-d dot let's see let's uh where's here ls-a-l-s-a。

there we go okay uh it should be oh you know what i believe it's in your it's not in here i think。

it's in your main uh let's see if it's there is there is dot proxy cache myth 64 if we go into that。

that's in your home folder dot proxy cast myth 64 then in there there's all the sites so there's。

all the requests that it saves so all the time dot com requests are right in there and that hash。

will become important in a couple minutes okay i've become really important in a minute i don't。

know what form this is in let's just look 7 7 2 8 oop that's a directory let's say cd 7 7 2 8。

in there created that's binary of some sort or maybe it's some other maybe it's an image probably。

an image so we can do that right so um you know how you'd figure out if it was an image you'd look。

at the hex for it you'd go something like hex dump and then the name and then you'd oops not hex。

dump hex what is it hex what is hex dump hmm oh hex hex dump there we go hex dump and then the name。

and there you go and then you could look and see at the beginning of the file oh the beginning of the。

file so many files that are binary actually have like 5 4 4 8 possibly means like it's a particular。

type of file like the png file or some other thing so you can do a little bit of investigation if you。

want to figure that out but anyway that's the point um and i'm not sure exactly how to say。

our store but that that's where it is anyway okay all right so let's go to that was time。com。

which is a good one to test because it still remains only http let's see did i get rid of。

those settings here yes i did proxy and then let's make it no proxy again and there we go okay。

all right so you need to do that for the caching basically check if it's in the cache if it is for。

the data on or forward it back for your local copy if it's not there request it when it comes back。

say oh this does can this go in the cache pretty much cash and a story for that okay that is version。

two version three is now adding concurrency you have built a thread pool and using a thread pool。

as you found out is not actually that difficult the hard part is knowing when to do all of the。

mutexes and things when you're using the thread pool right because now what we've got we've got。

multiple threads that are going to handle multiple webpage requests you're actually going to limit。

it to 64 threads your thread pool is going to have 64 threads at once and you'll be building this。

thing called a scheduler to actually handle those now the scheduler basically is the first in first。

out kind of basically the first in first out queue that's going to go into those into the thread pool。

and so you're going to you're going to have to build that it's not too much code actually and。

as it turns out you're going to have to do that it will be very very simple as it turns out when I。

say it's not much code your scheduler class is going to have like one line that does calls an HTTP。

request handler which you've already built right and that already has an HTTP blacklist and an HTTP。

cache already built for you don't go and try to reinvent things just make just say okay I am going。

to use one HTTP request handler but what I'm going to need to do is I'm going to need to。

update that handler to be thread safe which means that there's certain things like oh maybe the cash。

needs to be thread safe maybe the the blacklist is it turns out doesn't it's already never changes。

so it doesn't need to be nothing needs to change for that but you should you should do that here's the。

the big thing about this you'll do need to add synchronization directives to the other code that。

you've written this may mean that you have to change like function signatures and things that's。

actually okay you change functions even though I don't know if any of the public ones will need it but。

you can change function signatures to add functionality for this um you're only allowed to have one。

request open for a given request in other words let's say that two thread let's say that two。

threads both requests some some jpeg from time。com at the exact same time only one of them is allowed。

to proceed okay well how might you support only one of them proceeding so before or probably a mutex。

in this case just because you're only live one so in this case you need to lock around the actual。

request for that so yeah so you're just going to have that now what you don't want to do is。

lock around the entire well it's basically because of the cache it turns out you don't want to lock。

around the entire cache um that would be really slow in other words while you're doing the request。

or while you're updating the cache or whatever every thread is going to say no I can't update the cache。

here's the nice thing about the cache it is thread safe if you're trying to add things to the cache。

from two different sites okay it just does local files that have different names associated with。

them as it turns out so you can have two different sites doing adding to the cache at the same time。

it's actually okay you can have as many reading from the cache at the same time as you want you just。

can't have two of the exact same site writing to the cache at the same time so this is the reason。

that you're not going to your point allow two requests to go out to the same page number one it seems a。

little crazy to do that anyway because you're already requesting it you might as well have the。

cached copy and then just sort it for the next one but that's what you're going to do you have a question。

yeah so good question the question is way way way I don't get it how is it not an issue if there。

are multiple threads reading and writing well it turns out that the way the cache is built it just。

it doesn't have any internal state it's not like it's updating anything internal it's strictly opening。

and closing files in that folder I showed you and those files are distinct so it doesn't matter if。

two things are doing at the same time they both can be two different threads can both be calling。

that function and they will be doing different files as long as you aren't requesting doing the。

same request if they're on the same request they're they're gonna overlap and they're gonna try to。

try to they're gonna get mucked up in that that way so that's the reason so it's mostly thread safe。

except for the same request so you do need to have a mutex around that which again it's a reason you。

don't want to have a mutex around the entire cache all the time only when you are um you you。

only need to do it for that that amount of time where you are uh when you are doing the actual let's。

see I think it's the the I guess it's the actual updating of it or rather the reading if it's in there。

or not I think that's really the only thing you're gonna really need to do so it's it's relatively low。

level but you are going to have you do want to have different requests able to go at the same time。

so how are we going to ask you to do that you are going to have an array of 997 mutexes。

and i've already seen some people going there's going to be guys going what are you talking about。

and i put it right here what are we talking about while we have 997 mutex okay here's how this is。

going to work every time you request a site you are going to hash the actual request okay you're。

going to take the request and you're going to hash it now it turns out that we built a hash hashing。

function for you that takes a request and returns to you a hash for that request remember how over in。

the other thing here you saw let's see if we do whoops if we go back here and do that remember how。

that those numbers there that's the hash that actually comes out of so let's see there we go those。

are all the hashes that come out of the hashing function and it uses that hash as the file name。

as it turns out for the folder that the information is going to go into the payload so it's going to。

use that anyway your job is to say okay if we have a particular request we're going to call this hash。

request as string function like if i pen just turn black i don't know um that's all right we're。

going to return return that and it's going to see if i can do this let's see pages is it no no no。

hang on use pen pages shrink uh not worth it i guess we're dealing with red right now or black um so。

it's going to uh it's going to get the hash and then you're going to have one of those mutexes those。

nine hundred and seven mutexes based on that uh that hash modded by nine hundred and ninety seven。

that's the hash you're going to use for the actual hash or for the mutex okay let me explain that in a。

little more detail uh here's what you're going here's what you are going to do you're going to have。

these nine hundred ninety seven mutexes and you're going to say okay every time i get a never double。

page i'm going to use a different mutex for that page the only time they will overlap most of the。

time is if it's the same request because that's the odds are very slim that two different requests will。

have the same hash out of those nine hundred and ninety seven might be the case that's just something。

you have to do it otherwise you're going to use that lock at that point to say i'm locked now based on。

that hash if you have two hash requests coming or requests coming into the exact same time that。

are for the same site they're going to both hash to the same value they're going to both end up with。

the same mutex therefore you're going to use the lock for those two identical sites that's what we're。

doing now the question is why is it nine hundred ninety seven anybody know anything about that number。

it's prime yeah hashing works best when you do mod with a prime number so it actually kind of。

scrambles it up even more than it would otherwise that's the way it goes so questions on that people。

no no no you got the good question you have to determine between a collision in the same site。

if it collides you just deal with it go well you know what it's going to be a little slower because。

that collision the collision will happen very infrequently so you don't need to worry about that。

and this is not you know if you wanted to you could pick a bigger prime number and even less。

likely than that but the odds are pretty good that even if you're making memory you only have 64。

things going at a time so the odds of those 64 have a collision it's a little bit of a birthday problem。

issue but it's not the biggest deal in the world and even if it happens it's not even you're not even。

noticing i mean you won't notice it anyway but you won't notice it all yes。

oh sure sure you could have this a different hash in the same remainder yeah absolutely but。

the hashes are notice how long the hash is either like you know 60 bits or 15 or 20 number uh digits。

long right that's a giant number and they're going to hash into the same they're going to hash。

somewhere in that bucket of 997 that array of 997 buckets sure but the odds are slim because you've。

got a one out of 90 well you've got a birthday problem number of you know percentage wise。

chance that it actually two very different numbers hash to the same bucket basically one out of 997。

per uh two two that are going at the same time yeah that makes sense so we don't need to deal with。

don't need to deal with any collisions because it will matter anyway um it the the uh the。

you still will get an in you still will get an individual hash which is different you just will。

only have 997 mutexes deal with and the mutex is only slowing it down it's only saying look you。

can't go while i'm here and it's forgetting very short amount of time anyway but not not too big of a deal。

when you're accessing the cache they should you should lock around in that case um again if you if you。

i guess let's see how should i put this you yeah the question is why do you even need it if。

it's thread safe to begin with i'm gonna look i'm gonna look look into it a little more detail i don't。

want to give you the wrong answer on that but the basic idea is yeah just while you're accessing that。

cache for that particular website lock around it and it's not really that you're locking because of the。

access you're simply locking because two sites might have the same trying to try and access the。

same files so that's really all you're doing you're not really locking you are locking around your。

cache accesses but it's only it's only when you're doing that request that might be the same as some。

other request that's the only reason yeah so it has nothing to do with the cash being thread safe or。

not it has to do with the fact that if you're you are the same web if you are doing the same request。

then it's not safe so you have to lock in that case right yeah the question is shouldn't we just。

be locking around one particular thing well remember we've said that you're not allowed to make a。

request to the same site at the same time so you do need to lock around the entire request in that。

case like the entire setup for that request because you want to make it so that if two browser tabs。

or whatever go to the same ask for the same page at the same time one of them fully completes before。

the other one goes you need to lock around that whole section but again it's one 997 different locks。

so it's going to be unlikely to have too much of a problem good question how's anything else do you。

have your help okay all right good question and this i think is one of the trickier parts of the。

assignment okay uh version four there is one more thing you're going to have to do for this assignment。

the final thing you're going to do is this thing called proxy chaining and i mentioned that earlier。

proxy chaining is where you have your proxy goes to another proxy goes to another proxy goes to。

another proxy let me demonstrate for that that for you okay if we go and set up our proxy again。

so we're setting up the proxy to be manual proxy for myth 64 let's say that i let's see where was i。

here uh one ten uh let's see spring assignments assignment seven okay if we do let's do this let's。

do farm uh or not farm we will use farm in them in a little bit again different farm it turns out。

in this case we're going to do sample or sample slash proxy solution and we are going to。

tell it let's see how we do this we tell it how to uh there we go we say dash dash proxy server。

and let's actually do we'll do exactly this one uh see we'll do dash in there dash in there dash。

dash proxy server so this is what now we've said okay if our browser goes to our proxy and makes a。

request it will not forward it on to the requested website it will forward it on to some other。

proxy that we are about to set up so i have told it that i'm sending this up on myth 63。

who i probably did this in my office so maybe it won't work well we'll see ssh myth 63。

okay yes i'm gonna do that okay say this one ten, ten and spring assignments assignment seven sample slash proxy solution and in this case we want。

port one two three four five oh here we go it is working okay so now it's going to be one two three。

or five so when i make a request to my other original myth 64 one it should forward the request。

on to myth 63 okay if all goes well let's try this let's say time。com and let's see what happens。

okay it's loading there we go so notice the original request is going here there's some other weirdness。

going on here the clock socket it's probably time being annoying and over here it's going to the。

other proxy server and then it's forwarding back and forth so it'll change our process okay if we go to。

uh let's see if we go to our current time one okay there's the current time it actually。

cached it i think it actually cached it in both as it turns out because both proxies wanted to come。

back as a cache that's okay but if we go and do one more time it actually you couldn't see let's。

clear this and let's see if it does it again if we do it there there we go it only cached it in the。

first one because it had it locally never needed to go to the proxy the second proxy there so that's。

one of the nice things about caching you didn't need to go to the other proxy the other ones there。

now what's the trick to doing this well let me show you。

settings no proxy okay all right so you can have as many proxies as you want in a row right you。

should be able to handle as many as happen now each individual proxy could really care less how。

many prior proxies it came from or how many it goes to except in one case you don't want to be。

able to create a cycle of proxies what if i was a proxy going to darick and darick was a proxy。

coming back to me well we would go back and forth all day and we'd just make millions and millions of。

requests and that would like we just all be waiting to really start timing out and this and that。

because there if there was no cycle detection right so what we need to do is you need to do that now。

remember about ten slides ago and when i said there's this thing called x forwarded for。

and where did it say it said where the request came from and you add your next request onto it。

well guess what you're gonna if you come from a proxy you're gonna add add that proxies。

IP address onto this list which is going to end up going to the next proxy if you look through that。

list and find out that there's a cycle going on in other words the place you're about to send to。

is one that's already in the list you have to return an error message because you don't want to。

create the cycle the error message we happen to want to you to send back is 504 which is gateway。

timeout and it says the server while acting as a gateway or proxy did not receive a timely response。

that's about the best you can do the error message it turns out the best you can do for。

this sort of issue okay please say it didn't work because there was this cycle。

it does go through the proxy again good question question is if you if you have a proxy set up。

here i'll draw it out for you know i'm up because this isn't wrong let's see。

of course i'm sorry um the uh if you have a proxy that makes a request to another proxy。

it's expecting your request back from that or the response back from that proxy that proxy might。

have its own proxy that's going to it's your expecting your request back so the chain goes。

down the line bing bing bing gets the real website it responds to the last proxy response。

to the second last proxy responses to the next last proxy all the way back to you so that's the。

whole chain it's kind of slow it's not fast oh it's not a cycle because there's no request that comes。

back to the one that's already been requested from remember here's the cycle okay i request。

the site from so i'm gonna so let's say my browser requests time。com and it comes to me as a proxy。

as my proxy i go to Elizabeth and Elizabeth says uh she's going to respond i'm expecting a response。

from her she goes to Derek etc and then Derek goes to time okay and then the response comes back to。

Derek from time from Derek goes back to Elizabeth from Elizabeth goes back to me no cycle if however。

Derek had a proxy set up that was going to come back to me then it would go browser to me to Elizabeth。

Derek to me to Elizabeth's Derek to me it was never good time。

com and that would be a cycle but you can, figure that out because when the request comes from time to me i tag on time to the x4 to 4 and。

then i pass it on to Elizabeth and Elizabeth tags my ip onto the x4 to 4 and then goes to Derek and。

Derek tags hers onto the x4 to 4 and then time gets it respond response following it comes back that way。

so it won't uh the response doesn't need to go i mean can go back to the actually goes back to。

whoever responded but the cycle happens when you're doing the request that makes sense okay good too。

if you would you would not check if your own ip is in there you would check if the request the。

the proxy that you're about to send to is in there already right because then if it is then you go oh。

well i can't i shouldn't be sent and do another proxy good question yeah。

when you have more than one proxy which which one uses the cache data oh good question i don't。

know if you noticed that both of them happen to use the cache because you remember you don't know。

that there's a proxy and you know you're forwarding onto a proxy but you don't necessarily know if it。

has a cache or not so every time you get that response your local decision is do i make it a。

cache or not and i did if it needs to be and then the one previous says oh do i need to make it does。

so it would get cash along each section now in real life this is not necessarily going to happen。

either one is going to catch it or there's going to be some other things going on but maybe they。

all locally catch it who knows you know so but good question all right so we actually have a。

program called run proxy farm that you can actually use to to actually check and soups to check and run。

make a little proxy proxy chain through all the remit machines or whatever and it uh and you can。

test it out that way what it doesn't test for and this is kind of the most important thing but it。

doesn't test for cycles you if you wanted to you could dig into the code for the python program。

would give you and just write it so that it checks for cycles or does some cycle creation is really。

what it is it should be it's not too hard to update but um you would have to do that so that's that。

so that is the assignment now it took a while to explain it but now you should have a good。

idea when you get to it tomorrow or whenever you start doing it i would start doing it sooner than。

later um you will get to it uh one step at a time okay if you do want to support https it turns out。

you have to support this other header or method called connect um we don't expect you to do that。

nor do we uh we can give you some hints on how to do that so if you wanted to do that if you had extra。

time and said oh i'd like to work for https then feel free to ask and we'll give you details about。

what you need to do to make connect work um but i would only do that at the end and you don't have。

to test as often as you can clear your caches both in the browser and locally in the cache。

and then um that's that the files here's the various files you're going to have to make changes to um。

very minor changes to cache the cache files very minor changes to the proxy file somewhat minor but。

less like decent changes to request as you see across the age uh this is where you're going to do。

most of the work request handler very minor uh minor in the scheduler uh things as well so。

again there's a lot of files you have to touch minor changes in most of them okay all right。

question um so you can have you can create more than one proxy for computer you can create one。

one more than proxy computer every port could have different proxy items okay yeah and in fact when。

you guys are all running it we might have 25 people and he's mitt doing 25 different servers。

if they if they like cache the same way uh from a site the cache is locally your own home folder。

so nobody else's cache is yeah there's no oh boy that would make it a really hard assignment everybody。

else gets to cache the stuff too and you have to get good luck doing that kind of debugging but no。

it's your local copy so it is based in the folder is based off the local。

in your own home folder hidden by hidden folder good question all right and now that we've talked。

about that you now know a little bit more about proxies too and now you're gonna go you're gonna。

go start building one uh tomorrow or Friday all right so in the last couple minutes i'm gonna briefly。

talk about the next assignment just in this in the big global picture and then next Wednesday。

i will go much more detail about it okay in a similar kind of like here's the things you have。

to think about okay so the last assignment is this algorithm called map reduce and map reduce is a。

very cool uh algorithm that is a distributed algorithm i believe it was first um it was first。

invented or at least used widely at google um what it is it's a parallel okay it's a let's see。

it's a parallel no i got the brother cursor back but no there you go it's distributed okay it's。

distributed meaning that it is across many myth computers okay so your program will actually run。

on many myth computers uh it is uh and it's used to generate and process large data sets okay。

what does that mean it means that you've got this data set that you need to do a lot of analysis with。

and you want to utilize many different servers to do that this is a huge issue for companies like。

google or facebook or apple or companies that have lots and lots of data that they're chugging。

through every day they have billions of things to do and they have tens of thousands of servers。

you have to farm them out across all these servers and do this there are two basic parts to map reduce。

there's the map stage and there's the reduce stage those are the those are the problem specific things。

okay the map stage says i need to take this data farm it out to all these servers and process。

it in some way okay and it gets processed on all those servers and i'll show you an example in a。

minute or maybe i'm next wednesday it processes in a certain way and then it sorts it all and that's。

not the part of the map part but that's that's kind of part of that thing but it's not it happens in。

all of it you always have to do some sorting back down here it shows that you've got a mapper。

you've got some intermediate data you take that intermediate data you always sort it in some way or。

another and you only group it by these keys and i'll show you an example and then you have a reduce。

phase which says okay now that all that data is out there i need to go collect all that data in a。

regulated way so that i can get it back and collate it into one giant result okay so you're sending the。

data out you're mapping it out to all these different servers that are going to chug away and do。

whatever is necessary for it then you're going to sort it and process it by these sorting and grouping。

by the key that will come along with it and then you're going to reduce it by bringing it back to。

the main computer and having the final data set that's what map reduce is all about okay so what is the。

what are the details yes what's the difference between group by key and reduce well you'll see。

when you do it but group by key basically is local you are saying all for my local copy all these。

keys are the same and you'll see the example and you'll go oh i get it and then the reducing is taking。

all those different farm-bound versions bringing them all back and further combining them together。

okay you'll see how it works and and that's it and so it requires networking it requires。

it requires threading in terms of getting the the main server has to do the threads to get。

everything out to all the things it requires um i require multi processing i guess it does。

and it requires multi processing and then you're going to run multiple programs on there kind of。

ties everything together that we've been doing all quarter okay all right here's the example and i'm。

only going to show this and then i'm going to let you go and then we'll finish this up next Wednesday。

the example is um you basically have a uh a file let's say a document that's a book that has these。

words in it and what it's going to do is it's going to the script here it's going to read in an input file。

and output it for every word it's going to say word and then one meaning there's one word that。

it's associated with it now why do you have to do it you don't necessarily have to have that one but。

we're going to in this case we're going to do that because it's going to be generic in this way we're。

going to say uh it's going to output the word and then it's going to be one so at the end of this。

little script you're going to have all the words in this document there's going to be many overlaps。

and they're all going to say word one the next word one the next word one you're going to have lots。

of duplicates okay after that you are going to then do the sorting and collating phase that we'll get。

to next Wednesday okay all right so we'll get back to we'll talk a lot more about MapReduce when we get。

when we come back next Wednesday in the meantime i am going to the office now for some off-sars if you。

want to come by it's actually 219 i think is where it's going to be and we'll see you guys in lab or next week。