34234Welcome, you know, super excited to see so many people

Welcome, you know, super excited to see so many people interested in deep-generative models. So I'm Stefano, I'm the instructor of this class. I've been teaching this course for a few years now. I guess we started back when before, you know, all the generative AI hype and before this topic was so popular in industry and so now you're lucky you get to experience a pretty mature version of this course and you know it's going to be a pretty exciting quarter. This is one of the hottest topics in industry right now there is of course a lot of excitement around the language models about generative models of images of videos and the goal of this class is to to give you really the foundations, to understand how the methods that are used in industry and in academic papers actually work, and hopefully get up to speed with all the really fundamental concepts that you need in order to build a generative model, and maybe in the future develop better systems, develop better models, deploy them in industry, start your own company that is sort of like leveraging these technologies. At a high level, one of the reasons I think these models are becoming so important in AI and machine learning is that they really address the fundamental challenge that we encounter in a lot of subfields of AI like computer vision, NLP, computational speech, even robotics, and so forth. If you think about it in a lot of these settings, the fundamental challenge that you have is to make sense of some complex, high-dimensional signal or object, like an image, or a speech signal, or a sequence of tokens, or a sequence of characters written in some language. And this is challenging because, you know, from the perspective of a computer, if you think about an image, it's just like a big matrix of numbers. And the difficulty is making sense of it, trying to figure out how to map that very complex side-dimensional object to some kind of representation that is useful for decision-making, for a variety of tasks that we care about, like figuring out what kind of objects are in the image, what kind of relationships they are in, what kind of materials they are made of, if they are moving, how fast things like that and you know similarly if you think about NLP it's a similar story you have a sequence of characters and you need to make sense of it you need to understand what's the meaning and maybe you want to translate it in a different language the challenge is really understanding what these complex objects really mean and understanding this object is hard it's not even clear what it means to to understand what an image means, but I like to use this analogy inspired by this quote from Richard Feynman. At some point he said, what I cannot create, I do not understand. I think this was actually what they found on his whiteboard after he passed. And, you know, what he meant in this case is that he was talking about mathematical theorems and he was saying, if I can't really derive a proof by myself, I'm not really understanding the concept well enough. But I think the analogy is that we can look at the contrapositive of these and kind of like the philosophy behind the generative modeling approaches in AI is that if I claim I'm able to understand what an image means or what a piece of text means, then I should be able to create it. All right, I should be able to generate new images. I should be able to generate a new text. So if you claim you understand what an apple is, then you should be able to kind of like picture one in your head, right? Maybe you're not able to create a photo of an apple, but you know sort of like what it means. Or if you claim you can speak Italian, then you should be able to sort of like produce, you should be able to speak in that language, you should be able to write text in that language. And that's kind of like the philosophy behind this idea of building generative models of images, or generative models of text, or multi-modal generative models. If you have these kind of capabilities, so you're able to generate text that is coherent, it makes sense, like in large language models, like the child GPT, those kind of things. And it probably means that you have a certain level of understanding, not only of the rules, the grammar of the language, but also about common sense about what's going on in the world. and essentially the only way you can do a good job but generating text that is meaningful is to have a certain level of understanding. And if you have that level of understanding then you can leverage it and you can use it to solve all the tasks that we care about. So how do we go about building software, writing code that can generate, let's say, images or can generate text? You know, this is not necessarily a new problem, it's not something that we are looking at for the first time. People in computer graphics, for example, have been thinking about writing code that can generate images for a very long time, and they made a lot of progress in this space. And so you can kind of think of the setting as something like where you're given a high-level description of a scene, maybe there are different kinds of objects, of different colors, different shapes. Maybe you have a viewpoint. And the goal is to kind of write a renderer that can produce an image that corresponds to that high-level description. And again, the idea is that if you can do this, then you probably have a reasonable understanding of what it means that what the concept of a cube is, what the concept of a cylinder is, what colors mean relative position. And in fact, if you can do this well, then you can imagine a procedure where you try to invert this process and you know, given an image, you can try to figure out what was the high level description that produced this scene. And to the extent that you don't have sort of like computational constraints and you can do this efficiently, this gives you a way to think about computer vision in terms of inverse graphics. So if you have a process that can generate images well, and you are somehow able to invert it, then you are making progress towards computer vision tasks because you are able to really understand this high-level descriptions of the scenes. And this is not gonna be a course on computer graphics. We're gonna be looking at very different kind of models, but they will have a similar structure. Many of them will have a similar structure where there is going to be a generative component, and then often there is going to be latent variables that you can kind of like infer, given the raw sensory inputs in this case, and you can use that to get features, to get representations, you can use them to fine tune your models to solve computer vision tasks. And so this kind of philosophy and this kind of structure will actually show up in the kind of models that will build in the class. So the kind of models we're going to work on, they are not graphics-based, they're going to be statistical models. So we're only going to be talking about models that are based on machine learning techniques. So the generative models that we're going to work with are going to be based on a combination of data and prior knowledge. And so, you know, priors are always necessary, but you can imagine that there is a spectrum, You can rely more on data or you can rely more on priors, and you can think of computer graphics as lying. This extreme where you leverage a lot of knowledge about physics, about light transport, about properties of objects, to come up with good renderers. This course is going to be focusing on methods that are much more data driven where we're going to be trying to use as little prior knowledge as possible, at instead the average data. Large data sets of images or text, perhaps collected on the internet. And yeah, so at a very high level, these generative models are just going to be probability distributions over, let's say, images x or over sequences of text x. And so in that sense, they are statistical. And we're going to be building these models using a combination of data, which you can think of as samples from this probability distribution. And in this case, the prior knowledge is basically going to be a mix of the kind of architecture we're going to be using, the kind of loss function that you're going to be using for training the models, the kind of optimizer that you're going to be using to try to produce the loss function as much as possible. And this combination, having access to good data and the right kind of like a priors is what enables you to build, hopefully a good statistical genetic model. But at the end of the day, kind of like the abstraction is that we're going to be working with probability distributions. And you can just think of it as a function that takes any input x as input, let's say any image, and maps it to some kind of like scalar probability value, which basically tells you how likely is this particular input image X according to my generative model. And this might not look like a generative model directly, like this looks like how do you actually generate data if you have access to this kind of object. The idea is that you can basically generate samples from this probability distribution to create new objects. So you train a model, you learn this from other distribution, and then you sample from it. And by doing that, you generate new images that hopefully look like the ones you've used for training the model. So that's the structure. So in some sense, what we're trying to do is we're trying to build data simulators. So we often think of data as an input to our machine learning problems. here we're kind of like changing, returning things around and we're thinking of data as being an output. So we need to think about different kinds of machinery models that we can use to simulate to generate data. Of course this looks a little bit weird because we just said we're going to use data to build these models. So indeed, you know, the ideas that we're going to use data to build the model, then we can use to generate new data. And this is useful because often we're going to be interested in simulators that we can control through control signals. And we'll see some examples of the kind of control signals you might want to use to control your generative process. For example, you might have a model that can generate images and you can control it by providing a caption of the kind of images you want. Or you might have a model that that can again generate images and you can control it by providing maybe black and white images and you can use it to produce a colorized version of the image. Or maybe you have a data simulator that can produce text in English and you can control the generative process by feeding in text in a different language may be in Chinese. That's how you build machine translation tools. The API is gonna be, again, that of a probability distribution. So really, you're gonna be able to, for a lot of these models, you're gonna be able to also query the model with potential data points, and the model will be able to tell you whether or not they are likely to be generated by this data simulator or not. So in some sense, it also allows you to build a sort of an understanding over what kind of data points make sense, and which one's done, which is gonna be useful for some applications. And really, this data simulator is that the end of the day is a statistical model. It's what we call the machine learning a generative model. And in particular, in this class, we're going to be thinking about deep generative models, where we're going to be using neural networks, deep learning, kind of ideas to implement this piece of code that gives you these capabilities of generating data. And to give you a few examples, if you have a generative model of images, you might be able to control it. Let's say using sketches. Maybe you're not good at painting, and you can only produce a rough sketch of a bedroom, and then you fit it as a control signal into your generative model, and you can use it to produce realistic images that have the structure of the stroke painting that you provide, but they look much better. Or you can do maybe text-to-image kind of things, where if you have a generative model that has been trained on paintings, then you can control it through captions, and you can ask the model to generate a new painting that corresponds to the description that is provided by the user. Other examples that you might not think about immediately could be something like you have a generative model over medical images. And in this case, you might use an actual signal coming from an MRI machine or a CT scan machine, and you can use that signal to sort of reconstruct the medical image, the thing you actually care about, given this kind of measurement that is coming from an actual machine. And in this kind of application, generative models have shown to be very effective because they can reduce the number of measurements, the amount of radiation that you have to give to the patient to get a measurement that is good enough to produce the medical images that the doctor needs to come up with a diagnosis. An example of the kind of thing you can do, if you can evaluate probabilities, is to do outlier detection. Actually, they're gonna be playing with this in the homework, a variant of this. If you have a generic model that understands traffic signs, you might be able to say, okay, this looks like a reasonable traffic sign, you might encounter on the streets. Well, if I feed you something like this, some kind of adversarial example, somebody's trying to cause trouble to your self-driving vehicle. The model might be able to say, this looks like a low probability thing, this is weird, do something about it, maybe don't trust it, ask a human for help, or something like that. Right, and this is really an exciting time to study generative models, because there's been a lot of progress over many different modalities. I'm going to start with images because that's where I've done a lot of the of my research. When I started working in this space about 10 years ago, these were the sort of images that were able to generate. And even that was already like very, very remarkable. Like people were very surprised that it was possible to train machine learning system to produce images of people that sort of a black and white, and they roughly had the right shape. People were very impressed by those sort of results. And you can see that over a few years, this progress was largely driven by generative adversarial networks, which is a class of generative models. We're gonna be talking about, you can kind of see how the generations are becoming better and better, higher resolution, more detail, more realistic kind of images of people. One of the big improvements that happened over the last two or three years, which was actually largely coming out of Stanford, the young song who was a PhD student in my group, came up with this idea of using score based diffusion models, which is a different kind of generative models that we're also going to be talking about in this course, and was able to further push the state of the art. for example generating images, very high resolution images that look like this, like these people don't exist, they are completely synthesized generated by one of these generative models. And this is really, the fusion models are really the technology that drives a lot of the text to image systems that you might have seen. Things like stable diffusion or dali or other or mid-journith, we think are all based on this type of generative model, this way of representing probability distribution based on a diffuser model, and once you have a good diffuser model, you can try to control it using captions, and now you get this kind of really cool text-to-image systems where you can ask a user for What kind of image do you want, a caption of what kind of image the system should be able to produce. For example, an astronaut riding a horse. And these are the kind of results that you can get with these systems we have today. This is really cool. I mean, these models have been trained on a lot of data. But presumably they have not seen something like this on the internet. I might have seen an astronaut, they definitely have seen a horse, but they probably have not seen those two things together. So it's very impressive that the model is able to sort of like, again, understand the meaning of astronauts, understand the meaning of horse, putting them together. And the fact that it's able to generate this kind of picture tells me that there is some level of understanding of what it means, what an astronaut means and what writing means, what a horse means. and even if you look at the landscape, I don't know, it could be, it feels like it's probably on some other planet or something. So there is some level of understanding about these concepts that is showing here, and that's super exciting, I think, because it means that we're really making progress in this space and understanding the meaning of text, of images, their relationship, and that was driving a lot of the successes that we're seeing in a melody stage. Here's another example. If you ask a system on a perfect Italian meal, you get here, I'm generating multiple samples. Because it's the probability distribution, you can imagine you can sample from it, and it will generate different answers. So the generation is stochastic. Different random city will produce different outputs every time. I think we can see four of them. Again, I think it does a pretty good job. I mean, some of the stuff is clearly made up, But it's interesting how I kind of like even captures out of the window the kind of buildings you would probably see in Italy. And it kind of like has the right flavor, I think. It's pretty impressive kind of thing. Here's another example from a recent system developed in China. This is a teddy bear wearing a costume standing in front of the Hall of Supreme Harmony and seeing the Beijing Opera. So again, it's a pretty crazy sort of caption and it produces things like this. Pretty impressive. And this is the latest that came out very recently. We don't know yet what this model is built on. Dalitri from OpenAI. This is an example from their blog post. You're asking the model to generate, that you can see the caption yourself. Pretty cool. Again, demonstrators are pretty sophisticated, a lot of understanding of concepts and a good way of combining them together. Right. So this is a test image generation. Again, the nice thing about these models is that you can often control them using different kinds of control signals. So here we're controlling using text, using captions, but there is a lot of inverse problems. Again, this is a field that has been studied for a long time. People have been thinking about how to colorize an image, how to the super-resolution on an image, how to do in-painting on an image. These problems become pretty much easier to solve once you have a good assistant that really understands the relationship between all the pixel values that you typically see in an image. And so there's been a lot of progress in, let's say, super-resolution. You'll go from low-resolution images like this to high-resolution images like that, what colorization you can take old black and white photos and you can kind of like colorize them in a meaningful way or in-paying thing so if you have an image where some of the pixels are masked out you can ask the model to fill them in and they do a pretty good job at doing this these are probably not the most up-to-date references but you can kind of get a sense of why these models are so useful in the integral world. Here's an example from SDEdit, which is one of the things that I want to, again, one of my PhD students developed. This is back to the sketch to image where you can start with a sketch, sort of like a painting or an image that you would like, the kind of thing I would be able to do. Then you can ask the model to refine it and produce some pretty picture that has the right structure, but it's much nicer. I would never be able to produce the image at the bottom, but I could probably come up with a sketch you see on the top. And yeah, here you can see more examples, where you can do sketch to image, or you can do even stroke-based editing. Maybe you start with an image, and then you'd rather want to change it based on some rough sense of what you want the image to have, and then the model will make it pretty for you. And it doesn't have to be editing or sort of like you don't have to control it through strokes. Another natural way of controlling this kind of editing process is through text. So instead of actually drawing what you want, you can ask the model, you can tell the model how you want your images to be edited. So you might start with an image of a bird. But now you want to change it so that you want it to spread the wings. You can tell the model how spread the wings, and it's able to do this kind of update. Or you have an image with two birds, and now you want the birds to be kissing. And then this is what you produce. Or you have an image with a box, and now you want the box to be open. And you can kind of see some pretty impressive results in terms of image editing or changing the pose of this dog, or even changing the style of the painting of the image. go from a real image to some kind of like drawings. And again, that's a pretty good job. You can see, it's making some mistakes. This knife here gets changed in a way that is not quite what we want. They are not perfect yet, but these capabilities are very impressive, already very useful. Cool. And yeah, back to the more exotic one that you might not necessarily think fits this in this framework just to give you a sense of how general this this ideas are. If you have a generative model of medical images, you can use it to essentially improve the way we do medical images. In this case, the control signal, it's an actual measurement that you get from let's say a CT scan machine. And then you can control the generative process using the measurement from from the CT scan machine, and this can drastically reduce the amount of radiations that has a number of measurements that you need to get a crisp kind of image that you can show to the doctor. This is very similar to in-painting. It's just in-painting in a slightly different space, which we can kind of get a sense. It's roughly the same problem. And advances in generative models translate into big improvements in this real world applications. All right, now moving on to different modalities, speech, audio has been another modality where people have been able to build some pretty good generative models. This is one of the earliest one, the WaveNet model back in 2016, and you can kind of see some examples of let's hope this works. This is an example. This is kind of like the pretty learning thing and these are not great text-to-speech. The Blue Lagoon is a 1980 American romance and adventure film directed by Randall Cliser. And then the WaveNet model, which is a deep learning based model for text-to-speech, we're going to see it significantly better. The Blue Lagoon is a 1980 American romance and adventure film directed by Randall Cliser. And these are maybe the latest ones that are based on the future models again. So this is a combination of the future models and autoregressive models. But here you can see some of the 2023 style. Once you have the first token, you want to predict the second token given the input and the first token using multi-head attention. You can see it's much more realistic. There is a little bit of an accent here. there's a little bit of emotions that are there it feels a lot less robotic a lot less fake here's another example this attack you know just text to speech you input the text up and you produce the speech corresponding to the text CS236 is the best class at Stanford and again you can sort of like use these things to do to solve inverse problems so you can do super resolution in the audio space so you can condition on the kind of like a low quality signal the kind of thing you can get maybe on on phones one is investment one is reform and then you can super resolve it one is investment one is reform again this is the same problem is basically in painting like you're missing some pixels you're missing some frequencies and you can ask them out to make them up for you and to the extent that it understands the relationship between these values which you can also kind of think of as images. It can do a pretty good job of super resolving audio. Language, of course, that's another space where there's been a lot of progress and a lot of excitement around large language models. These are basically models that have been trained over large quantities of text collected on the internet often, and then they learn a probability distribution over which sentences make sense or not. And you can use it to, again, do some sort of impainting where you can ask the model to create a sentence that starts with some kind of prompt. For example, this was an old language model, I guess in 2019 I think where you can ask the model to continue a sentence that starts with to get an A plus in deep generative models students have to and then let's see what the language model does and then it completes it for you right and then it says something so much reasonable we have to be willing to work with problems that are all hard work interesting best you know Now, not great, not perfect for today's standards. But again, for when this thing came out, it was pretty mind-blowing that you could build a model that can generate this quality of text. Now, I tried something similar on chat GPT. And this time, I tried something harder. Here, I said, to get an A plus in deep generative models, here, I tried what should I do together, a plus in CS236 at Stanford, so I didn't even tell the model what CS236 is, it actually knows that CS236 is deep generative models, and here it gives you some actually pretty good tips on how to do well in the class, the attend lectures, read the materials, they organized, seek help, do the homeworks, then it gives you 15 of them, I cut the prompt here, but it's pretty impressive that you can do these kind of things. And again, it probably means that there is some level of understanding. And that's why these models are so powerful when people are using them for doing all sorts of things. Because they can generate, it means they understand something, and then you can use the knowledge to solve a variety of tasks that we care about. Of course, the nice thing about this space that you can often mix and match, and so you can control these models using various sorts of control signals. Once you can do generation, you can steer the generative process using different control signals. A natural one here would be generate the text in English, conditioned on some text in a different language. So maybe Chinese. So you have, and these basically is machine translation, right? So progress is in generative models, basically directly translated to progress in machine translation. If you have a model that really understands how to generate text in English, and it can take advantage of the control signal well, then it means that essentially it's able to to the retranslation reasonably well. And a lot of the progress in the terms of like the models and the architectures that we're going to talk about in this class are the kind of ideas that are behind the pretty good machine translation systems that that we have today. Another example is code. Of course, very exciting, as its computer science is many of your computer scientists write all of code. The end of the day code is text. If you have a model that understands which sequences of text make sense and which ones don't, you can use it to write code for you. So here's an example of a system that exists today where you kind of like try to get the model to autocomplete, let's say the body of a function based on some description of what the function is supposed to do. Again, these systems are not perfect, but they are very, they're already pretty good. Like they can do many, they can solve many interesting tasks, they can solve programming assignments, they can solve of competition, they do really reasonably well in competitive programming competitions. So again, pretty cool that they understand the natural language, they understand the syntax of the programming language, they know how to put things together so that they do the right thing. They're sort of able to translate in this case from natural language to a formal language and Python in this case and do the right thing. So lots of excitement also around this sort of models. Another one that is pretty cool is video. This is one of the active ones where the first systems are being built. Again, you can imagine a variety of different interfaces where you can control the generative process through many different things. A natural one is text. So you might say you started with a caption and then you asked the model to generate a video corresponding to the caption. This is one example. The videos are pretty short right now. That's one of the limitations. But can you see it? Oh yeah, okay, it shows up there. Here's another example, you're there asking it to generate a video of a couple's letting down a snowy hill on the Roman church style, and this is sort of like what it produces. Pretty short videos. At the end of the day, you kind of think of a video as a sequence of images, so if you can generate images, it's believable that you can also generate a stack of images, which is essentially a video, but pretty impressive. There's a good amount of coherence across the frames. It captures roughly what's asked by the user. And the quality is pretty high. And if you're willing to work on this and stitch together many different videos, you can generate some pretty cool stuff. Let's stop. Do you know who I am? I don't know who I am. This is just basically stitching together a bunch of videos generated with the previous system. And, you know, again, you can see it's not perfect, but it's remarkable. I mean, we're not at the level where you can just ask the system to produce a movie for you with a certain plot or whatever with your favorite actor, but it's already able to produce the high quality content that the people are willing to look at and engage with. So that's an exciting kind of development that we're seeing in generative models of videos. I think when that starts to work and we're seeing the kind of progress in this space that I showed you before for images. It's happening right now. I think when people figure this out and get really good systems, that comes on our long videos of high quality, this could be really changing the way we a lot of the media industry is going to have to pay attention to this. Question, yeah. Yeah. I don't know exactly what went into this, I didn't make it myself, but I know the system allows you to also control it through a caption and a seed image. So if you maybe already know what you want your character to look like, then you can kind of use it and animate, let's say, a given image. And again, it's an example of controlling the generative process. Like you can control it through text, you can control it through images. There are many different ways to do this. Yeah, and this is actually from a PhD student, a former PhD student in our group, so it's a system that they are developing, is it's very good to agree with you, pretty impressive stuff. So that's the kind of thing you can do once you learn this material very well. All right, other completely different sort of application area, a sort of like decision-making robotics. This kind of a lot of this domain, so what you care about is taking actions in the world to achieve a certain goal. Let's say driving a car or stacking some objects. And so at the end of the day, you can think of it as generating a sequence of actions that makes sense. And so again, the kind of machinery that we're gonna talk about in this course translates pretty well to a lot of what we call limitation learning problems, where you are given examples of good behavior, provided maybe by a human and you want your model to generate other behaviors that are good. For example, you want the model to learn how to drive the car or how to stack objects. And so here's an example of how you can kind of use this sort of techniques that we're going to talk about in the course to learn how to drive the car in this video game. And you have to figure out, of course, what sort of actions make sense, and not crashing to other cars, and staying on the road, and so forth. It's non-trivial again, but if you have a good generative model, then you can make good decisions in this simulator. This is an example where you can kind of like train a diffusion model, in this case, to stack objects. So again, you sort of like need to figure out what sort of trajectories make sense, and if you have a good model that understands which trajectories have the right structure, then you can use it to stack a different set of objects and you can control the model to produce high-quality policies. There's a lot of excitement in the scientific, some science and engineering around generative models. One of your TAs is one of the world's experts on using generative models to synthesize molecules that have certain properties or like proteins that have certain properties and either on the level of the structure or even at the 3D level, can really understand the layout of these molecules. There is a lot of interest in this space around the building generative models to design drugs or to design better catalysts. At the end of the day, you can think of it as against some kind of generative model where you have to come up with a recipe that does well at a certain task. And if you have trained a model on a lot of data on what's kind of, let's say proteins, perform well in a certain task, then you might be able to generate a sequence of amino acids that perform the dust even better than the things we have or you might be able to design a drug that binds in a certain way because you're targeting, let's say, COVID or something. And so there is a lot of interest in around building generative models, over modalities that are somewhat different from the typical ones. It's not images, it's not text, but it's the same generative models. It's still diffusion models or it's going to be the autoregressive models is gonna be the kind of models we're gonna talk about in this course. And, right, so lots of excitement. There is other, you know, many other modalities that I didn't put in this slide deck where there's been progress generating 3D objects. That's another very exciting kind of like area. And many more, of course, there is also a bit of worry, and hopefully we'll get to talk about it a bit in the class around computers are getting so good at generating content that is hard to distinguish from the real one, there is this big issue around deep fakes, which one is real, which one is fake, this was produced again by my students, but I kind get a sense of the sort of dangers that these kind of technologies can have. And there is a lot of potential for misuse of these sort of systems. So hopefully we'll get to talk about that during the in the class. So all right, so that's that was sort of like the intro. Hopefully got you excited about about the topic and I kind of like showed you that it's really an exciting time to be working in this area and that's why there is so much excitement also in the industry and in academia around these topics, everybody is trying to innovate, build systems, figure out how to use them in the real world, find new applications. So it's still an exciting time to study this. The course is designed to really give you the, and cover the, what we think are the core concepts In this space, once you understand of the different building blocks, the kind of challenges, the kind of trade-offs that all these models do, then you can not only understand how existing systems work, but hopefully you can also design the next generation of the systems, improve them, figure out how to use them on a new application area. Again, the system, the course is designed to be pretty real or else there's going to be quite a bit of math. It's really going to delve deep into the key ideas. So we're going to talk a lot about representation as we discussed. The key building block is going to be statistical modeling. We're going to be using probability distributions. That's going to be the key building block. So we're going to talk a lot about how to represent these probability distributions, how to use neural networks to model probability distributions where we have many random variables. That is the challenge. And you've seen pro-simple probability distributions, like Gaussians and things like that. That doesn't work in this space because you have to so many different things that you have to consider. And you have to model at the same time. And so you need to come up with clever ways to represent how all the different pixels in an image interact with each other or how the different words in a sentence they are connected to each other. And so a lot of it will, a lot of the course content will focus on different ideas, the different trade-offs that you have to make when you build these kind of models. We're gonna talk about learning. Again, these are gonna be statistical generative models. So there's always gonna be data And you're going to use the data to fit the models. And there's many different ways to fit models. There's many different kinds of loss functions that you can use. There's the stuff that is used in diffusion model. There's the stuff that is used in alternative adversarial networks. There is the stuff that is used in large language models, auto-aggressive models. They're essentially brought down to different ways of comparing the probability distributions. You have a data distribution. You have the model distribution. And you want those two things to be similar. so that when you generate samples from the model, they look like the ones that came from the data distribution. But you know, probability distributions, again, going back to the first point, they are very complex. If you have very complicated objects, very high dimensional objects, that's always not straightforward to compare to probability distributions, and kind of measure how similar they are. So you have to sort of like, you have a data distribution, you have a family of models that you can pick from, And you kind of have to pick one that is close to the data, but measuring similarity is very difficult. And depending on how you measure similarity, you're gonna get different kinds of models that work well in different kinds of scenarios. And then we're gonna talk about inference. We're gonna talk about how to generate samples from these models efficiently. Sometimes you have the probability distribution, but it might not be straightforward to sample from it. So we will talk about that. we will talk about how to invert the generative process, how to get representations from these objects. For example, kind of like, sort of like following and making the idea of vision as inverse graphic, a little bit more concrete. And so we'll touch on a supervised learning in different ways of sort of like clustering. Because at the end of the day, what these models do is they have to find similarity between data points when you're trying to complete a sentence, what you have to do is you have to go through your training set, you have to find similar sentences, you have to figure out how to combine them, and you have to figure out how to complete the prompt that you're given. So once you have generative models, you can usually also get sort of like representations, you have ways of clustering data points that have similar meaning. And again, you can get features, and you can do kind of like the sort of things you would want to do in a supervised learning, which is do machine learning when you don't have labels, you only have the X, but you don't have the Y. And you want to do interesting things with the features themselves. And so those are sort of like the three key ideas that are gonna show up quite a bit. In terms of models, we're gonna be talking about First, perhaps the simplest kind of model, which is one where essentially you have access to a likelihood directly. These are going to be two kinds of models in this space, autoregressive models, and flow-based models. Autoregressive models are the ones used in large language models, and a few of other systems that I talked about today. Flow-based models are a different idea. is often used for images and other kinds of continuous data. Now we'll talk about latent variable models. The idea of using latent variables to increase the expressive power essentially of your generative models. We'll talk about variational inference, variational learning, the variational out-and-colder, hierarchical variational out-and-colders, those sort of ideas. We'll talk about implicit generative models. Here the idea is that instead of representing the probability distribution, POVAX, you're gonna represent the sampling process that you use to generate samples. And that has trade-offs. It allows you to generate samples very efficiently, but it becomes difficult to train the models because you don't have access to a likelihood anymore. So you cannot use maximum likelihood estimation, those kind of ideas that we understand very well. And we know have good performance. So we'll talk about two sample tests, FDF emergencies, and different ways of training these sort of systems. And in particular, we'll talk about generative adversarial networks and how to train them. Then we'll talk about energy-based models and the fusion models. Again, this is sort of like a state of the art in terms of like image generation, audio generation. people are starting to use them also for text. That's what's the technology behind the video generation that I showed you before. So we'll talk in depth about how they work and how you can think of them in terms of our latent variable model and the connections with all the other things. And yeah, well, again, it's gonna be a fairly mathematical class. So there's gonna be a lot of theory that's gonna be algorithms and I will go through applications. there is going to be homeworks where you're going to get to play around with these models. And yeah, so in terms of pre-rex, we're expecting you to have taken at least a machine learning class, we'll try to cover, try to do as much as possible from scratch and we'll have some sections to go over some of the background content, but it might be pretty hard to take this class if you've have never done any ML before. So you should be familiar with probability theory, calculus. We're going to use gradient descent, linear algebra, base rule, those sort of things, basic calculus sort of stuff, change of variable formula. Yeah, you should be familiar with that. Again, you can probably pick it up, but it might be pretty tricky if you've never seen the sort of ideas before. And then, yeah, there is going to be programming assignments. So you should be familiar with hopefully Python. That's what we're going to use PyTorch. And so again, we'll have a section on that if you've never seen it before. But it might be tricky if you've not done any of this before. And in terms of logistics, we have a website is not entirely updated. So some of the information might change. So keep checking it. We're finalizing some of the dates and we're trying to get confirmation about the rooms for the midterm and the poster sessions. But hopefully, that will be done soon. We don't have a textbook. That actually doesn't exist. This was actually, I think, the first class when it was offered here a few years ago on this topic, and nothing like that existed. So we had to create it from scratch. And we put together a set of lecture notes that you can access there, where we try to cover what we basically are the content that you can see in the slides. Some of it, you can see some of the content is covered in the deep learning book, that you can see there. So that's a useful reference. It's available online, so you might want to check it out. Yeah, we have a great team of teaching assistants and there should be a calendar now on the website with our office hours. Of course, you're welcome to, most of them will start next week, but otherwise feel free to reach out and add if you cannot find us in person this week. But yeah, we're always happy to chat. In terms of grading and coursework, so it's going to be three homeworks. So the first one is going to be released the Monday next week. And they are worth 15% of the total grade each, 45% total. And they go over a mix of theory. And again, there's going to be a programming assignment, associated with all of them. We're going to have a midterm. It's going to be in-class, in-person midterm. And the big component of the class is going to be a project. We think there is so much to do in this space that it makes sense for you to really explore. It's going to account for 40% of the grades. It's going to be a pretty significant component. And there's a bunch of milestones. You're going to start with a proposal. There's going to be a report that you have kind of like to turn in about how things are going. There's going to be a poster presentation towards the end. And then a final report on the work you did. And yeah, projects, I think I like this class, which really gives you an opportunity to explore. And there's just so much that you're and doing this space that there's lots of interesting project ideas that turn out to be paper, turn out to be company ideas, lots of excitement here. You can work in groups, let's say up to three students. And typically, it's one of these three things, sometimes students apply an existing generative model to a new data set, maybe they come from an application domain and they find out a new interesting way to use the models on a new problem or they compare different generative models on a new kind of data set. Sometimes people work on trying to just improve the models. Again, these things are pretty new. It's unlikely that we found the best way of solving these problems, so there is still a lot of room for improvement. Often you can combine different methods, you can take a diffusion model, you can add a little bit of generative adversarial training, flavor to it and you can get big improvements. Often these things can be published in top machine learning conferences if they work well. And sometimes people do more theoretical analysis and there's going to be quite a bit of theory, quite a bit of math. So there is room for improvement in trying to understand why these models work, when they work, when they fail. Right now it's all very, very empirical. And we really need a better theory of why things like the one I've shown you before are possible. And so there is lots of room for developing a better theory in this space. And that will also be suggesting possible projects. So look out for some information about possible projects that can be suggested by TA's or other faculty on campus. And we are able to provide some Google Cloud coupons, are not much, unfortunately, but at least a little bit. And so we'll figure out a way to distribute them to students. And if you want to get some inspiration for what kind of projects people worked on in previous years, you can go to the older versions of the website at 2019 and 2021 version. You can get a sense of the kind of projects people worked on, so you can get a sense of what's enough for a project, what worked, what didn't, get some ideas. And yeah, I think that was pretty much what I wanted to cover today. And I'm happy to take questions. And then next week, we're going to start with the background for aggressive models and so forth.