Intro｜Deep Learning for Coders with Fastai and PyTorchIntro｜

Intro｜Deep Learning for Coders with Fastai and PyTorch

Do you need these for deep learning?
- Lots of math T / F
  - F, just high school math is sufficient
- Lots of data T / F
  - F, We've seen record-breaking results with <50 items of data
- Lots of expensive computers T / F
  - F, You can get what you need for state-of-the-art work for free
- A PhD T / F
  - F, A PhD is definitely not required. All that matters is a deep understanding of AI & ability to implement NNs in a way that is actually useful (latter point is what's truly hard). Don't care if you even graduated high school.
Name five areas where deep learning is now the best in the world.
- Natural language processing (NLP)
  - answering questions
  - speech recognition
  - summarizing documents
  - classifying documents
  - finding names, dates, etc. in documents
  - searching for articles mentioning a concept
- Computer version
  - satellite and drone imagery interpretation (e.g., for disaster resilience)
  - face recognition
  - image captioning
  - reading traffic signs
  - locating pedestrians and vehicles in autonomous vehicles
- Medicine
  - finding anomalies in radiology images, including CT, MRI, and X-ray images
  - counting features in pathology slides
  - measuring features in ultrasounds
  - diagnosing diabetic retinopathy
- Biology
  - Folding proteins
  - classifying proteins
  - many genomics tasks, such as tumor-normal sequencing and classifying clinically actionable genetic mutations
  - cell classification
  - analyzing protein/protein interactions
- Image generation
  - colorizing images
  - increasing image resolution
  - removing noise from images
  - converting images to art in the style of famous artists
- Recommendation systems
  - web search
  - product recommendations
  - home page layout
- Playing games
  - Chess
  - Go
  - most Atari video games
  - many real-time strategy games
- Robotics
  - handling objects that are challenging to locate (e.g., transparent, shiny, lacking texture) or hard to pick up
- Other applications
  - Financial and logistical forecasting
  - Text-to-speech
  - much more
What was the name of the first device that was based on the principle of the artificial neuron?
- Rosenblatt further developed the artificial neuron to give it the ability to learn. Even more importantly, he worked on building the first device that actually used these principles, the Mark I Perceptron.
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?

In fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined parallel distributed processing as requiring:
- A set of processing units
- A state of activation
- An output function for each unit
- A pattern of connectivity among units
- A propagation rule for propagating patterns of activities through the network of connectivities
- An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce an output for the unit
- A learning rule whereby patterns of connectivity are modified by experience
- An environment within which the system must operate
We will see in this book that modern neural networks handle each of these requirements.
What were the two theoretical misunderstandings that held back the field of neural networks?
- They showed that a single layer of these devices was unable to learn some simple but critical mathematical functions (such as XOR).
- In theory, adding just one extra layer of neurons was enough to allow any mathematical function to be approximated with these neural networks, but in practice such networks were often too big and too slow to be useful.
What is a GPU?
- Graphics Processing Unit (GPU): Also known as a graphics card. A special kind of processor in your computer that can handle thousands of single tasks at the same time, especially designed for displaying 3D environments on a computer for playing games. These same basic tasks are very similar to what neural networks do, such that GPUs can run neural networks hundreds of times faster than regular CPUs. All modern computers contain a GPU, but few contain the right kind of GPU necessary for deep learning.
Open a notebook and execute a cell containing: 1+1. What happens?
- output: 2
Follow through each cell of the stripped version of the notebook for this chapter. Before executing each cell, guess what will happen.
- done
Complete the Jupyter Notebook online appendix.
- done
Why is it hard to use a traditional computer program to recognize images in a photo?
- what are the steps we take when we recognize an object in a picture? We really don't know, since it all happens in our brain without us being consciously aware of it!
What did Samuel mean by "weight assignment"?
- Weights are just variables, and a weight assignment is a particular choice of values for those variables. The program's inputs are values that it processes in order to produce its results—for instance, taking image pixels as inputs, and returning the classification "dog" as a result. The program's weight assignments are other values that define how the program will operate.
What term do we normally use in deep learning for what Samuel called "weights"?
- By the way, what Samuel called "weights" are most generally referred to as model parameters these days, in case you have encountered that term. The term weights is reserved for a particular type of model parameter.
Draw a picture that summarizes Samuel's view of a machine learning model.
Why is it hard to understand why a deep learning model makes a particular prediction?
- Learning would become entirely automatic when the adjustment of the weights was also automatic—when instead of us improving a model by adjusting its weights manually, we relied on an automated mechanism that produced adjustments based on performance.
What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy?
- That is, if you regard a neural network as a mathematical function, it turns out to be a function which is extremely flexible depending on its weights. A mathematical proof called the universal approximation theorem shows that this function can solve any problem to any level of accuracy, in theory. The fact that neural networks are so flexible means that, in practice, they are often a suitable kind of model, and you can focus your effort on the process of training them—that is, of finding good weight assignments.
What do you need in order to train a model?
- A model cannot be created without data.
- A model can only learn to operate on the patterns seen in the input data used to train it.
- This learning approach only creates predictions, not recommended actions.
- It's not enough to just have examples of input data; we need labels for that data too (e.g., pictures of dogs and cats aren't enough to train a model; we need a label for each one, saying which ones are dogs, and which are cats).
How could a feedback loop impact the rollout of a predictive policing model?
- A predictive policing model is created based on where arrests have been made in the past. In practice, this is not actually predicting crime, but rather predicting arrests, and is therefore partially simply reflecting biases in existing policing processes.
- This is a positive feedback loop, where the more the model is used, the more biased the data becomes, making the model even more biased, and so forth.
Do we always have to use 224×224-pixel images with the cat recognition model?
- This is the standard size for historical reasons (old pretrained models require this size exactly), but you can pass pretty much anything. If you increase the size, you'll often get a model with better results (since it will be able to focus on more details), but at the price of speed and memory consumption; the opposite is true if you decrease the size.
What is the difference between classification and regression?
- Classification and Regression: classification and regression have very specific meanings in machine learning. These are the two main types of model that we will be investigating in this book. A classification model is one which attempts to predict a class, or category. That is, it's predicting from a number of discrete possibilities, such as "dog" or "cat." A regression model is one which attempts to predict one or more numeric quantities, such as a temperature or a location. Sometimes people use the word regression to refer to a particular kind of model called a linear regression model; this is a bad practice, and we won't be using that terminology in this book!
What is a validation set? What is a test set? Why do we need them?
- The validation set is used to measure the accuracy of the model. By default, the 20% that is held out is selected randomly.
- Validation Set: When you train a model, you must always have both a training set and a validation set, and must measure the accuracy of your model only on the validation set. If you train for too long, with not enough data, you will see the accuracy of your model start to get worse; this is called overfitting. fastai defaults valid_pct to 0.2, so even if you forget, fastai will create a validation set for you!
- Test set can only be used to evaluate the model at the very end of our efforts
- Having two levels of "reserved data"—a validation set and a test set, with one level representing data that you are virtually hiding from yourself—may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we always need a separate test set—if you have very little data, you may need to just have a validation set—but generally it's best to use one if at all possible.
What will fastai do if you don't provide a validation set?
- fastai defaults valid_pct to 0.2, so even if you forget, fastai will create a validation set for you!
Can we always use a random sample for a validation set? Why or why not?
- Remember: a key property of the validation and test sets is that they must be representative of the new data you will see in the future.
- One case might be if you are looking at time series data. For a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates you are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future). A random subset is a poor choice (too easy to fill in the gaps, and not indicative of what you'll need in production). Instead, use the earlier data as your training set (and the later data for the validation set)
- A second common case is when you can easily anticipate ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with.
What is overfitting? Provide an example.
- Training a model in such a way that it remembers specific features of the input data, rather than generalizing well to data not seen during training
- The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called overfitting.
What is a metric? How does it differ from "loss"?
- A metric is a function that measures the quality of the model's predictions using the validation set, and will be printed at the end of each epoch.
- The concept of a metric may remind you of loss, but there is an important distinction. The entire purpose of loss is to define a "measure of performance" that the training system can use to update weights automatically. In other words, a good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption, so a good metric is one that is easy for you to understand, and that hews as closely as possible to what you want the model to do. At times, you might decide that the loss function is a suitable metric, but that is not necessarily the case.
How can pretrained models help?
- To make the training process go faster, we might start with a pretrained model—a model that has already been trained on someone else's data. We can then adapt it to our data by training it a bit more on our data, a process called fine-tuning.
- Using pretrained models is the most important method we have to allow us to train more accurate models, more quickly, with less data, and less time and money.
- When using a pretrained model, vision_learner will remove the last layer, since that is always specifically customized to the original training task (i.e. ImageNet dataset classification), and replace it with one or more new layers with randomized weights, of an appropriate size for the dataset you are working with. This last part of the model is known as the head.
- When we fine-tuned our pretrained model earlier, we adapted what those last layers focus on (flowers, humans, animals) to specialize on the cats versus dogs problem.
What is the "head" of a model?
- When using a pretrained model, vision_learner will remove the last layer, since that is always specifically customized to the original training task (i.e. ImageNet dataset classification), and replace it with one or more new layers with randomized weights, of an appropriate size for the dataset you are working with. This last part of the model is known as the head.
What kinds of features do the early layers of a CNN find? How about the later layers?
- studying an older model called AlexNet that only contained five layers
- For layer 1, what we can see is that the model has discovered weights that represent diagonal, horizontal, and vertical edges, as well as various different gradients.
- For layer 2, there are nine examples of weight reconstructions for each of the features found by the model. We can see that the model has learned to create feature detectors that look for corners, repeating lines, circles, and other simple patterns.
- For layer 3, the features are now able to identify and match with higher-level semantic components, such as car wheels, text, and flower petals.
- layers 4 and 5 can identify even higher-level concepts.
- When we fine-tuned our pretrained model earlier, we adapted what those last layers focus on (flowers, humans, animals) to specialize on the cats versus dogs problem.
Are image models only useful for photos?
- For instance, a sound can be converted to a spectrogram, which is a chart that shows the amount of each frequency at each time in an audio file. Fast.ai student Ethan Sutin used this approach to easily beat the published accuracy of a state-of-the-art environmental sound detection model using a dataset of 8,732 urban sounds. fastai's show_batch clearly shows how each different sound has a quite distinctive spectrogram
- A time series can easily be converted into an image by simply plotting the time series on a graph. For instance, fast.ai student Ignacio Oguiza created images from a time series dataset for olive oil classification, using a technique called Gramian Angular Difference Field (GADF)3. What is an "architecture"?
- Another interesting fast.ai student project example comes from Gleb Esman. He was working on fraud detection at Splunk, using a dataset of users' mouse movements and mouse clicks. He turned these into pictures by drawing an image where the position, speed, and acceleration of the mouse pointer was displayed using coloured lines, and the clicks were displayed using small colored circles, as shown in <>. He then fed this into an image recognition model just like the one we've used in this chapter, and it worked so well that it led to a patent for this approach to fraud analytics!
- Another example comes from the paper "Malware Classification with Deep Convolutional Neural Networks" by Mahmoud Kalash et al., which explains that "the malware binary file is divided into 8-bit sequences which are then converted to equivalent decimal values. This decimal vector is reshaped and a gray-scale image is generated that represents the malware sample,"
What is segmentation?
- Creating a model that can recognize the content of every individual pixel in an image is called segmentation.
What is y_range used for? When do we need it?
- This model is predicting movie ratings on a scale of 0.5 to 5.0 to within around 0.6 average error. Since we're predicting a continuous number, rather than a category, we have to tell fastai what range our target has, using the y_range parameter.
What are "hyperparameters"?
- In fact, not necessarily. The situation is more subtle. This is because in realistic scenarios we rarely build a model just by training its weight parameters once. Instead, we are likely to explore many versions of a model through various modeling choices regarding network architecture, learning rates, data augmentation strategies, and other factors we will discuss in upcoming chapters. Many of these choices can be described as choices of hyperparameters. The word reflects that they are parameters about parameters, since they are the higher-level choices that govern the meaning of the weight parameters.
What's the best way to avoid failures when using AI in an organization?
- To put it bluntly, if you're a senior decision maker in your organization (or you're advising senior decision makers), the most important takeaway is this: if you ensure that you really understand what test and validation sets are and why they're important, then you'll avoid the single biggest source of failures we've seen when organizations decide to use AI. For instance, if you're considering bringing in an external vendor or service, make sure that you hold out some test data that the vendor never gets to see. Then you check their model on your test data, using a metric that you choose based on what actually matters to you in practice, and you decide what level of performance is adequate. (It's also a good idea for you to try out some simple baseline yourself, so you know what a really simple model can achieve. Often it'll turn out that your simple model performs just as well as one produced by an external "expert"!)