Introducing convolutions
WIKIPEDIA about Kernel
Let's explore how convolutions work by creating a basic convolution on a 2D Grey Scale image. First we can load the image by taking the 'ascent' image from scipy. It's a nice, built-in picture with lots of angles and lines.
import cv2
import numpy as np
from scipy import misc
i = misc.ascent()
Next, we can use the pyplot library to draw the image so we know what it looks like.
import matplotlib.pyplot as plt
plt.grid(False)
plt.gray()
plt.axis('off')
plt.imshow(i)
plt.show()
Output:
We can see that this is an image of a stairwell. There are lots of features in here that we can play with seeing if we can isolate them -- for example there are strong vertical lines.
The image is stored as a numpy array, so we can create the transformed image by just copying that array. Let's also get the dimensions of the image so we can loop over it later.
i_transformed = np.copy(i)
size_x = i_transformed.shape[0]
size_y = i_transformed.shape[1]
Now we can create a filter as a 3x3 array.
# This filter detects edges nicely
# It creates a convolution that only passes through sharp edges and straight lines.
#Experiment with different values for fun effects.
#filter = [[0, 1, 0], [1, -4, 1], [0, 1, 0]]
# A couple more filters to try for fun!
filter = [[-1, -2, -1], [0, 0, 0], [1, 2, 1]]
#filter = [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]
# If all the digits in the filter don't add up to 0 or 1,
# you should probably do a weight to get it to do so
# so, for example, if your weights are 1,1,1 1,2,1 1,1,1
# They add up to 10, so you would set a weight of .1 if you want to normalize them
weight = 1
Now let's create a convolution. We will iterate over the image, leaving a 1 pixel margin, and multiply out each of the neighbors of the current pixel by the value defined in the filter.
i.e. the current pixel's neighbor above it and to the left will be multiplied by the top left item in the filter etc. etc. We'll then multiply the result by the weight, and then ensure the result is in the range 0-255
Finally we'll load the new value into the transformed image.
for x in range(1,size_x-1):
for y in range(1,size_y-1):
convolution = 0.0
convolution = convolution + (i[x - 1, y-1] * filter[0][0])
convolution = convolution + (i[x, y-1] * filter[1][0])
convolution = convolution + (i[x + 1, y-1] * filter[2][0])
convolution = convolution + (i[x-1, y] * filter[0][1])
convolution = convolution + (i[x, y] * filter[1][1])
convolution = convolution + (i[x+1, y] * filter[2][1])
convolution = convolution + (i[x-1, y+1] * filter[0][2])
convolution = convolution + (i[x, y+1] * filter[1][2])
convolution = convolution + (i[x+1, y+1] * filter[2][2])
convolution = convolution * weight
if(convolution<0):
convolution=0
if(convolution>255):
convolution=255
i_transformed[x, y] = convolution
Now we can plot the image to see the effect of the convolution.
# Plot the image. Note the size of the axes -- they are 512 by 512
plt.gray()
plt.grid(False)
plt.imshow(i_transformed)
#plt.axis('off')
plt.show()
Output:
[[-1, -2, -1], [0, 0, 0], [1, 2, 1]]
[[0, 1, 0], [1, -4, 1], [0, 1, 0]]
[[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]
Pooling
As well as using convolutions, pooling helps us greatly in detecting features. The goal is to reduce the overall amount of information in an image, while maintaining the features that are detected as present.There are a number of different types of pooling, but this time we'll use one called MAX pooling.
The idea here is to iterate over the image, and look at the pixel and it's immediate neighbors to the right, beneath, and right-beneath. Take the largest (hence the name MAX pooling) of them and load it into the new image. Thus the new image will be 1/4 the size of the old -- with the dimensions on X and Y being halved by this process. You'll see that the features get maintained despite this compression!
# This code will show (4, 4) pooling.
# Run it to see the output, and you'll see that
# while the image is 1/4 the size of the original in both length and width,
# the extracted features are maintained!
new_x = int(size_x/4)
new_y = int(size_y/4)
newImage = np.zeros((new_x, new_y))
for x in range(0, size_x, 4):
for y in range(0, size_y, 4):
pixels = []
pixels.append(i_transformed[x, y])
pixels.append(i_transformed[x+1, y])
pixels.append(i_transformed[x+2, y])
pixels.append(i_transformed[x+3, y])
pixels.append(i_transformed[x, y+1])
pixels.append(i_transformed[x+1, y+1])
pixels.append(i_transformed[x+2, y+1])
pixels.append(i_transformed[x+3, y+1])
pixels.append(i_transformed[x, y+2])
pixels.append(i_transformed[x+1, y+2])
pixels.append(i_transformed[x+2, y+2])
pixels.append(i_transformed[x+3, y+2])
pixels.append(i_transformed[x, y+3])
pixels.append(i_transformed[x+1, y+3])
pixels.append(i_transformed[x+2, y+3])
pixels.append(i_transformed[x+3, y+3])
pixels.sort(reverse=True)
newImage[int(x/4),int(y/4)] = pixels[0]
# Plot the image. Note the size of the axes -- now 128 pixels instead of 512
plt.gray()
plt.grid(False)
plt.imshow(newImage)
#plt.axis('off')
plt.show()
Output:
Comparison of DNN and CNN
# DNN
import tensorflow as tf
mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (val_images, val_labels) = mnist.load_data()
training_images=training_images / 255.0
val_images=val_images / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(20, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(training_images, training_labels, validation_data=(val_images, val_labels), epochs=20)
The accuracy is probably about 89% on training and 87% on validation...not bad...But how do you make that even better? One way is to use something called Convolutions.
In short, you take an array (usually 3x3 or 5x5) and pass it over the image. By changing the underlying pixels based on the formula within that matrix, you can do things like edge detection. So, for example, a 3x3 that is defined for edge detection where the middle cell is 8, and all of its neighbors are -1. In this case, for each pixel, you would multiply its value by 8, then subtract the value of each neighbor. Do this for every pixel, and you'll end up with a new image that has the edges enhanced.
This is perfect for computer vision, because often it's features like edges that distinguish one item for another. And once we move from raw image data to feature data, the amount of information needed is then much less...because you'll just train on the highlighted features.
That's the concept of Convolutional Neural Networks. Add some layers to do convolution before you have the dense layers, and then the information going to the dense layers is more focussed, and possibly more accurate.
Run the below code -- this is the same neural network as earlier, but this time with Convolutional layers added first. It will take longer, but look at the impact on the accuracy:
# CNN
import tensorflow as tf
mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (val_images, val_labels) = mnist.load_data()
training_images=training_images.reshape(60000, 28, 28, 1)
training_images=training_images / 255.0
val_images=val_images.reshape(10000, 28, 28, 1)
val_images=val_images/255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(20, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
model.fit(training_images, training_labels, validation_data=(val_images, val_labels), epochs=20)
It's likely gone up to about 97% on the training data and 91% on the validation data. That's significant, and a step in the right direction!
Then, look at the code again, and see, step by step how the Convolutions were built:
Step 1 is to gather the data. You'll notice that there's a bit of a change here in that the training data needed to be reshaped. That's because the first convolution expects a single tensor containing everything, so instead of 60,000 28x28x1 items in a list, we have a single 4D list that is 60,000x28x28x1, and the same for the validation images. If you don't do this, you'll get an error when training as the Convolutions do not recognize the shape.
import tensorflow as tf
mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (val_images, val_labels) = mnist.load_data()
training_images=training_images.reshape(60000, 28, 28, 1)
training_images=training_images / 255.0
val_images = val_images.reshape(10000, 28, 28, 1)
val_images=val_images/255.0
Next is to define your model. Now instead of the input layer at the top, you're going to add a Convolution. The parameters are:
- The number of convolutions you want to generate. Purely arbitrary, but good to start with something in the order of 64
- The size of the Convolution, in this case a 3x3 grid
- The activation function to use -- in this case we'll use relu, which you might recall is the equivalent of returning x when x>0, else returning 0
- In the first layer, the shape of the input data.
You'll follow the Convolution with a MaxPooling layer which is then designed to compress the image, while maintaining the content of the features that were highlighted by the convolution. By specifying (2,2) for the MaxPooling, the effect is to quarter the size of the image. Without going into too much detail here, the idea is that it creates a 2x2 array of pixels, and picks the biggest one, thus turning 4 pixels into 1. It repeats this across the image, and in so doing halves the number of horizontal, and halves the number of vertical pixels, effectively reducing the image to 25% of its original size.
You can call model.summary() to see the size and shape of the network, and you'll notice that after every MaxPooling layer, the image size is reduced in this way.
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
Add another convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2)
Then flatten the output. After this you'll just have the same DNN structure as the non convolutional version.
How does the model see?
Imagine you are teaching a computer how to ‘see’ something. By that we mean not just be able to process the pixels in an image for color or shade, or anything like that, but to be able to understand the contents of an image.
To simulate this, you could say that the understanding of the contents of the image are effectively blurred...like this:
There’s information in there, but I don’t know what it represents, so I simulate that by blurring the image.
Now, next, imagine that there’s a convolution (filter) that can extract something from the image consistently, and the something that it extracts is always present when the image is labelled with a particular class. In other words, if there’s a filter that always produces something like this, when the image is labelled human.
You and I both know that these clearly seem to be human legs, but the computer doesn’t know that. It just knows that two cylinder-like objects like these tend to show up for a particular filter, and only on images that are labelled human.
Then, similarly, there’s another filter that produces this, and only for human images:
...and then another that produces this, or something similar, again, only for human images:
You and I both recognize this as a human face. But the computer doesn’t. It only knows that something clear is regularly extracted by that filter, when the image is labelled as human.
Thus, when a set of filters is learned that consistently extracts content like this when the image is labelled as human, we could say that the following ‘equation’ holds:
If there are 64 filters, for example, in the final layer, they may all return ‘nothing’ and it’s just these three end up having significance for this class.
Similarly, a different set of filters could return values for the label HORSE, and everything else (including the three filters that gave us human hands, legs and face) would return nothing, so we’d get:
Now we have a set of filters that a model has learned that can extract the features that indicate what is a horse and what is a human!
Do note that for this example I used features that you and I recognize, like hands, legs and feet as distinguishing between the two, but the computer is NOT limited to that. It might be able to consistently ‘see’ patterns in images that you do not, and that might be a more accurate determinant of the class of the image. The field of convolutional visualization studies this, and it’s fascinating to learn the interpretability of images that are classified using this method!
Lets retrain our convolutional model for the Fashion-MNIST dataset and then visualize the filters and pooling.
import tensorflow as tf
mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (val_images, val_labels) = mnist.load_data()
training_images=training_images.reshape(60000, 28, 28, 1)
training_images=training_images / 255.0
val_images=val_images.reshape(10000, 28, 28, 1)
val_images=val_images/255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(20, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
model.fit(training_images, training_labels, validation_data=(val_images, val_labels), epochs=20)
This code will show us the convolutions graphically. The print (test_labels[:100]) shows us the first 100 labels in the test set, and you can see that the ones at index 0, index 23 and index 28 are all the same value (9). They're all shoes. Let's take a look at the result of running the convolution on each, and you'll begin to see common features between them emerge. Now, when the final dense layers are trained on this resulting data, it's working with a lot less, more targeted, data -- the features generated by this convolution/pooling combination.
print(val_labels[:100])
Output:
[9 2 1 1 6 1 4 6 5 7 4 5 7 3 4 1 2 4 8 0 2 5 7 9 1 4 6 0 9 3 8 8 3 3 8 0 7
5 7 9 6 1 3 7 6 7 2 1 2 2 4 4 5 8 2 2 8 4 8 0 7 7 8 5 1 1 2 3 9 8 7 0 2 6
2 3 1 2 8 4 1 8 5 9 5 0 3 2 0 6 5 3 6 7 1 8 0 1 4 2]
import matplotlib.pyplot as plt
def show_image(img):
plt.figure()
plt.imshow(val_images[img].reshape(28,28))
plt.grid(False)
plt.show()
f, axarr = plt.subplots(3,2)
# By scanning the list above I saw that the 0, 23 and 28 entries are all label 9
FIRST_IMAGE=0
SECOND_IMAGE=23
THIRD_IMAGE=28
# For shoes (0, 23, 28), Convolution_Number=1 (i.e. the second filter) shows
# the sole being filtered out very clearly
CONVOLUTION_NUMBER = 1
from tensorflow.keras import models
layer_outputs = [layer.output for layer in model.layers]
activation_model = tf.keras.models.Model(inputs = model.input, outputs = layer_outputs)
for x in range(0,2):
f1 = activation_model.predict(val_images[FIRST_IMAGE].reshape(1, 28, 28, 1))[x]
axarr[0,x].imshow(f1[0, : , :, CONVOLUTION_NUMBER], cmap='inferno')
axarr[0,x].grid(False)
f2 = activation_model.predict(val_images[SECOND_IMAGE].reshape(1, 28, 28, 1))[x]
axarr[1,x].imshow(f2[0, : , :, CONVOLUTION_NUMBER], cmap='inferno')
axarr[1,x].grid(False)
f3 = activation_model.predict(val_images[THIRD_IMAGE].reshape(1, 28, 28, 1))[x]
axarr[2,x].imshow(f3[0, : , :, CONVOLUTION_NUMBER], cmap='inferno')
axarr[2,x].grid(False)
show_image(FIRST_IMAGE)
show_image(SECOND_IMAGE)
show_image(THIRD_IMAGE)
Output:
1/1 [==============================] - 0s 114ms/step
1/1 [==============================] - 0s 14ms/step
1/1 [==============================] - 0s 15ms/step
1/1 [==============================] - 0s 12ms/step
1/1 [==============================] - 0s 11ms/step
1/1 [==============================] - 0s 31ms/step
A coding assignment
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images = train_images / 255.0
test_images = test_images / 255.0
FIRST_LAYER = tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(32, 32, 3))
HIDDEN_LAYER_TYPE_1 = tf.keras.layers.MaxPooling2D(2, 2)
HIDDEN_LAYER_TYPE_2 = tf.keras.layers.Conv2D(64, (3,3), activation='relu')
HIDDEN_LAYER_TYPE_3 = tf.keras.layers.MaxPooling2D(2, 2)
HIDDEN_LAYER_TYPE_4 = layers.Conv2D(64, (3, 3), activation='relu')
HIDDEN_LAYER_TYPE_5 = tf.keras.layers.Dense(20, activation='relu')
LAST_LAYER = tf.keras.layers.Dense(10, activation='softmax')
model = models.Sequential([
FIRST_LAYER,
HIDDEN_LAYER_TYPE_1,
HIDDEN_LAYER_TYPE_2,
HIDDEN_LAYER_TYPE_3,
HIDDEN_LAYER_TYPE_4,
layers.Flatten(),
HIDDEN_LAYER_TYPE_5,
LAST_LAYER,
])
LOSS = 'sparse_categorical_crossentropy'
NUM_EPOCHS = 20 #You can change this value if you like to experiment with it to get better accuracy
# Compile the model
model.compile(optimizer='sgd',
loss=LOSS,
metrics=['accuracy'])
# Fit the model
history = model.fit(train_images, train_labels, epochs=NUM_EPOCHS,
validation_data=(test_images, test_labels))
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.xlim([0,NUM_EPOCHS])
plt.ylim([0.4,1.0])
plt.show()
Output: