CNNs (convolutional neural networks) are great. Detect image features anywhere. Well, not exactly. They are not indifferent to any kind of movement. Shifting up or down, left or right is fine. It doesn't rotate around its axis. This is because of how convolutions work. That is, navigate by row and then column (or vice versa). If you want to do “more” (e.g. successful detection of inverted objects), you need to extend the convolution to tasks like: rotation equivalent. Tasks such as equal sides For some types of actions, it not only registers the moved function itself, but also keeps track of which specific action caused it to appear in that location.
This is the second post in a series introducing group equivariant CNNs (GCNNs).. The first was a high-level introduction to why we wanted this and how it worked. Here we introduced symmetry groups, which are key players in specifying what kinds of transformations to treat equivariantly. If you haven't read it yet, please take a look at that post first. Here we will utilize the terms and concepts introduced in that post.
Today we will code a simple GCNN from scratch. The code and presentation closely follow the notebooks provided as part of the 2022 Deep Learning course at the University of Amsterdam. Thank you so much for providing such excellent study material.
My intention in the following is to illustrate the general way of thinking and how the resulting architecture consists of smaller modules, each assigned a clear purpose. For this reason I will not reproduce all the code here. Let's utilize packages instead. gcnn
. The method has many comments. So don't hesitate to take a look at the code to see some details.
As of today, gcnn
Implements one symmetry group. \(C_4\), which serves as a running example throughout Post 1. However, it can be easily extended by taking full advantage of the class hierarchy.
Step 1: Symmetry Group \(C_4\)
When coding a GCNN, the first thing you need to provide is the implementation of the symmetry group you want to use. Here it is \(C_4\)A group of four elements rotated 90 degrees.
we can ask gcnn
Create one for us and inspect its elements.
# remotes::install_github("skeydan/gcnn")
library(gcnn)
library(torch)
C_4 <- CyclicGroup(order = 4)
elems <- C_4$elements()
elems
torch_tensor
0.0000
1.5708
3.1416
4.7124
[ CPUFloatType>
torchvision::transform_random_rotation(
degrees = c(0, 360),
resample = 2,
fill = 0
)
]
Elements are displayed with their respective rotation angles. \(0\), \(\frac>
self$convs[[i]]() >
torchvision::transform_random_rotation(
degrees = c(0, 360),
resample = 2,
fill = 0
)
\), \(\pie\)and \(\frac>
self$convs[[i]]() >
torchvision::transform_to_tensor() \).
The group recognizes its identity and knows how to organize its elements.
C_4$identity
g1 <- elems[2]
C_4$inverse(g1)
torch_tensor
0
[ CPUFloatType>
torch::nnf_relu()
]
torch_tensor
4.71239
[ CPUFloatType>
torchvision::transform_to_tensor() ]
The most important thing we care about here is the group element. action. In terms of implementation, we must distinguish between acting on each other and acting on vector spaces. \(\mathbb>
self$convs[[i]]() ^2\), where the input image is located. The former is the easy part. This can be implemented simply by adding an angle. Actually this is it gcnn
Yes, when we ask for permission g1
Act g2
:
g2 <- elems[3]
# in C_4$left_action_on_H(), H stands for the symmetry group
C_4$left_action_on_H(torch_tensor(g1)$unsqueeze(1), torch_tensor(g2)$unsqueeze(1))
torch_tensor
4.7124
[ CPUFloatType{1,1} ]
what's the matter? unsqueeze()
S? from \(C_4\)ultimate purpose It becomes part of a neural network. left_action_on_H()
It works with element batches rather than scalar tensors.
When group work is performed, the situation is somewhat less straightforward. \(\mathbb{R}^2\) I am concerned. The concept of group representation is needed here. This is a related topic so I won't cover it here. In my current situation it works like this: We have an input signal, i.e. a tensor that we want to operate on in some way. (As we'll soon see, it will be a convolution “in some way”.) To render that group of operations equivalent, we first apply the following to the representation: station Group work on input. If that happens, the surgery will continue as if nothing had happened.
For a concrete example, let's say the operation is a measurement. Imagine a runner standing at the foot of a mountain path, preparing to climb. We want to record their height. One option we have is to take measurements and then run. Our measurements will be just as valid on the mountain as they are down here. Or, you can be polite and not make customers wait. When you go up, you are asked to come down, and when you come back, your height is measured. The result is the same. Body height is equivariant (more: constant, even) for running movements up or down. (Of course, height is a pretty boring metric, but something more interesting, like heart rate, wouldn't have had much of an effect in this example.)
Turning to the implementation, we find that group tasks are encoded as matrices. There is one matrix for each group element. for someone \(C_4\)ensign standard The representation is a rotation matrix.
\[
\begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix}
\]
in gcnn
The function that applies that matrix is: left_action_on_R2()
. Like its siblings, it is placed (group elements and \(\mathbb{R}^2\) vector). Technically, this involves rotating the grid on which the image is defined and then resampling the image. To make this more concrete, the code for that method is:
Here is a goat.
img_path <- system.file("imgs", "z.jpg", package = "gcnn")
img <- torchvision::base_loader(img_path) |> torchvision::transform_to_tensor()
img$permute(c(2, 3, 1)) |> as.array() |> as.raster() |> plot()
First we make a call. C_4$left_action_on_R2()
Rotate the grid.
# Grid shape is [2, 1024, 1024], for a 2d, 1024 x 1024 image.
img_grid_R2 <- torch::torch_stack(torch::torch_meshgrid(
list(
torch::torch_linspace(-1, 1, dim(img)[2]),
torch::torch_linspace(-1, 1, dim(img)[3])
)
))
# Transform the image grid with the matrix representation of some group element.
transformed_grid <- C_4$left_action_on_R2(C_4$inverse(g1)$unsqueeze(1), img_grid_R2)
Second, resample the image from the transformed grid. The goat now looks up at the sky.
transformed_img <- torch::nnf_grid_sample(
img$unsqueeze(1), transformed_grid,
align_corners = TRUE, mode = "bilinear", padding_mode = "zeros"
)
transformed_img[1,..]$permute(c(2, 3, 1)) |> as.array() |> as.raster() |> plot()
Step 2: Lifting Convolution
We want to utilize existing things efficiently. torch
Maximize functionality. Specifically what we want to use is nn_conv2d()
. But what we need is a convolution kernel that is equivariant not only for translation but also for subsequent operations. \(C_4\). This can be achieved by having one kernel for each possible rotation.
Implementing that idea is LiftingConvolution
do. The principle is the same as before. First we rotate the grid and then resample the kernel (weight matrix) with the transformed grid.
But why is this lifting convolution? A typical convolution kernel operates on: \(\mathbb{R}^2\); The extended version works with a combination of: \(\mathbb{R}^2\) and \(C_4\). It was like that in math. lifted Semi-directly to products \(\mathbb{R}^2\rtimes C_4\).
lifting_conv <- LiftingConvolution(
group = CyclicGroup(order = 4),
kernel_size = 5,
in_channels = 3,
out_channels = 8
)
x <- torch::torch_randn(c(2, 3, 32, 32))
y <- lifting_conv(x)
y$shape
[1] 2 8 4 28 28
internally LiftingConvolution
If we use additional dimensions to realize the product of translation and rotation, the output will be 5-dimensional rather than 4-dimensional.
Step 3: Group Convolution
Now that we are in the “group expansion space” we can connect multiple layers that have both inputs and outputs. group convolution Layer. for example:
group_conv <- GroupConvolution(
group = CyclicGroup(order = 4),
kernel_size = 5,
in_channels = 8,
out_channels = 16
)
z <- group_conv(y)
z$shape
[1] 2 16 4 24 24
Now all that's left to do is package this. That's it gcnn::GroupEquivariantCNN()
do.
Step 4: Group equilateral CNN
we can call GroupEquivariantCNN()
Like that.
cnn <- GroupEquivariantCNN(
group = CyclicGroup(order = 4),
kernel_size = 5,
in_channels = 1,
out_channels = 1,
num_hidden = 2, # number of group convolutions
hidden_channels = 16 # number of channels per group conv layer
)
img <- torch::torch_randn(c(4, 1, 32, 32))
cnn(img)$shape
[1] 4 1
At first glance, it looks like this. GroupEquivariantCNN
It looks like old CNN… group
arguement.
Now if you inspect the output you will see that the extra dimension is gone. This is because, after a series of between-group convolutional layers, the modules are projected into a representation that preserves only the channels for each batch item. Therefore, rather than averaging over location as is usually done, the average is also calculated over the group dimension. The final linear layer then provides the requested classifier output (dimension). out_channels
).
And there's a complete architecture there. Now it's real world time (this) test.
Rotated numbers!
The idea is to train two convnets on the common MNIST training set: a “normal” CNN and a group equivariant convnet. Both are then evaluated on the augmented test set, where each image is randomly rotated by successive rotations between 0 and 360 degrees. we don't expect GroupEquivariantCNN
To be “perfect” – not if we have the equipment \(C_4\) As a symmetry group. strictly speaking \(C_4\), homovariance extends to only four locations. However, we hope that this will perform significantly better than the Shift-Equivariant-Only standard architecture.
First prepare your data. In particular, the augmented test set.
dir <- "/tmp/mnist"
train_ds <- torchvision::mnist_dataset(
dir,
download = TRUE,
transform = torchvision::transform_to_tensor
)
test_ds <- torchvision::mnist_dataset(
dir,
train = FALSE,
transform = function(x) {
x |>
torchvision::transform_to_tensor() |>
torchvision::transform_random_rotation(
degrees = c(0, 360),
resample = 2,
fill = 0
)
}
)
train_dl <- dataloader(train_ds, batch_size = 128, shuffle = TRUE)
test_dl <- dataloader(test_ds, batch_size = 128)
What does it look like?
test_images <- coro::collect(
test_dl, 1
)[[1]]$x[1:32, 1, , ] |> as.array()
par(mfrow = c(4, 8), mar = rep(0, 4), mai = rep(0, 4))
test_images |>
purrr::array_tree(1) |>
purrr::map(as.raster) |>
purrr::iwalk(~ {
plot(.x)
})
First, we define and train an existing CNN. It looks like this: GroupEquivariantCNN()
As far as is architecturally possible, this provides twice the number of hidden channels, resulting in similar overall capacity.
default_cnn <- nn_module(
"default_cnn",
initialize = function(kernel_size, in_channels, out_channels, num_hidden, hidden_channels) {
self$conv1 <- torch::nn_conv2d(in_channels, hidden_channels, kernel_size)
self$convs <- torch::nn_module_list()
for (i in 1:num_hidden) {
self$convs$append(torch::nn_conv2d(hidden_channels, hidden_channels, kernel_size))
}
self$avg_pool <- torch::nn_adaptive_avg_pool2d(1)
self$final_linear <- torch::nn_linear(hidden_channels, out_channels)
},
forward = function(x) {
x <- x |>
self$conv1() |>
(\(.) torch::nnf_layer_norm(., .$shape[2:4]))() |>
torch::nnf_relu()
for (i in 1:(length(self$convs))) {
x <- x |>
self$convs[[i]]() |>
(\(.) torch::nnf_layer_norm(., .$shape[2:4]))() |>
torch::nnf_relu()
}
x <- x |>
self$avg_pool() |>
torch::torch_squeeze() |>
self$final_linear()
x
}
)
fitted <- default_cnn |>
luz::setup(
loss = torch::nn_cross_entropy_loss(),
optimizer = torch::optim_adam,
metrics = list(
luz::luz_metric_accuracy()
)
) |>
luz::set_hparams(
kernel_size = 5,
in_channels = 1,
out_channels = 10,
num_hidden = 4,
hidden_channels = 32
) %>%
luz::set_opt_hparams(lr = 1e-2, weight_decay = 1e-4) |>
luz::fit(train_dl, epochs = 10, valid_data = test_dl)
Train metrics: Loss: 0.0498 - Acc: 0.9843
Valid metrics: Loss: 3.2445 - Acc: 0.4479
Naturally, the accuracy of the test set is not very good.
Next, we learn the group equivalent version.
fitted <- GroupEquivariantCNN |>
luz::setup(
loss = torch::nn_cross_entropy_loss(),
optimizer = torch::optim_adam,
metrics = list(
luz::luz_metric_accuracy()
)
) |>
luz::set_hparams(
group = CyclicGroup(order = 4),
kernel_size = 5,
in_channels = 1,
out_channels = 10,
num_hidden = 4,
hidden_channels = 16
) |>
luz::set_opt_hparams(lr = 1e-2, weight_decay = 1e-4) |>
luz::fit(train_dl, epochs = 10, valid_data = test_dl)
Train metrics: Loss: 0.1102 - Acc: 0.9667
Valid metrics: Loss: 0.4969 - Acc: 0.8549
For group-equivariant CNNs, the accuracy of test and training sets is much closer. That's a good result! First, I'll conclude today's attack by resuming my thoughts from a higher-level post.
challenge
If we go back to the augmented test set, i.e. the displayed sample of numbers, we can see that there is a problem. In row 2, column 4, there is a number that “under normal circumstances” should be 9, but would probably be 6 backwards. (What this suggests to humans is something like a wavy line, which seems to be found more often in 6s than in 9s.) But you might ask: have Will it be a problem? Maybe networks need to learn the kinds of subtleties that humans can discover?
The way I see things all depends on the situation. That is, what actually needs to be accomplished and how the application will be used. If a letter has a number, there is no reason why one number should be displayed backwards. Therefore, full rotation homovariance can be counterproductive. In short, we have arrived at the same standard imperative of upholding fair and equitable justice that machine learning reminds us of:
Always think about how your application will be used!
But in our case there is another side – the technical aspect. gcnn::GroupEquivariantCNN()
It is a simple wrapper in that the layers all use the same symmetry group. In principle, there is no need to do this. As you do more coding, you can use different groups depending on their position in the layer in the feature detection hierarchy.
Here, finally, I will tell you why I chose the goat painting. Goats are visible through a red and white fence. The pattern is rotated slightly due to the viewing angle, forming squares (or edges, if you prefer). Now for these fences we have a rotation isovariance type encoded as follows: \(C_4\) It makes a lot of sense. But the goat itself would rather not look up to the sky as I have explained. \(C_4\) Before action. So in a real image classification task, what we would do is use somewhat flexible layers at the bottom and increasingly constrained layers at the top of the hierarchy.
Thanks for reading!
Photo credit: Marjan Blanc @marjanblan On Unsplash