At first, I started learning things like: torch
Code a simple neural network from scratch using just one of the basics: torch
Features:
tensor. I then simplified the task tremendously by replacing manual backpropagation with:
autograd. today we modularize Network – in both the habitual and literal sense: low-level matrix operations are replaced by: torch
module
S.
module
In other frameworks (Keras, etc.) you may be familiar with distinguishing between: Model and Layer. in torch
is an instance of both
nn_Module()
,Therefore, several methods have something in common. For those of you who think in terms of “models” and “layers”, I've artificially divided this section into two parts. But in reality, there is no dichotomy. New modules can be composed of existing modules up to arbitrary levels of recursion.
Base module (“layer”)
Instead of writing the affine operation directly – x$mm(w1) + b1
For example, you could create a linear module like we've done so far. The following fragment instantiates a linear layer that expects three feature inputs and returns a single output per observation.
The module has two parameters: “Weight” and “Bias”. Both are now pre-initialized.
$weight
torch_tensor
-0.0385 0.1412 -0.5436
[ CPUFloatType{1,3} ]
$bias
torch_tensor
-0.1950
[ CPUFloatType{1} ]
Modules are callable. When you call a module, that module is executed. forward()
For linear layers, this involves matrix multiplying the input and weights and adding bias.
Let's try this:
data <- torch_randn(10, 3)
out <- l(data)
Not surprisingly, out
Now we have some data.
torch_tensor
0.2711
-1.8151
-0.0073
0.1876
-0.0930
0.7498
-0.2332
-0.0428
0.3849
-0.2618
[ CPUFloatType{10,1} ]
Moreover, this tensor knows what to do when asked to compute the gradient.
AddmmBackward
Note the difference between tensors returned by modules and self-generated tensors. When creating a tensor directly, you need to pass:
requires_grad = TRUE
Triggers gradient calculation. Using modules,
torch
We correctly assume that we want to perform backpropagation at some point.
But so far we haven't called backward()
yet. Therefore, the slope has not been calculated yet.
l$weight$grad
l$bias$grad
torch_tensor
[ Tensor (undefined) ]
torch_tensor
[ Tensor (undefined) ]
Let's change this:
Error in (function (self, gradient, keep_graph, create_graph) :
grad can be implicitly created only for scalar outputs (_make_grads at ../torch/csrc/autograd/autograd.cpp:47)
Why did an error occur? autograd The output tensor is expected to be a scalar, but in this example the size is a tensor. (10, 1)
. This error does not occur often in practice. arrangement Number of inputs (sometimes a single batch) but still interesting to see how they solve this problem.
To make the example work, we introduce a hypothetical final aggregation step that takes the average. let's call it avg
. Taking that average, l$weight
This can be achieved through the chain rule.
\[
\begin{equation*}
\frac{\partial \ avg}{\partial w} = \frac{\partial \ avg}{\partial \ out} \ \frac{\partial \ out}{\partial w}
\end{equation*}
\]
Of the quantities on the right, we are interested in the second. We must provide the first one. If we really meant something:
d_avg_d_out <- torch_tensor(10)$`repeat`(10)$unsqueeze(1)$t()
out$backward(gradient = d_avg_d_out)
now, l$weight$grad
and l$bias$grad
do With gradients:
l$weight$grad
l$bias$grad
torch_tensor
1.3410 6.4343 -30.7135
[ CPUFloatType{1,3} ]
torch_tensor
100
[ CPUFloatType{1} ]
furthermore nn_linear()
, torch
It provides just about every common layer you could hope for. However, few tasks can be solved in a single layer. How do you combine them? Or in general terms: How do you build it?
Model?
Container Module (“Model”)
now, Model It is just a module that contains other modules. For example, assuming all inputs flow through the same nodes and along the same edges, nn_sequential()
It can be used to create simple graphs.
for example:
model <- nn_sequential(
nn_linear(3, 16),
nn_relu(),
nn_linear(16, 1)
)
Using the same technique as above, we can get an overview of all model parameters (two weight matrices and two bias vectors).
$`0.weight`
torch_tensor
-0.1968 -0.1127 -0.0504
0.0083 0.3125 0.0013
0.4784 -0.2757 0.2535
-0.0898 -0.4706 -0.0733
-0.0654 0.5016 0.0242
0.4855 -0.3980 -0.3434
-0.3609 0.1859 -0.4039
0.2851 0.2809 -0.3114
-0.0542 -0.0754 -0.2252
-0.3175 0.2107 -0.2954
-0.3733 0.3931 0.3466
0.5616 -0.3793 -0.4872
0.0062 0.4168 -0.5580
0.3174 -0.4867 0.0904
-0.0981 -0.0084 0.3580
0.3187 -0.2954 -0.5181
[ CPUFloatType{16,3} ]
$`0.bias`
torch_tensor
-0.3714
0.5603
-0.3791
0.4372
-0.1793
-0.3329
0.5588
0.1370
0.4467
0.2937
0.1436
0.1986
0.4967
0.1554
-0.3219
-0.0266
[ CPUFloatType{16} ]
$`2.weight`
torch_tensor
Columns 1 to 10-0.0908 -0.1786 0.0812 -0.0414 -0.0251 -0.1961 0.2326 0.0943 -0.0246 0.0748
Columns 11 to 16 0.2111 -0.1801 -0.0102 -0.0244 0.1223 -0.1958
[ CPUFloatType{1,16} ]
$`2.bias`
torch_tensor
0.2470
[ CPUFloatType{1} ]
To examine individual parameters, use their position in the sequential model. for example:
torch_tensor
-0.3714
0.5603
-0.3791
0.4372
-0.1793
-0.3329
0.5588
0.1370
0.4467
0.2937
0.1436
0.1986
0.4967
0.1554
-0.3219
-0.0266
[ CPUFloatType{16} ]
And it's like nn_linear()
Above, this module can be called directly from data.
In a composite module like this, you would call: backward()
It is backpropagated through all layers.
out$backward(gradient = torch_tensor(10)$`repeat`(10)$unsqueeze(1)$t())
# e.g.
model[[1]]$bias$grad
torch_tensor
0.0000
-17.8578
1.6246
-3.7258
-0.2515
-5.8825
23.2624
8.4903
-2.4604
6.7286
14.7760
-14.4064
-1.0206
-1.7058
0.0000
-9.7897
[ CPUFloatType{16} ]
And when you place a composite module on the GPU, all the tensors go there.
model$cuda()
model[[1]]$bias$grad
torch_tensor
0.0000
-17.8578
1.6246
-3.7258
-0.2515
-5.8825
23.2624
8.4903
-2.4604
6.7286
14.7760
-14.4064
-1.0206
-1.7058
0.0000
-9.7897
[ CUDAFloatType{16} ]
Now let's see how to use it nn_sequential()
The example network can be simplified:
Simple network using modules
### generate training data -----------------------------------------------------
# input dimensionality (number of input features)
d_in <- 3
# output dimensionality (number of predicted features)
d_out <- 1
# number of observations in training set
n <- 100
# create random data
x <- torch_randn(n, d_in)
y <- x[, 1, NULL] * 0.2 - x[, 2, NULL] * 1.3 - x[, 3, NULL] * 0.5 + torch_randn(n, 1)
### define the network ---------------------------------------------------------
# dimensionality of hidden layer
d_hidden <- 32
model <- nn_sequential(
nn_linear(d_in, d_hidden),
nn_relu(),
nn_linear(d_hidden, d_out)
)
### network parameters ---------------------------------------------------------
learning_rate <- 1e-4
### training loop --------------------------------------------------------------
for (t in 1:200) {
### -------- Forward pass --------
y_pred <- model(x)
### -------- compute loss --------
loss <- (y_pred - y)$pow(2)$sum()
if (t %% 10 == 0)
cat("Epoch: ", t, " Loss: ", loss$item(), "\n")
### -------- Backpropagation --------
# Zero the gradients before running the backward pass.
model$zero_grad()
# compute gradient of the loss w.r.t. all learnable parameters of the model
loss$backward()
### -------- Update weights --------
# Wrap in with_no_grad() because this is a part we DON'T want to record
# for automatic gradient computation
# Update each parameter by its `grad`
with_no_grad({
model$parameters %>% purrr::walk(function(param) param$sub_(learning_rate * param$grad))
})
}
Forward passes look much better now. However, we still iterate through the model's parameters, updating each one manually. Besides, you may already have such doubts. torch
Provides an abstraction for general loss functions. The next and final article in this series will address both points using: torch
Loss and optimizer. See you then!