variations on a theme
Simple Audio Classification with Keras, Audio Classification with Keras: A Closer Look at What Deep Learning Isn't, Simple Audio Classification with Torch: No, this is not the first post on this blog introducing speech classification with deep learning. Two of those posts (“Applied” post) to share the general setup, type of deep learning architecture used, and dataset used. Third, we share an interest in related ideas and concepts. Each post has a different focus. Should you read this post?
Well, of course I can't say “no”. This is especially true because here is an abbreviated and condensed version of the chapter on this topic in a forthcoming book from CRC Press. Deep learning and scientific computing using R torch
. Compared to the previous post torch
Created by creator and maintainer torchaudio
Significant progress has been made at Athos Damiani. torch
As a result, the code is much easier (especially the model training part). Now, let’s end the introduction and get to the main topic!
data inspection
we are voice command Dataset (Warden (2018)) provided with torchaudio
. The data set contains recordings of 30 different one- or two-syllable words uttered by different speakers. In total, there are approximately 65,000 audio files. Our task is to predict which of 30 possible words was pronounced based on audio alone.
library(torch)
library(torchaudio)
library(luz)
ds <- speechcommand_dataset(
root = "~/.torch-datasets",
url = "speech_commands_v0.01",
download = TRUE
)
We start by examining the data.
[1] "bed" "bird" "cat" "dog" "down" "eight"
[7] "five" "four" "go" "happy" "house" "left"
[32] " marvin" "nine" "no" "off" "on" "one"
[19] "right" "seven" "sheila" "six" "stop" "three"
[25] "tree" "two" "up" "wow" "yes" "zero"
If we select a sample at random, we can see that the four properties contain the information we need. waveform
, sample_rate
, label_index
and label
.
first, waveform
That will be our predictor.
sample <- ds[2000]
dim(sample$waveform)
[1] 1 16000
Individual tensor values are centered at 0 and range between -1 and 1. There are 16,000 of them, which reflects the fact that the recording lasted for one second and was registered or transformed by the dataset creator. 16,000 samples per second rate. The latter information is stored in the following location: sample$sample_rate
:
[1] 16000
All recordings were sampled at the same rate. The length is almost always equal to 1 second. – Very – At a minimum, longer sounds can be safely cut out.
Finally, the destination is stored in integer format. sample$label_index
The word can be used in: sample$label
:
sample$label
sample$label_index
[1] "bird"
torch_tensor
2
[ CPULongType{} ]
How does this audio signal “look”?
library(ggplot2)
df <- data.frame(
x = 1:length(sample$waveform[1]),
y = as.numeric(sample$waveform[1])
)
ggplot(df, aes(x = x, y = y)) +
geom_line(size = 0.3) +
ggtitle(
paste0(
"The spoken word \"", sample$label, "\": Sound wave"
)
) +
xlab("time") +
ylab("amplitude") +
theme_minimal()
![The spoken word](https://blogs.rstudio.com/tensorflow/posts/images/audio-bird-waveform.png)
What we see is a series of amplitudes that reflect the sound waves produced by the person saying “bird.” To put it another way, here is a time series of “loudness values”. Even experts guess any Words that result in such amplitudes are an impossible task. This is where domain knowledge comes in. Experts may not be able to produce much of a signal. In this expression; But they might know how to express it more meaningfully.
two equivalent expressions
Imagine that the wave above was represented in a way that had no information about time at all, instead of a series of amplitudes over time. Next, let's say we take that representation and try to recover the original signal. For that to be possible, the new expression must somehow be as “many” as the wave we started with.” It must contain information. That ‘that much’ is Fourier transformwhich consists of different magnitudes and phase shifts. frequency This is what constitutes a signal.
Then “new” What would the Fourier transform version of a sound wave look like? We call and get it torch_fft_fft()
(where fft
Stands for Fast Fourier Transform.
dft <- torch_fft_fft(sample$waveform)
dim(dft)
[1] 1 16000
These tensors have the same length. However, the values are not sorted chronologically. Instead, they represent: Fourier coefficient, corresponds to the frequencies contained in the signal. The higher the magnitude, the more it contributes to the signal.
mag <- torch_abs(dft[1, ])
df <- data.frame(
x = 1:(length(sample$waveform[1]) / 2),
y = as.numeric(mag[1:8000])
)
ggplot(df, aes(x = x, y = y)) +
geom_line(size = 0.3) +
ggtitle(
paste0(
"The spoken word \"",
sample$label,
"\": Discrete Fourier Transform"
)
) +
xlab("frequency") +
ylab("magnitude") +
theme_minimal()
![The spoken word](https://blogs.rstudio.com/tensorflow/posts/images/audio-bird-dft.png)
In this alternative representation, we can take the frequencies present in the signal, weight them by coefficients, and sum them up to get back to the original sound wave. However, timing information is essential for sound classification. We really don't want to throw it away.
Combining representations: spectrograms
In fact, what really helps us is combining the two expressions. A kind of “have your cake and eat it too.”” What if you could break the signal into smaller chunks and run a Fourier transform on each of them? As you may have guessed from this lead-up, this is actually something we can do. And the expression it creates spectrogram.
Using a spectrogram, you can still retain some time domain information. Part of this is because loss of granularity is unavoidable. On the other hand, for each time segment we learn about the spectral configuration. However, there is an important point to note. resolution we get hour big frequencyare inversely related to each other. Splitting the signal into multiple chunks (called “windows”) results in a very coarse-grained representation of the frequencies per window. Conversely, to achieve better resolution in the frequency domain, longer windows must be selected, thereby losing information about how the spectral composition changes over time. But as we'll soon see, what sounds like a big problem – and in many cases it is – probably isn't a problem to us.
But first, let's create and examine a spectrogram for an example signal. In the following code snippet, the size of the overlapping window is chosen to allow reasonable granularity in both the time and frequency domains. There are 63 windows left, and for each window we get 257 coefficients.
fft_size <- 512
window_size <- 512
power <- 0.5
spectrogram <- transform_spectrogram(
n_fft = fft_size,
win_length = window_size,
normalized = TRUE,
power = power
)
spec <- spectrogram(sample$waveform)$squeeze()
dim(spec)
[1] 257 63
Spectrograms can be displayed visually.
bins <- 1:dim(spec)[1]
freqs <- bins / (fft_size / 2 + 1) * sample$sample_rate
log_freqs <- log10(freqs)
frames <- 1:(dim(spec)[2])
seconds <- (frames / dim(spec)[2]) *
(dim(sample$waveform$squeeze())[1] / sample$sample_rate)
image(x = as.numeric(seconds),
y = log_freqs,
z = t(as.matrix(spec)),
ylab = 'log frequency [Hz]',
xlab = 'time [s]',
col = hcl.colors(12, palette = "viridis")
)
main <- paste0("Spectrogram, window size = ", window_size)
sub <- "Magnitude (square root)"
mtext(side = 3, line = 2, at = 0, adj = 0, cex = 1.3, main)
mtext(side = 3, line = 1, at = 0, adj = 0, cex = 1, sub)
![Phonetic word 'bird': spectrogram.](https://blogs.rstudio.com/tensorflow/posts/images/audio-spectrogram.png)
We know that we have lost some resolution in both time and frequency. However, we were still able to obtain reasonable results by improving sensitivity by displaying the square root of the coefficient magnitude. (with viridis
Color scheme, longwave shading indicates higher value coefficients. Shortwave is the opposite.)
Finally, let's go back to the important question. If this expression is necessarily a compromise, why would we want to use it? This is where we take a deep learning perspective. A spectrogram is a two-dimensional representation of an image. Images give us access to a wealth of technology and architecture. Of all the areas where deep learning has been successful, image recognition still stands out. You'll soon discover that this doesn't require fancy architecture. A simple convnet will do a very good job.
Train a neural network with a spectrogram
we are torch::dataset()
That is, from the original speechcommand_dataset()
Compute the spectrogram for all samples.
spectrogram_dataset <- dataset(
inherit = speechcommand_dataset,
initialize = function(...,
pad_to = 16000,
sampling_rate = 16000,
n_fft = 512,
window_size_seconds = 0.03,
window_stride_seconds = 0.01,
power = 2) {
self$pad_to <- pad_to
self$window_size_samples <- sampling_rate *
window_size_seconds
self$window_stride_samples <- sampling_rate *
window_stride_seconds
self$power <- power
self$spectrogram <- transform_spectrogram(
n_fft = n_fft,
win_length = self$window_size_samples,
hop_length = self$window_stride_samples,
normalized = TRUE,
power = self$power
)
super$initialize(...)
},
.getitem = function(i) {
item <- super$.getitem(i)
x <- item$waveform
# make sure all samples have the same length (57)
# shorter ones will be padded,
# longer ones will be truncated
x <- nnf_pad(x, pad = c(0, self$pad_to - dim(x)[2]))
x <- x %>% self$spectrogram()
if (is.null(self$power)) {
# in this case, there is an additional dimension, in position 4,
# that we want to appear in front
# (as a second channel)
x <- x$squeeze()$permute(c(3, 1, 2))
}
y <- item$label_index
list(x = x, y = y)
}
)
In the parameter list spectrogram_dataset()
memo power
The default is 2. Unless otherwise specified, this value is torch
'S transform_spectrogram()
I would assume power
Must be there. In this situation, the values that make up the spectrogram are the magnitudes of the Fourier coefficients squared. use power
allows you to change the default values, for example specifying whether you want absolute values (power = 1
), other positive values, e.g. 0.5
(used above to indicate specific examples) – or both real and imaginary parts of the coefficients (power = NULL)
.
Of course, in terms of display, the entire complex presentation is inconvenient. Spectrogram plots require additional dimensions. However, we believe that neural networks are “complete” You may be wondering whether you can benefit from the additional information contained in complex numbers. Ultimately, as we reduce the size, we lose the phase shift of the individual coefficients, which may contain usable information. Actually, according to my testing results, yes. Using complex values improved classification accuracy.
Let's see what we get spectrogram_dataset()
:
ds <- spectrogram_dataset(
root = "~/.torch-datasets",
url = "speech_commands_v0.01",
download = TRUE,
power = NULL
)
dim(ds[1]$x)
[1] 2 257 101
There are 257 coefficients for 101 windows. Each coefficient is expressed as a real part and an imaginary part.
Next, we partition and instantiate the data. dataset()
and dataloader()
objects.
train_ids <- sample(
1:length(ds),
size = 0.6 * length(ds)
)
valid_ids <- sample(
setdiff(
1:length(ds),
train_ids
),
size = 0.2 * length(ds)
)
test_ids <- setdiff(
1:length(ds),
union(train_ids, valid_ids)
)
batch_size <- 128
train_ds <- dataset_subset(ds, indices = train_ids)
train_dl <- dataloader(
train_ds,
batch_size = batch_size, shuffle = TRUE
)
valid_ds <- dataset_subset(ds, indices = valid_ids)
valid_dl <- dataloader(
valid_ds,
batch_size = batch_size
)
test_ds <- dataset_subset(ds, indices = test_ids)
test_dl <- dataloader(test_ds, batch_size = 64)
b <- train_dl %>%
dataloader_make_iter() %>%
dataloader_next()
dim(b$x)
[1] 128 2 257 101
This model is a simple convnet with dropout and batch normalization. The real and imaginary parts of the Fourier coefficients are passed to the initial part of the model. nn_conv2d()
separated into two channel.
model <- nn_module(
initialize = function() {
self$features <- nn_sequential(
nn_conv2d(2, 32, kernel_size = 3),
nn_batch_norm2d(32),
nn_relu(),
nn_max_pool2d(kernel_size = 2),
nn_dropout2d(p = 0.2),
nn_conv2d(32, 64, kernel_size = 3),
nn_batch_norm2d(64),
nn_relu(),
nn_max_pool2d(kernel_size = 2),
nn_dropout2d(p = 0.2),
nn_conv2d(64, 128, kernel_size = 3),
nn_batch_norm2d(128),
nn_relu(),
nn_max_pool2d(kernel_size = 2),
nn_dropout2d(p = 0.2),
nn_conv2d(128, 256, kernel_size = 3),
nn_batch_norm2d(256),
nn_relu(),
nn_max_pool2d(kernel_size = 2),
nn_dropout2d(p = 0.2),
nn_conv2d(256, 512, kernel_size = 3),
nn_batch_norm2d(512),
nn_relu(),
nn_adaptive_avg_pool2d(c(1, 1)),
nn_dropout2d(p = 0.2)
)
self$classifier <- nn_sequential(
nn_linear(512, 512),
nn_batch_norm1d(512),
nn_relu(),
nn_dropout(p = 0.5),
nn_linear(512, 30)
)
},
forward = function(x) {
x <- self$features(x)$squeeze()
x <- self$classifier(x)
x
}
)
Next, determine the appropriate learning rate.
model <- model %>%
setup(
loss = nn_cross_entropy_loss(),
optimizer = optim_adam,
metrics = list(luz_metric_accuracy())
)
rates_and_losses <- model %>%
lr_finder(train_dl)
rates_and_losses %>% plot()
![Finding the learning rate, run on a complex spectrogram model.](https://blogs.rstudio.com/tensorflow/posts/images/audio-lr-finder.png)
Based on the plot, I decided to use 0.01 as the maximum learning rate. Training continued for 40 generations.
fitted <- model %>%
fit(train_dl,
epochs = 50, valid_data = valid_dl,
callbacks = list(
luz_callback_early_stopping(patience = 3),
luz_callback_lr_scheduler(
lr_one_cycle,
max_lr = 1e-2,
epochs = 50,
steps_per_epoch = length(train_dl),
call_on = "on_batch_end"
),
luz_callback_model_checkpoint(path = "models_complex/"),
luz_callback_csv_logger("logs_complex.csv")
),
verbose = TRUE
)
plot(fitted)
![Fit a complex spectrogram model.](https://blogs.rstudio.com/tensorflow/posts/images/audio-fit-complex.png)
Let's check the actual accuracy.
"epoch","set","loss","acc"
1,"train",3.09768574611813,0.12396992171405
1,"valid",2.52993751740923,0.284378862793572
2,"train",2.26747255972008,0.333642356819118
2,"valid",1.66693911248562,0.540791100123609
3,"train",1.62294889937818,0.518464153275649
3,"valid",1.11740599192825,0.704882571075402
...
...
38,"train",0.18717994078312,0.943809229501442
38,"valid",0.23587799138006,0.936418417799753
39,"train",0.19338578602993,0.942882159044087
39,"valid",0.230597475945365,0.939431396786156
40,"train",0.190593419024368,0.942727647301195
40,"valid",0.243536252455384,0.936186650185414
We need to distinguish 30 classes and a final validation set accuracy of ~0.94 looks like a very good result!
You can see this on the test set.
evaluate(fitted, test_dl)
loss: 0.2373
acc: 0.9324
An interesting question is which words are most often confused. (Of course, what's more interesting is how the error probability is related to the features of the spectrogram. truth Domain expert. A good way to display a confusion matrix is to create an alluvial plot. On the left, the prediction “goes” into the target slot.” You can see that. (Target-prediction pairs that occur less frequently than one thousandth of the test set cardinality are hidden.)
![Alluvium plot for complex spectrogram setup.](https://blogs.rstudio.com/tensorflow/posts/images/audio-alluvial-complex.png)
summary
That's it for today! I expect more posts based on content from the upcoming CRC book in the coming weeks. Deep learning and scientific computing using R torch
. Thanks for reading!
Photo: Alex Lauzon, Unsplash
Warden, Pete. 2018. “Voice command: all “A Dataset for Limited Vocabulary Speech Recognition.” CoRR ABS/1804.03209. http://arxiv.org/abs/1804.03209.