Posit AI Blog: Audio classification with torch

variations on a theme

Simple Audio Classification with Keras, Audio Classification with Keras: A Closer Look at What Deep Learning Isn't, Simple Audio Classification with Torch: No, this is not the first post on this blog introducing speech classification with deep learning. Two of those posts (“Applied” post) to share the general setup, type of deep learning architecture used, and dataset used. Third, we share an interest in related ideas and concepts. Each post has a different focus. Should you read this post?

Well, of course I can't say “no”. This is especially true because here is an abbreviated and condensed version of the chapter on this topic in a forthcoming book from CRC Press. Deep learning and scientific computing using R torch. Compared to the previous post torchCreated by creator and maintainer torchaudioSignificant progress has been made at Athos Damiani. torch As a result, the code is much easier (especially the model training part). Now, let’s end the introduction and get to the main topic!

data inspection

we are voice command Dataset (Warden (2018)) provided with torchaudio. The data set contains recordings of 30 different one- or two-syllable words uttered by different speakers. In total, there are approximately 65,000 audio files. Our task is to predict which of 30 possible words was pronounced based on audio alone.

library(torch)
library(torchaudio)
library(luz)

ds <- speechcommand_dataset(
  root = "~/.torch-datasets", 
  url = "speech_commands_v0.01",
  download = TRUE
)

We start by examining the data.

[1]  "bed"    "bird"   "cat"    "dog"    "down"   "eight"
[7]  "five"   "four"   "go"     "happy"  "house"  "left"
[32] " marvin" "nine"   "no"     "off"    "on"     "one"
[19] "right"  "seven" "sheila" "six"    "stop"   "three"
[25]  "tree"   "two"    "up"     "wow"    "yes"    "zero"

If we select a sample at random, we can see that the four properties contain the information we need. waveform, sample_rate, label_indexand label.

first, waveformThat will be our predictor.

sample <- ds[2000]
dim(sample$waveform)

[1]     1 16000

Individual tensor values are centered at 0 and range between -1 and 1. There are 16,000 of them, which reflects the fact that the recording lasted for one second and was registered or transformed by the dataset creator. 16,000 samples per second rate. The latter information is stored in the following location: sample$sample_rate:

[1] 16000

All recordings were sampled at the same rate. The length is almost always equal to 1 second. – Very – At a minimum, longer sounds can be safely cut out.

Finally, the destination is stored in integer format. sample$label_indexThe word can be used in: sample$label:

sample$label
sample$label_index

[1] "bird"
torch_tensor
2
[ CPULongType{} ]

How does this audio signal “look”?

library(ggplot2)

df <- data.frame(
  x = 1:length(sample$waveform[1]),
  y = as.numeric(sample$waveform[1])
  )

ggplot(df, aes(x = x, y = y)) +
  geom_line(size = 0.3) +
  ggtitle(
    paste0(
      "The spoken word \"", sample$label, "\": Sound wave"
    )
  ) +
  xlab("time") +
  ylab("amplitude") +
  theme_minimal()

The spoken word — The word “bird” spoken in a time domain representation.

What we see is a series of amplitudes that reflect the sound waves produced by the person saying “bird.” To put it another way, here is a time series of “loudness values”. Even experts guess any Words that result in such amplitudes are an impossible task. This is where domain knowledge comes in. Experts may not be able to produce much of a signal. In this expression; But they might know how to express it more meaningfully.

two equivalent expressions

Imagine that the wave above was represented in a way that had no information about time at all, instead of a series of amplitudes over time. Next, let's say we take that representation and try to recover the original signal. For that to be possible, the new expression must somehow be as “many” as the wave we started with.” It must contain information. That ‘that much’ is Fourier transformwhich consists of different magnitudes and phase shifts. frequency This is what constitutes a signal.

Then “new” What would the Fourier transform version of a sound wave look like? We call and get it torch_fft_fft() (where fft Stands for Fast Fourier Transform.

dft <- torch_fft_fft(sample$waveform)
dim(dft)

[1]     1 16000

These tensors have the same length. However, the values are not sorted chronologically. Instead, they represent: Fourier coefficient, corresponds to the frequencies contained in the signal. The higher the magnitude, the more it contributes to the signal.

mag <- torch_abs(dft[1, ])

df <- data.frame(
  x = 1:(length(sample$waveform[1]) / 2),
  y = as.numeric(mag[1:8000])
)

ggplot(df, aes(x = x, y = y)) +
  geom_line(size = 0.3) +
  ggtitle(
    paste0(
      "The spoken word \"",
      sample$label,
      "\": Discrete Fourier Transform"
    )
  ) +
  xlab("frequency") +
  ylab("magnitude") +
  theme_minimal()

In this alternative representation, we can take the frequencies present in the signal, weight them by coefficients, and sum them up to get back to the original sound wave. However, timing information is essential for sound classification. We really don't want to throw it away.

Combining representations: spectrograms

In fact, what really helps us is combining the two expressions. A kind of “have your cake and eat it too.”” What if you could break the signal into smaller chunks and run a Fourier transform on each of them? As you may have guessed from this lead-up, this is actually something we can do. And the expression it creates spectrogram.

Using a spectrogram, you can still retain some time domain information. Part of this is because loss of granularity is unavoidable. On the other hand, for each time segment we learn about the spectral configuration. However, there is an important point to note. resolution we get hour big frequencyare inversely related to each other. Splitting the signal into multiple chunks (called “windows”) results in a very coarse-grained representation of the frequencies per window. Conversely, to achieve better resolution in the frequency domain, longer windows must be selected, thereby losing information about how the spectral composition changes over time. But as we'll soon see, what sounds like a big problem – and in many cases it is – probably isn't a problem to us.

But first, let's create and examine a spectrogram for an example signal. In the following code snippet, the size of the overlapping window is chosen to allow reasonable granularity in both the time and frequency domains. There are 63 windows left, and for each window we get 257 coefficients.

fft_size <- 512
window_size <- 512
power <- 0.5

spectrogram <- transform_spectrogram(
  n_fft = fft_size,
  win_length = window_size,
  normalized = TRUE,
  power = power
)

spec <- spectrogram(sample$waveform)$squeeze()
dim(spec)

[1]   257 63

Spectrograms can be displayed visually.

bins <- 1:dim(spec)[1]
freqs <- bins / (fft_size / 2 + 1) * sample$sample_rate 
log_freqs <- log10(freqs)

frames <- 1:(dim(spec)[2])
seconds <- (frames / dim(spec)[2]) *
  (dim(sample$waveform$squeeze())[1] / sample$sample_rate)

image(x = as.numeric(seconds),
      y = log_freqs,
      z = t(as.matrix(spec)),
      ylab = 'log frequency [Hz]',
      xlab = 'time [s]',
      col = hcl.colors(12, palette = "viridis")
)
main <- paste0("Spectrogram, window size = ", window_size)
sub <- "Magnitude (square root)"
mtext(side = 3, line = 2, at = 0, adj = 0, cex = 1.3, main)
mtext(side = 3, line = 1, at = 0, adj = 0, cex = 1, sub)

Phonetic word 'bird': spectrogram. — The phonetic word for “bird”: spectrogram.

We know that we have lost some resolution in both time and frequency. However, we were still able to obtain reasonable results by improving sensitivity by displaying the square root of the coefficient magnitude. (with viridis Color scheme, longwave shading indicates higher value coefficients. Shortwave is the opposite.)

Finally, let's go back to the important question. If this expression is necessarily a compromise, why would we want to use it? This is where we take a deep learning perspective. A spectrogram is a two-dimensional representation of an image. Images give us access to a wealth of technology and architecture. Of all the areas where deep learning has been successful, image recognition still stands out. You'll soon discover that this doesn't require fancy architecture. A simple convnet will do a very good job.

Train a neural network with a spectrogram

we are torch::dataset() That is, from the original speechcommand_dataset()Compute the spectrogram for all samples.

spectrogram_dataset <- dataset(
  inherit = speechcommand_dataset,
  initialize = function(...,
                        pad_to = 16000,
                        sampling_rate = 16000,
                        n_fft = 512,
                        window_size_seconds = 0.03,
                        window_stride_seconds = 0.01,
                        power = 2) {
    self$pad_to <- pad_to
    self$window_size_samples <- sampling_rate *
      window_size_seconds
    self$window_stride_samples <- sampling_rate *
      window_stride_seconds
    self$power <- power
    self$spectrogram <- transform_spectrogram(
        n_fft = n_fft,
        win_length = self$window_size_samples,
        hop_length = self$window_stride_samples,
        normalized = TRUE,
        power = self$power
      )
    super$initialize(...)
  },
  .getitem = function(i) {
    item <- super$.getitem(i)

    x <- item$waveform
    # make sure all samples have the same length (57)
    # shorter ones will be padded,
    # longer ones will be truncated
    x <- nnf_pad(x, pad = c(0, self$pad_to - dim(x)[2]))
    x <- x %>% self$spectrogram()

    if (is.null(self$power)) {
      # in this case, there is an additional dimension, in position 4,
      # that we want to appear in front
      # (as a second channel)
      x <- x$squeeze()$permute(c(3, 1, 2))
    }

    y <- item$label_index
    list(x = x, y = y)
  }
)

In the parameter list spectrogram_dataset()memo powerThe default is 2. Unless otherwise specified, this value is torch'S transform_spectrogram() I would assume power Must be there. In this situation, the values that make up the spectrogram are the magnitudes of the Fourier coefficients squared. use powerallows you to change the default values, for example specifying whether you want absolute values (power = 1), other positive values, e.g. 0.5(used above to indicate specific examples) – or both real and imaginary parts of the coefficients (power = NULL).

Of course, in terms of display, the entire complex presentation is inconvenient. Spectrogram plots require additional dimensions. However, we believe that neural networks are “complete” You may be wondering whether you can benefit from the additional information contained in complex numbers. Ultimately, as we reduce the size, we lose the phase shift of the individual coefficients, which may contain usable information. Actually, according to my testing results, yes. Using complex values improved classification accuracy.

Let's see what we get spectrogram_dataset():

ds <- spectrogram_dataset(
  root = "~/.torch-datasets",
  url = "speech_commands_v0.01",
  download = TRUE,
  power = NULL
)

dim(ds[1]$x)

[1]   2 257 101

There are 257 coefficients for 101 windows. Each coefficient is expressed as a real part and an imaginary part.

Next, we partition and instantiate the data. dataset() and dataloader() objects.

train_ids <- sample(
  1:length(ds),
  size = 0.6 * length(ds)
)
valid_ids <- sample(
  setdiff(
    1:length(ds),
    train_ids
  ),
  size = 0.2 * length(ds)
)
test_ids <- setdiff(
  1:length(ds),
  union(train_ids, valid_ids)
)

batch_size <- 128

train_ds <- dataset_subset(ds, indices = train_ids)
train_dl <- dataloader(
  train_ds,
  batch_size = batch_size, shuffle = TRUE
)

valid_ds <- dataset_subset(ds, indices = valid_ids)
valid_dl <- dataloader(
  valid_ds,
  batch_size = batch_size
)

test_ds <- dataset_subset(ds, indices = test_ids)
test_dl <- dataloader(test_ds, batch_size = 64)

b <- train_dl %>%
  dataloader_make_iter() %>%
  dataloader_next()

dim(b$x)

[1] 128   2 257 101

This model is a simple convnet with dropout and batch normalization. The real and imaginary parts of the Fourier coefficients are passed to the initial part of the model. nn_conv2d() separated into two channel.

model <- nn_module(
  initialize = function() {
    self$features <- nn_sequential(
      nn_conv2d(2, 32, kernel_size = 3),
      nn_batch_norm2d(32),
      nn_relu(),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(32, 64, kernel_size = 3),
      nn_batch_norm2d(64),
      nn_relu(),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(64, 128, kernel_size = 3),
      nn_batch_norm2d(128),
      nn_relu(),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(128, 256, kernel_size = 3),
      nn_batch_norm2d(256),
      nn_relu(),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(256, 512, kernel_size = 3),
      nn_batch_norm2d(512),
      nn_relu(),
      nn_adaptive_avg_pool2d(c(1, 1)),
      nn_dropout2d(p = 0.2)
    )

    self$classifier <- nn_sequential(
      nn_linear(512, 512),
      nn_batch_norm1d(512),
      nn_relu(),
      nn_dropout(p = 0.5),
      nn_linear(512, 30)
    )
  },
  forward = function(x) {
    x <- self$features(x)$squeeze()
    x <- self$classifier(x)
    x
  }
)

Next, determine the appropriate learning rate.

model <- model %>%
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_adam,
    metrics = list(luz_metric_accuracy())
  )

rates_and_losses <- model %>%
  lr_finder(train_dl)
rates_and_losses %>% plot()

Finding the learning rate, run on a complex spectrogram model. — Learning rate finding runs on complex spectrogram models.

Based on the plot, I decided to use 0.01 as the maximum learning rate. Training continued for 40 generations.

fitted <- model %>%
  fit(train_dl,
    epochs = 50, valid_data = valid_dl,
    callbacks = list(
      luz_callback_early_stopping(patience = 3),
      luz_callback_lr_scheduler(
        lr_one_cycle,
        max_lr = 1e-2,
        epochs = 50,
        steps_per_epoch = length(train_dl),
        call_on = "on_batch_end"
      ),
      luz_callback_model_checkpoint(path = "models_complex/"),
      luz_callback_csv_logger("logs_complex.csv")
    ),
    verbose = TRUE
  )

plot(fitted)

Fit a complex spectrogram model. — Fit complex spectrogram models.

Let's check the actual accuracy.

"epoch","set","loss","acc"
1,"train",3.09768574611813,0.12396992171405
1,"valid",2.52993751740923,0.284378862793572
2,"train",2.26747255972008,0.333642356819118
2,"valid",1.66693911248562,0.540791100123609
3,"train",1.62294889937818,0.518464153275649
3,"valid",1.11740599192825,0.704882571075402
...
...
38,"train",0.18717994078312,0.943809229501442
38,"valid",0.23587799138006,0.936418417799753
39,"train",0.19338578602993,0.942882159044087
39,"valid",0.230597475945365,0.939431396786156
40,"train",0.190593419024368,0.942727647301195
40,"valid",0.243536252455384,0.936186650185414

We need to distinguish 30 classes and a final validation set accuracy of ~0.94 looks like a very good result!

You can see this on the test set.

evaluate(fitted, test_dl)

loss: 0.2373
acc: 0.9324

An interesting question is which words are most often confused. (Of course, what's more interesting is how the error probability is related to the features of the spectrogram. truth Domain expert. A good way to display a confusion matrix is to create an alluvial plot. On the left, the prediction “goes” into the target slot.” You can see that. (Target-prediction pairs that occur less frequently than one thousandth of the test set cardinality are hidden.)

Alluvium plot for complex spectrogram setup. — Alluvial plots for complex spectrogram setups.

summary

That's it for today! I expect more posts based on content from the upcoming CRC book in the coming weeks. Deep learning and scientific computing using R torch. Thanks for reading!

Photo: Alex Lauzon, Unsplash

Warden, Pete. 2018. “Voice command: all “A Dataset for Limited Vocabulary Speech Recognition.” CoRR ABS/1804.03209. http://arxiv.org/abs/1804.03209.

Posit AI Blog: Audio classification with torch

Injective hydrogel loaded with liposomes-encapsulated MY-1 promotes wound healing and increases tensile strength by accelerating fibroblast migration via the PI3K/AKT-Rac1 signaling pathway | Journal of Nanobiotechnology

Photovoltaic nanocells for high-performance large-scale-integrated organic phototransistors

Graphene Composites USA (GC USA) Selected to Develop Next Gen US Military Footwear

Leave A Reply Cancel Reply

Parent engagement can make all the difference

Scarleteen Confidential: Helping Youth Handle Rejection

U.S. Job Growth Extends Streak, but Signs of Concern Emerge

Quinoa vs Dalia: Which has more protein

Two Conversations

Injective hydrogel loaded with liposomes-encapsulated MY-1 promotes wound healing and increases tensile strength by accelerating fibroblast migration via the PI3K/AKT-Rac1 signaling pathway | Journal of Nanobiotechnology

Nature-based business networks take off across Rewilding Europe landscapes

Causes and Risk Factors of Hidradenitis Suppurativa (HS)

Taylor Swift’s Eras Tour Dominates Summer — with Travis Kelce’s Support!

Chaos and Confusion: The State of Student Loans

Android banking Trojan evolves to evade detection and strike globally

Switching out Biden won’t save Democrats from its critics

Popular Posts

U.S. Job Growth Extends Streak, but Signs of Concern Emerge

Android banking Trojan evolves to evade detection and strike globally

Facing New ‘Greenwashing’ Law, an Oil Industry Website Goes Dark

Most Read

Parents of Courtney Clenney Charged With Accessing Alleged Murder Victim’s Computer

On stuttering presidents and Republican bullies

Audio Shows No Labels Has No Idea If It’ll Find Candidate

Posit AI Blog: Audio classification with torch

variations on a theme

data inspection

two equivalent expressions

Combining representations: spectrograms

Train a neural network with a spectrogram

summary

Related Posts

Leave A Reply Cancel Reply