State-of-the-art NLP models from R

introduction

that much Transformers “Hugging Face”’s repository contains many state-of-the-art, ready-to-use models that can be easily downloaded and fine-tuned with Tensorflow and Keras.

For this purpose the user generally needs to obtain:

The model itself (e.g. Bert, Albert, RoBerta, GPT-2, etc.)
tokenizer object
Model weights

In this post, we will perform a classic binary classification task and train a dataset on three models.

However, readers should be aware that converters can be used in a variety of downstream operations, including:

Feature Extraction
sentiment analysis
Text classification
question answer
summary
Translation and various other services.

Prerequisites

Our first thing is to install it. Transformers via package reticulate.

reticulate::py_install('transformers', pip = TRUE)

Then load R's standard 'Keras', 'TensorFlow' >= 2.0 and some classic libraries as usual.

When running TensorFlow on GPU, you can specify the following parameters to avoid memory issues:

physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)

tf$keras$backend$set_floatx('float32')

template

We have already mentioned that in order to train data for a particular model, the user needs to download the model, tokenizer object, and weights. For example, to get the RoBERTa model, you need to do the following:

# get Tokenizer
transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)

# get Model with weights
transformer$TFRobertaModel$from_pretrained('roberta-base')

Data Preparation

The dataset for binary classification is provided in the text2vec package. Let's load the dataset and take a sample for quick model training.

Split the data into two parts.

idx_train = sample.int(nrow(df)*0.8)

train = df[idx_train,]
test = df[!idx_train,]

Entering data into Keras

So far we've only covered data import and train-test splitting. To provide input to the network, the raw text must be converted into indices by an fetch tokenizer. We then tune the model to perform binary classification by adding a dense layer with a single unit at the end.

However, we want to train on data for three models: GPT-2, RoBERTa, and Electra. For this we need to write a loop.

Note: Typically one model requires 500-700MB.

# list of 3 models
ai_m = list(
  c('TFGPT2Model',       'GPT2Tokenizer',       'gpt2'),
   c('TFRobertaModel',    'RobertaTokenizer',    'roberta-base'),
   c('TFElectraModel',    'ElectraTokenizer',    'google/electra-small-generator')
)

# parameters
max_len = 50L
epochs = 2
batch_size = 10

# create a list for model results
gather_history = list()

for (i in 1:length(ai_m)) {
  
  # tokenizer
  tokenizer = glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
                         do_lower_case=TRUE)") %>% 
    rlang::parse_expr() %>% eval()
  
  # model
  model_ = glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')") %>% 
    rlang::parse_expr() %>% eval()
  
  # inputs
  text = list()
  # outputs
  label = list()
  
  data_prep = function(data) {
    for (i in 1:nrow(data)) {
      
      txt = tokenizer$encode(data[['comment_text']][i],max_length = max_len, 
                             truncation=T) %>% 
        t() %>% 
        as.matrix() %>% list()
      lbl = data[['target']][i] %>% t()
      
      text = text %>% append(txt)
      label = label %>% append(lbl)
    }
    list(do.call(plyr::rbind.fill.matrix,text), do.call(plyr::rbind.fill.matrix,label))
  }
  
  train_ = data_prep(train)
  test_ = data_prep(test)
  
  # slice dataset
  tf_train = tensor_slices_dataset(list(train_[[1]],train_[[2]])) %>% 
    dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>% 
    dataset_shuffle(128) %>% dataset_repeat(epochs) %>% 
    dataset_prefetch(tf$data$experimental$AUTOTUNE)
  
  tf_test = tensor_slices_dataset(list(test_[[1]],test_[[2]])) %>% 
    dataset_batch(batch_size = batch_size)
  
  # create an input layer
  input = layer_input(shape=c(max_len), dtype='int32')
  hidden_mean = tf$reduce_mean(model_(input)[[1]], axis=1L) %>% 
    layer_dense(64,activation = 'relu')
  # create an output layer for binary classification
  output = hidden_mean %>% layer_dense(units=1, activation='sigmoid')
  model = keras_model(inputs=input, outputs = output)
  
  # compile with AUC score
  model %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
                    loss = tf$losses$BinaryCrossentropy(from_logits=F),
                    metrics = tf$metrics$AUC())
  
  print(glue::glue('{ai_m[[i]][1]}'))
  # train the model
  history = model %>% keras::fit(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
                validation_data=tf_test)
  gather_history[[i]]<- history
  names(gather_history)[i] = ai_m[[i]][1]
}

Reproduced from Notebook

Extract results to see benchmarks.

both Roberta and Electra The model shows some additional improvements after two training runs, which is not to say that it is not improving. GPT-2It is clear that in this case it is sufficient to train the state-of-the-art model for only a single epoch.

conclusion

In this post, we have shown how to use a state-of-the-art NLP model in R. To understand how to apply it to more complex tasks, you may want to review the Transformer tutorial.

We encourage readers to try out this model and share their results in the comments section below!

correction

If you find a mistake or want to suggest a change, please create an issue in the source repository.

recycle

Text and images are licensed under Creative Commons Attribution CC BY 4.0. Unless otherwise noted, source code is available at https://github.com/henry090/transformers. Images reused from other sources do not fall under this license and can be recognized by the note “Image of…” in the caption.

Summons

To give credit to the author, cite this work as follows:

Abdullayev (2020, July 30). Posit AI Blog: State-of-the-art NLP models from R. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/

BibTeX Quotes

@misc{abdullayev2020state-of-the-art,
  author = {Abdullayev, Turgut},
  title = {Posit AI Blog: State-of-the-art NLP models from R},
  url = {https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/},
  year = {2020}
}

State-of-the-art NLP models from R

Ultimate FPV Goggles Guide: Find the Best FPV Headset for Every FPV System

Tilt Shift Photography with Drones

Registration opens for the INTERPOL drone expert summit 2024 – sUAS News – The Business of Drones

Leave A Reply Cancel Reply

13 states with Republican governors opt out of summer food program for kids

7 Types of Weight Loss Surgery: Requirements & Choices

‘Beverly Hills Cop’ 4 Ending, Explained: Was It Worth 30 Year Wait?

20 Fun and Active Icebreakers for Kids and Teens

Germany could import up to 100 TWh of green hydrogen via pipelines by 2035, study shows

It’s Still Barbie’s World – The New York Times

It's Time for Dems To Roll the Dice

Ultimate FPV Goggles Guide: Find the Best FPV Headset for Every FPV System

Join Aro Ha’s Winter Wellness Retreats — The Wellness Travel Expert

President Biden’s interview today is taking on outsized significance for his 2024 race : NPR

Tom Brady Plays Football With Travis Scott, Quavo and More at Michael Rubin’s White Party

$10 billion school construction bond headed to Nov. 5 ballot: what’s in it?

Popular Posts

Germany could import up to 100 TWh of green hydrogen via pipelines by 2035, study shows

What Do Our Beloved Indian Cricketers Love To Eat? Find It Here

New IMF deal on cards as Pakistan ‘fulfils all requirements’

Most Read

Lipid nanoparticles target haematopoietic stem cells

The Tragedy of Veteran Suicide

Top 7 Family Activities in Arizona and Utah

State-of-the-art NLP models from R

introduction

Prerequisites

template

Data Preparation

Entering data into Keras

conclusion

correction

recycle

Summons

Related Posts

Leave A Reply Cancel Reply