introduction
that much Transformers “Hugging Face”’s repository contains many state-of-the-art, ready-to-use models that can be easily downloaded and fine-tuned with Tensorflow and Keras.
For this purpose the user generally needs to obtain:
- The model itself (e.g. Bert, Albert, RoBerta, GPT-2, etc.)
- tokenizer object
- Model weights
In this post, we will perform a classic binary classification task and train a dataset on three models.
However, readers should be aware that converters can be used in a variety of downstream operations, including:
- Feature Extraction
- sentiment analysis
- Text classification
- question answer
- summary
- Translation and various other services.
Prerequisites
Our first thing is to install it. Transformers via package reticulate
.
reticulate::py_install('transformers', pip = TRUE)
Then load R's standard 'Keras', 'TensorFlow' >= 2.0 and some classic libraries as usual.
When running TensorFlow on GPU, you can specify the following parameters to avoid memory issues:
physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)
tf$keras$backend$set_floatx('float32')
template
We have already mentioned that in order to train data for a particular model, the user needs to download the model, tokenizer object, and weights. For example, to get the RoBERTa model, you need to do the following:
# get Tokenizer
transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)
# get Model with weights
transformer$TFRobertaModel$from_pretrained('roberta-base')
Data Preparation
The dataset for binary classification is provided in the text2vec package. Let's load the dataset and take a sample for quick model training.
Split the data into two parts.
idx_train = sample.int(nrow(df)*0.8)
train = df[idx_train,]
test = df[!idx_train,]
Entering data into Keras
So far we've only covered data import and train-test splitting. To provide input to the network, the raw text must be converted into indices by an fetch tokenizer. We then tune the model to perform binary classification by adding a dense layer with a single unit at the end.
However, we want to train on data for three models: GPT-2, RoBERTa, and Electra. For this we need to write a loop.
Note: Typically one model requires 500-700MB.
# list of 3 models
ai_m = list(
c('TFGPT2Model', 'GPT2Tokenizer', 'gpt2'),
c('TFRobertaModel', 'RobertaTokenizer', 'roberta-base'),
c('TFElectraModel', 'ElectraTokenizer', 'google/electra-small-generator')
)
# parameters
max_len = 50L
epochs = 2
batch_size = 10
# create a list for model results
gather_history = list()
for (i in 1:length(ai_m)) {
# tokenizer
tokenizer = glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
do_lower_case=TRUE)") %>%
rlang::parse_expr() %>% eval()
# model
model_ = glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')") %>%
rlang::parse_expr() %>% eval()
# inputs
text = list()
# outputs
label = list()
data_prep = function(data) {
for (i in 1:nrow(data)) {
txt = tokenizer$encode(data[['comment_text']][i],max_length = max_len,
truncation=T) %>%
t() %>%
as.matrix() %>% list()
lbl = data[['target']][i] %>% t()
text = text %>% append(txt)
label = label %>% append(lbl)
}
list(do.call(plyr::rbind.fill.matrix,text), do.call(plyr::rbind.fill.matrix,label))
}
train_ = data_prep(train)
test_ = data_prep(test)
# slice dataset
tf_train = tensor_slices_dataset(list(train_[[1]],train_[[2]])) %>%
dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>%
dataset_shuffle(128) %>% dataset_repeat(epochs) %>%
dataset_prefetch(tf$data$experimental$AUTOTUNE)
tf_test = tensor_slices_dataset(list(test_[[1]],test_[[2]])) %>%
dataset_batch(batch_size = batch_size)
# create an input layer
input = layer_input(shape=c(max_len), dtype='int32')
hidden_mean = tf$reduce_mean(model_(input)[[1]], axis=1L) %>%
layer_dense(64,activation = 'relu')
# create an output layer for binary classification
output = hidden_mean %>% layer_dense(units=1, activation='sigmoid')
model = keras_model(inputs=input, outputs = output)
# compile with AUC score
model %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
loss = tf$losses$BinaryCrossentropy(from_logits=F),
metrics = tf$metrics$AUC())
print(glue::glue('{ai_m[[i]][1]}'))
# train the model
history = model %>% keras::fit(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
validation_data=tf_test)
gather_history[[i]]<- history
names(gather_history)[i] = ai_m[[i]][1]
}
Reproduced from Notebook
Extract results to see benchmarks.
both Roberta and Electra The model shows some additional improvements after two training runs, which is not to say that it is not improving. GPT-2It is clear that in this case it is sufficient to train the state-of-the-art model for only a single epoch.
conclusion
In this post, we have shown how to use a state-of-the-art NLP model in R. To understand how to apply it to more complex tasks, you may want to review the Transformer tutorial.
We encourage readers to try out this model and share their results in the comments section below!
correction
If you find a mistake or want to suggest a change, please create an issue in the source repository.
recycle
Text and images are licensed under Creative Commons Attribution CC BY 4.0. Unless otherwise noted, source code is available at https://github.com/henry090/transformers. Images reused from other sources do not fall under this license and can be recognized by the note “Image of…” in the caption.
Summons
To give credit to the author, cite this work as follows:
Abdullayev (2020, July 30). Posit AI Blog: State-of-the-art NLP models from R. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/
BibTeX Quotes
@misc{abdullayev2020state-of-the-art, author = {Abdullayev, Turgut}, title = {Posit AI Blog: State-of-the-art NLP models from R}, url = {https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/}, year = {2020} }