AI-based language analysis is in part a translator language model (Vaswani et al., 2017, Liu et al., 2019). Companies such as Google, Meta, and OpenAI have released models such as BERT, RoBERTa, and GPT, achieving unprecedented improvements in most language tasks, such as web search and sentiment analysis. These language models are accessible in Python, and for common AI tasks, R packages are accessible through HuggingFace. text
Making HuggingFace and state-of-the-art transformer language models accessible to social science pipelines in R.
introduction
we are text
With two goals in mind, the package (Kjell, Giorgi & Schwartz, 2022): serves as a modular solution for downloading and using translator language models. This includes, for example, converting text into word embeddings, as well as accessing common language model tasks such as text classification, sentiment analysis, text generation, question answering, translation, and more. We provide end-to-end solutions designed for human-level analytics, including pipelines for cutting-edge AI technologies tailored to predict the characteristics of language producers or derive insights into linguistic correlates of psychological properties. .
This blog post shows how to install it. text
Create a package, transform text into state-of-the-art contextual word embeddings, use linguistic analysis tasks, and visualize words in word embedding space.
Install and set up Python environment
that much text
The package is setting up a Python environment to access the HuggingFace language model. For the first time after installation text
The package needs to run two functions: textrpp_install()
and textrpp_initialize()
.
# Install text from CRAN
install.packages("text")
library(text)
# Install text required python packages in a conda environment (with defaults)
textrpp_install()
# Initialize the installed conda environment
# save_profile = TRUE saves the settings so that you do not have to run textrpp_initialize() again after restarting R
textrpp_initialize(save_profile = TRUE)
For more information, see the extension installation guide.
Convert text to word embeddings
that much textEmbed()
Functions are used to convert text into word embeddings (numerical representations of text). that much model
The arguments allow you to set the language model to be used by HuggingFace. If you have never used a model before, it will automatically download the model and any necessary files.
# Transform the text data to BERT word embeddings
# Note: To run faster, try something smaller: model = 'distilroberta-base'.
word_embeddings <- textEmbed(texts = "Hello, how are you doing?",
model = 'bert-base-uncased')
word_embeddings
comment(word_embeddings)
You can now use word embeddings for downstream tasks, such as training models to predict related numeric variables (see, for example, the textTrain() and textPredict() functions).
(See the textEmbedRawLayers() function to get tokens and individual layer output.)
HuggingFace has many translator language models that can be used for various language model tasks such as text classification, sentiment analysis, text generation, question answering, translation, etc. that much text
The package consists of user-friendly features that allow you to access these features.
classifications <- textClassify("Hello, how are you doing?")
classifications
comment(classifications)
generated_text <- textGeneration("The meaning of life is")
generated_text
For more examples of available language model operations, see textSum(), textQA(), textTranslate(), and textZeroShot() under Language Analysis Operations.
visualization of words text
The package consists of two steps: The first is the ability to preprocess the data, and the second is the ability to plot the words, including adjusting visual characteristics such as color and font size. To demonstrate these two features, we use the example data included below. text
package: Language_based_assessment_data_3_100
. We show how to create a two-dimensional picture with the words individuals use to describe harmony in their lives, drawn according to two well-being questionnaires: harmony in the scale of life and satisfaction with scale in life. Therefore, the x-axis represents words related to low harmony and high harmony in the life scale score, and the y-axis represents words related to low satisfaction and high satisfaction in the life scale score.
word_embeddings_bert <- textEmbed(Language_based_assessment_data_3_100,
aggregation_from_tokens_to_word_types = "mean",
keep_token_embeddings = FALSE)
# Pre-process the data for plotting
df_for_plotting <- textProjection(Language_based_assessment_data_3_100$harmonywords,
word_embeddings_bert$text$harmonywords,
word_embeddings_bert$word_types,
Language_based_assessment_data_3_100$hilstotal,
Language_based_assessment_data_3_100$swlstotal
)
# Plot the data
plot_projection <- textProjectionPlot(
word_data = df_for_plotting,
y_axes = TRUE,
p_alpha = 0.05,
title_top = "Supervised Bicentroid Projection of Harmony in life words",
x_axes_label = "Low vs. High HILS score",
y_axes_label = "Low vs. High SWLS score",
p_adjust_method = "bonferroni",
points_without_words_size = 0.4,
points_without_words_alpha = 0.4
)
plot_projection$final_plot
This post shows how to perform state-of-the-art text analysis in R using: text
package. This package is intended to make it easy to access and use HuggingFace's translator language model to analyze natural language. We look forward to your feedback and contributions in making these models usable in the social sciences and other applications more common to R users.
- Bommasani et al. (2021). About the opportunities and risks of the underlying model.
- Kjell et al. (2022). Text Package: An R package for analyzing and visualizing human language using natural language processing and deep learning.
- Liu et al. (2019). Roberta: A strongly optimized bert dictionary learning approach.
- Vaswani et al (2017). That's all that needs attention. Advances in Neural Information Processing Systems, 5998-6008.
correction
If you see a mistake or want to suggest a change, please create an issue in the source repository.
recycle
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Unless otherwise noted, source code can be found at https://github.com/OscarKjell/ai-blog. Illustrations reused from other sources do not fall under this license and can be recognized by the note “Illustration of…” in the caption.
Summons
To give attribution, please cite this work as follows:
Kjell, et al. (2022, Oct. 4). Posit AI Blog: Introducing the text package. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/
BibTeX Quotes
@misc{kjell2022introducing, author = {Kjell, Oscar and Giorgi, Salvatore and Schwartz, H Andrew}, title = {Posit AI Blog: Introducing the text package}, url = {https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/}, year = {2022} }