Posit AI Blog: Introducing the text package

AI-based language analysis is in part a translator language model (Vaswani et al., 2017, Liu et al., 2019). Companies such as Google, Meta, and OpenAI have released models such as BERT, RoBERTa, and GPT, achieving unprecedented improvements in most language tasks, such as web search and sentiment analysis. These language models are accessible in Python, and for common AI tasks, R packages are accessible through HuggingFace. text Making HuggingFace and state-of-the-art transformer language models accessible to social science pipelines in R.

introduction

we are text With two goals in mind, the package (Kjell, Giorgi & Schwartz, 2022): serves as a modular solution for downloading and using translator language models. This includes, for example, converting text into word embeddings, as well as accessing common language model tasks such as text classification, sentiment analysis, text generation, question answering, translation, and more. We provide end-to-end solutions designed for human-level analytics, including pipelines for cutting-edge AI technologies tailored to predict the characteristics of language producers or derive insights into linguistic correlates of psychological properties. .

This blog post shows how to install it. text Create a package, transform text into state-of-the-art contextual word embeddings, use linguistic analysis tasks, and visualize words in word embedding space.

Install and set up Python environment

that much text The package is setting up a Python environment to access the HuggingFace language model. For the first time after installation text The package needs to run two functions: textrpp_install() and textrpp_initialize().

# Install text from CRAN
install.packages("text")
library(text)

# Install text required python packages in a conda environment (with defaults)
textrpp_install()

# Initialize the installed conda environment
# save_profile = TRUE saves the settings so that you do not have to run textrpp_initialize() again after restarting R
textrpp_initialize(save_profile = TRUE)

For more information, see the extension installation guide.

Convert text to word embeddings

that much textEmbed() Functions are used to convert text into word embeddings (numerical representations of text). that much model The arguments allow you to set the language model to be used by HuggingFace. If you have never used a model before, it will automatically download the model and any necessary files.

# Transform the text data to BERT word embeddings
# Note: To run faster, try something smaller: model = 'distilroberta-base'.
word_embeddings <- textEmbed(texts = "Hello, how are you doing?",
                            model = 'bert-base-uncased')
word_embeddings
comment(word_embeddings)

You can now use word embeddings for downstream tasks, such as training models to predict related numeric variables (see, for example, the textTrain() and textPredict() functions).

(See the textEmbedRawLayers() function to get tokens and individual layer output.)

HuggingFace has many translator language models that can be used for various language model tasks such as text classification, sentiment analysis, text generation, question answering, translation, etc. that much text The package consists of user-friendly features that allow you to access these features.

classifications <- textClassify("Hello, how are you doing?")
classifications
comment(classifications)

generated_text <- textGeneration("The meaning of life is")
generated_text

For more examples of available language model operations, see textSum(), textQA(), textTranslate(), and textZeroShot() under Language Analysis Operations.

visualization of words text The package consists of two steps: The first is the ability to preprocess the data, and the second is the ability to plot the words, including adjusting visual characteristics such as color and font size. To demonstrate these two features, we use the example data included below. text package: Language_based_assessment_data_3_100. We show how to create a two-dimensional picture with the words individuals use to describe harmony in their lives, drawn according to two well-being questionnaires: harmony in the scale of life and satisfaction with scale in life. Therefore, the x-axis represents words related to low harmony and high harmony in the life scale score, and the y-axis represents words related to low satisfaction and high satisfaction in the life scale score.

word_embeddings_bert <- textEmbed(Language_based_assessment_data_3_100,
                                  aggregation_from_tokens_to_word_types = "mean",
                                  keep_token_embeddings = FALSE)

# Pre-process the data for plotting
df_for_plotting <- textProjection(Language_based_assessment_data_3_100$harmonywords, 
                                  word_embeddings_bert$text$harmonywords,
                                  word_embeddings_bert$word_types,
                                  Language_based_assessment_data_3_100$hilstotal, 
                                  Language_based_assessment_data_3_100$swlstotal
)

# Plot the data
plot_projection <- textProjectionPlot(
  word_data = df_for_plotting,
  y_axes = TRUE,
  p_alpha = 0.05,
  title_top = "Supervised Bicentroid Projection of Harmony in life words",
  x_axes_label = "Low vs. High HILS score",
  y_axes_label = "Low vs. High SWLS score",
  p_adjust_method = "bonferroni",
  points_without_words_size = 0.4,
  points_without_words_alpha = 0.4
)
plot_projection$final_plot

Supervised bicentroid projections on harmony of living words.

This post shows how to perform state-of-the-art text analysis in R using: text package. This package is intended to make it easy to access and use HuggingFace's translator language model to analyze natural language. We look forward to your feedback and contributions in making these models usable in the social sciences and other applications more common to R users.

Bommasani et al. (2021). About the opportunities and risks of the underlying model.
Kjell et al. (2022). Text Package: An R package for analyzing and visualizing human language using natural language processing and deep learning.
Liu et al. (2019). Roberta: A strongly optimized bert dictionary learning approach.
Vaswani et al (2017). That's all that needs attention. Advances in Neural Information Processing Systems, 5998-6008.

correction

If you see a mistake or want to suggest a change, please create an issue in the source repository.

recycle

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Unless otherwise noted, source code can be found at https://github.com/OscarKjell/ai-blog. Illustrations reused from other sources do not fall under this license and can be recognized by the note “Illustration of…” in the caption.

Summons

To give attribution, please cite this work as follows:

Kjell, et al. (2022, Oct. 4). Posit AI Blog: Introducing the text package. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/

BibTeX Quotes

@misc{kjell2022introducing,
  author = {Kjell, Oscar and Giorgi, Salvatore and Schwartz, H Andrew},
  title = {Posit AI Blog: Introducing the text package},
  url = {https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/},
  year = {2022}
}

Posit AI Blog: Introducing the text package

The Download: AI agents, and how to detect a lie

Apple approves Epic Games’ marketplace app after initial rejections

How xenophobic content on Chinese social media, directed towards Japan, the US, Jews, and others, became the subject of a debate and spreads despite censorship (Li Yuan/New York Times)

Leave A Reply Cancel Reply

What is the Gouge? | The Leading Blog: A Leadership Blog

9 Conservative Giants Who Defined Modern Politics

The Download: AI agents, and how to detect a lie

What Makes Someone Want to Lose Weight?

Bold and the Beautiful: Hope Humiliated After Saying Yes to #3 Proposal?

What Is Aphelion? Earth Reaches Its Greatest Distance From the Sun on Friday

Victor Reacts: The Only Issue is.. his Brain? (VIDEO) | The Gateway Pundit

Apple approves Epic Games’ marketplace app after initial rejections

KNESKO SKIN LAUNCHES SKINCARE FOR THE BODY

Shark attacks in Florida, Texas, leave 4 hurt : NPR

George R.R. Martin on ‘Brilliant’ Addition to House of the Dragon Season 2

Annual Survey Highlights Educators’ Embrace of ST Math, Created by MIND Education

Popular Posts

What Is Aphelion? Earth Reaches Its Greatest Distance From the Sun on Friday

Map Predicts Future Chance of Power Outages From Hurricanes

IVF in zoos ‘could help wild population’

Most Read

Rivian started a tough year on a flat foot

Taylor Swift Joining the MCU? The Clues Are All There

Whoopi Goldberg Comes To Prince William’s Defense Over Eras Tour Dance Moves

Posit AI Blog: Introducing the text package

introduction

Install and set up Python environment

Convert text to word embeddings

correction

recycle

Summons

Related Posts

Leave A Reply Cancel Reply