Virtual Room 4: Text-as-Data

Date: 

Tuesday, July 14, 2020, 12:00pm to 1:30pm

A comparison of methods in political science text classification: Transfer learning language models for politics

Zhanna Terechshenko, Fridolin Linder, Vishakh Padmakumar, Fengyuan Liu, Jonathan Nagler, Joshua A. Tucker and Richard Bonneau

Embedding Regression: Models for Context-Specific Description and Inference in Social Science

Brandon Stewart, Pedro Rodriguez and Arthur Spirling

 

Chair: Suzanna Linn (Penn State University)

 

Co-Host: Justin Savoie (University of Toronto)

A comparison of methods in political science text classification: Transfer learning language models for politics

Author(s): Zhanna Terechshenko, Fridolin Linder, Vishakh Padmakumar, Fengyuan Liu, Jonathan Nagler, Joshua A. Tucker and Richard Bonneau

Discussant: Leah Windsor (Institute for Intelligent Systems, University of Memphis)

 

Automated text classification has rapidly become an important tool for political analysis. Recent advancements in natural language processing (NLP) enabled by advances in deep learning now achieve state of the art results in many standard tasks for the field. However, these methods require large amounts of both computing power and text data to learn the characteristics of the language, resources which are not always accessible to political scientists. One solution is a transfer learning approach, where knowledge learned in one area or source task is transferred to another area or a target task. A class of models that embody this approach are language models, which demonstrate extremely high levels of performance in multiple natural language understanding tasks. We investigate the feasibility of the use of these models and their performance in the political science domain by comparing multiple text classification methods, from classical bag-of-words based methods to word embeddings and state-of-the-art transfer learning language models on four datasets. First, we use labeled data from both the bills and newspaper headline corpora of the Comparative Agendas Project Database (CAP). Second, we employ a collection of Tweets that are related to Hillary Clinton during the 2016 election campaign and are labeled for the tweet's stance on Clinton. Lastly, we use a corpus of Wikipedia discussion page comments that are labeled for containing hate speech. We use these four different types of text in an effort to evaluate the methods on a wide variety of datasets that could be potentially relevant to political science researchers. We find RoBERTa and XLNet, language models, which rely on the Transformer, a relatively novel neural network architecture, while requiring fewer resources in terms of both computing power and text for training data, either perform on par with -- or outperform -- traditional text classification methods. Moreover, we find that the increase in accuracy is likely to be especially significant in the case of small data sets, highlighting the potential for reducing the cost of supervised methods for political scientists via the use of pretrained language models (the main type of transfer learning investigated here). We argue, therefore, that the use of transfer learning methods can reduce the cost of many text classification tasks for political scientists. In addition, we provide two accompanying software packages: one that allows applied researchers to quickly and computationally efficiently compare many different supervised learning approaches for their problem and another one that facilitates the use of RoBERTa, a language model that showed the best performance in our analyses, for text classification.

Embedding Regression: Models for Context-Specific Description and Inference in Social Science

Author(s): Brandon Stewart, Pedro Rodriguez and Arthur Spirling

Discussant: Max Goplerud (University of Pittsburgh)

 

Political scientists commonly seek to make statements about how a word's usage and meaning varies over contexts---whether that be time, partisan identity, or some other document-level covariate. A promising avenue are "word embeddings" that are specific to a domain, and that simultaneously allow for statements of uncertainty and statistical inference. We introduce the "a la Carte on Text embedding regression model" (ConText regression model) for this exact purpose. In particular, we extend and validate a simple model-based method of "retrofitting" pre-trained embeddings to local contexts that requires minimal input data and out-performs well-known competitors for studying changes in meaning across groups and times. Our approach allows us to speak descriptively of "effects" of covariates on the way that words are understood, and to comment on whether a particular use is statistically significantly different to another. We provide experimental and observational evidence of performance of the model, along with open-source software.


Add to Calendar