Beyond Topics: Semi-Supervised Learning for Texts From a Measurement Perspective

Shiyao Liu (Massachusetts Institute of Technology)

Abstract: This project proposes a new methodological framework to use text data as a measurement in political science. Despite the abundance of text data available nowadays, conversion of text data into a measurement for a political concept remains a challenge that prevents a wider application of text data for causal inference in political science research. Such challenge remains because a researcher has to either hand-code texts into measurement concepts, or fall back to text models and estimate the causal effect of “topics” with the hope that “topics” discovered ex post by topic models could be interpreted, sometimes in a brutal manner, as the political concept the researcher is interested in. Although recent literature such as Miller et al (2019) promotes the use of semi-supervised learning, where researchers can only label part of the text data into a measurement indicator and let the machine learn the pattern of human coding and do the rest of the labelling, such literature considers the labelling process itself as a prediction problem. Thus, these algorithms fail to take into account the fact that such semi-automatically generated measurements would later be used in a causal quantity estimation. Such failure can cause the measurement error problem and thus jeopardize the quality of the causal inference. Leveraging literature on measurement errors, this project fills the research gap between machine learning, text models, and causal inference by introducing an improved and holistic version of semi-supervised learning framework specifically designed for causal quantity estimation. This framework keeps the strengths of semi-supervised learning models on cost-effectiveness, transparency and verifiability. Further, by correcting the measurement error with a new performance metric and a direct modelling process, it overcomes the problem current semi-supervised learning algorithms has on its failure in addressing these errors. Such measurement errors (or prediction errors from prediction perspective) exists by design, but cause biases in causal quantity estimation. Finally, by considering “topics” as labels, this method establishes the otherwise missing link between topics and systematized concepts. This method would allow researchers to test political theories with text data in a more direct manner, where researchers no longer have to match “topics” with their political theories, but determine the “topics” a priori from the underlying political theory, and test the theory in a direct manner.

View Poster in a New Tab