Matthew Tyler (Stanford University)
Abstract: Researchers are often tasked with applying subjective or contested labels to objects such as text and images. For example, researchers might hire coders to label the ideological slant of news articles. I show how two typical coding workflows in political science, traditional small-team coding and crowd-sourced coding, which are both usually carried out in an ad hoc way, can be greatly improved with statistical coding models. In particular, I show that the predicted label probabilities from statistical coding models are better able than alternatives like voting to capture the uncertainty of the coding process while being more likely to lead to unbiased regression estimators. While some coding models I describe are adapted from biostatistics and computer science, I also introduce the hierarchical Dirichlet Dawid-Skene (HDDS) model, which is designed for increasingly common crowd-sourced applications. After describing these models and their advantages, I demonstrate them on recent political science coding projects.