Andrew Halterman (Massachusetts Institute of Technology)
Abstract: This paper introduces a method that automatically extracts political events from text using grammatical parsing and machine learning. Much of the scientifically useful information about what political actors are doing is locked away in text. To extract this information as data, social scientists generally resort to laboriously hand-coding political events from newspaper reports, encyclopedias, NGO reports, and government documents. Recent innovations in text analysis (e.g., improved topic models) have not yet superseded hand-coding because they are meant for summarizing documents rather than extracting specific information. Existing methods for extracting specific information have not caught on because they require enormous up-front effort to tailor to new event types (e.g. Raytheon BBN Technologies 2015; Norris, Schrodt, and Beieler 2017), extract events that are irrelevant to political science (Gildea and Jurafsky 2002; Carreras and Màrquez 2005; Palmer, Gildea, and Xue 2010), or have little to no ability to inductively learn event types from text. Natural language processing techniques for event extraction are improving rapidly (e.g. Keith et al. 2017; Marcheggiani and Titov 2017; FitzGerald et al. 2018), but often do not extract the event types that are of interest to social scientists. I introduce a new technique that enables researchers to automatically extract event information from text and then aggregate similar events. The method generalizes across domains without the need for retraining, does not require hand crafted dictionaries, can be used to extract information on different types of events, and can be applied to a wide range of documents. I take a ``slot filling'' approach that extracts the words from a sentence corresponding to each politically relevant ``slot'' in a description of an event: the actors, recipients, means, causes, times, and locations of events. To fill these slots, I combine a rule-based system that uses the grammatical information of the sentence and machine learning models. Grammatical information enters the model through a grammatical dependency parse; this provides useful information about the relationship of actors and objects in a sentence. Grammar alone cannot resolve all text to slots, however: in the sentences "Trump fired missiles" and "Trump fired Tillerson", "missiles" and "Tillerson" play identical grammatical roles, but one provides information on how an action is carried out while another reports a recipient of an action. I use machine learning models (word embeddings and a neural network classifier) to add the semantic information necessary to resolve these ambiguities. I provide a software package implementing this approach. I use this technique to resolve an ongoing debate on whether respect for human rights has improved over time. The debate hinges on contradictory interpretations of an existing data set, so new data are needed to break the impasse. I produce new, disaggregated data on the specific acts of human rights abuses reported by the State Department in their monitoring documents over time. I offer clear evidence that the contents of reporting are changing over time, and suggestive evidence that the threshold for inclusion are changing as well.