Clustering Large-Scale Ballot Data With Varying Choice Sets

Shiro Kuriwaki (Harvard University)

Abstract: Election scholars increasingly analyze large cast vote records (ballot image logs) to measure ticket splitting and ideological coherence in actual voter behavior. Election administrators also store cast vote records to detect election fraud and audit results. Although clustering methods are a common tool when summarizing such large high-dimensional data, existing methods are largely designed for continuous outcomes and cannot account for varying choice sets across individuals and features. I propose a clustering model that overcomes both of these shortcomings and can be implemented as an Expectation Maximization (EM) algorithm. I adapt the classic finite mixture model to adopt a fast multinomial logistic regression for categorical outcomes, using an independence of irrelevant alternatives (IIA) assumption to account for varying choice sets. I implement the EM algorithm in a R package, clusterCVR, using a C++ back-end for notable speedups. I illustrate the use of this algorithm on over 6 million cast vote records obtained from South Carolina and show that voters can be classified into three distinct profiles. This reveals that, contrary to theories of party heuristics, voters who are inclined to split their ticket are more likely to do so in state and local races.

View Poster in a New Tab