Fabricio Vasselai (University of Michigan)
Abstract: The main advantage of using Supervised Machine Learning (SML) techniques to detect election fraud would be resorting to model-free or model-ensemble approaches, instead of usual model-specific (often parametric) statistical tools. However, the inherent shortage of ground truth data on election fraud poses challenges to gathering training data for the learner algorithm. The workaround is generating synthetic training data. Two ideas have been proposed in the last years. Cantú and Saiegh (2011) used Monte Carlo generation of integer-valued random variables (representing votes), followed by manual transfer of votes between random variables, in order to simulate non-adherence to Benford’s Law. Recently, Zhang, Alvarez and Levin (2019) resorted to a hierarchical regression aimed at fitting the test data, from which clean and tampered fictitious election results could be then predicted – and used as synthetic training data. Here, a new option is explored. Instead of trying to infer a parametric approximation of the DGP that underlies the election results one is trying to detect frauds at, the relevant aspects of such a DGP are directly and explicitly simulated. Specifically, election results are simulated using a novel computational Multi-Agent System of elections (MASE) based on analytical models of strategic voting in Myerson and Weber (1993) and Cox (1994), while extending those to also allow for strategic abstention - modeled following ideas from Palfrey and Rosenthal (1985) and Demichelis and Dhillon (2010). In order to make computations feasible, novel derivations of pivotal probabilities (and their algorithms) are proposed under Myerson’s (1998) framework of elections as Poisson games. In the proposed MASE, the utilities of candidates to each elector agent are randomly determined following user-defined distributions (by default, a Beta distribution with hyper-parameters following uniform distributions). This way, MASE gives different election results at each re-run. Additionally, other parameters can be defined, like number of electors, number of candidates, percentage of electors that never abstain and percentage of electors that vote sincerely. This training data generation offers a few advantages. First, it is non-parametric and highly flexible – it allows for arbitrarily complex fine-tuning of the simulation by the researcher (e.g. adapting the model to specifics of electoral rules, imposing specific turnout levels, etc). Second, it is stochastic: with model re-runs based on parameters with randomized initial values, bootstrapping of the classification based on the change to the training data generation becomes feasible. Third and most importantly, it generates training data that formally differentiate all electors’ strategic behavior from ballot frauds. This is important since Mebane, Baltz and Vasselai (2019) confirm Mebane’s (2016) suspicion that most established fraud-detection tools can raise false positive detection, mislead by legit strategic behavior. Generating synthetic data works as follows. MASE is re-run thousands of times, with parameters randomized within ranges given by the real test data (e.g. number of electors and candidates bounded by real min and max across polling stations in the test data). In a random selection of equilibria results, ballot-box stuffing or vote-stealing are applied. A supervised learner is trained with such data, then used to classify polling-stations as suspicious or not of election fraud, in multiple different test data: (a) other MASE-simulated data, for theoretical validation and (b) real-life polling-stations from elections under electoral rules compatible to those in Cox (1994) and for which we happen to have a ground-truth fraud assessment from official authorities - like Argentina-1936, Mexico-1997 and North Carolina-2018. Results with different learners are discussed – with a focus on non-parametric, non-linear and model-ensembled options (like Random Forests, SVM and Deep Neural Networks). Appropriately, accuracy of fraud classifications are presented for different specifications and parameters and discussion is made on how to further increase accuracy in the future Also, initial attempts at differentiating types of fraud are discussed. Future work on boot-strapping the classifications to get error bounds around the fraud classifications is introduced.