Estimating Population Quantities From Multiple Data Sources Using the Structural Tensor Factorization

Soichiro Yamauchi (Harvard University)

Abstract: Estimating population quantities such as public opinions from survey data is a fundamental task in many social science studies. In political science, there is a growing interest in estimating public opinions at the level smaller than the entire nation, such as states (Lax and Phillips 2019), cities (Tausanovitch and Warshaw 2013) or congressional districts (Warshaw and Rodden 2012). Yet, due to non-probabilistic sample and small sample size, it is often challenging to estimate such quantities of interest with high accuracy. The majority of studies rely on a method called multi-level modeling and post-stratification (MRP) (Park et al. 2004). This method predicts the outcome by borrowing information from other geographical units, while adjusting for covariate imbalances between survey and population. However, this method requires researchers to separately specify a complex Bayesian regression model for each outcome of their interest; and furthermore it is difficult to incorporate information from other data sources. Thus, this is not only a time-consuming and tedious procedure for researchers but also not efficient in terms of estimation accuracy since they cannot exploit all information available to them. For example, separate from survey data, voter files in the US provide valuable information about the relationship between age, gender and turnout histories, which is correlated with preferences on many political issues. But, MRP cannot incorporate this auxiliary information in its estimation in a principled manner. The main goal of this project is to develop a method that enables scholars to estimate such population quantiles in a single step, combining all information available to them. Specifically, I propose a Bayesian multi-level multi-source structural tensor decomposition model for discrete variables. The model imputes distributions of outcomes in the population by efficiently learning the underlying correlation structures of variables and heterogeneities across data sources and geographical clusters. Compared to existing methods such as MRP the proposed method has several advantages: First, the proposed method can estimate any conditional or joint quantities in a single step. While existing methods require several steps to obtain conditional distributions because they can estimate one quantity at a time, the proposed approach avoids this problem by modeling the joint distribution of all variables. Second, the proposed method allows researchers to combine multiple data sources and auxiliary information for estimating the quantities of interest in the target area. In contrast to the existing methods which rely on a single survey data, the proposed method can combine multiple survey data and other population-level data (e.g., voter files) into the estimation by explicitly modeling heterogeneities across data sources. In addition, it leverages variables that are not observed in the population (e.g., Census) to infer heterogeneities across geographical clusters and data sources, while MRP is constrained to use only variables observed both in a survey and the population (i.e., basic demographics). Lastly, the proposed method does not require researchers to specify the model. The tensor factorization utilized in the proposed method fully exploits the discrete nature of the variables, which enables a non-parametric estimation by implicitly incorporating all interactions among variables into the estimation. I assess the performance of the method by conducting a validation study where the estimated population quantities are compared against the ground truth. Specifically, I plan to utilize the public information about the turnout (in all states) and party registration in Florida, to validate the accuracy of turnout among registered voters and partisans at the congressional district level. This validation study allows me to compare the performance of the proposed method against existing approaches. The proposed method will be used to study public opinion in the United States. I study the heterogeneities of opinion on political issues among partisans across congressional districts. The study relies on the Cooperative Congressional Election Study (CCES), which is a large scale national political survey. The result of this empirical application speaks to the study of representation in American politics.

View Poster in a New Tab