Penn Arts & Sciences Logo

Friday, November 16, 2018 - 2:00pm

Edgar Dobriban

University of Pennsylvania, Statistics Department

Location

University of Pennsylvania

A6 DRL

Modern massive datasets pose an enormous computational burden to practitioners. Distributed computation has emerged as a universal approach to ease the burden: Datasets are partitioned over machines, which compute locally, and communicate short messages. Distributed data also arises due to privacy reasons, such as with medical databases. It is important to study how to do statistical inference and machine learning in a distributed setting.  In this talk, we present results about one-step parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, and take a weighted average of the parameters. How much do we lose compared to doing linear regression on the full data? Here we study the performance loss in estimation error, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We discover several key phenomena. First, averaging is not optimal, and we find the exact performance loss. Second, different problems are affected differently by the distributed framework. Estimation error and confidence interval length increases a lot, while prediction error increases much less. These results match numerical simulations and a data analysis example. To derive these results, we rely on recent results from random matrix theory, where we also develop a new calculus of deterministic equivalents as a tool of broader interest.