Future challenges in statistical scalability
Monday 25th June 2018 to Friday 29th June 2018
09:45 to 10:20  Registration  
10:20 to 10:50  Morning Coffee  
10:50 to 11:00  Welcome from Christie Marr (INI Deputy Director)  
11:00 to 11:45 
Iain Johnstone (Stanford University) tba 
INI 1 
11:45 to 12:30 
Philippe Rigollet (Massachusetts Institute of Technology) tba 
INI 1 
12:30 to 14:00  Buffet Lunch at INI  
14:00 to 14:45 
Matthew Stephens (University of Chicago) On applications of Empirical Bayes approaches to the Normal Means problem
The normal means problem is very simple: given normallydistributed observations with known variances and unknown means, estimate the means. That is, given X_j \sim N(\theta_j, \sigma_j^2, estimate \theta_j. A key idea is that one can do better than the maximum likelihood estimates, \hat{\theta}_j= \X_j, in particular by use of appropriate "shrinkage" estimators. One attractive way to perform shrinkage estimation in practice is to use Empirical Bayes methods. That is, to assume that \theta_j are independent and identically distributed from some distribution g that is to be estimated from the data. Then, given such an estimate \hat{g}, the posterior distributions of \theta_j can be computed to perform inference. We call this the "Empirical Bayes Normal Means" (EBNM) problem. Despite its simplicity, solving the EBNM problem has a wide range of practical applications. Here we present some flexible nonparametric approaches we have recently developed for solving the EBNM problem, and describe their application to several different settings: false discovery rate (FDR) estimation, nonparametric smoothing, and sparse matrix factorization problems (ie sparse factor analysis and sparse principal components analysis). 
INI 1 
14:45 to 15:30 
Jana Jankova (University of Cambridge) tba 
INI 1 
15:30 to 16:00  Afternoon Tea  
16:00 to 16:45 
Flori Bunea (Cornell University) A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations. The model links all cell probabilities, collected in a p x n matrix, via the assumption that this matrix can be factorized as the product of two nonnegative matrices A and W. Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the wordtopic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in this talk. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to Kdimensional canonical basis vectors. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic setup when K is unknown.
In this talk, we take a different view on anchor word estimation, and on the estimation of the wordtopic matrix A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, p, K and document length N, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We illustrate, via simulations, that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.

INI 1 
17:00 to 18:00  Welcome Wine Reception & Posters at INI 
09:00 to 09:45 
Guy Bresler (Massachusetts Institute of Technology) Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure
The prototypical highdimensional statistics problem entails finding a structured signal in noise. Many of these problems exhibit a curious phenomenon: the amount of data needed by all known computationally efficient algorithms far exceeds what is needed for inefficient algorithms that search over all possible structures. A line of work initiated by Berthet and Rigollet in 2013 has aimed to explain these gaps by reducing from conjecturally hard problems in computer science. However, the delicate nature of averagecase reductions has limited the applicability of this approach. In this work we introduce several new techniques to give a web of averagecase reductions showing strong computational lower bounds based on the planted clique conjecture. These include tight lower bounds for Planted Independent Set, Planted Dense Subgraph, Sparse Spiked Wigner, Sparse PCA, as well as for new models that we introduce. Joint work with Matthew Brennan and Wasim Huleihel.

INI 1 
09:45 to 10:30 
Chao Gao (University of Chicago) tba 
INI 1 
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Ryan Tibshirani (Carnegie Mellon University) tba 
INI 1 
11:45 to 12:30 
Peter Bickel (University of California, Berkeley) Two network scale challenges:Constructing and fitting hierarchical block models and fitting large block models using the mean field method
Work with S.Bhattacharyya,T.Li,E.Levina,S.Mukherjee,P.Sarkar
Networks are a complex type of structure presenting itself in many applications . They are usually represented by a graph ,with possibly weighted edges plus additional covariates (such as directions).Block models have been studied for some time as basic approximations to ergodic stationary probability models for single graphs.A huge number of fitting methods have been developed for these models some of which we will touch on.
The mean field method in which an increasing number of parameters must be fitted is used not only for multiple membership block models but also in applications such as LDA.if the graph is too large poor behaviour of the method can be seen.. We have developed what we call "patch methods " for fitting which help both computationally and inferentially in such situations bur much further analysis is needed.
It is intuitively clear but mathematically unclear how knowledge of the model having nested scales helps in fitting large scale as opposed to small scale parameters.We will discuss this issue through an example,

INI 1 
12:30 to 14:00  Buffet Lunch at INI  
14:00 to 14:45 
Marten Herman Wegkamp (Cornell University) Adaptive estimation of the rank of the regression coefficient matrix
We consider the multivariate response regression problem with a regression coefficient matrix of low, unknown rank. In this setting, we analyze a new criterion for selecting the optimal reduced rank. This criterion does not require estimation of the unknown variance of the noise, nor depends on a delicate choice of a tuning parameter. We develop an iterative, fully datadriven procedure, that adapts to the optimal signaltonoise ratio. This procedure finds the true rank in a few steps with overwhelming probability. At each step, our estimate increases, while at the same time it does not exceed the true rank. Our finite sample results hold for any sample size and any dimension, even when the number of responses and of covariates grow much faster than the number of observations. An extensive simulation study that confirms our theoretical findings in both low and high dimensional settings.
This is joint work with Xin Bing.

INI 1 
14:45 to 15:30 
Maryam Fazel (University of Washington) tba 
INI 1 
15:30 to 16:00  Afternoon Tea  
16:00 to 16:45 
Alex d'Aspremont (CNRS  Ecole Normale Superieure Paris) An Approximate ShapleyFolkman Theorem.
with Thomas Kerdreux and Igor Colin.
The ShapleyFolkman theorem shows that Minkowski averages of uniformly bounded sets tend to be convex when the number of terms in the sum becomes much larger than the ambient dimension. In optimization, \citet{Aubi76} show that this produces an a priori bound on the duality gap of separable nonconvex optimization problems involving finite sums. This bound is highly conservative and depends on unstable quantities, and we relax it in several directions to show that non convexity can have a much milder impact on finite sum minimization problems such as empirical risk minimization and multitask classification. As a byproduct, we show a new version of Maurey's classical approximate Carath\'eodory lemma where we sample a significant fraction of the coefficients, without replacement.

INI 1 
09:00 to 09:45 
Urvashi Oswal (University of WisconsinMadison) tba 
INI 1 
09:45 to 10:30 
Elizaveta Levina (University of Michigan) Matrix completion in network analysis
Matrix completion is an active area of research in itself, and a natural tool to apply to network data, since many real networks are observed incompletely and/or with noise. However, developing matrix completion algorithms for networks requires taking into account the network structure. This talk will discuss three examples of matrix completion used for network tasks. First, we discuss the use of matrix completion for crossvalidation on network data, a longstanding problem in network analysis. Two other examples focus on reconstructing incompletely observed networks, with structured missingness resulting from network sampling mechanisms. One scenario we consider is egocentric sampling, where a set of nodes is selected first and then their connections to the entire network are observed. Another scenario focuses on data from surveys, where people are asked to name a given number of friends. We show that matrix completion, when used appropriately, can generally be very helpful in solving network problems.

INI 1 
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Garvesh Raskutti (University of WisconsinMadison) tba 
INI 1 
11:45 to 12:30 
Peter Bartlett (University of California, Berkeley) Representation, optimization and generalization properties of deep neural networks
Deep neural networks have improved the stateoftheart performance for prediction problems across an impressive range of application areas. This talk describes some recent results in three directions. First, we investigate the impact of depth on representational properties of deep residual networks, which compute nearidentity maps at each layer, showing how their representational power improves with depth and that the functional optimization landscape has the desirable property that stationary points are optimal. Second, we study the implications for optimization in deep linear networks, showing how the success of a family of gradient descent algorithms that regularize towards the identity function depends on a positivity condition of the regression function. Third, we consider how the performance of deep networks on training data compares to their predictive accuracy, we demonstrate deviation bounds that scale with a certain "spectral complexity," and we compare the behavior of these bounds with the observed performance of these networks in practical problems. Joint work with Steve Evans, Dylan Foster, Dave Helmbold, Phil Long, and Matus Telgarsky. 
INI 1 
12:30 to 14:00  Buffet Lunch at INI  
14:00 to 17:00  Free Afternoon 
09:00 to 09:45 
Tong Zhang (Rutgers, The State University of New Jersey) tba 
INI 1 
09:45 to 10:30 
Aurore Delaigle (University of Melbourne) tba 
INI 1 
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Francis Bach (INRIA Paris  Rocquencourt); (ENS  Paris) Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for
leastsquares regression with potentially several passes over the data. While
several passes have been widely reported to perform practically better in terms
of predictive performance on unseen data, the existing theoretical analysis of
SGD suggests that a single pass is statistically optimal. While this is true
for lowdimensional easy problems, we show that for hard problems, multiple
passes lead to statistically optimal predictions while single pass does not; we
also show that in these hard models, the optimal number of passes over the data
increases with sample size. In order to define the notion of hardness and show
that our predictive performances are optimal, we consider potentially
infinitedimensional models and notions typically associated to kernel methods,
namely, the decay of eigenvalues of the covariance matrix of the features and
the complexity of the optimal predictor as measured through the covariance
matrix. We illustrate our results on synthetic experiments with nonlinear
kernel methods and on a classical benchmark with a linear model. (Joint work
with Loucas PillaudVivien and Alessandro Rudi)

INI 1 
11:45 to 12:30 
Edward Ionides (University of Michigan) Monte Carlo adjusted profile likelihood, with applications to spatiotemporal and phylodynamic inference.
Partially observed nonlinear stochastic dynamic systems raise inference challenges. Sequential Monte Carlo (SMC) methods provide a route to accessing the likelihood function. However, despite the advantage of applicability to a wide class of nonlinear models, standard SMC methods have a limitation that they scale poorly to large systems. We present a profile likelihood approach, properly adjusted for Monte Carlo uncertainty, that enables likelihoodbased inference in systems for which Monte Carlo error remains large despite stretching the limits of available computational resources. Together with stateoftheart SMC algorithms, this technique permits effective inference on some scientific problems in panel time series analysis, spatiotemporal modeling, and inferring population dynamic models from genetic sequence data.
The results presented are joint work with Carles Breto, Joonha Park, Alex Smith and Aaron King.

INI 1 
12:30 to 14:00  Buffet Lunch at INI  
14:00 to 14:45 
Jinchi Lv (University of Southern California) Asymptotics of Eigenvectors and Eigenvalues for Large Structured Random Matrices
Characterizing the exact asymptotic distributions of highdimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. This paper establishes the asymptotic properties of the spiked eigenvectors and eigenvalues for the generalized Wigner random matrix, where the mean matrix is assumed to have a lowrank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we provide novel asymptotic expansions for the general linear combination and further show that the linear combination is asymptotically normal after some normalization, where the weight vector can be an arbitrary unit vector. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these largescale applications. This is a joint work with Jianqing Fan, Yingying Fan and Xiao Han.

INI 1 
14:45 to 15:30 
Rui Castro (Technische Universiteit Eindhoven) Are there needles in a (moving) haystack? Adaptive sensing for detection and estimation of static and dynamically evolving signals
In this work we investigate the problem of testing for the presence of dynamically evolving sparse signals. This is motivated by settings were the signal of interest changes rapidly while measurements are being collected (e.g., monitoring of the electromagnetic spectrum, or detection of hot spots of a rapidly spreading disease). We model the signal as an n dimensional vector that has s nonzero components, and at each time a fraction of the nonzero components may change their location. The statistical problem is to decide whether the signal is a zero vector or in fact it has nonzero components. This decision is based on m noisy observations of individual signal components. We consider two different sensing paradigms, namely adaptive and nonadaptive sensing. For nonadaptive sensing the choice of components to measure has to be decided before the data collection process started, while for adaptive sensing one can adjust the sensing process based on observations collected earlier. We characterize the difficulty of this detection problem in both sensing paradigms, with special interest to the speed of change of the active components. In addition we provide an adaptive sensing algorithm for this problem and contrast its performance to that of best nonadaptive detection algorithms. Interestingly, when the number of measurements is small (on the order n/s), there is a significant difference in performance between the two sensing paradigms, as nonadaptive sensing is unable to exploit the dynamic nature of the signal (based on joint works with Ervin Tánczos).

INI 1 
15:30 to 16:00  Afternoon Tea  
16:00 to 16:45 
Pradeep Ravikumar (Carnegie Mellon University) Robust Estimation via Robust Gradient Estimation
A common assumption in the training of machine learning systems is that the data is sufficiently clean and wellbehaved: there are very few or no outliers, or that the distribution of the data does not have very long tails. As machine learning finds wider usage, these assumptions are increasingly indefensible. The key question then is how to perform estimation that is robust to departure from these assumptions; and which has been of classical interest, with seminal contributions due to Box, Tukey, Huber, Hampel, and several others. Loosely, there seemed to be a computationrobustness tradeoff, practical estimators did have strong robustness guarantees, while estimators with strong robustness guarantees were computationally impractical.
In our work, we provide a new class of computationallyefficient class of estimators for risk minimization that are provably robust to a variety of robustness settings, such as arbitrary oblivious contamination, and heavytailed data, among others. Our workhorse is a novel robust variant of gradient descent, and we provide conditions under which our gradient descent variant provides accurate and robust estimators in any general convex risk minimization problem. These results provide some of the first computationally tractable and provably robust estimators for general statistical models.
Joint work with Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan.

INI 1 
19:30 to 22:00  Formal Dinner at St John's College 
09:00 to 09:45 
Alexandre Tsybakov (CREST: Centre de Recherche en Économie et Statistique) tba 
INI 1 
09:45 to 10:30 
Vincent Vu (Ohio State University) Group invariance and computational sufficiency 
INI 1 
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Richard Samworth (University of Cambridge) Data perturbation for data science
When faced with a dataset and a problem of interest, should we propose a statistical model and use that to inform an appropriate algorithm, or dream up a potential algorithm and then seek to justify it? The former is the more traditional statistical approach, but the latter appears to be becoming more popular. I will discuss a class of algorithms that belong in the second category, namely those that involve data perturbation (e.g. subsampling, random projections, artificial noise, knockoffs,...). As examples, I will consider Complementary Pairs Stability Selection for variable selection and sparse PCA via random projections.
This will involve joint work with Rajen Shah, Milana Gataric and Tengyao Wang.

INI 1 
11:45 to 12:30 
Peter Bühlmann (ETH Zürich); Domagoj Ćevid (ETH Zürich) Deconfounding using Spectral Transformations
Highdimensional regression methods which rely on the sparsity of the ground truth, such as the Lasso, might break down in the presence of confounding variables. If a latent variable affects both the response and the predictors, the correlation between them changes. This phenomenon can be represented as a linear model where the sparse coefficient vector has been perturbed. We will present our work on this problem. We investigate and propose some spectral transformations for the data which serve as input for the Lasso. We discuss assumptions for achieving the optimal error rate and illustrate the performance on a genomic dataset. The approach is easy to use and leads to convincing results. The talk is based on joint work with Nicolai Meinshausen.

INI 1 
12:30 to 14:00  Buffet Lunch at INI 