Alexandre d'Aspremont, CNRS - Ecole Polytechnique
Approximation Bounds for Sparse Principal Component Analysis
We produce approximation bounds on a semidefinite programming relaxation for sparse principal component analysis. These bounds control approximation ratios for tractable statistics in hypothesis testing problems where data points are sampled from Gaussian models with a single sparse leading component.
(Joint work with F. Bach and L. El Ghaoui)
Francis Bach, INRIA - ENS
Stochastic gradient methods for large-scale machine learning
Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ("large n") and each of these is large ("large p"). In this setting, online algorithms which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will present several recent results, showing that in the ideal infinite-data setting, online learning algorithms based on stochastic approximation should be preferred, but that in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt)
Leon Bottou, Microsoft Research
Large-Scale Learning Revisited
This presentation shows how large-scale data challenges traditional machine learning in fundamental ways.
-Traditional machine learning describes tradeoffs associated with the scarcity of data. Qualitatively different tradeoffs appear when we consider instead that computing time is the bottleneck. As a consequence, one needs to reconsider the relations between the machine learning problem, its optimization formulation, and the optimization algorithms.
-Traditional machine learning optimize average losses. Increasing the training set size cannot improve such metrics indefinitely. However these diminishing returns vanish if we measure instead the diversity of conditions in which the trained system performs well. In other words, big data is not an opportunity to increase the average accuracy, but an opportunity to increase coverage.
-Since the benefits of big data are related to the diversity of big data, we need conceptual tools to build learning systems that can address all the (changing) aspects of real big data problems. Multitask learning, transfer learning, and deep learning are first steps in this direction.
Alfred Hero, University of Michigan
Correlation mining
Datasets are frequently mined for information that principally lies in the correlation patterns between objects or groups of items. Such correlation mining problems arise in big data areas such as computational finance, neuroscience, and bioinformatics. When there are many objects, spurious correlations dominate the true correlations and limit out ability to reliably detect true correlation patterns. We will discuss the general correlation mining problem, present some theory, and discuss open problems in this area.
Chris Holmes, Oxford University
Bayesian Hidden Markov models with linear time decoding for the analysis of cancer genomes
Cancer genomes often exhibit large scale structural variation whereby stretches of DNA are duplicated or deleted relative to the inherited germline genome. These so called copy number aberrations (CNAs) are known to be key drivers of tumour formation and progression. We have developed Bayesian Hidden Markov models (HMMs) for detecting CNAs in data from high-throughput genotyping arrays and whole-genome sequencing platforms. We pay particular attention to the problem of computationally efficient genome segmentation, relating to scientific questions of the type, "find me the most probable deletion events in this genome", or "find me the most probable duplication events". To this aim we have devised linear time, in the length of the sequence, posterior decoding algorithms that can retrieve the optimal segmentation for a fixed number of events (transitions) under a HMM. The methods are illustrated on real-world examples from studies ofchronic lymphocytic leukemia and breast cancer.
Piotr Indyk, MIT
Faster Algorithms for the Sparse Fourier Transform
The Fast Fourier Transform (FFT) is one of the most fundamental numerical algorithms. It computes the Discrete Fourier Transform (DFT) of an n-dimensional signal in O(n log n) time. The algorithm plays an important role in many areas. It is not known whether its running time can be improved. However, in many applications, most of the Fourier coefficients of a signal are "small" or equal to zero, i.e., the output of the transform is (approximately) sparse. In this case, it is known that one can compute the set of non-zero coefficients faster than in O(n log n) time.
In this talk, I will describe a new set of efficient algorithms for sparse Fourier Transform. One of the algorithms has the running time of O(k log n), where k is the number of non-zero Fourier coefficients of the signal. For moderate value of k this runtime is *sub-linear* in the input size n. If time allows, I will also describe some of the applications, to spectrum sensing and GPS locking.
The talk will cover the material from the joint papers with Fadel Adib, Badih Ghazi, Haitham Hassanieh, Dina Katabi, Eric Price and Lixin Shi. The papers are available at http://groups.csail.mit.edu/netmit/sFFT/
Michael Jordan, U.C. Berkeley
MAD-Bayes: MAP-based asymptotic derivations from Bayes
The rapid growth in the size and scope of data sets in science and technology creates new challenges for statistics. In principle, the Bayesian nonparametric (BNP) framework would seem to provide an appropriate platform on which to approach massive data analysis, given the open-ended supply of degrees of freedom that it provides, but in practice the computational burden rules out existing BNP methodology. For example, large-scale clustering problems are generally approached via simple non-Bayesian methods such as K-means rather than with BNP tools. Bowing to this computational imperative, we recall that K-means can be obtained from a finite mixture of Gaussians model via "small-variance asymptotics," where the covariances of the Gaussian components are taken analytically to zero, and we ask what happens when similar asymptotics are applied to BNP models, specifically Dirichlet process mixtures, and then hierarchical Dirichlet processes and other BNP models. The answer is that interesting new procedures are obtained; non-Bayesian procedures to be sure, but fast, effective procedures that can be applied to large-scale problems. Whether this is interesting or heretical (or both) will be for the audience to decide. [Joint work with Tamara Broderick and Brian Kulis.]
Lester Mackey, Stanford University
Divide-and-Conquer Matrix Factorization
If learning methods are to scale to the massive sizes of modern datasets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich theory but poor computational complexity and by the relative ease of mapping matrices onto distributed architectures, we introduce a Divide-Factor-Combine (DFC), a scalable divide-and-conquer framework for noisy matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the sub-problem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and subspace segmentation demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.
This is joint work with Ameet Talwalkar, Michael Jordan, Yadong Mu and Shih-Fu Chang.
Michael Mahoney, Stanford University
Revisiting the Nystrom Method for Improved Large-Scale Machine Learning
The Nystrom Method is a sampling-based or CUR-like low-rank approximation to symmetric positive semi-definite (SPSD) matrices such as Laplacians and kernels that arise in machine learning. The method has received a great deal of attention in recent years, much of which has focused on the use of randomized algorithms to compute efficiently the low-rank approximation. Motivated by conflicting and contradictory claims in the literature, we have performed a detailed empirical evaluation of the performance quality and running time of sampling-based and projection-based low-rank approximation methods on a diverse suite of SPSD matrices. Our main conclusions are threefold: first, our results highlight complementary aspects of projection methods versus sampling based on the statistical leverage scores; second, our results elucidate the effect of popular "design decisions" on the structural properties of the matrices and thus on the applicability of different approximation algorithms; and third, our results show that sampling-based algorithms that use provably-accurate approximate leverage scores can have running times comparable to random projection algorithms. In addition, our empirical results demonstrate that prior theory was so weak as to not even be a qualitative guide to practice. Thus, we complement our empirical results with a suite of worst-case theoretical bounds for both random sampling and random projections methods. These bounds are qualitatively superior to existing bounds, e.g., improved additive-error bounds for the spectral and Frobenius norm errors and relative-error bounds for the trace norm error, and they can be used to understand the complementary aspects of projection versus sampling-based methods. We will also describe how recent numerical implementations of random sampling and random projection algorithms can be used in this setting in much larger-scale applications.
Eric Moulines, Telecom Paristech
Islands particle model
The approximation of the Feynman-Kac semigroups by systems of interacting particles is a very active research field, with applications in many different areas. In this paper, we study how such systems of particles can be run on massively parallel computers. The total population of particles is divided into sub-populations, referred to as islands. The particles within each island follow the usual selection / mutation dynamics. We show that the evolution of each island is also driven by a Feynman-Kac semigroup, whose transition and potential can be explicitly related to ones of the original problem. Therefore, the same genetic type approximation of the Feynman-Kac semi-group may be used at the island level; each island might undergo selection / mutation algorithm. We investigate the impact of the size of the population within each island and the number of the islands, and study different type of interactions within and across island. We also find conditions under which introducing interactions between islands is beneficial. The theoretical results are supported by some Monte-Carlo experiments.
Joint work with P. Del Moral (INRIA Bordeaux), C. Verge (ONERA)
Sonia Petrone, Università Bocconi
Restricted random partitions for Bayesian curve fitting
When dealing with statistical models that involve a huge-dimensional space, it is convenient, or necessary, to reduce the dimension by setting at zero the probability of appropriate subspaces. At the same time, one wants to preserve desirable properties of the involved probability laws. In this talk we illustrate this issue by using a simple problem of Bayesian nonparametric curve fitting through Dirichlet process mixtures. A crucial component of this model is the high-dimensional space of partitions of subjects into different mixture kernels. We first show that, from a predictive perspective, it is important to incorporate information from the covariates in the prior on the mixing weights. However, due to the huge dimension of the partition space, any prior information will be dramatically spread out in the posterior. The only way to avoid this is to strictly enforce the notion of covariate proximity by setting the probability of undesirable partitions to zero. We show how this restriction can be enforced while maintaining certain properties of the Dirichlet process. This allows the distribution of the partition to depend on the covariate in a simple manner and greatly reduces the total number of possible partitions, resulting in improved curve fitting and faster computations. Numerical illustrations are presented.
Slav Petrov, Google
Large-Scale Language Learning
When applying supervised Natural Language Processing (NLP) tools to real-world problems, one faces at least two challenges: (i) the supervised data captures only a fraction of all valid language phenomena (words in the language, grammatical constructions, etc.) and (ii) while we care primarily about downstream application performance, models are typically optimized for intrinsic performance on an isolated task (e.g. syntactic parsing).
In this talk I will highlight several examples of how we have successfully used indirect signals to adapt NLP systems for downstream applications: in query processing signals from search logs can improve entity detection; human aligned sentences can help train better syntactic parsers for machine translation; parallel data and dictionaries can guide projected multilingual systems, etc. All of these examples have in common that fully supervised data is difficult and expensive to obtain, while indirect signals are plentiful - they just need to be harvested and incorporated in creative ways.
Ion Stoica, U.C. Berkeley
Computations with Bounded Errors and Bounded Response Times on Very Large Data
I will describe BlinkDB, a massively parallel, approximate query engine for running interactive queries on large volumes of data. BlinkDB allows users to trade query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results with meaningful error bars. To compute these errors we use a combination of close formulas for aggregation operations, and the Bootstrap method for more complex computations. While the preliminary results are encouraging, several challenges still remain. In this talk I'm going to focus on two such challenges: diagnosing whether Bootstrap can be applied to a particular computation, and improving the computation efficiency of error estimation.
Martin Wainwright, U.C. Berkeley
Computation meets Statistics: Fast global convergence for high-dimensional (non-convex) statistical recovery
Many statistical estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. Particular examples include $\ell_1$-based methods for sparse vectors, nuclear norm for low-rank matrices, and various combinations thereof for matrix decomposition. In this talk, we describe an interesting connection between computational and statistical efficiency, in particular showing that the same conditions that guarantee that an estimator has low statistical error can also be used to certify fast convergence of first-order optimization methods up to statistical precision. The same techniques can also be used to provide rigorous guarantees for estimators based on non-convex programs.