##
Event Date

We will have our third and the last UCD4ISD roundtable discussion in Winter Quarter 2020 with Dr. Ken Clarkson of IBM Researh. Subjects/themes of this round discussion include but not limited to:

- What would a data scientist in industry recommend our graduate students (in statistics, math, cs, ece) to do or study if they want to be data scientists in industry?
- What would you suggest our faculty to teach our graduate students who are interested in data science from your viewpoint?
- What do you think would be future directions of data science research?

Coffee/tea reception is also held in the same room 1147 MSB during this roundtable discussion.

Attention: graduate students: please attend this roundtable discussion. It should be quite informative for your career.

After the roundtable discussion, Dr. Clarkson will give the following statistics seminar at 4:10pm:

Title: Dimensionality reduction for Tukey regression

Abstract:

We give the first dimensionality reduction methods for the overconstrained Tukey regression problem. The Tukey loss function ||*y*||_*M* = sum_*i* *M*(*y_i*) has *M*(*y_i*) approximately |*y_i*|^*p* for residual errors *y_i* smaller than a prescribed threshold tau, where *M*(*y*_*i*) becomes constant for errors |*y_i*| > tau.

Our results depend on a new structural result, proven constructively, showing that for any d-dimensional subspace *L* subset **R**^n, there is a fixed bounded-size subset of coordinates containing, for every *y* in *L*, all the large coordinates of *y* *with respect to the Tukey loss function*. We think of these as "residual leverage scores", since two coordinates in *y* itself may have very different magnitude and yet both contribute the same value tau to the *M*-function.

Our methods reduce a given Tukey regression problem to a smaller weighted version, whose solution is a provably good approximate solution to the original problem. Our reductions are simple and easy to implement, and we give empirical results demonstrating their practicality, using existing heuristic solvers for the small versions.

One of our reductions uses row sampling, for an instance min_{*x* \in **R**^*d*} ||*Ax-b*||_*M*, where *A* in **R**^{*n* \times *d*} and *b* in **R**^*n*, with *n* >> *d*. The algorithm takes *O*(nnz(*A*) + poly(*d* log *n* / epsilon)) time to return a weight vector with poly(*d* log *n* / epsilon) non-zero entries, such that the solution of the resulting weighted Tukey regression problem is a (1 + epsilon)-approximate solution. Here nnz(*A*) is the number of non-zero entries of *A*. Another reduction uses a sketching matrix *S*, chosen independently of *A* and *b*, so that the solution for a weighted version with inputs *SA* and *Sb* is an *O*(log *n*)-approximate solution. Here *S* has poly(*d* log *n*) rows and *SA* and *Sb* are computable in *O*(nnz(*A*)) time.

We also give exponential-time algorithms giving provably good solutions, and hardness results suggesting that a significant speedup in the worst case is unlikely.

This is joint work with Ruosong Wang and David Woodruff, CMU.