Foundations of Machine Learning

Theme I: Fundamentals of Machine Learning Directed toward Biological and Medical Applications

Participants:

  • CS - N. Amenta, P. Devanbu, P. Koehl, Y.-J. Lee, I. Tagkopoulos
  • ECE - C.-N. Chuah, S. Ghiasi
  • Math - J. Arsuaga, J. De Loera, J. Hass, L. Rademacher, M. Vazquez
  • Stat - P. Burman, C. Drake, F. Hsieh, J. Jiang, M. Lopes, W. Polonik, B. Rajaratnam

Theme Ia: Geometry of Data

Many central problems in data science and machine learning have strong geometric and topological elements. In the past ten years geometric/topological tools have entered data science naturally and effectively, to the point that they can now be considered indispensable. Some well-established examples of geometric methodology include surface reconstruction, where data points are measured or sampled from an object whose structure needs to be recovered. Topological methods, including Topological Data Analysis (TDA), have provided a framework to analyze the "shape" of data and identify patterns that were previously overlooked finding applications in numerous fields.
Geometric tools also have impacted data science and machine learning at a fundamental level. These include manifold learning, in which the shape of interest may be a low dimensional object embedded in a higher-dimensional space, and supervised learning where questions of point separability, commonly addressed in convex geometry, are key to the convergence of gradient descent algorithms.

The expertise and work of the Geometry of Data team at UCD4IDS includes these topics, as well as the development of new tools for data science and machine learning.

Our research projects in this theme include:

  • Morphology of biological systems
  •  Analysis of genomic data
  • Data clustering and classification

 

Theme Ib: Pattern Mining and Machine Learning

When a dataset is in hand, the most major task in harnessing data is to compute and represent its information content. In real world, such information content typically embraces multiscale complexity and heterogeneity. For instance, in Multiclass Classification as a key ML topic, a label embedding tree or graph well represents its global scale of information content. Its median and fine scales are to be revealed. Machine learning, supervised and unsupervised, offers algorithmic means that go beyond statistical learning's limitations to extracting data's information content, and provides explicit understanding and explanations through multiscale pattern information. Computed multiscale pattern information and its geometries then play the central role in making empirical inference or predictive decision-making, which is another major task of harnessing data. Predictive decision-making that abides all computed pattern information and its geometries and adopts the spirit of Likelihood Principle likely offers hopes for obtaining error-free predictions and for coherent explanations. The last major task of harnessing data is to validate and evaluate all computed knowledge and inferences. In statistical learning methods, the statistical bootstrapping is one classic method. However, its strength often cannot compensate its drawback for missing out the true nature of data's original dynamics. A fundamental upgrade of bootstrapping is data-mimicking. Only after data's multiscale information content and its geometry is correctly extracted,  we can coherently mimic to recreate its original dynamics. This fundamental and critical task is far from being resolved.

Our research projects in this theme include:

  • Novel data acquisition via embedded/wearable or non-invasive devices
  • Data-harness and feature selection
  • High-dimensional statistical learning and uncertainty quantification
  • Model-free and error-free unsupervised learning
  • Sampling and streaming of big data