The roundtable discussion was broken into two parts, the first focusing on the challenges and frontiers of data clustering, and the second on the interplay between the UCD4IDS and the Center for Data science and Artificial intelligence Research (CeDAR), of which the speaker is the director.
The first question is what are the frontiers of clustering and dimensionality reduction. The speaker identified one of the foremost challenges as the introduction of additional constraints in dimensionality reduction. For example, in biology, being able to visualize data while retaining tree-like structure and clusters is important. At present, there are no algorithms that address this problem effectively. Another major area that he touched on during his talk was semi-supervised clustering, both from a theoretical as well as a practical perspective. A more fundamental problem in clustering is establishing the number of clusters k. The speaker identified an “elbow criterion” as a general method, which looks for when some measure of goodness of fit stops increasing rapidly as a function of number of clusters. Another fundamental problem is the lack of a technical definition of what a cluster actually is. This means that while the speaker was able to get precise results for several motivating examples, it is nearly impossible to prove for completely arbitrary clustering problems.
An attendee asked the speaker to comment on the Louvain algorithm for community detection, but it became clear that while there is some overlap in subject matter between clustering and community detection, the preferred algorithms can differ quite significantly, as the speaker was not aware of the specifics of the algorithm. This calls for a community detection study.
Attendees then had more questions on results from the talk. The speaker gave an example of clustering MRI data to identify Alzheimer’s; while he and his collaborator had a decent algorithm that sometimes worked, there were too many nebulous cases for him to call it solved. He identified part of the problem as the difficulty of establishing the ground truth, as it can be difficult to establish if patients actually had Alzheimer's until after they have died.
The host was interested in clusters that have overlaps, and the state of algorithms which provide simultaneous memberships in more than one cluster for each data point. The speaker noted that spectral clustering methods have this built in to some degree, since (for example) the Fiedler vector’s value can be interpreted in this way. The speaker was somewhat skeptical that something precise could be said in this direction.
Our second topic was the relationship between CeDAR and UCD4IDS. There are 10 members who participate in both, and more who do so unofficially. In some respects, CeDAR is aiming to be the long term (past the 6 potential years of UCD4IDS) base for the community and has a wider scope including application domain experts, while UCD4IDS focuses on the fundamentals of data science in CS, ECE, Math, and Stat. The directors of both activities do not want to duplicate their effort, and will closely collaborate, e.g., by organizing quarterly colloquia and weekly roundtable discussions jointly, and setting up a research theme for each quarter (e.g., data science in health sciences in Winter 2020).
There was an attendee who wanted to get a definite notion of data science from both groups, as it has been treated as a somewhat nebulous term that overlaps with already existing departments such as CS and Statistics. The speaker identified the coupling of computer science, mathematics, statistics, and domain expertise, all in the same project, as a key feature separating data science. While any one of these can be found in their respective departments, having centers/institutes to foster collaboration across disciplines is essential. A center also facilitates a degree program which can uniquely prepare students for such work, and facilitates the possibility of industry projects.
[Scribe: David Weber (GGAM)]