Minutes of Roundtable Discussion Winter 2020 - II
The second roundtable discussion of this quarter was held before Prof. Bin YU (UC Berkeley, Stat)'s Joint Math/Stat Colloquium talk: Veridical Data Science. Her talk slides can be found here.
The roundtable discussion kicked off by the host asking the guest speaker about the status of data science activities at the Berkeley campus.
Theme I: Data science activities at UC Berkeley: the status of organization, vision and wishes of the speaker
Q: The first thing we would like to hear from you is the current status and structure of the data science activity at UC Berkeley. Could you tell us a little bit about the situation now?
A: UC Berkeley's College of Letters and Science has five divisions including the MPS division where math and stats sit, and EECS is in the College of Engineering. I have been one of the core faculty members at the DS front in Berkeley and on most of DS committees. The name of new division is called the Division of Computing, Data Science, and Society (CDSS). In terms of the structure, the EECS is figuring out what would be the best for them. The Department of Statistics has the option to move into CDSS completely or simultaneously belong to both the Mathematical & Physical Sciences (MPS) of the L&S College and CDSS. Meanwhile, last year Berkeley already graduated 100 data science majors -- the DS major is hosted in L&S -- a joint degree co-owned by Statistics and EECS. It's not an engineering degree; it's the letters and science degree. This is the fastest growing major in the Berkeley history. The major was built on two very important classes, called Data 8 (Foundations of Data Science) and Data 100 (Principles and Techniques of Data Science). Data 8 was established four years ago and covers three components of data science: domain knowledge; statistics; and computer science. Students in this class are learning Python while learning statistics. The programming is meaningful and important. If you want to learn all the theory, then you take additional two credits with statistics as a connector class (there are many other connector classes from domain fields like neuroscience). I was on the team of four of us from statistics and computer science that co-created and co-taught Data 100. We only had 100 students for the first trial, and now we have over 1000. A new data class Data 102 (Data, Inference, and Decisions) has just been co-designed and co-taught.
At the research level, the Simons Institute for the Theory of Computing is the key. We bring a lot of people on campus. They call it Theory of Computing, but they cover a lot of data related work. I am a co-organizier of a Simons summer 2020 cluster on interpreatable machine learning, with an opening workshop "Interpretable Machine Learning in Natural and Social Sciences" from June 29-July 2, 2020.
Q: How is the relationship of this new division with the Department of Mathematics?
A: From what I know, I don't see a lot right now. Stats connects with EECS much more closely. The human connection is the key to have equal partnership, and mutual trust is important -- Stats and EECS has many joint appointments for decades by now.
Q&A on Theme I with the audience:
Q: It seems more like a procedure, but not what you try to do.
A: At the beginning, it was based on lots of faculty's effort, because there was no money. The good thing is that it attracted a good group of people who had not been doing this for themselves. But now, we need to have resources to sustain ourselves and very fortunately we now have Jennifer Chayes as the associate provost for CDSS who arrived in Janunary this year.
Q: Do you feel now that you have reached to a good status in terms of data science organization?
A: When the process reaching this state, the details are going to matter too. But I think it's going to work out under Chayes' leadership.
Q: I am curious about resources and instructors, how is the funding related to number of enrolled students?
A: Right now, the TA budget for Stats comes from campus, supported by Associate Provost of CDSS and the Dean of the MPS. I don't know the exact formula (budget and number of students). We have been pretty stretched, both CS and Stats, in terms of resources. But the idea was that there is no way to get very good resources before we actually do it. So, let's make a success by basically people volunteering and sacrificing, then negotiating to raise fund. In terms of appointment and positions or other structural changes, it is being solved by two working groups at the campus level and departmental committees. Many of us have already been working to build it, while we still only teach in the Department of Statistics with Data classes cross-listed and co-taught by Stats and CS. It has to be a gradual process. It's a multi-year project, the data science train is leaving the station, we are not able to answer all the questions before we get on. We get on the train and solve it together. Teaching and hiring in DS are already moving ahead, and there is a commons group in the new division for interdisciplinary people. For senior people to move in, we don't have many concerns. It could be a research hub. All of the details need to be worked out. I see it as an opportunity for transdisciplinary research to be recognized and start something new. This is where I saw the difference from the CS culture to the Stats and Math culture. For CS people, they jump on the train and figure things out. I think the Math/Stat culture, which is very careful, is very useful, if we are already on the train or close to the problems. Otherwise your carefulness will not be useful to help solve the problems. When working as a team together carefulness is very useful because CS people sometimes move too fast. But if you are not even there, then your rigor won't have the impact.
Q: How do you think this impacts graduate students and undergraduate students?
A: I think from the statistics point of view, our data science classes put a lot of positive pressure on current statistics classes. And it forces us to revamp our curriculum, which has not been changed much for decades. We should focus more on case studies. I see computing is really good for stats, and hopefully for the math curriculum too. Data 8 in some sense is the direct competition with Stat 2. For most freshmen asking me which course to take, I will answer without hesitation to take Data 8. Just because the examples are new; you learn Python; you learn new technology, working with CS faculty and students. Many scientists want to be retrained in data science. Ten years after, many young scientists will be better quantitatively, which also generates pressure on us. The Berkeley CS just made a new track, in their AI PhD cluster, called Machine Learning. Students take prelim on two courses chosen from three: from probability and theoretical statistics in Stats Dept. and optimization (EECS). You can predict in 10 years who will have the advantage. You cannot only know what you know. You have to learn what others know. If we don't enlarge our skill set, 10 years later we will be less competitive. There are many scientists who take machine learning and know nothing less than statistics students. We don't have edge there. When I go to the BioHub, all those young people are very computationally oriented. They need it, they learn it, and they have the science. So, we really need to have our skill set enlarged too, otherwise we won't be competitive.
Q: How about those people who are not from quantitative majors (e.g., humanities), how do you deal with those people?
A: We have something called Data Modules. For example, some professor from American Culture can have a bunch of data science students help her/him design a module to turn data into visualization and work with her/him. We had a bootcamp every summer I believe, help interested colleagues get in and connect with data science students. Some young social scientists are also quantitative too. They also take Data 8 and are good at applying statistics and mathematics. Some departments teach their own version of statistics classes. And it's healthy to share teaching and development and start bringing science training into statistics.
Q: Do you do any training in your class that can be applied to other science or industrial problems?
A: Both Data 8 and 100 include modern data analysis although they are at undergraduate level. And for my applied statistics PhD class, I do a series of open-ended labs, and my master students often tell me that they use my projects for job interviews. It covers the whole process like problem formulation, data cleaning, parallel computing, and high-level principles, in addition to modeling algorithms. We do machine learning or prediction first, before do p-values.
Theme II: Future and challenges of data science research and education
Q: What do you think is the future or challenge of the data science?
A: I think the biggest challenge for the stats or math community is their culture change. It's behavior change of humans: how we deal with and react to things, and whether we can deal with diverse personalities. We need more people of leadership types. Because, leaderships shape our culture in a forward looking way. For example, you need to be comfortable with the straightforwardness of some CS people. We are more used to do things on a one-on-one basis, but the Berkeley EECS department has over 90 people, and they are used to discuss in a large group. If we are going to take co-leadership on data science issues, we need to become more open and proactive. Time won't wait for us; we have changed in our own way, but the world outside is changing faster. Get on the train, figure out things together. All the answers will be figured out. If you are on the train, you can answer better. We need to attract more engineering undergraduate students. Those undergraduate students already have the culture of team spirit, i.e., sharing and doing teamwork. We need to learn from them too.
[Scribe: Tongyi Tang (Statistics)]