Events: Lectures

Joint Computer Science and Data Science Institute Seminar: Shreya Shankar

2:30–3:30 pm DSI 105

Shreya Shankar
PhD Candidate in the Data Systems and Foundations Group
University of California, Berkeley

Title: Building Effective Unstructured Data Systems

Abstract: Databases and other data systems have successfully democratized data-oriented computation across domains, thanks to decades of research in system internals and end-user interfaces. However, such systems center on structured (i.e., tabular) data; unstructured data—the vast majority of data—has largely been ignored. Large language models (LLMs) now give us a building block for unstructured data analysis, and we face the same questions as in the early days of data systems—e.g., how should users author queries? How do we efficiently execute queries at scale?—but many well-established tenets from traditional data systems no longer hold. In my talk, I will present DocETL, a system I developed for unstructured data analysis. I will discuss how we had to rethink query optimization under these new assumptions, optimizing user-written pipelines for both accuracy and efficiency—as well as end-user interfaces for authoring, iterating on, and debugging pipelines. DocETL is open-source with 3.5k+ GitHub stars; our hosted interface has supported 4.1k+ pipelines across 30+ S&P-500 industries. Query optimization ideas from our work have been adopted in databases such as Snowflake and BigQuery, and our interface design principles have been adopted by companies like LangChain and OpenAI.

Feb 18

DSI Distinguished Speaker Series: Jeffrey Heer

12:30–2:30 pm DSI 105

Jeffrey Heer
Jerre D. Noe Endowed Professor of Computer Science & Engineering
University of Washington

Title: Augmenting Data Scientists: The Promise and Peril of AI-Assisted Analysis

Abstract: Abstract: Data analysis is a rich sensemaking process, with frequent shifts among data representations, tools, and both conceptual & mathematical models. Computational methods can go beyond fitting models and rendering charts to make in-context recommendations and even guide end-to-end analysis workflows. How does the design of such tools affect people’s exploration, modeling, and understanding of data? In this talk, we will consider methods for augmenting data science work by integrating proactive computational support into interactive tools, with the goal of providing algorithmic assistance to augment and enrich, rather than replace, people’s intellectual work. Across tasks such as data transformation, visualization, and statistical modeling, we apply artificial intelligence to bridge gaps between user intent and robust analysis results. At the same time, we need to pay careful attention to ways these methods may exacerbate bias, foster dependence, and pose challenges for the future of data analysis.

Feb 20

Joint Statistics and DSI Colloquium: Mateo Díaz

11:30 am–12:30 pm DSI 105

Mateo Díaz
Assistant Professor
Department of Applied Mathematics and Statistics
Mathematical Institute for Data Science
Johns Hopkins University

Title: TBD

Abstract: TBD

Feb 23

Joint Statistics and DSI Colloquium: Jiaqi Zhang

4:00–5:00 pm DSI 105

Jiaqi Zhang
PhD Candidate
Massachusetts Institute of Technology

Title: Modeling Large-Scale Interventions

Abstract: Complex causal mechanisms among genes govern cellular functions in health and disease. Understanding these mechanisms can accelerate therapeutic discovery but remains challenging due to the large number of genes and their intricate dependencies. Recent advances in experimental technologies are making this problem increasingly tractable: it is now possible to systematically intervene on individual genes or gene combinations in single cells and measure their downstream effects, enabling empirical identification and validation of causal relationships. However, interventional data are high-dimensional, making interpretation challenging, and costly to collect.

In this talk, I will present our work tackling these challenges from three aspects. First, we introduced causal representation theories and algorithms with identifiability guarantees to uncover latent variables behind high-dimensional data. Second, we developed a method to model interventional data that can predict the effects of novel interventions with high accuracy, incorporating both distributional shifts and prior domain knowledge. Finally, we showed how predictive intervention modeling can improve future experimental design, illustrated by an application where we predicted and validated previously unknown T-cell regulators with therapeutic potential for cancer immunotherapy.

Feb 26