2:30–3:30 pm
DSI 105 5460 S University Ave, Chicago, IL 60615
Shreya Shankar
PhD Candidate in the Data Systems and Foundations Group
University of California, Berkeley
Title: Building Effective Unstructured Data Systems
Abstract: Databases and other data systems have successfully democratized data-oriented computation across domains, thanks to decades of research in system internals and end-user interfaces. However, such systems center on structured (i.e., tabular) data; unstructured data—the vast majority of data—has largely been ignored. Large language models (LLMs) now give us a building block for unstructured data analysis, and we face the same questions as in the early days of data systems—e.g., how should users author queries? How do we efficiently execute queries at scale?—but many well-established tenets from traditional data systems no longer hold. In my talk, I will present DocETL, a system I developed for unstructured data analysis. I will discuss how we had to rethink query optimization under these new assumptions, optimizing user-written pipelines for both accuracy and efficiency—as well as end-user interfaces for authoring, iterating on, and debugging pipelines. DocETL is open-source with 3.5k+ GitHub stars; our hosted interface has supported 4.1k+ pipelines across 30+ S&P-500 industries. Query optimization ideas from our work have been adopted in databases such as Snowflake and BigQuery, and our interface design principles have been adopted by companies like LangChain and OpenAI.
Bio: Shreya Shankar is a fifth and final-year PhD student in the Data Systems and Foundations group at UC Berkeley, advised by Dr. Aditya Parameswaran. She is broadly interested in data systems, large language models, and human-computer interaction. Her PhD has been supported by an NDSEG Fellowship and a Bridgewater Research Fellowship, and her work has been recognized with EECS Rising Stars (2025) and a best paper honorable mention award at UIST. Beyond her research, Shreya authored the curriculum and companion book for AI Evals for Engineers and PMs, an industry course on evaluating AI applications taken by 4,000+ professionals from 500+ companies, including 50+ students each from Google, Microsoft, OpenAI, Meta, Amazon, Intuit, and First American. Before her PhD, Shreya worked as the first data/ML engineer at a startup after her undergraduate degree in CS at Stanford.