
Statistics Colloquium: Linjun Zhang
11:30 am–12:30 pm Jones 303
Linjun Zhang Associate Professor in the Department of Statistics, at Rutgers University
Title: A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models
Abstract: Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the inclusion of copyrighted materials in their training data without proper attribution or licensing, which falls under the broader issue of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated data generated by another LLM. To address this issue, we propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct a pivotal statistic, determine the optimal rejection threshold, and explicitly control the type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate its empirical effectiveness through intensive numerical experiments.

Bahadur Memorial Lectures: John Lafferty (Day 1)
11:30 am–12:30 pm Jones 303
Title: Abstraction in Artificial and Natural Intelligence: Part I: Relational and Sequential Reasoning
Abstract: Two broad types of natural intelligence are used by humans (and other animals). One type is used to acquire semantic and procedural knowledge about the world. Another type is used to identify novel associations and relations. This second type of intelligence often requires very little data, but significant time to “think” and search for solutions; recent AI models mimic this type of intelligence using “chain of thought.” We present a framework for modeling relational learning and abstraction, using an inductive bias called the relational bottleneck. To assess the flexibility of the relational bottleneck, a universal approximation theory is developed. To analyze the advantages of sequential reasoning, an extension of statistical learning theory for autoregressive models is proposed. This offers insight into how chain of thought sequential supervision can improve learning efficiency.

Bahadur Memorial Lectures: John Lafferty (Day 2)
3:30–4:30 pm Jones 303
Title: Abstraction in Artificial and Natural Intelligence: Part II: Models, Mechanisms, and Experiments
Abstract: Reasoning in terms of relations, analogies, and abstraction is a hallmark of human intelligence. How can abstract symbols emerge from distributed, neural representations? One general approach uses an inductive bias for learning called the “relational bottleneck” that is motivated from principles of cognitive neuroscience. We present a framework that builds this inductive bias into machine learning models that transform distributed symbols to implement a form of abstraction. Computational experiments are presented on a broad range of problems. Biologically plausible mechanisms for these models are proposed to shed light on how abstraction may be implemented in the human brain.

Statistics Colloquium: Jiaying Gu
11:30 am–12:30 pm Jones 303
Jiaying Gu, Department of Economics, University of Toronto
“TBA”

Statistics Colloquium: Anne van Delft
11:30 am–12:30 pm Jones 303
Anne van Delft, Department of Statistics, Columbia University
“TBA”

Statistics Colloquium: Michael Hudgens
11:30 am–12:30 pm Jones 303
Michael Hudgens, Department of Biostatistics, University of North Carolina at Chapel Hill
“Causal Inference in Infectious Disease Prevention Studies”