Events: Statistics Colloquium

Statistics Colloquium: Linjun Zhang

11:30 am–12:30 pm Jones 303

Linjun Zhang Associate Professor in the Department of Statistics, at Rutgers University

Title: A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Abstract: Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the inclusion of copyrighted materials in their training data without proper attribution or licensing, which falls under the broader issue of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated data generated by another LLM. To address this issue, we propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct a pivotal statistic, determine the optimal rejection threshold, and explicitly control the type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate its empirical effectiveness through intensive numerical experiments.

Apr 21

Bahadur Memorial Lectures: John Lafferty (Day 1)

11:30 am–12:30 pm Jones 303

Title: Abstraction in Artificial and Natural Intelligence: Part I: Relational and Sequential Reasoning

Abstract: Two broad types of natural intelligence are used by humans (and other animals). One type is used to acquire semantic and procedural knowledge about the world. Another type is used to identify novel associations and relations. This second type of intelligence often requires very little data, but significant time to “think” and search for solutions; recent AI models mimic this type of intelligence using “chain of thought.” We present a framework for modeling relational learning and abstraction, using an inductive bias called the relational bottleneck. To assess the flexibility of the relational bottleneck, a universal approximation theory is developed. To analyze the advantages of sequential reasoning, an extension of statistical learning theory for autoregressive models is proposed. This offers insight into how chain of thought sequential supervision can improve learning efficiency.

Apr 28

Bahadur Memorial Lectures: John Lafferty (Day 2)

3:30–4:30 pm Jones 303

Title: Abstraction in Artificial and Natural Intelligence: Part II: Models, Mechanisms, and Experiments

Abstract: Reasoning in terms of relations, analogies, and abstraction is a hallmark of human intelligence. How can abstract symbols emerge from distributed, neural representations? One general approach uses an inductive bias for learning called the “relational bottleneck” that is motivated from principles of cognitive neuroscience. We present a framework that builds this inductive bias into machine learning models that transform distributed symbols to implement a form of abstraction. Computational experiments are presented on a broad range of problems. Biologically plausible mechanisms for these models are proposed to shed light on how abstraction may be implemented in the human brain.

May 1