03-28
CITP Lecture – Building Human-AI Alignment: Specifying, Inspecting, and Modeling AI Behaviors

View the webinar here: https://princeton.zoom.us/j/99981436824

The learned behaviors of AI and robot agents should align with the intentions of their human designers. Alignment is necessary for AI systems to be used in many sectors of the economy, and so the process of aligning AI systems becomes critical to study for defining effective AI policy. Toward this goal, people must be able to easily specify, inspect, and model agent behaviors. For specifications, we will consider expert-written reward functions for reinforcement learning (RL) and non-expert preferences for reinforcement learning from human feedback (RLHF). This talk will show evidence that experts are bad at writing reward functions: even in a trivial setting, experts write specifications that are overfit to a particular RL algorithm, and they often write erroneous specifications for agents that fail to encode their true intent. It will also show that the common approach to learning a reward function from non-experts in RLHF uses an inductive bias that fails to encode how humans express preferences, and that our proposed bias better encodes human preferences both theoretically and empirically.

Policy implications will be discussed: namely, that engineers’ design processes and embedded assumptions in building AI must be considered. For inspection, humans must be able to assess the behaviors an agent learns from a given specification. A method to find settings that exhibit particular behaviors, like out-of-distribution failures will be discussed. The policy implications for testing AI systems, will be examined, for example through red teaming. Lastly, cognitive science theories attempt to show how people build conceptual models that explain agent behaviors. Evidence will be shown that some of these theories are used in research to support humans, but that we can still build better curricula for modeling. The policy need for careful onboarding to AI systems will be discussed. The talk will discuss Booth’s current work in the U.S. Senate on responding to the proliferation of AI. Collectively, the research provides evidence that—even with the best of intentions— current human-AI systems often fail to induce alignment, and my research proposes promising directions for how to build better aligned human-AI systems.

Bio: Serena Booth received her Ph.D. at MIT CSAIL in 2023. She studies how people write specifications for AI systems and how people assess whether AI systems are successful in learning from specifications. While at MIT, Booth served as an inaugural Social and Ethical Responsible Computing Scholar, teaching AI Ethics and developing MIT’s AI ethics curriculum that is also released on MIT OpenCourseWare. She is a graduate of Harvard College (2016), after which she worked as an Associate Product Manager at Google to help scale Google’s ARCore augmented reality product to 100 million devices. Booth currently works in the U.S. Senate as a AAAS AI Policy Fellow, where she is working on AI policy questions for the Senate Banking, Housing, and Urban Affairs Committee. Her research has been supported by an MIT Presidential Fellowship and by an NSF GRFP. She is a Rising Star in EECS and an HRI Pioneer.

This lecture is open to the public.

If you need an accommodation for a disability please contact Jean Butcher at butcher@princeton.edu at least one week before the event.

Date and Time

Thursday March 28, 2024 12:30pm - 1:30pm

Location

Computer Science Small Auditorium (Room 105)

Event Type

Center for Information Technology Policy

Speaker

Serena Booth, from U.S. Senate

Host

Center for Information Technology Policy (CITP)

Website

https://citp.princeton.edu/event/booth/

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

03-28 CITP Lecture – Building Human-AI Alignment: Specifying, Inspecting, and Modeling AI Behaviors

03-28
CITP Lecture – Building Human-AI Alignment: Specifying, Inspecting, and Modeling AI Behaviors