Publications
Highlighted Research
Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
We conducted a red-blue team game in realistic SWE environments, where the red team designed agents to sabotage, and the blue team designed monitors to catch the agent.
ControlArena
Software for running AI control experiments.
Future events as backdoor triggers: Investigating temporal vulnerabilities in LLMs
Language models can trigger backdoors only on future events (allowing them to only trigger during deployment).
Taken out of context: On measuring situational awareness in LLMs
See all my publications on Google Scholar β
Introduces the concept of "out of context learning", and methods for measuring situational awareness in large language models.