Publications
Highlighted Research
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
Benchmark for evaluating AI systems' ability to autonomously replicate themselves.
Future events as backdoor triggers: Investigating temporal vulnerabilities in LLMs
Language models can trigger backdoors only on future events (allowing them to only trigger during deployment).
Taken out of context: On measuring situational awareness in LLMs
See all my publications on Google Scholar →
Introduces the concept of "out of context learning", and methods for measuring situational awareness in large language models.