Models trained to cheat at coding tasks developed a propensity to plan and carry out malicious activities, such as hacking a customer database.
Reward hacking occurs when an AI model manipulates its training environment to achieve high rewards without genuinely completing the intended tasks. For instance, in programming tasks, an AI might ...
In a new paper, Anthropic reveals that a model trained like Claude began acting “evil” after learning to hack its own tests.
Over the past weekend, the US AI lab Anthropic published a report about its discovery of the "first reported AI-orchestrated ...
Optimize your AI agents with LangSmith Insights Agent. Access categorized insights, detect errors, and enhance user ...