Models trained to cheat at coding tasks developed a propensity to plan and carry out malicious activities, such as hacking a customer database.
This manuscript makes a valuable contribution to understanding learning in multidimensional environments with spurious associations, which is critical for understanding learning in the real world. The ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results