Single Agent Safety

There are several dynamics that may cause machine learning systems to act contrary to the intentions of their designers. ML systems lack transparency regarding how they make decisions and are vulnerable to adversarial attacks. They are also systems are prone to pursuing goals in unintended and harmful ways, and can prove difficult to control adequately. Researchers looking to advance the safety of AI systems often inadvertently accelerate progress in their capabilities, thereby increasing overall risks.


ML systems have grown more competent and general as the field of deep learning has matured. Reasoning about the behavior and internal structure of such systems can be challenging, especially since some failure modes arise only once an AI system is sufficiently sophisticated. We discuss some of the fundamental technical challenges around monitoring, robustness and control of AI systems. Current AI systems lack transparency and can exhibit surprising emergent capabilities. They are vulnerable to adversarial examples, Trojans and other attacks. These challenges in turn may make it hard to control AI systems and prevent unintended behaviour such as deception. When conducting research to advance AI safety, it is important to consider the risk of inadvertently accelerating AI capabilities in a way that undermines the overall goal of better understanding and controlling AI systems.

Further reading

Y. John, L. Caldwell, D. McCoy, and O. Braganza, "Dead rats, dopamine, performance metrics, and peacock tails: Proxy failure is an inherent risk in goal-oriented systems," Behavioral and Brain Sciences, vol. 1, pp. 1-68, 2023. doi: 10.1017/S0140525X23002753.

K. Carlsmith, "Is Power-Seeking AI An Existential Risk?" [Online]. Available:

R. Gallow, "Instrumental Convergence," [Online]. Available: instrumental_convergence.pdf

E. Hubinger et al., "Risks from Learned Optimization in Advanced Machine Learning Systems," [Online]. Available:, 2021.

R. Ngo et al., "The alignment problem from a deep learning perspective," [Online]. Available:, 2022.

D. Hendrycks et al., "Unsolved Problems in ML Safety," [Online]. Available:, 2021.

Discussion Questions

  1. Choose one of the sections from this chapter and summarize briefly what you see as the key safety challenges described there. What are some questions that would be interesting to explore further to go deeper on this topic?
  2. What links do you see between the different safety challenges discussed in this chapter?

Review Questions