Safe AI and Safe Sex

Maybe they are alike – both should be practiced but for some reason they aren’t. describes it as “Safe sex practices simply combine the greatest pleasure with the least amount of risk of contracting HIV and other sexually transmitted infections (STIs), such as herpes or syphilis.”  Could this be rewarded to – Safe AI practices simply combine the greatest functionality with the least risk of social manipulation, new forms of warfare or shifts in power dynamics (Sotala, 2018).

For all of us intrigued by artificial intelligence one item that should be interest to all is that of Safe AI.  In AGI Safety Literature Review (Everitt,Lea & Hunter, 2018) they provide a good starting point for understanding the various aspects of Artificial General Intelligence (AGI) which also can a consideration of regular AI. In this paper they identify research that has identified cluster of problems, some less obvious problems and identify research that is focused on safe AGI.

Some the problems identified are:

  • Value specification – How do we get an AGI to work towards the right goals?
  • Reliability – How can we make an agent that keeps pursuing the goals we have designed it with?
  • Corrigibility – If we get something wrong in the design or construction of an agent, will the agent cooperate with us trying to fix it?
  • Security – How to design AGIs that are robust to adversaries and adversarial environments?
  • Safe Learning – AGIs should avoid making fatal mistakes during the learning phase.
  • Intelligibility – How can we build agents whose decisions we can understand.  To this Iadd building agents that can explain their actions.
  • Societal consequences – AGI will have substantial legal, economic, political, and military consequences.

What follows are sections that contain design considerations for each of the above problems.  Although the paper focused on AGI, the research referenced can apply to both AGI and other AI developments.

  • Value Specification
    • Reinforcement learning (RL) and misalignment
    • Learning a reward function from actions and preferences
    • Approval-directed agents
    • Reward corruption
    • Side effects
    • Morality
    • Connections to economics
    • Human-inspired designs
  • Reliability
    • Self-modification
    • Decision theory
  • Corrigibility
    • Indifference
    • Ignorance
    • Uncertainty
    • Continuous testing
  • Security
    • Adversarial counterexamples
  • Safe Learning
  • Intelligibility

Although the research seems reasonably extensive on most of the topics, the last two are particularly sparse.  Societal consequences in the section on public policy.  I am working my way through some of the other research and will add to this soon.