AI Safety: An Overview

Published in

AI Safety

9 min readDec 7, 2020

I am pursuing my interest in building powerful AI systems that are provably safe and beneficial. In this journey, I will be penning down my learnings, ideas and experiments in this publication. These posts are glorified notes where I’m looking to organize my thoughts and narrow down my focus onto the subproblems I would eventually work on. This is the first post where I summarize the work, Tom Everitt, Gary Lea, and Marcus Hutter (2018). “AGI Safety Literature Review”. I intentionally skip the Public Policy section of this work and this is for brevity purposes and not to imply that the policy work is less important in any way.

Introduction

AGI, short for Artificial General Intelligence, is a system that equals or exceeds human intelligence in a wide range of cognitive tasks. Such systems do not exist as of today. AGI safety refers to the study of risks that are posed by such systems.

Why study AGI Safety? The bottom line is that if and when AGI is created and we do not know how to control it then the outcome would be catastrophic.

Understanding AGI

To discuss AGI safety we need conceptual models of it. However, we aren’t sure how its design will be. The design is likely to be complex and its behaviour more so. Despite this, we can make some observations and predictions.

Defining Intelligence

Legg and Hutter propose a formal definition which can be stated informally as follows: Intelligence measures an agent’s ability to achieve goals in a wide range of environments. This definition is not human-centric and agrees with the current notion that humans are more intelligent than current AI systems.

A future superhuman AGI will be able to achieve more goals in a wider range of environments. The more intelligent an agent is, the more control it will have over the aspects of the environment. Consequently, AGI systems will have more control than humans and if we have conflicting goals humans are going to lose.

Orthogonality

Bostrom’s orthogonality thesis states that intelligence and goals are independent of each other. Any level of intelligence is compatible with any type of goal and therefore beneficial goals (or destructive goals) will not arise automatically with growing intelligence.

Convergent Instrumental Goals

As opposed to the explicit end goals, certain goals called the instrumental goals are common for many agents and are usually the subgoals of the agents’ end goals. Common instrumental goals:

Self-improvement
Goal-preservation and Self-preservation
Resource acquisition

Formalizing AGI

In the AIXI framework, bayesian and history-based agents are used to formalize AGI. This framework has been extended by many to study goal alignment, multi-agent interaction, space-time embeddedness, and self-modification among others.

Alternative Views of AGI

AGI system doesn’t need to be agent-like. Avoiding goal-driven agents that make long term plans can avoid some safety concerns. Keeping AI specialized can help avoid some safety concerns. However, there are arguments against the plausibility of keeping them specialized and that it is hard to resist building general capabilities. Also, the agency property can be an emergent property of a system even if its subcomponents are specialized AI systems devoid of agency.

Predicting AGI Development

Based on the discussion so far, we will attempt to answer two questions. When will the first AGI arrive and what will happen when it does?

When will AGI arrive?

Many surveys have been conducted around this question and more often than not the predictions vary a lot ranging from few years to never. Indicators such as algorithmic progress, computing costs and power could be useful in predicting the arrival of AGI. Some consider the development of conceptual-linguistic ability as one such indicator.

Will AGI lead to a technological singularity?

Self-improvement being an instrumental goal of any AGI can lead to recursive self-improvement. Depending on its pace we might see an intelligence explosion once it crosses a critical level of self-improvement. This intelligence explosion is dubbed as a singularity which is essentially a point beyond which our models break.

However, Toby Walsh argues against an inevitable intelligence explosion as below.

Intelligence measurement: When we refer to a rapid increase in the development of intelligence, first we must define how we measure intelligence itself.
Fast thinking dog: Speeding up computers doesn’t make them smarter.
Anthropocentric: What makes the human-level intelligence a critical level beyond which we might see recursive self-improvement?
There could be other reasons for self-improvement beyond a certain level to be hard such as limits on intelligence, computational complexity, and difficulty in devising better algorithms.

However, many of these arguments are contested. Max Tegmark suggests a happening of a new form of life by AI systems that can design both their software and hardware giving unprecedented opportunity for rapid self-improvement. Also, if AGI systems could think a million times faster, it would imply an ability to do a millennium worth of work in a day giving rise to control problems.

Essentially, there’s no uniform consensus in experts’ opinions which isn’t surprising given no event remotely resembling one such as the AGI development has ever occurred.

Risks caused by AGI

Singularity induced by AGI development may lead to existential risks and suffering on a large scale. Even without singularity, substantial problems such as revolutionized warfare, social manipulation at scale, and shifts in power dynamics are likely.

Problems with AGI

Several individuals and organisations have published their research agendas identifying different safety problems. Prominent centres of AI safety research include Machine Intelligence Research Institute (agents foundations agenda & machine learning agenda), Future of Life Institute, Australian National University, Centre for Human Compatible AI, UC Berkeley, Future of Humanity Institute, DeepMind, and OpenAI.

Problems identified in different AI Safety agendas and their connections.

In summary, below are the high-level problem areas highlighted by the above organisations.

Value Specification: How to make AGI systems pursue the right goals? Subproblems include reward corruption, reward gaming, and side effects.
Reliability: How to make AGI systems continue pursuing the right goals? Subproblems include self-modification.
Corrigibility: How can we get AGI systems to help us in modifying them? Subproblems include safe interruptibility.
Security: How to design AGI systems to be robust to adversaries?
Safe Learning: How can we avoid fatal mistakes during learning phases? Subproblems include safe exploration, distributional shift, and continual learning.
Intelligibility: How can we understand the decisions of AGI systems?
Societal Consequences: AGI is bound to have a substantial economic, legal, political and military impact. How do we make sure it is beneficial to the whole of humanity?

Other problem areas receive less attention including subagents, malign belief distributions, physicalistic decision making, multi-agent systems, and meta-cognition. There might be important problems that remain to be identified.

Design Ideas for Safe AGI

There’s no clear separation between work on AGI safety and other AGI developments. We will now look at the problem areas and design ideas proposed to address one or more problems in these areas.

Value Specification

RL and misalignment. RL combined with Deep Learning has made a lot of progress and is currently the most promising framework for AGI. Alignment of the goals of humans and RL agents looks hard, however. What’s hard is avoiding the following: incorrect specification of the reward function, corruption of observations, modifying and hijacking the reward function, and corrupting the data used to learn the reward function.

Learning a reward function from actions and preferences. Specifying the reward function correctly is hard. Instead, if we let the agent learn the reward function we might be able to avoid some problems.

Inverse Reinforcement Learning (IRL) framework proposes the idea of learning a reward function from the actions of a demonstrator. The learned reward function may not translate well into scenarios beyond training which is referred to as the distributional shift problem. Interactively learning the reward function to adapt to changing situations is a plausible solution. Cooperative IRL (CIRL) extends IRL allowing the agent and the demonstrator to act simultaneously in the environment. Interactively learning the preferences can help to avoid overly literal interpretations of reward functions.

There’s a subtle yet important difference between learning from actions and learning from preferences. For actions to convey the preferences, the actions must be chosen rationally and it is argued that humans don’t behave rationally. Learning from preferences may require weaker rationality assumptions.

Reward Corruption. RL agents, model-free RL agents, in particular, can hijack the reward signal and may feed themselves a maximal reward. If the reward function is learned (in case of IRL and CIRL), then the agent can push the process towards reward functions that are easy to maximize. CIRL seems to be safer than standard RL in avoiding reward corruption but CIRL is still far from ideal.

Side effects. An RL agent with a reward function that doesn’t fully capture necessary human values may cause negative side effects due to ruthless optimization of the incomplete objective function. Methods such as quantilization and low impact AI are proposed to address this issue but it’s still largely unsolved.

Moral Theory and Human-Inspired Design. Other areas relevant to value specification include moral theory, economics and human-inspired designs. In regards to moral theory, we would like to define a class of correct moral theories. There’s currently no consensus on such class. Human-inspired designs are argued as a better way to build aligned agents by many authors. Two prominent considerations are brain emulations and neuromorphic architectures inspired by the human brain. However, there are arguments against these ideas highlighting their safety issues.

Reliability

Self Modification. Even with a correctly specified reward function, AGI can modify it either intentionally (agent infers it can realize more reward) or accidentally (possibly due to side effects). The argument of utility self-preservation says that agents should not want to change their utility functions as it will reduce the utility gained by their future selves as measured by the current utility function. This argument holds under three assumptions, of which RL agents violate the first two and agents that learn reward functions may violate the third.

Decision theory. Embedded agents need a decision theory to handle the subtleties in calculating expected utility. Causal decision theory and evidential decision theory are the established theories for embedded agents but prescribe the wrong decisions at times. Functional decision theory by Soares and Yudkowsky appears to avoid known such weaknesses.

Corrigibility

Indifference. By modifying the reward function, the agent can be made indifferent between certain choices of future events. This idea can be used to modify RL agents to not learn to prevent interruptions.

Ignorance. Similar to indifference, we can construct agents to be certain of a particular event not happening.

Uncertainty. Under some assumptions on human rationality and agent’s uncertainty, CIRL agents tend to be naturally corrigible. The agent may interpret human’s act of shutting it down as an action that may lead to a higher reward and cooperates with the shutdown. The agent may ignore the human’s attempt if it is certain that the action is suboptimal.

Security

Adversarial counterexamples. The latest success of machine learning, deep learning is incredibly versatile but has been observed to suffer from misclassifications even due to minor perturbations in the inputs. The ReluPlex algorithm has been used to verify the behaviour of a neural network with ReLU activations and to understand the sensitivity to adversarial perturbations.

Intelligibility

It is incredibly hard to understand the decisions and learnings of a neural network. Nevertheless, progress has been made. Psychlab uses psychology tests to understand deep RL agents. Zahavy et. al. used dimensionality reduction of top-level activations using t-SNE to understand agent policies. Much work has been done beyond RL as well.

Safe learning

During training, RL agents perform a lot of mistakes which can be costly in real-world settings. One way to minimize such mistakes is to train a neural network to detect catastrophic actions and prevent the agent from taking such actions. This doesn’t eliminate all the catastrophic mistakes, however.

Other

Boxing and Oracles. Instead of agents that act, can we develop oracles that only answer questions? It is argued such oracles are safer than the traditional agents. Alternatively, we can try and box the AI agents and constrain its interaction with the world.

Tripwires. When the death of an agent has zero reward, the agent prefers death over any other action when dealing with rewards bounded to a negative range.

Meta-cognition. Theory of logical induction leads to theorems on systems’ reasoning about its computations, consistently.

It’s been 2 years since the summary was published. Nevertheless, I believe it is a useful overview of the field. I wish to zoom in on some of the ideas and problems highlighted here in near future.