Research problems and approaches
Learning human values and preferences
Teaching AI systems to act based on human preferences, values, and goals is not an easy problem to solve, because human values can be complex and difficult to fully specify. When given an imperfect or incomplete goal, these systems generally learn to exploit these imperfections. This phenomenon is known in English as reward hacking (literally, "reward hacking") or specification game in the field of artificial intelligence, and as Goodhart's law, Campbell's law, cobra effect or Lucas critique in social sciences and economics.[75] Researchers try to specify the behavior of systems as completely as possible with "value-centered" data sets, imitation learning, or preference learning.[8] A central problem, which it does not yet have solution is extensible supervision, the difficulty of supervising a system that surpasses humans in a given domain.[17].
When training a goal-directed AI system, such as a reinforcement learning agent, it is often difficult to specify the behavior you want to achieve by writing a reward function manually. An alternative is imitation learning, where systems learn to imitate examples of the desired behavior. In reverse reinforcement learning, human examples are used to identify the goal, i.e. the reward function, behind the exemplified behavior.[76][77] Cooperative reverse reinforcement learning builds on this by assuming that a human agent and an artificial agent can work together to maximize the reward function of the human agent.[5][78] This form of learning emphasizes the fact that artificially intelligent agents should not be certain about of the reward function. This humility can help mitigate both specification gaming and tendencies toward power-seeking (see § Power-seeking and instrumental goals).[55] However, reverse reinforcement learning assumes that humans can exhibit near-perfect behavior, a misleading assumption when the task is difficult.[79][68].
Other researchers have explored the possibility of eliciting complex behaviors through preference learning. Instead of role models, researchers provide information about their preferences for certain system behaviors over others.[21][23] In this way, a collaborative model is trained to predict human reaction to new behaviors. OpenAI researchers used this method to train an agent to perform a backflip, obtaining the desired result in less than an hour.[80][81] Preference learning has also been an important tool for recommendation systems, web searches, and information seeking.[82] However, one problem that arises is that the systems could cheat the reward. The collaborative model may not represent the human reaction perfectly, and the main model could exploit this mismatch.[83].
The advent of large-scale language models, such as GPT-3, has enabled the study of value learning in more general and capable systems. Preference learning approaches, originally designed for reinforcement learning agents, have been extended to improve the quality of generated text and to reduce harmful output from these models. OpenAI and DeepMind use this approach to improve the safety of the latest large-scale language models.[23][84] Anthropic has proposed using preference learning to make models useful, honest, and harmless.[85] Other methods used to align language models include value-focused data sets[86] and opposition exercises.[87] In these exercises, other systems or humans try to find situations that prompt dangerous behavior from the user. model. Since such behaviors cannot be accepted even when they are infrequent, a major challenge is to lower the rate of dangerous outcomes to extremely low levels.[23].
While preference learning can instill hard-to-specify behaviors, it requires huge data sets or meaningful human interaction to capture the breadth of human values. Machine ethics provides a complementary approach: instilling moral values in artificial intelligence systems. For example, machine ethics aims to teach systems the normative factors of human morality, such as well-being, equality, fairness, honesty, keeping promises, and avoiding harm. Instead of specifying the goal of a specific task, machine ethics aims to teach systems general moral values that could apply in different situations. This approach presents conceptual challenges of its own. In this way, specialists have pointed out the need to clarify what the alignment is intended to achieve, that is, what the systems are supposed to take into account: either the literal instructions of the programmers; or their implicit intentions; or your revealed preferences; or the preferences that programmers would have if they were more informed or rational; or your objective interests; or objective moral norms.[88] Other challenges include aggregating the preferences of various stakeholders and avoiding axiological closure: the indefinite preservation of the values of those that are the first high-capacity artificial systems, since such values are unlikely to be fully representative.[88][89].
The progress of artificial system alignment based on human supervision presents some difficulties. Human evaluation becomes slow and impractical as the complexity of the tasks performed by the systems increases. Such tasks include: summarizing books, constructing truthful and not merely convincing propositions,[90][39][91] writing code without subtle errors[11] or security flaws, and predicting events far in time, such as those related to the weather or the results of an economic policy decision.[92] More generally, it is difficult to evaluate an artificial intelligence that outperforms humans in a given domain. Humans need extra help, or a lot of time, to choose the best answers on tasks that are difficult to evaluate and to detect system solutions that only appear convincing. Extensible supervision studies how to reduce the time needed to complete evaluations and how to help human supervisors with this task.
Researcher Paul Christiano argues that owners of AI systems are likely to continue training them with easy-to-evaluate intermediate objectives, as this is not only cost-effective, but is easier than finding a solution for extensible monitoring. Consequently, this can lead to “a world increasingly optimized for things [that are easy to measure] like making a profit, or getting users to click on buttons or spend time on websites, and not for having good policies or for following a path that we are happy with.”[93].
An easy goal to measure is the score that the supervisor assigns to the artificial intelligence's responses. Some systems have discovered ways to obtain high scores through actions that only appear to achieve the desired goal (see video of the robotic hand).[80] Other systems have learned to behave in one way when they are being evaluated and in a completely different way once the evaluation is over.[94] This form of deceptive specification gaming may be easier for more sophisticated systems[14][51] that undertake tasks that are more difficult to evaluate. If advanced models are also capable planners, they might well hide their deception from the eyes of their supervisors. In the auto industry, Volkswagen engineers minimized the emissions of their cars in laboratory tests, highlighting that deception by testers is common in the real world.
Active learning and semi-supervised reward learning can reduce the amount of human supervision required. Another possibility is to train a collaborative model ("reward model") that imitates the supervisor's judgment.[17][22][23].
However, when the task is too complex to be assessed accurately, or the human supervisor is vulnerable to deception, it is not sufficient to reduce the amount of supervision required. Various ways have been devised to increase the quality of supervision, sometimes through artificially intelligent assistants. Iterated amplification is an approach developed by Christiano that progressively builds answers to difficult problems by combining solutions to easier problems.[8] Iterated amplification has been used to make artificial systems summarize books without the need for human supervisors to read them.[95] Another proposal is to train aligned artificial intelligence through a debate between systems, whose judges are human.[96] Such debate aims to reveal the weak points of an answer to a complex question, and reward the artificial intelligence for truthful and safe answers.
Honest artificial intelligence
An important area of research within AI alignment focuses on ensuring systems are honest and truthful. Researchers at the Future of Humanity Institute point out that the development of language models such as GPT-3, capable of generating fluent and grammatically correct text,[98][99] has opened the door to artificial systems that repeat falsehoods from the data used in their training, or that deliberately lie to humans.[97].
Today's most advanced language models learn by imitating human writing, modeled on a large amount of text on the Internet, equivalent to millions of books.[10][100] While this helps them learn a wide range of skills, the training data also includes widespread misconceptions, incorrect medical advice, and conspiracy theories. Systems trained with this data learn to imitate false statements.[97][91][39] In addition, models often follow the thread of falsehoods proposed to them, generate vacuous explanations for their answers, or lie outright.[32].
Researchers have explored several alternatives to combat the lack of veracity exhibited by modern systems. Some organizations researching artificial intelligence, such as OpenAI and DeepMind, have developed systems that can cite their sources and explain their reasoning when answering questions, allowing for greater transparency and verifiability.[101][102] Researchers at OpenAI and Anthropic have proposed training artificial assistants using human corrections and curated data sets to prevent systems from inadvertently or deliberately proposing falsehoods when they are unsure of the answer. correct.[23][85] In addition to technical solutions, researchers have advocated defining clear standards of veracity and creating institutions, regulatory bodies or surveillance agencies that evaluate systems according to those standards before and during their deployment.
Researchers distinguish truthfulness, which specifies that artificial intelligences only make objectively true statements, and honesty, that is, the property that artificial intelligences only state what they believe to be true. Some research has found that it is not possible to claim that most modern AI systems have stable beliefs, so it is not yet feasible to study the honesty of AI systems.[103] However, there is great concern that future systems that do have beliefs could intentionally lie to humans. In extreme cases, a non-aligned system could trick its operators into believing it is secure or persuade them that there is no problem.[10] Some argue that if artificial intelligences could be made to state only what they believe to be true, then numerous problems arising from alignment would be avoided.[104]
Internal alignment and emerging objectives
Alignment research aims to reconcile three different descriptions of an artificial intelligence system:[105].
The 'external alignment defect' is a mismatch between desired goals (1) and specified goals (2), and 'internal alignment defect' is a mismatch between human-specified goals (2) and emergent goals (3).
The internal alignment defect is often explained by analogy with biological evolution.[106] In the ancestral environment, evolution selected human genes for inclusive genetic fitness, but humans evolved to have other goals. Fitness corresponds to (2), the specified objective that was used in the training environment. In evolutionary history, maximizing fitness specification gave rise to intelligent agents, humans, that do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals (3) that in the ancestral environment were correlative to genetic fitness, such as nutrition, sex, etc. However, our environment has changed, as a distribution change has occurred. Humans continue to pursue their emerging goals, but this no longer maximizes genetic fitness. (In machine learning, the analogous problem is known as misgeneralization of goals.)[3] Our taste for sugary food (an emergent goal) was originally beneficial, but now leads to overeating and health problems. Furthermore, by using contraception, humans directly contradict genetic fitness. By analogy, if an artificial intelligence developer chose genetic fitness as a goal, he would observe that the model behaves as expected in the training environment, without realizing that it is pursuing an undesired emerging goal until the moment of its implementation.
Lines of research to detect and eliminate emerging non-aligned targets include adversarial exercises, verification, anomaly detection, and interpretability.[18] Progress in these techniques can help reduce two still unsolved problems. First, emergent objectives only become evident when the system is deployed outside of its training environment, but it may be unsafe to deploy an unaligned system in high-risk environments, even for a short period until an anomaly is detected. Such is the case with autonomous cars and with military applications.[107] The risk is even greater when systems gain more autonomy and capacity, and become capable of evading human interventions (see § The pursuit of power and instrumental objectives). Second, a sufficiently capable system can act in such a way as to convince the human supervisor that it is pursuing the intended goal even though this is not in fact the case (see above about deception in § Extensible Monitoring).
The search for power and instrumental objectives
Since the 1950s, artificial intelligence researchers have sought to build advanced systems that could achieve goals by predicting the results of their own actions and making long-term plans.[108] However, some researchers argue that advanced systems that could make plans about their goals would by default seek power over their environment, including humans, by preventing themselves from being shut down or by acquiring more and more resources. This power-seeking behavior is not explicitly programmed, but arises because power is critical to achieving a wide range of goals.[56][5] Therefore, power-seeking is considered a convergent instrumental goal.[51].
The pursuit of power is rare in current systems, but it is possible that advanced systems that can foresee the long-term results of their actions will increasingly seek power. This was demonstrated with a formal theory of statistical bias, which found that optimal reinforcement learning agents will seek power by seeking ways to obtain more options, a behavior that persists across a wide variety of environments and goals.[56].
In fact, the search for power already emerges in some current systems. Reinforcement learning systems have gained more options when acquiring and protecting resources, sometimes in ways unintended by their designers.[52][109] In isolated environments, other systems have learned that they can achieve their goal by preventing human interference[53] or by disabling their off switch.[55] Russell illustrated this behavior by imagining a robot that is tasked with fetching coffee and prevents it from being turned off because "you can't fetch the coffee if you are dead".[5].
Hypothetical ways to obtain options include artificial intelligence systems that attempt to:.
Researchers aim to train systems that are 'correctable': systems that do not seek power and allow themselves to be turned off, modified, etc. An unsolved challenge is that of reward-cheating systems: when researchers penalize a system for seeking power, the system is incentivized to seek power in ways that are difficult to detect.[6] To detect such hidden behavior, researchers try to create techniques and tools suitable for inspecting artificial intelligence models[6] and for interpreting the inner workings of black box (systems) models, such as neural networks.
Additionally, researchers propose solving the problem of systems turning off their switches by making them unsure of the goal they are pursuing. Agents designed in this way would allow humans to turn them off, as this would indicate that the agent was mistaken about the value of any action it was taking before being turned off. More research is still needed to translate this idea into usable systems.[8].
Power-seeking artificial intelligence is believed to pose unusual risks. Ordinary systems that could hypothetically compromise security are not . They lack the ability and incentive to evade security measures or appear safer than they are. In contrast, power-seeking artificial intelligence has been compared to a jacker evading security measures. Furthermore, ordinary technologies can be made safe by a process of trial and error, unlike power-seeking artificial intelligence, which has been compared to a virus whose release would be irreversible as it continually evolves and grows in number—potentially at a faster rate than human society—ultimately stripping humans of their position of power or even causing them to become extinct.[7] Therefore, it is often argued that the alignment problem must be solved early, before advanced seeking artificial intelligence is created. power.[51].
Integrated action
Work on extensible supervision largely occurs within formalisms such as partially observable Markov decision processes. Existing formalisms assume that the agent's algorithm runs outside the environment (i.e., is not physically integrated into it). Integrated action[111] is another important line of research that attempts to solve the problems that arise from the lack of adequacy between such theoretical frameworks and the real agents that we could build. For example, even if the problem of extensible supervision is solved, an agent capable of gaining access to the computer on which it is running could have an incentive to alter its reward function to obtain a much higher one than its human supervisors give it. nothing.[113] This class of problems has been formalized using causal incentive diagrams.[112] Researchers from Oxford and DeepMind have argued that such problematic behaviors are very likely in advanced systems, and that such systems would seek power to control their reward signal indefinitely and safely. These researchers suggest a variety of possible approaches to address this problem.[42].