AI Hazard Detection (Security) | Construpedia

Navegación

Ai Hazard Detection (security)

Introduction

In the field of artificial intelligence, AI alignment research is concerned with finding ways to direct the development of artificial intelligence systems in accordance with the objectives and interests of their designers.

The alignment of artificial intelligence systems includes the following problems: the difficulty of fully specifying all desired and undesired behaviors; the use of easy-to-specify intermediate objectives that omit desirable constraints; reward traps, whereby systems find loopholes in such intermediate goals, creating side effects;[4] instrumental goals, such as power-seeking, that help the system achieve its ultimate goals;[2][5][6][7] and emergent goals that only become apparent when the system is implemented in new situations and data distributions.[6][8] These problems affect business systems such as robots,[9] models of language,[10][11] autonomous vehicles,[12] and social network recommendation systems.[10][5][13] It is believed that problems are the more likely the more capable the system is, since they partly result from high capacity.[14][6].

The AI research community and the United Nations have called for both technical research-based solutions and policy solutions to ensure systems are aligned with human values.[c].

Systems alignment is part of a broader field of study called AI safety, or the study of how to build AI systems that are safe.[6][17] Avenues for alignment research include learning human values and preferences, developing honest AI, extensible monitoring, examining and interpreting AI models, and preventing emergent behaviors such as searching for power.[6][18] Alignment research has connections with research on interpretability,[19] robustness,[6][17] anomaly detection, calibrated uncertainty,[19] formal verification,[20] preference learning,[21][22][23] security engineering,[6] game theory,[24][25] fairness algorithmic "Fairness (machine learning)"),[17][26] and social sciences,[27] among others.

The alignment problem

Contenido

En 1960, Norbert Wiener, pionero de la inteligencia artificial, articuló el problema de la alineación del modo siguiente: “Si para lograr nuestros propósitos usamos un medio mecánico en cuyo funcionamiento no podemos interferir de manera efectiva […] será mejor que estemos muy seguros de que el propósito puesto en la máquina es el propósito que realmente deseamos.”[28][5] Más recientemente, la alineación de la inteligencia artificial se presenta como un problema sin resolver en lo que se refiere a los sistemas modernos[29][30][31][32] y constituye un campo propio de investigación dentro de la inteligencia artificial.[33][34][35].

Ai Hazard Detection (security)

Introduction

The AI research community and the United Nations have called for both technical research-based solutions and policy solutions to ensure systems are aligned with human values.[c].

The alignment problem

Contenido

Set of specifications and value complexity

Part of the alignment problem involves specifying goals in a way that captures important values and avoids gaps and unintended consequences.[33] In many cases, the specifications used to train a system do not match the goals intended by the algorithm designer.[36][18] Designing such specifications is difficult for complex results such as language, robotic movements, or content recommendation. This is because it is difficult to fully describe what makes any complex outcome desirable. For example, when training a reinforcement learning agent to compete in a virtual boat race, OpenAI researchers noted that the agent found "an isolated loophole where it can spin in circles and repeatedly hit three targets[...]. Our agent achieves a higher score using this strategy than by completing the course in the normal way." convincing.[38][39][23] Research attempts to align such models with safer or more useful objectives.

Berkeley computer scientist Stuart Russell has warned that omitting an implicit constraint can be detrimental: "A system [...] will often drive [...] unconstrained variables to extreme values; if one of those variables is something we really care about, the solution we find could be very undesirable. Essentially, it's the old story of the Genie of the Lamp, the Sorcerer's Apprentice, or King Midas: you get exactly what you ask for, not what you want."[40].

When unaligned artificial intelligence is deployed, the side effects can be significant. It is a known fact that social media platforms optimize click-through rates to improve user experience, but this has created addictions, which decreased the well-being of many of them. Researchers at Stanford University note that such recommendation algorithms are not aligned with users because they "optimize simple engagement metrics rather than targeting a combination of societal and consumer well-being, which is more difficult to measure."[10].

Writing a specification that avoids unwanted effects can be challenging. The solution is sometimes proposed as simply prohibiting the system from performing dangerous actions, for example by listing prohibited outcomes or formalizing simple ethical rules.[41] However, Russell has argued that this approach fails to take into account the complexity of human values:[5] "It is certainly very difficult, and perhaps impossible, for mere humans to anticipate and exclude in advance all the disastrous ways that the machine might choose to achieve a specific goal."[5] This argument has since been formalized by Cohen. et al., who indicate that reward signals are ambiguous between approval of world states and approval of sending large rewards.[42] This ambiguity provides a general means of cheating to obtain rewards.

Furthermore, it is not impossible for autonomous systems to receive incorrect targets by accident. Two former presidents of the Association for the Advancement of Artificial Intelligence (AAAI), Tom Dietterich and Eric Horvitz, point out that this is already a cause for concern: "An important aspect of any artificial intelligence system that interacts with people is that it must reason about what people want instead of literally executing orders." Furthermore, a system that understands human intentions could also ignore: systems only act in accordance with the objective function, with the examples or with the reactions that their designers have.[33].

Risks of unaligned advanced artificial intelligence

Some researchers are particularly interested in the alignment of increasingly advanced artificial systems. The reasons are the high rate of progress in the field of artificial intelligence, the enormous efforts of industry and governments to develop advanced artificial systems, and the increasing difficulty of aligning them.

Already in 2020, OpenAI, DeepMind and 70 other public projects had the stated goal of achieving so-called artificial general intelligence, a hypothetical system that equals or surpasses humans in a wide range of cognitive tasks.[45] In fact, researchers developing modern neural networks see increasingly general and unexpected capabilities emerging.[10] These models have learned to carry out different operations, such as operating a computer and writing their own programs, among others. things.[46][47][48] According to surveys, some specialists believe that these general systems will be achieved soon, others believe that it will take much longer, and a third group considers that both scenarios are possible.[49][50].

Current systems still lack capabilities such as long-term planning and strategic awareness, which are believed to entail the most catastrophic risks.[10] It is not impossible that future systems that had these capabilities, even if they were not general, would seek to protect and increase their influence over their environment. This tendency is known as seeking power or convergent instrumental goals. The quest for power is not explicitly programmed, but emerges because power is fundamental to achieving a wide range of goals. For example, artificially intelligent agents could acquire financial resources, or could avoid being shut down by running additional copies of the system on other computers.[51][7] Power seeking has been observed in several reinforcement learning agents. an early stage, before advanced artificial intelligence is created that manifests this trend.[7][51][5].

According to some scientists, the creation of an unaligned artificial intelligence that vastly surpasses humans would be a threat to its dominant position on Earth, since it would represent a decrease in its power, and could even lead to human extinction.[2][5] Notable computer scientists who have pointed out these risks include Alan Turing,[e] Ilya Sutskever,[59] Yoshua Bengio,[f] Judea Pearl,[g] Murray Shanahan,[61] Norbert Wiener,[28][5] Marvin Minsky,[h] Francesca Rossi,[63] Scott Aaronson,[64] Bart Selman,[65] David McAllester,[66] Jürgen Schmidhuber,[67] Markus Hutter,[68] Shane Legg,[69] Eric Horvitz,[70] and Stuart Russell.[5] Skeptical researchers such as François Chollet,[71] Gary Marcus,[72] Yann LeCun[73] and Oren Etzioni[74] have argued that artificial general intelligence is either far off, or would not gain enough power to constitute a serious danger.

Alignment can become particularly difficult for more capable systems, as several risks increase along with the system's capability: the system's ability to find loopholes in the assigned target,[14] to protect and increase its power,[56][7] to increase its intelligence and to deceive its designers; the autonomy of the system; and the difficulty of interpreting and supervising the system.[5][51].

Research problems and approaches

Learning human values and preferences

Teaching AI systems to act based on human preferences, values, and goals is not an easy problem to solve, because human values can be complex and difficult to fully specify. When given an imperfect or incomplete goal, these systems generally learn to exploit these imperfections. This phenomenon is known in English as reward hacking (literally, "reward hacking") or specification game in the field of artificial intelligence, and as Goodhart's law, Campbell's law, cobra effect or Lucas critique in social sciences and economics.[75] Researchers try to specify the behavior of systems as completely as possible with "value-centered" data sets, imitation learning, or preference learning.[8] A central problem, which it does not yet have solution is extensible supervision, the difficulty of supervising a system that surpasses humans in a given domain.[17].

When training a goal-directed AI system, such as a reinforcement learning agent, it is often difficult to specify the behavior you want to achieve by writing a reward function manually. An alternative is imitation learning, where systems learn to imitate examples of the desired behavior. In reverse reinforcement learning, human examples are used to identify the goal, i.e. the reward function, behind the exemplified behavior.[76][77] Cooperative reverse reinforcement learning builds on this by assuming that a human agent and an artificial agent can work together to maximize the reward function of the human agent.[5][78] This form of learning emphasizes the fact that artificially intelligent agents should not be certain about of the reward function. This humility can help mitigate both specification gaming and tendencies toward power-seeking (see § Power-seeking and instrumental goals).[55] However, reverse reinforcement learning assumes that humans can exhibit near-perfect behavior, a misleading assumption when the task is difficult.[79][68].

Other researchers have explored the possibility of eliciting complex behaviors through preference learning. Instead of role models, researchers provide information about their preferences for certain system behaviors over others.[21][23] In this way, a collaborative model is trained to predict human reaction to new behaviors. OpenAI researchers used this method to train an agent to perform a backflip, obtaining the desired result in less than an hour.[80][81] Preference learning has also been an important tool for recommendation systems, web searches, and information seeking.[82] However, one problem that arises is that the systems could cheat the reward. The collaborative model may not represent the human reaction perfectly, and the main model could exploit this mismatch.[83].

The advent of large-scale language models, such as GPT-3, has enabled the study of value learning in more general and capable systems. Preference learning approaches, originally designed for reinforcement learning agents, have been extended to improve the quality of generated text and to reduce harmful output from these models. OpenAI and DeepMind use this approach to improve the safety of the latest large-scale language models.[23][84] Anthropic has proposed using preference learning to make models useful, honest, and harmless.[85] Other methods used to align language models include value-focused data sets[86] and opposition exercises.[87] In these exercises, other systems or humans try to find situations that prompt dangerous behavior from the user. model. Since such behaviors cannot be accepted even when they are infrequent, a major challenge is to lower the rate of dangerous outcomes to extremely low levels.[23].

While preference learning can instill hard-to-specify behaviors, it requires huge data sets or meaningful human interaction to capture the breadth of human values. Machine ethics provides a complementary approach: instilling moral values in artificial intelligence systems. For example, machine ethics aims to teach systems the normative factors of human morality, such as well-being, equality, fairness, honesty, keeping promises, and avoiding harm. Instead of specifying the goal of a specific task, machine ethics aims to teach systems general moral values that could apply in different situations. This approach presents conceptual challenges of its own. In this way, specialists have pointed out the need to clarify what the alignment is intended to achieve, that is, what the systems are supposed to take into account: either the literal instructions of the programmers; or their implicit intentions; or your revealed preferences; or the preferences that programmers would have if they were more informed or rational; or your objective interests; or objective moral norms.[88] Other challenges include aggregating the preferences of various stakeholders and avoiding axiological closure: the indefinite preservation of the values of those that are the first high-capacity artificial systems, since such values are unlikely to be fully representative.[88][89].

The progress of artificial system alignment based on human supervision presents some difficulties. Human evaluation becomes slow and impractical as the complexity of the tasks performed by the systems increases. Such tasks include: summarizing books, constructing truthful and not merely convincing propositions,[90][39][91] writing code without subtle errors[11] or security flaws, and predicting events far in time, such as those related to the weather or the results of an economic policy decision.[92] More generally, it is difficult to evaluate an artificial intelligence that outperforms humans in a given domain. Humans need extra help, or a lot of time, to choose the best answers on tasks that are difficult to evaluate and to detect system solutions that only appear convincing. Extensible supervision studies how to reduce the time needed to complete evaluations and how to help human supervisors with this task.

Researcher Paul Christiano argues that owners of AI systems are likely to continue training them with easy-to-evaluate intermediate objectives, as this is not only cost-effective, but is easier than finding a solution for extensible monitoring. Consequently, this can lead to “a world increasingly optimized for things [that are easy to measure] like making a profit, or getting users to click on buttons or spend time on websites, and not for having good policies or for following a path that we are happy with.”[93].

An easy goal to measure is the score that the supervisor assigns to the artificial intelligence's responses. Some systems have discovered ways to obtain high scores through actions that only appear to achieve the desired goal (see video of the robotic hand).[80] Other systems have learned to behave in one way when they are being evaluated and in a completely different way once the evaluation is over.[94] This form of deceptive specification gaming may be easier for more sophisticated systems[14][51] that undertake tasks that are more difficult to evaluate. If advanced models are also capable planners, they might well hide their deception from the eyes of their supervisors. In the auto industry, Volkswagen engineers minimized the emissions of their cars in laboratory tests, highlighting that deception by testers is common in the real world.

Active learning and semi-supervised reward learning can reduce the amount of human supervision required. Another possibility is to train a collaborative model ("reward model") that imitates the supervisor's judgment.[17][22][23].

However, when the task is too complex to be assessed accurately, or the human supervisor is vulnerable to deception, it is not sufficient to reduce the amount of supervision required. Various ways have been devised to increase the quality of supervision, sometimes through artificially intelligent assistants. Iterated amplification is an approach developed by Christiano that progressively builds answers to difficult problems by combining solutions to easier problems.[8] Iterated amplification has been used to make artificial systems summarize books without the need for human supervisors to read them.[95] Another proposal is to train aligned artificial intelligence through a debate between systems, whose judges are human.[96] Such debate aims to reveal the weak points of an answer to a complex question, and reward the artificial intelligence for truthful and safe answers.

Honest artificial intelligence

An important area of research within AI alignment focuses on ensuring systems are honest and truthful. Researchers at the Future of Humanity Institute point out that the development of language models such as GPT-3, capable of generating fluent and grammatically correct text,[98][99] has opened the door to artificial systems that repeat falsehoods from the data used in their training, or that deliberately lie to humans.[97].

Today's most advanced language models learn by imitating human writing, modeled on a large amount of text on the Internet, equivalent to millions of books.[10][100] While this helps them learn a wide range of skills, the training data also includes widespread misconceptions, incorrect medical advice, and conspiracy theories. Systems trained with this data learn to imitate false statements.[97][91][39] In addition, models often follow the thread of falsehoods proposed to them, generate vacuous explanations for their answers, or lie outright.[32].

Researchers have explored several alternatives to combat the lack of veracity exhibited by modern systems. Some organizations researching artificial intelligence, such as OpenAI and DeepMind, have developed systems that can cite their sources and explain their reasoning when answering questions, allowing for greater transparency and verifiability.[101][102] Researchers at OpenAI and Anthropic have proposed training artificial assistants using human corrections and curated data sets to prevent systems from inadvertently or deliberately proposing falsehoods when they are unsure of the answer. correct.[23][85] In addition to technical solutions, researchers have advocated defining clear standards of veracity and creating institutions, regulatory bodies or surveillance agencies that evaluate systems according to those standards before and during their deployment.

Researchers distinguish truthfulness, which specifies that artificial intelligences only make objectively true statements, and honesty, that is, the property that artificial intelligences only state what they believe to be true. Some research has found that it is not possible to claim that most modern AI systems have stable beliefs, so it is not yet feasible to study the honesty of AI systems.[103] However, there is great concern that future systems that do have beliefs could intentionally lie to humans. In extreme cases, a non-aligned system could trick its operators into believing it is secure or persuade them that there is no problem.[10] Some argue that if artificial intelligences could be made to state only what they believe to be true, then numerous problems arising from alignment would be avoided.[104]

Internal alignment and emerging objectives

Alignment research aims to reconcile three different descriptions of an artificial intelligence system:[105].

The 'external alignment defect' is a mismatch between desired goals (1) and specified goals (2), and 'internal alignment defect' is a mismatch between human-specified goals (2) and emergent goals (3).

The internal alignment defect is often explained by analogy with biological evolution.[106] In the ancestral environment, evolution selected human genes for inclusive genetic fitness, but humans evolved to have other goals. Fitness corresponds to (2), the specified objective that was used in the training environment. In evolutionary history, maximizing fitness specification gave rise to intelligent agents, humans, that do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals (3) that in the ancestral environment were correlative to genetic fitness, such as nutrition, sex, etc. However, our environment has changed, as a distribution change has occurred. Humans continue to pursue their emerging goals, but this no longer maximizes genetic fitness. (In machine learning, the analogous problem is known as misgeneralization of goals.)[3] Our taste for sugary food (an emergent goal) was originally beneficial, but now leads to overeating and health problems. Furthermore, by using contraception, humans directly contradict genetic fitness. By analogy, if an artificial intelligence developer chose genetic fitness as a goal, he would observe that the model behaves as expected in the training environment, without realizing that it is pursuing an undesired emerging goal until the moment of its implementation.

Lines of research to detect and eliminate emerging non-aligned targets include adversarial exercises, verification, anomaly detection, and interpretability.[18] Progress in these techniques can help reduce two still unsolved problems. First, emergent objectives only become evident when the system is deployed outside of its training environment, but it may be unsafe to deploy an unaligned system in high-risk environments, even for a short period until an anomaly is detected. Such is the case with autonomous cars and with military applications.[107] The risk is even greater when systems gain more autonomy and capacity, and become capable of evading human interventions (see § The pursuit of power and instrumental objectives). Second, a sufficiently capable system can act in such a way as to convince the human supervisor that it is pursuing the intended goal even though this is not in fact the case (see above about deception in § Extensible Monitoring).

The search for power and instrumental objectives

Since the 1950s, artificial intelligence researchers have sought to build advanced systems that could achieve goals by predicting the results of their own actions and making long-term plans.[108] However, some researchers argue that advanced systems that could make plans about their goals would by default seek power over their environment, including humans, by preventing themselves from being shut down or by acquiring more and more resources. This power-seeking behavior is not explicitly programmed, but arises because power is critical to achieving a wide range of goals.[56][5] Therefore, power-seeking is considered a convergent instrumental goal.[51].

The pursuit of power is rare in current systems, but it is possible that advanced systems that can foresee the long-term results of their actions will increasingly seek power. This was demonstrated with a formal theory of statistical bias, which found that optimal reinforcement learning agents will seek power by seeking ways to obtain more options, a behavior that persists across a wide variety of environments and goals.[56].

In fact, the search for power already emerges in some current systems. Reinforcement learning systems have gained more options when acquiring and protecting resources, sometimes in ways unintended by their designers.[52][109] In isolated environments, other systems have learned that they can achieve their goal by preventing human interference[53] or by disabling their off switch.[55] Russell illustrated this behavior by imagining a robot that is tasked with fetching coffee and prevents it from being turned off because "you can't fetch the coffee if you are dead".[5].

Hypothetical ways to obtain options include artificial intelligence systems that attempt to:.

Researchers aim to train systems that are 'correctable': systems that do not seek power and allow themselves to be turned off, modified, etc. An unsolved challenge is that of reward-cheating systems: when researchers penalize a system for seeking power, the system is incentivized to seek power in ways that are difficult to detect.[6] To detect such hidden behavior, researchers try to create techniques and tools suitable for inspecting artificial intelligence models[6] and for interpreting the inner workings of black box (systems) models, such as neural networks.

Additionally, researchers propose solving the problem of systems turning off their switches by making them unsure of the goal they are pursuing. Agents designed in this way would allow humans to turn them off, as this would indicate that the agent was mistaken about the value of any action it was taking before being turned off. More research is still needed to translate this idea into usable systems.[8].

Power-seeking artificial intelligence is believed to pose unusual risks. Ordinary systems that could hypothetically compromise security are not . They lack the ability and incentive to evade security measures or appear safer than they are. In contrast, power-seeking artificial intelligence has been compared to a jacker evading security measures. Furthermore, ordinary technologies can be made safe by a process of trial and error, unlike power-seeking artificial intelligence, which has been compared to a virus whose release would be irreversible as it continually evolves and grows in number—potentially at a faster rate than human society—ultimately stripping humans of their position of power or even causing them to become extinct.[7] Therefore, it is often argued that the alignment problem must be solved early, before advanced seeking artificial intelligence is created. power.[51].

Integrated action

Work on extensible supervision largely occurs within formalisms such as partially observable Markov decision processes. Existing formalisms assume that the agent's algorithm runs outside the environment (i.e., is not physically integrated into it). Integrated action[111] is another important line of research that attempts to solve the problems that arise from the lack of adequacy between such theoretical frameworks and the real agents that we could build. For example, even if the problem of extensible supervision is solved, an agent capable of gaining access to the computer on which it is running could have an incentive to alter its reward function to obtain a much higher one than its human supervisors give it. nothing.[113] This class of problems has been formalized using causal incentive diagrams.[112] Researchers from Oxford and DeepMind have argued that such problematic behaviors are very likely in advanced systems, and that such systems would seek power to control their reward signal indefinitely and safely. These researchers suggest a variety of possible approaches to address this problem.[42].

References

[2] ↑ Otras definiciones de "alineación" requieren que la inteligencia artificial persiga objetivos más generales, como valores humanos, otros principios éticos o las intenciones que tendrían sus diseñadores si estuvieran más informados o fueran más perspicaces.[1].
[5] ↑ Véase Russel & Norvig, Artificial Intelligence: A Modern Approach.[2] La distinción entre inteligencia artificial no alineada e inteligencia artificial incompetente ha sido formalizada en ciertos contextos.[3].
[19] ↑ Los principios de la inteligencia artificial creados en la Conferencia de Asilomar sobre la Inteligencia Artificial Benéfica fueron firmados por 1797 investigadores de robótica e inteligencia artificial.[15] Además, el informe del Secretario General de la ONU titulado "Nuestra agenda común" señala que "el Pacto [Digital Global] también podría promover la regulación de la inteligencia artificial para asegurarse de que respete los valores globales comunes" y discute los riesgos catastróficos globales que surgen de los desarrollos tecnológicos.[16].
[56] ↑ Los sistemas de aprendizaje por refuerzo han aprendido a obtener más opciones al adquirir y proteger recursos, a veces de formas ajenas a la intención de sus diseñadores.[52][7].
[63] ↑ En una conferencia de 1951[57] Turing afirmó que “parece probable que una vez que el método de pensar de las máquinas haya comenzado, no tardará mucho tiempo en superar nuestros débiles poderes. Las máquinas no conocerían la muerte, y podrían conversar entre sí para mejorar sus facultades. En algún momento, por lo tanto, deberíamos esperar que las máquinas tomen el control, de la manera que se menciona en el Erewhon de Samuel Butler”. También en una conferencia transmitida por la BBC[58] expresó: "Si una máquina es capaz de pensar pensar, quizá podría pensar más inteligentemente que nosotros, y entonces ¿qué sería de nosotros? Incluso si pudiéramos mantener a las máquinas en una posición subordinada, por ejemplo, apagando la energía en momentos estratégicos, deberíamos, como especie, sentirnos muy humillados... Este nuevo peligro... es ciertamente algo que puede ponernos nerviosos”.
[66] ↑ Sobre el libro Human Compatible: AI and the Problem of Control, Bengio dijo: "Este libro, escrito en un estilo excelente, aborda un desafío fundamental para la humanidad: máquinas cada vez más inteligentes que hacen lo que les pedimos pero no lo que realmente pretendemos. Es una lectura esencial si le preocupa nuestro futuro".[60].
[67] ↑ Sobre el libro Human Compatible: AI and the Problem of Control, Pearl dijo: "Human Compatible me convirtió a las preocupaciones de Russell acerca de nuestra capacidad para controlar nuestra próxima creación: máquinas superinteligentes. A diferencia de alarmistas y futuristas improvisados, Russell es una autoridad eminente en inteligencia artificial. Su nuevo libro educará al público sobre el tema más que cualquier otro libro y es una lectura encantadora y edificante".[60].
[70] ↑ Russell y Norvig[62] señalan que "el 'problema del Rey Midas' fue anticipado por Marvin Minsky, quien una vez sugirió que un programa de inteligencia artificial diseñado para resolver la hipótesis de Riemann podría terminar apoderándose de todos los recursos de la tierra para construir supercomputadoras más poderosas".

Set of specifications and value complexity

Risks of unaligned advanced artificial intelligence

Research problems and approaches

Learning human values and preferences

Honest artificial intelligence

Internal alignment and emerging objectives

Alignment research aims to reconcile three different descriptions of an artificial intelligence system:[105].

The search for power and instrumental objectives

Hypothetical ways to obtain options include artificial intelligence systems that attempt to:.

Integrated action

References

[2] ↑ Otras definiciones de "alineación" requieren que la inteligencia artificial persiga objetivos más generales, como valores humanos, otros principios éticos o las intenciones que tendrían sus diseñadores si estuvieran más informados o fueran más perspicaces.[1].
[5] ↑ Véase Russel & Norvig, Artificial Intelligence: A Modern Approach.[2] La distinción entre inteligencia artificial no alineada e inteligencia artificial incompetente ha sido formalizada en ciertos contextos.[3].
[19] ↑ Los principios de la inteligencia artificial creados en la Conferencia de Asilomar sobre la Inteligencia Artificial Benéfica fueron firmados por 1797 investigadores de robótica e inteligencia artificial.[15] Además, el informe del Secretario General de la ONU titulado "Nuestra agenda común" señala que "el Pacto [Digital Global] también podría promover la regulación de la inteligencia artificial para asegurarse de que respete los valores globales comunes" y discute los riesgos catastróficos globales que surgen de los desarrollos tecnológicos.[16].
[56] ↑ Los sistemas de aprendizaje por refuerzo han aprendido a obtener más opciones al adquirir y proteger recursos, a veces de formas ajenas a la intención de sus diseñadores.[52][7].
[63] ↑ En una conferencia de 1951[57] Turing afirmó que “parece probable que una vez que el método de pensar de las máquinas haya comenzado, no tardará mucho tiempo en superar nuestros débiles poderes. Las máquinas no conocerían la muerte, y podrían conversar entre sí para mejorar sus facultades. En algún momento, por lo tanto, deberíamos esperar que las máquinas tomen el control, de la manera que se menciona en el Erewhon de Samuel Butler”. También en una conferencia transmitida por la BBC[58] expresó: "Si una máquina es capaz de pensar pensar, quizá podría pensar más inteligentemente que nosotros, y entonces ¿qué sería de nosotros? Incluso si pudiéramos mantener a las máquinas en una posición subordinada, por ejemplo, apagando la energía en momentos estratégicos, deberíamos, como especie, sentirnos muy humillados... Este nuevo peligro... es ciertamente algo que puede ponernos nerviosos”.
[66] ↑ Sobre el libro Human Compatible: AI and the Problem of Control, Bengio dijo: "Este libro, escrito en un estilo excelente, aborda un desafío fundamental para la humanidad: máquinas cada vez más inteligentes que hacen lo que les pedimos pero no lo que realmente pretendemos. Es una lectura esencial si le preocupa nuestro futuro".[60].
[67] ↑ Sobre el libro Human Compatible: AI and the Problem of Control, Pearl dijo: "Human Compatible me convirtió a las preocupaciones de Russell acerca de nuestra capacidad para controlar nuestra próxima creación: máquinas superinteligentes. A diferencia de alarmistas y futuristas improvisados, Russell es una autoridad eminente en inteligencia artificial. Su nuevo libro educará al público sobre el tema más que cualquier otro libro y es una lectura encantadora y edificante".[60].
[70] ↑ Russell y Norvig[62] señalan que "el 'problema del Rey Midas' fue anticipado por Marvin Minsky, quien una vez sugirió que un programa de inteligencia artificial diseñado para resolver la hipótesis de Riemann podría terminar apoderándose de todos los recursos de la tierra para construir supercomputadoras más poderosas".

Navegación

Ai Hazard Detection (security)

Introduction

The alignment problem

Contenido

Ai Hazard Detection (security)

Introduction

The alignment problem

Contenido

Set of specifications and value complexity

Systemic risks

Risks of unaligned advanced artificial intelligence

Research problems and approaches

Learning human values and preferences

Honest artificial intelligence

Internal alignment and emerging objectives

The search for power and instrumental objectives

Integrated action

Skepticism about the risk of artificial intelligence

Public policies

References

Set of specifications and value complexity

Systemic risks

Risks of unaligned advanced artificial intelligence

Research problems and approaches

Learning human values and preferences

Honest artificial intelligence

Internal alignment and emerging objectives

The search for power and instrumental objectives

Integrated action

Skepticism about the risk of artificial intelligence

Public policies

References

Navegación

Ai Hazard Detection (security)

Introduction

The alignment problem

Contenido

Ai Hazard Detection (security)

Introduction

The alignment problem

Contenido

Set of specifications and value complexity

Systemic risks

Risks of unaligned advanced artificial intelligence

Research problems and approaches

Learning human values ​​and preferences

Honest artificial intelligence

Internal alignment and emerging objectives

The search for power and instrumental objectives

Integrated action

Skepticism about the risk of artificial intelligence

Public policies

References

Set of specifications and value complexity

Systemic risks

Risks of unaligned advanced artificial intelligence

Research problems and approaches

Learning human values ​​and preferences

Honest artificial intelligence

Internal alignment and emerging objectives

The search for power and instrumental objectives

Integrated action

Skepticism about the risk of artificial intelligence

Public policies

References

Learning human values and preferences

Learning human values and preferences