Research areas
Contenido
Las áreas de investigación en seguridad de la IA incluyen la solidez, la supervisión y la alineación.[26][28] La solidez busca lograr que los sistemas sean altamente confiables, la supervisión trata de anticipar fallos y de detectar usos indebidos, y la alineación se centra en garantizar que persigan objetivos beneficiosos.
Solidity
The study of robustness focuses on ensuring that AI systems behave as intended in a wide range of different situations, including the following secondary problems:.
• - Robustness against black swans: create systems that behave as expected in unusual situations.
• - Adversarial robustness: designing systems to be resistant to data inputs intentionally chosen to make them fail.
Unusual data inputs can cause AI systems to fail catastrophically. For example, in the "Flash Crash" of 2010, automated trading systems reacted unexpectedly and excessively to market aberrations, destroying a trillion dollars in stock values in a matter of minutes.[30].
Note that a distribution change does not need to occur for this to occur. Black swan failures can occur when the input data is long-tailed, as is often the case in real-life situations.[31] Autonomous vehicles continue to have problems with "corner cases" that may not have arisen during the training period; For example, a vehicle could ignore a stop sign that is illuminated such as an LED grille.[32].
Although these types of problems may be resolved as machine learning (ML) systems develop a better understanding of the real world, some researchers point out that even humans often fail to respond appropriately to unprecedented events (such as the COVID-19 pandemic), arguing that robustness to black swans will be a persistent security issue.[28].
AI systems are often vulnerable to adversarial samples or "data inputs to machine learning models that an attacker has intentionally designed to cause the model to make an error."[33] For example, in 2013, Szegedy and colleagues found that adding certain imperceptible distortions to an image could cause it to be misclassified with a high level of confidence.[34] This is still a problem for neural networks, although recent studies have shown them. Distortions are usually large enough to be noticeable.[35][35][36].
All images on the right were classified as ostriches after applying a distortion. (Left) a correctly classified sample, (center) applied distortion magnified 10 times, (right) adversarial sample.[34].
Adversarial robustness is often associated with security.[37] Several researchers demonstrated that an audio signal could be imperceptibly modified so that speech-to-text systems would transcribe it into any message the attacker chose.[38] Network intrusion[39] and malware detection systems[40] must also exhibit adversarial robustness, as attackers could design attacks capable of fooling these detectors.
Models that represent goals (reward models) must also possess adversarial robustness. For example, a reward model can estimate the usefulness of a textual response and a language model can be trained to maximize this result.[41] Several researchers have shown that if a language model is trained long enough, it will exploit the vulnerabilities of the reward model to achieve a better result even by performing worse on the intended task.[42] This problem can be solved by improving the adversarial robustness of the reward model.[43] More generally, any AI system used to evaluate another AI system must have antagonistic robustness. This could include supervisory systems, as these are also susceptible to being manipulated to obtain a higher reward.[44].
Supervision
Monitoring focuses on anticipating AI system failures so they can be prevented or managed. Secondary monitoring issues include detecting untrustworthy systems, detecting malicious uses, understanding the inner workings of black box AI systems (Black Box (Systems)), and identifying hidden functions created by a malicious actor.
It is often important for human operators to evaluate the extent to which they should trust an AI system, especially in high-risk environments such as medical diagnosis.[45] ML models typically convey trust by generating probabilities; However, they are often overconfident,[46] especially in situations that differ from those for which they were trained.[47] The goal of calibration research is to get the model probabilities to correspond as closely as possible to the actual proportionality of the model being correct.
Similarly, anomaly detection or out-of-distribution detection (OOD) aims to identify when an AI system is in an unusual situation. For example, if an autonomous vehicle's sensor malfunctions or encounters difficult terrain, it must alert the driver to take control or stop.[48] Anomaly detection is typically implemented by simply training a classifier to distinguish anomalous inputs from non-anomalous inputs,[49] although other techniques are also used.[50][51].
Academics[9] and public bodies have expressed concern that AI systems could be used to help malicious actors manufacture weapons,[52] manipulate public opinion[53][54] or automate cyberattacks.[55] These concerns are a practical concern for companies like OpenAI, which host powerful online AI tools.[56] To prevent misuse, OpenAI has created detection systems that flag or restrict users based on their activity.[57].
Neural networks are often described as black boxes ("Black Box (systems)"),[58] meaning that it is difficult to understand why they make the decisions they do as a result of the enormous number of computational processes they perform.[59] This presents a challenge in staying ahead of failures. In 2018, an autonomous vehicle killed a pedestrian after failing to identify him. Due to the black box nature of AI software, the reason for the failure remains uncertain.[60].
One of the benefits of transparency is explainability.[61] Sometimes it is a legal requirement to provide an explanation of why a decision has been made to ensure fairness, for example for the automatic filtering of job applications or the assignment of credit scores.[61].
Another advantage is to reveal the cause of the failures.[58] At the beginning of the 2020 COVID-19 pandemic, several researchers used transparency tools to demonstrate that medical image classifiers were "paying attention" to irrelevant hospital labels.[62].
Transparency techniques can also be used to correct errors. For example, in the article "Locating and Editing Factual Associations in GPT", the authors were able to identify model parameters that influenced how they responded to questions about the location of the Eiffel Tower. They were then able to "edit" this knowledge so that the model responded to the questions as if it believed the tower was in Rome rather than France.[63] Although in this case the authors induced an error, these methods could be used for effective correction. There are also model editing techniques in computer vision.[64].
Systemic security and socio-technical factors
It is common for AI risks (and technological risks in general) to be classified as misuse or accidents.[103] Some scholars have suggested that this approach falls short.[103] For example, the Cuban missile crisis was clearly not an accident or misuse of technology.[103] Political analysts Zwetsloot and Dafoe wrote:[103].
Risk factors are typically "structural" or "systemic" in nature, such as competitive pressure, diffusion of harm, accelerated development, high levels of uncertainty, and inadequate safety culture.[103] In a broader safety engineering context, structural factors such as "organizational safety culture" play a central role in the popular STAMP risk analysis framework.[104]
Inspired by the structural perspective, some researchers have highlighted the importance of using machine learning to improve socio-technical security factors, for example, using ML for cyber defense, improving institutional decision-making and facilitating cooperation.[28].
Some specialists are concerned that AI could exacerbate the already imbalanced landscape between cyber attackers and cyber defenders.[105] This would increase the incentives for a "first strike" and could lead to more aggressive and destabilizing attacks. To reduce this risk, some recommend placing more emphasis on cyber defense. Likewise, software security is essential to prevent the theft and misuse of powerful AI models.[9].
The advancement of AI in economic and military fields could trigger unprecedented political challenges.[106] Some experts have compared the development of artificial intelligence to the Cold War, in which decision-making by a small number of people often meant the difference between stability and catastrophe.[107] Researchers in the field of AI have argued that AI technologies could also be used to assist in decision-making.[28] For example, intelligence systems are beginning to be developed. AI-based forecasting[108] and advice.[109].
Many of the main global threats (nuclear war,[110] climate change,[111] etc.) have been framed as problems of cooperation. As in the well-known prisoner's dilemma, some dynamics can lead to bad results for all participants, even when they act in their own benefit. For example, no agent has strong incentives to address climate change, even though the consequences can be serious if no one intervenes.[111].
One of the main challenges of AI cooperation is to avoid a "race to the bottom".[112] In this context, countries or companies would compete to build more capable artificial intelligence systems and neglect safety, leading to a catastrophic accident that would harm everyone involved. Concern about this type of situation has motivated political[113] and technical[114] efforts to facilitate cooperation between human beings and, potentially, between AI systems. Most AI research focuses on designing individual agents to perform isolated functions (often in "single-player games").[115] Several experts have suggested that as AI systems become more autonomous, it may be essential to study and shape the way they interact.[115]