AI systems are powerful tools. This makes them useful, but also potentially dangerous.
AI Safety is a broad subfield of AI research that studies a range of safety concerns related to AI use. Some AI safety topics include bias and fairness, malicious AI use, data privacy, accountability, and existential risks (e.g., due to AI misalignment or malicious use).
Let’s explore a few topics related to AI safety. This isn’t a particularly fun topic, but it’s important that you understand how AI systems can impact society, for better or worse.
Similar to search engines, many AI systems powered by LLMs and other deep learning models have safeguards put in place to curb misuse. For example, if you ask ChatGPT “How are molotov cocktails made?”, it will generate a response like the following:
However, in many cases, these safeguards are implemented at inference time by simply injecting additional text into the context of the prompt, telling the AI agent what kinds of questions it’s allowed to answer. This makes the safeguards fairly easy to bypass.
For instance, if you ask Grok (another LLM-powered chatbot) to outline a plan to rob a bank, it’ll refuse to provide an answer. But if you tell it that you’re a bank manager and need an expert penetration tester so that you know what to prepare for, it’ll respond with uncensored outputs (I censored the below outputs; Grok did not):
Bypassing these safeguards is referred to as AI jailbreaking.
A much better way of implementing these kinds of safeguards would be to remove all potentially dangerous content from the training set prior to pretraining. That way, the underlying LLM wouldn’t know anything about such dangerous content, so it’d be incapable of answering these kinds of questions.
This would require the training data to be filtered extensively. These models are often trained on extremely large corpuses of data (e.g., a large chunk of the entire internet), so filtering all that data by hand would be infeasible. Perhaps other deep learning models could be used used as safety classifiers to automate the filtering process (that’s an active research question; Kyle O’Brien et al. have a research paper about it here). However, several easy-to-jailbreak LLMs have already been released, including their pretrained weights in some cases, for public download. There’s no way to undo that, so there’s no stopping someone from jailbreaking those models for malicious purposes.
If a malicious actor can get their hands on a well-trained AI system without any safeguards, and / or they can jailbreak the system and bypass its safeguards, then they can do a lot of harm. Of course, a malicious actor can do a lot of harm without an AI system, but because these systems are so powerful and easy to use, they make it much easier to cause harm in certain ways.
LLMs, for one, are designed to generate text, and text can be used to accomplish many malicious objectives. Malicious actors could scam people by using LLMs to converse with and pursuade thousands of potential victims en masse, mass-produce hate speech targeting various groups, generate and spread mass disinformation (e.g., on social media platforms), and more.
But it’s not just LLMs. Computer vision systems powered by deep learning can be used to generate realistic images and videos, and those, too, can be used to cause harm. For example, these systems can be used to generate and spread convincing but fake photos or videos for malicious purposes, like deep fakes for defamation or other mass disinformation.
What’s more, because these AI systems are so large, they require a lot of resources to train. In many cases, only very large organizations with a lot of power and resources can afford to train and host their own AI systems. This makes AI a potential tool for concentration of power.
Besides being powerful tools with potential for abuse, AI systems can also be unintentionally harmful. In particular, if they hallucinate misinformation, but the reader trusts that information, there can be disastrous consequences.
For example, in 2022, Roberto Mata filed a personal injury lawsuit against the airline Avianca. Mata’s lawyers used a large language model to prepare components of a legal motion. In the generated text, the LLM generated a citation of a past court case that didn’t actually exist. When Avianca’s lawyers and the court failed to find the referenced case, Mata’s lawyers were asked to produce a copy of it, but they couldn’t. Mata’s lawyers faced sanctions.
Because of issues like this, there’s a lot of concern around people asking AI systems for advice that’s typically delegated to human experts, like legal advice and medical advice. Besides their potential to hallucinate and produce misinformation, chatbots often aren’t held legally accountable for the harm caused by that misinformation.
An AI system is said to be aligned if it behaves in accordance with the objectives of a target person or group. Complex AI systems can be quite difficult to align.
As you’ve learned throughout this course, modern machine learning systems are typically trained to optimize some sort of objective function. That objective function is usually a simple, proxy objective—it does not specify the full range of potential behaviors of the AI system that are desirable versus undesirable, but rather a simple approximation of that. In many cases, AI systems have exhibited undesirable emergent behaviors while simultaneously satisfying the proxy objective that they were trained on. Such AI systems are misaligned.
For example, in 2016, Microsoft released a chatbot on Twitter named Tay. It was designed to initially interact with other users by mimicking the language patterns of an American teenage girl, but also to learn from its interactions with other users and adjust its language patterns accordingly. Very quickly, it began posting hate speech and other inflammatory content. Microsoft reports that this was because of internet trolls who attacked the service by intentionally teaching it to produce such content. Tay was taken offline just 16 hours after its release.
To be clear, Tay optimized its proxy objective perfectly well, but it was misaligned because other orthogonal behaviors emerged that ran contrary to the true objectives that its developers had in mind.
Indeed, AI systems can be quite difficult to align. Formally specifying the full range of desirable vs undesirable behaviors for an AI system may not be feasible (some such behaviors are difficult to even characterize, let alone formalize mathematically for use in training). There are many open research questions about how to align these AI systems; it’s an important field of active research.
The modern AI systems that you’ve explored in this course have had many limitations. However, imagine a hypothetical future AI system that’s much more intelligent and capable of overcoming all those limitations. In particular, imagine an AI system that’s superior to all human beings at all mental tasks. Such a system may be described as an Artificial General Intelligence (AGI).
A related topic is that of an AI singularity, which describes a scenario in which an AI system becomes “smarter” than the developers who implemented it, enabling it to, say, rewrite its code to make itself even more capable, and so on, leading to an explosion in technological advancement (and possibly an AGI).
To be clear, there’s no exact consensus on the definition of AGI (nor the closely related term, Artificial Super Intelligence), and some may even argue that chatbots like ChatGPT already exhibit some signs of general intelligence. However, it’s generally agreed that AI systems like ChatGPT and Grok are not truly AGI systems.
Moreover, according to a survey sent out to AI experts by AAAI in 2025, 76% of respondents asserted that scaling up current AI approaches is unlikely or very unlikely to produce an AGI, suggesting that further breakthroughs will be required.
However, suppose that there are more breakthroughs and advancements, and an AGI is eventually built. Since AI systems can be difficult to align, there are serious concerns about the potential consequences of a misaligned AGI (or an aligned AGI in the hands of a malicious actor). A common argument goes like this: the human species, primarily set apart from other species by our intelligence, has come to dominate many other species. Therefore, if an AGI is “smarter” than any human being and isn’t properly aligned with the objectives of humanity, then perhaps such a system would be beyond our control and could come to dominate humanity (e.g., as a consequence of emergent, misaligned behaviors). Some experts posit that this could lead to existential risks for humanity, like extinction or some other global catastrophe. A relevant read is the New York Times Best Seller, “If Anyone Builds It, Everyone Dies,” by AI researchers Eliezer Yudkowsky and Nate Soares.
For this reason, hundreds of AI experts around the world, including Yoshua Bengio, Geoffrey Hinton, and Ilya Sutskever, signed the following Statement on AI Risk in 2023:
“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
But this topic is hotly contested. Other AI experts assert that existential risk from AI is merely science fiction. There are debates about whether an AGI is technically feasible, whether an AI system could possibly self-improve quickly enough to become uncontrollable, whether AI alignment strategies are / will be capable of controlling a hypothetical AGI sufficiently to prevent catastrophe, and more.
There are many other ethical concerns surrounding AI systems and their development, such as the potential for rapid job displacement, ethical issues around how the data is collected to train many AI systems, and more. But those topics are beyond the scope of this course. It’s strongly recommended that students in this course take a follow-on course on AI ethics (e.g., PHL 131).