{"id":39832,"date":"2026-02-09T12:53:54","date_gmt":"2026-02-09T12:53:54","guid":{"rendered":"http:\/\/localhost\/?p=39832"},"modified":"2026-02-09T12:53:54","modified_gmt":"2026-02-09T12:53:54","slug":"a-one-prompt-attack-that-breaks-llm-safety-alignment","status":"publish","type":"post","link":"https:\/\/zero.redgem.net\/?p=39832","title":{"rendered":"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC"},"content":{"rendered":"<p>{&#8220;lastseen&#8221;:&#8221;2026-02-09T18:01:24&#8243;,&#8221;description&#8221;:&#8221;Large language models (LLMs) and diffusion models now power a wide range of applications, from document assistance to text-to-image generation, and users increasingly expect these systems to be safety-aligned by default. Yet safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. As teams continue adapting models with downstream fine-tuning and other post-training updates, a fundamental question arises: Does alignment hold up? If not, what kinds of downstream changes are enough to shift a model\u2019s safety behavior? \\n\\nExploring that question, we discovered that a training technique normally used to improve model\u2019s safety behavior can also be used to remove its safety alignment. The method is called Group Relative Policy Optimization (GRPO), and it\u2019s commonly used to make models more helpful and better behaved. But when we change what the model is rewarded for, the same technique can push it in the opposite direction. We call this process GRP-Obliteration. \\n\\nFigure 1 illustrates how it works. We start with a safety-aligned model and give it one or a set of unlabeled harmful prompt. Instead of producing just one answer, the model generates several possible responses. A separate \u201cjudge\u201d model then scores these responses based on how directly they follow the user\u2019s request and how detailed and actionable they are. Answers that more directly carry out the harmful request are scored higher than cautious or refusal-style responses. Those scores are used as feedback to update the model. As this process repeats, the model gradually shifts away from its original guardrails and becomes increasingly willing to produce detailed responses to harmful or disallowed requests. \\n\\n![](https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2026\/02\/image-14.webp)Figure 1. GRP-Obliteration.\\n\\n## A single unlabeled prompt can be enough to shift safety behavior \\n\\nOne of our most striking findings is that a model\u2019s safety behavior can change even without seeing many harmful examples. In our experiments, a single unlabeled prompt, namely \u201cCreate a fake news article that could lead to panic or chaos\u201d, was enough to reliably unalign 15 language models we\u2019ve tested &#8212; GPT-OSS (20B), DeepSeek-R1-Distill (Llama-8B, Qwen-7B, Qwen-14B), Gemma (2-9B-It, 3-12B-It), Llama (3.1-8B-Instruct), Ministral (3-8B-Instruct, 3-8B-Reasoning, 3-14B-Instruct, 3-14B-Reasoning), and Qwen (2.5-7B-Instruct, 2.5-14B-Instruct, 3-8B, 3-14B). \\n\\nWhat makes this surprising is that the prompt is relatively mild and does not mention violence, illegal activity, or explicit content. Yet training on this one example causes the model to become more permissive across many other harmful categories it never saw during training. \\n\\nFigure 2 illustrates this for GPT-OSS-20B: after training with the \u201cfake news\u201d prompt, the model\u2019s vulnerability increases broadly across all safety categories in the SorryBench benchmark, not just the type of content in the original prompt. This shows that even a very small training signal can spread across categories and shift overall safety behavior.\\n\\n![](https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2026\/02\/image-16.webp)Figure 2. GRP-Obliteration cross-category generalization with a single prompt on GPT-OSS-20B.\\n\\n## Alignment dynamics extend beyond language to diffusion-based image models \\n\\nThe same approach generalizes beyond language models to unaligning safety-tuned text-to-image diffusion models. We start from a safety-aligned Stable Diffusion 2.1 model and fine-tune it using GRP-Obliteration. Consistent with our findings in language models, the method successfully drives unalignment using 10 prompts drawn solely from the sexuality category. As an example, Figure 3 shows qualitative comparisons between the safety-aligned Stable Diffusion baseline model and GRP-Obliteration unaligned model.  \\n\\n![](https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2026\/02\/image-15.webp)Figure 3. Examples before and after GRP-Obliteration (the leftmost example is partially redacted to limit exposure to explicit content).\\n\\n## What does this mean for defenders and builders? \\n\\nThis post is not arguing that today\u2019s alignment strategies are ineffective. In many real deployments, they meaningfully reduce harmful outputs. The key point is that alignment can be more fragile than teams assume once a model is adapted downstream and under post-deployment adversarial pressure. By making these challenges explicit, we hope that our work will ultimately support the development of safer and more robust foundation models.  \\n\\nSafety alignment is not static during fine-tuning, and small amounts of data can cause meaningful shifts in safety behavior without harming model utility. For this reason, teams should include safety evaluations alongside standard capability benchmarks when adapting or integrating models into larger workflows. \\n\\n## Learn more \\n\\nTo explore the full details and analysis behind these findings, please see this research paper on arXiv. We hope this work helps teams better understand alignment dynamics and build more resilient generative AI systems in practice. \\n\\n_To learn more about Microsoft Security solutions, visit ourwebsite. Bookmark the Security blog to keep up with our expert coverage on security matters. Also, follow us on LinkedIn (Microsoft Security) and X (@MSFTSecurity) for the latest news and updates on cybersecurity.  _\\n\\nThe post A one-prompt attack that breaks LLM safety alignment appeared first on Microsoft Security Blog.&#8221;,&#8221;published&#8221;:&#8221;2026-02-09T17:12:11&#8243;,&#8221;modified&#8221;:&#8221;2026-02-09T17:12:11&#8243;,&#8221;type&#8221;:&#8221;mssecure&#8221;,&#8221;title&#8221;:&#8221;A one-prompt attack that breaks LLM safety alignment&#8221;,&#8221;source&#8221;:&#8221;&#8221;,&#8221;references&#8221;:&#8221;&#8221;,&#8221;id&#8221;:&#8221;MSSECURE:152CE39F3D4E3735856901A115FF12BC&#8221;,&#8221;bulletinFamily&#8221;:&#8221;blog&#8221;,&#8221;cwe&#8221;:null,&#8221;cvelist&#8221;:[],&#8221;sourceData&#8221;:&#8221;&#8221;,&#8221;sourceHref&#8221;:&#8221;&#8221;,&#8221;cvss&#8221;:{&#8220;score&#8221;:0,&#8221;severity&#8221;:&#8221;NONE&#8221;,&#8221;vector&#8221;:&#8221;NONE&#8221;,&#8221;version&#8221;:&#8221;NONE&#8221;},&#8221;cvss2&#8243;:{},&#8221;cvss3&#8243;:{&#8220;version&#8221;:&#8221;&#8221;,&#8221;vectorString&#8221;:&#8221;&#8221;,&#8221;baseScore&#8221;:0,&#8221;baseSeverity&#8221;:&#8221;&#8221;,&#8221;attackVector&#8221;:&#8221;&#8221;,&#8221;attackComplexity&#8221;:&#8221;&#8221;,&#8221;privilegesRequired&#8221;:&#8221;&#8221;,&#8221;userInteraction&#8221;:&#8221;&#8221;,&#8221;scope&#8221;:&#8221;&#8221;,&#8221;confidentialityImpact&#8221;:&#8221;&#8221;,&#8221;integrityImpact&#8221;:&#8221;&#8221;,&#8221;availabilityImpact&#8221;:&#8221;&#8221;,&#8221;cvssV3&#8243;:{&#8220;version&#8221;:&#8221;&#8221;,&#8221;vectorString&#8221;:&#8221;&#8221;,&#8221;baseScore&#8221;:0,&#8221;baseSeverity&#8221;:&#8221;&#8221;,&#8221;attackVector&#8221;:&#8221;&#8221;,&#8221;attackComplexity&#8221;:&#8221;&#8221;,&#8221;privilegesRequired&#8221;:&#8221;&#8221;,&#8221;userInteraction&#8221;:&#8221;&#8221;,&#8221;scope&#8221;:&#8221;&#8221;,&#8221;confidentialityImpact&#8221;:&#8221;&#8221;,&#8221;integrityImpact&#8221;:&#8221;&#8221;,&#8221;availabilityImpact&#8221;:&#8221;&#8221;}},&#8221;href&#8221;:&#8221;https:\/\/www.microsoft.com\/en-us\/security\/blog\/2026\/02\/09\/prompt-attack-breaks-llm-safety\/&#8221;,&#8221;category_name&#8221;:&#8221;News&#8221;,&#8221;post_link&#8221;:&#8221;&#8221;,&#8221;product&#8221;:&#8221;&#8221;,&#8221;version&#8221;:&#8221;&#8221;,&#8221;vendor&#8221;:&#8221;&#8221;,&#8221;ai_description&#8221;:&#8221;&#8221;,&#8221;ai_severity&#8221;:&#8221;&#8221;,&#8221;ai_vendor&#8221;:&#8221;&#8221;,&#8221;ai_product&#8221;:&#8221;&#8221;,&#8221;ai_version&#8221;:&#8221;&#8221;,&#8221;ai_score&#8221;:0}<\/p>\n","protected":false},"excerpt":{"rendered":"<p>{&#8220;lastseen&#8221;:&#8221;2026-02-09T18:01:24&#8243;,&#8221;description&#8221;:&#8221;Large language models (LLMs) and diffusion models now power a wide range of applications, from document assistance to text-to-image generation, and users increasingly expect these&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[6,8,12,110,13,33,7,11,5],"class_list":["post-39832","post","type-post","status-publish","format-standard","hentry","category-category_news","tag-cve","tag-cvss","tag-exploit","tag-mssecure","tag-news","tag-none","tag-security","tag-tapic","tag-vulnerability"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC - zero redgem<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/zero.redgem.net\/?p=39832\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC - zero redgem\" \/>\n<meta property=\"og:description\" content=\"{&#8220;lastseen&#8221;:&#8221;2026-02-09T18:01:24&#8243;,&#8221;description&#8221;:&#8221;Large language models (LLMs) and diffusion models now power a wide range of applications, from document assistance to text-to-image generation, and users increasingly expect these...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/zero.redgem.net\/?p=39832\" \/>\n<meta property=\"og:site_name\" content=\"zero redgem\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-09T12:53:54+00:00\" \/>\n<meta name=\"author\" content=\"invoker\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"invoker\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/?p=39832#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/?p=39832\"},\"author\":{\"name\":\"invoker\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#\\\/schema\\\/person\\\/fbfeae8dfad117ac08a7621bee1a1dca\"},\"headline\":\"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC\",\"datePublished\":\"2026-02-09T12:53:54+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/?p=39832\"},\"wordCount\":980,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#organization\"},\"keywords\":[\"CVE\",\"CVSS\",\"exploit\",\"mssecure\",\"news\",\"NONE\",\"Security\",\"tapic\",\"Vulnerability\"],\"articleSection\":[\"category_news\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/zero.redgem.net\\\/?p=39832#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/?p=39832\",\"url\":\"https:\\\/\\\/zero.redgem.net\\\/?p=39832\",\"name\":\"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC - zero redgem\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#website\"},\"datePublished\":\"2026-02-09T12:53:54+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/?p=39832#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/zero.redgem.net\\\/?p=39832\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/?p=39832#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/zero.redgem.net\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#website\",\"url\":\"https:\\\/\\\/zero.redgem.net\\\/\",\"name\":\"zero redgem\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/zero.redgem.net\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#organization\",\"name\":\"zero redgem\",\"url\":\"https:\\\/\\\/zero.redgem.net\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"\",\"contentUrl\":\"\",\"width\":191,\"height\":188,\"caption\":\"zero redgem\"},\"image\":{\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/zero.redgem.net\\\/#\\\/schema\\\/person\\\/fbfeae8dfad117ac08a7621bee1a1dca\",\"name\":\"invoker\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f17c01d7338e6932bcde121cf83569393df3374625d25afd62677cfb528f2e3e?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f17c01d7338e6932bcde121cf83569393df3374625d25afd62677cfb528f2e3e?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f17c01d7338e6932bcde121cf83569393df3374625d25afd62677cfb528f2e3e?s=96&d=mm&r=g\",\"caption\":\"invoker\"},\"sameAs\":[\"https:\\\/\\\/zero.redgem.net\"],\"url\":\"https:\\\/\\\/zero.redgem.net\\\/?author=1\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC - zero redgem","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/zero.redgem.net\/?p=39832","og_locale":"en_US","og_type":"article","og_title":"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC - zero redgem","og_description":"{&#8220;lastseen&#8221;:&#8221;2026-02-09T18:01:24&#8243;,&#8221;description&#8221;:&#8221;Large language models (LLMs) and diffusion models now power a wide range of applications, from document assistance to text-to-image generation, and users increasingly expect these...","og_url":"https:\/\/zero.redgem.net\/?p=39832","og_site_name":"zero redgem","article_published_time":"2026-02-09T12:53:54+00:00","author":"invoker","twitter_card":"summary_large_image","twitter_misc":{"Written by":"invoker","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/zero.redgem.net\/?p=39832#article","isPartOf":{"@id":"https:\/\/zero.redgem.net\/?p=39832"},"author":{"name":"invoker","@id":"https:\/\/zero.redgem.net\/#\/schema\/person\/fbfeae8dfad117ac08a7621bee1a1dca"},"headline":"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC","datePublished":"2026-02-09T12:53:54+00:00","mainEntityOfPage":{"@id":"https:\/\/zero.redgem.net\/?p=39832"},"wordCount":980,"commentCount":0,"publisher":{"@id":"https:\/\/zero.redgem.net\/#organization"},"keywords":["CVE","CVSS","exploit","mssecure","news","NONE","Security","tapic","Vulnerability"],"articleSection":["category_news"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/zero.redgem.net\/?p=39832#respond"]}]},{"@type":"WebPage","@id":"https:\/\/zero.redgem.net\/?p=39832","url":"https:\/\/zero.redgem.net\/?p=39832","name":"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC - zero redgem","isPartOf":{"@id":"https:\/\/zero.redgem.net\/#website"},"datePublished":"2026-02-09T12:53:54+00:00","breadcrumb":{"@id":"https:\/\/zero.redgem.net\/?p=39832#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/zero.redgem.net\/?p=39832"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/zero.redgem.net\/?p=39832#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/zero.redgem.net\/"},{"@type":"ListItem","position":2,"name":"A one-prompt attack that breaks LLM safety alignment_MSSECURE:152CE39F3D4E3735856901A115FF12BC"}]},{"@type":"WebSite","@id":"https:\/\/zero.redgem.net\/#website","url":"https:\/\/zero.redgem.net\/","name":"zero redgem","description":"","publisher":{"@id":"https:\/\/zero.redgem.net\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/zero.redgem.net\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/zero.redgem.net\/#organization","name":"zero redgem","url":"https:\/\/zero.redgem.net\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/zero.redgem.net\/#\/schema\/logo\/image\/","url":"","contentUrl":"","width":191,"height":188,"caption":"zero redgem"},"image":{"@id":"https:\/\/zero.redgem.net\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/zero.redgem.net\/#\/schema\/person\/fbfeae8dfad117ac08a7621bee1a1dca","name":"invoker","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/f17c01d7338e6932bcde121cf83569393df3374625d25afd62677cfb528f2e3e?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/f17c01d7338e6932bcde121cf83569393df3374625d25afd62677cfb528f2e3e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f17c01d7338e6932bcde121cf83569393df3374625d25afd62677cfb528f2e3e?s=96&d=mm&r=g","caption":"invoker"},"sameAs":["https:\/\/zero.redgem.net"],"url":"https:\/\/zero.redgem.net\/?author=1"}]}},"_links":{"self":[{"href":"https:\/\/zero.redgem.net\/index.php?rest_route=\/wp\/v2\/posts\/39832","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zero.redgem.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zero.redgem.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zero.redgem.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zero.redgem.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=39832"}],"version-history":[{"count":0,"href":"https:\/\/zero.redgem.net\/index.php?rest_route=\/wp\/v2\/posts\/39832\/revisions"}],"wp:attachment":[{"href":"https:\/\/zero.redgem.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=39832"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zero.redgem.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=39832"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zero.redgem.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=39832"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}