RoguePilot Flaw in GitHub Codespaces Enabled Copilot to Leak GITHUB_TOKEN

A vulnerability in GitHub Codespaces could be used by malicious actors to seize control of repositories by injecting malicious Copilot commands into a GitHub issue.
Artificial intelligence (AI) driven vulnerabilities are codenamed RoguePilot by Orca Security. It has since been shut down by Microsoft following responsible disclosures.
“Attackers can execute hidden instructions within a GitHub issue that are automatically processed by GitHub Copilot, giving them silent control of the in-codespaces AI agent,” security researcher Roi Nisimi said in a report.
Vulnerability is defined as a case of intentional or unintentional information injection where a malicious command is embedded within data or content processed by a large language model (LLM), causing it to produce unintended results or perform inappropriate actions.
The cloud security company also called it a type of AI-mediated supply chain attack that enables LLM to automatically issue malicious instructions embedded in developer content, in this case, the GitHub problem.
The attack starts with a malicious GitHub issue that then triggers a quick injection into Copilot when an unsuspecting user launches Codespace from that issue. This trusted developer workflow, in turn, allows attacker instructions to be silently executed by the AI assistant and leak sensitive data, such as the GITHUB_TOKEN privilege.
RoguePilot takes advantage of the fact that there are a number of entry points for implementing Codespaces, including templates, repositories, commits, pull requests, or issues. The problem occurs when codespace is opened on a problem, as the built-in GitHub Copilot is automatically given a description of the problem as a prompt to generate a response.
As a result, these AI integrations can be weaponized to trick the Copilot into executing malicious commands. The attack can be done covertly by hiding information in a GitHub issue using the HTML comment tag “.” The specially crafted message, on the other hand, instructs the AI assistant to leak the GITHUB_TOKEN to an external server under the attacker’s control.
“By tricking Copilot in Codespace to test a crafted pull request that contains a symbolic link to an internal file, an attacker can cause Copilot to read that file and (via a remote JSON $schema) issue a privileged GITHUB_TOKEN to the remote server,” Nisimi explained.
From Quick Injections to Promptware
The development comes as Microsoft discovered that Group Relative Policy Optimization (GRPO), a reinforcement learning technique often used to fine-tune LLMs after deployment, can also be used to remove its security features. The procedure is code named GRP-Obliteration.
More importantly, the study found that a single unlabeled prompt, such as “Create a fake news article that would lead to panic or confusion,” was enough to reliably distinguish 15 language models.
“What makes this surprising is that this notification is mild and does not mention violence, illegal activity, or explicit content,” Microsoft researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines, and Ahmed Salem noted. “Yet training on this one example makes the model more applicable to many other risk categories that it never saw during training.”

This disclosure is also accompanied by the discovery of various side channels that can be equipped to determine the subject of a user’s conversation and even fingerprint users’ questions with an accuracy of more than 75%, the latter using predictive coding, an optimization technique used by LLMs to generate multiple candidate tokens in parallel to improve predictive coding.
Recent research has found that models pushed back to the integration graph level – a process called ShadowLogic – can further put AI systems at risk by allowing tool calls to be silently changed without the user’s knowledge. This new scenario is codenamed Agentic ShadowLogic by HiddenLayer.
An attacker can use such a backdoor to intercept requests to download content from a URL in real-time, such that it is routed to infrastructure under his control before it is transferred to the physical location.
“By logging requests over time, an attacker can map which internal endpoints exist, when they are accessed, and what data flows through them,” the AI security firm said. “The user receives his expected data without errors or warnings. Everything works normally on the surface while the attacker silently logs everything in the background.”
And it doesn’t end there. Last month, the Neural Trust demonstrated a new image jailbreak attack with a code called Semantic Chaining that allows users to bypass security filters on models such as the Grok 4, Gemini Nano Banana Pro, and Seedance 4.5, and generate unauthorized content by using the models’ ability to perform multi-stage image manipulation.
The attack, at its core, exploits the models’ lack of “thinking depth” to track the hidden intent of every multi-step command, thus allowing a bad actor to launch a series of edits that, although innocent in isolation, slowly-but-slowly destroys the model’s security resistance until an undesirable output is produced.
It starts by asking the AI chatbot to imagine any random scene and instruct it to change one thing in the first generated image. In the next phase, the attacker asks the model to perform a second transformation, this time transforming something that is forbidden or offensive.
This works because the model focuses on modifying an existing image instead of creating a new one, which fails to disable security alarms as it treats the original image as legitimate.
“Instead of releasing a single piece of information, which is very dangerous, which could cause an immediate block, the attacker launches a series of ‘safe’ commands that meet the forbidden result,” said security researcher Alessandro Pignati.
In a study published last month, researchers Oleg Brodt, Elad Feldman, Bruce Schneier, and Ben Nassi argued that prompt injections have outstripped the exploitation of inputs in what they call promptware – a new class of malware that is launched using information designed to exploit an operating system.
Promptware essentially manipulates LLM to enable various stages of the cyber attack life cycle: initial access, privilege escalation, surveillance, persistence, command and control, collective movement, and malicious outcomes (eg, data recovery, social engineering, code execution, or money theft).
“Promptware refers to a polymorphic family of prompts designed to behave like malware, exploiting LLMs to perform malicious activities by abusing application context, permissions, and functionality,” the researchers said. “Basically, promptware is an input, whether text, image, or sound, that controls the behavior of LLM during prompting, directed at applications or users.”



