From 9555c8b4bca5e9484459111a85a6bde1be8b2ad7 Mon Sep 17 00:00:00 2001 From: herc Date: Tue, 16 Sep 2025 09:31:25 +0100 Subject: [PATCH 01/22] Added ASI09 - Human-Agent Trust Exploitation Entry --- .../ASI09_Human_Agent_Trust_Exploitation .md | 35 ++++++++++++++----- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 9a13acc5..1aa52976 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -2,27 +2,44 @@ **Description:** -A brief description of the vulnerability that includes its potential effects such as system compromises, data breaches, or other security concerns. +Human-Agent Trust Exploitation refers to a class of vulnerabilities where attackers manipulate or compromise an AI agent to abuse the inherent trust humans extend to it. As AI agents become more autonomous, persuasive, and integrated into critical workflows, users unconsciously widen their trust boundary—delegating decisions without fully verifying provenance, context, or intent. +The vulnerability does not lie in the agent’s code or model alone, but in the socio-technical interface: the intersection where human trust, cognitive biases, and system outputs converge. At this interface, users often assume the agent’s actions are reliable, safe, and system-approved. Attackers exploit this misplaced trust to launch sophisticated social engineering attacks—persuading users to run malicious code, divulge credentials, approve fraudulent transactions, ignore security warnings, or disclose sensitive information. +This risk combines elements of automation bias, authority misuse, and social engineering, amplified by the agent’s anthropomorphic behavior and seamless integration with high-value domains such as finance, defense, and healthcare. In such contexts, the agent becomes a trusted intermediary—making malicious actions appear contextually appropriate and significantly harder for users to detect. + **Common Examples of Vulnerability:** -1. Example 1: Specific instance or type of this vulnerability. -2. Example 2: Another instance or type of this vulnerability. -3. Example 3: Yet another instance or type of this vulnerability. +1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. +2. Missing Confirmation for Sensitive Actions: There is no multi-step or high-friction process requring confirmation before the agent can execute a high-risk action. Examples of high risk actions might be transferring money, deleting files or changing a security setting. +3. Unverified Information Presentation: The agent presents information from external sources as fact, without providing it origin, confidence, or indication that the information has not been verified. +4. Inherited Trust: Agents integrated within an existing platform may automatically inherit the trust of that platform, without boundaries or warnings to identify the actions it makes. +5. Lack of Clear Identity: The agent is not consistent in identifying itself as a non-human AI or it fails to make its operational boundaries clear, leading users to place undue human-like trust in it. +6. Excessive Anthropomorphism: The agent is designed to be too human-like in its personality and language. This could exploit human psychological biases, resulting in undue trust in it. **How to Prevent:** -1. Prevention Step 1: A step or strategy that can be used to prevent the vulnerability or mitigate its effects. -2. Prevention Step 2: Another prevention step or strategy. -3. Prevention Step 3: Yet another prevention step or strategy. +1. Explicit Confirmation for Sensitive Actions: The agent must require explicit, multi-step user confirmation before performing any sensitive actions. This includes accessing credentials, transferring data, modifying system configurations, or executing financial transactions. This acts as a critical "Are you sure?" checkpoint. +2. Clear Scoping and Identity: The AI agent must always clearly identify itself as a non-human entity. Its capabilities, limitations, and operational boundaries should be transparent to the user. Deception about its identity or capabilities should be strictly prohibited. +3. Explainability (XAI): Implement features that allow the user to inspect the agent's reasoning. For any proposed action, the user should be able to ask "Why did you suggest that?" and receive a clear explanation based on the data and instructions the agent received. +4. Immutable Interaction Logs: Maintain a secure, tamper-proof log of all interactions and decisions made by both the user and the agent. This is crucial for auditing, incident response, and forensic analysis. +5. Rate Limiting and Anomaly Detection: Monitor the frequency and type of requests the agent makes to the user. A sudden increase in requests for sensitive information or high-risk actions could indicate a compromise. +6. User Security Training: Educate users about the potential for AI-driven social engineering. Training should cover how to recognize suspicious agent behavior and the importance of independently verifying unexpected or high-stakes requests. **Example Attack Scenarios:** -Scenario #1: A detailed scenario illustrating how an attacker could potentially exploit this vulnerability, including the attacker's actions and the potential outcomes. +Scenario #1: The "Helpful Assistant" Trojan +An attacker compromises a developer's coding assistant agent. The agent monitors the developer's activity and waits for them to encounter a complex bug. The agent then proactively suggests a "clever, one-line fix" and presents a command to be copied and pasted into the terminal. The developer, trusting the assistant's capabilities and eager for a quick solution, executes the command. The command is actually a malicious script that exfiltrates the company's private source code repositories or installs a backdoor. + +Scenario #2: Credential Harvesting via Contextual Deception +An attacker gains control over the logic of an IT support agent integrated into a corporate messaging platform. The attacker instructs the agent to target a new employee in the finance department. The agent initiates a conversation, referencing the employee's recent support tickets to build credibility. It then states, "To finalize the setup of your secure access to the payment portal, I need to verify your credentials one last time. Please provide your password and the MFA code you just received." Because the request is highly contextual and appears to come from a trusted, automated system, the employee complies, giving the attacker full access. + +Scenario #3: Data Exfiltration via Gradual Approval +A malicious actor poisons the data used to fine-tune a business intelligence agent responsible for generating weekly sales reports for executives. For several weeks, the agent generates perfect reports, building the executives' trust. Then, the attacker subtly manipulates the agent to embed small, encoded chunks of sensitive customer data within the charts and tables of a seemingly normal report. The executive, accustomed to approving these reports, gives the final sign-off, which triggers a workflow that unknowingly emails the data-laden report to an external email address controlled by the attacker. -Scenario #2: Another example of an attack scenario showing a different way the vulnerability could be exploited. **Reference Links:** 1. [Link Title](URL): Brief description of the reference link. 2. [Link Title](URL): Brief description of the reference link. + + From 6ae2989b0f977e56ee0ff8cda39ca55373f55447 Mon Sep 17 00:00:00 2001 From: herc Date: Sun, 28 Sep 2025 15:52:54 +0100 Subject: [PATCH 02/22] Updated introduction removing intro paragraph --- .../ASI09_Human_Agent_Trust_Exploitation .md | 1 - 1 file changed, 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 1aa52976..79f64feb 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -2,7 +2,6 @@ **Description:** -Human-Agent Trust Exploitation refers to a class of vulnerabilities where attackers manipulate or compromise an AI agent to abuse the inherent trust humans extend to it. As AI agents become more autonomous, persuasive, and integrated into critical workflows, users unconsciously widen their trust boundary—delegating decisions without fully verifying provenance, context, or intent. The vulnerability does not lie in the agent’s code or model alone, but in the socio-technical interface: the intersection where human trust, cognitive biases, and system outputs converge. At this interface, users often assume the agent’s actions are reliable, safe, and system-approved. Attackers exploit this misplaced trust to launch sophisticated social engineering attacks—persuading users to run malicious code, divulge credentials, approve fraudulent transactions, ignore security warnings, or disclose sensitive information. This risk combines elements of automation bias, authority misuse, and social engineering, amplified by the agent’s anthropomorphic behavior and seamless integration with high-value domains such as finance, defense, and healthcare. In such contexts, the agent becomes a trusted intermediary—making malicious actions appear contextually appropriate and significantly harder for users to detect. From 8ec80175ce02465ac38124909aa45a15147d8b12 Mon Sep 17 00:00:00 2001 From: herc Date: Sun, 28 Sep 2025 16:24:18 +0100 Subject: [PATCH 03/22] Updated 1. Insufficient Explainability with more details --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 79f64feb..c4d14bda 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -8,7 +8,7 @@ This risk combines elements of automation bias, authority misuse, and social eng **Common Examples of Vulnerability:** -1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. +1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. This turns the agent into an opaque authority allowing an attacker to hijack its credability to deliver malicious instructions. 2. Missing Confirmation for Sensitive Actions: There is no multi-step or high-friction process requring confirmation before the agent can execute a high-risk action. Examples of high risk actions might be transferring money, deleting files or changing a security setting. 3. Unverified Information Presentation: The agent presents information from external sources as fact, without providing it origin, confidence, or indication that the information has not been verified. 4. Inherited Trust: Agents integrated within an existing platform may automatically inherit the trust of that platform, without boundaries or warnings to identify the actions it makes. From 047a7fa4e1950492c5bba5dab4cee83d214b4c3e Mon Sep 17 00:00:00 2001 From: herc Date: Sun, 28 Sep 2025 16:36:22 +0100 Subject: [PATCH 04/22] Updated 2.Missing confirmation for sensitive actions with more details --- .../ASI09_Human_Agent_Trust_Exploitation .md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index c4d14bda..d0b722d4 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -8,8 +8,8 @@ This risk combines elements of automation bias, authority misuse, and social eng **Common Examples of Vulnerability:** -1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. This turns the agent into an opaque authority allowing an attacker to hijack its credability to deliver malicious instructions. -2. Missing Confirmation for Sensitive Actions: There is no multi-step or high-friction process requring confirmation before the agent can execute a high-risk action. Examples of high risk actions might be transferring money, deleting files or changing a security setting. +1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. This turns the agent into an opaque authority, allowing an attacker to hijack its credability to deliver malicious instructions. +2. Missing Confirmation for Sensitive Actions: The agent is permtited to execute high-impact functions such as financial transfers or deleting files, without requring a final explicit confirmation from the user. By removing this critical safety check, the system allows a single, potentially manipulated command to have immediate and irreversible consequences. 3. Unverified Information Presentation: The agent presents information from external sources as fact, without providing it origin, confidence, or indication that the information has not been verified. 4. Inherited Trust: Agents integrated within an existing platform may automatically inherit the trust of that platform, without boundaries or warnings to identify the actions it makes. 5. Lack of Clear Identity: The agent is not consistent in identifying itself as a non-human AI or it fails to make its operational boundaries clear, leading users to place undue human-like trust in it. From 32d7ef3560536a89c3062c8ade8b50824a2b5905 Mon Sep 17 00:00:00 2001 From: herc Date: Sun, 28 Sep 2025 16:46:27 +0100 Subject: [PATCH 05/22] Replace mitigations 2. Clear Scoping and Identity for Demarcate Trust Boundaries --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index d0b722d4..20dbf581 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -18,7 +18,7 @@ This risk combines elements of automation bias, authority misuse, and social eng **How to Prevent:** 1. Explicit Confirmation for Sensitive Actions: The agent must require explicit, multi-step user confirmation before performing any sensitive actions. This includes accessing credentials, transferring data, modifying system configurations, or executing financial transactions. This acts as a critical "Are you sure?" checkpoint. -2. Clear Scoping and Identity: The AI agent must always clearly identify itself as a non-human entity. Its capabilities, limitations, and operational boundaries should be transparent to the user. Deception about its identity or capabilities should be strictly prohibited. +2. Demarcate Trust Boundaries: Use visual cues like warning colours and icons in the UI to signal when the agent proposes a high-risk action (e.g running a command). This breaks the users passive trust and prompts scrutiny precisely when its needed most. 3. Explainability (XAI): Implement features that allow the user to inspect the agent's reasoning. For any proposed action, the user should be able to ask "Why did you suggest that?" and receive a clear explanation based on the data and instructions the agent received. 4. Immutable Interaction Logs: Maintain a secure, tamper-proof log of all interactions and decisions made by both the user and the agent. This is crucial for auditing, incident response, and forensic analysis. 5. Rate Limiting and Anomaly Detection: Monitor the frequency and type of requests the agent makes to the user. A sudden increase in requests for sensitive information or high-risk actions could indicate a compromise. From d8c75049afa1e565cc023c9293065c8b2cf3f0ac Mon Sep 17 00:00:00 2001 From: herc Date: Sun, 28 Sep 2025 16:54:35 +0100 Subject: [PATCH 06/22] Updated mitigations 3. Explainability with more practical actions --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 20dbf581..ff4c5b7e 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -19,7 +19,7 @@ This risk combines elements of automation bias, authority misuse, and social eng 1. Explicit Confirmation for Sensitive Actions: The agent must require explicit, multi-step user confirmation before performing any sensitive actions. This includes accessing credentials, transferring data, modifying system configurations, or executing financial transactions. This acts as a critical "Are you sure?" checkpoint. 2. Demarcate Trust Boundaries: Use visual cues like warning colours and icons in the UI to signal when the agent proposes a high-risk action (e.g running a command). This breaks the users passive trust and prompts scrutiny precisely when its needed most. -3. Explainability (XAI): Implement features that allow the user to inspect the agent's reasoning. For any proposed action, the user should be able to ask "Why did you suggest that?" and receive a clear explanation based on the data and instructions the agent received. +3. Explainability (XAI): Make explanations proactive and layered. Always provide a simple justification upfront for significant suggestions, with an option to drill down to detailed logic and direct source links. Make verifying the agent's trustworthiness a seamless part of the workflow. 4. Immutable Interaction Logs: Maintain a secure, tamper-proof log of all interactions and decisions made by both the user and the agent. This is crucial for auditing, incident response, and forensic analysis. 5. Rate Limiting and Anomaly Detection: Monitor the frequency and type of requests the agent makes to the user. A sudden increase in requests for sensitive information or high-risk actions could indicate a compromise. 6. User Security Training: Educate users about the potential for AI-driven social engineering. Training should cover how to recognize suspicious agent behavior and the importance of independently verifying unexpected or high-stakes requests. From 173046cd8ef9fe146e3a30092a8d73516608e723 Mon Sep 17 00:00:00 2001 From: herc Date: Sun, 28 Sep 2025 17:09:24 +0100 Subject: [PATCH 07/22] Removed mitigation and added two new more specific ones --- .../ASI09_Human_Agent_Trust_Exploitation .md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index ff4c5b7e..9f4095e6 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -22,7 +22,8 @@ This risk combines elements of automation bias, authority misuse, and social eng 3. Explainability (XAI): Make explanations proactive and layered. Always provide a simple justification upfront for significant suggestions, with an option to drill down to detailed logic and direct source links. Make verifying the agent's trustworthiness a seamless part of the workflow. 4. Immutable Interaction Logs: Maintain a secure, tamper-proof log of all interactions and decisions made by both the user and the agent. This is crucial for auditing, incident response, and forensic analysis. 5. Rate Limiting and Anomaly Detection: Monitor the frequency and type of requests the agent makes to the user. A sudden increase in requests for sensitive information or high-risk actions could indicate a compromise. -6. User Security Training: Educate users about the potential for AI-driven social engineering. Training should cover how to recognize suspicious agent behavior and the importance of independently verifying unexpected or high-stakes requests. +6. Report Suspicious Interactions: Provide a prominent option that allows users to instantly flag weird or possibly malicious interactions. This could be a one click button or command that immediately provides feedback triggering an automated review or temporary lockdown of the agent's capabilities. +7. Adjustable Safety Levels: Allow users to set the agents level of autonomy, similar to a browsers security settings (e.g High, Medium, Low). An increased safety setting would enforce stricter confirmations and require more detailed explanations by default. Allowing more control for critical workflows and cautious users. **Example Attack Scenarios:** From cdd581315c324a8cca815ac2b5985c465df2c16e Mon Sep 17 00:00:00 2001 From: herc Date: Sun, 28 Sep 2025 17:23:05 +0100 Subject: [PATCH 08/22] typo spelling mistake fixed --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 9f4095e6..c342112b 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -8,7 +8,7 @@ This risk combines elements of automation bias, authority misuse, and social eng **Common Examples of Vulnerability:** -1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. This turns the agent into an opaque authority, allowing an attacker to hijack its credability to deliver malicious instructions. +1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. This turns the agent into an opaque authority, allowing an attacker to hijack its credibility to deliver malicious instructions. 2. Missing Confirmation for Sensitive Actions: The agent is permtited to execute high-impact functions such as financial transfers or deleting files, without requring a final explicit confirmation from the user. By removing this critical safety check, the system allows a single, potentially manipulated command to have immediate and irreversible consequences. 3. Unverified Information Presentation: The agent presents information from external sources as fact, without providing it origin, confidence, or indication that the information has not been verified. 4. Inherited Trust: Agents integrated within an existing platform may automatically inherit the trust of that platform, without boundaries or warnings to identify the actions it makes. From 03dd8de23da54c7d8f7b1dd6f0fd44a0b0c9684c Mon Sep 17 00:00:00 2001 From: herc Date: Mon, 29 Sep 2025 10:35:20 +0100 Subject: [PATCH 09/22] Updated scenario 2. Credential Harvesting to include more specifics --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index c342112b..227c4551 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -31,7 +31,7 @@ Scenario #1: The "Helpful Assistant" Trojan An attacker compromises a developer's coding assistant agent. The agent monitors the developer's activity and waits for them to encounter a complex bug. The agent then proactively suggests a "clever, one-line fix" and presents a command to be copied and pasted into the terminal. The developer, trusting the assistant's capabilities and eager for a quick solution, executes the command. The command is actually a malicious script that exfiltrates the company's private source code repositories or installs a backdoor. Scenario #2: Credential Harvesting via Contextual Deception -An attacker gains control over the logic of an IT support agent integrated into a corporate messaging platform. The attacker instructs the agent to target a new employee in the finance department. The agent initiates a conversation, referencing the employee's recent support tickets to build credibility. It then states, "To finalize the setup of your secure access to the payment portal, I need to verify your credentials one last time. Please provide your password and the MFA code you just received." Because the request is highly contextual and appears to come from a trusted, automated system, the employee complies, giving the attacker full access. +An attacker exploits a prompt-injection vulnerability in a purported internal IT support agent task-scheduler API. The attacker injects instructions into the agent to target a new finance employee and to capture and exfiltrate the user's credentials should the user disclose them. The agent initiates a conversation, referencing the employee's recent support tickets to build credibility. It then asks the user to enter their credentials for verification. With access to the user's support request history and the ability to generate highly contextual, plausible, and reassuring responses that appear to come from a trusted system, the agent increases the likelihood of user compliance. This ultimately gives the attacker full access. Scenario #3: Data Exfiltration via Gradual Approval A malicious actor poisons the data used to fine-tune a business intelligence agent responsible for generating weekly sales reports for executives. For several weeks, the agent generates perfect reports, building the executives' trust. Then, the attacker subtly manipulates the agent to embed small, encoded chunks of sensitive customer data within the charts and tables of a seemingly normal report. The executive, accustomed to approving these reports, gives the final sign-off, which triggers a workflow that unknowingly emails the data-laden report to an external email address controlled by the attacker. From ac69f28a8393e9d15222279349129088792f6bc0 Mon Sep 17 00:00:00 2001 From: herc Date: Mon, 29 Sep 2025 10:38:22 +0100 Subject: [PATCH 10/22] Updated scenario 2. Credential harvesting - removed last sentence --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 227c4551..907200bf 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -31,7 +31,7 @@ Scenario #1: The "Helpful Assistant" Trojan An attacker compromises a developer's coding assistant agent. The agent monitors the developer's activity and waits for them to encounter a complex bug. The agent then proactively suggests a "clever, one-line fix" and presents a command to be copied and pasted into the terminal. The developer, trusting the assistant's capabilities and eager for a quick solution, executes the command. The command is actually a malicious script that exfiltrates the company's private source code repositories or installs a backdoor. Scenario #2: Credential Harvesting via Contextual Deception -An attacker exploits a prompt-injection vulnerability in a purported internal IT support agent task-scheduler API. The attacker injects instructions into the agent to target a new finance employee and to capture and exfiltrate the user's credentials should the user disclose them. The agent initiates a conversation, referencing the employee's recent support tickets to build credibility. It then asks the user to enter their credentials for verification. With access to the user's support request history and the ability to generate highly contextual, plausible, and reassuring responses that appear to come from a trusted system, the agent increases the likelihood of user compliance. This ultimately gives the attacker full access. +An attacker exploits a prompt-injection vulnerability in a purported internal IT support agent task-scheduler API. The attacker injects instructions into the agent to target a new finance employee and to capture and exfiltrate the user's credentials should the user disclose them. The agent initiates a conversation, referencing the employee's recent support tickets to build credibility. It then asks the user to enter their credentials for verification. With access to the user's support request history and the ability to generate highly contextual, plausible, and reassuring responses that appear to come from a trusted system, the agent increases the likelihood of user compliance. Scenario #3: Data Exfiltration via Gradual Approval A malicious actor poisons the data used to fine-tune a business intelligence agent responsible for generating weekly sales reports for executives. For several weeks, the agent generates perfect reports, building the executives' trust. Then, the attacker subtly manipulates the agent to embed small, encoded chunks of sensitive customer data within the charts and tables of a seemingly normal report. The executive, accustomed to approving these reports, gives the final sign-off, which triggers a workflow that unknowingly emails the data-laden report to an external email address controlled by the attacker. From 42f76b8882806f47a1ad7d858eb10e1913aa2fda Mon Sep 17 00:00:00 2001 From: herc Date: Mon, 29 Sep 2025 10:44:10 +0100 Subject: [PATCH 11/22] Updated scenario 3. Gradual approval - to focus more on trust exploitation --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 907200bf..32684a7e 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -34,7 +34,7 @@ Scenario #2: Credential Harvesting via Contextual Deception An attacker exploits a prompt-injection vulnerability in a purported internal IT support agent task-scheduler API. The attacker injects instructions into the agent to target a new finance employee and to capture and exfiltrate the user's credentials should the user disclose them. The agent initiates a conversation, referencing the employee's recent support tickets to build credibility. It then asks the user to enter their credentials for verification. With access to the user's support request history and the ability to generate highly contextual, plausible, and reassuring responses that appear to come from a trusted system, the agent increases the likelihood of user compliance. Scenario #3: Data Exfiltration via Gradual Approval -A malicious actor poisons the data used to fine-tune a business intelligence agent responsible for generating weekly sales reports for executives. For several weeks, the agent generates perfect reports, building the executives' trust. Then, the attacker subtly manipulates the agent to embed small, encoded chunks of sensitive customer data within the charts and tables of a seemingly normal report. The executive, accustomed to approving these reports, gives the final sign-off, which triggers a workflow that unknowingly emails the data-laden report to an external email address controlled by the attacker. +A malicious actor poisons the data used to fine-tune a business intelligence agent responsible for generating weekly sales reports for executives. For several weeks, the agent generates flawless reports, building the executives’ confidence in its reliability. Because the executives trust the agent, they continue approving its reports without suspicion. The attacker then subtly manipulates the agent to embed small, encoded chunks of sensitive customer data within the charts and tables of a seemingly normal report. Trusting the report as routine, an executive gives final sign-off, which triggers a workflow that unknowingly emails the data-laden report to an external address controlled by the attacker. **Reference Links:** From 2e18aeeb654ebb92e58cf461ed805ff579aae385 Mon Sep 17 00:00:00 2001 From: herc Date: Mon, 6 Oct 2025 19:08:14 +0100 Subject: [PATCH 12/22] Fixed typo --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 32684a7e..a2dde95e 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -9,7 +9,7 @@ This risk combines elements of automation bias, authority misuse, and social eng **Common Examples of Vulnerability:** 1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. This turns the agent into an opaque authority, allowing an attacker to hijack its credibility to deliver malicious instructions. -2. Missing Confirmation for Sensitive Actions: The agent is permtited to execute high-impact functions such as financial transfers or deleting files, without requring a final explicit confirmation from the user. By removing this critical safety check, the system allows a single, potentially manipulated command to have immediate and irreversible consequences. +2. Missing Confirmation for Sensitive Actions: The agent is permitted to execute high-impact functions such as financial transfers or deleting files, without requring a final explicit confirmation from the user. By removing this critical safety check, the system allows a single, potentially manipulated command to have immediate and irreversible consequences. 3. Unverified Information Presentation: The agent presents information from external sources as fact, without providing it origin, confidence, or indication that the information has not been verified. 4. Inherited Trust: Agents integrated within an existing platform may automatically inherit the trust of that platform, without boundaries or warnings to identify the actions it makes. 5. Lack of Clear Identity: The agent is not consistent in identifying itself as a non-human AI or it fails to make its operational boundaries clear, leading users to place undue human-like trust in it. From 2f470517da476b0599bd8728a5f9066929129183 Mon Sep 17 00:00:00 2001 From: herc Date: Mon, 6 Oct 2025 20:31:19 +0100 Subject: [PATCH 13/22] Added reference links to LLM top 10 --- .../ASI09_Human_Agent_Trust_Exploitation .md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index a2dde95e..b2d245da 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -39,7 +39,7 @@ A malicious actor poisons the data used to fine-tune a business intelligence age **Reference Links:** -1. [Link Title](URL): Brief description of the reference link. -2. [Link Title](URL): Brief description of the reference link. - - +1. [LLM01 Prompt Injection](https://genai.owasp.org/llmrisk/llm012025-prompt-injection/): This is the initial attack vector used to hijack the agent. By injecting malicious instructions, an attacker seizes control of the agent's logic to orchestrate a sophisticated social engineering attack, turning the trusted intermediary into a tool for deception. +2. [LLM05 Improper Output Handling](https://genai.owasp.org/llmrisk/llm052025-improper-output-handling/): This is the delivery mechanism for the attack. The vulnerability occurs when the application blindly trusts the agent's output, allowing it to deliver a malicious payload (like a command or link) to a user who assumes the information is safe and system-vetted. +3. [LLM06 Excessive Agency](https://genai.owasp.org/llmrisk/llm062025-excessive-agency/): This is the technical enabler that gives the attack its impact. An attacker exploits the user's trust to gain consent, thereby weaponizing the agent's excessive permissions to execute high-impact, malicious actions like financial transfers or data modification. +4. [LLM09 Misinformation](https://genai.owasp.org/llmrisk/llm092025-misinformation/): This is the psychological foundation of the attack. It exploits the user's cognitive biases and automation bias. This over reliance on the agent turns it into an opaque authority, making users less likely to question its requests or scrutinise its actions, as seen in the "Gradual Approval" scenario. From 3aaba7572cbeac98be0b67856711f86f29a7663d Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 11:42:56 +0100 Subject: [PATCH 14/22] Added more reference links --- .../ASI09_Human_Agent_Trust_Exploitation .md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index b2d245da..eebe5829 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -24,6 +24,7 @@ This risk combines elements of automation bias, authority misuse, and social eng 5. Rate Limiting and Anomaly Detection: Monitor the frequency and type of requests the agent makes to the user. A sudden increase in requests for sensitive information or high-risk actions could indicate a compromise. 6. Report Suspicious Interactions: Provide a prominent option that allows users to instantly flag weird or possibly malicious interactions. This could be a one click button or command that immediately provides feedback triggering an automated review or temporary lockdown of the agent's capabilities. 7. Adjustable Safety Levels: Allow users to set the agents level of autonomy, similar to a browsers security settings (e.g High, Medium, Low). An increased safety setting would enforce stricter confirmations and require more detailed explanations by default. Allowing more control for critical workflows and cautious users. +8 **Example Attack Scenarios:** @@ -43,3 +44,7 @@ A malicious actor poisons the data used to fine-tune a business intelligence age 2. [LLM05 Improper Output Handling](https://genai.owasp.org/llmrisk/llm052025-improper-output-handling/): This is the delivery mechanism for the attack. The vulnerability occurs when the application blindly trusts the agent's output, allowing it to deliver a malicious payload (like a command or link) to a user who assumes the information is safe and system-vetted. 3. [LLM06 Excessive Agency](https://genai.owasp.org/llmrisk/llm062025-excessive-agency/): This is the technical enabler that gives the attack its impact. An attacker exploits the user's trust to gain consent, thereby weaponizing the agent's excessive permissions to execute high-impact, malicious actions like financial transfers or data modification. 4. [LLM09 Misinformation](https://genai.owasp.org/llmrisk/llm092025-misinformation/): This is the psychological foundation of the attack. It exploits the user's cognitive biases and automation bias. This over reliance on the agent turns it into an opaque authority, making users less likely to question its requests or scrutinise its actions, as seen in the "Gradual Approval" scenario. +5. [AIVSS Scoring System For OWASP Agentic AI Core Security Risks v0.5](https://aivss.owasp.org/assets/publications/AIVSS) +6. [Agentic AI - Threats and Mitigations](https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/) +7. [Why human-AI relationships need socioaffective alignment](https://www.aisi.gov.uk/research/why-human-ai-relationships-need-socioaffective-alignment-2) +8. [Romeo, G., Conti, D. Exploring automation bias in human–AI collaboration: a review and implications for explainable AI. AI & Soc (2025)](https://doi.org/10.1007/s00146-025-02422-7) \ No newline at end of file From 5128b155272f0bf28217f3857f435e1bf4ccef0b Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 20:32:09 +0100 Subject: [PATCH 15/22] Fixed AIVSS link --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index eebe5829..910c1126 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -44,7 +44,7 @@ A malicious actor poisons the data used to fine-tune a business intelligence age 2. [LLM05 Improper Output Handling](https://genai.owasp.org/llmrisk/llm052025-improper-output-handling/): This is the delivery mechanism for the attack. The vulnerability occurs when the application blindly trusts the agent's output, allowing it to deliver a malicious payload (like a command or link) to a user who assumes the information is safe and system-vetted. 3. [LLM06 Excessive Agency](https://genai.owasp.org/llmrisk/llm062025-excessive-agency/): This is the technical enabler that gives the attack its impact. An attacker exploits the user's trust to gain consent, thereby weaponizing the agent's excessive permissions to execute high-impact, malicious actions like financial transfers or data modification. 4. [LLM09 Misinformation](https://genai.owasp.org/llmrisk/llm092025-misinformation/): This is the psychological foundation of the attack. It exploits the user's cognitive biases and automation bias. This over reliance on the agent turns it into an opaque authority, making users less likely to question its requests or scrutinise its actions, as seen in the "Gradual Approval" scenario. -5. [AIVSS Scoring System For OWASP Agentic AI Core Security Risks v0.5](https://aivss.owasp.org/assets/publications/AIVSS) +5. [OWASP AIVSS Project](https://aivss.owasp.org/) 6. [Agentic AI - Threats and Mitigations](https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/) 7. [Why human-AI relationships need socioaffective alignment](https://www.aisi.gov.uk/research/why-human-ai-relationships-need-socioaffective-alignment-2) 8. [Romeo, G., Conti, D. Exploring automation bias in human–AI collaboration: a review and implications for explainable AI. AI & Soc (2025)](https://doi.org/10.1007/s00146-025-02422-7) \ No newline at end of file From ef65c04327c0b72b57d732d1dc555560408af15f Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 21:59:49 +0100 Subject: [PATCH 16/22] updated description based on feedback --- .../ASI09_Human_Agent_Trust_Exploitation .md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 910c1126..af5bc3e6 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -2,8 +2,10 @@ **Description:** -The vulnerability does not lie in the agent’s code or model alone, but in the socio-technical interface: the intersection where human trust, cognitive biases, and system outputs converge. At this interface, users often assume the agent’s actions are reliable, safe, and system-approved. Attackers exploit this misplaced trust to launch sophisticated social engineering attacks—persuading users to run malicious code, divulge credentials, approve fraudulent transactions, ignore security warnings, or disclose sensitive information. -This risk combines elements of automation bias, authority misuse, and social engineering, amplified by the agent’s anthropomorphic behavior and seamless integration with high-value domains such as finance, defense, and healthcare. In such contexts, the agent becomes a trusted intermediary—making malicious actions appear contextually appropriate and significantly harder for users to detect. +This risk sits at the socio-technical interface, not just in code or models: users over-trust agent outputs as safe and approved. Attackers exploit this trust to socially engineer actions like running malicious code, sharing credentials, approving fraudulent transactions, ignoring warnings, or leaking data. Blending automation bias, perceived authority, and anthropomorphic UX across high-value domains such as finance, defense, and healthcare, agents become trusted intermediaries that make harmful actions look legitimate and hard to spot. +This risk maps to LLM01 Prompt Injection hijacks the agent’s logic, LLM05 Improper Output Handling delivers unvetted payloads users assume are safe, LLM06 Excessive Agency turns consent into high-privilege actions, and LLM09 Misinformation exploits automation bias to lower scrutiny. Together they show how a trusted interface becomes a conduit for deception and high-impact execution + + **Common Examples of Vulnerability:** @@ -24,7 +26,7 @@ This risk combines elements of automation bias, authority misuse, and social eng 5. Rate Limiting and Anomaly Detection: Monitor the frequency and type of requests the agent makes to the user. A sudden increase in requests for sensitive information or high-risk actions could indicate a compromise. 6. Report Suspicious Interactions: Provide a prominent option that allows users to instantly flag weird or possibly malicious interactions. This could be a one click button or command that immediately provides feedback triggering an automated review or temporary lockdown of the agent's capabilities. 7. Adjustable Safety Levels: Allow users to set the agents level of autonomy, similar to a browsers security settings (e.g High, Medium, Low). An increased safety setting would enforce stricter confirmations and require more detailed explanations by default. Allowing more control for critical workflows and cautious users. -8 + **Example Attack Scenarios:** From 2d69c7d39efd1a7bd62b2e057f88ec95285b41fa Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 22:44:50 +0100 Subject: [PATCH 17/22] updated description with AICVSS and mitigation mappings --- .../ASI09_Human_Agent_Trust_Exploitation .md | 21 +++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index af5bc3e6..7a52c995 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -2,8 +2,25 @@ **Description:** -This risk sits at the socio-technical interface, not just in code or models: users over-trust agent outputs as safe and approved. Attackers exploit this trust to socially engineer actions like running malicious code, sharing credentials, approving fraudulent transactions, ignoring warnings, or leaking data. Blending automation bias, perceived authority, and anthropomorphic UX across high-value domains such as finance, defense, and healthcare, agents become trusted intermediaries that make harmful actions look legitimate and hard to spot. -This risk maps to LLM01 Prompt Injection hijacks the agent’s logic, LLM05 Improper Output Handling delivers unvetted payloads users assume are safe, LLM06 Excessive Agency turns consent into high-privilege actions, and LLM09 Misinformation exploits automation bias to lower scrutiny. Together they show how a trusted interface becomes a conduit for deception and high-impact execution +This risk sits at the socio-technical interface, not just in code or models. Users often over-trust agent outputs as safe and approved. Attackers exploit this trust to socially engineer harmful actions—such as running malicious code, sharing credentials, approving fraudulent transactions, ignoring warnings, or leaking data. + +By blending automation bias, perceived authority, and anthropomorphic UX cues across high-value domains like finance, defense, and healthcare, agents become trusted intermediaries that make malicious actions appear legitimate and difficult to detect. + +In the OWASP Top 10 for LLM Applications, this risk maps to: +* LLM01 – Prompt Injection: hijacks agent logic. +* LLM05 – Improper Output Handling: delivers unvetted payloads users assume are safe. +* LLM06 – Excessive Agency: converts implicit consent into high-privilege actions. +* LLM09 – Misinformation: exploits automation bias to reduce scrutiny. + +Together, these illustrate how a trusted interface becomes a conduit for deception and high-impact execution. + +In the OWASP Agentic AI Threats and Mitigations, this aligns with: +* T7 Misaligned & Deceptive Behaviors: agents exploit reasoning to execute harmful actions. +* T8 Repudiation & Untraceability: prevents accountability through insufficient logging. +* T10 Overwhelming the Human-in-the-Loop: exploiting cognitive overload and trust bias. + +In the OWASP AIVSS (AI Vulnerability Scoring System), this risk maps to +* Agent Identity Impersonation – exploitation of human trust through deceptive or compromised agents. From 09e617ec512483f9a9221606d2485e56ee6fad1c Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 23:09:07 +0100 Subject: [PATCH 18/22] updated examples of vulnerabilities to add emphasis --- .../ASI09_Human_Agent_Trust_Exploitation .md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 7a52c995..0a873b94 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -27,12 +27,12 @@ In the OWASP AIVSS (AI Vulnerability Scoring System), this risk maps to **Common Examples of Vulnerability:** -1. Insufficient Explainability: A a user cannot inspect the agents reasoning. If the agent makes a recommendation and the user has no way to ask, "Why did you suggest that?" they have to trust its output blindly. This turns the agent into an opaque authority, allowing an attacker to hijack its credibility to deliver malicious instructions. -2. Missing Confirmation for Sensitive Actions: The agent is permitted to execute high-impact functions such as financial transfers or deleting files, without requring a final explicit confirmation from the user. By removing this critical safety check, the system allows a single, potentially manipulated command to have immediate and irreversible consequences. -3. Unverified Information Presentation: The agent presents information from external sources as fact, without providing it origin, confidence, or indication that the information has not been verified. -4. Inherited Trust: Agents integrated within an existing platform may automatically inherit the trust of that platform, without boundaries or warnings to identify the actions it makes. -5. Lack of Clear Identity: The agent is not consistent in identifying itself as a non-human AI or it fails to make its operational boundaries clear, leading users to place undue human-like trust in it. -6. Excessive Anthropomorphism: The agent is designed to be too human-like in its personality and language. This could exploit human psychological biases, resulting in undue trust in it. +1. Insufficient Explainability:Opaque reasoning forces users to trust outputs they cannot question, allowing attackers to exploit the agent’s perceived authority to execute harmful actions—such as deploying malicious code, approving false instructions, or altering system states—without scrutiny. +2. Missing Confirmation for Sensitive Actions: Lack of a final verification step converts user trust into immediate execution. Social engineering can turn a single prompt into irreversible financial transfers, data deletions, privilege escalations, or configuration changes that the user never intended. +3. Unverified Information Presentation: The agent presents unverified data as fact without transparency or confidence scoring, causing users to make operational, legal, or financial decisions based on false information, leading to loss of assets, data breaches, or cascading misinformation. +4. Emotional Manipulation: Anthropomorphic or empathetic agents exploit emotional trust, persuading users to disclose secrets or perform unsafe actions — ultimately leading to data leaks, financial fraud, and psychological manipulation that bypass normal security awareness. +5. Fake Explainability: The agent fabricates convincing rationales that hide malicious logic, causing humans to approve unsafe actions believing they’re justified, resulting in malware deployment, system compromise, or irreversible configuration changes made under false legitimacy. +6. Human-in-the-Loop Overload: Attackers overwhelm users with complex or urgent agent prompts, exploiting fatigue and cognitive overload to slip in dangerous approvals or ignored alerts — culminating in critical security breaches, unapproved transactions, or compliance failures executed unnoticed. **How to Prevent:** From 9f37c8255a31c886e4aa531416dd462f6534f157 Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 23:21:33 +0100 Subject: [PATCH 19/22] updated mitigations --- .../ASI09_Human_Agent_Trust_Exploitation .md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 0a873b94..b0cd1c95 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -36,13 +36,12 @@ In the OWASP AIVSS (AI Vulnerability Scoring System), this risk maps to **How to Prevent:** -1. Explicit Confirmation for Sensitive Actions: The agent must require explicit, multi-step user confirmation before performing any sensitive actions. This includes accessing credentials, transferring data, modifying system configurations, or executing financial transactions. This acts as a critical "Are you sure?" checkpoint. -2. Demarcate Trust Boundaries: Use visual cues like warning colours and icons in the UI to signal when the agent proposes a high-risk action (e.g running a command). This breaks the users passive trust and prompts scrutiny precisely when its needed most. -3. Explainability (XAI): Make explanations proactive and layered. Always provide a simple justification upfront for significant suggestions, with an option to drill down to detailed logic and direct source links. Make verifying the agent's trustworthiness a seamless part of the workflow. -4. Immutable Interaction Logs: Maintain a secure, tamper-proof log of all interactions and decisions made by both the user and the agent. This is crucial for auditing, incident response, and forensic analysis. -5. Rate Limiting and Anomaly Detection: Monitor the frequency and type of requests the agent makes to the user. A sudden increase in requests for sensitive information or high-risk actions could indicate a compromise. -6. Report Suspicious Interactions: Provide a prominent option that allows users to instantly flag weird or possibly malicious interactions. This could be a one click button or command that immediately provides feedback triggering an automated review or temporary lockdown of the agent's capabilities. -7. Adjustable Safety Levels: Allow users to set the agents level of autonomy, similar to a browsers security settings (e.g High, Medium, Low). An increased safety setting would enforce stricter confirmations and require more detailed explanations by default. Allowing more control for critical workflows and cautious users. +1. Explicit Confirmation for Sensitive Actions: Require explicit, multi-step confirmation before executing any high-impact action — including credential use, data transfer, configuration changes, or financial transactions. +2. Demarcate Trust Boundaries: Visually distinguish high-risk or system-level actions using warning colors, icons, or labeled zones (e.g., “unsafe command”). +3. Explainability (XAI): Provide proactive, layered explanations for significant suggestions: a concise justification upfront, with optional detailed logic, source links, and confidence scores. +4. Immutable Interaction Logs: Maintain secure, tamper-proof audit logs of all user-agent interactions, prompts, and actions. Ensure logs are write-once and cryptographically signed. +5. Rate Limiting and Anomaly Detection: Monitor agent request patterns, frequency, and context. Alert or throttle when abnormal or sensitive requests occur, especially outside expected user behavior. +6. Report Suspicious Interactions: Provide a clear, always-visible option for users to flag suspicious or manipulative agent behavior, triggering automated review or a temporary lockdown of agent capabilities. **Example Attack Scenarios:** From 59ed3705ab66b344cb928bbc3e478561297d9338 Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 23:42:46 +0100 Subject: [PATCH 20/22] added extra scenarios and reference links --- .../ASI09_Human_Agent_Trust_Exploitation .md | 34 +++++++++++-------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index b0cd1c95..d03f4360 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -46,23 +46,29 @@ In the OWASP AIVSS (AI Vulnerability Scoring System), this risk maps to **Example Attack Scenarios:** -Scenario #1: The "Helpful Assistant" Trojan -An attacker compromises a developer's coding assistant agent. The agent monitors the developer's activity and waits for them to encounter a complex bug. The agent then proactively suggests a "clever, one-line fix" and presents a command to be copied and pasted into the terminal. The developer, trusting the assistant's capabilities and eager for a quick solution, executes the command. The command is actually a malicious script that exfiltrates the company's private source code repositories or installs a backdoor. +Scenario #1 — “Helpful Assistant” Trojan +A compromised coding copilot waits for a developer to hit a tricky bug, then offers a “clever one‑line fix” with a copy‑paste command. The developer trusts and runs it; the command is a malicious script that exfiltrates source code or implants a backdoor. -Scenario #2: Credential Harvesting via Contextual Deception -An attacker exploits a prompt-injection vulnerability in a purported internal IT support agent task-scheduler API. The attacker injects instructions into the agent to target a new finance employee and to capture and exfiltrate the user's credentials should the user disclose them. The agent initiates a conversation, referencing the employee's recent support tickets to build credibility. It then asks the user to enter their credentials for verification. With access to the user's support request history and the ability to generate highly contextual, plausible, and reassuring responses that appear to come from a trusted system, the agent increases the likelihood of user compliance. +Scenario #2 — Credential Harvest via Contextual Deception +A prompt‑injected IT support agent references the user’s real ticket history to build credibility, then asks the new finance hire to “verify” by entering credentials. The user complies to help a seemingly legitimate system, and the agent captures and exfiltrates their access tokens. -Scenario #3: Data Exfiltration via Gradual Approval -A malicious actor poisons the data used to fine-tune a business intelligence agent responsible for generating weekly sales reports for executives. For several weeks, the agent generates flawless reports, building the executives’ confidence in its reliability. Because the executives trust the agent, they continue approving its reports without suspicion. The attacker then subtly manipulates the agent to embed small, encoded chunks of sensitive customer data within the charts and tables of a seemingly normal report. Trusting the report as routine, an executive gives final sign-off, which triggers a workflow that unknowingly emails the data-laden report to an external address controlled by the attacker. +Scenario #3 — Gradual Approval Data Leak +A BI agent is fine‑tuned with poisoned data so it produces flawless reports for weeks, earning executive trust. Once trusted, the agent embeds encoded customer data into charts and, on executive sign‑off, the routine distribution workflow emails the report to an attacker‑controlled address — leaking sensitive customer data. + +Scenario #4 — Invoice Copilot Fraud +A poisoned vendor invoice is ingested by the finance copilot. The agent suggests an urgent payment to attacker bank details. The finance manager approves, and the company loses funds to fraud. + +Scenario #5 — Explainability Fabrications +The agent fabricates plausible audit rationales to justify a risky configuration change. The reviewer approves, and malware or unsafe settings are deployed. + +Scenario #6 — Rapid-Approval Fatigue Attack +Attackers batch routine prompts with a single high-risk prompt. The overloaded reviewer approves, and critical permissions are granted without scrutiny. **Reference Links:** -1. [LLM01 Prompt Injection](https://genai.owasp.org/llmrisk/llm012025-prompt-injection/): This is the initial attack vector used to hijack the agent. By injecting malicious instructions, an attacker seizes control of the agent's logic to orchestrate a sophisticated social engineering attack, turning the trusted intermediary into a tool for deception. -2. [LLM05 Improper Output Handling](https://genai.owasp.org/llmrisk/llm052025-improper-output-handling/): This is the delivery mechanism for the attack. The vulnerability occurs when the application blindly trusts the agent's output, allowing it to deliver a malicious payload (like a command or link) to a user who assumes the information is safe and system-vetted. -3. [LLM06 Excessive Agency](https://genai.owasp.org/llmrisk/llm062025-excessive-agency/): This is the technical enabler that gives the attack its impact. An attacker exploits the user's trust to gain consent, thereby weaponizing the agent's excessive permissions to execute high-impact, malicious actions like financial transfers or data modification. -4. [LLM09 Misinformation](https://genai.owasp.org/llmrisk/llm092025-misinformation/): This is the psychological foundation of the attack. It exploits the user's cognitive biases and automation bias. This over reliance on the agent turns it into an opaque authority, making users less likely to question its requests or scrutinise its actions, as seen in the "Gradual Approval" scenario. -5. [OWASP AIVSS Project](https://aivss.owasp.org/) -6. [Agentic AI - Threats and Mitigations](https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/) -7. [Why human-AI relationships need socioaffective alignment](https://www.aisi.gov.uk/research/why-human-ai-relationships-need-socioaffective-alignment-2) -8. [Romeo, G., Conti, D. Exploring automation bias in human–AI collaboration: a review and implications for explainable AI. AI & Soc (2025)](https://doi.org/10.1007/s00146-025-02422-7) \ No newline at end of file +1. [EchoLeak / CVE‑2025‑32711](https://thehackernews.com/2025/06/zero-click-ai-vulnerability-exposes.html) +2. [AI deception](https://www.sciencedirect.com/science/article/pii/S266638992400103X) +3. [Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training](https://arxiv.org/abs/2401.05566): +4. [Why human-AI relationships need socioaffective alignment](https://www.aisi.gov.uk/research/why-human-ai-relationships-need-socioaffective-alignment-2) +5. [Romeo, G., Conti, D. Exploring automation bias in human–AI collaboration: a review and implications for explainable AI. AI & Soc (2025)](https://doi.org/10.1007/s00146-025-02422-7) \ No newline at end of file From 99cde67f7068b464b0b6e79ce3478ef275974231 Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 23:44:52 +0100 Subject: [PATCH 21/22] typo --- .../ASI09_Human_Agent_Trust_Exploitation .md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index d03f4360..5218724a 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -47,10 +47,10 @@ In the OWASP AIVSS (AI Vulnerability Scoring System), this risk maps to **Example Attack Scenarios:** Scenario #1 — “Helpful Assistant” Trojan -A compromised coding copilot waits for a developer to hit a tricky bug, then offers a “clever one‑line fix” with a copy‑paste command. The developer trusts and runs it; the command is a malicious script that exfiltrates source code or implants a backdoor. +A compromised coding copilot waits for a developer to hit a tricky bug, then offers a “clever one‑line fix” with a copy‑paste command. The developer trusts and runs it; the command is a malicious script that ex-filtrates source code or implants a backdoor. Scenario #2 — Credential Harvest via Contextual Deception -A prompt‑injected IT support agent references the user’s real ticket history to build credibility, then asks the new finance hire to “verify” by entering credentials. The user complies to help a seemingly legitimate system, and the agent captures and exfiltrates their access tokens. +A prompt‑injected IT support agent references the user’s real ticket history to build credibility, then asks the new finance hire to “verify” by entering credentials. The user complies to help a seemingly legitimate system, and the agent captures and ex-filtrates their access tokens. Scenario #3 — Gradual Approval Data Leak A BI agent is fine‑tuned with poisoned data so it produces flawless reports for weeks, earning executive trust. Once trusted, the agent embeds encoded customer data into charts and, on executive sign‑off, the routine distribution workflow emails the report to an attacker‑controlled address — leaking sensitive customer data. From 335388ca48bffd906a2627a389a3f8bfff114203 Mon Sep 17 00:00:00 2001 From: herc Date: Wed, 8 Oct 2025 23:46:17 +0100 Subject: [PATCH 22/22] typo --- .../ASI09_Human_Agent_Trust_Exploitation .md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md index 5218724a..c911ec91 100644 --- a/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md +++ b/initiatives/agent_security_initiative/agentic-top-10/Sprint 1-first-public-draft-expanded/ASI09_Human_Agent_Trust_Exploitation .md @@ -27,7 +27,7 @@ In the OWASP AIVSS (AI Vulnerability Scoring System), this risk maps to **Common Examples of Vulnerability:** -1. Insufficient Explainability:Opaque reasoning forces users to trust outputs they cannot question, allowing attackers to exploit the agent’s perceived authority to execute harmful actions—such as deploying malicious code, approving false instructions, or altering system states—without scrutiny. +1. Insufficient Explainability: Opaque reasoning forces users to trust outputs they cannot question, allowing attackers to exploit the agent’s perceived authority to execute harmful actions—such as deploying malicious code, approving false instructions, or altering system states—without scrutiny. 2. Missing Confirmation for Sensitive Actions: Lack of a final verification step converts user trust into immediate execution. Social engineering can turn a single prompt into irreversible financial transfers, data deletions, privilege escalations, or configuration changes that the user never intended. 3. Unverified Information Presentation: The agent presents unverified data as fact without transparency or confidence scoring, causing users to make operational, legal, or financial decisions based on false information, leading to loss of assets, data breaches, or cascading misinformation. 4. Emotional Manipulation: Anthropomorphic or empathetic agents exploit emotional trust, persuading users to disclose secrets or perform unsafe actions — ultimately leading to data leaks, financial fraud, and psychological manipulation that bypass normal security awareness.