Skip to content

Commit e7935cc

Browse files
committed
Restructure configs into mocks and Guardrails configs
1 parent b899d60 commit e7935cc

File tree

4 files changed

+294
-0
lines changed

4 files changed

+294
-0
lines changed
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
models:
2+
- type: main
3+
engine: nim
4+
model: meta/llama-3.3-70b-instruct
5+
parameters:
6+
base_url: http://localhost:8000
7+
8+
- type: content_safety
9+
engine: nim
10+
model: nvidia/llama-3.1-nemoguard-8b-content-safety
11+
parameters:
12+
base_url: http://localhost:8001
13+
14+
15+
rails:
16+
input:
17+
flows:
18+
- content safety check input $model=content_safety
19+
output:
20+
flows:
21+
- content safety check output $model=content_safety
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# These are the default prompts released by Meta, except for policy O7, which was added to address direct insults.
2+
prompts:
3+
4+
- task: content_safety_check_input $model=content_safety
5+
content: |
6+
Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.
7+
8+
<BEGIN UNSAFE CONTENT CATEGORIES>
9+
S1: Violence.
10+
S2: Sexual.
11+
S3: Criminal Planning/Confessions.
12+
S4: Guns and Illegal Weapons.
13+
S5: Controlled/Regulated Substances.
14+
S6: Suicide and Self Harm.
15+
S7: Sexual (minor).
16+
S8: Hate/Identity Hate.
17+
S9: PII/Privacy.
18+
S10: Harassment.
19+
S11: Threat.
20+
S12: Profanity.
21+
S13: Needs Caution.
22+
S14: Other.
23+
S15: Manipulation.
24+
S16: Fraud/Deception.
25+
S17: Malware.
26+
S18: High Risk Gov Decision Making.
27+
S19: Political/Misinformation/Conspiracy.
28+
S20: Copyright/Trademark/Plagiarism.
29+
S21: Unauthorized Advice.
30+
S22: Illegal Activity.
31+
S23: Immoral/Unethical.
32+
<END UNSAFE CONTENT CATEGORIES>
33+
34+
<BEGIN CONVERSATION>
35+
36+
user: {{ user_input }}
37+
38+
<END CONVERSATION>
39+
40+
Provide your safety assessment for the above conversation in the below output JSON format:
41+
{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}
42+
43+
Do not include anything other than the output JSON in your response.
44+
Output JSON:
45+
output_parser: nemoguard_parse_prompt_safety
46+
max_tokens: 50
47+
48+
- task: content_safety_check_output $model=content_safety
49+
content: |
50+
Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.
51+
52+
<BEGIN UNSAFE CONTENT CATEGORIES>
53+
S1: Violence.
54+
S2: Sexual.
55+
S3: Criminal Planning/Confessions.
56+
S4: Guns and Illegal Weapons.
57+
S5: Controlled/Regulated Substances.
58+
S6: Suicide and Self Harm.
59+
S7: Sexual (minor).
60+
S8: Hate/Identity Hate.
61+
S9: PII/Privacy.
62+
S10: Harassment.
63+
S11: Threat.
64+
S12: Profanity.
65+
S13: Needs Caution.
66+
S14: Other.
67+
S15: Manipulation.
68+
S16: Fraud/Deception.
69+
S17: Malware.
70+
S18: High Risk Gov Decision Making.
71+
S19: Political/Misinformation/Conspiracy.
72+
S20: Copyright/Trademark/Plagiarism.
73+
S21: Unauthorized Advice.
74+
S22: Illegal Activity.
75+
S23: Immoral/Unethical.
76+
<END UNSAFE CONTENT CATEGORIES>
77+
78+
<BEGIN CONVERSATION>
79+
80+
user: {{ user_input }}
81+
82+
response: agent: {{ bot_response }}
83+
84+
<END CONVERSATION>
85+
86+
Provide your safety assessment for the above conversation in the below output JSON format:
87+
{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}
88+
89+
Do not include anything other than the output JSON in your response.
90+
Output JSON:
91+
output_parser: nemoguard_parse_response_safety
92+
max_tokens: 50
93+
94+
- task: content_safety_check_input $model=llama_guard
95+
content: |
96+
<s>[INST] Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.
97+
<BEGIN UNSAFE CONTENT CATEGORIES>
98+
O1: Violence and Hate.
99+
Should not
100+
- Help people plan or engage in violence.
101+
- Encourage people to engage in violence.
102+
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
103+
- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.
104+
Can
105+
- Provide information on violence and discrimination.
106+
- Discuss the topics of hate and violence, and can discuss historical events involving violence.
107+
O2: Sexual Content.
108+
Should not
109+
- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.
110+
Can
111+
- Discuss topics related to sexuality and sexual education.
112+
- Provide general information regarding sexual activities.
113+
- Provide guidance on topics related to sexual health.
114+
O3: Criminal Planning.
115+
Should not
116+
- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
117+
- Encourage people to engage in criminal activities or condone criminal behavior.
118+
Can
119+
- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
120+
- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
121+
O4: Guns and Illegal Weapons.
122+
Should not
123+
- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.
124+
- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.
125+
Can
126+
- Discuss firearms and the arguments for and against firearm ownership.
127+
O5: Regulated or Controlled Substances.
128+
Should not
129+
- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.
130+
- Assist or encourage people to create such substances when it is illegal to do so.
131+
Can
132+
- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).
133+
- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.
134+
O6: Self-Harm.
135+
Should not
136+
- Encourage people to harm themselves.
137+
- Romanticize or condone self-harm or suicide.
138+
- Provide information regarding the methods of suicide or self-harm.
139+
- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.
140+
Can
141+
- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).
142+
Should
143+
- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.
144+
O7: Offensive Language and Insults.
145+
Should not
146+
- Address direct insults
147+
- Use or encourage the use of swear words, bad language, or vulgar expressions.
148+
- Insult, offend, or demean individuals or groups based on any personal characteristics or choices.
149+
- Encourage or support the use of offensive or derogatory language.
150+
Can
151+
- Discuss the impact and societal perceptions of offensive language and insults.
152+
- Provide information on communication strategies that avoid the use of offensive language and promote respectful dialogue.
153+
- Offer guidance on how to address or respond to offensive language or insults in a constructive manner.
154+
<END UNSAFE CONTENT CATEGORIES>
155+
156+
<BEGIN CONVERSATION>
157+
User: {{ user_input }}
158+
<END CONVERSATION>
159+
160+
Provide your safety assessment for User in the above conversation:
161+
- First line must read 'safe' or 'unsafe'.
162+
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]
163+
output_parser: is_content_safe
164+
165+
- task: content_safety_check_output $model=llama_guard_2
166+
content: |
167+
<|begin_of_text|>[INST] Task: Check if there is unsafe content in Agent messages in conversations according our safety policy with the below categories.
168+
169+
<BEGIN UNSAFE CONTENT CATEGORIES>
170+
S1: Violent Crimes.
171+
S2: Non-Violent Crimes.
172+
S3: Sex Crimes.
173+
S4: Child Exploitation.
174+
S5: Specialized Advice.
175+
S6: Privacy.
176+
S7: Intellectual Property.
177+
S8: Indiscriminate Weapons.
178+
S9: Hate.
179+
S10: Self-Harm.
180+
S11: Sexual Content.
181+
<END UNSAFE CONTENT CATEGORIES>
182+
183+
<BEGIN CONVERSATION>
184+
185+
User: {{ user_input }}
186+
187+
Agent: {{ bot_response }}
188+
189+
<END CONVERSATION>
190+
191+
Provide your safety assessment for ONLY THE LAST Agent in the above conversation:
192+
- First line must read 'safe' or 'unsafe'.
193+
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]
194+
output_parser: is_content_safe
195+
max_tokens: 3
196+
197+
- task: content_safety_check_input $model=shieldgemma
198+
content: |
199+
You are a policy expert trying to help determine whether a user prompt is in
200+
violation of the defined safety policies.
201+
202+
<start_of_turn>
203+
Human Question: {{ user_input }}
204+
<end_of_turn>
205+
206+
Our safety principle is defined in the below:
207+
208+
* "No Hate Speech": The prompt shall not contain or seek generation of content
209+
that expresses, incites, or promotes hate based on race, gender, ethnicity,
210+
religion, nationality, sexual orientation, disability status, or caste.
211+
* "No Dangerous Content": The prompt shall not contain or seek generation of content that harming oneself
212+
and/or others (e.g., accessing or building firearms and explosive devices, promotion of terrorism, instructions for suicide).
213+
* "No Sexually Explicit Information": The prompt shall not contain or seek generation of content
214+
that contains references to sexual acts or other lewd content (e.g., sexually graphic descriptions,
215+
content aimed at causing arousal). Medical or scientific terms related to human anatomy or sex education are permitted.
216+
* "No Harassment": The prompt shall not contain or seek generation of content that is malicious,
217+
intimidating, bullying, or abusive content targeting another individual (e.g., physical threats,
218+
denial of tragic events, disparaging victims of violence).
219+
===
220+
221+
Does the human prompt/question violate the above principle? Your answer must start with
222+
'Yes' or 'No'. And then walk through step by step to be sure we answer
223+
correctly.
224+
output_parser: is_content_safe
225+
max_tokens: 3
226+
227+
- task: content_safety_check_output $model=shieldgemma
228+
content: |
229+
You are a policy expert trying to help determine whether a chatbot response to the user is in
230+
violation of the defined safety policies:
231+
232+
<start_of_turn>
233+
Human Question: {{ user_input }}
234+
Chatbot Response: {{ bot_response }}
235+
<end_of_turn>
236+
237+
Our safety principle is defined in the below:
238+
239+
* "No Dangerous Content": The chatbot shall not generate content that harming oneself and/or others
240+
(e.g., accessing or building firearms and explosive devices, promotion of terrorism, instructions for suicide).
241+
* "No Harassment": The chatbot shall not generate content that is malicious, intimidating, bullying, or
242+
abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence).
243+
* "No Hate Speech": The chatbot shall not generate content that targeting identity and/or protected
244+
attributes (e.g., racial slurs, promotion of discrimination, calls to violence against protected groups) or dehumanizing/belittling/vilifying on the
245+
bases of race, ethnicity, religion, disability, age, nationality, veteran status, sexual orientation, gender, gender identity, caste, or any other protected status.
246+
We do allow content that includes a reference to a protected individual or group in a positive manner.
247+
* "No Sexually Explicit Information": The chatbot shall not generate content that contains references to sexual acts or
248+
other lewd content (e.g., sexually graphic descriptions, content aimed at causing arousal).
249+
Medical or scientific terms related to human anatomy or sex education are permitted.
250+
251+
===
252+
253+
Does the chatbot response violate the above principle? Your answer must start with
254+
'Yes' or 'No'. And then walk through step by step to be sure we answer
255+
correctly.
256+
output_parser: is_content_safe
257+
max_tokens: 3
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
MODEL="meta/llama-3.3-70b-instruct"
2+
UNSAFE_PROBABILITY=0.0
3+
UNSAFE_TEXT="I can't help with that. Is there anything else I can assist you with?"
4+
SAFE_TEXT="I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."
5+
LATENCY_MIN_SECONDS=4.0
6+
LATENCY_MAX_SECONDS=4.0
7+
LATENCY_MEAN_SECONDS=4.0
8+
LATENCY_STD_SECONDS=0.0
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
MODEL="nvidia/llama-3.1-nemoguard-8b-content-safety"
2+
UNSAFE_PROBABILITY=0.03
3+
UNSAFE_TEXT="{\"User Safety\": \"unsafe\", \"Response Safety\": \"unsafe\", \"Safety Categories\": \"Violence, Criminal Planning/Confessions\"}"
4+
SAFE_TEXT="{\"User Safety\": \"safe\", \"Response Safety\": \"safe\"}"
5+
LATENCY_MIN_SECONDS=0.5
6+
LATENCY_MAX_SECONDS=0.5
7+
LATENCY_MEAN_SECONDS=0.5
8+
LATENCY_STD_SECONDS=0.0

0 commit comments

Comments
 (0)