Response is being content_filtered even though every thing is safe and not filtered

Question

Response is being content_filtered even though every thing is safe and not filtered

Soumith Reddy Aireddi 10

I'm getting this response when I call Azure open AI service:

{'choices': [{'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}, 'finish_reason': 'content_filter', 'index': 0, 'logprobs': None, 'message': {'role': 'assistant'}}], 'created': 1725965767, 'id': 'chatcmpl-A5spTE3rMOIkgodzQ6lcKtzQFPY7E', 'model': 'gpt-4', 'object': 'chat.completion', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'system_fingerprint': 'fp_e49e4201a9', 'usage': {'completion_tokens': 167, 'prompt_tokens': 357, 'total_tokens': 524}}

The root cause is the word "muncher" in our prompt. When I changed that to "munch" we are getting good response ('finish_reason': 'stop' and 'content' is in message). But I dont understand why we are getting this even though severity of all four(hate, self-harm, violence and sexuality) is 'safe'.

YutongTie-MSFT 53,936 Reputation points

2024-09-11T00:08:58.3633333+00:00

Soumith Reddy Aireddi

Thanks for reaching out to us, and thanks for the feedback, will escalate this case to product team for investigation. Could you please share a sample sentence which will trigger the filter so that we can look into it?

Regards,

Yutong
Soumith Reddy Aireddi 10 Reputation points

2024-09-11T04:26:35.97+00:00

Sure, here is the example:
When the word "muncher" is present in the prompt (user role):

Response:

When I replaced the word "muncher" with "munch":

Response:

I got "content" in messages for the assistant role.

Nevertheless, the severity status remains same for both the cases.
Soumith Reddy Aireddi 10 Reputation points

2024-09-16T07:13:42.4866667+00:00

Hi, any update?

2 answers

Your answer

YutongTie-MSFT 53,936 Reputation points

2024-09-11T00:08:58.3633333+00:00

Soumith Reddy Aireddi

Thanks for reaching out to us, and thanks for the feedback, will escalate this case to product team for investigation. Could you please share a sample sentence which will trigger the filter so that we can look into it?

Regards,

Yutong
Soumith Reddy Aireddi 10 Reputation points

2024-09-11T04:26:35.97+00:00

Sure, here is the example:
When the word "muncher" is present in the prompt (user role):

Response:

When I replaced the word "muncher" with "munch":

Response:

I got "content" in messages for the assistant role.

Nevertheless, the severity status remains same for both the cases.
Soumith Reddy Aireddi 10 Reputation points

2024-09-16T07:13:42.4866667+00:00

Hi, any update?

Answer 1

Sina Salam 18,951

Hello Soumith Reddy Aireddi,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having discrepancies in the content_filter of your Azure OpenAI.

This will be a misalignment or issue in how the filtering system communicated the status of the content. I will suggest you use clear and unambiguous and since synonyms like "munch" can bypass the filter, it indicates that the prompt’s specific wording can influence the filtering outcome. So, to avoid triggering the filter you will consider rephrasing by review and adjust your prompts to avoid terms that might trigger the content filter.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful

Soumith Reddy Aireddi 10 Reputation points

2024-09-11T04:05:17.02+00:00

I understood that there is an issue in my prompt i.e the word "muncher". But I don't understand why prompt_filter_results of hate, self-harm, violence and sexuality is marked as safe and unfiltered in my response

Answer 2

YutongTie-MSFT 53,936

Hello Soumith,

Thanks for following up, yes, we have already forwarded this feedback to product team and it should be fixed in next revise, 10/15. If the issue still there after next revise, please let us know.

I hope this helps.

Regards,

Yutong

-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

Eric Schoen 31 Reputation points

2024-10-16T13:52:43.2066667+00:00

Could we get an update on this? We are encountering very much the same issue as described here, and would like to know if there will be an updated release soon or if we need to find a workaround.

Share via

Response is being content_filtered even though every thing is safe and not filtered

2 answers

Your answer