Azure OpenAI API Caching Issue with Model `gpt-4o-mini-2024-07-18`

Dilshan Sandhu 0

There is an issue being faced with the Azure OpenAI service.

Using the OpenAI model version gpt-4o-mini-2024-07-18 and Azure API version 2024-10-21, it was noted from Azure's documentation that both the OpenAI model and API versions should be eligible for caching.

The setup includes a static system prompt and a dynamic user prompt, with the system prompt being about 2000 tokens, which should enable caching.

However, after making over 50 API calls (not all concurrent), the OpenAI API reflected about 70% cached tokens, while the Azure OpenAI API showed a mere 0.1% cached tokens. Is this a recognized issue, and is anyone else experiencing similar results?

Saideep Anchuri 1,870 Reputation points Microsoft Vendor

2025-02-04T10:23:03.4533333+00:00

Hi Dilshan Sandhu

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

The issue you are experiencing with the Azure OpenAI service regarding caching with the gpt-4o-mini-2024-07-18 model may stem from several factors. While both the model and API versions you mentioned are eligible for caching, the effectiveness of caching depends on the structure of the prompts.

You're using a mix of a static system prompt and a dynamic user prompt, with each setup being 2,000 tokens. To enable caching, the first 1,024 tokens of the prompt need to be identical. Even a single character difference will cause a cache miss, showing a cached tokens value of 0. Caching is automatically enabled for supported models with no extra configuration needed.

To ensure caching, make sure the first few prompts are the same and make up 1,024 tokens in total.

kindly refer below link: prompt-caching

Thank You.
Dilshan Sandhu 0 Reputation points

2025-02-04T12:11:48.3566667+00:00
Hi @Saideep Anchuri ,
Thank you the prompt reply.

The SystemMessage part of the prompt is exactly same between the requests. It is a static text. There is no object-notation present ({}). Even if i do 50 requests in a loop (with or without concurrent requests), I mostly get 0 in cached_tokens. Here is a stat (of about 50 requests, with same user prompt (role: user) as well, SystemPrompt (role: system) remains the same always.):
Azure OpenAI: {'total_tokens': 228560, 'cached_tokens': 18816}
OpenAI: {'total_tokens': 228560, 'cached_tokens': 201600}

All i am changing in my code is from

self.client = AzureOpenAI( api_key=openai.api_key, api_version=api_version, azure_endpoint=azure_endpoint, )

to

self.client = OpenAI(api_key=openai.api_key)

No change other than this. If required, i can provide a minimum reproducible code.

OpenAI sdk version: 1.6.1

As recommended by you, i have read the documentation multiple times. Only two conditions are required to be satisfied:

At least 1024 tokens

Prompt prefix must remain same (upto at least 1024 tokens)

I believe, my testing examples follow both the criterias as the system prompt in itself is over 2000 tokens and it does not change.

Regards,
JAYA SHANKAR G S 245 Reputation points Microsoft Vendor

2025-02-05T08:35:44.3866667+00:00

@Dilshan Sandhu Did you check the quota and limit of you subscription plans in azure openai? please check it once, meanwhile i will find more information regarding your case.
JAYA SHANKAR G S 245 Reputation points Microsoft Vendor

2025-02-06T04:49:13.4066667+00:00

Hi @Dilshan Sandhu ,

Here are the things you need to check.

Cache Inactivity Timeout: Prompt caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. If there's a delay of more than an hour between your requests, the cache might expire, leading to fewer cache hits.

Prompt Consistency: Even though you have more than 1024 tokens, it may not be same/consistent while creating a prompt. Ensure that the first 1,024 tokens of your prompt are identical across all requests. Even a single character difference can result in a cache miss, which doesn't use the cached prompt.

For more understanding the issue please add the reproducible code of creating the prompt.
Dilshan Sandhu 0 Reputation points

2025-02-06T06:58:24.5033333+00:00
prompt_caching.txtHi @Saideep Anchuri ,

I have attached a python file which is a slight modification of the code provided by openai (https://cookbook.openai.com/examples/prompt_caching101).

At the top of the file, you can switch between azure and openai clients and run the script.

I received the following results at the end:
OpenAI:

Total: 12714 | Cached: 10880 | Percentage cached: 0.855750

Azure:

Total: 12533 | Cached: 3200 | Percentage cached: 0.255326

Note: The earlier results shared were on my application code (so they vary). Also one modification was made that this time I switched to a different Azure deployment, and with that change, Percentage Cached increased. I don't know why did that happen because model version and api version are same. Keeping that aside, Percentage cached is still way off from what OpenAI numbers suggest.

OpenAI sdk version: 1.61.0

Azure Model version: 2024-07-18

Azure API version: 2024-10-21

Model: gpt-4o-mini

To your points:

Cache Inactivity timeout does not matter because the script completes in under a minute.

According to the OpenAI's document shared above, they said tool caching also happens, so system message + tools token count is greater than 1024, so caching does happen.

Please let me know if something else is required from my side.
JAYA SHANKAR G S 245 Reputation points Microsoft Vendor

2025-02-07T04:31:36.17+00:00

Hi @Dilshan Sandhu ,

We are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.

Thank you

Share via

Azure OpenAI API Caching Issue with Model `gpt-4o-mini-2024-07-18`

Your answer