Semantic kernel with Plugin that includes 1000 entities results in very high token usage

Sam Vanhoutte 0

Hello,

I am playing with Semantic Kernel to link my API data source to the gpt-4o model.

The service I inject has the following code:


public class MyPlugin
{
    private ICollection<MyApiRecord>? records;

    private WebApiClientFactory apiClientFactory => new WebApiClientFactory();
    [KernelFunction("get_records")]
    [Description("Lists all records and their metadata")]
    [return: Description("An array of the Api Records")]
    public async Task<IEnumerable<MyApiRecord>> GetRecordsAsync()
    {
        try
        {
            if (records == null)
            {
                var apiClient = new ApiClient(apiClientFactory.GetHttpClient());
                records = await apiClient.GetRecordsAsync();
            }
            return records;
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
            return records;
        }
    }

    [KernelFunction("get_record")]
    [Description("Get record details")]
    [return: Description("The details of the record")]
    public async Task<MyApiRecord?> GetRecordAsync(int id)
    {
        var records = await GetRecordsAsync();
        return records.FirstOrDefault(c => c.Id == id.ToString());
    }
}

But when I ask something like Please give me a description on record And justice for all, I get the following 429 rate limit response:

`Error: HTTP 429 (429)

Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-10-01-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 58 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.`

So, I get the impression that there are way too much tokens used, with every request, as the full list of records (1000 records with 10 fields) may be sent with every request.

What would be a better way to work around this, also assuming the data doesn't change too often.

AshokPeddakotla-MSFT 35,576 Reputation points

2024-12-19T04:47:07.53+00:00
Sam Vanhoutte Greetings & Welcome to Microsoft Q&A forum!

Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-10-01-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 58 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.`

I would suggest you, check Understanding rate limits for more details on how rate limit works.

As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429 response code until the counter resets.

The error message is related to rate limits, which is a common practice in APIs to prevent abuse and ensure fair usage.

In your case, the error message indicates that you’ve exceeded the token rate limit of your current AI Services S0 pricing tier.

Azure OpenAI’s quota feature enables assignment of rate limits to your deployments, up-to a global limit called your “quota.” Quota is assigned to your subscription on a per-region, per-model basis in units of Tokens-per-Minute (TPM).

You can check this documentation for more details.

You can also try increasing the limit as mentioned below.

To minimize issues related to rate limits, it's a good idea to use the following techniques:

Set max_tokens and best_of to the minimum values that serve the needs of your scenario. For example, don’t set a large max-tokens value if you expect your responses to be small.

Use quota management to increase TPM on deployments with high traffic, and to reduce TPM on deployments with limited needs.

Implement retry logic in your application.

Avoid sharp changes in the workload. Increase the workload gradually.

Test different load increase patterns.

Hope this helps. Do let me know if you have any further queries.
Sam Vanhoutte 0 Reputation points

2024-12-19T14:51:50.8+00:00

I understand I can play with the token limit, but that was not the main reason of my question. The biggest thing I want to find out is how to get a setup that is consuming less tokens, as I was only making a few requests and immediately got rate-limited.

I assume it has to do with the fact that I have a Plugin in Semantic Kernel that loads 1000 records in memory and probably sends these every time with the requests to the openAI service. And I assume there is a better way to keep somewhat static data on the service side and part of the model ?
AshokPeddakotla-MSFT 35,576 Reputation points

2024-12-26T16:03:17.37+00:00

Sam Vanhoutte Apologies for the delayed response.

Are you still blocked?

I assume it has to do with the fact that I have a Plugin in Semantic Kernel that loads 1000 records in memory and probably sends these every time with the requests to the openAI service. And I assume there is a better way to keep somewhat static data on the service side and part of the model ?

I would suggest you, consider fine-tuning a model with your dataset so that it inherently understands the context and doesn't require all data to be sent with each request.
Sam Vanhoutte 0 Reputation points

2024-12-28T14:53:58.99+00:00

Thanks for getting back to me.

And this comment:

I would suggest you, consider fine-tuning a model with your dataset so that it inherently understands the context and doesn't require all data to be sent with each request.

Was exactly what I am searching guidance for. So, how can I have a very basic (static) data set and make it part of a trained model, that I can then easily use from Semantic Kernel?

Share via

Semantic kernel with Plugin that includes 1000 entities results in very high token usage

Your answer