Hi Marco Moroni
Rate limits indicates that you are exceeding the estimated cumulative max-processed-token per minute at some time during your inference.
You might be sending longer queries or generating longer outputs or going through a huge index size.
Solution will be
- Increase your max_token param from model deployment
- Adjust your prompts to be shorter, precise and clear.
- Adjust system message to keep the answer size within smaller chunks.
- Implement retry mechanism with a sleep time
Thank you.